File size: 9,092 Bytes
109e941 bb94196 01d96ee d5e699b bb94196 45caa9b 109e941 45caa9b 109e941 45caa9b 39f3deb 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 0520b05 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 0520b05 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 45caa9b 109e941 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
---
license: mit
tags:
- Vision-Language-Action
- OpenHelix Team
base_model:
- Qwen/Qwen2.5-0.5B
language:
- en
pipeline_tag: robotics
---
<p align="center">
<img src="https://huggingface.co/datasets/VLA-Adapter/Figures/resolve/main/Logo.png" width="1000"/>
<p>
# Model Card for VLA-Adapter Libero-Spatial
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model trained on Libero-Spatial.
- 💬 Project page: [https://vla-adapter.github.io/](https://vla-adapter.github.io/)
- 🖥️ Dataset: [https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main](https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main)
- 🤗 HuggingFace: [https://huggingface.co/VLA-Adapter](https://huggingface.co/VLA-Adapter)
- Github: [https://github.com/OpenHelix-Team/VLA-Adapter](https://github.com/OpenHelix-Team/VLA-Adapter)
## Model Details
We have developed and released the VLA-Adapter family of VLA models, a series of fine-tuned generative
action models. The VLA-Adapter VLM follows the Prismatic-VLM architecture, using only a very small backbone
(Qwen2.5-0.5B) for the LLM. On common robotics benchmarks, it surpasses open-source VLA models with 8.5B,
7B, 4B, 3B, and 2B backbones.
**Input:** Models input image and text.
**Output:** Models generate action only.
**Model Architecture:** The VLA-Adapter consists of a VLM for receiving and processing image and text
information and a policy for generating actions. We systematically analyzed the benefits that the VLM
provides to different types of policy conditions and determined a unified framework. We then utilized
our designed Bridge Attention module to fuse the conditions generated by the VLM with the initial action
information in the policy, bridging the gap between VL and A to the greatest extent possible.
This resulted in a high-performance VLA model on a tiny-scale backbone.
### Success Rate Comparison
<table>
<tr>
<td><strong>LIBERO</strong></td> <td><strong>Methods</strong></td>
<td><strong>Scale</strong></td> <td><strong>Spatial</strong></td>
<td><strong>Object</strong></td> <td><strong>Goal</strong></td>
<td><strong>Long</strong></td> <td><strong>Avg.</strong></td>
</tr>
<tr><td rowspan="10">Large-scale</td><td>FlowVLA (Zhong et al., 2025)</td>
<td>8.5B</td><td>93.2</td><td>95.0</td><td>91.6</td><td>72.6</td><td>88.1</td></tr>
<tr><td>UnifiedVLA (Wang et al., 2025)</td>
<td>8.5B</td><td>95.4</td><td><i><u>98.8*</u></i></td><td> 93.6 </td><td>94.0 </td><td>95.5</td></tr>
<tr><td>OpenVLA (Kim et al., 2024)</td>
<td>7B</td><td>84.7</td><td>88.4</td><td>79.2</td><td>53.7</td><td>76.5</td></tr>
<tr><td>OpenVLA-OFT (Kim et al., 2025)</td>
<td>7B</td><td><i><u>97.6*</u></i></td><td>98.4</td><td><b>97.9</b></td><td><i><u>94.5*</u></i></td><td><i><u>97.1*</u></i></td></tr>
<tr><td>UniVLA (Bu et al., 2025)</td>
<td>7B</td><td>96.5</td><td> 96.8</td><td> 95.6 </td><td>92.0 </td><td>95.2</td></tr>
<tr><td>CoT-VLA (Zhao et al., 2025)</td>
<td>7B</td><td>87.5 </td><td>91.6 </td><td>87.6</td><td> 69.0</td><td> 81.1</td></tr>
<tr><td>WorldVLA (Cen et al., 2025)</td>
<td>7B</td><td>87.6</td><td> 96.2</td><td> 83.4</td><td> 60.0</td><td> 81.8</td></tr>
<tr><td>TraceVLA (Zheng et al., 2025)</td>
<td>7B</td><td>84.6</td><td> 85.2</td><td> 75.1</td><td> 54.1</td><td> 74.8</td></tr>
<tr><td>MolmoAct (Lee et al., 2025)</td>
<td>7B</td><td>87.0</td><td> 95.4 </td><td>87.6</td><td> 77.2 </td><td>86.6</td></tr>
<tr><td>ThinkAct (Huang et al., 2025)</td>
<td>7B</td><td>88.3 </td><td>91.4</td><td> 87.1</td><td> 70.9</td><td> 84.4</td></tr>
<tr><td rowspan="7">Small-scale</td><td>4D-VLA (Zhang et al., 2025)</td>
<td>4B</td><td>88.9</td><td> 95.2</td><td> 90.9</td><td> 79.1 </td><td>88.6</td></tr>
<tr><td>SpatialVLA (Qu et al., 2025)</td>
<td>4B</td><td>88.2</td><td> 89.9</td><td> 78.6</td><td> 55.5 </td><td>78.1</td></tr>
<tr><td>π0 (Black et al., 2024)</td>
<td>3B</td><td>96.8</td><td><i><u>98.8*</u></i></td><td>95.8</td><td> 85.2</td><td> 94.2</td></tr>
<tr><td>π0-FAST (Pertsch et al., 2025)</td>
<td>3B</td><td>96.4</td><td> 96.8 </td><td>88.6</td><td> 60.2</td><td> 85.5</td></tr>
<tr><td>NORA (Hung et al., 2025)</td>
<td>3B</td><td>92.2 </td><td>95.4 </td><td>89.4</td><td> 74.6 </td><td>87.9</td></tr>
<tr><td>SmolVLA (Shukor et al., 2025)</td>
<td>2.2B</td><td>93.0</td><td> 94.0 </td><td>91.0</td><td> 77.0 </td><td>88.8</td></tr>
<tr><td>GR00T N1 (NVIDIA et al., 2025)</td>
<td>2B</td><td>94.4</td><td> 97.6 </td><td>93.0 </td><td>90.6</td><td> 93.9</td></tr>
<tr><td rowspan="5">Tiny-scale</td><td>Seer (Tian et al., 2025)</td>
<td>0.57B</td><td>-</td><td> - </td><td>- </td><td>78.7</td><td> 78.7</td></tr>
<tr><td>VLA-OS (Gao et al., 2025)</td>
<td>0.5B</td><td>87.0 </td><td>96.5</td><td> 92.7 </td><td>66.0</td><td> 85.6</td></tr>
<tr><td>Diffusion Policy (Chi et al., 2023)</td>
<td>-</td><td>78.3</td><td> 92.5</td><td> 68.3 </td><td>50.5 </td><td>72.4</td></tr>
<tr><td><b>VLA-Adapter (Ours)</b></td>
<td><b>0.5B</b></td><td><b>97.8</b></td><td><b>99.2</b></td><td><i><u>97.2*</u></i></td><td> <b>95.0 </b></td><td><b>97.3</b></td></tr>
<tr><td><b>VLA-Adapter-Pro (Ours)</b></td>
<td><b>0.5B</b></td><td><b><i>99.6</i></b></td><td><b><i>99.6</i></b> </td><td><b><i>98.2</i></b></td><td><b><i>96.4</i></b></td><td><b><i>98.5</i></b></td></tr>
</table>
<table>
<tr>
<td><strong>CALVIN</strong></td> <td><strong>Methods</strong></td>
<td><strong>Scale</strong></td> <td><strong>1</strong></td>
<td><strong>2</strong></td> <td><strong>3</strong></td>
<td><strong>4</strong></td> <td><strong>5</strong></td> <td><strong>Avg. len</strong></td>
</tr>
<tr><td rowspan="8">Large-scale</td><td>UniVLA (Bu et al., 2025) </td><td>7B </td><td>95.5 </td><td>85.8 </td><td>75.4</td><td> 66.9 </td><td>56.5 </td><td>3.80</tr>
<tr><td>OpenVLA (Kim et al., 2024) </td><td> 7B</td><td> 91.3</td><td> 77.8 </td><td>62.0 </td><td>52.1 </td><td>43.5</td><td> 3.27</td></tr>
<tr><td>OpenVLA-OFT (Kim et al., 2025)</td><td> 7B</td><td> 96.3</td><td> 89.1 </td><td>82.4</td><td> 75.8</td><td> 66.5</td><td> 4.10</td></tr>
<tr><td>VLAS (Zhao et al., 2025b) </td><td> 7B</td><td> 87.2 </td><td>64.2</td><td> 40.9 </td><td>28.1</td><td> 19.6 </td><td>2.40</td></tr>
<tr><td>LCB (Shentu et al., 2024) </td><td> 7B</td><td> 73.6 </td><td>50.2 </td><td>28.5 </td><td>16.0 </td><td>9.9 </td><td>1.78</td></tr>
<tr><td>RoboDual (Bu et al., 2024a) </td><td> 7B</td><td> 94.4</td><td> 82.7</td><td> 72.1</td><td> 62.4 </td><td>54.4</td><td> 3.66</td></tr>
<tr><td>OpenHelix (Cui et al., 2025) </td><td> 7B</td><td> <i><u>97.1*</u></i> </td><td>91.4 </td><td>82.8</td><td> 72.6</td><td> 64.1 </td><td>4.08</td></tr>
<tr><td>ReconVLA (Song et al., 2025c) </td><td> 7B</td><td> 95.6 </td><td>87.6 </td><td>76.9</td><td> 69.3</td><td> 64.1 </td><td>3.95</td></tr>
<tr><td rowspan="4">Small-scale</td><td>DeeR (Yue et al., 2024) </td><td> 3B</td><td> 86.2</td><td> 70.1 </td><td>51.8</td><td> 41.5</td><td> 30.4 </td><td>2.82</td></tr>
<tr><td>RoboFlamingo (Li et al., 2024b) </td><td> 3B</td><td> 82.4 </td><td>61.9</td><td> 46.6 </td><td>33.1</td><td> 23.5</td><td> 2.48</td></tr>
<tr><td>VPP (Hu et al., 2025)</td><td> 1.5B</td><td> 95.7</td><td> 91.2</td><td> <i><u>86.3*</u></i></td><td> <i><u>81.0*</u></i></td><td> <i><u>75.0*</u></i></td><td> <i><u>4.33*</u></i></td></tr>
<tr><td>SuSIE (Black et al., 2024)</td><td>1.3B</td><td> 87.0</td><td> 69.0</td><td> 49.0 </td><td>38.0</td><td> 26.0</td><td> 2.69</td></tr>
<tr><td rowspan="5">Tiny-scale</td><td>Seer-Large (Tian et al., 2025)</td><td>0.57B</td><td> 96.3 </td><td><i><u>91.6*</u></i></td><td> 86.1 </td><td>80.3 </td><td>74.0</td><td> 4.28</td></tr>
<tr><td>MoDE (Reuss et al., 2025) </td><td> 0.44B </td><td>96.2</td><td> 88.9</td><td> 81.1</td><td> 71.8 </td><td>63.5 </td><td>4.01</td></tr>
<tr><td>Seer (Tian et al., 2025) </td><td> 0.32B</td><td> 94.4 </td><td>87.2 </td><td>79.9 </td><td>72.2 </td><td>64.3</td><td> 3.98</td></tr>
<tr><td><b>VLA-Adapter (Ours)</b></td>
<td><b>0.5B</b></td><td><b><i>99.1</i></b> </td><td><b>94.6</b> </td><td><b>88.8</b></td><td> <b>82.8</b> </td><td><b>76.5</b> </td><td><b>4.42</b></td></tr>
<tr><td><b>VLA-Adapter-Pro (Ours)</b></td>
<td><b>0.5B</b></td><td><b>98.5</b></td><td><b><i>95.0</i></b> </td><td><b><i>90.5</i></b></td><td><b><i>85.3</i></b></td><td><b><i>80.0</i></b></td><td><b><i>4.50</i></b></td></tr>
</table>
## Citation instructions
```BibTeX
@article{wang2025vlaadapter,
author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
journal={arXiv preprint arXiv:2509.09372},
year={2025}
}
``` |