File size: 9,092 Bytes
109e941
 
 
 
 
 
 
 
 
 
 
 
 
bb94196
 
 
 
 
01d96ee
d5e699b
bb94196
 
 
45caa9b
109e941
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45caa9b
 
 
 
109e941
 
45caa9b
 
39f3deb
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
0520b05
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
0520b05
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
109e941
45caa9b
 
109e941
 
 
 
 
 
45caa9b
 
 
 
109e941
 
45caa9b
109e941
45caa9b
109e941
45caa9b
109e941
45caa9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109e941
 
 
 
 
45caa9b
 
 
 
 
109e941
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
license: mit
tags:
- Vision-Language-Action
- OpenHelix Team
base_model:
- Qwen/Qwen2.5-0.5B
language:
- en
pipeline_tag: robotics
---


<p align="center">
    <img src="https://huggingface.co/datasets/VLA-Adapter/Figures/resolve/main/Logo.png" width="1000"/>
<p>


# Model Card for VLA-Adapter Libero-Spatial
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model trained on Libero-Spatial.
- 💬 Project page: [https://vla-adapter.github.io/](https://vla-adapter.github.io/)
- 🖥️ Dataset: [https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main](https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main)
- 🤗 HuggingFace: [https://huggingface.co/VLA-Adapter](https://huggingface.co/VLA-Adapter)
- Github: [https://github.com/OpenHelix-Team/VLA-Adapter](https://github.com/OpenHelix-Team/VLA-Adapter)

## Model Details
We have developed and released the VLA-Adapter family of VLA models, a series of fine-tuned generative 
action models. The VLA-Adapter VLM follows the Prismatic-VLM architecture, using only a very small backbone 
(Qwen2.5-0.5B) for the LLM. On common robotics benchmarks, it surpasses open-source VLA models with 8.5B, 
7B, 4B, 3B, and 2B backbones.

**Input:** Models input image and text.

**Output:** Models generate action only.

**Model Architecture:** The VLA-Adapter consists of a VLM for receiving and processing image and text 
information and a policy for generating actions. We systematically analyzed the benefits that the VLM 
provides to different types of policy conditions and determined a unified framework. We then utilized 
our designed Bridge Attention module to fuse the conditions generated by the VLM with the initial action 
information in the policy, bridging the gap between VL and A to the greatest extent possible.
This resulted in a high-performance VLA model on a tiny-scale backbone.


### Success Rate Comparison
<table>
  <tr>
   <td><strong>LIBERO</strong></td>  <td><strong>Methods</strong></td>
   <td><strong>Scale</strong></td>  <td><strong>Spatial</strong></td>
   <td><strong>Object</strong></td>  <td><strong>Goal</strong></td>
   <td><strong>Long</strong></td>  <td><strong>Avg.</strong></td>
  </tr>

  <tr><td rowspan="10">Large-scale</td><td>FlowVLA (Zhong et al., 2025)</td>
   <td>8.5B</td><td>93.2</td><td>95.0</td><td>91.6</td><td>72.6</td><td>88.1</td></tr>

  <tr><td>UnifiedVLA (Wang et al., 2025)</td>
   <td>8.5B</td><td>95.4</td><td><i><u>98.8*</u></i></td><td> 93.6 </td><td>94.0 </td><td>95.5</td></tr>

  <tr><td>OpenVLA (Kim et al., 2024)</td>
   <td>7B</td><td>84.7</td><td>88.4</td><td>79.2</td><td>53.7</td><td>76.5</td></tr>

  <tr><td>OpenVLA-OFT (Kim et al., 2025)</td>
   <td>7B</td><td><i><u>97.6*</u></i></td><td>98.4</td><td><b>97.9</b></td><td><i><u>94.5*</u></i></td><td><i><u>97.1*</u></i></td></tr>

  <tr><td>UniVLA (Bu et al., 2025)</td>
   <td>7B</td><td>96.5</td><td> 96.8</td><td> 95.6 </td><td>92.0 </td><td>95.2</td></tr>

  <tr><td>CoT-VLA (Zhao et al., 2025)</td>
   <td>7B</td><td>87.5 </td><td>91.6 </td><td>87.6</td><td> 69.0</td><td> 81.1</td></tr>

  <tr><td>WorldVLA (Cen et al., 2025)</td>
   <td>7B</td><td>87.6</td><td> 96.2</td><td> 83.4</td><td> 60.0</td><td> 81.8</td></tr>

  <tr><td>TraceVLA (Zheng et al., 2025)</td>
   <td>7B</td><td>84.6</td><td> 85.2</td><td> 75.1</td><td> 54.1</td><td> 74.8</td></tr>

  <tr><td>MolmoAct (Lee et al., 2025)</td>
   <td>7B</td><td>87.0</td><td> 95.4 </td><td>87.6</td><td> 77.2 </td><td>86.6</td></tr>

  <tr><td>ThinkAct (Huang et al., 2025)</td>
   <td>7B</td><td>88.3 </td><td>91.4</td><td> 87.1</td><td> 70.9</td><td> 84.4</td></tr>

  <tr><td rowspan="7">Small-scale</td><td>4D-VLA (Zhang et al., 2025)</td>
   <td>4B</td><td>88.9</td><td> 95.2</td><td> 90.9</td><td> 79.1 </td><td>88.6</td></tr>

  <tr><td>SpatialVLA (Qu et al., 2025)</td>
   <td>4B</td><td>88.2</td><td> 89.9</td><td> 78.6</td><td> 55.5 </td><td>78.1</td></tr>

  <tr><td>π0 (Black et al., 2024)</td>
   <td>3B</td><td>96.8</td><td><i><u>98.8*</u></i></td><td>95.8</td><td> 85.2</td><td> 94.2</td></tr>

  <tr><td>π0-FAST (Pertsch et al., 2025)</td>
   <td>3B</td><td>96.4</td><td> 96.8 </td><td>88.6</td><td> 60.2</td><td> 85.5</td></tr>

  <tr><td>NORA (Hung et al., 2025)</td>
   <td>3B</td><td>92.2 </td><td>95.4 </td><td>89.4</td><td> 74.6 </td><td>87.9</td></tr>

  <tr><td>SmolVLA (Shukor et al., 2025)</td>
   <td>2.2B</td><td>93.0</td><td> 94.0 </td><td>91.0</td><td> 77.0 </td><td>88.8</td></tr>

  <tr><td>GR00T N1 (NVIDIA et al., 2025)</td>
   <td>2B</td><td>94.4</td><td> 97.6 </td><td>93.0 </td><td>90.6</td><td> 93.9</td></tr>

  <tr><td rowspan="5">Tiny-scale</td><td>Seer (Tian et al., 2025)</td>
   <td>0.57B</td><td>-</td><td> - </td><td>- </td><td>78.7</td><td> 78.7</td></tr>

  <tr><td>VLA-OS (Gao et al., 2025)</td>
   <td>0.5B</td><td>87.0 </td><td>96.5</td><td> 92.7 </td><td>66.0</td><td> 85.6</td></tr>

  <tr><td>Diffusion Policy (Chi et al., 2023)</td>
   <td>-</td><td>78.3</td><td> 92.5</td><td> 68.3 </td><td>50.5 </td><td>72.4</td></tr>

  <tr><td><b>VLA-Adapter (Ours)</b></td>
   <td><b>0.5B</b></td><td><b>97.8</b></td><td><b>99.2</b></td><td><i><u>97.2*</u></i></td><td> <b>95.0 </b></td><td><b>97.3</b></td></tr>

  <tr><td><b>VLA-Adapter-Pro (Ours)</b></td>
   <td><b>0.5B</b></td><td><b><i>99.6</i></b></td><td><b><i>99.6</i></b> </td><td><b><i>98.2</i></b></td><td><b><i>96.4</i></b></td><td><b><i>98.5</i></b></td></tr>
  
</table>


<table>
  <tr>
   <td><strong>CALVIN</strong></td>  <td><strong>Methods</strong></td>
   <td><strong>Scale</strong></td>  <td><strong>1</strong></td>
   <td><strong>2</strong></td>  <td><strong>3</strong></td>
   <td><strong>4</strong></td>  <td><strong>5</strong></td> <td><strong>Avg. len</strong></td>
  </tr>

  <tr><td rowspan="8">Large-scale</td><td>UniVLA (Bu et al., 2025) </td><td>7B </td><td>95.5 </td><td>85.8 </td><td>75.4</td><td> 66.9 </td><td>56.5 </td><td>3.80</tr>

  <tr><td>OpenVLA (Kim et al., 2024) </td><td> 7B</td><td> 91.3</td><td> 77.8 </td><td>62.0 </td><td>52.1 </td><td>43.5</td><td> 3.27</td></tr>

  <tr><td>OpenVLA-OFT (Kim et al., 2025)</td><td> 7B</td><td> 96.3</td><td> 89.1 </td><td>82.4</td><td> 75.8</td><td> 66.5</td><td> 4.10</td></tr>

  <tr><td>VLAS (Zhao et al., 2025b) </td><td> 7B</td><td> 87.2 </td><td>64.2</td><td> 40.9 </td><td>28.1</td><td> 19.6 </td><td>2.40</td></tr>

  <tr><td>LCB (Shentu et al., 2024) </td><td> 7B</td><td> 73.6 </td><td>50.2 </td><td>28.5 </td><td>16.0 </td><td>9.9 </td><td>1.78</td></tr>

  <tr><td>RoboDual (Bu et al., 2024a) </td><td> 7B</td><td> 94.4</td><td> 82.7</td><td> 72.1</td><td> 62.4 </td><td>54.4</td><td> 3.66</td></tr>

  <tr><td>OpenHelix (Cui et al., 2025)  </td><td> 7B</td><td> <i><u>97.1*</u></i> </td><td>91.4 </td><td>82.8</td><td> 72.6</td><td> 64.1 </td><td>4.08</td></tr>

  <tr><td>ReconVLA (Song et al., 2025c)  </td><td> 7B</td><td> 95.6 </td><td>87.6 </td><td>76.9</td><td> 69.3</td><td> 64.1 </td><td>3.95</td></tr>

  <tr><td rowspan="4">Small-scale</td><td>DeeR (Yue et al., 2024) </td><td> 3B</td><td> 86.2</td><td> 70.1 </td><td>51.8</td><td> 41.5</td><td> 30.4 </td><td>2.82</td></tr>

  <tr><td>RoboFlamingo (Li et al., 2024b) </td><td> 3B</td><td> 82.4 </td><td>61.9</td><td> 46.6 </td><td>33.1</td><td> 23.5</td><td> 2.48</td></tr>

  <tr><td>VPP (Hu et al., 2025)</td><td>  1.5B</td><td>  95.7</td><td>  91.2</td><td>  <i><u>86.3*</u></i></td><td>  <i><u>81.0*</u></i></td><td>  <i><u>75.0*</u></i></td><td>  <i><u>4.33*</u></i></td></tr>

  <tr><td>SuSIE (Black et al., 2024)</td><td>1.3B</td><td> 87.0</td><td> 69.0</td><td> 49.0 </td><td>38.0</td><td> 26.0</td><td> 2.69</td></tr>

  <tr><td rowspan="5">Tiny-scale</td><td>Seer-Large (Tian et al., 2025)</td><td>0.57B</td><td> 96.3 </td><td><i><u>91.6*</u></i></td><td> 86.1 </td><td>80.3 </td><td>74.0</td><td> 4.28</td></tr>

  <tr><td>MoDE (Reuss et al., 2025) </td><td> 0.44B </td><td>96.2</td><td> 88.9</td><td> 81.1</td><td> 71.8 </td><td>63.5 </td><td>4.01</td></tr>

  <tr><td>Seer (Tian et al., 2025) </td><td> 0.32B</td><td> 94.4 </td><td>87.2 </td><td>79.9 </td><td>72.2 </td><td>64.3</td><td> 3.98</td></tr>

  <tr><td><b>VLA-Adapter (Ours)</b></td>
   <td><b>0.5B</b></td><td><b><i>99.1</i></b> </td><td><b>94.6</b> </td><td><b>88.8</b></td><td> <b>82.8</b> </td><td><b>76.5</b> </td><td><b>4.42</b></td></tr>

  <tr><td><b>VLA-Adapter-Pro (Ours)</b></td>
   <td><b>0.5B</b></td><td><b>98.5</b></td><td><b><i>95.0</i></b> </td><td><b><i>90.5</i></b></td><td><b><i>85.3</i></b></td><td><b><i>80.0</i></b></td><td><b><i>4.50</i></b></td></tr>
  
</table>

## Citation instructions

```BibTeX
@article{wang2025vlaadapter,
  author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
  title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
  journal={arXiv preprint arXiv:2509.09372},
  year={2025}
}
```