|
# WALL-OSS |
|
|
|
<div align="left"> |
|
|
|
<p align="center"> |
|
<img src="assets/logo.png" width="600"/> |
|
<p> |
|
|
|
<div align="center"> |
|
|
|
[](https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf) |
|
|
|
[](https://huggingface.co/x-square-robot) |
|
|
|
[](https://github.com/X-Square-Robot/wall-x) |
|
|
|
[](https://x2robot.com/en/research/68bc2cde8497d7f238dde690) |
|
|
|
</div> |
|
|
|
</div> |
|
|
|
## <a href="https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf" target="_blank"><strong>WALL-OSS: Igniting VLMs toward the Embodied Space</strong></a> |
|
|
|
We introduce **WALL-OSS**, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision--language understanding, (2) strong language--action association, and (3) robust manipulation capability. |
|
Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT—seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework. |
|
Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models. |
|
|
|
## 🎬 Video Demos |
|
|
|
<div align="center"> |
|
<video width="80%" controls> |
|
<source src="https://x2robot.com/api/videos/file/wall-oss_top_720p-1.mp4" type="video/mp4"> |
|
Your browser does not support the video tag. |
|
</video> |
|
<p><strong>WALL-OSS in Action: Demonstrating advanced manipulation capabilities and embodied AI performance</strong></p> |
|
</div> |
|
|
|
|
|
|
|
## 🚀 Quick Start |
|
|
|
### Installation |
|
|
|
```bash |
|
# Create conda environment |
|
conda create --name wallx python=3.10 |
|
conda activate wallx |
|
|
|
# Install base requirements |
|
pip install torch torchvision transformers |
|
pip install huggingface_hub |
|
|
|
# Install Wall-X from GitHub |
|
git clone https://github.com/X-Square-Robot/wall-x.git |
|
cd wall-x |
|
pip install -e . |
|
``` |
|
|
|
### Basic Usage |
|
|
|
```python |
|
import torch |
|
from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction |
|
|
|
# Load the model |
|
model_path = "X-Square-Robot/wall-oss-flow" # or your local path |
|
model = Qwen2_5_VLMoEForAction.from_pretrained(model_path) |
|
model.eval() |
|
|
|
# Configuration |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model = model.to(device).bfloat16() |
|
|
|
# Your inference code here... |
|
``` |
|
|
|
## 🎯 Supervised Fine-Tuning (SFT) |
|
|
|
For training Wall-X on your robotics datasets, please refer to our comprehensive training guide: |
|
|
|
**📖 [Training Documentation](https://github.com/X-Square-Robot/wall-x/blob/main/workspace/README.md)** |
|
|
|
The training process includes: |
|
- **Dataset Preparation**: How to prepare your robotics datasets in LeRobot format |
|
- **Configuration Setup**: Detailed configuration for GPU setup, model paths, and robot DOF settings |
|
- **Training Scripts**: Ready-to-use training scripts with proper hyperparameters |
|
|
|
### Quick Training Start |
|
|
|
```bash |
|
# Run training (see workspace/README.md for detailed configuration) |
|
bash ./workspace/lerobot_example/run.sh |
|
``` |
|
|
|
## 🔮 Inference |
|
|
|
For detailed inference examples and model evaluation: |
|
|
|
**📖 [Inference Documentation](https://github.com/X-Square-Robot/wall-x/blob/main/scripts/)** |
|
|
|
### Basic Inference Example |
|
|
|
```python |
|
import torch |
|
from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction |
|
|
|
# Load model |
|
model_path = "X-Square-Robot/wall-x" |
|
model = Qwen2_5_VLMoEForAction.from_pretrained(model_path) |
|
model.eval() |
|
|
|
# Setup |
|
batch_size = 1 |
|
seq_length = 50 |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model = model.to(device).bfloat16() |
|
|
|
# Prepare inputs (example with synthetic data) |
|
torch.manual_seed(0) |
|
input_ids = torch.randint(0, len(model.processor.tokenizer), (batch_size, seq_length), dtype=torch.long) |
|
attention_mask = torch.ones((batch_size, seq_length), dtype=torch.long) |
|
moe_token_types = torch.zeros((batch_size, seq_length), dtype=torch.long) |
|
position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0).expand(batch_size, -1) |
|
|
|
# Robotics-specific inputs |
|
proprioception = torch.randn((batch_size, 1, 20), dtype=torch.float32) # Joint states |
|
agent_pos_mask = torch.ones((batch_size, 1, 20), dtype=torch.float32) |
|
dof_mask = torch.ones((batch_size, 32, 20), dtype=torch.float32) # DOF mask |
|
dataset_names = ["x2_normal"] |
|
|
|
# Move to device |
|
inputs = { |
|
"input_ids": input_ids.to(device), |
|
"attention_mask": attention_mask.to(device), |
|
"moe_token_types": moe_token_types.to(device), |
|
"position_ids": position_ids.to(device), |
|
"proprioception": proprioception.to(device).bfloat16(), |
|
"agent_pos_mask": agent_pos_mask.to(device).bfloat16(), |
|
"dof_mask": dof_mask.to(device).bfloat16(), |
|
"dataset_names": dataset_names, |
|
"mode": "validate" |
|
} |
|
|
|
# Run inference |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
print(f"Output logits shape: {outputs.logits.shape}") |
|
``` |
|
|
|
### Advanced Inference Scripts |
|
|
|
For production-ready inference and evaluation scripts: |
|
|
|
```bash |
|
# Basic inference test |
|
python ./scripts/fake_inference.py |
|
|
|
# Generate open-loop comparison plots |
|
python ./scripts/draw_openloop_plot.py |
|
``` |
|
|
|
**📁 [View all inference scripts](https://github.com/X-Square-Robot/wall-x/tree/main/scripts)** |
|
|
|
## 📚 Complete Documentation |
|
|
|
For comprehensive setup, training, and inference instructions: |
|
|
|
### 🚀 **[Visit our GitHub Repository](https://github.com/X-Square-Robot/wall-x)** |
|
|
|
The repository contains: |
|
- **Detailed Installation Guide**: Complete environment setup with all dependencies |
|
- **Training Tutorials**: Step-by-step SFT process with LeRobot datasets |
|
- **Inference Examples**: Multiple inference scripts and evaluation tools |
|
- **Configuration Templates**: Ready-to-use configs for different robot setups |
|
- **Troubleshooting Guide**: Common issues and solutions |
|
|
|
## 📄 Cite Us |
|
|
|
If you find WALL-OSS models useful, please cite: |
|
|
|
```bibtex |
|
@misc{walloss_paper_2025, |
|
title = {WALL-OSS: Igniting VLMs toward the Embodied Space}, |
|
author = {X Square Robot}, |
|
year = {2025}, |
|
howpublished = {\url{https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf}}, |
|
note = {White paper} |
|
} |
|
``` |
|
|