File size: 6,783 Bytes

bd76a4d
0b3af84
09c8cae
 
bd76a4d
 
 
 
 
 
 
 
09c8cae
bd76a4d
09c8cae
bd76a4d
09c8cae
0b3af84
09c8cae
0b3af84
bd76a4d
 
 
0b3af84
09c8cae
aa10d79
09c8cae
0b3af84
bd76a4d
 
 
 
 
 
 
 
 
 
 
 
aa10d79
0b3af84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09a62bd
0b3af84
 
 
 
 
 
 
 
 
 
aa10d79
0b3af84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aa10d79
0b3af84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aa10d79
0b3af84
 
 
 
 
 
 
 
 
 
 
09c8cae
aa10d79
09c8cae
818f671
09c8cae
 
 
 
 
 
d304709
09c8cae

# WALL-OSS

<div align="left">

<p align="center">
    <img src="assets/logo.png" width="600"/>
<p>

<div align="center">

[![Paper](https://img.shields.io/badge/📄%20Paper-PDF-EA1B22?style=for-the-badge&logo=adobeacrobatreader&logoColor=fff)](https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf)
&nbsp;&nbsp;
[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-x--square--robot-FFB000?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/x-square-robot)
&nbsp;&nbsp;
[![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=fff)](https://github.com/X-Square-Robot/wall-x)
&nbsp;&nbsp;
[![Project Page](https://img.shields.io/badge/Project-1E90FF?style=for-the-badge&logo=google-chrome&logoColor=fff)](https://x2robot.com/en/research/68bc2cde8497d7f238dde690)

</div>

</div>

## <a href="https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf" target="_blank"><strong>WALL-OSS: Igniting VLMs toward the Embodied Space</strong></a>

We introduce **WALL-OSS**, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision--language understanding, (2) strong language--action association, and (3) robust manipulation capability.
Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT—seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework.
Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex   understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models.

## 🎬 Video Demos

<div align="center">
    <video width="80%" controls>
        <source src="https://x2robot.com/api/videos/file/wall-oss_top_720p-1.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
    <p><strong>WALL-OSS in Action: Demonstrating advanced manipulation capabilities and embodied AI performance</strong></p>
</div>



## 🚀 Quick Start

### Installation

```bash
# Create conda environment
conda create --name wallx python=3.10
conda activate wallx

# Install base requirements
pip install torch torchvision transformers
pip install huggingface_hub

# Install Wall-X from GitHub
git clone https://github.com/X-Square-Robot/wall-x.git
cd wall-x
pip install -e .
```

### Basic Usage

```python
import torch
from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction

# Load the model
model_path = "X-Square-Robot/wall-oss-flow"  # or your local path
model = Qwen2_5_VLMoEForAction.from_pretrained(model_path)
model.eval()

# Configuration
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).bfloat16()

# Your inference code here...
```

## 🎯 Supervised Fine-Tuning (SFT)

For training Wall-X on your robotics datasets, please refer to our comprehensive training guide:

**📖 [Training Documentation](https://github.com/X-Square-Robot/wall-x/blob/main/workspace/README.md)**

The training process includes:
- **Dataset Preparation**: How to prepare your robotics datasets in LeRobot format
- **Configuration Setup**: Detailed configuration for GPU setup, model paths, and robot DOF settings
- **Training Scripts**: Ready-to-use training scripts with proper hyperparameters

### Quick Training Start

```bash
# Run training (see workspace/README.md for detailed configuration)
bash ./workspace/lerobot_example/run.sh
```

## 🔮 Inference

For detailed inference examples and model evaluation:

**📖 [Inference Documentation](https://github.com/X-Square-Robot/wall-x/blob/main/scripts/)**

### Basic Inference Example

```python
import torch
from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction

# Load model
model_path = "X-Square-Robot/wall-x"
model = Qwen2_5_VLMoEForAction.from_pretrained(model_path)
model.eval()

# Setup
batch_size = 1
seq_length = 50
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).bfloat16()

# Prepare inputs (example with synthetic data)
torch.manual_seed(0)
input_ids = torch.randint(0, len(model.processor.tokenizer), (batch_size, seq_length), dtype=torch.long)
attention_mask = torch.ones((batch_size, seq_length), dtype=torch.long)
moe_token_types = torch.zeros((batch_size, seq_length), dtype=torch.long)
position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0).expand(batch_size, -1)

# Robotics-specific inputs
proprioception = torch.randn((batch_size, 1, 20), dtype=torch.float32)  # Joint states
agent_pos_mask = torch.ones((batch_size, 1, 20), dtype=torch.float32)
dof_mask = torch.ones((batch_size, 32, 20), dtype=torch.float32)  # DOF mask
dataset_names = ["x2_normal"]

# Move to device
inputs = {
    "input_ids": input_ids.to(device),
    "attention_mask": attention_mask.to(device),
    "moe_token_types": moe_token_types.to(device),
    "position_ids": position_ids.to(device),
    "proprioception": proprioception.to(device).bfloat16(),
    "agent_pos_mask": agent_pos_mask.to(device).bfloat16(),
    "dof_mask": dof_mask.to(device).bfloat16(),
    "dataset_names": dataset_names,
    "mode": "validate"
}

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    print(f"Output logits shape: {outputs.logits.shape}")
```

### Advanced Inference Scripts

For production-ready inference and evaluation scripts:

```bash
# Basic inference test
python ./scripts/fake_inference.py

# Generate open-loop comparison plots
python ./scripts/draw_openloop_plot.py
```

**📁 [View all inference scripts](https://github.com/X-Square-Robot/wall-x/tree/main/scripts)**

## 📚 Complete Documentation

For comprehensive setup, training, and inference instructions:

### 🚀 **[Visit our GitHub Repository](https://github.com/X-Square-Robot/wall-x)**

The repository contains:
- **Detailed Installation Guide**: Complete environment setup with all dependencies
- **Training Tutorials**: Step-by-step SFT process with LeRobot datasets
- **Inference Examples**: Multiple inference scripts and evaluation tools
- **Configuration Templates**: Ready-to-use configs for different robot setups
- **Troubleshooting Guide**: Common issues and solutions

## 📄 Cite Us

If you find WALL-OSS models useful, please cite:

```bibtex
@misc{walloss_paper_2025,
  title        = {WALL-OSS: Igniting VLMs toward the Embodied Space},
  author       = {X Square Robot},
  year         = {2025},
  howpublished = {\url{https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf}},
  note         = {White paper}
}
```