|
--- |
|
license: apache-2.0 |
|
--- |
|
# QuixiGR00T-N1.5-3B-Zero |
|
|
|
by Eric Hartford |
|
|
|
I love GR00T but NVidia's license - Tsk-tsk, no no no, that won't do at all. |
|
Also all their inference code is wrapped in hard coded CUDA dependencies. Rude. |
|
|
|
The world - our future - and our children's future - deserves a high quality permissively licensed robot control model that isn't tied to any specific hardware. |
|
|
|
This repo contains a fully open-source Apache 2.0 licensed, randomly initialized version of the GR00T-N1.5-3B architecture for humanoid robot control. This model has the exact same architecture as NVIDIA's GR00T-N1.5-3B but with random weights. |
|
|
|
And NO it's NOT gonna be uncensored! It's driving a humanoid robot you guys! I am not trying to burn down the world here! (you can easily finetune it to do ANYTHING you want it to.) |
|
|
|
I created this model using [this script](init_DolphinGR00T_zero.py) |
|
|
|
The purpose is to distill GR00T into an Apache-2.0 licensed version. |
|
|
|
The whole job looks like this: |
|
|
|
1) make an Apache 2.0 licensed "blank slate" with the right shape (this repo) |
|
2) Track down the sub-components that are Apache 2.0, and bring those weights in. (qwen3-1.7b, for instance, is used as the language tower.) |
|
3) missing components - find some initialization that's better than "random" - like merging from similar models into the correct shape. |
|
4) distill GR00T onto it with online logit distillation. The model's small, easy to load both models into vram! |
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/SJo_YZmamnVRoBI4lMCoC.png" width="600" /> |
|
|
|
## Model Description |
|
|
|
DolphinGR00T-N1.5-3B-Zero is a Vision-Language-Action (VLA) model designed for humanoid robot control: |
|
|
|
- **Architecture**: Dual-system design with vision-language backbone (Eagle-based with Qwen3 LLM) and diffusion transformer action head |
|
- **Parameters**: 2,724M total (1,655M backbone in bfloat16, 1,069M action head in float32) |
|
- **License**: Apache-2.0 (fully open source) |
|
- **Weights**: Randomly initialized - no pre-training, ready for your own training |
|
|
|
## Key Features |
|
|
|
- ✅ **Exact architecture match** with NVIDIA GR00T-N1.5-3B |
|
- ✅ **No license restrictions** - Apache-2.0 throughout |
|
- ✅ **Mixed precision ready** - bfloat16 backbone, float32 action head |
|
- ✅ **Multi-modal inputs** - images, language instructions, and robot proprioception |
|
- ✅ **Continuous action output** via diffusion transformer |
|
|
|
## Installation |
|
|
|
```bash |
|
pip install torch transformers safetensors |
|
``` |
|
|
|
## Usage |
|
|
|
### Loading the Model |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
# Load model |
|
model = AutoModel.from_pretrained( |
|
"DolphinGR00T-N1.5-3B-Zero", |
|
trust_remote_code=True, |
|
torch_dtype="auto" |
|
) |
|
|
|
# Load tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("DolphinGR00T-N1.5-3B-Zero") |
|
|
|
# Move to GPU if available |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model = model.to(device) |
|
``` |
|
|
|
### Inference Example |
|
|
|
```python |
|
import torch |
|
import torch.nn.functional as F |
|
from PIL import Image |
|
import numpy as np |
|
|
|
def prepare_image(image_path, target_size=(224, 224)): |
|
"""Prepare image for model input""" |
|
image = Image.open(image_path).convert('RGB') |
|
image = image.resize(target_size) |
|
# Normalize to [-1, 1] |
|
image = np.array(image).astype(np.float32) / 127.5 - 1.0 |
|
image = torch.from_numpy(image).permute(2, 0, 1) |
|
return image |
|
|
|
def inference(model, tokenizer, image_paths, instruction, robot_state, device): |
|
""" |
|
Run inference to generate robot actions |
|
|
|
Args: |
|
image_paths: List of paths to camera images |
|
instruction: Natural language instruction |
|
robot_state: Current robot proprioception (joint angles, etc.) |
|
device: torch device |
|
|
|
Returns: |
|
actions: Predicted robot actions |
|
""" |
|
model.eval() |
|
|
|
with torch.no_grad(): |
|
# Prepare inputs |
|
images = torch.stack([prepare_image(path) for path in image_paths]) |
|
images = images.unsqueeze(0).to(device) # Add batch dimension |
|
|
|
# Tokenize instruction |
|
text_inputs = tokenizer( |
|
instruction, |
|
return_tensors="pt", |
|
padding=True, |
|
truncation=True, |
|
max_length=256 |
|
).to(device) |
|
|
|
# Robot state (example: 32-dim joint angles) |
|
if isinstance(robot_state, list): |
|
robot_state = torch.tensor(robot_state, dtype=torch.float32) |
|
robot_state = robot_state.unsqueeze(0).to(device) |
|
|
|
# Forward pass through backbone |
|
# Note: This is a simplified example - actual implementation depends on model interface |
|
vision_features = model.backbone.eagle_model.vision_model(images) |
|
|
|
# Process language |
|
language_features = model.backbone.eagle_model.language_model.model( |
|
input_ids=text_inputs.input_ids, |
|
attention_mask=text_inputs.attention_mask |
|
).last_hidden_state |
|
|
|
# Combine features (simplified - actual fusion may be more complex) |
|
combined_features = torch.cat([ |
|
vision_features.mean(dim=1), # Pool vision features |
|
language_features.mean(dim=1) # Pool language features |
|
], dim=-1) |
|
|
|
# Generate actions through diffusion process |
|
# This is a simplified placeholder - actual diffusion requires multiple steps |
|
action_features = model.action_head.model( |
|
combined_features, |
|
timesteps=torch.zeros(1, device=device), |
|
context=robot_state |
|
) |
|
|
|
# Decode to action space |
|
actions = model.action_head.action_decoder(action_features) |
|
|
|
return actions |
|
|
|
# Example usage |
|
image_paths = ["camera1.jpg", "camera2.jpg"] |
|
instruction = "Pick up the red cube and place it on the table" |
|
robot_state = torch.randn(32) # Example: 32 joint angles |
|
|
|
actions = inference(model, tokenizer, image_paths, instruction, robot_state, device) |
|
print(f"Predicted actions shape: {actions.shape}") |
|
``` |
|
|
|
### Training Example |
|
|
|
```python |
|
import torch |
|
import torch.nn as nn |
|
from torch.utils.data import DataLoader, Dataset |
|
from transformers import get_linear_schedule_with_warmup |
|
|
|
class RobotDataset(Dataset): |
|
"""Example dataset for robot manipulation tasks""" |
|
def __init__(self, data_path, tokenizer, transform=None): |
|
self.data = [] # Load your data here |
|
self.tokenizer = tokenizer |
|
self.transform = transform |
|
|
|
def __len__(self): |
|
return len(self.data) |
|
|
|
def __getitem__(self, idx): |
|
# Return dict with keys: images, instruction, robot_state, target_actions |
|
sample = self.data[idx] |
|
|
|
# Process images |
|
images = torch.stack([self.transform(img) for img in sample['images']]) |
|
|
|
# Tokenize instruction |
|
text = self.tokenizer( |
|
sample['instruction'], |
|
return_tensors="pt", |
|
padding="max_length", |
|
truncation=True, |
|
max_length=256 |
|
) |
|
|
|
return { |
|
'images': images, |
|
'input_ids': text['input_ids'].squeeze(), |
|
'attention_mask': text['attention_mask'].squeeze(), |
|
'robot_state': torch.tensor(sample['robot_state'], dtype=torch.float32), |
|
'target_actions': torch.tensor(sample['target_actions'], dtype=torch.float32) |
|
} |
|
|
|
def train_step(model, batch, criterion, device): |
|
"""Single training step""" |
|
# Move batch to device |
|
images = batch['images'].to(device) |
|
input_ids = batch['input_ids'].to(device) |
|
attention_mask = batch['attention_mask'].to(device) |
|
robot_state = batch['robot_state'].to(device) |
|
target_actions = batch['target_actions'].to(device) |
|
|
|
# Forward pass (simplified - actual implementation may differ) |
|
# Process vision |
|
vision_features = model.backbone.eagle_model.vision_model(images) |
|
|
|
# Process language |
|
language_output = model.backbone.eagle_model.language_model.model( |
|
input_ids=input_ids, |
|
attention_mask=attention_mask |
|
) |
|
language_features = language_output.last_hidden_state |
|
|
|
# Combine modalities |
|
combined_features = torch.cat([ |
|
vision_features.mean(dim=1), |
|
language_features.mean(dim=1) |
|
], dim=-1) |
|
|
|
# Generate actions (simplified diffusion) |
|
predicted_actions = model.action_head( |
|
combined_features, |
|
context=robot_state |
|
) |
|
|
|
# Compute loss |
|
loss = criterion(predicted_actions, target_actions) |
|
|
|
return loss |
|
|
|
# Training setup |
|
def train_model(model, train_dataset, val_dataset, config): |
|
"""Main training loop""" |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model = model.to(device) |
|
|
|
# Create dataloaders |
|
train_loader = DataLoader( |
|
train_dataset, |
|
batch_size=config['batch_size'], |
|
shuffle=True, |
|
num_workers=4 |
|
) |
|
|
|
val_loader = DataLoader( |
|
val_dataset, |
|
batch_size=config['batch_size'], |
|
shuffle=False, |
|
num_workers=4 |
|
) |
|
|
|
# Setup optimizer with different learning rates for backbone and action head |
|
optimizer = torch.optim.AdamW([ |
|
{'params': model.backbone.parameters(), 'lr': config['backbone_lr']}, |
|
{'params': model.action_head.parameters(), 'lr': config['action_head_lr']} |
|
], weight_decay=config['weight_decay']) |
|
|
|
# Learning rate scheduler |
|
num_training_steps = len(train_loader) * config['num_epochs'] |
|
scheduler = get_linear_schedule_with_warmup( |
|
optimizer, |
|
num_warmup_steps=config['warmup_steps'], |
|
num_training_steps=num_training_steps |
|
) |
|
|
|
# Loss function |
|
criterion = nn.MSELoss() # or nn.L1Loss() for action prediction |
|
|
|
# Training loop |
|
for epoch in range(config['num_epochs']): |
|
model.train() |
|
total_loss = 0 |
|
|
|
for batch_idx, batch in enumerate(train_loader): |
|
optimizer.zero_grad() |
|
|
|
loss = train_step(model, batch, criterion, device) |
|
|
|
loss.backward() |
|
|
|
# Gradient clipping |
|
torch.nn.utils.clip_grad_norm_( |
|
model.parameters(), |
|
config['max_grad_norm'] |
|
) |
|
|
|
optimizer.step() |
|
scheduler.step() |
|
|
|
total_loss += loss.item() |
|
|
|
if batch_idx % config['log_interval'] == 0: |
|
print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}") |
|
|
|
# Validation |
|
model.eval() |
|
val_loss = 0 |
|
with torch.no_grad(): |
|
for batch in val_loader: |
|
loss = train_step(model, batch, criterion, device) |
|
val_loss += loss.item() |
|
|
|
avg_train_loss = total_loss / len(train_loader) |
|
avg_val_loss = val_loss / len(val_loader) |
|
|
|
print(f"Epoch {epoch}: Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}") |
|
|
|
# Save checkpoint |
|
if (epoch + 1) % config['save_interval'] == 0: |
|
torch.save({ |
|
'epoch': epoch, |
|
'model_state_dict': model.state_dict(), |
|
'optimizer_state_dict': optimizer.state_dict(), |
|
'scheduler_state_dict': scheduler.state_dict(), |
|
'train_loss': avg_train_loss, |
|
'val_loss': avg_val_loss, |
|
}, f"checkpoint_epoch_{epoch+1}.pt") |
|
|
|
# Example configuration |
|
config = { |
|
'batch_size': 16, |
|
'num_epochs': 100, |
|
'backbone_lr': 1e-5, |
|
'action_head_lr': 1e-4, |
|
'weight_decay': 0.01, |
|
'warmup_steps': 1000, |
|
'max_grad_norm': 1.0, |
|
'log_interval': 10, |
|
'save_interval': 10 |
|
} |
|
|
|
# Create dataset (you need to implement data loading) |
|
# train_dataset = RobotDataset("path/to/train/data", tokenizer) |
|
# val_dataset = RobotDataset("path/to/val/data", tokenizer) |
|
|
|
# Train model |
|
# train_model(model, train_dataset, val_dataset, config) |
|
``` |
|
|
|
### Fine-tuning Tips |
|
|
|
1. **Mixed Precision Training**: The model is designed for mixed precision. Use `torch.cuda.amp` for faster training: |
|
```python |
|
from torch.cuda.amp import GradScaler, autocast |
|
|
|
scaler = GradScaler() |
|
|
|
with autocast(): |
|
loss = train_step(model, batch, criterion, device) |
|
|
|
scaler.scale(loss).backward() |
|
scaler.step(optimizer) |
|
scaler.update() |
|
``` |
|
|
|
2. **Gradient Checkpointing**: For memory-efficient training: |
|
```python |
|
model.backbone.eagle_model.language_model.gradient_checkpointing_enable() |
|
``` |
|
|
|
3. **Frozen Backbone Training**: Start by training only the action head: |
|
```python |
|
# Freeze backbone |
|
for param in model.backbone.parameters(): |
|
param.requires_grad = False |
|
|
|
# Train only action head |
|
optimizer = torch.optim.AdamW( |
|
model.action_head.parameters(), |
|
lr=1e-4 |
|
) |
|
``` |
|
|
|
## Model Architecture |
|
|
|
The model consists of two main components: |
|
|
|
### 1. Vision-Language Backbone (System 2) |
|
- **Vision Encoder**: Based on Eagle vision model with 27 transformer layers |
|
- **Language Model**: Qwen3-based LLM with 12 layers, 2048 hidden dim |
|
- **Cross-modal Fusion**: MLP connector between vision and language |
|
|
|
### 2. Action Head (System 1) |
|
- **Diffusion Transformer**: 16 DiT blocks for action generation |
|
- **State Encoder**: Processes robot proprioception |
|
- **Action Decoder**: Outputs continuous robot actions |
|
- **Self-Attention Blocks**: 4 transformer blocks for vision-language features |
|
|
|
## Limitations |
|
|
|
- This is a **blank model** with random weights - it requires training before use |
|
- No pre-trained knowledge or capabilities |
|
- Designed for humanoid robots but can be adapted for other embodiments |
|
- Requires significant computational resources for training |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@software{DolphinGR00T2024, |
|
title={DolphinGR00T-N1.5-3B-Zero: a Permissively Licensed Reimplementation of GR00T-N1.5-3B}, |
|
author={Eric Hartford}, |
|
year={2024}, |
|
license={Apache-2.0} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
Apache-2.0 - This model is fully open source with no restrictions. |
|
|
|
## Acknowledgments |
|
|
|
This is an independent implementation of the GR00T architecture for the open-source community. The architecture is based on publicly available information about NVIDIA's GR00T-N1.5 model, but contains no proprietary code or weights. |