Update README.md

7524bdd verified 3 months ago

14.3 kB

	---
	license: apache-2.0
	---
	# QuixiGR00T-N1.5-3B-Zero

	by Eric Hartford

	I love GR00T but NVidia's license - Tsk-tsk, no no no, that won't do at all.
	Also all their inference code is wrapped in hard coded CUDA dependencies. Rude.

	The world - our future - and our children's future - deserves a high quality permissively licensed robot control model that isn't tied to any specific hardware.

	This repo contains a fully open-source Apache 2.0 licensed, randomly initialized version of the GR00T-N1.5-3B architecture for humanoid robot control. This model has the exact same architecture as NVIDIA's GR00T-N1.5-3B but with random weights.

	And NO it's NOT gonna be uncensored! It's driving a humanoid robot you guys! I am not trying to burn down the world here! (you can easily finetune it to do ANYTHING you want it to.)

	I created this model using [this script](init_DolphinGR00T_zero.py)

	The purpose is to distill GR00T into an Apache-2.0 licensed version.

	The whole job looks like this:

	1) make an Apache 2.0 licensed "blank slate" with the right shape (this repo)
	2) Track down the sub-components that are Apache 2.0, and bring those weights in. (qwen3-1.7b, for instance, is used as the language tower.)
	3) missing components - find some initialization that's better than "random" - like merging from similar models into the correct shape.
	4) distill GR00T onto it with online logit distillation. The model's small, easy to load both models into vram!

	<img src="https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/SJo_YZmamnVRoBI4lMCoC.png" width="600" />

	## Model Description

	DolphinGR00T-N1.5-3B-Zero is a Vision-Language-Action (VLA) model designed for humanoid robot control:

	- Architecture: Dual-system design with vision-language backbone (Eagle-based with Qwen3 LLM) and diffusion transformer action head
	- Parameters: 2,724M total (1,655M backbone in bfloat16, 1,069M action head in float32)
	- License: Apache-2.0 (fully open source)
	- Weights: Randomly initialized - no pre-training, ready for your own training

	## Key Features

	- ✅ Exact architecture match with NVIDIA GR00T-N1.5-3B
	- ✅ No license restrictions - Apache-2.0 throughout
	- ✅ Mixed precision ready - bfloat16 backbone, float32 action head
	- ✅ Multi-modal inputs - images, language instructions, and robot proprioception
	- ✅ Continuous action output via diffusion transformer

	## Installation

	```bash
	pip install torch transformers safetensors
	```

	## Usage

	### Loading the Model

	```python
	import torch
	from transformers import AutoModel, AutoTokenizer

	# Load model
	model = AutoModel.from_pretrained(
	"DolphinGR00T-N1.5-3B-Zero",
	trust_remote_code=True,
	torch_dtype="auto"
	)

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("DolphinGR00T-N1.5-3B-Zero")

	# Move to GPU if available
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device)
	```

	### Inference Example

	```python
	import torch
	import torch.nn.functional as F
	from PIL import Image
	import numpy as np

	def prepare_image(image_path, target_size=(224, 224)):
	"""Prepare image for model input"""
	image = Image.open(image_path).convert('RGB')
	image = image.resize(target_size)
	# Normalize to [-1, 1]
	image = np.array(image).astype(np.float32) / 127.5 - 1.0
	image = torch.from_numpy(image).permute(2, 0, 1)
	return image

	def inference(model, tokenizer, image_paths, instruction, robot_state, device):
	"""
	Run inference to generate robot actions

	Args:
	image_paths: List of paths to camera images
	instruction: Natural language instruction
	robot_state: Current robot proprioception (joint angles, etc.)
	device: torch device

	Returns:
	actions: Predicted robot actions
	"""
	model.eval()

	with torch.no_grad():
	# Prepare inputs
	images = torch.stack([prepare_image(path) for path in image_paths])
	images = images.unsqueeze(0).to(device) # Add batch dimension

	# Tokenize instruction
	text_inputs = tokenizer(
	instruction,
	return_tensors="pt",
	padding=True,
	truncation=True,
	max_length=256
	).to(device)

	# Robot state (example: 32-dim joint angles)
	if isinstance(robot_state, list):
	robot_state = torch.tensor(robot_state, dtype=torch.float32)
	robot_state = robot_state.unsqueeze(0).to(device)

	# Forward pass through backbone
	# Note: This is a simplified example - actual implementation depends on model interface
	vision_features = model.backbone.eagle_model.vision_model(images)

	# Process language
	language_features = model.backbone.eagle_model.language_model.model(
	input_ids=text_inputs.input_ids,
	attention_mask=text_inputs.attention_mask
	).last_hidden_state

	# Combine features (simplified - actual fusion may be more complex)
	combined_features = torch.cat([
	vision_features.mean(dim=1), # Pool vision features
	language_features.mean(dim=1) # Pool language features
	], dim=-1)

	# Generate actions through diffusion process
	# This is a simplified placeholder - actual diffusion requires multiple steps
	action_features = model.action_head.model(
	combined_features,
	timesteps=torch.zeros(1, device=device),
	context=robot_state
	)

	# Decode to action space
	actions = model.action_head.action_decoder(action_features)

	return actions

	# Example usage
	image_paths = ["camera1.jpg", "camera2.jpg"]
	instruction = "Pick up the red cube and place it on the table"
	robot_state = torch.randn(32) # Example: 32 joint angles

	actions = inference(model, tokenizer, image_paths, instruction, robot_state, device)
	print(f"Predicted actions shape: {actions.shape}")
	```

	### Training Example

	```python
	import torch
	import torch.nn as nn
	from torch.utils.data import DataLoader, Dataset
	from transformers import get_linear_schedule_with_warmup

	class RobotDataset(Dataset):
	"""Example dataset for robot manipulation tasks"""
	def __init__(self, data_path, tokenizer, transform=None):
	self.data = [] # Load your data here
	self.tokenizer = tokenizer
	self.transform = transform

	def __len__(self):
	return len(self.data)

	def __getitem__(self, idx):
	# Return dict with keys: images, instruction, robot_state, target_actions
	sample = self.data[idx]

	# Process images
	images = torch.stack([self.transform(img) for img in sample['images']])

	# Tokenize instruction
	text = self.tokenizer(
	sample['instruction'],
	return_tensors="pt",
	padding="max_length",
	truncation=True,
	max_length=256
	)

	return {
	'images': images,
	'input_ids': text['input_ids'].squeeze(),
	'attention_mask': text['attention_mask'].squeeze(),
	'robot_state': torch.tensor(sample['robot_state'], dtype=torch.float32),
	'target_actions': torch.tensor(sample['target_actions'], dtype=torch.float32)
	}

	def train_step(model, batch, criterion, device):
	"""Single training step"""
	# Move batch to device
	images = batch['images'].to(device)
	input_ids = batch['input_ids'].to(device)
	attention_mask = batch['attention_mask'].to(device)
	robot_state = batch['robot_state'].to(device)
	target_actions = batch['target_actions'].to(device)

	# Forward pass (simplified - actual implementation may differ)
	# Process vision
	vision_features = model.backbone.eagle_model.vision_model(images)

	# Process language
	language_output = model.backbone.eagle_model.language_model.model(
	input_ids=input_ids,
	attention_mask=attention_mask
	)
	language_features = language_output.last_hidden_state

	# Combine modalities
	combined_features = torch.cat([
	vision_features.mean(dim=1),
	language_features.mean(dim=1)
	], dim=-1)

	# Generate actions (simplified diffusion)
	predicted_actions = model.action_head(
	combined_features,
	context=robot_state
	)

	# Compute loss
	loss = criterion(predicted_actions, target_actions)

	return loss

	# Training setup
	def train_model(model, train_dataset, val_dataset, config):
	"""Main training loop"""
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device)

	# Create dataloaders
	train_loader = DataLoader(
	train_dataset,
	batch_size=config['batch_size'],
	shuffle=True,
	num_workers=4
	)

	val_loader = DataLoader(
	val_dataset,
	batch_size=config['batch_size'],
	shuffle=False,
	num_workers=4
	)

	# Setup optimizer with different learning rates for backbone and action head
	optimizer = torch.optim.AdamW([
	{'params': model.backbone.parameters(), 'lr': config['backbone_lr']},
	{'params': model.action_head.parameters(), 'lr': config['action_head_lr']}
	], weight_decay=config['weight_decay'])

	# Learning rate scheduler
	num_training_steps = len(train_loader) * config['num_epochs']
	scheduler = get_linear_schedule_with_warmup(
	optimizer,
	num_warmup_steps=config['warmup_steps'],
	num_training_steps=num_training_steps
	)

	# Loss function
	criterion = nn.MSELoss() # or nn.L1Loss() for action prediction

	# Training loop
	for epoch in range(config['num_epochs']):
	model.train()
	total_loss = 0

	for batch_idx, batch in enumerate(train_loader):
	optimizer.zero_grad()

	loss = train_step(model, batch, criterion, device)

	loss.backward()

	# Gradient clipping
	torch.nn.utils.clip_grad_norm_(
	model.parameters(),
	config['max_grad_norm']
	)

	optimizer.step()
	scheduler.step()

	total_loss += loss.item()

	if batch_idx % config['log_interval'] == 0:
	print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")

	# Validation
	model.eval()
	val_loss = 0
	with torch.no_grad():
	for batch in val_loader:
	loss = train_step(model, batch, criterion, device)
	val_loss += loss.item()

	avg_train_loss = total_loss / len(train_loader)
	avg_val_loss = val_loss / len(val_loader)

	print(f"Epoch {epoch}: Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")

	# Save checkpoint
	if (epoch + 1) % config['save_interval'] == 0:
	torch.save({
	'epoch': epoch,
	'model_state_dict': model.state_dict(),
	'optimizer_state_dict': optimizer.state_dict(),
	'scheduler_state_dict': scheduler.state_dict(),
	'train_loss': avg_train_loss,
	'val_loss': avg_val_loss,
	}, f"checkpoint_epoch_{epoch+1}.pt")

	# Example configuration
	config = {
	'batch_size': 16,
	'num_epochs': 100,
	'backbone_lr': 1e-5,
	'action_head_lr': 1e-4,
	'weight_decay': 0.01,
	'warmup_steps': 1000,
	'max_grad_norm': 1.0,
	'log_interval': 10,
	'save_interval': 10
	}

	# Create dataset (you need to implement data loading)
	# train_dataset = RobotDataset("path/to/train/data", tokenizer)
	# val_dataset = RobotDataset("path/to/val/data", tokenizer)

	# Train model
	# train_model(model, train_dataset, val_dataset, config)
	```

	### Fine-tuning Tips

	1. Mixed Precision Training: The model is designed for mixed precision. Use `torch.cuda.amp` for faster training:
	```python
	from torch.cuda.amp import GradScaler, autocast

	scaler = GradScaler()

	with autocast():
	loss = train_step(model, batch, criterion, device)

	scaler.scale(loss).backward()
	scaler.step(optimizer)
	scaler.update()
	```

	2. Gradient Checkpointing: For memory-efficient training:
	```python
	model.backbone.eagle_model.language_model.gradient_checkpointing_enable()
	```

	3. Frozen Backbone Training: Start by training only the action head:
	```python
	# Freeze backbone
	for param in model.backbone.parameters():
	param.requires_grad = False

	# Train only action head
	optimizer = torch.optim.AdamW(
	model.action_head.parameters(),
	lr=1e-4
	)
	```

	## Model Architecture

	The model consists of two main components:

	### 1. Vision-Language Backbone (System 2)
	- Vision Encoder: Based on Eagle vision model with 27 transformer layers
	- Language Model: Qwen3-based LLM with 12 layers, 2048 hidden dim
	- Cross-modal Fusion: MLP connector between vision and language

	### 2. Action Head (System 1)
	- Diffusion Transformer: 16 DiT blocks for action generation
	- State Encoder: Processes robot proprioception
	- Action Decoder: Outputs continuous robot actions
	- Self-Attention Blocks: 4 transformer blocks for vision-language features

	## Limitations

	- This is a blank model with random weights - it requires training before use
	- No pre-trained knowledge or capabilities
	- Designed for humanoid robots but can be adapted for other embodiments
	- Requires significant computational resources for training

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@software{DolphinGR00T2024,
	title={DolphinGR00T-N1.5-3B-Zero: a Permissively Licensed Reimplementation of GR00T-N1.5-3B},
	author={Eric Hartford},
	year={2024},
	license={Apache-2.0}
	}
	```

	## License

	Apache-2.0 - This model is fully open source with no restrictions.

	## Acknowledgments

	This is an independent implementation of the GR00T architecture for the open-source community. The architecture is based on publicly available information about NVIDIA's GR00T-N1.5 model, but contains no proprietary code or weights.