QuixiAI
/

QuixiGR00T-N1.5-3B-Zero

Safetensors

gr00t_n1_5

Model card Files Files and versions

xet

Community

ehartford commited on Jun 27

Commit

fbc91b3

verified ·

1 Parent(s): ff303bb

Update README.md

Browse files

Files changed (1) hide show

README.md +406 -1

README.md CHANGED Viewed

	@@ -1 +1,406 @@
1	- ~~Random GR00T-N1.5-3B \| backbone bf16 \| action_head fp32 \| Apache-2.0~~

+---
+license: apache-2.0
+---
+# OpenGR00T-N1.5-3B-Zero
+A fully open-source, randomly initialized version of the GR00T-N1.5-3B architecture for humanoid robot control. This model has the exact same architecture as NVIDIA's GR00T-N1.5-3B but with random weights and Apache-2.0 licensing.
+## Model Description
+OpenGR00T-N1.5-3B-Zero is a Vision-Language-Action (VLA) model designed for humanoid robot control:
+- **Architecture**: Dual-system design with vision-language backbone (Eagle-based with Qwen3 LLM) and diffusion transformer action head
+- **Parameters**: 2,724M total (1,655M backbone in bfloat16, 1,069M action head in float32)
+- **License**: Apache-2.0 (fully open source)
+- **Weights**: Randomly initialized - no pre-training, ready for your own training
+## Key Features
+- ✅ **Exact architecture match** with NVIDIA GR00T-N1.5-3B
+- ✅ **No license restrictions** - Apache-2.0 throughout
+- ✅ **Mixed precision ready** - bfloat16 backbone, float32 action head
+- ✅ **Multi-modal inputs** - images, language instructions, and robot proprioception
+- ✅ **Continuous action output** via diffusion transformer
+## Installation
+```bash
+pip install torch transformers safetensors
+```
+## Usage
+### Loading the Model
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+# Load model
+model = AutoModel.from_pretrained(
+    "OpenGR00T-N1.5-3B-Zero",
+    trust_remote_code=True,
+    torch_dtype="auto"
+)
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained("OpenGR00T-N1.5-3B-Zero")
+# Move to GPU if available
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = model.to(device)
+```
+### Inference Example
+```python
+import torch
+import torch.nn.functional as F
+from PIL import Image
+import numpy as np
+def prepare_image(image_path, target_size=(224, 224)):
+    """Prepare image for model input"""
+    image = Image.open(image_path).convert('RGB')
+    image = image.resize(target_size)
+    # Normalize to [-1, 1]
+    image = np.array(image).astype(np.float32) / 127.5 - 1.0
+    image = torch.from_numpy(image).permute(2, 0, 1)
+    return image
+def inference(model, tokenizer, image_paths, instruction, robot_state, device):
+    """
+    Run inference to generate robot actions
+    Args:
+        image_paths: List of paths to camera images
+        instruction: Natural language instruction
+        robot_state: Current robot proprioception (joint angles, etc.)
+        device: torch device
+    Returns:
+        actions: Predicted robot actions
+    """
+    model.eval()
+    with torch.no_grad():
+        # Prepare inputs
+        images = torch.stack([prepare_image(path) for path in image_paths])
+        images = images.unsqueeze(0).to(device)  # Add batch dimension
+        # Tokenize instruction
+        text_inputs = tokenizer(
+            instruction,
+            return_tensors="pt",
+            padding=True,
+            truncation=True,
+            max_length=256
+        ).to(device)
+        # Robot state (example: 32-dim joint angles)
+        if isinstance(robot_state, list):
+            robot_state = torch.tensor(robot_state, dtype=torch.float32)
+        robot_state = robot_state.unsqueeze(0).to(device)
+        # Forward pass through backbone
+        # Note: This is a simplified example - actual implementation depends on model interface
+        vision_features = model.backbone.eagle_model.vision_model(images)
+        # Process language
+        language_features = model.backbone.eagle_model.language_model.model(
+            input_ids=text_inputs.input_ids,
+            attention_mask=text_inputs.attention_mask
+        ).last_hidden_state
+        # Combine features (simplified - actual fusion may be more complex)
+        combined_features = torch.cat([
+            vision_features.mean(dim=1),  # Pool vision features
+            language_features.mean(dim=1)  # Pool language features
+        ], dim=-1)
+        # Generate actions through diffusion process
+        # This is a simplified placeholder - actual diffusion requires multiple steps
+        action_features = model.action_head.model(
+            combined_features,
+            timesteps=torch.zeros(1, device=device),
+            context=robot_state
+        )
+        # Decode to action space
+        actions = model.action_head.action_decoder(action_features)
+    return actions
+# Example usage
+image_paths = ["camera1.jpg", "camera2.jpg"]
+instruction = "Pick up the red cube and place it on the table"
+robot_state = torch.randn(32)  # Example: 32 joint angles
+actions = inference(model, tokenizer, image_paths, instruction, robot_state, device)
+print(f"Predicted actions shape: {actions.shape}")
+```
+### Training Example
+```python
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader, Dataset
+from transformers import get_linear_schedule_with_warmup
+class RobotDataset(Dataset):
+    """Example dataset for robot manipulation tasks"""
+    def __init__(self, data_path, tokenizer, transform=None):
+        self.data = []  # Load your data here
+        self.tokenizer = tokenizer
+        self.transform = transform
+    def __len__(self):
+        return len(self.data)
+    def __getitem__(self, idx):
+        # Return dict with keys: images, instruction, robot_state, target_actions
+        sample = self.data[idx]
+        # Process images
+        images = torch.stack([self.transform(img) for img in sample['images']])
+        # Tokenize instruction
+        text = self.tokenizer(
+            sample['instruction'],
+            return_tensors="pt",
+            padding="max_length",
+            truncation=True,
+            max_length=256
+        )
+        return {
+            'images': images,
+            'input_ids': text['input_ids'].squeeze(),
+            'attention_mask': text['attention_mask'].squeeze(),
+            'robot_state': torch.tensor(sample['robot_state'], dtype=torch.float32),
+            'target_actions': torch.tensor(sample['target_actions'], dtype=torch.float32)
+        }
+def train_step(model, batch, criterion, device):
+    """Single training step"""
+    # Move batch to device
+    images = batch['images'].to(device)
+    input_ids = batch['input_ids'].to(device)
+    attention_mask = batch['attention_mask'].to(device)
+    robot_state = batch['robot_state'].to(device)
+    target_actions = batch['target_actions'].to(device)
+    # Forward pass (simplified - actual implementation may differ)
+    # Process vision
+    vision_features = model.backbone.eagle_model.vision_model(images)
+    # Process language
+    language_output = model.backbone.eagle_model.language_model.model(
+        input_ids=input_ids,
+        attention_mask=attention_mask
+    )
+    language_features = language_output.last_hidden_state
+    # Combine modalities
+    combined_features = torch.cat([
+        vision_features.mean(dim=1),
+        language_features.mean(dim=1)
+    ], dim=-1)
+    # Generate actions (simplified diffusion)
+    predicted_actions = model.action_head(
+        combined_features,
+        context=robot_state
+    )
+    # Compute loss
+    loss = criterion(predicted_actions, target_actions)
+    return loss
+# Training setup
+def train_model(model, train_dataset, val_dataset, config):
+    """Main training loop"""
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    model = model.to(device)
+    # Create dataloaders
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=config['batch_size'],
+        shuffle=True,
+        num_workers=4
+    )
+    val_loader = DataLoader(
+        val_dataset,
+        batch_size=config['batch_size'],
+        shuffle=False,
+        num_workers=4
+    )
+    # Setup optimizer with different learning rates for backbone and action head
+    optimizer = torch.optim.AdamW([
+        {'params': model.backbone.parameters(), 'lr': config['backbone_lr']},
+        {'params': model.action_head.parameters(), 'lr': config['action_head_lr']}
+    ], weight_decay=config['weight_decay'])
+    # Learning rate scheduler
+    num_training_steps = len(train_loader) * config['num_epochs']
+    scheduler = get_linear_schedule_with_warmup(
+        optimizer,
+        num_warmup_steps=config['warmup_steps'],
+        num_training_steps=num_training_steps
+    )
+    # Loss function
+    criterion = nn.MSELoss()  # or nn.L1Loss() for action prediction
+    # Training loop
+    for epoch in range(config['num_epochs']):
+        model.train()
+        total_loss = 0
+        for batch_idx, batch in enumerate(train_loader):
+            optimizer.zero_grad()
+            loss = train_step(model, batch, criterion, device)
+            loss.backward()
+            # Gradient clipping
+            torch.nn.utils.clip_grad_norm_(
+                model.parameters(),
+                config['max_grad_norm']
+            )
+            optimizer.step()
+            scheduler.step()
+            total_loss += loss.item()
+            if batch_idx % config['log_interval'] == 0:
+                print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
+        # Validation
+        model.eval()
+        val_loss = 0
+        with torch.no_grad():
+            for batch in val_loader:
+                loss = train_step(model, batch, criterion, device)
+                val_loss += loss.item()
+        avg_train_loss = total_loss / len(train_loader)
+        avg_val_loss = val_loss / len(val_loader)
+        print(f"Epoch {epoch}: Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")
+        # Save checkpoint
+        if (epoch + 1) % config['save_interval'] == 0:
+            torch.save({
+                'epoch': epoch,
+                'model_state_dict': model.state_dict(),
+                'optimizer_state_dict': optimizer.state_dict(),
+                'scheduler_state_dict': scheduler.state_dict(),
+                'train_loss': avg_train_loss,
+                'val_loss': avg_val_loss,
+            }, f"checkpoint_epoch_{epoch+1}.pt")
+# Example configuration
+config = {
+    'batch_size': 16,
+    'num_epochs': 100,
+    'backbone_lr': 1e-5,
+    'action_head_lr': 1e-4,
+    'weight_decay': 0.01,
+    'warmup_steps': 1000,
+    'max_grad_norm': 1.0,
+    'log_interval': 10,
+    'save_interval': 10
+}
+# Create dataset (you need to implement data loading)
+# train_dataset = RobotDataset("path/to/train/data", tokenizer)
+# val_dataset = RobotDataset("path/to/val/data", tokenizer)
+# Train model
+# train_model(model, train_dataset, val_dataset, config)
+```
+### Fine-tuning Tips
+1. **Mixed Precision Training**: The model is designed for mixed precision. Use `torch.cuda.amp` for faster training:
+```python
+from torch.cuda.amp import GradScaler, autocast
+scaler = GradScaler()
+with autocast():
+    loss = train_step(model, batch, criterion, device)
+scaler.scale(loss).backward()
+scaler.step(optimizer)
+scaler.update()
+```
+2. **Gradient Checkpointing**: For memory-efficient training:
+```python
+model.backbone.eagle_model.language_model.gradient_checkpointing_enable()
+```
+3. **Frozen Backbone Training**: Start by training only the action head:
+```python
+# Freeze backbone
+for param in model.backbone.parameters():
+    param.requires_grad = False
+# Train only action head
+optimizer = torch.optim.AdamW(
+    model.action_head.parameters(),
+    lr=1e-4
+)
+```
+## Model Architecture
+The model consists of two main components:
+### 1. Vision-Language Backbone (System 2)
+- **Vision Encoder**: Based on Eagle vision model with 27 transformer layers
+- **Language Model**: Qwen3-based LLM with 12 layers, 2048 hidden dim
+- **Cross-modal Fusion**: MLP connector between vision and language
+### 2. Action Head (System 1)
+- **Diffusion Transformer**: 16 DiT blocks for action generation
+- **State Encoder**: Processes robot proprioception
+- **Action Decoder**: Outputs continuous robot actions
+- **Self-Attention Blocks**: 4 transformer blocks for vision-language features
+## Limitations
+- This is a **blank model** with random weights - it requires training before use
+- No pre-trained knowledge or capabilities
+- Designed for humanoid robots but can be adapted for other embodiments
+- Requires significant computational resources for training
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@software{opengr00t2024,
+  title={OpenGR00T-N1.5-3B-Zero: Open Source Blank GR00T Architecture},
+  author={Community Contributors},
+  year={2024},
+  license={Apache-2.0}
+}
+```
+## License
+Apache-2.0 - This model is fully open source with no restrictions.
+## Acknowledgments
+This is an independent implementation of the GR00T architecture for the open-source community. The architecture is based on publicly available information about NVIDIA's GR00T-N1.5 model, but contains no proprietary code or weights.