ehartford commited on
Commit
fbc91b3
·
verified ·
1 Parent(s): ff303bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +406 -1
README.md CHANGED
@@ -1 +1,406 @@
1
- Random GR00T-N1.5-3B | backbone bf16 | action_head fp32 | Apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # OpenGR00T-N1.5-3B-Zero
5
+
6
+ A fully open-source, randomly initialized version of the GR00T-N1.5-3B architecture for humanoid robot control. This model has the exact same architecture as NVIDIA's GR00T-N1.5-3B but with random weights and Apache-2.0 licensing.
7
+
8
+ ## Model Description
9
+
10
+ OpenGR00T-N1.5-3B-Zero is a Vision-Language-Action (VLA) model designed for humanoid robot control:
11
+
12
+ - **Architecture**: Dual-system design with vision-language backbone (Eagle-based with Qwen3 LLM) and diffusion transformer action head
13
+ - **Parameters**: 2,724M total (1,655M backbone in bfloat16, 1,069M action head in float32)
14
+ - **License**: Apache-2.0 (fully open source)
15
+ - **Weights**: Randomly initialized - no pre-training, ready for your own training
16
+
17
+ ## Key Features
18
+
19
+ - ✅ **Exact architecture match** with NVIDIA GR00T-N1.5-3B
20
+ - ✅ **No license restrictions** - Apache-2.0 throughout
21
+ - ✅ **Mixed precision ready** - bfloat16 backbone, float32 action head
22
+ - ✅ **Multi-modal inputs** - images, language instructions, and robot proprioception
23
+ - ✅ **Continuous action output** via diffusion transformer
24
+
25
+ ## Installation
26
+
27
+ ```bash
28
+ pip install torch transformers safetensors
29
+ ```
30
+
31
+ ## Usage
32
+
33
+ ### Loading the Model
34
+
35
+ ```python
36
+ import torch
37
+ from transformers import AutoModel, AutoTokenizer
38
+
39
+ # Load model
40
+ model = AutoModel.from_pretrained(
41
+ "OpenGR00T-N1.5-3B-Zero",
42
+ trust_remote_code=True,
43
+ torch_dtype="auto"
44
+ )
45
+
46
+ # Load tokenizer
47
+ tokenizer = AutoTokenizer.from_pretrained("OpenGR00T-N1.5-3B-Zero")
48
+
49
+ # Move to GPU if available
50
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
51
+ model = model.to(device)
52
+ ```
53
+
54
+ ### Inference Example
55
+
56
+ ```python
57
+ import torch
58
+ import torch.nn.functional as F
59
+ from PIL import Image
60
+ import numpy as np
61
+
62
+ def prepare_image(image_path, target_size=(224, 224)):
63
+ """Prepare image for model input"""
64
+ image = Image.open(image_path).convert('RGB')
65
+ image = image.resize(target_size)
66
+ # Normalize to [-1, 1]
67
+ image = np.array(image).astype(np.float32) / 127.5 - 1.0
68
+ image = torch.from_numpy(image).permute(2, 0, 1)
69
+ return image
70
+
71
+ def inference(model, tokenizer, image_paths, instruction, robot_state, device):
72
+ """
73
+ Run inference to generate robot actions
74
+
75
+ Args:
76
+ image_paths: List of paths to camera images
77
+ instruction: Natural language instruction
78
+ robot_state: Current robot proprioception (joint angles, etc.)
79
+ device: torch device
80
+
81
+ Returns:
82
+ actions: Predicted robot actions
83
+ """
84
+ model.eval()
85
+
86
+ with torch.no_grad():
87
+ # Prepare inputs
88
+ images = torch.stack([prepare_image(path) for path in image_paths])
89
+ images = images.unsqueeze(0).to(device) # Add batch dimension
90
+
91
+ # Tokenize instruction
92
+ text_inputs = tokenizer(
93
+ instruction,
94
+ return_tensors="pt",
95
+ padding=True,
96
+ truncation=True,
97
+ max_length=256
98
+ ).to(device)
99
+
100
+ # Robot state (example: 32-dim joint angles)
101
+ if isinstance(robot_state, list):
102
+ robot_state = torch.tensor(robot_state, dtype=torch.float32)
103
+ robot_state = robot_state.unsqueeze(0).to(device)
104
+
105
+ # Forward pass through backbone
106
+ # Note: This is a simplified example - actual implementation depends on model interface
107
+ vision_features = model.backbone.eagle_model.vision_model(images)
108
+
109
+ # Process language
110
+ language_features = model.backbone.eagle_model.language_model.model(
111
+ input_ids=text_inputs.input_ids,
112
+ attention_mask=text_inputs.attention_mask
113
+ ).last_hidden_state
114
+
115
+ # Combine features (simplified - actual fusion may be more complex)
116
+ combined_features = torch.cat([
117
+ vision_features.mean(dim=1), # Pool vision features
118
+ language_features.mean(dim=1) # Pool language features
119
+ ], dim=-1)
120
+
121
+ # Generate actions through diffusion process
122
+ # This is a simplified placeholder - actual diffusion requires multiple steps
123
+ action_features = model.action_head.model(
124
+ combined_features,
125
+ timesteps=torch.zeros(1, device=device),
126
+ context=robot_state
127
+ )
128
+
129
+ # Decode to action space
130
+ actions = model.action_head.action_decoder(action_features)
131
+
132
+ return actions
133
+
134
+ # Example usage
135
+ image_paths = ["camera1.jpg", "camera2.jpg"]
136
+ instruction = "Pick up the red cube and place it on the table"
137
+ robot_state = torch.randn(32) # Example: 32 joint angles
138
+
139
+ actions = inference(model, tokenizer, image_paths, instruction, robot_state, device)
140
+ print(f"Predicted actions shape: {actions.shape}")
141
+ ```
142
+
143
+ ### Training Example
144
+
145
+ ```python
146
+ import torch
147
+ import torch.nn as nn
148
+ from torch.utils.data import DataLoader, Dataset
149
+ from transformers import get_linear_schedule_with_warmup
150
+
151
+ class RobotDataset(Dataset):
152
+ """Example dataset for robot manipulation tasks"""
153
+ def __init__(self, data_path, tokenizer, transform=None):
154
+ self.data = [] # Load your data here
155
+ self.tokenizer = tokenizer
156
+ self.transform = transform
157
+
158
+ def __len__(self):
159
+ return len(self.data)
160
+
161
+ def __getitem__(self, idx):
162
+ # Return dict with keys: images, instruction, robot_state, target_actions
163
+ sample = self.data[idx]
164
+
165
+ # Process images
166
+ images = torch.stack([self.transform(img) for img in sample['images']])
167
+
168
+ # Tokenize instruction
169
+ text = self.tokenizer(
170
+ sample['instruction'],
171
+ return_tensors="pt",
172
+ padding="max_length",
173
+ truncation=True,
174
+ max_length=256
175
+ )
176
+
177
+ return {
178
+ 'images': images,
179
+ 'input_ids': text['input_ids'].squeeze(),
180
+ 'attention_mask': text['attention_mask'].squeeze(),
181
+ 'robot_state': torch.tensor(sample['robot_state'], dtype=torch.float32),
182
+ 'target_actions': torch.tensor(sample['target_actions'], dtype=torch.float32)
183
+ }
184
+
185
+ def train_step(model, batch, criterion, device):
186
+ """Single training step"""
187
+ # Move batch to device
188
+ images = batch['images'].to(device)
189
+ input_ids = batch['input_ids'].to(device)
190
+ attention_mask = batch['attention_mask'].to(device)
191
+ robot_state = batch['robot_state'].to(device)
192
+ target_actions = batch['target_actions'].to(device)
193
+
194
+ # Forward pass (simplified - actual implementation may differ)
195
+ # Process vision
196
+ vision_features = model.backbone.eagle_model.vision_model(images)
197
+
198
+ # Process language
199
+ language_output = model.backbone.eagle_model.language_model.model(
200
+ input_ids=input_ids,
201
+ attention_mask=attention_mask
202
+ )
203
+ language_features = language_output.last_hidden_state
204
+
205
+ # Combine modalities
206
+ combined_features = torch.cat([
207
+ vision_features.mean(dim=1),
208
+ language_features.mean(dim=1)
209
+ ], dim=-1)
210
+
211
+ # Generate actions (simplified diffusion)
212
+ predicted_actions = model.action_head(
213
+ combined_features,
214
+ context=robot_state
215
+ )
216
+
217
+ # Compute loss
218
+ loss = criterion(predicted_actions, target_actions)
219
+
220
+ return loss
221
+
222
+ # Training setup
223
+ def train_model(model, train_dataset, val_dataset, config):
224
+ """Main training loop"""
225
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
226
+ model = model.to(device)
227
+
228
+ # Create dataloaders
229
+ train_loader = DataLoader(
230
+ train_dataset,
231
+ batch_size=config['batch_size'],
232
+ shuffle=True,
233
+ num_workers=4
234
+ )
235
+
236
+ val_loader = DataLoader(
237
+ val_dataset,
238
+ batch_size=config['batch_size'],
239
+ shuffle=False,
240
+ num_workers=4
241
+ )
242
+
243
+ # Setup optimizer with different learning rates for backbone and action head
244
+ optimizer = torch.optim.AdamW([
245
+ {'params': model.backbone.parameters(), 'lr': config['backbone_lr']},
246
+ {'params': model.action_head.parameters(), 'lr': config['action_head_lr']}
247
+ ], weight_decay=config['weight_decay'])
248
+
249
+ # Learning rate scheduler
250
+ num_training_steps = len(train_loader) * config['num_epochs']
251
+ scheduler = get_linear_schedule_with_warmup(
252
+ optimizer,
253
+ num_warmup_steps=config['warmup_steps'],
254
+ num_training_steps=num_training_steps
255
+ )
256
+
257
+ # Loss function
258
+ criterion = nn.MSELoss() # or nn.L1Loss() for action prediction
259
+
260
+ # Training loop
261
+ for epoch in range(config['num_epochs']):
262
+ model.train()
263
+ total_loss = 0
264
+
265
+ for batch_idx, batch in enumerate(train_loader):
266
+ optimizer.zero_grad()
267
+
268
+ loss = train_step(model, batch, criterion, device)
269
+
270
+ loss.backward()
271
+
272
+ # Gradient clipping
273
+ torch.nn.utils.clip_grad_norm_(
274
+ model.parameters(),
275
+ config['max_grad_norm']
276
+ )
277
+
278
+ optimizer.step()
279
+ scheduler.step()
280
+
281
+ total_loss += loss.item()
282
+
283
+ if batch_idx % config['log_interval'] == 0:
284
+ print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
285
+
286
+ # Validation
287
+ model.eval()
288
+ val_loss = 0
289
+ with torch.no_grad():
290
+ for batch in val_loader:
291
+ loss = train_step(model, batch, criterion, device)
292
+ val_loss += loss.item()
293
+
294
+ avg_train_loss = total_loss / len(train_loader)
295
+ avg_val_loss = val_loss / len(val_loader)
296
+
297
+ print(f"Epoch {epoch}: Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")
298
+
299
+ # Save checkpoint
300
+ if (epoch + 1) % config['save_interval'] == 0:
301
+ torch.save({
302
+ 'epoch': epoch,
303
+ 'model_state_dict': model.state_dict(),
304
+ 'optimizer_state_dict': optimizer.state_dict(),
305
+ 'scheduler_state_dict': scheduler.state_dict(),
306
+ 'train_loss': avg_train_loss,
307
+ 'val_loss': avg_val_loss,
308
+ }, f"checkpoint_epoch_{epoch+1}.pt")
309
+
310
+ # Example configuration
311
+ config = {
312
+ 'batch_size': 16,
313
+ 'num_epochs': 100,
314
+ 'backbone_lr': 1e-5,
315
+ 'action_head_lr': 1e-4,
316
+ 'weight_decay': 0.01,
317
+ 'warmup_steps': 1000,
318
+ 'max_grad_norm': 1.0,
319
+ 'log_interval': 10,
320
+ 'save_interval': 10
321
+ }
322
+
323
+ # Create dataset (you need to implement data loading)
324
+ # train_dataset = RobotDataset("path/to/train/data", tokenizer)
325
+ # val_dataset = RobotDataset("path/to/val/data", tokenizer)
326
+
327
+ # Train model
328
+ # train_model(model, train_dataset, val_dataset, config)
329
+ ```
330
+
331
+ ### Fine-tuning Tips
332
+
333
+ 1. **Mixed Precision Training**: The model is designed for mixed precision. Use `torch.cuda.amp` for faster training:
334
+ ```python
335
+ from torch.cuda.amp import GradScaler, autocast
336
+
337
+ scaler = GradScaler()
338
+
339
+ with autocast():
340
+ loss = train_step(model, batch, criterion, device)
341
+
342
+ scaler.scale(loss).backward()
343
+ scaler.step(optimizer)
344
+ scaler.update()
345
+ ```
346
+
347
+ 2. **Gradient Checkpointing**: For memory-efficient training:
348
+ ```python
349
+ model.backbone.eagle_model.language_model.gradient_checkpointing_enable()
350
+ ```
351
+
352
+ 3. **Frozen Backbone Training**: Start by training only the action head:
353
+ ```python
354
+ # Freeze backbone
355
+ for param in model.backbone.parameters():
356
+ param.requires_grad = False
357
+
358
+ # Train only action head
359
+ optimizer = torch.optim.AdamW(
360
+ model.action_head.parameters(),
361
+ lr=1e-4
362
+ )
363
+ ```
364
+
365
+ ## Model Architecture
366
+
367
+ The model consists of two main components:
368
+
369
+ ### 1. Vision-Language Backbone (System 2)
370
+ - **Vision Encoder**: Based on Eagle vision model with 27 transformer layers
371
+ - **Language Model**: Qwen3-based LLM with 12 layers, 2048 hidden dim
372
+ - **Cross-modal Fusion**: MLP connector between vision and language
373
+
374
+ ### 2. Action Head (System 1)
375
+ - **Diffusion Transformer**: 16 DiT blocks for action generation
376
+ - **State Encoder**: Processes robot proprioception
377
+ - **Action Decoder**: Outputs continuous robot actions
378
+ - **Self-Attention Blocks**: 4 transformer blocks for vision-language features
379
+
380
+ ## Limitations
381
+
382
+ - This is a **blank model** with random weights - it requires training before use
383
+ - No pre-trained knowledge or capabilities
384
+ - Designed for humanoid robots but can be adapted for other embodiments
385
+ - Requires significant computational resources for training
386
+
387
+ ## Citation
388
+
389
+ If you use this model in your research, please cite:
390
+
391
+ ```bibtex
392
+ @software{opengr00t2024,
393
+ title={OpenGR00T-N1.5-3B-Zero: Open Source Blank GR00T Architecture},
394
+ author={Community Contributors},
395
+ year={2024},
396
+ license={Apache-2.0}
397
+ }
398
+ ```
399
+
400
+ ## License
401
+
402
+ Apache-2.0 - This model is fully open source with no restrictions.
403
+
404
+ ## Acknowledgments
405
+
406
+ This is an independent implementation of the GR00T architecture for the open-source community. The architecture is based on publicly available information about NVIDIA's GR00T-N1.5 model, but contains no proprietary code or weights.