nvidia
/

NVIDIA-Nemotron-Nano-12B-v2-Base

Text Generation

Model card Files Files and versions

suhara commited on 17 days ago

Commit

6d4bf47

·

verified ·

1 Parent(s): a1cd02c

Update README.md

Files changed (1) hide show

README.md +0 -11

README.md CHANGED Viewed

@@ -85,17 +85,6 @@ Hugging Face 08/18/2025  via [https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano
 ## Model design
 The model was trained with 20T tokens, with a batch size of 736, and used the Warmup-Stable-Decay (WSD) learning rate schedule with 8B tokens of learning rate warm up, peak learning rate of 4.5e-4 and minimum learning rate of 4.5e-6. There are a total of 62 layers, of which there are 28 of each MLP and Mamba-2, the remaining layers use GQA with 8 groups
-## Computational load
-Cumulative compute : 1.45E+24 FLOPS
-Estimate energy and emissions for model training: 708.3 MWh
-|  | \# of tokens | Compute \[FLOPS\] | Energy \[MWh\] |
-| :---- | :---- | :---- | :---- |
-| 12B Base Pre-training | 20T | 1.45E+24 | 708.3 |
 ## Input

 ## Model design
 The model was trained with 20T tokens, with a batch size of 736, and used the Warmup-Stable-Decay (WSD) learning rate schedule with 8B tokens of learning rate warm up, peak learning rate of 4.5e-4 and minimum learning rate of 4.5e-6. There are a total of 62 layers, of which there are 28 of each MLP and Mamba-2, the remaining layers use GQA with 8 groups
 ## Input