Update README.md
Browse files
README.md
CHANGED
@@ -85,17 +85,6 @@ Hugging Face 08/18/2025 via [https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano
|
|
85 |
## Model design
|
86 |
|
87 |
The model was trained with 20T tokens, with a batch size of 736, and used the Warmup-Stable-Decay (WSD) learning rate schedule with 8B tokens of learning rate warm up, peak learning rate of 4.5e-4 and minimum learning rate of 4.5e-6. There are a total of 62 layers, of which there are 28 of each MLP and Mamba-2, the remaining layers use GQA with 8 groups
|
88 |
-
|
89 |
-
## Computational load
|
90 |
-
|
91 |
-
Cumulative compute : 1.45E+24 FLOPS
|
92 |
-
|
93 |
-
Estimate energy and emissions for model training: 708.3 MWh
|
94 |
-
|
95 |
-
| | \# of tokens | Compute \[FLOPS\] | Energy \[MWh\] |
|
96 |
-
| :---- | :---- | :---- | :---- |
|
97 |
-
| 12B Base Pre-training | 20T | 1.45E+24 | 708.3 |
|
98 |
-
|
99 |
|
100 |
|
101 |
## Input
|
|
|
85 |
## Model design
|
86 |
|
87 |
The model was trained with 20T tokens, with a batch size of 736, and used the Warmup-Stable-Decay (WSD) learning rate schedule with 8B tokens of learning rate warm up, peak learning rate of 4.5e-4 and minimum learning rate of 4.5e-6. There are a total of 62 layers, of which there are 28 of each MLP and Mamba-2, the remaining layers use GQA with 8 groups
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
88 |
|
89 |
|
90 |
## Input
|