Update README.md
Browse files
README.md
CHANGED
|
@@ -13,7 +13,7 @@ base_model:
|
|
| 13 |
# Granite-3.1-2B-Instruct
|
| 14 |
|
| 15 |
**Model Summary:**
|
| 16 |
-
Granite-3.1-2B-Instruct is a
|
| 17 |
|
| 18 |
- **Developers:** Granite Team, IBM
|
| 19 |
- **GitHub Repository:** [ibm-granite/granite-3.1-language-models](https://github.com/ibm-granite/granite-3.1-language-models)
|
|
@@ -56,7 +56,7 @@ import torch
|
|
| 56 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 57 |
|
| 58 |
device = "auto"
|
| 59 |
-
model_path = "ibm-granite/
|
| 60 |
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
| 61 |
# drop device_map if running on CPU
|
| 62 |
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
|
|
@@ -82,21 +82,21 @@ Granite-3.1-2B-Instruct is based on a decoder-only dense transformer architectur
|
|
| 82 |
|
| 83 |
| Model | 2B Dense | 8B Dense | 1B MoE | 3B MoE |
|
| 84 |
| :-------- | :--------| :-------- | :------| :------|
|
| 85 |
-
| Embedding size | 2048 |
|
| 86 |
-
| Number of layers | 40 |
|
| 87 |
-
| Attention head size | 64 |
|
| 88 |
-
| Number of attention heads | 32 |
|
| 89 |
-
| Number of KV heads | 8 |
|
| 90 |
-
| MLP hidden size | 8192 |
|
| 91 |
-
| MLP activation | SwiGLU |
|
| 92 |
-
| Number of experts |
|
| 93 |
-
| MoE TopK |
|
| 94 |
-
| Initialization std | 0.1 |
|
| 95 |
-
| Sequence length | 128K |
|
| 96 |
-
| Position embedding | RoPE |
|
| 97 |
-
| # Parameters | 2.5B |
|
| 98 |
-
| # Active parameters | 2.5B |
|
| 99 |
-
| # Training tokens | 12T |
|
| 100 |
|
| 101 |
**Training Data:**
|
| 102 |
Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities including long-context tasks, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the [Granite 3.0 Technical Report](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/paper.pdf), [Granite 3.1 Technical Report (coming soon)](https://huggingface.co/collections/ibm-granite/granite-31-language-models-6751dbbf2f3389bec5c6f02d), and [Accompanying Author List](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/author-ack.pdf).
|
|
|
|
| 13 |
# Granite-3.1-2B-Instruct
|
| 14 |
|
| 15 |
**Model Summary:**
|
| 16 |
+
Granite-3.1-2B-Instruct is a 2B parameter long-context instruct model finetuned from Granite-3.1-2B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging.
|
| 17 |
|
| 18 |
- **Developers:** Granite Team, IBM
|
| 19 |
- **GitHub Repository:** [ibm-granite/granite-3.1-language-models](https://github.com/ibm-granite/granite-3.1-language-models)
|
|
|
|
| 56 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 57 |
|
| 58 |
device = "auto"
|
| 59 |
+
model_path = "ibm-granite/granite-3.1-2b-instruct"
|
| 60 |
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
| 61 |
# drop device_map if running on CPU
|
| 62 |
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
|
|
|
|
| 82 |
|
| 83 |
| Model | 2B Dense | 8B Dense | 1B MoE | 3B MoE |
|
| 84 |
| :-------- | :--------| :-------- | :------| :------|
|
| 85 |
+
| Embedding size | **2048** | 4096 | 1024 | 1536 |
|
| 86 |
+
| Number of layers | **40** | 40 | 24 | 32 |
|
| 87 |
+
| Attention head size | **64** | 128 | 64 | 64 |
|
| 88 |
+
| Number of attention heads | **32** | 32 | 16 | 24 |
|
| 89 |
+
| Number of KV heads | **8** | 8 | 8 | 8 |
|
| 90 |
+
| MLP hidden size | **8192** | 12800 | 512 | 512 |
|
| 91 |
+
| MLP activation | **SwiGLU** | SwiGLU | SwiGLU | SwiGLU |
|
| 92 |
+
| Number of experts | **—** | — | 32 | 40 |
|
| 93 |
+
| MoE TopK | **—** | — | 8 | 8 |
|
| 94 |
+
| Initialization std | **0.1** | 0.1 | 0.1 | 0.1 |
|
| 95 |
+
| Sequence length | **128K** | 128K | 128K | 128K |
|
| 96 |
+
| Position embedding | **RoPE** | RoPE | RoPE | RoPE |
|
| 97 |
+
| # Parameters | **2.5B** | 8.1B | 1.3B | 3.3B |
|
| 98 |
+
| # Active parameters | **2.5B** | 8.1B | 400M | 800M |
|
| 99 |
+
| # Training tokens | **12T** | 12T | 10T | 10T |
|
| 100 |
|
| 101 |
**Training Data:**
|
| 102 |
Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities including long-context tasks, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the [Granite 3.0 Technical Report](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/paper.pdf), [Granite 3.1 Technical Report (coming soon)](https://huggingface.co/collections/ibm-granite/granite-31-language-models-6751dbbf2f3389bec5c6f02d), and [Accompanying Author List](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/author-ack.pdf).
|