juliehunter commited on
Commit
8ac48c1
·
verified ·
1 Parent(s): 9b54c45

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -311
README.md CHANGED
@@ -25,314 +25,3 @@ training_progress:
25
  context_length: 32000
26
  ---
27
 
28
- # Model Card for Lucie-7B
29
-
30
- <!-- inspired from the following template:
31
- https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1
32
- -->
33
-
34
- * [Model Description](#model-description)
35
- <!-- * [Uses](#uses) -->
36
- * [Example Code in Python](#example-code-in-python)
37
- * [Load the model](#load-the-model)
38
- * [Sentence completion](#sentence-completion)
39
- * [Load a checkpoint](#load-a-checkpoint)
40
- * [Training Details](#training-details)
41
- * [Training Data](#training-data)
42
- * [Training Procedure](#training-procedure)
43
- * [Neural Network Architecture](#neural-network-architecture)
44
- * [Training Hyperparameters](#training-hyperparameters)
45
- 1. [Main Pre-training](#1-main-pre-training)
46
- 2. [Context Extension](#2-context-extension)
47
- 3. [Annealing](#3-annealing)
48
- * [Training Logs and Learning Curves](#training-logs-and-learning-curves)
49
- <!-- * [Evaluation](#evaluation) -->
50
- * [Disclaimer](#disclaimer)
51
- * [Citation](#citation)
52
- * [Acknowledgements](#acknowledgements)
53
- * [Contact](#contact)
54
-
55
- ## Model Description
56
-
57
- Lucie-7B is a pretrained 7B parameter causal language model built by [LINAGORA](https://labs.linagora.com/) and [OpenLLM-France](https://github.com/OpenLLM-France),
58
- available under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0).
59
-
60
- Lucie-7B was trained on 3 trillion tokens of multilingual data, including
61
- English (33.2%),
62
- French (32.4%),
63
- German (6.9%),
64
- Spanish (6.6%),
65
- Italian (3.8%),
66
- and parallel data from those languages (2.5%),
67
- as well as several programming languages (14.7%).
68
-
69
- ## Example Code in Python
70
-
71
- ### Load the model
72
-
73
- Load the model (quantized version on GPU if possible, for efficient inference):
74
- ```python
75
- import transformers
76
-
77
- model_name = "OpenLLM-France/Lucie-7B"
78
-
79
- tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
80
- model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
81
- device_map="auto",
82
- load_in_4bit=True # For efficient inference, if quantization is supported by the GPU card
83
- )
84
- ```
85
- ### Sentence completion
86
-
87
- Wrap the model in a text generation pipeline, and specify some generation parameters:
88
- ```
89
- pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)
90
-
91
- generation_kwargs = dict(
92
- num_return_sequences=1, # Number of variants to generate.
93
- return_full_text= False, # Do not include the prompt in the generated text.
94
- do_sample=True,
95
- temperature=1.0, top_p=1, top_k=None, # Sampling parameters.
96
- max_new_tokens=200, # Maximum length for the output text (in number of tokens).
97
- )
98
- ```
99
-
100
- Try 1-shot question answering:
101
- ```python
102
- prompt = """\
103
- Quelle est la capitale de l'Espagne ? Madrid\n\
104
- Quelle est la capitale de la France ?\
105
- """
106
- completions = pipeline(prompt, **generation_kwargs)
107
- for completion in completions:
108
- print(prompt + " […]" + completion['generated_text'])
109
- ```
110
- This will print something like:
111
- ```
112
- Quelle est la capitale de l'Espagne ? Madrid
113
- Quelle est la capitale de la France ? […] Paris
114
- Quelle est la capitale de l'Italie? Rome
115
- Quelle est la capitale de la Grande-Bretagne? Londres
116
- Quelle est la capitale de la Suisse? Berne
117
- Quelle est la capitale du Portugal? Lisbonne
118
- Quelle est la capitale de l'Algérie? Alger
119
- ...
120
- ```
121
-
122
- If running on GPU (`cuda` device), you will need at least 6GB of VRAM to run inference using 4bit quantization (16GB of VRAM without 4bit quantization).
123
-
124
- ### Load a checkpoint
125
-
126
- Checkpoints at several training steps are available under revision tags,
127
- every 5000 steps during the first 25000 steps, and then every 25000 steps.
128
-
129
- Intermediate checkpoints can be loaded using the `revision` parameter:
130
- ```python
131
- model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
132
- revision="step0753851",
133
- ...
134
- )
135
- ```
136
- where `revision` can be one of:
137
- * "[`step0005000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0005000)", "[`step0010000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0010000)", "[`step0015000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0015000)", "[`step0020000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0020000)": every 5000 steps for the first pre-training steps (with a context length of 4096).
138
- * "[`step0025000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0025000)", "[`step0050000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0050000)", "[`step0075000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0075000)", "[`step0100000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0100000)", ..., "[`step0750000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0750000)": every 25000 steps from 25k to 750k steps.
139
- * "[`step0753851`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0753851)": last pre-training step before context extension and annealing.
140
- * "[`extension_step0000250`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000250)", "[`extension_step0000500`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000500)", "[`extension_step0000750`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000750)", "[`extension_step0001000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001000)", "[`extension_step0001220`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001220)": several checkpoints during context extension (with a context length of 32000).
141
-
142
- ## Training Details
143
-
144
- ### Training Data
145
-
146
- The training dataset used for the pretraining of Lucie-7B is available
147
- at [OpenLLM-France/Lucie-Training-Dataset](https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset).
148
- <!-- and described in ["The Lucie Training Dataset" (2024/12)](https://arxiv.org/abs/xxxx.xxxxx). -->
149
-
150
- The initial composition of the training data is as follows:
151
-
152
- ![Initial Data Composition](figures/pie_dataset_composition.png)
153
-
154
- Some of the data was upsampled to balance the training data distribution yielding the following composition for training:
155
-
156
- ![Training Data Composition](figures/pie_dataset_composition_training.png)
157
-
158
- ### Training Procedure
159
-
160
- Lucie-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
161
-
162
- It was pre-trained on 512 H100 80GB GPUs for about 550\,000 GPU hours on the [Jean Zay supercomputer](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html).
163
-
164
- The training code is available at [https://github.com/OpenLLM-France/Lucie-Training](https://github.com/OpenLLM-France/Lucie-Training).
165
- It is based on [this fork of Megatron-DeepSpeed](https://github.com/OpenLLM-France/Megatron-DeepSpeed).
166
-
167
- Optimizer checkpoints are available at [OpenLLM-France/Lucie-7B-optimizer-states](https://huggingface.co/OpenLLM-France/Lucie-7B-optimizer-states).
168
-
169
- #### Neural Network Architecture
170
-
171
- Lucie-7B has the same neural network architecture as [Llama3.1](https://huggingface.co/meta-llama/Llama-3.1-8B).
172
- It has exactly 6 706 958 336 free parameters,
173
- with the following hyperparameters:
174
- | **Hyperparameter** | **Value** |
175
- |---------------------------|---------|
176
- | Vocabulary size (\# tokens)| 65 024 |
177
- | \# transformer blocks | 32 |
178
- | \# attention heads | 32 |
179
- | \# key-value heads | 8 |
180
- | Hidden size | 4 096 |
181
- | Feed-Forward hidden size | 12 288 |
182
- | Activation | `silu` |
183
- | RMS norm epsilon | 1e-5 |
184
-
185
- The "theta" parameter of Rotary Positional Embedding (RoPE) was increased during the training process. Its values are indicated in the tables with training hyperparameters below.
186
-
187
- #### Training Hyperparameters
188
-
189
- The training consisted of three main phases:
190
- 1. Main pre-training on 3.1T tokens, with a context length of 4096,
191
- 2. Context extension on 5B tokens, with a context length of 32000,
192
- 3. Annealing on 5B tokens of high quality data composed of a mixture of new data and data seen during training.
193
- <!-- perhaps cite the dataset for annealing -->
194
-
195
- The details of each phase are given below.
196
-
197
- ##### 1. Main Pre-training
198
-
199
- Training hyperparameters in torch/Megatron-DeepSpeed were as follows:
200
- | **Hyperparameter** | **Value** |
201
- |------------------------|------------|
202
- | Total \# samples| 762 144 586 (3.1T tokens) |
203
- | Total \# steps | 753 851 |
204
- | RoPE theta | 500 000 |
205
- | Context length | 4 096 |
206
- | Initial Batch size | 256 |
207
- | Final Batch size | 1 024 |
208
- | Batch size rampup | by steps of 64 over 10M samples |
209
- | Learning rate schedule | warmup (2M samples) + cosine annealing |
210
- | Maximum Learning rate | 3e-4 |
211
- | Final Learning rate | 3e-5 |
212
- | Weight decay | 0.1 |
213
- | Dropout | _ |
214
- | Gradient clipping | 1 |
215
- | Initializer range | 0.009 |
216
- | Optimizer | `AdamW` (β₁=0.9, β₂=0.95, ε=1e-5) |
217
- | Precision | `bfloat16` |
218
- | Tensor Parallelism (with 512 GPUs) | 4 |
219
- | Pipeline Parallelism (with 512 GPUs) | 4 |
220
- | Data Parallelism (with 512 GPUs) | 32 |
221
-
222
- #### 2. Context Extension
223
-
224
- Training hyperparameters are the same as above, with the following changes:
225
- | **Hyperparameter** | **Value** |
226
- |------------------------|------------|
227
- | Total \# samples| 156 250 (5B tokens) |
228
- | Total \# steps | 1 220 |
229
- | RoPE theta | 20 000 000 |
230
- | Context length | 32 000 |
231
- | Batch size | 128 |
232
- | Learning rate | 2e-5 |
233
- | Learning rate schedule | constant |
234
- | Tensor Parallelism (with 128 GPUs) | 4 |
235
- | Pipeline Parallelism (with 128 GPUs) | 4 |
236
- | Data Parallelism (with 128 GPUs) | 8 |
237
-
238
- #### 3. Annealing
239
-
240
- Training hyperparameters are the same as for context extension, with the following changes:
241
- | **Hyperparameter** | **Value** |
242
- |------------------------|------------|
243
- | Learning rate schedule | linear annealing |
244
- | Maximum Learning rate | 3e-5 |
245
- | Final Learning rate | 0 |
246
-
247
- ### Training Logs and Learning Curves
248
-
249
- #### Training loss
250
-
251
- Training logs can be found in Tensorboard format in:
252
- * [`metadata/training_logs/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs)
253
- <br> ├── [`1_pretraining.zip`](metadata/training_logs/1_pretraining.zip) training logs for the first pre-training phases,
254
- in a zip file. Each file in the zip corresponds to a job of at most 20H of training (parallelized over 512 GPUs).
255
- <br> ├── [`2_extension/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs/2_extension) folder containing the training log <br> └── [`3_annealing/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs/3_annealing) folder containing the training log for the annealing phase, which also took around 13H of training (parallelized over 128 GPUs).
256
-
257
- The convergence curves of the three pre-training phases are the following:
258
-
259
- ![figures/convergence-curve-pretraining.png](figures/convergence-curve-pretraining.png)
260
-
261
- Data corresponding to these plots were extracted from tensorboard logs and are available in the following CSV files:
262
- * [`metadata/training_logs/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs)
263
- <br> ├── [`1_pretraining.csv`](metadata/training_logs/1_pretraining.csv)
264
- <br> ├── [`2_extension.csv`](metadata/training_logs/2_extension.csv)
265
- <br> └── [`3_annealing.csv`](metadata/training_logs/3_annealing.csv)
266
-
267
- #### Evaluations
268
-
269
- Multiple evaluations were conducted during Lucie-7B's training to assess its performance on standard benchmarks,
270
- primarily in French and English, as well as in Spanish, German, and Italian.
271
-
272
- Evaluation results on benchmark datasets of checkpoints of Lucie-7B throughout the training process are available at
273
- [metadata/evaluation_learning_curve_lucie.csv](metadata/evaluation_learning_curve_lucie.csv).
274
- Evaluation results of baseline models on the same benchmark datasets are available at
275
- [metadata/evaluation_baselines.csv](metadata/evaluation_baselines.csv).
276
-
277
- Main results are summarized in the following figures:
278
-
279
- ### French
280
- ![figures/learning-curve-evaluation-french-bench.png](figures/learning-curve-evaluation-french-bench.png)
281
-
282
- ### English
283
- ![figures/learning-curve-evaluation-benchmarks-in-english.png](figures/learning-curve-evaluation-benchmarks-in-english.png)
284
-
285
- ### other
286
- ![figures/learning-curve-evaluation-multilingual-arc-benchmark.png](figures/learning-curve-evaluation-multilingual-arc-benchmark.png)
287
-
288
- ### Needle in a Haystack
289
-
290
- #### Pretraining
291
- ![figures/needle-in-a-haystack/Lucie-7B-main.png](figures/needle-in-a-haystack/Lucie-7B-main.png)
292
-
293
- #### Context Extension
294
- ![figures/needle-in-a-haystack/Lucie-7B-extension.png](figures/needle-in-a-haystack/Lucie-7B-extension.png)
295
-
296
- #### Annealing
297
- ![figures/needle-in-a-haystack/Lucie-7B-annealing.png](figures/needle-in-a-haystack/Lucie-7B-annealing.png)
298
-
299
-
300
- ## Disclaimer
301
-
302
- Lucie-7B is a language model trained solely to predict the most probable next word in a sequence. Despite efforts to filter the [Lucie Training Dataset](https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset), it is possible that Lucie-7B encountered strings containing toxic or offensive language during its training and as a result, it may generate strings of similar quality. To limit such behavior, it is advised to fine-tune Lucie-7B through instruction and/or preference tuning (DPO, RLHF, etc.).
303
-
304
- ## Citation
305
-
306
- TODO
307
-
308
-
309
- ## Acknowledgements
310
-
311
- This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444).
312
-
313
- Lucie-7B was created by members of [LINAGORA](https://labs.linagora.com/) and OpenLLM-France community, including in alphabetical order:
314
- Christophe Cerisara (LORIA),
315
- Evan Dufraisse (CEA),
316
- Julie Hunter (LINAGORA),
317
- Jean-Pierre Lorré (LINAGORA),
318
- Jérôme Louradour (LINAGORA),
319
- Michel-Marie Maudet (LINAGORA),
320
- Olivier Gouvert (LINAGORA), and
321
- Yaya Sy (LORIA).
322
-
323
- We thank
324
- Anastasia Stasenko (OpSci/Pleias),
325
- Clément Bénesse (Opsci),
326
- Guokan Shang (MBZUAI),
327
- Ismaïl Harrando (LINAGORA),
328
- Joël Gombin (Opsci),
329
- Jordan Ricker (Opsci),
330
- Olivier Ferret (CEA),
331
- Pierre-Carl Langlais (OpSci/Pleias),
332
- and
333
- Rachel Bawden (INRIA),
334
- for their helpful input.
335
-
336
- ## Contact
337
-
338
 
25
  context_length: 32000
26
  ---
27