| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						license: apache-2.0 | 
					
					
						
						| 
							 | 
						pipeline_tag: text-generation | 
					
					
						
						| 
							 | 
						language: | 
					
					
						
						| 
							 | 
						- fr | 
					
					
						
						| 
							 | 
						- en | 
					
					
						
						| 
							 | 
						tags: | 
					
					
						
						| 
							 | 
						- openllm-france | 
					
					
						
						| 
							 | 
						datasets: | 
					
					
						
						| 
							 | 
						- cmh/alpaca_data_cleaned_fr_52k | 
					
					
						
						| 
							 | 
						- OpenLLM-France/Croissant-Aligned-Instruct | 
					
					
						
						| 
							 | 
						- Gael540/dataSet_ens_sup_fr-v1 | 
					
					
						
						| 
							 | 
						- ai2-adapt-dev/flan_v2_converted | 
					
					
						
						| 
							 | 
						- teknium/OpenHermes-2.5 | 
					
					
						
						| 
							 | 
						- allenai/tulu-3-sft-personas-math | 
					
					
						
						| 
							 | 
						- allenai/tulu-3-sft-personas-math-grade | 
					
					
						
						| 
							 | 
						- allenai/WildChat-1M | 
					
					
						
						| 
							 | 
						base_model: | 
					
					
						
						| 
							 | 
						- OpenLLM-France/Lucie-7B | 
					
					
						
						| 
							 | 
						widget: | 
					
					
						
						| 
							 | 
						- text: |- | 
					
					
						
						| 
							 | 
						    Quelle est la capitale de l'Espagne ? Madrid. | 
					
					
						
						| 
							 | 
						    Quelle est la capitale de la France ? | 
					
					
						
						| 
							 | 
						  example_title: Capital cities in French | 
					
					
						
						| 
							 | 
						  group: 1-shot Question Answering | 
					
					
						
						| 
							 | 
						training_progress: | 
					
					
						
						| 
							 | 
						  context_length: 32000 | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						# Model Card for Lucie-7B-Instruct-v1.1 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						* [Model Description](#model-description) | 
					
					
						
						| 
							 | 
						<!-- * [Uses](#uses) --> | 
					
					
						
						| 
							 | 
						* [Training Details](#training-details) | 
					
					
						
						| 
							 | 
						  * [Training Data](#training-data) | 
					
					
						
						| 
							 | 
						  * [Preprocessing](#preprocessing) | 
					
					
						
						| 
							 | 
						  * [Instruction template](#instruction-template) | 
					
					
						
						| 
							 | 
						  * [Training Procedure](#training-procedure) | 
					
					
						
						| 
							 | 
						<!-- * [Evaluation](#evaluation) --> | 
					
					
						
						| 
							 | 
						* [Testing the model](#testing-the-model) | 
					
					
						
						| 
							 | 
						  * [Test with ollama](#test-with-ollama) | 
					
					
						
						| 
							 | 
						  * [Test with vLLM](#test-with-vllm) | 
					
					
						
						| 
							 | 
						* [Citation](#citation) | 
					
					
						
						| 
							 | 
						* [Acknowledgements](#acknowledgements) | 
					
					
						
						| 
							 | 
						* [Contact](#contact) | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Model Description | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Lucie-7B-Instruct-v1.1 is a fine-tuned version of [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B), an open-source, multilingual causal language model created by OpenLLM-France. It is meant to replace the original Lucie-7B-Instruct model that was released in January 2025. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Lucie-7B-Instruct is fine-tuned on a mixture of human-templated and synthetic instructions (produced by ChatGPT) and a small set of customized prompts about OpenLLM and Lucie.  | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Note that this instruction training is light and is meant to allow Lucie to produce responses of a desired type (answer, summary, list, etc.). Lucie-7B-Instruct-v1.1 would need further training before being implemented in pipelines for specific use-cases or for particular generation tasks such as code generation or mathematical problem solving. It is also susceptible to hallucinations; that is, producing false answers that result from its training. Its performance and accuracy can be improved through further fine-tuning and alignment with methods such as DPO, RLHF, etc. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Due to its size, Lucie-7B is limited in the information that it can memorize; its ability to produce correct answers could be improved by implementing the model in a retrieval augmented generation pipeline. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						While Lucie-7B-Instruct is trained on sequences of 4096 tokens, its base model, Lucie-7B has a context size of 32K tokens. Based on Needle-in-a-haystack evaluations, Lucie-7B-Instruct-v1.1 has a context window size of 22K tokens. This window could be increasd by fine-tuning on longer data samples.   | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Training details | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						### Training data | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Lucie-7B-Instruct-v1.1 is trained on the following datasets: | 
					
					
						
						| 
							 | 
						* [Alpaca-cleaned-fr](https://huggingface.co/datasets/cmh/alpaca_data_cleaned_fr_52k) (French; 51,655 samples) | 
					
					
						
						| 
							 | 
						* [Croissant-Aligned-Instruct](https://huggingface.co/datasets/OpenLLM-France/Croissant-Aligned-Instruct) (English-French; 20,000 samples taken from 80,000 total) | 
					
					
						
						| 
							 | 
						* [ENS](https://huggingface.co/datasets/Gael540/dataSet_ens_sup_fr-v1) (French, 394 samples) | 
					
					
						
						| 
							 | 
						* [FLAN v2 Converted](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted) (English, 78,580 samples) | 
					
					
						
						| 
							 | 
						* [Open Hermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) (English, 1,000,495 samples) | 
					
					
						
						| 
							 | 
						* [Oracle](https://github.com/opinionscience/InstructionFr/tree/main/wikipedia) (French, 4,613 samples) | 
					
					
						
						| 
							 | 
						* [PIAF](https://www.data.gouv.fr/fr/datasets/piaf-le-dataset-francophone-de-questions-reponses/) (French, 1,849 samples) | 
					
					
						
						| 
							 | 
						* [TULU3 Personas Math](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math) | 
					
					
						
						| 
							 | 
						* [TULU3 Personas Math Grade](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math-grade) | 
					
					
						
						| 
							 | 
						* [Wildchat](https://huggingface.co/datasets/allenai/WildChat-1M) (French subset; 26,436 samples) | 
					
					
						
						| 
							 | 
						* Hard-coded prompts concerning OpenLLM and Lucie (based on [allenai/tulu-3-hard-coded-10x](https://huggingface.co/datasets/allenai/tulu-3-hard-coded-10x)) | 
					
					
						
						| 
							 | 
						    * French: openllm_french.jsonl (24x10 samples) | 
					
					
						
						| 
							 | 
						    * English: openllm_english.jsonl (24x10 samples) | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						One epoch was passed on each dataset except for Croissant-Aligned-Instruct for which we randomly selected 20,000 translation pairs. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						### Preprocessing | 
					
					
						
						| 
							 | 
						* Filtering by keyword: Examples containing assistant responses were filtered out from the four synthetic datasets if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...). | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						### Instruction template: | 
					
					
						
						| 
							 | 
						Lucie-7B-Instruct-v1.1 was trained on the chat template from Llama 3.1 with the sole difference that `<|begin_of_text|>` is replaced with `<s>`. The resulting template: | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						<s><|start_header_id|>system<|end_header_id|> | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						{SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|> | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						{INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|> | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						{OUTPUT}<|eot_id|> | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						An example: | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						<s><|start_header_id|>system<|end_header_id|> | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|> | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						Give me three tips for staying in shape.<|eot_id|><|start_header_id|>assistant<|end_header_id|> | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						1. Eat a balanced diet and be sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.<|eot_id|> | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						### Training procedure | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						The model architecture and hyperparameters are the same as for [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B) during the annealing phase with the following exceptions: | 
					
					
						
						| 
							 | 
						* context length: 4096<sup>*</sup> | 
					
					
						
						| 
							 | 
						* batch size: 1024 | 
					
					
						
						| 
							 | 
						* max learning rate: 3e-5 | 
					
					
						
						| 
							 | 
						* min learning rate: 3e-6 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<sup>*</sup>As noted above, while Lucie-7B-Instruct is trained on sequences of 4096 tokens, it maintains the capacity of the base model, Lucie-7B, to handle context sizes of up to 32K tokens. | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						## Testing the model | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						### Test with ollama | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						* Download and install [Ollama](https://ollama.com/download) | 
					
					
						
						| 
							 | 
						* Download the [GGUF model](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-v1.1-gguf/blob/main/Lucie-7B-Instruct-v1.1-q4_k_m.gguf) | 
					
					
						
						| 
							 | 
						* Copy the [`Modelfile`](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-v1.1-gguf/blob/main/Modelfile), adapting if necessary the path to the GGUF file (line starting with `FROM`). | 
					
					
						
						| 
							 | 
						* Run in a shell: | 
					
					
						
						| 
							 | 
						    * `ollama create -f Modelfile Lucie` | 
					
					
						
						| 
							 | 
						    * `ollama run Lucie` | 
					
					
						
						| 
							 | 
						* Once ">>>" appears, type your prompt(s) and press Enter. | 
					
					
						
						| 
							 | 
						* Optionally, restart a conversation by typing "`/clear`" | 
					
					
						
						| 
							 | 
						* End the session by typing "`/bye`". | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Useful for debug: | 
					
					
						
						| 
							 | 
						* [How to print input requests and output responses in Ollama server?](https://stackoverflow.com/a/78831840) | 
					
					
						
						| 
							 | 
						* [Documentation on Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter) | 
					
					
						
						| 
							 | 
						   * Examples: [Ollama model library](https://github.com/ollama/ollama#model-library) | 
					
					
						
						| 
							 | 
						      * Llama 3 example: https://ollama.com/library/llama3.1 | 
					
					
						
						| 
							 | 
						* Add GUI : https://docs.openwebui.com/ | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						### Test with vLLM | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						#### 1. Run vLLM Docker Container | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Use the following command to deploy the model, | 
					
					
						
						| 
							 | 
						replacing `INSERT_YOUR_HF_TOKEN` with your Hugging Face Hub token. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```bash | 
					
					
						
						| 
							 | 
						docker run --runtime nvidia --gpus=all \ | 
					
					
						
						| 
							 | 
						    --env "HUGGING_FACE_HUB_TOKEN=INSERT_YOUR_HF_TOKEN" \ | 
					
					
						
						| 
							 | 
						    -p 8000:8000 \ | 
					
					
						
						| 
							 | 
						    --ipc=host \ | 
					
					
						
						| 
							 | 
						    vllm/vllm-openai:latest \ | 
					
					
						
						| 
							 | 
						    --model OpenLLM-France/Lucie-7B-Instruct-v1.1 | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						#### 2. Test using OpenAI Client in Python | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						To test the deployed model, use the OpenAI Python client as follows: | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						from openai import OpenAI | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						# Initialize the client | 
					
					
						
						| 
							 | 
						client = OpenAI(base_url='http://localhost:8000/v1', api_key='empty') | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						# Define the input content | 
					
					
						
						| 
							 | 
						content = "Hello Lucie" | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						# Generate a response | 
					
					
						
						| 
							 | 
						chat_response = client.chat.completions.create( | 
					
					
						
						| 
							 | 
						    model="OpenLLM-France/Lucie-7B-Instruct-v1.1", | 
					
					
						
						| 
							 | 
						    messages=[ | 
					
					
						
						| 
							 | 
						        {"role": "user", "content": content} | 
					
					
						
						| 
							 | 
						    ], | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						print(chat_response.choices[0].message.content) | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Citation | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						When using the Lucie-7B-Instruct model, please cite the following paper: | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						✍ Olivier Gouvert, Julie Hunter, Jérôme Louradour, Christophe Cérisara,  | 
					
					
						
						| 
							 | 
						Evan Dufraisse, Yaya Sy, Laura Rivière, Jean-Pierre Lorré (2025). | 
					
					
						
						| 
							 | 
						The Lucie-7B LLM and the Lucie Training Dataset: | 
					
					
						
						| 
							 | 
						      open resources for multilingual language generation | 
					
					
						
						| 
							 | 
						```bibtex | 
					
					
						
						| 
							 | 
						@misc{openllm2025lucie, | 
					
					
						
						| 
							 | 
						      title={The Lucie-7B LLM and the Lucie Training Dataset: | 
					
					
						
						| 
							 | 
						      open resources for multilingual language generation},  | 
					
					
						
						| 
							 | 
						      author={Olivier Gouvert and Julie Hunter and Jérôme Louradour and Christophe Cérisara and Evan Dufraisse and Yaya Sy and Laura Rivière and Jean-Pierre Lorré}, | 
					
					
						
						| 
							 | 
						      year={2025}, | 
					
					
						
						| 
							 | 
						      archivePrefix={arXiv}, | 
					
					
						
						| 
							 | 
						      primaryClass={cs.CL} | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Acknowledgements | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444). We gratefully acknowledge support from GENCI and IDRIS and from Pierre-François Lavallée (IDRIS) and Stephane Requena (GENCI) in particular. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Lucie-7B-Instruct-v1.1 was created by members of [LINAGORA](https://labs.linagora.com/) and the [OpenLLM-France](https://www.openllm-france.fr/) community, including in alphabetical order: | 
					
					
						
						| 
							 | 
						Olivier Gouvert (LINAGORA), | 
					
					
						
						| 
							 | 
						Ismaïl Harrando (LINAGORA/SciencesPo),  | 
					
					
						
						| 
							 | 
						Julie Hunter (LINAGORA), | 
					
					
						
						| 
							 | 
						Jean-Pierre Lorré (LINAGORA), | 
					
					
						
						| 
							 | 
						Jérôme Louradour (LINAGORA), | 
					
					
						
						| 
							 | 
						Michel-Marie Maudet (LINAGORA), and | 
					
					
						
						| 
							 | 
						Laura Rivière (LINAGORA). | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						We thank  | 
					
					
						
						| 
							 | 
						Clément Bénesse (Opsci),  | 
					
					
						
						| 
							 | 
						Christophe Cerisara (LORIA), | 
					
					
						
						| 
							 | 
						Émile Hazard (Opsci), | 
					
					
						
						| 
							 | 
						Evan Dufraisse (CEA List), | 
					
					
						
						| 
							 | 
						Guokan Shang (MBZUAI),  | 
					
					
						
						| 
							 | 
						Joël Gombin (Opsci),  | 
					
					
						
						| 
							 | 
						Jordan Ricker (Opsci),  | 
					
					
						
						| 
							 | 
						and | 
					
					
						
						| 
							 | 
						Olivier Ferret (CEA List)  | 
					
					
						
						| 
							 | 
						for their helpful input. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Finally, we thank the entire OpenLLM-France community, whose members have helped in diverse ways. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Contact | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						[email protected] |