Update README.md
Browse files
README.md
CHANGED
@@ -9,9 +9,13 @@ tags:
|
|
9 |
- llama-3
|
10 |
- openllm-france
|
11 |
datasets:
|
12 |
-
- yahma/alpaca-cleaned
|
13 |
- cmh/alpaca_data_cleaned_fr_52k
|
14 |
-
-
|
|
|
|
|
|
|
|
|
|
|
15 |
- allenai/WildChat-1M
|
16 |
base_model:
|
17 |
- OpenLLM-France/Lucie-7B
|
@@ -22,8 +26,6 @@ widget:
|
|
22 |
example_title: Capital cities in French
|
23 |
group: 1-shot Question Answering
|
24 |
training_progress:
|
25 |
-
num_steps: 756291
|
26 |
-
num_tokens: 3131736326144
|
27 |
context_length: 32000
|
28 |
---
|
29 |
|
@@ -49,7 +51,11 @@ training_progress:
|
|
49 |
|
50 |
Lucie-7B-Instruct-v1.1 is a fine-tuned version of [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B), an open-source, multilingual causal language model created by OpenLLM-France.
|
51 |
|
52 |
-
Lucie-7B-Instruct is fine-tuned on synthetic instructions produced by ChatGPT and
|
|
|
|
|
|
|
|
|
53 |
|
54 |
While Lucie-7B-Instruct is trained on sequences of 4096 tokens, its base model, Lucie-7B has a context size of 32K tokens. Based on Needle-in-a-haystack evaluations, Lucie-7B-Instruct maintains the capacity of the base model to handle 32K-size context windows.
|
55 |
|
@@ -59,21 +65,21 @@ While Lucie-7B-Instruct is trained on sequences of 4096 tokens, its base model,
|
|
59 |
### Training data
|
60 |
|
61 |
Lucie-7B-Instruct-v1.1 is trained on the following datasets:
|
62 |
-
* [Alpaca-cleaned-fr](https://huggingface.co/datasets/cmh/alpaca_data_cleaned_fr_52k) (French;
|
63 |
* [Croissant-Aligned-Instruct](https://huggingface.co/datasets/OpenLLM-France/Croissant-Aligned-Instruct) (English-French; 20,000 samples taken from 80,000 total)
|
64 |
* [ENS](https://huggingface.co/datasets/Gael540/dataSet_ens_sup_fr-v1) (French, 394 samples)
|
65 |
-
* [FLAN v2 Converted](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted) (English,
|
66 |
-
* [Open Hermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) (English,
|
67 |
-
* [Oracle](https://github.com/opinionscience/InstructionFr/tree/main/wikipedia) (French,
|
68 |
-
* [PIAF](https://www.data.gouv.fr/fr/datasets/piaf-le-dataset-francophone-de-questions-reponses/) (French,
|
69 |
* [TULU3 Personas Math](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math)
|
70 |
* [TULU3 Personas Math Grade](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math-grade)
|
71 |
-
* [Wildchat](https://huggingface.co/datasets/allenai/WildChat-1M) (French subset;
|
72 |
* Hard-coded prompts concerning OpenLLM and Lucie (based on [allenai/tulu-3-hard-coded-10x](https://huggingface.co/datasets/allenai/tulu-3-hard-coded-10x))
|
73 |
* French: openllm_french.jsonl (24x10 samples)
|
74 |
* English: openllm_english.jsonl (24x10 samples)
|
75 |
|
76 |
-
One epoch was passed on each dataset
|
77 |
|
78 |
### Preprocessing
|
79 |
* Filtering by keyword: Examples containing assistant responses were filtered out from the four synthetic datasets if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
|
|
|
9 |
- llama-3
|
10 |
- openllm-france
|
11 |
datasets:
|
|
|
12 |
- cmh/alpaca_data_cleaned_fr_52k
|
13 |
+
- OpenLLM-France/Croissant-Aligned-Instruct
|
14 |
+
- Gael540/dataSet_ens_sup_fr-v1
|
15 |
+
- ai2-adapt-dev/flan_v2_converted
|
16 |
+
- teknium/OpenHermes-2.5
|
17 |
+
- allenai/tulu-3-sft-personas-math
|
18 |
+
- allenai/tulu-3-sft-personas-math-grade
|
19 |
- allenai/WildChat-1M
|
20 |
base_model:
|
21 |
- OpenLLM-France/Lucie-7B
|
|
|
26 |
example_title: Capital cities in French
|
27 |
group: 1-shot Question Answering
|
28 |
training_progress:
|
|
|
|
|
29 |
context_length: 32000
|
30 |
---
|
31 |
|
|
|
51 |
|
52 |
Lucie-7B-Instruct-v1.1 is a fine-tuned version of [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B), an open-source, multilingual causal language model created by OpenLLM-France.
|
53 |
|
54 |
+
Lucie-7B-Instruct is fine-tuned on a mixture of human-templated and synthetic instructions (produced by ChatGPT) and a small set of customized prompts about OpenLLM and Lucie.
|
55 |
+
|
56 |
+
Note that this instruction training is light and is meant to allow Lucie to produce responses of a desired type (answer, summary, list, etc.). Lucie-7B-Instruct-v1.1 would need further training before being implemented in pipelines for specific use-cases or for particular generation tasks such as code generation or mathematical problem solving. It is also susceptible to hallucinations; that is, producing false answers that result from its training. Its performance and accuracy can be improved through further fine-tuning and alignment with methods such as DPO, RLHF, etc.
|
57 |
+
|
58 |
+
Due to its size, Lucie-7B is limited in the information that it can memorize; its ability to produce correct answers could be improved by implementing the model in a retrieval augmented generation pipeline.
|
59 |
|
60 |
While Lucie-7B-Instruct is trained on sequences of 4096 tokens, its base model, Lucie-7B has a context size of 32K tokens. Based on Needle-in-a-haystack evaluations, Lucie-7B-Instruct maintains the capacity of the base model to handle 32K-size context windows.
|
61 |
|
|
|
65 |
### Training data
|
66 |
|
67 |
Lucie-7B-Instruct-v1.1 is trained on the following datasets:
|
68 |
+
* [Alpaca-cleaned-fr](https://huggingface.co/datasets/cmh/alpaca_data_cleaned_fr_52k) (French; 51,655 samples)
|
69 |
* [Croissant-Aligned-Instruct](https://huggingface.co/datasets/OpenLLM-France/Croissant-Aligned-Instruct) (English-French; 20,000 samples taken from 80,000 total)
|
70 |
* [ENS](https://huggingface.co/datasets/Gael540/dataSet_ens_sup_fr-v1) (French, 394 samples)
|
71 |
+
* [FLAN v2 Converted](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted) (English, 78,580 samples)
|
72 |
+
* [Open Hermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) (English, 1,000,495 samples)
|
73 |
+
* [Oracle](https://github.com/opinionscience/InstructionFr/tree/main/wikipedia) (French, 4,613 samples)
|
74 |
+
* [PIAF](https://www.data.gouv.fr/fr/datasets/piaf-le-dataset-francophone-de-questions-reponses/) (French, 1,849 samples)
|
75 |
* [TULU3 Personas Math](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math)
|
76 |
* [TULU3 Personas Math Grade](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math-grade)
|
77 |
+
* [Wildchat](https://huggingface.co/datasets/allenai/WildChat-1M) (French subset; 26,436 samples)
|
78 |
* Hard-coded prompts concerning OpenLLM and Lucie (based on [allenai/tulu-3-hard-coded-10x](https://huggingface.co/datasets/allenai/tulu-3-hard-coded-10x))
|
79 |
* French: openllm_french.jsonl (24x10 samples)
|
80 |
* English: openllm_english.jsonl (24x10 samples)
|
81 |
|
82 |
+
One epoch was passed on each dataset except for Croissant-Aligned-Instruct for which we randomly selected 20,000 translation pairs.
|
83 |
|
84 |
### Preprocessing
|
85 |
* Filtering by keyword: Examples containing assistant responses were filtered out from the four synthetic datasets if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
|