RichardErkhov commited on
Commit
6b4b1aa
·
verified ·
1 Parent(s): 7f8187e

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +206 -0
README.md ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ ChocoLlama-2-7B-instruct - bnb 8bits
11
+ - Model creator: https://huggingface.co/ChocoLlama/
12
+ - Original model: https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ language:
20
+ - nl
21
+ license: cc-by-nc-4.0
22
+ base_model: ChocoLlama/ChocoLlama-2-7B-base
23
+ datasets:
24
+ - BramVanroy/ultrachat_200k_dutch
25
+ - BramVanroy/stackoverflow-chat-dutch
26
+ - BramVanroy/alpaca-cleaned-dutch
27
+ - BramVanroy/dolly-15k-dutch
28
+ - BramVanroy/no_robots_dutch
29
+ - BramVanroy/ultra_feedback_dutch
30
+
31
+ ---
32
+
33
+ <p align="center" style="margin:0;padding:0">
34
+ <img src="./chocollama_logo.png" alt="ChocoLlama logo" width="500" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
35
+ </p>
36
+ <div style="margin:auto; text-align:center">
37
+ <h1 style="margin-bottom: 0">ChocoLlama</h1>
38
+ <em>A Llama-2/3-based family of Dutch language models</em>
39
+ </div>
40
+
41
+ ## ChocoLlama-2-7B-instruct: Getting Started
42
+
43
+ We here present **ChocoLlama-2-7B-instruct**, an instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
44
+ Its base model, [ChocoLlama-2-7B-base](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base), is a language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
45
+
46
+ Use the code below to get started with the model.
47
+
48
+ ```python
49
+ from transformers import AutoTokenizer, AutoModelForCausalLM
50
+
51
+ tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-instruct')
52
+ model = AutoModelForCausalLM.from_pretrained('ChocoLlama/ChocoLlama-2-7B-instruct', device_map="auto")
53
+
54
+ messages = [
55
+ {"role": "system", "content": "Je bent een artificiële intelligentie-assistent en geeft behulpzame, gedetailleerde en beleefde antwoorden op de vragen van de gebruiker."},
56
+ {"role": "user", "content": "Jacques brel, Willem Elsschot en Jan Jambon zitten op café. Waar zouden ze over babbelen?"},
57
+ ]
58
+
59
+ input_ids = tokenizer.apply_chat_template(
60
+ messages,
61
+ add_generation_prompt=True,
62
+ return_tensors="pt"
63
+ ).to(model.device)
64
+
65
+ new_terminators = [
66
+ tokenizer.eos_token_id,
67
+ tokenizer.convert_tokens_to_ids("<|eot_id|>")
68
+ ]
69
+
70
+ outputs = model.generate(
71
+ input_ids,
72
+ max_new_tokens=512,
73
+ eos_token_id=new_terminators,
74
+ do_sample=True,
75
+ temperature=0.8,
76
+ top_p=0.95,
77
+ )
78
+ response = outputs[0][input_ids.shape[-1]:]
79
+ print(tokenizer.decode(response, skip_special_tokens=True))
80
+ ```
81
+
82
+ Note that the datasets used for instruction-tuning were translated using GPT-3.5/4, which means that this instruction-tuned model can not be used for commercial purposes.
83
+ Hence, for any commercial applications, we recommend finetuning the base model on your own Dutch data.
84
+
85
+ ## Model Details
86
+
87
+ ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.
88
+
89
+ We provide 6 variants (of which 3 base and 3 instruction-tuned models):
90
+ - **ChocoLlama-2-7B-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base)): A language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
91
+ - **ChocoLlama-2-7B-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct)): An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
92
+ - **ChocoLlama-2-7B-tokentrans-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-base)): A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
93
+ - **ChocoLlama-2-7B-tokentrans-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-instruct)): An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
94
+ - **Llama-3-ChocoLlama-8B-base** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base)): A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
95
+ - **Llama-3-ChocoLlama-instruct** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)): An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
96
+
97
+ For benchmark results for all models, including compared to their base models and other Dutch LLMs, we refer to our paper [here](https://arxiv.org/pdf/2412.07633).
98
+
99
+ ### Model Description
100
+
101
+ - **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe)
102
+ - **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA A100-80GB)
103
+ - **Language(s):** Dutch
104
+ - **License:** cc-by-nc-4.0
105
+ - **Finetuned from model:** [ChocoLlama-2-7B-base](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base)
106
+
107
+ ### Model Sources
108
+
109
+ - **Repository:** [on Github here](https://github.com/ChocoLlamaModel/ChocoLlama).
110
+ - **Paper:** [on ArXiv here](https://arxiv.org/pdf/2412.07633).
111
+
112
+ ## Uses
113
+
114
+ ### Direct Use
115
+
116
+ This is an instruction-tuned (SFT + DPO) Dutch model, optimized for Dutch language generation in conversational settings.
117
+ For optimal behavior, we advice to only use the model with the correct chat template (see Python code above), potentially supported by a system prompt.
118
+
119
+ ### Out-of-Scope Use
120
+
121
+ Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.
122
+
123
+ ## Bias, Risks, and Limitations
124
+
125
+ We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators.
126
+ However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.
127
+
128
+ ## Training Details
129
+
130
+ We adopt the same strategy as used to align GEITje-7B to [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra).
131
+ First, we apply supervised finetuning (SFT), utilizing the data made available by [Vanroy](https://arxiv.org/pdf/2312.12852):
132
+ - [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch)
133
+ - [BramVanroy/no_robots_dutch](https://huggingface.co/datasets/BramVanroy/no_robots_dutch)
134
+ - [BramVanroy/stackoverflow-chat-dutch](https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch)
135
+ - [BramVanroy/alpaca-cleaned-dutch](https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch)
136
+ - [BramVanroy/dolly-15k-dutch](https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch)
137
+
138
+ Next, we apply Direct Preference Optimization (DPO) to the SFT version of all the pretrained models we here develop,
139
+ now utilizing a Dutch version of the data used to train Zephyr-7B-$\beta$, [BramVanroy/ultra_feedback_dutch](https://huggingface.co/datasets/BramVanroy/ultra_feedback_dutch).
140
+
141
+ For both the SFT and DPO stage, we update all model weights and apply the same set of hyperparameters to all models as used in GEITje-7B-ultra:
142
+ - learning_rate: 5e-07
143
+ - train_batch_size: 4
144
+ - eval_batch_size: 4
145
+ - seed: 42
146
+ - distributed_type: multi-GPU
147
+ - num_devices: 4
148
+ - gradient_accumulation_steps: 4
149
+ - total_train_batch_size: 64
150
+ - total_eval_batch_size: 16
151
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
152
+ - lr_scheduler_type: cosine
153
+ - lr_scheduler_warmup_ratio: 0.1
154
+ - num_epochs: 1
155
+
156
+ Further, we leverage the publicly available [alignment handbook](https://github.com/huggingface/alignment-handbook) and use a set of 4 NVIDIA A100 (80 GB) for both stages.
157
+
158
+ ## Evaluation
159
+
160
+ ### Quantitative evaluation
161
+
162
+ We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.
163
+
164
+ | Model | ARC | HellaSwag | MMLU | TruthfulQA | Avg. |
165
+ |----------------------------------------------|----------------|----------------|----------------|----------------|----------------|
166
+ | **Llama-3-ChocoLlama-instruct** | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** |
167
+ | llama-3-8B-rebatch | 0.44 | 0.64 | 0.46 | 0.48 | 0.51 |
168
+ | llama-3-8B-instruct | 0.47 | 0.59 | 0.47 | 0.52 | 0.51 |
169
+ | llama-3-8B | 0.44 | 0.64 | 0.47 | 0.45 | 0.5 |
170
+ | Reynaerde-7B-Chat | 0.44 | 0.62 | 0.39 | 0.52 | 0.49 |
171
+ | **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** |
172
+ | zephyr-7b-beta | 0.43 | 0.58 | 0.43 | 0.53 | 0.49 |
173
+ | geitje-7b-ultra | 0.40 | 0.66 | 0.36 | 0.49 | 0.48 |
174
+ | **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** |
175
+ | mistral-7b-v0.1 | 0.43 | 0.58 | 0.37 | 0.45 | 0.46 |
176
+ | **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** |
177
+ | **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 |
178
+ | **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** |
179
+ | llama-2-7b-chat-hf | 0.36 | 0.49 | 0.33 | 0.44 | 0.41 |
180
+ | llama-2-7b-hf | 0.36 | 0.51 | 0.32 | 0.41 | 0.40 |
181
+
182
+ On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.
183
+
184
+ ### Qualitative evaluation
185
+
186
+ In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable.
187
+ For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench).
188
+
189
+ ### Compute Infrastructure
190
+
191
+ All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA A100 GPU's with 80 GB of VRAM.
192
+
193
+ ## Citation
194
+
195
+ If you found this useful for your work, kindly cite our paper:
196
+
197
+ ```
198
+ @article{meeus2024chocollama,
199
+ title={ChocoLlama: Lessons Learned From Teaching Llamas Dutch},
200
+ author={Meeus, Matthieu and Rath{\'e}, Anthony and Remy, Fran{\c{c}}ois and Delobelle, Pieter and Decorte, Jens-Joris and Demeester, Thomas},
201
+ journal={arXiv preprint arXiv:2412.07633},
202
+ year={2024}
203
+ }
204
+ ```
205
+
206
+