RichardErkhov
/

ChocoLlama_-_ChocoLlama-2-7B-instruct-8bits

+Quantization made by Richard Erkhov.
+[Github](https://github.com/RichardErkhov)
+[Discord](https://discord.gg/pvy7H8DZMG)
+[Request more models](https://github.com/RichardErkhov/quant_request)
+ChocoLlama-2-7B-instruct - bnb 8bits
+- Model creator: https://huggingface.co/ChocoLlama/
+- Original model: https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct/
+Original model description:
+---
+language:
+- nl
+license: cc-by-nc-4.0
+base_model: ChocoLlama/ChocoLlama-2-7B-base
+datasets:
+- BramVanroy/ultrachat_200k_dutch
+- BramVanroy/stackoverflow-chat-dutch
+- BramVanroy/alpaca-cleaned-dutch
+- BramVanroy/dolly-15k-dutch
+- BramVanroy/no_robots_dutch
+- BramVanroy/ultra_feedback_dutch
+---
+<p align="center" style="margin:0;padding:0">
+<img src="./chocollama_logo.png" alt="ChocoLlama logo" width="500" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
+</p>
+<div style="margin:auto; text-align:center">
+<h1 style="margin-bottom: 0">ChocoLlama</h1>
+<em>A Llama-2/3-based family of Dutch language models</em>
+</div>
+## ChocoLlama-2-7B-instruct: Getting Started
+We here present **ChocoLlama-2-7B-instruct**, an instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
+Its base model, [ChocoLlama-2-7B-base](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base), is a language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
+Use the code below to get started with the model.
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-instruct')
+model = AutoModelForCausalLM.from_pretrained('ChocoLlama/ChocoLlama-2-7B-instruct', device_map="auto")
+messages = [
+    {"role": "system", "content": "Je bent een artificiële intelligentie-assistent en geeft behulpzame, gedetailleerde en beleefde antwoorden op de vragen van de gebruiker."},
+    {"role": "user", "content": "Jacques brel, Willem Elsschot en Jan Jambon zitten op café. Waar zouden ze over babbelen?"},
+]
+input_ids = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt=True,
+    return_tensors="pt"
+).to(model.device)
+new_terminators = [
+    tokenizer.eos_token_id,
+    tokenizer.convert_tokens_to_ids("<|eot_id|>")
+]
+outputs = model.generate(
+    input_ids,
+    max_new_tokens=512,
+    eos_token_id=new_terminators,
+    do_sample=True,
+    temperature=0.8,
+    top_p=0.95,
+)
+response = outputs[0][input_ids.shape[-1]:]
+print(tokenizer.decode(response, skip_special_tokens=True))
+```
+Note that the datasets used for instruction-tuning were translated using GPT-3.5/4, which means that this instruction-tuned model can not be used for commercial purposes.
+Hence, for any commercial applications, we recommend finetuning the base model on your own Dutch data.
+## Model Details
+ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.
+We provide 6 variants (of which 3 base and 3 instruction-tuned models):
+- **ChocoLlama-2-7B-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base)): A language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
+- **ChocoLlama-2-7B-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct)): An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
+- **ChocoLlama-2-7B-tokentrans-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-base)): A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
+- **ChocoLlama-2-7B-tokentrans-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-instruct)): An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
+- **Llama-3-ChocoLlama-8B-base** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base)): A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
+- **Llama-3-ChocoLlama-instruct** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)): An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
+For benchmark results for all models, including compared to their base models and other Dutch LLMs, we refer to our paper [here](https://arxiv.org/pdf/2412.07633).
+### Model Description
+- **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe)
+- **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA A100-80GB)
+- **Language(s):** Dutch
+- **License:** cc-by-nc-4.0
+- **Finetuned from model:** [ChocoLlama-2-7B-base](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base)
+### Model Sources
+- **Repository:** [on Github here](https://github.com/ChocoLlamaModel/ChocoLlama).
+- **Paper:** [on ArXiv here](https://arxiv.org/pdf/2412.07633).
+## Uses
+### Direct Use
+This is an instruction-tuned (SFT + DPO) Dutch model, optimized for Dutch language generation in conversational settings.
+For optimal behavior, we advice to only use the model with the correct chat template (see Python code above), potentially supported by a system prompt.
+### Out-of-Scope Use
+Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.
+## Bias, Risks, and Limitations
+We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators.
+However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.
+## Training Details
+We adopt the same strategy as used to align GEITje-7B to [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra).
+First, we apply supervised finetuning (SFT), utilizing the data made available by [Vanroy](https://arxiv.org/pdf/2312.12852):
+- [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch)
+- [BramVanroy/no_robots_dutch](https://huggingface.co/datasets/BramVanroy/no_robots_dutch)
+- [BramVanroy/stackoverflow-chat-dutch](https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch)
+- [BramVanroy/alpaca-cleaned-dutch](https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch)
+- [BramVanroy/dolly-15k-dutch](https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch)
+Next, we apply Direct Preference Optimization (DPO) to the SFT version of all the pretrained models we here develop,
+now utilizing a Dutch version of the data used to train Zephyr-7B-$\beta$, [BramVanroy/ultra_feedback_dutch](https://huggingface.co/datasets/BramVanroy/ultra_feedback_dutch).
+For both the SFT and DPO stage, we update all model weights and apply the same set of hyperparameters to all models as used in GEITje-7B-ultra:
+- learning_rate: 5e-07
+- train_batch_size: 4
+- eval_batch_size: 4
+- seed: 42
+- distributed_type: multi-GPU
+- num_devices: 4
+- gradient_accumulation_steps: 4
+- total_train_batch_size: 64
+- total_eval_batch_size: 16
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: cosine
+- lr_scheduler_warmup_ratio: 0.1
+- num_epochs: 1
+Further, we leverage the publicly available [alignment handbook](https://github.com/huggingface/alignment-handbook) and use a set of 4 NVIDIA A100 (80 GB) for both stages.
+## Evaluation
+### Quantitative evaluation
+We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.
+| Model                                        | ARC            | HellaSwag      | MMLU           | TruthfulQA     | Avg.           |
+|----------------------------------------------|----------------|----------------|----------------|----------------|----------------|
+| **Llama-3-ChocoLlama-instruct**        | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** |
+| llama-3-8B-rebatch                           | 0.44           | 0.64           | 0.46           | 0.48           | 0.51           |
+| llama-3-8B-instruct                          | 0.47           | 0.59           | 0.47           | 0.52           | 0.51           |
+| llama-3-8B                                   | 0.44           | 0.64           | 0.47           | 0.45           | 0.5            |
+| Reynaerde-7B-Chat                            | 0.44           | 0.62           | 0.39           | 0.52           | 0.49           |
+| **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** |
+| zephyr-7b-beta                               | 0.43           | 0.58           | 0.43           | 0.53           | 0.49           |
+| geitje-7b-ultra                              | 0.40           | 0.66           | 0.36           | 0.49           | 0.48           |
+| **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** |
+| mistral-7b-v0.1                              | 0.43           | 0.58           | 0.37           | 0.45           | 0.46           |
+| **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** |
+| **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 |
+| **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** |
+| llama-2-7b-chat-hf                           | 0.36           | 0.49           | 0.33           | 0.44           | 0.41           |
+| llama-2-7b-hf                                | 0.36           | 0.51           | 0.32           | 0.41           | 0.40           |
+On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.
+### Qualitative evaluation
+In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable.
+For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench).
+### Compute Infrastructure
+All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA A100 GPU's with 80 GB of VRAM.
+## Citation
+If you found this useful for your work, kindly cite our paper:
+```
+@article{meeus2024chocollama,
+  title={ChocoLlama: Lessons Learned From Teaching Llamas Dutch},
+  author={Meeus, Matthieu and Rath{\'e}, Anthony and Remy, Fran{\c{c}}ois and Delobelle, Pieter and Decorte, Jens-Joris and Demeester, Thomas},
+  journal={arXiv preprint arXiv:2412.07633},
+  year={2024}
+}
+```