logicssoftwaregmbh
/

logicsct-mistral-nemo-instruct

@@ -7,52 +7,79 @@ base_model:
 pipeline_tag: question-answering
 tags:
 - Connect-Transport
-- ConnectTransport
 - Connect
 - chatbot
 library_name: transformers
 ---
 # Model Card for logicsct-mistral-nemo-instruct
-logicsct-mistral-nemo-instruct is a QLoRA 4-bit finetuning of [mistralai/Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407).
-## Model usage
-We are currently evaluating and training models to be a support chat bot for [**Connect-Transport**](https://www.logics-connect.de), a transport management system from Logics Software GmbH.
-## Finding a good base model - speaking German and following instructions well enough
-We have evaluated over 70 models for basic tech instruction tasks in German. The evaluation was done manually by checking the answers about the following questions:
-  1. Wie kann ich in Chrome machen dass meine Downloads immer am gleichen Ort gespeichert werden?
-  2. Wie kann ich in Outlook meine Mail Signatur anpassen und einen Link und Bild dort einfügen?
-The best models according to our subjective scale from 1 (bad) to 5 (very good):
-- 5 star rating:
-  - Big proprietary models like OpenAI o1, OpenAI 4o, OpenAI o1-mini
-  - Huge models: [deepseek-ai/DeepSeek-R1 (685B)](https://huggingface.co/deepseek-ai/DeepSeek-R1), [deepseek-ai/DeepSeek-V3 (685B)](https://huggingface.co/deepseek-ai/DeepSeek-V3), [mistralai/Mistral-Large-Instruct-2411 (123B)](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411)
-  - Large models: [Nexusflow/Athene-V2-Chat (72.7B)](https://huggingface.co/Nexusflow/Athene-V2-Chat), [nvidia/Llama-3.1-Nemotron-70B-Instruct (70.6B)](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct)
-- 4 star rating:
-  - Huge models: [mistralai/Mixtral-8x22B-Instruct-v0.1 (141B)](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1),  [alpindale/WizardLM-2-8x22B (141B)](https://huggingface.co/alpindale/WizardLM-2-8x22B) and [CohereForAI/c4ai-command-r-plus-08-2024 (104B)](https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024)
-  - Large models: [meta-llama/Llama-3.3-70B-Instruct (70.6B)](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) and [NousResearch/Hermes-3-Llama-3.1-70B (70.6B)](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B)
-  - Big models: [mistralai/Mixtral-8x7B-Instruct-v0.1 (46.7B)](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
-  - Medium big models: [google/gemma-2-27b (27.2B)](https://huggingface.co/google/gemma-2-27b) and [mistralai/Mistral-Small-Instruct-2409 (22.2B)](https://huggingface.co/mistralai/Mistral-Small-Instruct-2409)
-  - **Small sized models (main focus currently)**:
-    - [mistralai/Mistral-Nemo-Instruct-2407 (12.2B)](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
-    - [microsoft/phi-4](https://huggingface.co/microsoft/phi-4)
-- 3 star and lower: not listed here. We have tested dozens and dozens of <20B and <10B models but most do not understand or speak German well enough or perform well enough in context of asking a support chat bot tech questions.
-- There are some models which have smaller versions too but aren't listed above. The reason for that is that those smaller versions have not performed well enough for a 4+ rating.
-- Furthermore there are some models like Hermes 3 which have bigger versions available too that aren't listed. That's because we were not impressed of their performance per model size ratio and thus not particularily interested in testing their huge 405B versions.
-- We mainly focus in <20B size models and compared their performance with some of the bigger models, too.
-## How we fine tune our base model
-- Because of our small training dataset and GPU VRAM constraints we use QLoRA fine tuning only.
-- After trying out our own scripts, we finally settled with https://github.com/hiyouga/LLaMA-Factory which fits our needs in terms of easy training, inference, and export functionality for a big set of models.
-### Training data
-- Our training data currently consists of about **220 prompt-response pairs**.
-- We have build a webapp for our employees to enter training data, with gamification in form of a daily and weekly high score system. The webapp is furthermore connected to a selection of current evaluation models to see how the models answer to both prompts within their training data and outside of it.
-### QLoRA settings
-Full settings of `logicsct_train_Mistral_Nemo_qlora_sft_otfq.yaml`:
 ```
 ### model
 model_name_or_path: mistralai/Mistral-Nemo-Instruct-2407
@@ -99,8 +126,8 @@ eval_strategy: steps  # or "epoch" if you prefer evaluating at the end of each e
 eval_steps: 500       # adjust this if needed (e.g., if you use "steps", it determines evaluation frequency)
 ```
-### Training, inference, and export
-Following https://github.com/hiyouga/LLaMA-Factory?tab=readme-ov-file#quickstart:
 ```
 llamafactory-cli train logicsct_train_Mistral_Nemo_qlora_sft_otfq.yaml       # VRAM used: 10099MiB for 4 bit QLoRA training
@@ -110,14 +137,17 @@ llamafactory-cli export logicsct_export_Mistral_Nemo_qlora_sft_Q4.yaml       # V
 llamafactory-cli chat logicsct_inference_Mistral_Nemo_qlora_sft_otfq_Q4.yaml # VRAM used:  8541MiB-9569MiB VRAM for inference of the 4bit quant merged model (increasing with increasing context length)
 ```
-### Comparison of open source training/models with OpenAI proprietary finetuning
-- We have finetuned both OpenAI gpt 4o and 4o-mini, and compared their performance to our best small sized models
-- After some initial runs with very unsatisfying results, we needed to adjust the hyper parameters a lot, and mainly continued experimenting with 4o-mini.
-- With our current training data, it seems like both 4o and 4o-mini need 5 epochs with the default learning rate and the training loss ends pretty close to 0, but with fewer epochs the models seem not to learn enough, maybe because of our small sized training dataset.
-- Unusable overfitting occurs at about 7 epochs for both models.
-- Best settings so far: 5 epochs, batch size of 3, automatic learning rate.
-- But currently our small sized open source models perform pretty equal to or even better than such finetuning of 4o-mini.
-- We will continue further testing with OpenAI finetuning once we have a larger training data set.
-## Next steps
-Number one priority is currently collecting more training data.

 pipeline_tag: question-answering
 tags:
 - Connect-Transport
+- Connect Transport
 - Connect
+- Logics Software
+- KI-Chatbot Kundenservice
+- KI Chatbot
+- Deutscher Chatbot
+- Deutscher KI Chatbot
+- KI-Chatbot Deutsch
+- KI-Chatbots für Unternehmen
+- German chat bot
+- German support chatbot
+- German AI chatbot
 - chatbot
 library_name: transformers
 ---
 # Model Card for logicsct-mistral-nemo-instruct
+**logicsct-mistral-nemo-instruct** is a QLoRA 4-bit fine-tuned version of [mistralai/Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407). This model has been adapted with domain-specific knowledge to serve as a support chatbot for [**Connect-Transport**](https://www.logics-connect.de), our transport management system developed at Logics Software GmbH.
+While tailored for our internal use, the training principles and techniques we employed can also be applied by others interested in developing their own chatbot assistants.
+We are continuously evaluating and refining our models to enhance the performance of our support chatbot for Connect-Transport.
+## Finding a Good Base Model – Proficient in German and Following Instructions
+We have evaluated over 70 models for basic technical instruction tasks in German. The evaluation was carried out manually by reviewing the responses to the following questions:
+- Wie kann ich in Chrome machen dass meine Downloads immer am gleichen Ort gespeichert werden?
+- Wie kann ich in Outlook meine Mail Signatur anpassen und einen Link und Bild dort einfügen?
+The best models according to our subjective rating scale (1 = poor, 5 = excellent) are:
+5-Star Rating:
+- Big proprietary models such as OpenAI o1, OpenAI 4o and OpenAI o1-mini
+- Huge models: [deepseek-ai/DeepSeek-R1 (685B)](https://huggingface.co/deepseek-ai/DeepSeek-R1), [deepseek-ai/DeepSeek-V3 (685B)](https://huggingface.co/deepseek-ai/DeepSeek-V3), [mistralai/Mistral-Large-Instruct-2411 (123B)](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411)
+- Large models: [Nexusflow/Athene-V2-Chat (72.7B)](https://huggingface.co/Nexusflow/Athene-V2-Chat), [nvidia/Llama-3.1-Nemotron-70B-Instruct (70.6B)](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct)
+4-Star Rating:
+- Huge models: [mistralai/Mixtral-8x22B-Instruct-v0.1 (141B)](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1),  [alpindale/WizardLM-2-8x22B (141B)](https://huggingface.co/alpindale/WizardLM-2-8x22B) and [CohereForAI/c4ai-command-r-plus-08-2024 (104B)](https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024)
+- Large models: [meta-llama/Llama-3.3-70B-Instruct (70.6B)](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) and [NousResearch/Hermes-3-Llama-3.1-70B (70.6B)](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B)
+- Big models: [mistralai/Mixtral-8x7B-Instruct-v0.1 (46.7B)](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
+- Medium-sized models: [google/gemma-2-27b (27.2B)](https://huggingface.co/google/gemma-2-27b) and [mistralai/Mistral-Small-Instruct-2409 (22.2B)](https://huggingface.co/mistralai/Mistral-Small-Instruct-2409)
+**Small-Sized Models (Current Main Focus)**:
+- [mistralai/Mistral-Nemo-Instruct-2407 (12.2B)](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
+- [microsoft/phi-4](https://huggingface.co/microsoft/phi-4)
+Models rated 3 stars or lower are not listed here. We have tested dozens of models with fewer than 20B and 10B parameters, but most do not understand or speak German well enough or perform adequately in the context of answering support chatbot technical questions.
+Some models also have smaller versions that are not listed above because they did not achieve a 4+ rating. Additionally, some models (e.g., Hermes 3) have larger versions available that are not included, as their performance relative to model size was not impressive, making their massive 405B versions less interesting for our purposes.
+Given our goal of training, exporting, and running inference on our dedicated server hardware, we primarily focus on models with fewer than 20B parameters while comparing their performance with that of some larger models.
+## How We Fine-Tune Our Base Model
+Due to our small training dataset and GPU VRAM constraints, we exclusively use QLoRA fine-tuning. After testing our own scripts, we ultimately settled on [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), which meets our needs for straightforward training, inference, and export functionality across a wide range of models.
+### Training Data
+Our training data currently consists of approximately **220 prompt-response pairs** in a carefully curated dataset.
+To collect this training data, we developed an internal web app for our employees. In this app:
+- Employees enter both a prompt and a response — each pair forms a single training record.
+- Gamification is implemented via daily and weekly high score systems that highlight the employee who contributes the most training data.
+- The web app is also connected to a selection of current evaluation models to facilitate internal testing of how the models respond both to prompts from the training data and to new prompts.
+Here are three example records of our training data:
+> Prompt: "Wie kann ich eine Tour umbenennen?"
+> <br>Response: "1. Öffnen Sie die Stoppliste der Tour und klicken Sie auf das Tourmenü (graues Zahnradsymbol ⚙️) rechts oben in der Stoppliste.\n2. Wählen Sie im Tourmenü \"Tour umbenennen\".\n3. Folgen Sie der Anweisung im Dialog und klicken Sie auf \"OK\".\nBeachten Sie: Fahrer können die Tour nicht umbenennen, sondern nur ein Disponent (Admin Account)."
+> <br>Prompt: "Wie werden Daten in Connect-Transport importiert?"
+> <br>Response: "Der Flex Importer liest Auftragsdaten aus Excel-Dateien ein. Alternativ können Daten per Schnittstelle direkt aus ERP- oder Warenwirtschaftssystemen importiert werden."
+> <br>Prompt: "Gibt es eine Möglichkeit zur Sendungsverfolgung für Endkunden?"
+> <br>Response: "Ja, mit dem Avisierungs-Modul können Disponenten per SMS oder E-Mail voraussichtliche Lieferzeiten an Empfänger senden, die ihre Sendungen live verfolgen können."
+### QLoRA Settings
+Full settings for `logicsct_train_Mistral_Nemo_qlora_sft_otfq.yaml`:
 ```
 ### model
 model_name_or_path: mistralai/Mistral-Nemo-Instruct-2407
 eval_steps: 500       # adjust this if needed (e.g., if you use "steps", it determines evaluation frequency)
 ```
+### Training, Inference, and Export
+We follow the instructions provided in the [LLaMA-Factory Quickstart Guide](https://github.com/hiyouga/LLaMA-Factory?tab=readme-ov-file#quickstart):
 ```
 llamafactory-cli train logicsct_train_Mistral_Nemo_qlora_sft_otfq.yaml       # VRAM used: 10099MiB for 4 bit QLoRA training
 llamafactory-cli chat logicsct_inference_Mistral_Nemo_qlora_sft_otfq_Q4.yaml # VRAM used:  8541MiB-9569MiB VRAM for inference of the 4bit quant merged model (increasing with increasing context length)
 ```
+### Comparison of Open Source Training/Models with OpenAI Proprietary Fine-Tuning
+We have fine-tuned both OpenAI GPT 4o and 4o-mini and compared their performance to that of our best small-sized models. After some initial runs with unsatisfactory results, we significantly adjusted the hyperparameters and focused primarily on experimenting with 4o-mini.
+With our current training data, both 4o and 4o-mini appear to require 5 epochs using the default learning rate, with the training loss approaching zero. With fewer epochs, however, the models seem not to learn sufficiently—perhaps due to the small size of our training dataset. Significant overfitting occurs at approximately 7 epochs for both models.
+Our best settings so far are:
+  - Epochs: 5
+  - Batch Size: 3
+  - Learning Rate: Automatically determined
+Currently, our small-sized open-source models perform comparably to or even better than the fine-tuned 4o-mini. We will continue testing with OpenAI fine-tuning once we have a larger training dataset.
+## Next Steps
+Our top priority at the moment is to collect more training data.