--- license: apache-2.0 base_model: meta-llama/Llama-3.2-1B-Instruct tags: - dpo - lora - peft - llama-3.2 - llm-judge library_name: peft --- # DPO Fine-Tune of Llama-3.2-1B using an LLM Judge This repository contains the LoRA adapters for a `meta-llama/Llama-3.2-1B-Instruct` model that has been fine-tuned using Direct Preference Optimization (DPO). The preference dataset for this training was generated using a custom-built **LLM Judge** powered by GPT-3.5-Turbo. The judge was designed to evaluate pairs of model-generated responses based on a clear set of criteria, creating a high-quality dataset for preference alignment. - **Preference Dataset:** [NilayR/llm-judge-preferences-llama32](https://huggingface.co/datasets/NilayR/llm-judge-preferences-llama32) ## Model Details ### Model Description This model is a fine-tuned version of `meta-llama/Llama-3.2-1B-Instruct`. It was trained using DPO on a dataset of 483 preference pairs. These pairs were created by having the base model generate multiple responses to instructions from the LIMA dataset, which were then evaluated and ranked by a GPT-3.5-Turbo-based LLM Judge. The goal of this fine-tuning was to align the model more closely with human-like preferences for helpfulness, accuracy, and clarity, as defined by the judge's evaluation criteria. This model demonstrated the best performance in a comparative analysis against the base model and a model trained with PairRM data. - **Developed by:** NilayR - **Model type:** Causal Language Model - **Language(s):** English - **License:** apache-2.0 - **Finetuned from model:** `meta-llama/Llama-3.2-1B-Instruct` ## How to Get Started with the Model To use these LoRA adapters, load the base model (`meta-llama/Llama-3.2-1B-Instruct`) and then apply the adapters from this repository. ```python import torch from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig # Set base model ID and adapter path base_model_id = "meta-llama/Llama-3.2-1B-Instruct" adapter_id = "NilayR/llama32-dpo-llm-judge" # Configure BitsAndBytes for 4-bit quantization bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) # Load the base model with quantization base_model = AutoModelForCausalLM.from_pretrained( base_model_id, quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained(base_model_id) tokenizer.pad_token = tokenizer.eos_token # Load and apply the PEFT adapters model = PeftModel.from_pretrained(base_model, adapter_id) # --- Generate a response --- prompt = "Explain the concept of dark matter and dark energy in simple terms." messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ] input_ids = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) outputs = model.generate( input_ids, max_new_tokens=200, do_sample=True, temperature=0.7, top_p=0.95 ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response.split("assistant")[-1].strip()) ```` ## Training Details ### Training Data The model was trained on a preference dataset generated using a custom LLM Judge. * **Data Generation Process:** 1. **Instructions:** 50 instructions were extracted from the LIMA dataset. 2. **Response Generation:** The base `Llama-3.2-1B` model generated 5 diverse responses for each instruction. 3. **Preference Labeling:** A custom LLM Judge powered by `GPT-3.5-Turbo` evaluated all possible pairs of responses for each instruction, resulting in a dataset of **483 chosen/rejected pairs**. ### Training Procedure The model was trained for one epoch using the TRL library's `DPOTrainer`. #### Training Hyperparameters * **Framework:** `trl.DPOTrainer` * **Epochs:** 1 * **Batch Size:** 1 * **Gradient Accumulation Steps:** 4 (Effective Batch Size: 4) * **Optimizer:** `paged_adamw_8bit` * **Learning Rate:** 5e-5 * **LR Scheduler:** `cosine` with a warmup ratio of 0.1 * **DPO Beta (β):** 0.1 * **Final Training Loss:** `0.5545` #### LoRA Configuration * **Rank (`r`):** 16 * **Alpha (`lora_alpha`):** 32 * **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` * **Dropout:** 0.05 ### Compute Infrastructure * **Hardware:** 1x NVIDIA A100 40GB GPU * **Cloud Provider:** Google Colab * **Software:** `transformers`, `peft`, `trl`, `bitsandbytes` ----- ``` ```