File size: 4,681 Bytes
b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 b27924f 5c587a9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
---
license: apache-2.0
base_model: meta-llama/Llama-3.2-1B-Instruct
tags:
- dpo
- lora
- peft
- llama-3.2
- llm-judge
library_name: peft
---
# DPO Fine-Tune of Llama-3.2-1B using an LLM Judge
This repository contains the LoRA adapters for a `meta-llama/Llama-3.2-1B-Instruct` model that has been fine-tuned using Direct Preference Optimization (DPO).
The preference dataset for this training was generated using a custom-built **LLM Judge** powered by GPT-3.5-Turbo. The judge was designed to evaluate pairs of model-generated responses based on a clear set of criteria, creating a high-quality dataset for preference alignment.
- **Preference Dataset:** [NilayR/llm-judge-preferences-llama32](https://huggingface.co/datasets/NilayR/llm-judge-preferences-llama32)
## Model Details
### Model Description
This model is a fine-tuned version of `meta-llama/Llama-3.2-1B-Instruct`. It was trained using DPO on a dataset of 483 preference pairs. These pairs were created by having the base model generate multiple responses to instructions from the LIMA dataset, which were then evaluated and ranked by a GPT-3.5-Turbo-based LLM Judge.
The goal of this fine-tuning was to align the model more closely with human-like preferences for helpfulness, accuracy, and clarity, as defined by the judge's evaluation criteria. This model demonstrated the best performance in a comparative analysis against the base model and a model trained with PairRM data.
- **Developed by:** NilayR
- **Model type:** Causal Language Model
- **Language(s):** English
- **License:** apache-2.0
- **Finetuned from model:** `meta-llama/Llama-3.2-1B-Instruct`
## How to Get Started with the Model
To use these LoRA adapters, load the base model (`meta-llama/Llama-3.2-1B-Instruct`) and then apply the adapters from this repository.
```python
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# Set base model ID and adapter path
base_model_id = "meta-llama/Llama-3.2-1B-Instruct"
adapter_id = "NilayR/llama32-dpo-llm-judge"
# Configure BitsAndBytes for 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load the base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
tokenizer.pad_token = tokenizer.eos_token
# Load and apply the PEFT adapters
model = PeftModel.from_pretrained(base_model, adapter_id)
# --- Generate a response ---
prompt = "Explain the concept of dark matter and dark energy in simple terms."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=200,
do_sample=True,
temperature=0.7,
top_p=0.95
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.split("assistant")[-1].strip())
````
## Training Details
### Training Data
The model was trained on a preference dataset generated using a custom LLM Judge.
* **Data Generation Process:**
1. **Instructions:** 50 instructions were extracted from the LIMA dataset.
2. **Response Generation:** The base `Llama-3.2-1B` model generated 5 diverse responses for each instruction.
3. **Preference Labeling:** A custom LLM Judge powered by `GPT-3.5-Turbo` evaluated all possible pairs of responses for each instruction, resulting in a dataset of **483 chosen/rejected pairs**.
### Training Procedure
The model was trained for one epoch using the TRL library's `DPOTrainer`.
#### Training Hyperparameters
* **Framework:** `trl.DPOTrainer`
* **Epochs:** 1
* **Batch Size:** 1
* **Gradient Accumulation Steps:** 4 (Effective Batch Size: 4)
* **Optimizer:** `paged_adamw_8bit`
* **Learning Rate:** 5e-5
* **LR Scheduler:** `cosine` with a warmup ratio of 0.1
* **DPO Beta (β):** 0.1
* **Final Training Loss:** `0.5545`
#### LoRA Configuration
* **Rank (`r`):** 16
* **Alpha (`lora_alpha`):** 32
* **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
* **Dropout:** 0.05
### Compute Infrastructure
* **Hardware:** 1x NVIDIA A100 40GB GPU
* **Cloud Provider:** Google Colab
* **Software:** `transformers`, `peft`, `trl`, `bitsandbytes`
-----
```
``` |