File size: 4,681 Bytes
b27924f
5c587a9
 
 
 
 
 
 
 
 
b27924f
 
5c587a9
b27924f
5c587a9
b27924f
5c587a9
b27924f
5c587a9
b27924f
 
 
 
 
5c587a9
b27924f
5c587a9
b27924f
5c587a9
 
 
 
 
b27924f
 
 
5c587a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b27924f
 
 
 
 
5c587a9
b27924f
5c587a9
 
 
 
b27924f
 
 
5c587a9
b27924f
 
 
5c587a9
 
 
 
 
 
 
 
 
b27924f
5c587a9
b27924f
5c587a9
 
 
 
b27924f
 
 
5c587a9
 
 
b27924f
5c587a9
b27924f
5c587a9
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
license: apache-2.0
base_model: meta-llama/Llama-3.2-1B-Instruct
tags:
- dpo
- lora
- peft
- llama-3.2
- llm-judge
library_name: peft
---

# DPO Fine-Tune of Llama-3.2-1B using an LLM Judge

This repository contains the LoRA adapters for a `meta-llama/Llama-3.2-1B-Instruct` model that has been fine-tuned using Direct Preference Optimization (DPO).

The preference dataset for this training was generated using a custom-built **LLM Judge** powered by GPT-3.5-Turbo. The judge was designed to evaluate pairs of model-generated responses based on a clear set of criteria, creating a high-quality dataset for preference alignment.

- **Preference Dataset:** [NilayR/llm-judge-preferences-llama32](https://huggingface.co/datasets/NilayR/llm-judge-preferences-llama32)

## Model Details

### Model Description

This model is a fine-tuned version of `meta-llama/Llama-3.2-1B-Instruct`. It was trained using DPO on a dataset of 483 preference pairs. These pairs were created by having the base model generate multiple responses to instructions from the LIMA dataset, which were then evaluated and ranked by a GPT-3.5-Turbo-based LLM Judge.

The goal of this fine-tuning was to align the model more closely with human-like preferences for helpfulness, accuracy, and clarity, as defined by the judge's evaluation criteria. This model demonstrated the best performance in a comparative analysis against the base model and a model trained with PairRM data.

- **Developed by:** NilayR
- **Model type:** Causal Language Model
- **Language(s):** English
- **License:** apache-2.0
- **Finetuned from model:** `meta-llama/Llama-3.2-1B-Instruct`

## How to Get Started with the Model

To use these LoRA adapters, load the base model (`meta-llama/Llama-3.2-1B-Instruct`) and then apply the adapters from this repository.

```python
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Set base model ID and adapter path
base_model_id = "meta-llama/Llama-3.2-1B-Instruct"
adapter_id = "NilayR/llama32-dpo-llm-judge"

# Configure BitsAndBytes for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load the base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
tokenizer.pad_token = tokenizer.eos_token

# Load and apply the PEFT adapters
model = PeftModel.from_pretrained(base_model, adapter_id)

# --- Generate a response ---
prompt = "Explain the concept of dark matter and dark energy in simple terms."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.split("assistant")[-1].strip())
````

## Training Details

### Training Data

The model was trained on a preference dataset generated using a custom LLM Judge.

  * **Data Generation Process:**
    1.  **Instructions:** 50 instructions were extracted from the LIMA dataset.
    2.  **Response Generation:** The base `Llama-3.2-1B` model generated 5 diverse responses for each instruction.
    3.  **Preference Labeling:** A custom LLM Judge powered by `GPT-3.5-Turbo` evaluated all possible pairs of responses for each instruction, resulting in a dataset of **483 chosen/rejected pairs**.

### Training Procedure

The model was trained for one epoch using the TRL library's `DPOTrainer`.

#### Training Hyperparameters

  * **Framework:** `trl.DPOTrainer`
  * **Epochs:** 1
  * **Batch Size:** 1
  * **Gradient Accumulation Steps:** 4 (Effective Batch Size: 4)
  * **Optimizer:** `paged_adamw_8bit`
  * **Learning Rate:** 5e-5
  * **LR Scheduler:** `cosine` with a warmup ratio of 0.1
  * **DPO Beta (β):** 0.1
  * **Final Training Loss:** `0.5545`

#### LoRA Configuration

  * **Rank (`r`):** 16
  * **Alpha (`lora_alpha`):** 32
  * **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
  * **Dropout:** 0.05

### Compute Infrastructure

  * **Hardware:** 1x NVIDIA A100 40GB GPU
  * **Cloud Provider:** Google Colab
  * **Software:** `transformers`, `peft`, `trl`, `bitsandbytes`

-----

```
```