File size: 4,539 Bytes
a182d09
3797447
 
 
 
 
 
 
 
 
 
a182d09
 
3797447
a182d09
3797447
a182d09
3797447
a182d09
3797447
a182d09
 
 
 
 
3797447
a182d09
3797447
a182d09
3797447
 
 
 
 
a182d09
 
 
3797447
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a182d09
 
 
 
 
3797447
a182d09
3797447
 
 
a182d09
 
 
3797447
a182d09
 
 
3797447
 
 
 
 
 
 
 
 
a182d09
3797447
a182d09
3797447
 
 
 
a182d09
 
 
3797447
 
 
a182d09
3797447
a182d09
 
3797447
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
license: apache-2.0
base_model: meta-llama/Llama-3.2-1B-Instruct
tags:
- dpo
- lora
- peft
- llama-3.2
- iterative-dpo
- self-rewarding
library_name: peft
---

# Iterative DPO Fine-Tune of Llama-3.2-1B (Iteration 2)

This repository contains the LoRA adapters from the **second and final iteration** of a Direct Preference Optimization (DPO) fine-tuning process on the `meta-llama/Llama-3.2-1B-Instruct` model.

This model represents a further refinement of the Iteration 1 model, demonstrating a self-improvement loop where the model learns from preferences on its own generated outputs. This work was inspired by the "Self-Rewarding Language Models" paper.

- **Repository for Iteration 1:** [NilayR/llama32-iterative-dpo-iter1](https://huggingface.co/NilayR/llama32-iterative-dpo-iter1)

## Model Details

### Model Description

This model is the result of the second fine-tuning cycle in an iterative DPO pipeline. The process began with the model from Iteration 1 generating a new set of responses. These responses were then evaluated by an LLM Judge (GPT-3.5-Turbo) to create a fresh preference dataset. This new dataset was used to further fine-tune the model, resulting in the adapters contained in this repository.

The goal of this iteration was to demonstrate that the model could continue to improve its alignment with desired behaviors (accuracy, helpfulness, clarity) using its own outputs as a foundation for learning.

- **Developed by:** NilayR
- **Model type:** Causal Language Model
- **Language(s):** English
- **License:** apache-2.0
- **Finetuned from model:** `meta-llama/Llama-3.2-1B-Instruct` (with adapters from Iteration 1)

## How to Get Started with the Model

To use these LoRA adapters, load the base model (`meta-llama/Llama-3.2-1B-Instruct`) and then apply the adapters from this repository.

```python
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Set base model ID and adapter path
base_model_id = "meta-llama/Llama-3.2-1B-Instruct"
adapter_id = "NilayR/llama32-iterative-dpo-iter2"

# Configure BitsAndBytes for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load the base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
tokenizer.pad_token = tokenizer.eos_token

# Load and apply the PEFT adapters
model = PeftModel.from_pretrained(base_model, adapter_id)

# --- Generate a response ---
prompt = "What are the key benefits of meditation?"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.split("assistant")[-1].strip())
```

## Training Details

### Training Data

The model was trained on a preference dataset generated by the **Iteration 1 model** (`NilayR/llama32-iterative-dpo-iter1`).

  * **Data Generation Process:**
    1.  **Instructions:** The model from Iteration 1 generated responses to 20 instructions from the LIMA dataset.
    2.  **Preference Labeling:** A custom LLM Judge powered by `GPT-3.5-Turbo` evaluated pairs of the new responses, creating a dataset of **57 chosen/rejected pairs**.

### Training Procedure

The model was trained for one epoch using the TRL library's `DPOTrainer`.

#### Training Hyperparameters

  * **Framework:** `trl.DPOTrainer`
  * **Epochs:** 1
  * **Batch Size:** 1
  * **Gradient Accumulation Steps:** 2 (Effective Batch Size: 2)
  * **Optimizer:** `paged_adamw_8bit`
  * **Learning Rate:** 2e-5
  * **DPO Beta (β):** 0.1
  * **Max Steps:** 50
  * **Final Training Loss:** `0.6343`

#### LoRA Configuration

  * **Rank (`r`):** 16
  * **Alpha (`lora_alpha`):** 32
  * **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`
  * **Dropout:** 0.05

### Compute Infrastructure

  * **Hardware:** 1x NVIDIA A100 40GB GPU
  * **Cloud Provider:** Google Colab
  * **Software:** `transformers`, `peft`, `trl`, `bitsandbytes`

-----


```
```