---
license: apache-2.0
base_model: meta-llama/Llama-3.2-1B-Instruct
tags:
- dpo
- lora
- peft
- llama-3.2
- llm-judge
library_name: peft
---

# DPO Fine-Tune of Llama-3.2-1B using an LLM Judge

This repository contains the LoRA adapters for a `meta-llama/Llama-3.2-1B-Instruct` model that has been fine-tuned using Direct Preference Optimization (DPO).

The preference dataset for this training was generated using a custom-built **LLM Judge** powered by GPT-3.5-Turbo. The judge was designed to evaluate pairs of model-generated responses based on a clear set of criteria, creating a high-quality dataset for preference alignment.

- **Preference Dataset:** [NilayR/llm-judge-preferences-llama32](https://huggingface.co/datasets/NilayR/llm-judge-preferences-llama32)

## Model Details

### Model Description

This model is a fine-tuned version of `meta-llama/Llama-3.2-1B-Instruct`. It was trained using DPO on a dataset of 483 preference pairs. These pairs were created by having the base model generate multiple responses to instructions from the LIMA dataset, which were then evaluated and ranked by a GPT-3.5-Turbo-based LLM Judge.

The goal of this fine-tuning was to align the model more closely with human-like preferences for helpfulness, accuracy, and clarity, as defined by the judge's evaluation criteria. This model demonstrated the best performance in a comparative analysis against the base model and a model trained with PairRM data.

- **Developed by:** NilayR
- **Model type:** Causal Language Model
- **Language(s):** English
- **License:** apache-2.0
- **Finetuned from model:** `meta-llama/Llama-3.2-1B-Instruct`

## How to Get Started with the Model

To use these LoRA adapters, load the base model (`meta-llama/Llama-3.2-1B-Instruct`) and then apply the adapters from this repository.

```python
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Set base model ID and adapter path
base_model_id = "meta-llama/Llama-3.2-1B-Instruct"
adapter_id = "NilayR/llama32-dpo-llm-judge"

# Configure BitsAndBytes for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load the base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
tokenizer.pad_token = tokenizer.eos_token

# Load and apply the PEFT adapters
model = PeftModel.from_pretrained(base_model, adapter_id)

# --- Generate a response ---
prompt = "Explain the concept of dark matter and dark energy in simple terms."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.split("assistant")[-1].strip())
````

## Training Details

### Training Data

The model was trained on a preference dataset generated using a custom LLM Judge.

  * **Data Generation Process:**
    1.  **Instructions:** 50 instructions were extracted from the LIMA dataset.
    2.  **Response Generation:** The base `Llama-3.2-1B` model generated 5 diverse responses for each instruction.
    3.  **Preference Labeling:** A custom LLM Judge powered by `GPT-3.5-Turbo` evaluated all possible pairs of responses for each instruction, resulting in a dataset of **483 chosen/rejected pairs**.

### Training Procedure

The model was trained for one epoch using the TRL library's `DPOTrainer`.

#### Training Hyperparameters

  * **Framework:** `trl.DPOTrainer`
  * **Epochs:** 1
  * **Batch Size:** 1
  * **Gradient Accumulation Steps:** 4 (Effective Batch Size: 4)
  * **Optimizer:** `paged_adamw_8bit`
  * **Learning Rate:** 5e-5
  * **LR Scheduler:** `cosine` with a warmup ratio of 0.1
  * **DPO Beta (β):** 0.1
  * **Final Training Loss:** `0.5545`

#### LoRA Configuration

  * **Rank (`r`):** 16
  * **Alpha (`lora_alpha`):** 32
  * **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
  * **Dropout:** 0.05

### Compute Infrastructure

  * **Hardware:** 1x NVIDIA A100 40GB GPU
  * **Cloud Provider:** Google Colab
  * **Software:** `transformers`, `peft`, `trl`, `bitsandbytes`

-----

```
```