TEEN-D
/

Qwen2.5-Coder-3B-KernelBook-Finetuned

+---
+license: mit
+base_model: Qwen/Qwen2.5-Coder-3B
+datasets:
+- GPUMODE/KernelBook
+tags:
+- qwen2
+- code-generation
+- triton
+- pytorch
+- kernel-generation
+- kernelbook
+- lora
+- finetune
+---
+# Qwen2.5-Coder-3B-KernelBook: Fine-tuned for PyTorch to Triton Kernel Generation
+This repository contains a fine-tuned version of the **[Qwen/Qwen2.5-Coder-3B](https://huggingface.co/Qwen/Qwen2.5-Coder-3B)** model, specialized for transpiling PyTorch `nn.Module` code into high-performance Triton kernels.
+The model was trained on the **[GPUMODE/KernelBook](https://huggingface.co/datasets/GPUMODE/KernelBook)** dataset, which contains thousands of pairs of equivalent PyTorch and Triton code snippets generated by `torch.compile`. This fine-tuning enables the model to understand the patterns of PyTorch operations and translate them into efficient, fused GPU kernels written in the Triton language.
+This model was fine-tuned as part of a demonstration of an end-to-end workflow: from dataset preparation and model training to benchmarking with the official `KernelBench` framework.
+## Model Details
+- **Base Model:** `Qwen/Qwen2.5-Coder-3B`
+- **Fine-tuning Dataset:** `GPUMODE/KernelBook`
+- **Method:** Low-Rank Adaptation (LoRA)
+- **Framework:** PyTorch 2.5.0, Transformers, PEFT, TRL
+### Training Summary
+The model was trained for **1 full epoch** on the `GPUMODE/KernelBook` dataset (18,162 examples), showcasing strong learning and convergence.
+- **Final Training Loss:** **`0.0922`**
+- **Final Mean Token Accuracy:** **`98.34%`**
+- **Training Runtime:** `5818.25 seconds` (approx. 1 hour 37 minutes)
+- **Hardware:** 1x NVIDIA H100 80GB
+**Key Training Hyperparameters:**
+- `learning_rate`: 2e-4
+- `per_device_train_batch_size`: 1
+- `gradient_accumulation_steps`: 8 (effective batch size of 8)
+- `max_seq_length`: 4096
+- `optimizer`: adamw_torch_fused
+- `precision`: bfloat16
+For a detailed view of the training progress, you can visit the [Weights & Biases run page](https://wandb.ai/tarunreddi-university-at-buffalo/huggingface/runs/ew21hn3w).
+## How to Use
+This model is designed to be used for code generation in a structured prompt format. You should provide the PyTorch code and ask for the Triton code in return.
+### Installation
+First, make sure you have the necessary libraries installed:
+```bash
+pip install torch transformers peft accelerate
+```
+### Example Usage
+Here is a Python snippet demonstrating how to generate a Triton kernel from a PyTorch `nn.Module`.
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# The repository ID of this model on the Hugging Face Hub
+model_id = "TEEN-D/Qwen2.5-Coder-3B-KernelBook-Finetuned"
+print("Loading model and tokenizer...")
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+print("Model loaded successfully.")
+# --- 1. Define your PyTorch code ---
+pytorch_code = """
+import torch
+import torch.nn as nn
+class SumAggregator(nn.Module):
+    def __init__(self):
+        super(SumAggregator, self).__init__()
+    def forward(self, neighbor):
+        return torch.sum(neighbor, dim=1)
+"""
+# --- 2. Format the prompt as used during training ---
+prompt = f"""### INSTRUCTION
+Generate the Triton code for the following Python code.
+### PYTHON CODE:
+{pytorch_code}
+### TRITON CODE:
+"""
+# --- 3. Generate the Triton kernel ---
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=2048,
+    do_sample=False, # Use greedy decoding for reproducibility
+    pad_token_id=tokenizer.eos_token_id
+)
+full_output = tokenizer.decode(outputs, skip_special_tokens=True)
+# --- 4. Extract and print only the Triton code ---
+try:
+    triton_code = full_output.split("### TRITON CODE:").strip()
+    print("\n--- Generated Triton Code ---")
+    print(triton_code)
+except IndexError:
+    print("Could not parse the output. Full generated text:")
+    print(full_output)
+```
+## Fine-tuning Dataset: GPUMODE/KernelBook
+This model's capabilities are a direct result of the high-quality `GPUMODE/KernelBook` dataset.
+- **Content:** The dataset contains 18,162 pairs of PyTorch programs and their equivalent Triton kernels, as generated by `torch.compile`.
+- **Creation Process:** The authors collected PyTorch repositories, extracted `nn.Module` classes, generated Triton code with `torch.compile`, and enriched the data with metadata.
+- **Recommended Usage:** For best results when using or evaluating the generated Triton code, it is recommended to use the same PyTorch version the dataset was created with (`torch==2.5.0`).
+## Base Model: Qwen2.5-Coder-3B
+`Qwen2.5-Coder` is a series of code-specific large language models. The 3B model has the following characteristics:
+- **Parameters:** 3.09B
+- **Context Length:** 32,768 tokens
+- **Architecture:** Transformer with RoPE, SwiGLU, RMSNorm.
+## Citation
+If you use this model or the dataset in your work, please cite the original authors.
+**To cite the dataset:**
+```bibtex
+@software{kernelbook2025,
+    title={KernelBook},
+    author={Paliskara, Sahan and Saroufim, Mark},
+    year={2025},
+    month={5},
+    url={https://huggingface.co/datasets/GPUMODE/KernelBook},
+}
+```
+**To cite the base model:**
+```bibtex
+@article{hui2024qwen2,
+      title={Qwen2. 5-Coder Technical Report},
+      author={Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Dang, Kai and others},
+      journal={arXiv preprint arXiv:2409.12186},
+      year={2024}
+}
+```