Create README.md

f77c946 verified about 2 months ago

5.72 kB

	---
	license: mit
	base_model: Qwen/Qwen2.5-Coder-3B
	datasets:
	- GPUMODE/KernelBook
	tags:
	- qwen2
	- code-generation
	- triton
	- pytorch
	- kernel-generation
	- kernelbook
	- lora
	- finetune
	---

	# Qwen2.5-Coder-3B-KernelBook: Fine-tuned for PyTorch to Triton Kernel Generation

	This repository contains a fine-tuned version of the [Qwen/Qwen2.5-Coder-3B](https://huggingface.co/Qwen/Qwen2.5-Coder-3B) model, specialized for transpiling PyTorch `nn.Module` code into high-performance Triton kernels.

	The model was trained on the [GPUMODE/KernelBook](https://huggingface.co/datasets/GPUMODE/KernelBook) dataset, which contains thousands of pairs of equivalent PyTorch and Triton code snippets generated by `torch.compile`. This fine-tuning enables the model to understand the patterns of PyTorch operations and translate them into efficient, fused GPU kernels written in the Triton language.

	This model was fine-tuned as part of a demonstration of an end-to-end workflow: from dataset preparation and model training to benchmarking with the official `KernelBench` framework.

	## Model Details

	- Base Model: `Qwen/Qwen2.5-Coder-3B`
	- Fine-tuning Dataset: `GPUMODE/KernelBook`
	- Method: Low-Rank Adaptation (LoRA)
	- Framework: PyTorch 2.5.0, Transformers, PEFT, TRL

	### Training Summary

	The model was trained for 1 full epoch on the `GPUMODE/KernelBook` dataset (18,162 examples), showcasing strong learning and convergence.

	- Final Training Loss: `0.0922`
	- Final Mean Token Accuracy: `98.34%`
	- Training Runtime: `5818.25 seconds` (approx. 1 hour 37 minutes)
	- Hardware: 1x NVIDIA H100 80GB

	Key Training Hyperparameters:
	- `learning_rate`: 2e-4
	- `per_device_train_batch_size`: 1
	- `gradient_accumulation_steps`: 8 (effective batch size of 8)
	- `max_seq_length`: 4096
	- `optimizer`: adamw_torch_fused
	- `precision`: bfloat16

	For a detailed view of the training progress, you can visit the [Weights & Biases run page](https://wandb.ai/tarunreddi-university-at-buffalo/huggingface/runs/ew21hn3w).

	## How to Use

	This model is designed to be used for code generation in a structured prompt format. You should provide the PyTorch code and ask for the Triton code in return.

	### Installation

	First, make sure you have the necessary libraries installed:

	```bash
	pip install torch transformers peft accelerate
	```

	### Example Usage

	Here is a Python snippet demonstrating how to generate a Triton kernel from a PyTorch `nn.Module`.

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# The repository ID of this model on the Hugging Face Hub
	model_id = "TEEN-D/Qwen2.5-Coder-3B-KernelBook-Finetuned"

	print("Loading model and tokenizer...")
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	print("Model loaded successfully.")

	# --- 1. Define your PyTorch code ---
	pytorch_code = """
	import torch
	import torch.nn as nn

	class SumAggregator(nn.Module):
	def __init__(self):
	super(SumAggregator, self).__init__()

	def forward(self, neighbor):
	return torch.sum(neighbor, dim=1)
	"""

	# --- 2. Format the prompt as used during training ---
	prompt = f"""### INSTRUCTION
	Generate the Triton code for the following Python code.

	### PYTHON CODE:
	{pytorch_code}

	### TRITON CODE:
	"""

	# --- 3. Generate the Triton kernel ---
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(
	**inputs,
	max_new_tokens=2048,
	do_sample=False, # Use greedy decoding for reproducibility
	pad_token_id=tokenizer.eos_token_id
	)

	full_output = tokenizer.decode(outputs, skip_special_tokens=True)

	# --- 4. Extract and print only the Triton code ---
	try:
	triton_code = full_output.split("### TRITON CODE:").strip()
	print("\n--- Generated Triton Code ---")
	print(triton_code)
	except IndexError:
	print("Could not parse the output. Full generated text:")
	print(full_output)

	```

	## Fine-tuning Dataset: GPUMODE/KernelBook

	This model's capabilities are a direct result of the high-quality `GPUMODE/KernelBook` dataset.

	- Content: The dataset contains 18,162 pairs of PyTorch programs and their equivalent Triton kernels, as generated by `torch.compile`.
	- Creation Process: The authors collected PyTorch repositories, extracted `nn.Module` classes, generated Triton code with `torch.compile`, and enriched the data with metadata.
	- Recommended Usage: For best results when using or evaluating the generated Triton code, it is recommended to use the same PyTorch version the dataset was created with (`torch==2.5.0`).

	## Base Model: Qwen2.5-Coder-3B

	`Qwen2.5-Coder` is a series of code-specific large language models. The 3B model has the following characteristics:
	- Parameters: 3.09B
	- Context Length: 32,768 tokens
	- Architecture: Transformer with RoPE, SwiGLU, RMSNorm.

	## Citation

	If you use this model or the dataset in your work, please cite the original authors.

	To cite the dataset:
	```bibtex
	@software{kernelbook2025,
	title={KernelBook},
	author={Paliskara, Sahan and Saroufim, Mark},
	year={2025},
	month={5},
	url={https://huggingface.co/datasets/GPUMODE/KernelBook},
	}
	```

	To cite the base model:
	```bibtex
	@article{hui2024qwen2,
	title={Qwen2. 5-Coder Technical Report},
	author={Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Dang, Kai and others},
	journal={arXiv preprint arXiv:2409.12186},
	year={2024}
	}
	```