File size: 3,684 Bytes
975dad4 3f5c3b2 975dad4 3f5c3b2 975dad4 3f5c3b2 975dad4 3f5c3b2 975dad4 3f5c3b2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
---
license: apache-2.0
language: c++
tags:
- code-generation
- codellama
- peft
- unit-tests
- causal-lm
- text-generation
- embedded-systems
base_model: codellama/CodeLlama-7b-hf
model_type: llama
pipeline_tag: text-generation
---
# 🧪 CodeLLaMA Unit Test Generator — Full Merged Model (v2)
This is a **merged model** that combines [`codellama/CodeLlama-7b-hf`](https://huggingface.co/codellama/CodeLlama-7b-hf) with a LoRA adapter
fine-tuned on embedded C/C++ code and high-quality unit tests using GoogleTest and CppUTest. This version includes enhanced formatting, stop tokens,
and test cleanup mechanisms.
---
## 🎯 Use Cases
- Generate comprehensive unit tests for embedded C/C++ functions
- Focus on edge cases, boundaries, error handling
---
## 🧠 Training Summary
- Base model: `codellama/CodeLlama-7b-hf`
- LoRA fine-tuned with:
- Special tokens: `<|system|>`, `<|user|>`, `<|assistant|>`, `// END_OF_TESTS`
- Instruction-style prompts
- Explicit test output formatting
- Cleaned test labels via regex stripping headers/main
- Datasets: [`athrv/Embedded_Unittest2`](https://huggingface.co/datasets/athrv/Embedded_Unittest2)
---
## 📌 Example Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "Utkarsh524/codellama_utests_full_new_ver2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
prompt = """<|system|>
Generate comprehensive unit tests for C/C++ code. Cover all edge cases, boundary conditions, and error scenarios.
Output Constraints:
1. ONLY include test code (no explanations, headers, or main functions)
2. Start directly with TEST(...)
3. End after last test case
4. Never include framework boilerplate
<|user|>
Create tests for:
int add(int a, int b) { return a + b; }
<|assistant|>
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, eos_token_id=tokenizer.convert_tokens_to_ids("// END_OF_TESTS"))
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Training & Optimization Details
| Step | Description |
|---------------------|-----------------------------------------------------------------------------|
| **Dataset** | athrv/Embedded_Unittest2 (filtered for valid code-test pairs) |
| **Preprocessing** | Token length filtering (≤4096), special token injection |
| **Quantization** | 8-bit (BitsAndBytesConfig), llm_int8_threshold=6.0 |
| **LoRA Config** | r=64, alpha=32, dropout=0.1 on q_proj/v_proj/k_proj/o_proj |
| **Training** | 4 epochs, batch=4 (effective 8), lr=2e-4, FP16 |
| **Optimization** | Paged AdamW 8-bit, gradient checkpointing, custom data collator |
| **Special Tokens** | Added `<|system|>`, `<|user|>`, `<|assistant|>` |
---
## Tips for Best Results
- **Temperature:** 0.2–0.4
- **Top-p:** 0.85–0.95
- **Max New Tokens:** 256–512-1024-2048
- **Input Formatting:**
- Include complete function signatures
- Remove unnecessary comments
- Keep functions under 200 lines
- For long functions, split into logical units
---
## Feedback & Citation
**Dataset Credit:** `athrv/Embedded_Unittest2`
**Report Issues:** [Model's Hugging Face page](https://huggingface.co/Utkarsh524/codellama_utests_full_new_ver2)
**Maintainer:** Utkarsh524
**Model Version:** v2 (4-epoch trained)
---
|