🧠 SmallCoder (303M)

SmallCoder is a 303M parameter LLaMA-style language model trained from scratch for code generation and algorithmic reasoning.

This checkpoint represents a 6B-token Supervised Fine-Tuning (SFT) run that fixed a critical End-of-Sequence (EOS) token bug from earlier versions.

Despite its compact size, SmallCoder achieves state-of-the-art (SOTA) coding performance for <500M models, rivaling 1B–7B parameter LLMs.

Trained with support from Google’s TPU Research Cloud (TRC) program.


🚀 Key Results

Model Size HumanEval (pass@1) MBPP (pass@1)
SmallCoder (Stage 4.1) 303M 27.4 % 31.0 %
TinyLlama-1.1B 1.1B ~26.4 % ~27.6 %
MPT-1B-Instruct 1.0B ~22.0 % ~25.0 %
Zephyr-1.3B-SFT 1.3B 31.0 % 34.0 %
Mistral-7B-Base 7B 30.5 % 47.5 %

⚖️ SmallCoder nearly matches Mistral 7B on HumanEval while being 23× smaller.


🧬 Model Architecture

A LLaMA-type causal decoder with standard Multi-Head Attention (MHA).

LlamaConfig(
  vocab_size=49152,               # StarCoder tokenizer
  hidden_size=768,
  num_hidden_layers=24,
  num_attention_heads=8,
  num_key_value_heads=8,
  intermediate_size=3072,
  max_position_embeddings=1024,
)
Parameter Value
Total parameters ≈ 303 M
Context length 1 024 tokens
Tokenizer bigcode/starcoder
Architecture type LLaMA (MHA, non-GQA)
Precision bfloat16
Optimizer AdamW XLA
Hardware TPU v4-32 (TRC)

📚 Training Curriculum (4 Stages, 29.8B tokens)

Stage Tokens (B) Dataset Objective Loss ↓
1. Linguistic Base 6.3 FineWeb-Edu General English grounding 10.87 → 2.58
2. Code Specialization 7.5 60 % Nemotron Synthetic Code / 40 % StarCoderData Code syntax & reasoning 5.00 → 1.25
3. Math & Knowledge 10.0 Nemotron CC-Math / FineWiki / OpenWebMath Mathematical reasoning 2.77 → 1.55
4.1 SFT (EOS Fixed) 6.0 Nemotron SFT / OpenCodeInstruct / OpenMathInstruct-2 Instruction-tuned code alignment 1.73 → ~0.70

🧩 Total ≈ 29.8 B tokens of curated curriculum learning.


📊 Detailed Benchmarks (Stage 4.1 SFT)

Domain Benchmark Metric Score
Code HumanEval (0-shot) pass@1 27.4 %
Code MBPP (3-shot) pass@1 31.0 %
Math GSM8k (0-shot) exact match 4.55 %
Knowledge Wikitext-2 perplexity ↓ 167.6
Reasoning ARC (Easy/Challenge) acc norm 34.6 / 22.8 %
Commonsense HellaSwag acc norm 28.3 %

humaneval/mbpp were computed with manual evaluation (max_new_tokens=512, temp=0.2) due to SFT format truncation issues in lm-eval.


⚠️ Known Limitations

  1. Code-Specialized Model Tuned for Python and algorithmic reasoning. Poor performance on general text, math, and commonsense tasks.

  2. Short Context Trained on 1 024-token sequences only. Performance degrades on longer inputs.

  3. Tokenizer Bias Uses bigcode/starcoder BPE vocabulary — optimized for code, not prose.


💻 Usage Example

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Beebey/smallcoder-303m"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)

prompt = """User: Write a Python function to compute Fibonacci numbers.
Assistant:"""
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

💡 Trained using the “User:” / “Assistant:” dialogue format.


🧾 Citation

If you use SmallCoder (303M) in your research, please cite:

@misc{smallcoder303m,
  title  = {SmallCoder: A 303M-parameter Code LLM trained from scratch},
  author = {Da Silva, Ilan},
  year   = {2025},
  url    = {https://huggingface.co/Beebey/smallcoder-303m},
  note   = {Trained with Google TPU Research Cloud (TRC) support}
}

🙏 Acknowledgements

This model was trained with support from the Google TPU Research Cloud (TRC) program. Special thanks to the open datasets that enabled this work: FineWeb, StarCoderData, Nemotron, and OpenWebMath.


🧩 Summary

Category Description
Type Code LLM (LLaMA-style)
Parameters 303 M
Training tokens ~29.8 B
Specialty Code generation & reasoning
Context window 1 024 tokens
Tokenizer bigcode/starcoder
License Apache 2.0
Hardware TPU v4 (TRC Program)

🔬 SmallCoder (303M) demonstrates that a carefully designed <500M model can achieve near-SOTA coding performance, matching 1B-class models on HumanEval — proving that efficient, compact, open models still matter.


Downloads last month
43
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Beebey/smallcoder-303m

Quantizations
1 model

Datasets used to train Beebey/smallcoder-303m