README.md · onnx-community/Aryabhata-1.0 at main

Aryabhata-1.0 / README.md

Prince-1

Add files using upload-large-folder tool

a2d7d40 verified about 2 months ago

preview code

raw

history blame contribute delete

7.72 kB

	---
	license: apache-2.0
	tags:
	- small-language-model
	- jee
	- exam-centric
	- indian-education
	- reinforcement-learning
	- supervised-finetuning
	- model-merging
	- rejection-sampling
	- mathematics
	- ai4education
	- physicswallah
	- onnx
	- onnxruntime-genai
	- onnxruntime
	language:
	- en
	library_name: onnxruntime-genai
	base_model_relation: quantized
	base_model: Prince-1/Aryabhata-1.0
	pipeline_tag: text-generation
	model_creator: Physics Wallah AI Research
	model_type: Causal decoder-based model
	---

	# Aryabhatta 1.0 : An exam-focused language model for JEE Math

	![](benchmark.png)

	## Overview

	Aryabhata 1.0 is a 7B parameter small language model for mathematics developed by Physics Wallah AI Research, optimized for high-stakes Indian competitive exams like JEE Mains. Despite its compact size, Aryabhata 1.0 achieves state-of-the-art performance on exam-centric reasoning tasks with impressive token efficiency and low inference cost.


	> 🚧 Aryabhata 1.0 is an experimental release. We are actively seeking feedback — please contribute in the Discussion tab of this repo.
	---

	## 🧠 Key Features

	- Architecture: 7B parameter causal decoder-based model.
	- Exam-Centric Optimization: Specifically tuned for JEE-level Mathematics reasoning.
	- High Accuracy:
	- 86% on JEE Mains January 2025 session.
	- 90.2% on JEE Mains April 2025 session.
	- Token Efficiency: Operates effectively around a ~2K token window, compared to ~8K required by other reasoning models.
	- Compute Efficient: Trained on a 1x2 NVIDIA H100 GPU using optimized pipeline.

	---

	## 🛠️ Training Details

	- Training Data: ~130K problem-solution pairs curated from proprietary Physics Wallah exam datasets.
	- Training Pipeline:
	- Model Merging
	- Rejection Sampling
	- Supervised Fine-Tuning (SFT)
	- Reinforcement Learning with Verifiable Rewards (RLVR)

	### 🔀 Model Merging
	We began with model merging (Weighted average) to build a strong initialization (Aryabhata 0.5) by combining diverse model capabilities:
	* Qwen 2.5 Math: A robust math-centric LLM with solid symbolic math foundations.
	* Ace Math: An enhanced version of Qwen 2.5 Math, fine-tuned by NVIDIA for improved accuracy in mathematics benchmarks.
	* DeepSeek R1 Distill Qwen: A long-form reasoning model, fine-tuned on reasoning traces distilled from DeepSeek R1.

	### 📚 Data Curation + Rejection Sampling
	We extracted ~250K raw questions from Physics Wallah's internal database and applied aggressive filtering and cleaning:
	* Removed: diagram-based, non-English, and option-heavy questions.
	* Kept: questions matching the distribution of JEE Main 2019–2024.
	Final curated dataset: ~130K high-quality questions.

	For each question:
	* Generated 4 CoTs using Aryabhata 0.5.
	* Retained only those leading to correct final answers.

	Resulting Dataset:
	* ~100K questions
	* ~350K high-quality CoTs

	We used this dataset for SFT.

	### 🎯 Reinforcement Learning with Verifiable Rewards (RLVR)
	We used a custom in-house variant of Group Relative Policy Optimization (GRPO), adapted for math-specific reward functions.
	* Removed KL-divergence penalty
	* Removed clipping

	We used RLVR on the remaining ~30K questions.

	This multi-phase training strategy allows Aryabhata 1.0 to capture pedagogy-aligned reasoning patterns, making it highly effective for solving real student queries in mathematics.

	---

	## 📊 Performance Highlights

	### Evaluation Setup
	All evaluations were performed with temperature = 0.0, and we report pass@1 accuracy.

	#### Evaluation Datasets
	We evaluated the model on two sets of official JEE Mains 2025 mathematics papers:
	* January Session: 10 question papers containing 250 questions.
	* April Session: 9 question papers containing 225 questions.

	Each paper includes a mix of:
	* Multiple Choice Questions (MCQs) with one correct option
	* Numeric Answer Type (NAT) questions requiring precise numerical responses

	#### Evaluation Metric
	We used a composite evaluation metric to reflect real-world grading rigor and reduce false positives:

	1. Float Match
	* Compares predicted and target answers within a tolerance (±1e-9)
	* Handles rounding artifacts and small numerical errors robustly
	2. String Match
	* Used for symbolic answers (e.g., fractions, radicals)
	* Uses strict exact match — predictions must match ground truth character-for-character
	3. LLM-as-Judge (GPT-4o-mini)
	* Used for Mathematical equivalence for ambiguous formats

	### 🔹 Accuracy Comparison Across Models
	![](accuracy.png)
	> Aryabhata has the best accuracy on JEE Main Maths, on par with frontier models

	### 🔹 Accuracy vs Token Usage
	![](accuracy-vs-token.png)
	> Aryabhata is on par with frontier models in terms of accuracy vs token usage

	---

	## 🔧 Intended Use

	Primary Use Cases:
	- Competitive exam preparation (JEE Main level mathematics problems)
	- Question answering and doubt-solving systems
	- Educational tutoring and concept explanation


	## 💡 How to Use

	### 🧪 Using with 🤗 Transformers

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

	model_id = "PhysicsWallahAI/Aryabhata-1.0"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)


	# Define stop strings
	stop_strings = ["<\|im_end\|>", "<\|end\|>", "<im_start\|>", "⁠```python\n", "⁠<\|im_start\|>", "]}}]}}]"]

	def strip_bad_tokens(s, stop_strings):
	for suffix in stop_strings:
	if s.endswith(suffix):
	return s[:-len(suffix)]
	return s


	# Create generation config (can also set temperature, top_p, etc.)
	generation_config = GenerationConfig(
	max_new_tokens=4096,
	stop_strings = stop_strings
	)

	query = 'Find all the values of \\sqrt[3]{1}'
	messages = [{'role': 'system', 'content': 'Think step-by-step; put only the final answer inside \\boxed{}.'},
	{'role': 'user', 'content': query}]

	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	inputs = tokenizer([text], return_tensors="pt")
	outputs = model.generate(**inputs, generation_config=generation_config, tokenizer=tokenizer)

	print(strip_bad_tokens(tokenizer.decode(outputs[0], skip_special_tokens=True), stop_strings))
	````

	---

	### ⚡ Using with vLLM

	To run the model efficiently using vLLM:

	```python
	from vllm import LLM, SamplingParams

	# Initialize model (downloads from Hugging Face if not local)
	llm = LLM(model="PhysicsWallahAI/Aryabhata-1.0")

	# Define prompt and sampling configuration
	query = 'Find all the values of \\sqrt[3]{1}'
	messages = [{'role': 'system', 'content': 'Think step-by-step; put only the final answer inside \\boxed{}.'},
	{'role': 'user', 'content': query}]
	sampling_params = SamplingParams(temperature=0.0, max_tokens=4*1024, stop=["<\|im_end\|>", "<\|end\|>", "<im_start\|>", "⁠```python\n", "⁠<\|im_start\|>", "]}}]}}]"])

	# Run inference
	results = llm.chat(messages, sampling_params)

	# Print result
	print(results[0].outputs[0].text.strip())
	```

	---

	## 🚀 Roadmap

	Aryabhata 2.0 (Upcoming):
	- Extending domain coverage to Physics and Chemistry
	- Supporting JEE Advanced, NEET, and Foundation syllabus
	- Further optimization for affordability and accuracy in real-time deployments

	---

	## 🤝 Citation

	If you use this model, please cite:

	```bibtex
	@misc{Aryabhata2025,
	title = {Aryabhata 1.0: A compact, exam-focused language model tailored for mathematics in Indian competitive exams, especially JEE Main.},
	author = {Physics Wallah AI Research},
	year = {2025},
	note = {\url{https://huggingface.co/PhysicsWallahAI/Aryabhata-1.0}},
	}