File size: 7,780 Bytes
42c14ee
5f523dc
42c14ee
33cfd48
 
 
 
 
 
 
 
 
 
 
42c14ee
33cfd48
1593799
42c14ee
 
 
 
19b901a
42c14ee
 
bbd731e
 
 
 
 
42c14ee
1593799
42c14ee
 
1593799
42c14ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1593799
42c14ee
 
 
 
 
 
 
 
 
 
 
1593799
42c14ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1593799
42c14ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1593799
42c14ee
 
 
1593799
42c14ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1593799
42c14ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1593799
42c14ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ffa583f
 
 
 
42c14ee
 
1593799
42c14ee
 
 
 
 
 
 
 
 
 
 
1593799
 
42c14ee
 
1593799
42c14ee
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
---
license: cc-by-nc-4.0
tags:
- small-language-model
- jee
- exam-centric
- indian-education
- reinforcement-learning
- supervised-finetuning
- model-merging
- rejection-sampling
- mathematics
- ai4education
- physicswallah
language:
- en
model_name: PhysicsWallah/Aryabhata-1.0
model_creator: Physics Wallah AI Research
model_type: Causal decoder-based model
base_model: Qwen/Qwen2.5-Math-7B
pipeline_tag: text-generation
library_name: transformers
---

# Aryabhatta 1.0 : An exam-focused language model for JEE Math

![](benchmark.png)

## Overview

**Aryabhata 1.0** is a 7B parameter small language model for mathematics developed by **Physics Wallah AI Research**, optimized for high-stakes Indian competitive exams like **JEE Mains**. Despite its compact size, Aryabhata 1.0 achieves **state-of-the-art performance** on exam-centric reasoning tasks with impressive **token efficiency** and low inference cost.


> 🚧 *Aryabhata 1.0 is an **experimental release**. We are actively seeking feedback — please contribute in the Discussion tab of this repo.*
---

## 🧠 Key Features

- **Architecture**: 7B parameter causal decoder-based model.
- **Exam-Centric Optimization**: Specifically tuned for JEE-level Mathematics reasoning.
- **High Accuracy**:
  - **86%** on **JEE Mains January 2025** session.
  - **90.2%** on **JEE Mains April 2025** session.
- **Token Efficiency**: Operates effectively around a **~2K token window**, compared to ~8K required by other reasoning models.
- **Compute Efficient**: Trained on a **1x2 NVIDIA H100 GPU** using optimized pipeline.

---

## 🛠️ Training Details

- **Training Data**: ~130K problem-solution pairs curated from proprietary Physics Wallah exam datasets.
- **Training Pipeline**:
  - **Model Merging**
  - **Rejection Sampling**
  - **Supervised Fine-Tuning (SFT)**
  - **Reinforcement Learning with Verifiable Rewards (RLVR)**

### 🔀 Model Merging
We began with model merging (Weighted average) to build a strong initialization (Aryabhata 0.5) by combining diverse model capabilities:
* Qwen 2.5 Math: A robust math-centric LLM with solid symbolic math foundations.
* Ace Math: An enhanced version of Qwen 2.5 Math, fine-tuned by NVIDIA for improved accuracy in mathematics benchmarks.
* DeepSeek R1 Distill Qwen: A long-form reasoning model, fine-tuned on reasoning traces distilled from DeepSeek R1.

### 📚 Data Curation + Rejection Sampling
We extracted ~250K raw questions from Physics Wallah's internal database and applied aggressive filtering and cleaning:
* Removed: diagram-based, non-English, and option-heavy questions.
* Kept: questions matching the distribution of JEE Main 2019–2024.
Final curated dataset: ~130K high-quality questions.

For each question:
* Generated 4 CoTs using Aryabhata 0.5.
* Retained only those leading to correct final answers.

Resulting Dataset:
* ~100K questions
* ~350K high-quality CoTs

We used this dataset for SFT.

### 🎯 Reinforcement Learning with Verifiable Rewards (RLVR)
We used a custom in-house variant of Group Relative Policy Optimization (GRPO), adapted for math-specific reward functions.
* Removed KL-divergence penalty
* Removed clipping

We used RLVR on the remaining ~30K questions.

This multi-phase training strategy allows Aryabhata 1.0 to capture **pedagogy-aligned reasoning patterns**, making it highly effective for solving real student queries in mathematics.

---

## 📊 Performance Highlights

### Evaluation Setup
All evaluations were performed with temperature = 0.0, and we report pass@1 accuracy.

#### Evaluation Datasets
We evaluated the model on two sets of official JEE Mains 2025 mathematics papers:
* January Session: 10 question papers containing 250 questions.
* April Session: 9 question papers containing 225 questions.

Each paper includes a mix of:
* Multiple Choice Questions (MCQs) with one correct option
* Numeric Answer Type (NAT) questions requiring precise numerical responses

#### Evaluation Metric
We used a composite evaluation metric to reflect real-world grading rigor and reduce false positives:

1. Float Match
  * Compares predicted and target answers within a tolerance (±1e-9)
  * Handles rounding artifacts and small numerical errors robustly
2. String Match
  * Used for symbolic answers (e.g., fractions, radicals)
  * Uses strict exact match — predictions must match ground truth character-for-character
3. LLM-as-Judge (GPT-4o-mini)
  * Used for Mathematical equivalence for ambiguous formats

### 🔹 Accuracy Comparison Across Models
![](accuracy.png)
> *Aryabhata has the best accuracy on JEE Main Maths, on par with frontier models*

### 🔹 Accuracy vs Token Usage
![](accuracy-vs-token.png)
> *Aryabhata is on par with frontier models in terms of accuracy vs token usage*

---

## 🔧 Intended Use

**Primary Use Cases**:
- Competitive exam preparation (JEE Main level mathematics problems)
- Question answering and doubt-solving systems
- Educational tutoring and concept explanation


## 💡 How to Use

### 🧪 Using with 🤗 Transformers

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_id = "PhysicsWallahAI/Aryabhata-1.0"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)


# Define stop strings
stop_strings = ["<|im_end|>", "<|end|>", "<im_start|>", "⁠```python\n", "⁠<|im_start|>", "]}}]}}]"]

def strip_bad_tokens(s, stop_strings):
    for suffix in stop_strings:
        if s.endswith(suffix):
            return s[:-len(suffix)]
    return s


# Create generation config (can also set temperature, top_p, etc.)
generation_config = GenerationConfig(
    max_new_tokens=4096,
    stop_strings = stop_strings
)

query = 'Find all the values of \\sqrt[3]{1}'
messages = [{'role': 'system', 'content': 'Think step-by-step; put only the final answer inside \\boxed{}.'},
            {'role': 'user', 'content': query}]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config, tokenizer=tokenizer)

print(strip_bad_tokens(tokenizer.decode(outputs[0], skip_special_tokens=True), stop_strings))
````

---

### ⚡ Using with vLLM

To run the model efficiently using vLLM:

```python
from vllm import LLM, SamplingParams

# Initialize model (downloads from Hugging Face if not local)
llm = LLM(model="PhysicsWallahAI/Aryabhata-1.0")

# Define prompt and sampling configuration
query = 'Find all the values of \\sqrt[3]{1}'
messages = [{'role': 'system', 'content': 'Think step-by-step; put only the final answer inside \\boxed{}.'},
            {'role': 'user', 'content': query}]
sampling_params = SamplingParams(temperature=0.0, max_tokens=4*1024, stop=["<|im_end|>", "<|end|>", "<im_start|>", "⁠```python\n", "⁠<|im_start|>", "]}}]}}]"])

# Run inference
results = llm.chat(messages, sampling_params)

# Print result
print(results[0].outputs[0].text.strip())
```

---

Read more about Aryabhata 1.0 in our [Technical Report](https://arxiv.org/abs/2508.08665)

---

## 🚀 Roadmap

**Aryabhata 2.0** (Upcoming):
- Extending domain coverage to **Physics** and **Chemistry**
- Supporting **JEE Advanced**, **NEET**, and **Foundation syllabus**
- Further optimization for affordability and accuracy in real-time deployments

---

## 🤝 Citation

If you use this model, please cite:

```bibtex
@misc{Aryabhata2025,
  title = {Aryabhata 1.0: A compact, exam-focused language model tailored for mathematics in Indian competitive exams, especially JEE Main.},
  author = {Physics Wallah AI Research},
  year = {2025},
  note = {\url{https://huggingface.co/PhysicsWallahAI/Aryabhata-1.0}},
}