pw-ai-research commited on
Commit
42c14ee
·
verified ·
1 Parent(s): 0b4ed35

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +220 -3
README.md CHANGED
@@ -1,3 +1,220 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - small-language-model
5
+ - jee
6
+ - exam-centric
7
+ - indian-education
8
+ - reinforcement-learning
9
+ - supervised-finetuning
10
+ - model-merging
11
+ - rejection-sampling
12
+ - mathematics
13
+ - ai4education
14
+ - physicswallah
15
+ language:
16
+ - en
17
+ model_name: PhysicsWallah/Aryabhatta-1.0
18
+ model_creator: Physics Wallah AI Research
19
+ model_type: Causal decoder-based model
20
+ base_model: Qwen/Qwen2.5-Math-7B
21
+ pipeline_tag: text-generation
22
+ ---
23
+
24
+ # Aryabhatta 1.0 🌟
25
+
26
+ **Aryabhatta 1.0** is a 7B parameter small language model for mathematics developed by **Physics Wallah AI Research**, optimized for high-stakes Indian competitive exams like **JEE Mains**. Despite its compact size, Aryabhatta 1.0 achieves **state-of-the-art performance** on exam-centric reasoning tasks with impressive **token efficiency** and low inference cost.
27
+
28
+
29
+ > 🚧 *Aryabhatta 1.0 is an **experimental release**. We are actively seeking feedback — please contribute in the Discussion tab of this repo.*
30
+ ---
31
+
32
+ ## 🧠 Key Features
33
+
34
+ - **Architecture**: 7B parameter causal decoder-based model.
35
+ - **Exam-Centric Optimization**: Specifically tuned for JEE-level Mathematics reasoning.
36
+ - **High Accuracy**:
37
+ - **86%** on **JEE Mains January 2025** session.
38
+ - **90.2%** on **JEE Mains April 2025** session.
39
+ - **Token Efficiency**: Operates effectively around a **~2K token window**, compared to ~8K required by other reasoning models.
40
+ - **Compute Efficient**: Trained on a **1x2 NVIDIA H100 GPU** using optimized pipeline.
41
+
42
+ ---
43
+
44
+ ## 🛠️ Training Details
45
+
46
+ - **Training Data**: ~130K problem-solution pairs curated from proprietary Physics Wallah exam datasets.
47
+ - **Training Pipeline**:
48
+ - **Model Merging**
49
+ - **Rejection Sampling**
50
+ - **Supervised Fine-Tuning (SFT)**
51
+ - **Reinforcement Learning with Verifiable Rewards (RLVR)**
52
+
53
+ ### 🔀 Model Merging
54
+ We began with model merging (Weighted average) to build a strong initialization (Aryabhatta 0.5) by combining diverse model capabilities:
55
+ * Qwen 2.5 Math: A robust math-centric LLM with solid symbolic math foundations.
56
+ * Ace Math: An enhanced version of Qwen 2.5 Math, fine-tuned by NVIDIA for improved accuracy in mathematics benchmarks.
57
+ * DeepSeek R1 Distill Qwen: A long-form reasoning model, fine-tuned on reasoning traces distilled from DeepSeek R1.
58
+
59
+ ### 📚 Data Curation + Rejection Sampling
60
+ We extracted ~250K raw questions from Physics Wallah's internal database and applied aggressive filtering and cleaning:
61
+ * Removed: diagram-based, non-English, and option-heavy questions.
62
+ * Kept: questions matching the distribution of JEE Main 2019–2024.
63
+ Final curated dataset: ~130K high-quality questions.
64
+
65
+ For each question:
66
+ * Generated 4 CoTs using Aryabhatta 0.5.
67
+ * Retained only those leading to correct final answers.
68
+
69
+ Resulting Dataset:
70
+ * ~100K questions
71
+ * ~350K high-quality CoTs
72
+
73
+ We used this dataset for SFT.
74
+
75
+ ### 🎯 Reinforcement Learning with Verifiable Rewards (RLVR)
76
+ We used a custom in-house variant of Group Relative Policy Optimization (GRPO), adapted for math-specific reward functions.
77
+ * Removed KL-divergence penalty
78
+ * Removed clipping
79
+
80
+ We used RLVR on the remaining ~30K questions.
81
+
82
+ This multi-phase training strategy allows Aryabhatta 1.0 to capture **pedagogy-aligned reasoning patterns**, making it highly effective for solving real student queries in mathematics.
83
+
84
+ ---
85
+
86
+ ## 📊 Performance Highlights
87
+
88
+ ### Evaluation Setup
89
+ All evaluations were performed with temperature = 0.0, and we report pass@1 accuracy.
90
+
91
+ #### Evaluation Datasets
92
+ We evaluated the model on two sets of official JEE Mains 2025 mathematics papers:
93
+ * January Session: 10 question papers containing 250 questions.
94
+ * April Session: 9 question papers containing 225 questions.
95
+
96
+ Each paper includes a mix of:
97
+ * Multiple Choice Questions (MCQs) with one correct option
98
+ * Numeric Answer Type (NAT) questions requiring precise numerical responses
99
+
100
+ #### Evaluation Metric
101
+ We used a composite evaluation metric to reflect real-world grading rigor and reduce false positives:
102
+
103
+ 1. Float Match
104
+ * Compares predicted and target answers within a tolerance (±1e-9)
105
+ * Handles rounding artifacts and small numerical errors robustly
106
+ 2. String Match
107
+ * Used for symbolic answers (e.g., fractions, radicals)
108
+ * Uses strict exact match — predictions must match ground truth character-for-character
109
+ 3. LLM-as-Judge (GPT-4o-mini)
110
+ * Used for Mathematical equivalence for ambiguous formats
111
+
112
+ ### 🔹 Accuracy Comparison Across Models
113
+ ![](accuracy.png)
114
+ > *Aryabhatta has the best accuracy on JEE Main Maths, on par with frontier models*
115
+
116
+ ### 🔹 Accuracy vs Token Usage
117
+ ![](accuracy-vs-token.png)
118
+ > *Aryabhatta is on par with frontier models in terms of accuracy vs token usage*
119
+
120
+ ---
121
+
122
+ ## 🔧 Intended Use
123
+
124
+ **Primary Use Cases**:
125
+ - Competitive exam preparation (JEE Main level mathematics problems)
126
+ - Question answering and doubt-solving systems
127
+ - Educational tutoring and concept explanation
128
+
129
+
130
+ ## 💡 How to Use
131
+
132
+ ### 🧪 Using with 🤗 Transformers
133
+
134
+ ```python
135
+ from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
136
+
137
+ model_id = "PhysicsWallahAI/Aryabhatta-1.0"
138
+
139
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
140
+ model = AutoModelForCausalLM.from_pretrained(model_id)
141
+
142
+
143
+ # Define stop strings
144
+ stop_strings = ["<|im_end|>", "<|end|>", "<im_start|>", "⁠```python\n", "⁠<|im_start|>", "]}}]}}]"]
145
+
146
+ def strip_bad_tokens(s, stop_strings):
147
+ for suffix in stop_strings:
148
+ if s.endswith(suffix):
149
+ return s[:-len(suffix)]
150
+ return s
151
+
152
+
153
+ # Create generation config (can also set temperature, top_p, etc.)
154
+ generation_config = GenerationConfig(
155
+ max_new_tokens=4096,
156
+ stop_strings = stop_strings
157
+ )
158
+
159
+ query = 'Find all the values of \\sqrt[3]{1}'
160
+ messages = [{'role': 'system', 'content': 'Think step-by-step; put only the final answer inside \\boxed{}.'},
161
+ {'role': 'user', 'content': query}]
162
+
163
+ text = tokenizer.apply_chat_template(
164
+ messages,
165
+ tokenize=False,
166
+ add_generation_prompt=True
167
+ )
168
+ inputs = tokenizer([text], return_tensors="pt")
169
+ outputs = model.generate(**inputs, generation_config=generation_config, tokenizer=tokenizer)
170
+
171
+ print(strip_bad_tokens(tokenizer.decode(outputs[0], skip_special_tokens=True), stop_strings))
172
+ ````
173
+
174
+ ---
175
+
176
+ ### ⚡ Using with vLLM
177
+
178
+ To run the model efficiently using vLLM:
179
+
180
+ ```python
181
+ from vllm import LLM, SamplingParams
182
+
183
+ # Initialize model (downloads from Hugging Face if not local)
184
+ llm = LLM(model="PhysicsWallahAI/Aryabhatta-1.0")
185
+
186
+ # Define prompt and sampling configuration
187
+ query = 'Find all the values of \\sqrt[3]{1}'
188
+ messages = [{'role': 'system', 'content': 'Think step-by-step; put only the final answer inside \\boxed{}.'},
189
+ {'role': 'user', 'content': query}]
190
+ sampling_params = SamplingParams(temperature=0.0, max_tokens=4*1024, stop=["<|im_end|>", "<|end|>", "<im_start|>", "⁠```python\n", "⁠<|im_start|>", "]}}]}}]"])
191
+
192
+ # Run inference
193
+ results = llm.chat(messages, sampling_params)
194
+
195
+ # Print result
196
+ print(results[0].outputs[0].text.strip())
197
+ ```
198
+
199
+ ---
200
+
201
+ ## 🚀 Roadmap
202
+
203
+ **Aryabhatta 2.0** (Upcoming):
204
+ - Extending domain coverage to **Physics** and **Chemistry**
205
+ - Supporting **JEE Advanced**, **NEET**, and **Foundation syllabus**
206
+ - Further optimization for affordability and accuracy in real-time deployments
207
+
208
+ ---
209
+
210
+ ## 🤝 Citation
211
+
212
+ If you use this model, please cite:
213
+
214
+ ```bibtex
215
+ @misc{aryabhatta2025,
216
+ title = {Aryabhatta 1.0: A compact, exam-focused language model tailored for mathematics in Indian competitive exams, especially JEE Main.},
217
+ author = {Physics Wallah AI Research},
218
+ year = {2025},
219
+ note = {\url{https://huggingface.co/PhysicsWallahAI/Aryabhatta-1.0}},
220
+ }