Update README.md
Browse filesadded evaluation metric.
README.md
CHANGED
@@ -26,6 +26,26 @@ This model is a LoRA (Low-Rank Adaptation) fine-tuned version of **Qwen2.5-1.5B-
|
|
26 |
|
27 |
---
|
28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
## How to Use
|
30 |
|
31 |
### Example Python Script
|
|
|
26 |
|
27 |
---
|
28 |
|
29 |
+
## Evaluation on MATH-500 Benchmark
|
30 |
+
|
31 |
+
After following the sampling-based Pass@1 methodology inspired by [DeepSeek R1](https://arxiv.org/abs/2501.12948), we have
|
32 |
+
|
33 |
+
|
34 |
+
| Parameter | Value |
|
35 |
+
|------------------|---------|
|
36 |
+
| **Dataset** | `uggingFaceH4/MATH-500` |
|
37 |
+
| **Temperature** | `0.6` |
|
38 |
+
| **Top_p** | `0.95` |
|
39 |
+
| **Num_samples** | `16` per question |
|
40 |
+
|
41 |
+
### Results
|
42 |
+
|
43 |
+
- **At-least-one-correct Rate:** **54.60%** (273 out of 500 questions)
|
44 |
+
|
45 |
+
*This metric represents the percentage of questions with at least one correct solution among multiple generated attempts.*
|
46 |
+
|
47 |
+
---
|
48 |
+
|
49 |
## How to Use
|
50 |
|
51 |
### Example Python Script
|