zzzzzhy commited on
Commit
3259870
·
verified ·
1 Parent(s): 0bd3717

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -47
README.md CHANGED
@@ -42,50 +42,24 @@ base_model:
42
  </head>
43
  <body>
44
  <div class="badges">
45
- <a href="#" class="badge" target="_blank" rel="noopener">
46
- <svg viewBox="-64 0 512 512" xmlns="http://www.w3.org/2000/svg">
47
- <path d="M369.9 97.9L286 14C277 5 264.8-.1 252.1-.1H48C21.5 0 0 21.5 0 48v416c0 26.5 21.5
48
- 48 48 48h288c26.5 0 48-21.5 48-48V131.9c0-12.7-5.1-25-14.1-34zM332.1 128H256V51.9l76.1
49
- 76.1zM48 464V48h160v104c0 13.3 10.7 24 24 24h104v288H48zm250.2-143.7c-12.2-12-47-8.7
50
- -64.4-6.5-17.2-10.5-28.7-25-36.8-46.3 3.9-16.1 10.1-40.6 5.4-56-4.2-26.2-37.8-23.6
51
- -42.6-5.9-4.4 16.1-.4 38.5 7 67.1-10 23.9-24.9 56-35.4 74.4-20 10.3-47 26.2-51
52
- 46.2-3.3 15.8 26 55.2 76.1-31.2 22.4-7.4 46.8-16.5 68.4-20.1 18.9 10.2 41 17 55.8
53
- 17 25.5 0 28-28.2 17.5-38.7zm-198.1 77.8c5.1-13.7 24.5-29.5 30.4-35-19 30.3-30.4
54
- 35.7-30.4 35zm81.6-190.6c7.4 0 6.7 32.1 1.8 40.8-4.4-13.9-4.3-40.8-1.8-40.8zm-24.4
55
- 136.6c9.7-16.9 18-37 24.7-54.7 8.3 15.1 18.9 27.2 30.1 35.5-20.8 4.3-38.9 13.1-54.8
56
- 19.2zm131.6-5s-5 6-37.3-7.8c35.1-2.6 40.9 5.4 37.3 7.8z"/>
57
- </svg>arXiv (coming soon)</a><a href="http://blog.goedel-prover.com"
58
  class="badge" target="_blank" rel="noopener">
59
  <span class="emoji"><h1>🌐</h1></span>Website</a><a href="https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B"
60
  class="badge" target="_blank" rel="noopener">
61
- <span class="emoji"><h1>🤗</h1></span>HuggingFace</a><a href="#"
62
- class="badge" target="_blank" rel="noopener">
63
- <svg viewBox="0 0 64 64" xmlns="http://www.w3.org/2000/svg">
64
- <path d="M32.029,8.345c-13.27,0-24.029,10.759-24.029,24.033c0,10.617 6.885,19.624
65
- 16.435,22.803c1.202,0.22 1.64-0.522 1.64-1.16c0-0.569-0.02-2.081-0.032-4.086
66
- c-6.685,1.452-8.095-3.222-8.095-3.222c-1.093-2.775-2.669-3.514-2.669-3.514
67
- c-2.182-1.492,0.165-1.462,0.165-1.462c2.412,0.171 3.681,2.477 3.681,2.477
68
- c2.144,3.672 5.625,2.611 6.994,1.997c0.219-1.553 0.838-2.612 1.526-3.213
69
- c-5.336-0.606-10.947-2.669-10.947-11.877c0-2.623 0.937-4.769 2.474-6.449
70
- c-0.247-0.608-1.072-3.051 0.235-6.36c0,0 2.018-0.646 6.609,2.464c1.917-0.533
71
- 3.973-0.8 6.016-0.809c2.041,0.009 4.097,0.276 6.017,0.809c4.588-3.11
72
- 6.602-2.464 6.602-2.464c1.311,3.309 0.486,5.752 0.239,6.36c1.54,1.68
73
- 2.471,3.826 2.471,6.449c0,9.232-5.62,11.263-10.974,11.858c0.864,0.742
74
- 1.632,2.208 1.632,4.451c0,3.212-0.029,5.804-0.029,6.591c0,0.644
75
- 0.432,1.392 1.652,1.157c9.542-3.185 16.421-12.186 16.421-22.8c0-13.274
76
- -10.76-24.033-24.034-24.033"/>
77
- </svg>Code (coming soon)</a>
78
  </div>
79
  </body>
80
 
81
  ## 1. Introduction
82
 
83
- We introduce Goedel-Prover-V2, an open-source language model series that achieves state-of-the-art performance in automated formal proof generation. Our models build on the standard expert iteration and reinforcement learning pipeline, with three novel methodologies: (1) data synthesis for curriculums, which creates and augments intermediate problems to bridge the gap between simple and difficult proofs; (2) compiler-guided self-correction, where the model is trained to iteratively correct its own proofs using lean compiler feedback, mimicking human problem-solving; and (3) model averaging.
84
 
85
- Our small model, Goedel-Prover-V2-8B, reaches 83.0% on MiniF2F test set at Pass@32, matching the performance of DeepSeek-Prover-V2-671B while being nearly 100 times smaller in model size. Our flagship model, Goedel-Prover-V2-32B, achieves 88.0% on MiniF2F at Pass@32 and 90.4% with additional two rounds of self-correction (only increasing max output length from 32K to 40K), outperforming prior SOTA DeepSeek-Prover-V2-671B and concurrent work Kimina-Prover-72B by a large margin. Additionaly, Goedel-Prover-V2-32B solves 57 problems on PutnamBench with 32 sample budget and 2 rounds of self-correction per sample, securing the 1st on the leaderboard by beating DeepSeek-Prover-V2-671B's 47 by doing Pass@1024.
86
 
87
  ## 2. Benchmark Performance
88
 
 
 
89
 
90
  <style>
91
  .fig-row {
@@ -128,13 +102,11 @@ Our small model, Goedel-Prover-V2-8B, reaches 83.0% on MiniF2F test set at Pass@
128
  </div>
129
  </div>
130
  <figcaption>
131
- <strong>Figure 1</strong>: <em>Pass@32 performance on MiniF2F, PutnamBench, and our new FoMOBench containing 360 IMO-level problems.</em> </figcaption>
 
132
  </figure>
133
 
134
- The charts above demonstrate the state-of-the-art performance of Goedel-Prover-V2. Goedel-Prover-V2+2 rounds of self-correction indicates that the model first do full proof generation and then correct the proof according to the lean compilation feedback. <strong>Goedel-Prover-V2+2 rounds of self-correction only increases the max output length from 32K to 40K over other models.</strong>
135
- <strong>(Left)</strong> On the widely-used MiniF2F benchmark, our 8B model achieves 83.0%, matching much larger models, while our 32B model reaches 88.0% (90.4% with self-correction), outperforming all previous models.
136
- <strong>(Middle)</strong> On the challenging PutnamBench, our 32B model solves 43 problems, a significant leap over the previous best of 23.
137
- <strong>(Right)</strong> On our newly curated FoMOBench, which contains 360 IMO-level problems, our 32B model solves 73 problems, surpassing DeepSeek-Prover-V2-671B's 50.
138
 
139
 
140
  <div align="center">
@@ -148,9 +120,9 @@ The charts above demonstrate the state-of-the-art performance of Goedel-Prover-V
148
  </tr>
149
  </thead>
150
  <tbody>
151
- <tr><td>1</td><td>Goedel-Prover-V2-32B (two rounds of self-correction)</td><td>64</td><td>Pass@64</td></tr>
152
- <tr><td>1</td><td>Goedel-Prover-V2-32B (two rounds of self-correction)</td><td>57</td><td>Pass@32</td></tr>
153
- <tr><td>1</td><td>Goedel-Prover-V2-32B</td><td>43</td><td>Pass@32</td></tr>
154
  <tr><td>2</td><td>DeepSeek‑Prover‑V2-671B</td><td>47</td><td>pass@1024</td></tr>
155
  <tr><td>2</td><td>DeepSeek‑Prover‑V2-671B</td><td>22</td><td>pass@32</td></tr>
156
  <tr><td>3</td><td>DSP+</td><td>23</td><td>pass@128</td></tr>
@@ -158,7 +130,7 @@ The charts above demonstrate the state-of-the-art performance of Goedel-Prover-V
158
  </tbody>
159
  </table>
160
  <!-- table caption -->
161
- <caption align="bottom"><strong>Table 1</strong>: <em>PutnamBench leaderboard. Goedel-Prover-V2-32B secures the top rank with significantly less compute than the previous SOTA. <strong>Goedel-Prover-V2 (self-correction) only increase the max output length from 32K to 40K over other models.</strong> Note that DeepSeek-Prover-V2-671B's score of 47 was achieved with a much larger sample budget (pass@1024).</em>
162
  </div>
163
 
164
  ## 3. Compelling Scaling Performance
@@ -208,13 +180,11 @@ The charts above demonstrate the state-of-the-art performance of Goedel-Prover-V
208
  </figcaption>
209
  </figure>
210
 
211
- The scaling curves highlight our models' efficiency and effectiveness.
212
- <strong>(Left)</strong> The raw performance (Pass@K) shows that both our 8B and 32B models consistently improve as the number of samples (K) increases, demonstrating robust problem-solving capability.
213
- <strong>(Right)</strong> The compute-adjusted performance plot further reveals that our models are highly efficient, achieving strong results even at lower compute budgets compared to other models. This underscores the effectiveness of our training methodology.
214
 
215
  ## 4. Model & Dataset Downloads
216
 
217
- We release Goedel-Prover-V2 in two model sizes: 8B and 32B parameters, which are trained on top of Qwen3-8B and Qwen3-32B respectively. We also release FoMOBench, which serves as a benchmark more close to the International Mathematics Olympiad (IMO).
218
 
219
  <div align="center">
220
 
@@ -229,10 +199,14 @@ We release Goedel-Prover-V2 in two model sizes: 8B and 32B parameters, which are
229
 
230
  | Dataset | Download |
231
  | -------- | -------- |
232
- | FoMOBench | [🤗Download](https://huggingface.co/datasets/Goedel-LM/FoMOBench) |
233
 
234
  </div>
235
 
 
 
 
 
236
  ## 5. Quick Start
237
  You can directly use [Huggingface's Transformers](https://github.com/huggingface/transformers) for model inference.
238
 
@@ -288,7 +262,7 @@ print(time.time() - start)
288
  ```bibtex
289
  @misc{lin2025goedelproverv2,
290
  title={Goedel-Prover-V2: The Strongest Open-Source Theorem Prover to Date},
291
- author={Yong Lin and Shange Tang and Bohan Lyu and Ziran Yang and Jui-Hui Chung and Haoyu Zhao and Hongzhou Lin and Lai Jiang and Yihan Geng and Jiawei Ge and Jingruo Sun and Jiayun Wu and Jiri Gesi and David Acuna and Kaiyu Yang and Yejin Choi and Danqi Chen and Sanjeev Arora and Chi Jin},
292
  year={2025}
293
  }
294
  ```
 
42
  </head>
43
  <body>
44
  <div class="badges">
45
+ <a href="http://blog.goedel-prover.com"
 
 
 
 
 
 
 
 
 
 
 
 
46
  class="badge" target="_blank" rel="noopener">
47
  <span class="emoji"><h1>🌐</h1></span>Website</a><a href="https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B"
48
  class="badge" target="_blank" rel="noopener">
49
+ <span class="emoji"><h1>🤗</h1></span>HuggingFace</a>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  </div>
51
  </body>
52
 
53
  ## 1. Introduction
54
 
55
+ We introduce Goedel-Prover-V2, an open-source language model series that sets a new state-of-the-art in automated formal proof generation. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) <strong>Scaffolded data synthesis</strong>: We generate synthetic proof tasks of increasing difficulty to progressively train the model, enabling it to master increasingly complex theorems; (2) <strong>Verifier-guided self-correction</strong>: The model learns to iteratively revise its own proofs by leveraging feedback from Lean’s compiler, closely mimicking how humans refine their work; (3) <strong>Model averaging</strong>: We combine multiple model checkpoints to improve robustness and overall performance.
56
 
57
+ Our small model, Goedel-Prover-V2-8B, reaches 83.0% on MiniF2F test set at Pass@32, matching the performance of prior state-of-the-art DeepSeek-Prover-V2-671B while being nearly 100 times smaller in model size. Our flagship model, Goedel-Prover-V2-32B, achieves 88.0% on MiniF2F at Pass@32 on standard mode and 90.4% on self-correction mode, outperforming prior SOTA DeepSeek-Prover-V2-671B and concurrent work Kimina-Prover-72B by a large margin. Additionaly, our flagship model with self-correction solves 64 problems on PutnamBench at Pass@64, securing the 1st on the leaderboard surpassing DeepSeek-Prover-V2-671B's record of solving 47 problems by Pass@1024.
58
 
59
  ## 2. Benchmark Performance
60
 
61
+ <strong>Self-correction mode</strong>: Our model improves proof quality by first generating an initial candidate and then using Lean compiler feedback to iteratively revise it. We perform two rounds of self-correction, which remain computationally efficient—the total output length (including the initial proof and two revisions) increases only modestly from the standard 32K to 40K tokens.
62
+
63
 
64
  <style>
65
  .fig-row {
 
102
  </div>
103
  </div>
104
  <figcaption>
105
+ <strong>Figure 1</strong>: <em>Pass@32 performance on MiniF2F, PutnamBench, and our new MathOlympiadBench containing 360 IMO-level problems.</em>
106
+ </figcaption>
107
  </figure>
108
 
109
+ The charts above demonstrate the state-of-the-art performance of Goedel-Prover-V2. We report all numbers at Pass@32: (1) Across all three datasets, our flagship 32B model, in both standard and self-correction mode, significantly outperforms prior state-of-the-art DeepSeek-Prover-V2-671B and Kimina-Prover-72B; (2) on miniF2F, our 8B model matches the performance of DeepSeek-Prover-V2-671B while being 100 times smaller in model size.
 
 
 
110
 
111
 
112
  <div align="center">
 
120
  </tr>
121
  </thead>
122
  <tbody>
123
+ <tr><td>1</td><td><strong>Goedel-Prover-V2-32B (self-correction mode)</strong></td><td><strong>64</strong></td><td><strong>Pass@64</strong></td></tr>
124
+ <tr><td>1</td><td><strong>Goedel-Prover-V2-32B (self-correction mode)</strong></td><td><strong>57</strong></td><td><strong>Pass@32</strong></td></tr>
125
+ <tr><td>1</td><td><strong>Goedel-Prover-V2-32B</strong></td><td><strong>43</strong></td><td><strong>Pass@32</strong></td></tr>
126
  <tr><td>2</td><td>DeepSeek‑Prover‑V2-671B</td><td>47</td><td>pass@1024</td></tr>
127
  <tr><td>2</td><td>DeepSeek‑Prover‑V2-671B</td><td>22</td><td>pass@32</td></tr>
128
  <tr><td>3</td><td>DSP+</td><td>23</td><td>pass@128</td></tr>
 
130
  </tbody>
131
  </table>
132
  <!-- table caption -->
133
+ <caption align="bottom"><strong>Table 1</strong>: <em>PutnamBench leaderboard. Goedel-Prover-V2-32B secures the top rank with significantly less compute (pass number) than the previous state-of-the-art.</em>
134
  </div>
135
 
136
  ## 3. Compelling Scaling Performance
 
180
  </figcaption>
181
  </figure>
182
 
183
+ The scaling curves above show that our 32B model consistently outperforms all prior state-of-the-art models across the entire range of inference-time compute budgets.
 
 
184
 
185
  ## 4. Model & Dataset Downloads
186
 
187
+ We release our Goedel-Prover-V2 models and the new MathOlympiadBench benchmark to foster future research.
188
 
189
  <div align="center">
190
 
 
199
 
200
  | Dataset | Download |
201
  | -------- | -------- |
202
+ | MathOlympiadBench | [🤗Download](https://huggingface.co/datasets/Goedel-LM/MathOlympiadBench) |
203
 
204
  </div>
205
 
206
+ <strong>MathOlympiadBench</strong> (Math Olympiad Bench) comprises human-verified formalizations of Olympiad-level mathematical competition problems, sourced from Compfiles and IMOSLLean4 repository. MathOlympiadBench contains 360 problems, including 158 IMO problems from 1959 to 2024, 131 IMO shortlist problems covering 2006 to 2023, 68 regional mathematical Olympiad problems, and 3 additional mathematical puzzles.
207
+
208
+ This model is being released to aid other open-source projects, including those geared towards the upcoming IMO competition. A full paper with all details will be released in the coming weeks.
209
+
210
  ## 5. Quick Start
211
  You can directly use [Huggingface's Transformers](https://github.com/huggingface/transformers) for model inference.
212
 
 
262
  ```bibtex
263
  @misc{lin2025goedelproverv2,
264
  title={Goedel-Prover-V2: The Strongest Open-Source Theorem Prover to Date},
265
+ author={Yong Lin and Shange Tang and Bohan Lyu and Ziran Yang and Jui-Hui Chung and Haoyu Zhao and Lai Jiang and Yihan Geng and Jiawei Ge and Jingruo Sun and Jiayun Wu and Jiri Gesi and David Acuna and Kaiyu Yang and Hongzhou Lin and Yejin Choi and Danqi Chen and Sanjeev Arora and Chi Jin},
266
  year={2025}
267
  }
268
  ```