Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,294 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
base_model:
|
4 |
+
- Qwen/Qwen3-8B
|
5 |
+
---
|
6 |
+
<div align="center">
|
7 |
+
<h1> <a href="http://blog.goedel-prover.com"> <strong>Goedel-Prover-V2: The Strongest Open-Source Theorem Prover to Date</strong></a></h1>
|
8 |
+
</div>
|
9 |
+
|
10 |
+
<head>
|
11 |
+
<meta charset="UTF-8">
|
12 |
+
<style>
|
13 |
+
.badges {
|
14 |
+
display: flex;
|
15 |
+
justify-content: center;
|
16 |
+
gap: 0.5rem; /* eliminate whitespace between badges */
|
17 |
+
padding: 0.8rem;
|
18 |
+
}
|
19 |
+
.badge {
|
20 |
+
display: inline-flex; /* inline flex for pill layout */
|
21 |
+
align-items: center; /* center icon & text vertically */
|
22 |
+
padding: 0.4rem 0.4rem; /* larger badge size */
|
23 |
+
background: #333;
|
24 |
+
color: #fff; /* white text */
|
25 |
+
text-decoration: none;
|
26 |
+
font-family: sans-serif;
|
27 |
+
font-size: 0.8rem; /* larger font */
|
28 |
+
border-radius: 999em; /* full pill shape */
|
29 |
+
transition: background 0.15s;/* smooth hover */
|
30 |
+
}
|
31 |
+
.badge:hover {
|
32 |
+
background: #444;
|
33 |
+
}
|
34 |
+
.badge svg,
|
35 |
+
.badge .emoji {
|
36 |
+
width: 1.7em; /* scale icons with badge text */
|
37 |
+
height: 1.7em;
|
38 |
+
margin-right: 0.5em;
|
39 |
+
fill: currentColor; /* inherit badge color */
|
40 |
+
}
|
41 |
+
</style>
|
42 |
+
</head>
|
43 |
+
<body>
|
44 |
+
<div class="badges">
|
45 |
+
<a href="#" class="badge" target="_blank" rel="noopener">
|
46 |
+
<svg viewBox="-64 0 512 512" xmlns="http://www.w3.org/2000/svg">
|
47 |
+
<path d="M369.9 97.9L286 14C277 5 264.8-.1 252.1-.1H48C21.5 0 0 21.5 0 48v416c0 26.5 21.5
|
48 |
+
48 48 48h288c26.5 0 48-21.5 48-48V131.9c0-12.7-5.1-25-14.1-34zM332.1 128H256V51.9l76.1
|
49 |
+
76.1zM48 464V48h160v104c0 13.3 10.7 24 24 24h104v288H48zm250.2-143.7c-12.2-12-47-8.7
|
50 |
+
-64.4-6.5-17.2-10.5-28.7-25-36.8-46.3 3.9-16.1 10.1-40.6 5.4-56-4.2-26.2-37.8-23.6
|
51 |
+
-42.6-5.9-4.4 16.1-.4 38.5 7 67.1-10 23.9-24.9 56-35.4 74.4-20 10.3-47 26.2-51
|
52 |
+
46.2-3.3 15.8 26 55.2 76.1-31.2 22.4-7.4 46.8-16.5 68.4-20.1 18.9 10.2 41 17 55.8
|
53 |
+
17 25.5 0 28-28.2 17.5-38.7zm-198.1 77.8c5.1-13.7 24.5-29.5 30.4-35-19 30.3-30.4
|
54 |
+
35.7-30.4 35zm81.6-190.6c7.4 0 6.7 32.1 1.8 40.8-4.4-13.9-4.3-40.8-1.8-40.8zm-24.4
|
55 |
+
136.6c9.7-16.9 18-37 24.7-54.7 8.3 15.1 18.9 27.2 30.1 35.5-20.8 4.3-38.9 13.1-54.8
|
56 |
+
19.2zm131.6-5s-5 6-37.3-7.8c35.1-2.6 40.9 5.4 37.3 7.8z"/>
|
57 |
+
</svg>arXiv (coming soon)</a><a href="http://blog.goedel-prover.com"
|
58 |
+
class="badge" target="_blank" rel="noopener">
|
59 |
+
<span class="emoji"><h1>🌐</h1></span>Website</a><a href="https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B"
|
60 |
+
class="badge" target="_blank" rel="noopener">
|
61 |
+
<span class="emoji"><h1>🤗</h1></span>HuggingFace</a><a href="#"
|
62 |
+
class="badge" target="_blank" rel="noopener">
|
63 |
+
<svg viewBox="0 0 64 64" xmlns="http://www.w3.org/2000/svg">
|
64 |
+
<path d="M32.029,8.345c-13.27,0-24.029,10.759-24.029,24.033c0,10.617 6.885,19.624
|
65 |
+
16.435,22.803c1.202,0.22 1.64-0.522 1.64-1.16c0-0.569-0.02-2.081-0.032-4.086
|
66 |
+
c-6.685,1.452-8.095-3.222-8.095-3.222c-1.093-2.775-2.669-3.514-2.669-3.514
|
67 |
+
c-2.182-1.492,0.165-1.462,0.165-1.462c2.412,0.171 3.681,2.477 3.681,2.477
|
68 |
+
c2.144,3.672 5.625,2.611 6.994,1.997c0.219-1.553 0.838-2.612 1.526-3.213
|
69 |
+
c-5.336-0.606-10.947-2.669-10.947-11.877c0-2.623 0.937-4.769 2.474-6.449
|
70 |
+
c-0.247-0.608-1.072-3.051 0.235-6.36c0,0 2.018-0.646 6.609,2.464c1.917-0.533
|
71 |
+
3.973-0.8 6.016-0.809c2.041,0.009 4.097,0.276 6.017,0.809c4.588-3.11
|
72 |
+
6.602-2.464 6.602-2.464c1.311,3.309 0.486,5.752 0.239,6.36c1.54,1.68
|
73 |
+
2.471,3.826 2.471,6.449c0,9.232-5.62,11.263-10.974,11.858c0.864,0.742
|
74 |
+
1.632,2.208 1.632,4.451c0,3.212-0.029,5.804-0.029,6.591c0,0.644
|
75 |
+
0.432,1.392 1.652,1.157c9.542-3.185 16.421-12.186 16.421-22.8c0-13.274
|
76 |
+
-10.76-24.033-24.034-24.033"/>
|
77 |
+
</svg>Code (coming soon)</a>
|
78 |
+
</div>
|
79 |
+
</body>
|
80 |
+
|
81 |
+
## 1. Introduction
|
82 |
+
|
83 |
+
We introduce Goedel-Prover-V2, an open-source language model series that achieves state-of-the-art performance in automated formal proof generation. Our models build on the standard expert iteration and reinforcement learning pipeline, with three novel methodologies: (1) data synthesis for curriculums, which creates and augments intermediate problems to bridge the gap between simple and difficult proofs; (2) compiler-guided self-correction, where the model is trained to iteratively correct its own proofs using lean compiler feedback, mimicking human problem-solving; and (3) model averaging.
|
84 |
+
|
85 |
+
Our small model, Goedel-Prover-V2-8B, reaches 83.0% on MiniF2F test set at Pass@32, matching the performance of DeepSeek-Prover-V2-671B while being nearly 100 times smaller in model size. Our flagship model, Goedel-Prover-V2-32B, achieves 88.0% on MiniF2F at Pass@32 and 90.4% with additional two rounds of self-correction (only increasing max output length from 32K to 40K), outperforming prior SOTA DeepSeek-Prover-V2-671B and concurrent work Kimina-Prover-72B by a large margin. Additionaly, Goedel-Prover-V2-32B solves 57 problems on PutnamBench with 32 sample budget and 2 rounds of self-correction per sample, securing the 1st on the leaderboard by beating DeepSeek-Prover-V2-671B's 47 by doing Pass@1024.
|
86 |
+
|
87 |
+
## 2. Benchmark Performance
|
88 |
+
|
89 |
+
|
90 |
+
<style>
|
91 |
+
.fig-row {
|
92 |
+
display: flex;
|
93 |
+
justify-content: space-between; /* spread them out */
|
94 |
+
align-items: flex-start; /* align tops */
|
95 |
+
gap: 1rem; /* space between images */
|
96 |
+
}
|
97 |
+
.fig-row img {
|
98 |
+
display: block;
|
99 |
+
width: 100%;
|
100 |
+
height: auto;
|
101 |
+
}
|
102 |
+
.fig-row .panel {
|
103 |
+
/* override per‐panel width as needed */
|
104 |
+
/* e.g. .panel-1 { width:25%; } .panel-2 { width:40%; } etc. */
|
105 |
+
}
|
106 |
+
figure {
|
107 |
+
margin: 0;
|
108 |
+
}
|
109 |
+
figure figcaption {
|
110 |
+
text-align: center;
|
111 |
+
font-size: 0.9em;
|
112 |
+
margin-top: 0.75rem;
|
113 |
+
color: #555;
|
114 |
+
}
|
115 |
+
figure figcaption strong {
|
116 |
+
font-weight: bold;
|
117 |
+
}
|
118 |
+
/* Italicize the rest of the caption */
|
119 |
+
figure figcaption em {
|
120 |
+
font-style: italic;
|
121 |
+
}
|
122 |
+
</style>
|
123 |
+
|
124 |
+
<figure>
|
125 |
+
<div class="fig-row">
|
126 |
+
<div class="panel panel-1" style="width:100%;">
|
127 |
+
<img src="https://github.com/Goedel-LM/Goedel-Prover-V2/blob/main/assets/combined_performance_plots_varied_width.png?raw=true" alt="…">
|
128 |
+
</div>
|
129 |
+
</div>
|
130 |
+
<figcaption>
|
131 |
+
<strong>Figure 1</strong>: <em>Pass@32 performance on MiniF2F, PutnamBench, and our new FoMOBench containing 360 IMO-level problems.</em> </figcaption>
|
132 |
+
</figure>
|
133 |
+
|
134 |
+
The charts above demonstrate the state-of-the-art performance of Goedel-Prover-V2. Goedel-Prover-V2+2 rounds of self-correction indicates that the model first do full proof generation and then correct the proof according to the lean compilation feedback. <strong>Goedel-Prover-V2+2 rounds of self-correction only increases the max output length from 32K to 40K over other models.</strong>
|
135 |
+
<strong>(Left)</strong> On the widely-used MiniF2F benchmark, our 8B model achieves 83.0%, matching much larger models, while our 32B model reaches 88.0% (90.4% with self-correction), outperforming all previous models.
|
136 |
+
<strong>(Middle)</strong> On the challenging PutnamBench, our 32B model solves 43 problems, a significant leap over the previous best of 23.
|
137 |
+
<strong>(Right)</strong> On our newly curated FoMOBench, which contains 360 IMO-level problems, our 32B model solves 73 problems, surpassing DeepSeek-Prover-V2-671B's 50.
|
138 |
+
|
139 |
+
|
140 |
+
<div align="center">
|
141 |
+
<table style="margin: 0 auto;">
|
142 |
+
<thead>
|
143 |
+
<tr>
|
144 |
+
<th>#</th>
|
145 |
+
<th>Model</th>
|
146 |
+
<th>num‑solved</th>
|
147 |
+
<th>compute</th>
|
148 |
+
</tr>
|
149 |
+
</thead>
|
150 |
+
<tbody>
|
151 |
+
<tr><td>1</td><td>Goedel-Prover-V2-32B (two rounds of self-correction)</td><td>64</td><td>Pass@64</td></tr>
|
152 |
+
<tr><td>1</td><td>Goedel-Prover-V2-32B (two rounds of self-correction)</td><td>57</td><td>Pass@32</td></tr>
|
153 |
+
<tr><td>1</td><td>Goedel-Prover-V2-32B</td><td>43</td><td>Pass@32</td></tr>
|
154 |
+
<tr><td>2</td><td>DeepSeek‑Prover‑V2-671B</td><td>47</td><td>pass@1024</td></tr>
|
155 |
+
<tr><td>2</td><td>DeepSeek‑Prover‑V2-671B</td><td>22</td><td>pass@32</td></tr>
|
156 |
+
<tr><td>3</td><td>DSP+</td><td>23</td><td>pass@128</td></tr>
|
157 |
+
<tr><td>4</td><td>Kimina‑Prover‑7B‑Distill</td><td>10</td><td>pass@192</td></tr>
|
158 |
+
</tbody>
|
159 |
+
</table>
|
160 |
+
<!-- table caption -->
|
161 |
+
<caption align="bottom"><strong>Table 1</strong>: <em>PutnamBench leaderboard. Goedel-Prover-V2-32B secures the top rank with significantly less compute than the previous SOTA. <strong>Goedel-Prover-V2 (self-correction) only increase the max output length from 32K to 40K over other models.</strong> Note that DeepSeek-Prover-V2-671B's score of 47 was achieved with a much larger sample budget (pass@1024).</em>
|
162 |
+
</div>
|
163 |
+
|
164 |
+
## 3. Compelling Scaling Performance
|
165 |
+
|
166 |
+
<style>
|
167 |
+
.fig-row {
|
168 |
+
display: flex;
|
169 |
+
justify-content: space-between; /* spread them out */
|
170 |
+
align-items: flex-start; /* align tops */
|
171 |
+
gap: 1rem; /* space between images */
|
172 |
+
}
|
173 |
+
.fig-row img {
|
174 |
+
display: block;
|
175 |
+
width: 100%;
|
176 |
+
height: auto;
|
177 |
+
}
|
178 |
+
.fig-row .panel {
|
179 |
+
/* override per‐panel width as needed */
|
180 |
+
/* e.g. .panel-1 { width:25%; } .panel-2 { width:40%; } etc. */
|
181 |
+
}
|
182 |
+
figure {
|
183 |
+
margin: 0;
|
184 |
+
}
|
185 |
+
figure figcaption {
|
186 |
+
text-align: center;
|
187 |
+
font-size: 0.9em;
|
188 |
+
margin-top: 0.75rem;
|
189 |
+
color: #555;
|
190 |
+
}
|
191 |
+
figure figcaption strong {
|
192 |
+
font-weight: bold;
|
193 |
+
}
|
194 |
+
/* Italicize the rest of the caption */
|
195 |
+
figure figcaption em {
|
196 |
+
font-style: italic;
|
197 |
+
}
|
198 |
+
</style>
|
199 |
+
|
200 |
+
<figure>
|
201 |
+
<div class="fig-row">
|
202 |
+
<div class="panel panel-1" style="width:80%;">
|
203 |
+
<img src="https://github.com/Goedel-LM/Goedel-Prover-V2/blob/main/assets/inference_scale_performance.png?raw=true" alt="…">
|
204 |
+
</div>
|
205 |
+
</div>
|
206 |
+
<figcaption>
|
207 |
+
<strong>Figure 2</strong>: <em>Performance on MiniF2F test set under different sample budgets.</em>
|
208 |
+
</figcaption>
|
209 |
+
</figure>
|
210 |
+
|
211 |
+
The scaling curves highlight our models' efficiency and effectiveness.
|
212 |
+
<strong>(Left)</strong> The raw performance (Pass@K) shows that both our 8B and 32B models consistently improve as the number of samples (K) increases, demonstrating robust problem-solving capability.
|
213 |
+
<strong>(Right)</strong> The compute-adjusted performance plot further reveals that our models are highly efficient, achieving strong results even at lower compute budgets compared to other models. This underscores the effectiveness of our training methodology.
|
214 |
+
|
215 |
+
## 4. Model & Dataset Downloads
|
216 |
+
|
217 |
+
We release Goedel-Prover-V2 in two model sizes: 8B and 32B parameters, which are trained on top of Qwen3-8B and Qwen3-32B respectively. We also release FoMOBench, which serves as a benchmark more close to the International Mathematics Olympiad (IMO).
|
218 |
+
|
219 |
+
<div align="center">
|
220 |
+
|
221 |
+
| Model | Download |
|
222 |
+
| -------- | -------- |
|
223 |
+
| Goedel-Prover-V2-32B | [🤗Download](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B) |
|
224 |
+
| Goedel-Prover-V2-8B | [🤗Download](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-8B) |
|
225 |
+
|
226 |
+
</div>
|
227 |
+
|
228 |
+
<div align="center">
|
229 |
+
|
230 |
+
| Dataset | Download |
|
231 |
+
| -------- | -------- |
|
232 |
+
| FoMOBench | [🤗Download](https://huggingface.co/datasets/Goedel-LM/FoMOBench) |
|
233 |
+
|
234 |
+
</div>
|
235 |
+
|
236 |
+
## 5. Quick Start
|
237 |
+
You can directly use [Huggingface's Transformers](https://github.com/huggingface/transformers) for model inference.
|
238 |
+
|
239 |
+
```
|
240 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
241 |
+
import torch
|
242 |
+
torch.manual_seed(30)
|
243 |
+
|
244 |
+
model_id = "Goedel-LM/Goedel-Prover-V2-32B"
|
245 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
246 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
|
247 |
+
|
248 |
+
|
249 |
+
formal_statement = """
|
250 |
+
import Mathlib
|
251 |
+
import Aesop
|
252 |
+
|
253 |
+
set_option maxHeartbeats 0
|
254 |
+
|
255 |
+
open BigOperators Real Nat Topology Rat
|
256 |
+
|
257 |
+
|
258 |
+
theorem square_equation_solution {x y : ℝ} (h : x^2 + y^2 = 2*x - 4*y - 5) : x + y = -1 := by
|
259 |
+
sorry
|
260 |
+
""".strip()
|
261 |
+
|
262 |
+
prompt = """
|
263 |
+
Complete the following Lean 4 code:
|
264 |
+
|
265 |
+
```lean4
|
266 |
+
{}```
|
267 |
+
|
268 |
+
Before producing the Lean 4 code to formally prove the given theorem, provide a detailed proof plan outlining the main proof steps and strategies.
|
269 |
+
The plan should highlight key ideas, intermediate lemmas, and proof structures that will guide the construction of the final formal proof.
|
270 |
+
""".strip()
|
271 |
+
|
272 |
+
chat = [
|
273 |
+
{"role": "user", "content": prompt.format(formal_statement)},
|
274 |
+
]
|
275 |
+
|
276 |
+
inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
|
277 |
+
|
278 |
+
import time
|
279 |
+
start = time.time()
|
280 |
+
outputs = model.generate(inputs, max_new_tokens=32768)
|
281 |
+
print(tokenizer.batch_decode(outputs))
|
282 |
+
print(time.time() - start)
|
283 |
+
```
|
284 |
+
|
285 |
+
|
286 |
+
|
287 |
+
### Cite
|
288 |
+
```bibtex
|
289 |
+
@misc{lin2025goedelproverv2,
|
290 |
+
title={Goedel-Prover-V2: The Strongest Open-Source Theorem Prover to Date},
|
291 |
+
author={Yong Lin and Shange Tang and Bohan Lyu and Ziran Yang and Jui-Hui Chung and Haoyu Zhao and Hongzhou Lin and Lai Jiang and Yihan Geng and Jiawei Ge and Jingruo Sun and Jiayun Wu and Jiri Gesi and David Acuna and Kaiyu Yang and Yejin Choi and Danqi Chen and Sanjeev Arora and Chi Jin},
|
292 |
+
year={2025}
|
293 |
+
}
|
294 |
+
```
|