File size: 10,988 Bytes
0bd3717 3259870 0bd3717 3259870 0bd3717 3259870 0bd3717 3259870 0bd3717 3259870 0bd3717 3259870 0bd3717 3259870 0bd3717 3259870 f352367 0bd3717 3259870 0bd3717 3259870 0bd3717 3259870 0bd3717 3259870 0bd3717 3259870 0bd3717 fb82f1a 0bd3717 3259870 0bd3717 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 |
---
license: apache-2.0
base_model:
- Qwen/Qwen3-8B
---
<div align="center">
<h1> <a href="http://blog.goedel-prover.com"> <strong>Goedel-Prover-V2: The Strongest Open-Source Theorem Prover to Date</strong></a></h1>
</div>
<head>
<meta charset="UTF-8">
<style>
.badges {
display: flex;
justify-content: center;
gap: 0.5rem; /* eliminate whitespace between badges */
padding: 0.8rem;
}
.badge {
display: inline-flex; /* inline flex for pill layout */
align-items: center; /* center icon & text vertically */
padding: 0.4rem 0.4rem; /* larger badge size */
background: #333;
color: #fff; /* white text */
text-decoration: none;
font-family: sans-serif;
font-size: 0.8rem; /* larger font */
border-radius: 999em; /* full pill shape */
transition: background 0.15s;/* smooth hover */
}
.badge:hover {
background: #444;
}
.badge svg,
.badge .emoji {
width: 1.7em; /* scale icons with badge text */
height: 1.7em;
margin-right: 0.5em;
fill: currentColor; /* inherit badge color */
}
</style>
</head>
<body>
<div class="badges">
<a href="http://blog.goedel-prover.com"
class="badge" target="_blank" rel="noopener">
<span class="emoji"><h1>🌐</h1></span>Website</a><a href="https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B"
class="badge" target="_blank" rel="noopener">
<span class="emoji"><h1>🤗</h1></span>HuggingFace</a>
</div>
</body>
## 1. Introduction
We introduce Goedel-Prover-V2, an open-source language model series that sets a new state-of-the-art in automated formal proof generation. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) <strong>Scaffolded data synthesis</strong>: We generate synthetic proof tasks of increasing difficulty to progressively train the model, enabling it to master increasingly complex theorems; (2) <strong>Verifier-guided self-correction</strong>: The model learns to iteratively revise its own proofs by leveraging feedback from Lean’s compiler, closely mimicking how humans refine their work; (3) <strong>Model averaging</strong>: We combine multiple model checkpoints to improve robustness and overall performance.
Our small model, Goedel-Prover-V2-8B, reaches 83.0% on MiniF2F test set at Pass@32, matching the performance of prior state-of-the-art DeepSeek-Prover-V2-671B while being nearly 100 times smaller in model size. Our flagship model, Goedel-Prover-V2-32B, achieves 88.0% on MiniF2F at Pass@32 on standard mode and 90.4% on self-correction mode, outperforming prior SOTA DeepSeek-Prover-V2-671B and concurrent work Kimina-Prover-72B by a large margin. Additionaly, our flagship model with self-correction solves 64 problems on PutnamBench at Pass@64, securing the 1st on the leaderboard surpassing DeepSeek-Prover-V2-671B's record of solving 47 problems by Pass@1024.
## 2. Benchmark Performance
<strong>Self-correction mode</strong>: Our model improves proof quality by first generating an initial candidate and then using Lean compiler feedback to iteratively revise it. We perform two rounds of self-correction, which remain computationally efficient—the total output length (including the initial proof and two revisions) increases only modestly from the standard 32K to 40K tokens.
<style>
.fig-row {
display: flex;
justify-content: space-between; /* spread them out */
align-items: flex-start; /* align tops */
gap: 1rem; /* space between images */
}
.fig-row img {
display: block;
width: 100%;
height: auto;
}
.fig-row .panel {
/* override per‐panel width as needed */
/* e.g. .panel-1 { width:25%; } .panel-2 { width:40%; } etc. */
}
figure {
margin: 0;
}
figure figcaption {
text-align: center;
font-size: 0.9em;
margin-top: 0.75rem;
color: #555;
}
figure figcaption strong {
font-weight: bold;
}
/* Italicize the rest of the caption */
figure figcaption em {
font-style: italic;
}
</style>
<figure>
<div class="fig-row">
<div class="panel panel-1" style="width:100%;">
<img src="https://github.com/Goedel-LM/Goedel-Prover-V2/blob/main/assets/combined_performance_plots_varied_width.png?raw=true" alt="…">
</div>
</div>
<figcaption>
<strong>Figure 1</strong>: <em>Pass@32 performance on MiniF2F, PutnamBench, and our new MathOlympiadBench containing 360 IMO-level problems.</em>
</figcaption>
</figure>
The charts above demonstrate the state-of-the-art performance of Goedel-Prover-V2. We report all numbers at Pass@32: (1) Across all three datasets, our flagship 32B model, in both standard and self-correction mode, significantly outperforms prior state-of-the-art DeepSeek-Prover-V2-671B and Kimina-Prover-72B; (2) on miniF2F, our 8B model matches the performance of DeepSeek-Prover-V2-671B while being 100 times smaller in model size.
<div align="center">
<table style="margin: 0 auto;">
<thead>
<tr>
<th>#</th>
<th>Model</th>
<th>num‑solved</th>
<th>compute</th>
</tr>
</thead>
<tbody>
<tr><td>1</td><td><strong>Goedel-Prover-V2-32B (self-correction mode)</strong></td><td><strong>64</strong></td><td><strong>Pass@64</strong></td></tr>
<tr><td>1</td><td><strong>Goedel-Prover-V2-32B (self-correction mode)</strong></td><td><strong>57</strong></td><td><strong>Pass@32</strong></td></tr>
<tr><td>1</td><td><strong>Goedel-Prover-V2-32B</strong></td><td><strong>43</strong></td><td><strong>Pass@32</strong></td></tr>
<tr><td>2</td><td>DeepSeek‑Prover‑V2-671B</td><td>47</td><td>Pass@1024</td></tr>
<tr><td>2</td><td>DeepSeek‑Prover‑V2-671B</td><td>22</td><td>Pass@32</td></tr>
<tr><td>3</td><td>DSP+</td><td>23</td><td>Pass@128</td></tr>
<tr><td>4</td><td>Kimina‑Prover‑7B‑Distill</td><td>10</td><td>Pass@192</td></tr>
<tr><td>5</td><td>Self-play Theorem Prover</td><td>8</td><td>Pass@3200</td></tr>
<tr><td>6</td><td>Goedel-Prover-V1</td><td>7</td><td>Pass@512</td></tr>
</tbody>
</table>
<!-- table caption -->
<caption align="bottom"><strong>Table 1</strong>: <em>PutnamBench leaderboard. Goedel-Prover-V2-32B secures the top rank with significantly less compute (pass number) than the previous state-of-the-art.</em>
</div>
## 3. Compelling Scaling Performance
<style>
.fig-row {
display: flex;
justify-content: space-between; /* spread them out */
align-items: flex-start; /* align tops */
gap: 1rem; /* space between images */
}
.fig-row img {
display: block;
width: 100%;
height: auto;
}
.fig-row .panel {
/* override per‐panel width as needed */
/* e.g. .panel-1 { width:25%; } .panel-2 { width:40%; } etc. */
}
figure {
margin: 0;
}
figure figcaption {
text-align: center;
font-size: 0.9em;
margin-top: 0.75rem;
color: #555;
}
figure figcaption strong {
font-weight: bold;
}
/* Italicize the rest of the caption */
figure figcaption em {
font-style: italic;
}
</style>
<figure>
<div class="fig-row">
<div class="panel panel-1" style="width:80%;">
<img src="https://github.com/Goedel-LM/Goedel-Prover-V2/blob/main/assets/inference_scale_performance.png?raw=true" alt="…">
</div>
</div>
<figcaption>
<strong>Figure 2</strong>: <em>Performance on MiniF2F test set under different sample budgets.</em>
</figcaption>
</figure>
The scaling curves above show that our 32B model consistently outperforms all prior state-of-the-art models across the entire range of inference-time compute budgets.
## 4. Model & Dataset Downloads
We release our Goedel-Prover-V2 models and the new MathOlympiadBench benchmark to foster future research.
<div align="center">
| Model | Download |
| -------- | -------- |
| Goedel-Prover-V2-32B | [🤗Download](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B) |
| Goedel-Prover-V2-8B | [🤗Download](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-8B) |
</div>
<div align="center">
| Dataset | Download |
| -------- | -------- |
| MathOlympiadBench | [🤗Download](https://huggingface.co/datasets/Goedel-LM/MathOlympiadBench) |
</div>
<strong>MathOlympiadBench</strong> (Math Olympiad Bench) comprises human-verified formalizations of Olympiad-level mathematical competition problems, sourced from Compfiles and IMOSLLean4 repository. MathOlympiadBench contains 360 problems, including 158 IMO problems from 1959 to 2024, 131 IMO shortlist problems covering 2006 to 2023, 68 regional mathematical Olympiad problems, and 3 additional mathematical puzzles.
This model is being released to aid other open-source projects, including those geared towards the upcoming IMO competition. A full paper with all details will be released in the coming weeks.
## 5. Quick Start
You can directly use [Huggingface's Transformers](https://github.com/huggingface/transformers) for model inference.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(30)
model_id = "Goedel-LM/Goedel-Prover-V2-32B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
formal_statement = """
import Mathlib
import Aesop
set_option maxHeartbeats 0
open BigOperators Real Nat Topology Rat
theorem square_equation_solution {x y : ℝ} (h : x^2 + y^2 = 2*x - 4*y - 5) : x + y = -1 := by
sorry
""".strip()
prompt = """
Complete the following Lean 4 code:
```lean4
{}```
Before producing the Lean 4 code to formally prove the given theorem, provide a detailed proof plan outlining the main proof steps and strategies.
The plan should highlight key ideas, intermediate lemmas, and proof structures that will guide the construction of the final formal proof.
""".strip()
chat = [
{"role": "user", "content": prompt.format(formal_statement)},
]
inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
import time
start = time.time()
outputs = model.generate(inputs, max_new_tokens=32768)
print(tokenizer.batch_decode(outputs))
print(time.time() - start)
```
### Cite
```bibtex
@misc{lin2025goedelproverv2,
title={Goedel-Prover-V2: The Strongest Open-Source Theorem Prover to Date},
author={Yong Lin and Shange Tang and Bohan Lyu and Ziran Yang and Jui-Hui Chung and Haoyu Zhao and Lai Jiang and Yihan Geng and Jiawei Ge and Jingruo Sun and Jiayun Wu and Jiri Gesi and David Acuna and Kaiyu Yang and Hongzhou Lin and Yejin Choi and Danqi Chen and Sanjeev Arora and Chi Jin},
year={2025}
}
``` |