MiniF2F & Open-Source Lean4 Theorem Proving

Benchmark context for developers searching for an open-source Lean4 theorem prover with strong MiniF2F-Test scores.

What is MiniF2F?

MiniF2F is a standard benchmark for neural theorem proving: a collection of formalized olympiad-style problems checked by proof assistants (typically Lean). MiniF2F-Test is the held-out split used to compare prover models fairly.

Unlike math QA benchmarks that only require a numeric answer, MiniF2F requires a complete, machine-verifiable proof in Lean4.

LongCat-Flash-Prover on MiniF2F-Test

LongCat-Flash-Prover is an open-source Lean4 theorem prover from Meituan LongCat. With Tool-Integrated Reasoning (TIR) and a 72-attempt budget, it reports:

  • 97.1% pass rate on MiniF2F-Test
  • 100% auto-formalization on MiniF2F-Test & ProofNet
  • 46.7% on MathOlympiad-Bench (180-attempt budget)
  • 41.5% on PutnamBench (118-attempt budget)

The model decomposes proving into auto-formalization, sketching, and whole-proof generation with Lean4 server verification.

Why Lean4 for open-source provers?

Lean4 provides step-level verification: every proof line is checked by the kernel. Open-source prover stacks typically pair a language model with a Lean4 server, search, and TIR feedback loops.

Scores depend on attempt budget, sampling, and verifier configuration — always cite the evaluation protocol when comparing models.