Overview

LongCat-Flash-Prover is designed for formal mathematics and theorem proving. Unlike standard math QA tasks that only require a final numeric answer, theorem proving requires a strict step-by-step logical chain. Any ambiguity can invalidate the whole argument.

To address this, LongCat-Flash-Prover combines formal reasoning with Tool-Integrated Reasoning (TIR). With only 72 inference attempts, it reaches 97.1% pass rate on MiniF2F-Test, setting open-source SOTA among prover models. It also achieves 46.7% on MathOlympiad-Bench (180-attempt budget) and 41.5% on PutnamBench (118-attempt budget).

GitHub Hugging Face Technical Report

Why Lean4 Formal Language?

Natural language is inherently ambiguous and difficult to verify at proof-step granularity. Lean4 can be viewed as a "programming language for mathematics": just as source code is compiled and checked, Lean4 proofs are validated line by line. If proof code is syntactically valid and accepted by the Lean4 verifier, the theorem is rigorously proven.

Three Atomic Capabilities

Auto-Formalization: Translate informal math statements into precise Lean4 formal statements
Sketching: Draft a high-level proof plan and decompose difficult goals into lemmas
Proving: Complete the formal proof details following the sketch and verifier feedback

Expert Iteration with TIR

The training pipeline uses a mixed-expert iterative framework. Different experts synthesize trajectories for Auto-Formalization, Sketching, and Proving in both single-turn and multi-turn (TIR) modes, enabling self-correction and progressive curriculum learning.

Cold-Start Phase

Use ATF-32B (early Auto-Formalizer) to synthesize formal statements
Use LongCat-Flash-Thinking-2601 to produce high-quality trajectories with Lean4 feedback
Apply filtering, de-duplication, and diversity/difficulty sampling
Train a unified cold-start model with mixed-domain SFT

Iteration Phase

Use the current iteration model as the next expert
Synthesize fresh trajectories per task and per capability
Mix in general-purpose data to preserve non-formal reasoning utility
Run repeated SFT + RL iterations to obtain the final Prover model

Data Synthesis Workflow

Auto-Formalization: Generate N formal statements from an informal statement, verify syntax with Lean4 and semantics with a consistency judge; if all fail, switch to TIR correction loop
Whole-Proof: Generate N full proofs and validate with Lean4 plus theorem-consistency checks; if all fail, use TIR feedback or switch strategy
Sketch-Proof: Generate lemma-based sketches, then prove each lemma with TIR-assisted proving

Verification Tools

Lean4 Server: Syntax and compilation verification for statements, sketches, and proofs
Semantic Consistency: LLM-as-a-judge checks alignment between informal and formal statements
Theorem Consistency: Rule-based checks prevent target tampering in proofs
Legality Validation: Detects cheating patterns such as early exits, fake axioms, and macro-based bypass tricks

RL Stability and Anti-Cheating

During RL training, the team observed reward-hacking behavior when relying only on compiler pass/fail signals. To counter this, LongCat-Flash-Prover combines semantic/theorem checks and legality parsing, including AST-based inspections for known exploit patterns.

For MoE stability under TIR RL, training introduces hierarchical masking and staleness control:

Sequence-level masking: Filter sequences with excessive train-infer mismatch
Token-level masking: Remove unstable tokens with high mismatch
Token-level clipping: Control stale updates for stable optimization

Benchmark Highlights

MiniF2F-Test: 97.1% pass rate with 72-attempt budget
Auto-Formalization: 100% on MiniF2F-Test and ProofNet
MathOlympiad-Bench: 46.7% (180-attempt budget)
PutnamBench: 41.5% (118-attempt budget)
Sketching gain: ~10% better success under equal compute budget

Vision

LongCat-Flash-Prover aims to move AI from "answers that look correct" to "proofs that are verifiably correct." The long-term goal is to help translate textbooks and frontier papers into formal language, expand formal math knowledge bases, and support new research workflows in mathematics.

LongCat-Flash-Prover