LongCat-Flash-Prover

From correct answers to rigorous, machine-checkable proofs

Overview

LongCat-Flash-Prover is designed for formal mathematics and theorem proving. Unlike standard math QA tasks that only require a final numeric answer, theorem proving requires a strict step-by-step logical chain. Any ambiguity can invalidate the whole argument.

To address this, LongCat-Flash-Prover combines formal reasoning with Tool-Integrated Reasoning (TIR). With only 72 inference attempts, it reaches 97.1% pass rate on MiniF2F-Test, setting open-source SOTA among prover models. It also achieves 46.7% on MathOlympiad-Bench (180-attempt budget) and 41.5% on PutnamBench (118-attempt budget).

Why Lean4 Formal Language?

Natural language is inherently ambiguous and difficult to verify at proof-step granularity. Lean4 can be viewed as a "programming language for mathematics": just as source code is compiled and checked, Lean4 proofs are validated line by line. If proof code is syntactically valid and accepted by the Lean4 verifier, the theorem is rigorously proven.

Three Atomic Capabilities

  • Auto-Formalization: Translate informal math statements into precise Lean4 formal statements
  • Sketching: Draft a high-level proof plan and decompose difficult goals into lemmas
  • Proving: Complete the formal proof details following the sketch and verifier feedback

Expert Iteration with TIR

The training pipeline uses a mixed-expert iterative framework. Different experts synthesize trajectories for Auto-Formalization, Sketching, and Proving in both single-turn and multi-turn (TIR) modes, enabling self-correction and progressive curriculum learning.

Cold-Start Phase

  • Use ATF-32B (early Auto-Formalizer) to synthesize formal statements
  • Use LongCat-Flash-Thinking-2601 to produce high-quality trajectories with Lean4 feedback
  • Apply filtering, de-duplication, and diversity/difficulty sampling
  • Train a unified cold-start model with mixed-domain SFT

Iteration Phase

  • Use the current iteration model as the next expert
  • Synthesize fresh trajectories per task and per capability
  • Mix in general-purpose data to preserve non-formal reasoning utility
  • Run repeated SFT + RL iterations to obtain the final Prover model

Data Synthesis Workflow

  1. Auto-Formalization: Generate N formal statements from an informal statement, verify syntax with Lean4 and semantics with a consistency judge; if all fail, switch to TIR correction loop
  2. Whole-Proof: Generate N full proofs and validate with Lean4 plus theorem-consistency checks; if all fail, use TIR feedback or switch strategy
  3. Sketch-Proof: Generate lemma-based sketches, then prove each lemma with TIR-assisted proving

Verification Tools

  • Lean4 Server: Syntax and compilation verification for statements, sketches, and proofs
  • Semantic Consistency: LLM-as-a-judge checks alignment between informal and formal statements
  • Theorem Consistency: Rule-based checks prevent target tampering in proofs
  • Legality Validation: Detects cheating patterns such as early exits, fake axioms, and macro-based bypass tricks

RL Stability and Anti-Cheating

During RL training, the team observed reward-hacking behavior when relying only on compiler pass/fail signals. To counter this, LongCat-Flash-Prover combines semantic/theorem checks and legality parsing, including AST-based inspections for known exploit patterns.

For MoE stability under TIR RL, training introduces hierarchical masking and staleness control:

  • Sequence-level masking: Filter sequences with excessive train-infer mismatch
  • Token-level masking: Remove unstable tokens with high mismatch
  • Token-level clipping: Control stale updates for stable optimization

Benchmark Highlights

  • MiniF2F-Test: 97.1% pass rate with 72-attempt budget
  • Auto-Formalization: 100% on MiniF2F-Test and ProofNet
  • MathOlympiad-Bench: 46.7% (180-attempt budget)
  • PutnamBench: 41.5% (118-attempt budget)
  • Sketching gain: ~10% better success under equal compute budget

Vision

LongCat-Flash-Prover aims to move AI from "answers that look correct" to "proofs that are verifiably correct." The long-term goal is to help translate textbooks and frontier papers into formal language, expand formal math knowledge bases, and support new research workflows in mathematics.