Benchmarks - LongCat AI Performance

Text-Based Benchmarks

Representative results reported by the authors (non-exhaustive):

Category	Benchmark	Metric	LongCat-Flash
General Domains	MMLU	acc	89.71
Instruction Following	IFEval	acc	89.65
Math Reasoning	MATH500	acc	96.40
General Reasoning	DROP	F1	79.06
Coding	Humaneval+	pass@1	88.41
Agentic Tool Use	τ²-Bench (telecom)	avg@4	73.68

UNO-Bench: Unified All-Modality Benchmark

LongCat team releases UNO-Bench, a unified, high-quality all-modality benchmark designed to evaluate both single-modality and omni-modality intelligence under one framework, with strong Chinese support. It reveals the Combination Law of omni-modality performance.

Why UNO-Bench?

One-stop benchmark: evaluates image, audio, video, and text, plus their fusion
High-quality curation: 1,250 omni samples and 2,480 single-modality samples; 98% require cross-modal fusion
Chinese-centric: robust Chinese scenarios and tasks
Open-ended reasoning: Multi-step Open-ended (MO) questions with weighted human grading; automatic scoring model (95% accuracy)

Combination Law (Synergistic Promotion)

Omni-modality performance follows a power-law over single-modality abilities (perception of audio and vision):

POmni ≈ 1.0332 · (PA × PV)^2.1918 + 0.2422

Bottleneck effect: weaker models grow slowly
Synergistic gain: stronger models exhibit accelerated improvement; 1+1 >> 2

Data Pipeline & Quality

Manual curation to avoid contamination; >90% private, crowdsourced visuals
Audio-visual decoupling: independently designed/recorded audio paired with video to force real fusion
Ablation checks: modality removal verifies cross-modal solvability (≥98%)
Cluster-guided sampling: >90% compute reduction with rank consistency (SRCC/PLCC > 0.98)

Highlights

Omni SOTA: LongCat-Flash-Omni leads among open-source models on UNO-Bench
Reasoning gap: spatial/temporal/complex reasoning remains the key separator among top models

LongCat-Flash-Omni Documentation

Multi-Modal Leaderboard

Achieves open-source SOTA across modalities.

Suite	Metric	LongCat-Flash-Omni	Qwen3-Omni	Gemini-2.5-Flash	Gemini-2.5-Pro
Omni-Bench	avg	SOTA	-	-	-
WorldSense	avg	SOTA	-	-	-

Replace placeholders with exact numbers when available.

Model-Specific Benchmarks

Flash-Thinking

AIME25: 64.5% token savings (from 19,653 tokens down to 6,965)

Values summarized from public reports; please consult the official resources for full details and conditions.

Explore Models Technical Details