Benchmarks
Performance metrics and comparisons for LongCat AI models
Text-Based Benchmarks
Representative results reported by the authors (non-exhaustive):
| Category | Benchmark | Metric | LongCat-Flash |
|---|---|---|---|
| General Domains | MMLU | acc | 89.71 |
| Instruction Following | IFEval | acc | 89.65 |
| Math Reasoning | MATH500 | acc | 96.40 |
| General Reasoning | DROP | F1 | 79.06 |
| Coding | Humaneval+ | pass@1 | 88.41 |
| Agentic Tool Use | τ²-Bench (telecom) | avg@4 | 73.68 |
UNO-Bench: Unified All-Modality Benchmark
LongCat team releases UNO-Bench, a unified, high-quality all-modality benchmark designed to evaluate both single-modality and omni-modality intelligence under one framework, with strong Chinese support. It reveals the Combination Law of omni-modality performance.
Why UNO-Bench?
- One-stop benchmark: evaluates image, audio, video, and text, plus their fusion
- High-quality curation: 1,250 omni samples and 2,480 single-modality samples; 98% require cross-modal fusion
- Chinese-centric: robust Chinese scenarios and tasks
- Open-ended reasoning: Multi-step Open-ended (MO) questions with weighted human grading; automatic scoring model (95% accuracy)
Combination Law (Synergistic Promotion)
Omni-modality performance follows a power-law over single-modality abilities (perception of audio and vision):
POmni ≈ 1.0332 · (PA × PV)^2.1918 + 0.2422
- Bottleneck effect: weaker models grow slowly
- Synergistic gain: stronger models exhibit accelerated improvement; 1+1 >> 2
Data Pipeline & Quality
- Manual curation to avoid contamination; >90% private, crowdsourced visuals
- Audio-visual decoupling: independently designed/recorded audio paired with video to force real fusion
- Ablation checks: modality removal verifies cross-modal solvability (≥98%)
- Cluster-guided sampling: >90% compute reduction with rank consistency (SRCC/PLCC > 0.98)
Highlights
- Omni SOTA: LongCat-Flash-Omni leads among open-source models on UNO-Bench
- Reasoning gap: spatial/temporal/complex reasoning remains the key separator among top models
Multi-Modal Leaderboard
Achieves open-source SOTA across modalities.
| Suite | Metric | LongCat-Flash-Omni | Qwen3-Omni | Gemini-2.5-Flash | Gemini-2.5-Pro |
|---|---|---|---|---|---|
| Omni-Bench | avg | SOTA | - | - | - |
| WorldSense | avg | SOTA | - | - | - |
Replace placeholders with exact numbers when available.
Model-Specific Benchmarks
Flash-Thinking
- AIME25: 64.5% token savings (from 19,653 tokens down to 6,965)
Values summarized from public reports; please consult the official resources for full details and conditions.