LongCat-Flash-Omni
First open-source real-time all-modality interaction model (November 2025)
Overview
Since September 1, Meituan has released the LongCat-Flash series and open-sourced LongCat-Flash-Chat and LongCat-Flash-Thinking. Today, the family is upgraded with LongCat-Flash-Omni — the first open-source, real-time, all-modality interaction model.
Built upon the series’ efficient Shortcut-Connected MoE (ScMoE) backbone (with Zero-Computation Experts), Omni integrates efficient multi-modal perception and a speech reconstruction module. Even at a 560B total parameter scale with ~27B active, it delivers low-latency real-time audio-video interaction, providing developers with an efficient choice for multi-modal applications.
Modalities Supported
- Text: instruction following, reasoning, coding
- Image: VQA, fine-grained recognition, OCR
- Audio: speech understanding, streaming ASR
- Video: temporal reasoning, event grounding
Architectural Highlights
- End-to-end: visual and audio encoders as perceptual front-ends; the LLM directly produces text/speech tokens.
- Speech reconstruction: lightweight audio decoder reconstructs natural speech waveforms for real-time dialogue.
- Unified ScMoE: single-trunk expert routing across modalities with Zero-Computation Experts.
- Streaming-efficient: codecs are lightweight (~0.6B each); all modules optimized for streaming inference.
- Efficiency/performance balance: retains LongCat-Flash efficiency while achieving strong multi-modal quality.
Scale & Real-Time IO
- 560B total / ~27B active with ScMoE backbone.
- 128K context and > 8 minutes AV sessions for long-horizon dialogue.
- Chunked AV feature interleaving for efficient temporal processing.
- Low-latency speech generation with high fidelity.
Progressive Early Multi-Modal Fusion
Addressing heterogeneous modality distributions with a staged strategy:
- Stage 0: large-scale text pretraining to build a strong LLM base.
- Stage 1: introduce speech data and align acoustic-language spaces.
- Stage 2: add image–caption pairs and interleaved V-L corpora for vision–language alignment.
- Stage 3: incorporate video for spatio-temporal reasoning; strengthen image datasets.
- Stage 4: extend context from 8K to 128K.
- Stage 5: audio encoder alignment to mitigate discrete-token information loss.
Benchmark Highlights
Open-source SOTA across modalities on comprehensive suites (e.g., Omni-Bench, WorldSense). Strong single-modality performance:
- Text: maintains and improves textual capabilities across domains.
- Image: RealWorldQA 74.8 — comparable to Gemini-2.5-Pro; above open-source Qwen3-Omni; strong on multi-image tasks.
- Audio: strong ASR on LibriSpeech/AISHELL-1, S2TT on CoVost2, top audio understanding on TUT2017/Nonspeech7k; near closed-source on real-time AV interaction.
- Video: SOTA on video-to-text; short-video understanding far ahead; long-video on par with Gemini-2.5-Pro and Qwen3-VL.
- Cross-modal: better than Gemini-2.5-Flash (non-thinking), on par with Gemini-2.5-Pro (non-thinking); significant advantage on WorldSense.
Applications & Resources
- Multi-modal assistants and voice agents
- Visual Q&A and scene understanding
- Real-time AI video customer support