LongCat-Flash-Lite
N-gram embedding expansion for lightweight MoE (Released: 2026)
Overview
LongCat-Flash-Lite is a lightweight Mixture-of-Experts (MoE) model that explores a new scaling direction: N-gram embedding expansion. Instead of relying primarily on adding more experts, it allocates a large portion of total parameters into an N-gram embedding layer to improve local-context semantic capture, while keeping inference sparse via dynamic activation.
Key Specs
- Total parameters: 68.5B
- Activated per inference: ~2.9B–4.5B
- Embedding allocation: 31.4B (46%) to N-gram embedding layer
- Context length: Up to 256K (via YARN)
- Throughput: 500–700 token/s (typical load: 4K input / 1K output, LongCat API)
- Strengths: Agentic tool use and coding
Technology Highlights
N-gram Embedding Layer
The N-gram embedding layer enhances the model’s ability to capture local context semantics. Using a hash function, the current token together with its preceding N-1 tokens is mapped into a single N-gram embedding vector, which is then fused with the token’s base embedding.
To reduce hash collisions, the design includes:
- Sub-table decomposition + linear projection: split a large embedding table into multiple sub-tables and project each separately
- Vocabulary size design: carefully select table sizes to lower collision probability
- Embedding amplification: scaling or normalization before output to keep the signal effective through residual paths
System Co-Design for Speed
Despite the large total parameter count, LongCat-Flash-Lite benefits from sparse activation and system-level optimizations to convert theoretical sparsity gains into real throughput.
- Parameter allocation: shift parameters into O(1) embedding lookup to reduce compute growth and expert communication overhead
- N-gram Cache + kernel fusion: GPU-managed N-gram ID caching and fused CUDA kernels to reduce I/O latency and improve utilization
- Speculative decoding collaboration: co-design with speculative decoding; draft model uses standard embeddings to avoid N-gram lookup overhead
Benchmark Highlights
Agentic Tool Use
- τ²-Bench: Telecom 72.8, Retail 73.1, Aviation 58.0 (highest among compared models)
- VitaBench: 7.0 (leading)
Coding
- SWE-Bench: 54.4% (code fixing)
- TerminalBench: 33.75 (terminal command execution)
- SWE-Bench Multilingual: 38.10%
General Knowledge & Reasoning
- MMLU: 85.52
- C-Eval / CMMLU: 86.55 / 82.48
- MMLU-Pro / GPQA-Diamond: 78.29 / 66.78
- MATH500: 96.80%
- AIME: AIME24 72.19; AIME25 63.23