LongCat-Flash-Lite

Overview

LongCat-Flash-Lite is a lightweight Mixture-of-Experts (MoE) model that explores a new scaling direction: N-gram embedding expansion. Instead of relying primarily on adding more experts, it allocates a large portion of total parameters into an N-gram embedding layer to improve local-context semantic capture, while keeping inference sparse via dynamic activation.

Key Specs

Total parameters: 68.5B
Activated per inference: ~2.9B–4.5B
Embedding allocation: 31.4B (46%) to N-gram embedding layer
Context length: Up to 256K (via YARN)
Throughput: 500–700 token/s (typical load: 4K input / 1K output, LongCat API)
Strengths: Agentic tool use and coding

Technology Highlights

N-gram Embedding Layer

The N-gram embedding layer enhances the model’s ability to capture local context semantics. Using a hash function, the current token together with its preceding N-1 tokens is mapped into a single N-gram embedding vector, which is then fused with the token’s base embedding.

To reduce hash collisions, the design includes:

Sub-table decomposition + linear projection: split a large embedding table into multiple sub-tables and project each separately
Vocabulary size design: carefully select table sizes to lower collision probability
Embedding amplification: scaling or normalization before output to keep the signal effective through residual paths

System Co-Design for Speed

Despite the large total parameter count, LongCat-Flash-Lite benefits from sparse activation and system-level optimizations to convert theoretical sparsity gains into real throughput.

Parameter allocation: shift parameters into O(1) embedding lookup to reduce compute growth and expert communication overhead
N-gram Cache + kernel fusion: GPU-managed N-gram ID caching and fused CUDA kernels to reduce I/O latency and improve utilization
Speculative decoding collaboration: co-design with speculative decoding; draft model uses standard embeddings to avoid N-gram lookup overhead

Benchmark Highlights

Agentic Tool Use

τ²-Bench: Telecom 72.8, Retail 73.1, Aviation 58.0 (highest among compared models)
VitaBench: 7.0 (leading)

Coding

SWE-Bench: 54.4% (code fixing)
TerminalBench: 33.75 (terminal command execution)
SWE-Bench Multilingual: 38.10%

General Knowledge & Reasoning

MMLU: 85.52
C-Eval / CMMLU: 86.55 / 82.48
MMLU-Pro / GPQA-Diamond: 78.29 / 66.78
MATH500: 96.80%
AIME: AIME24 72.19; AIME25 63.23