Lightweight MoE LLM Inference
Sparse activation and N-gram embeddings for cost-efficient open-source inference.
The problem
Large language models are expensive to serve because compute scales with total parameter count. Mixture-of-Experts (MoE) models activate only a subset of experts per token, but many MoE designs still require heavy infrastructure for hundred-billion-scale checkpoints.
LongCat-Flash-Lite approach
LongCat-Flash-Lite targets developers who need strong agentic and coding performance with lighter inference than full 560B Flash-Chat:
- 68.5B total parameters; ~2.9B–4.5B activated per inference step
- N-gram embedding expansion: 31.4B params in a hash-based N-gram layer for local context
- 256K context via YARN extension
- Typical throughput: 500–700 tokens/s (4K in / 1K out, API reference load)
Flash-Lite vs Flash-Chat
| Model | Total params | Activated / token | Best for |
|---|---|---|---|
| Flash-Lite | 68.5B | ~2.9B–4.5B | Efficient agents, coding, long context |
| Flash-Chat | 560B | ~18.6B–31.3B (avg ~27B) | Maximum capability, multi-node serving |
For Flash-Chat production serving, see vLLM and SGLang guides.