Lightweight MoE LLM Inference

Sparse activation and N-gram embeddings for cost-efficient open-source inference.

The problem

Large language models are expensive to serve because compute scales with total parameter count. Mixture-of-Experts (MoE) models activate only a subset of experts per token, but many MoE designs still require heavy infrastructure for hundred-billion-scale checkpoints.

LongCat-Flash-Lite approach

LongCat-Flash-Lite targets developers who need strong agentic and coding performance with lighter inference than full 560B Flash-Chat:

68.5B total parameters; ~2.9B–4.5B activated per inference step
N-gram embedding expansion: 31.4B params in a hash-based N-gram layer for local context
256K context via YARN extension
Typical throughput: 500–700 tokens/s (4K in / 1K out, API reference load)

Flash-Lite vs Flash-Chat

Model	Total params	Activated / token	Best for
Flash-Lite	68.5B	~2.9B–4.5B	Efficient agents, coding, long context
Flash-Chat	560B	~18.6B–31.3B (avg ~27B)	Maximum capability, multi-node serving

For Flash-Chat production serving, see vLLM and SGLang guides.

Lightweight MoE LLM Inference

The problem

LongCat-Flash-Lite approach

Flash-Lite vs Flash-Chat

Get started