Lightweight MoE LLM Inference

Sparse activation and N-gram embeddings for cost-efficient open-source inference.

The problem

Large language models are expensive to serve because compute scales with total parameter count. Mixture-of-Experts (MoE) models activate only a subset of experts per token, but many MoE designs still require heavy infrastructure for hundred-billion-scale checkpoints.

LongCat-Flash-Lite approach

LongCat-Flash-Lite targets developers who need strong agentic and coding performance with lighter inference than full 560B Flash-Chat:

  • 68.5B total parameters; ~2.9B–4.5B activated per inference step
  • N-gram embedding expansion: 31.4B params in a hash-based N-gram layer for local context
  • 256K context via YARN extension
  • Typical throughput: 500–700 tokens/s (4K in / 1K out, API reference load)

Flash-Lite vs Flash-Chat

Model Total params Activated / token Best for
Flash-Lite 68.5B ~2.9B–4.5B Efficient agents, coding, long context
Flash-Chat 560B ~18.6B–31.3B (avg ~27B) Maximum capability, multi-node serving

For Flash-Chat production serving, see vLLM and SGLang guides.