Deploy with vLLM

High-throughput inference for LongCat-Flash MoE models

Overview

vLLM includes LongCat-Flash adaptations for expert-parallel MoE serving (upstream PR). LongCat-Flash-Chat has 560B total parameters; plan hardware accordingly:

  • FP8 weights: at least one 8×GPU node (e.g. 8× H20 141G)
  • BF16 weights: at least two 8×GPU nodes (e.g. 16× H800 80G)

Official source: LongCat-Flash deployment guide

Single-node deployment (FP8)

Tensor parallelism + expert parallelism on one node:

vllm serve meituan-longcat/LongCat-Flash-Chat-FP8 \
    --trust-remote-code \
    --enable-expert-parallel \
    --tensor-parallel-size 8

Server exposes an OpenAI-compatible API (default http://localhost:8000/v1).

Multi-node deployment (BF16)

Two-node example with data parallelism across workers:

Node 0 (coordinator)

vllm serve meituan-longcat/LongCat-Flash-Chat \
   --trust-remote-code \
   --tensor-parallel-size 8 \
   --data-parallel-size 2 \
   --data-parallel-size-local 1 \
   --data-parallel-address $MASTER_IP \
   --data-parallel-rpc-port 13345

Node 1 (worker)

vllm serve meituan-longcat/LongCat-Flash-Chat \
   --trust-remote-code \
   --tensor-parallel-size 8 \
   --headless \
   --data-parallel-size 2 \
   --data-parallel-size-local 1 \
   --data-parallel-start-rank 1 \
   --data-parallel-address $MASTER_IP \
   --data-parallel-rpc-port 13345

Replace $MASTER_IP with the coordinator IP. Only worker nodes use --data-parallel-start-rank.

Enable Multi-Token Prediction (MTP)

Add speculative decoding for higher throughput:

vllm serve meituan-longcat/LongCat-Flash-Chat-FP8 \
    --trust-remote-code \
    --enable-expert-parallel \
    --tensor-parallel-size 8 \
    --speculative_config '{"model": "meituan-longcat/LongCat-Flash-Chat", "num_speculative_tokens": 1, "method":"longcat_flash_mtp"}'

Smaller models

LongCat-Flash-Lite and LongCat-Flash-Prover use fewer GPUs. Start with lower --tensor-parallel-size and omit expert-parallel flags if the model card does not require them.