Overview
vLLM includes LongCat-Flash adaptations for expert-parallel MoE serving (upstream PR). LongCat-Flash-Chat has 560B total parameters; plan hardware accordingly:
- FP8 weights: at least one 8×GPU node (e.g. 8× H20 141G)
- BF16 weights: at least two 8×GPU nodes (e.g. 16× H800 80G)
Official source: LongCat-Flash deployment guide
Single-node deployment (FP8)
Tensor parallelism + expert parallelism on one node:
vllm serve meituan-longcat/LongCat-Flash-Chat-FP8 \
--trust-remote-code \
--enable-expert-parallel \
--tensor-parallel-size 8
Server exposes an OpenAI-compatible API (default http://localhost:8000/v1).
Multi-node deployment (BF16)
Two-node example with data parallelism across workers:
Node 0 (coordinator)
vllm serve meituan-longcat/LongCat-Flash-Chat \
--trust-remote-code \
--tensor-parallel-size 8 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $MASTER_IP \
--data-parallel-rpc-port 13345
Node 1 (worker)
vllm serve meituan-longcat/LongCat-Flash-Chat \
--trust-remote-code \
--tensor-parallel-size 8 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $MASTER_IP \
--data-parallel-rpc-port 13345
Replace $MASTER_IP with the coordinator IP. Only worker nodes use --data-parallel-start-rank.
Enable Multi-Token Prediction (MTP)
Add speculative decoding for higher throughput:
vllm serve meituan-longcat/LongCat-Flash-Chat-FP8 \
--trust-remote-code \
--enable-expert-parallel \
--tensor-parallel-size 8 \
--speculative_config '{"model": "meituan-longcat/LongCat-Flash-Chat", "num_speculative_tokens": 1, "method":"longcat_flash_mtp"}'
Smaller models
LongCat-Flash-Lite and
LongCat-Flash-Prover use fewer GPUs.
Start with lower --tensor-parallel-size and omit expert-parallel flags if the model card does not require them.