Overview
SGLang supports LongCat-Flash via expert-parallel MoE and flashinfer (upstream PR). Hardware requirements match vLLM: FP8 on one 8-GPU node; BF16 typically needs two nodes.
Single-node deployment (FP8)
python3 -m sglang.launch_server \
--model meituan-longcat/LongCat-Flash-Chat-FP8 \
--trust-remote-code \
--attention-backend flashinfer \
--enable-ep-moe \
--tp 8
Multi-node deployment (BF16)
python3 -m sglang.launch_server \
--model meituan-longcat/LongCat-Flash-Chat \
--trust-remote-code \
--attention-backend flashinfer \
--enable-ep-moe \
--tp 16 \
--nnodes 2 \
--node-rank $NODE_RANK \
--dist-init-addr $MASTER_IP:5000
Set $NODE_RANK (0 on master, 1 on worker) and $MASTER_IP to your cluster coordinator.
Enable Multi-Token Prediction (MTP)
Append these flags to the launch command:
--speculative-draft-model-path meituan-longcat/LongCat-Flash-Chat \
--speculative-algorithm NEXTN \
--speculative-num-draft-tokens 2 \
--speculative-num-steps 1 \
--speculative-eagle-topk 1