Deploy with SGLang

Expert-parallel MoE serving with flashinfer attention

Overview

SGLang supports LongCat-Flash via expert-parallel MoE and flashinfer (upstream PR). Hardware requirements match vLLM: FP8 on one 8-GPU node; BF16 typically needs two nodes.

Single-node deployment (FP8)

python3 -m sglang.launch_server \
    --model meituan-longcat/LongCat-Flash-Chat-FP8 \
    --trust-remote-code \
    --attention-backend flashinfer \
    --enable-ep-moe \
    --tp 8

Multi-node deployment (BF16)

python3 -m sglang.launch_server \
    --model meituan-longcat/LongCat-Flash-Chat \
    --trust-remote-code \
    --attention-backend flashinfer \
    --enable-ep-moe \
    --tp 16 \
    --nnodes 2 \
    --node-rank $NODE_RANK \
    --dist-init-addr $MASTER_IP:5000

Set $NODE_RANK (0 on master, 1 on worker) and $MASTER_IP to your cluster coordinator.

Enable Multi-Token Prediction (MTP)

Append these flags to the launch command:

    --speculative-draft-model-path meituan-longcat/LongCat-Flash-Chat \
    --speculative-algorithm NEXTN \
    --speculative-num-draft-tokens 2 \
    --speculative-num-steps 1 \
    --speculative-eagle-topk 1