FAQ & Troubleshooting

Common issues when deploying and running LongCat models

General

Which model should I start with?

For local experimentation on limited GPUs, try Flash-Lite. For production LLM serving at scale, use Flash-Chat with vLLM or SGLang. For video or image generation, see the respective model pages.

Can I load LongCat-Flash-Chat (560B) with Transformers on one GPU?

No. The full Flash-Chat checkpoint requires multi-GPU expert-parallel serving. Use FP8 weights on an 8-GPU node minimum, or BF16 across two nodes — see the deployment guides.

Deployment (vLLM / SGLang)

CUDA out-of-memory (OOM) at startup

  • Use the FP8 checkpoint (LongCat-Flash-Chat-FP8) instead of BF16
  • Increase --tensor-parallel-size / --tp to spread weights across more GPUs
  • Enable expert parallelism: --enable-expert-parallel (vLLM) or --enable-ep-moe (SGLang)
  • Close other GPU processes; verify nvidia-smi shows expected free memory

trust-remote-code required

LongCat models use custom architecture code on Hugging Face. Always pass --trust-remote-code (CLI) or trust_remote_code=True (Python).

Server starts but requests fail / empty responses

  • Verify your client uses the correct chat template prefix
  • Confirm the served model name matches the checkpoint ID in your API call
  • Check server logs for tokenizer or max-model-len errors

Multi-node jobs hang at initialization

  • Ensure all nodes can reach $MASTER_IP on the RPC port (vLLM: 13345; SGLang: 5000)
  • Set --node-rank uniquely per node (SGLang)
  • Only worker nodes should set --data-parallel-start-rank (vLLM)

Flash-Prover (Lean4)

Proof verification fails

Flash-Prover requires a running Lean4 server for syntax and compilation checks. See the Flash-Prover repository for server setup and TIR (Tool-Integrated Reasoning) configuration.

Low pass rate vs. published benchmarks

Reported MiniF2F scores use specific attempt budgets (e.g. 72 attempts with TIR). Match inference strategy, temperature, and verifier settings from the technical report before comparing.

Video & Image models

Video-Avatar out of VRAM

Use DMD 8-step inference and LoRA adapters as documented in the Video-Avatar 1.5 model page and LongCat-Video GitHub.

Chinese text renders incorrectly in Image model

LongCat-Image is optimized for Chinese glyph rendering. Use prompts that explicitly describe text content; see the Image model page and Chinese text generation guide.