Python Quick Start

Run your first LongCat inference in about 5 minutes

Prerequisites

  • Python 3.10+ and a CUDA-capable GPU (model size varies by variant)
  • For full LongCat-Flash-Chat (560B MoE): use vLLM or SGLang — not a single-GPU Transformers load
  • For smaller models (e.g. Flash-Lite, Flash-Prover): Transformers works on fewer GPUs

1. Install dependencies

pip install torch transformers accelerate huggingface_hub

For production throughput on Flash models, install vLLM or SGLang instead — see the deployment guides.

2. Apply the chat template

LongCat-Flash models use a custom prefix defined in tokenizer_config.json. See the full reference in Chat Template.

from transformers import AutoTokenizer

model_id = "meituan-longcat/LongCat-Flash-Lite"  # example: fits smaller GPUs
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

messages = [{"role": "user", "content": "Explain mixture-of-experts in one paragraph."}]
# Use the model's chat template if available:
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Or build manually for Flash-Chat style:
# prompt = "[Round 0] USER:Explain mixture-of-experts in one paragraph. ASSISTANT:"
print(prompt)

3. Run inference (Transformers)

Suitable for smaller LongCat checkpoints. Large MoE models should use vLLM/SGLang.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meituan-longcat/LongCat-Flash-Lite"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

prompt = "[Round 0] USER:What is latent action representation? ASSISTANT:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

4. Call via OpenAI-compatible API (vLLM / SGLang)

After starting a server (see deployment guides), use any OpenAI SDK client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="meituan-longcat/LongCat-Flash-Chat-FP8",
    messages=[{"role": "user", "content": "Summarize LongCat-Flash architecture."}],
    max_tokens=512,
)
print(response.choices[0].message.content)