1. Install dependencies
pip install torch transformers accelerate huggingface_hub
For production throughput on Flash models, install vLLM or SGLang instead — see the deployment guides.
2. Apply the chat template
LongCat-Flash models use a custom prefix defined in tokenizer_config.json. See the full reference in Chat Template.
from transformers import AutoTokenizer
model_id = "meituan-longcat/LongCat-Flash-Lite" # example: fits smaller GPUs
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
messages = [{"role": "user", "content": "Explain mixture-of-experts in one paragraph."}]
# Use the model's chat template if available:
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Or build manually for Flash-Chat style:
# prompt = "[Round 0] USER:Explain mixture-of-experts in one paragraph. ASSISTANT:"
print(prompt)
3. Run inference (Transformers)
Suitable for smaller LongCat checkpoints. Large MoE models should use vLLM/SGLang.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "meituan-longcat/LongCat-Flash-Lite"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
prompt = "[Round 0] USER:What is latent action representation? ASSISTANT:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
4. Call via OpenAI-compatible API (vLLM / SGLang)
After starting a server (see deployment guides), use any OpenAI SDK client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="meituan-longcat/LongCat-Flash-Chat-FP8",
messages=[{"role": "user", "content": "Summarize LongCat-Flash architecture."}],
max_tokens=512,
)
print(response.choices[0].message.content)