LongCat-Video-Avatar 1.5 Released

From high-fidelity to truly usable: commercial-grade digital human video generation, now open source.

Meituan LongCat officially open-sources LongCat-Video-Avatar 1.5, a digital human video model that advances from open-source SOTA toward commercial-grade applications. It delivers comprehensive improvements in lip synchronization, physical plausibility, long-video stability, multi-person interaction, and efficient inference—producing stable, natural, high-quality content even in complex business scenarios.

Three Capability Upgrades

Commercialized Base Experience

Under long sentences, fast speech, and singing, lip motion is more precise and smooth. Facial expressions, head pose, and body movements are better coordinated for natural, stable overall performance.

Richer Open-Domain Scenes

A high-quality data system enables stable handling of real humans, anime, virtual idols, animals, and more. Multi-person dialogue is more natural, with accurate speaker/listener distinction.

Efficient Inference

DMD distillation reduces generation from 50 steps to 8 steps (~15× speedup). A shared base model plus LoRA adapters replaces three-model parallel deployment, cutting VRAM. A 10-second video generates in about 1 minute.

Whisper-Large Audio Upgrade

The audio encoder moves from Wav2Vec2 to Whisper-large, capturing finer phoneme changes, pronunciation rhythm, and multilingual prosody. This improves lip sync and full-body temporal stability—reducing jitter, frame skips, frozen frames, and identity drift in long videos.

Data Engineering

  • Offline annotation: Face keypoints, person count, body composition, audio-visual sync
  • Online validation: Filter transitions, black frames, flicker, frame skips
  • Multi-person data: Active speaker detection for single-speaker segments
  • Silent data: Natural micro-expressions without spurious lip motion on non-speakers
  • Emotion data: Frame-level emotion recognition for speech-expression-body alignment

GRPO & Hand Stability

GRPO (Group Relative Policy Optimization) applies frame-level human preference alignment, correcting discontinuous motion, hand deformation, structural collapse, and expression-speech mismatch. First-frame hand detection for image-to-video and continuation tasks increases training on visible-hand samples, reducing hand artifacts in e-commerce, product showcase, and education scenarios.

EvalTalker Benchmark Results

Built on EvalTalker across news, education, entertainment, and commercial scenarios—with 770 evaluators, 13,240 subjective scores, and 10 expert structured analyses.

User Preference

vs. LongCat-Video-Avatar 1.5 Win Rate
Kling Avatar 2.065.9%
OmniHuman 1.561.1%
HeyGen54.3%

Key Metrics

Metric Result
Single-person score3.336
Multi-person score2.730 (vs. InfiniteTalk 2.339)
Subject deformation rate23.1%
Background deformation rate9.4%
Frame skip rate0.8% (lowest)
Face-body sync issue rate5.1% (best)
Lip sync issue rate29.8% (best)

Open Source Resources