LongCat-Video-Avatar 1.5 Released

From high-fidelity to truly usable: commercial-grade digital human video generation, now open source.

LongCat officially open-sources LongCat-Video-Avatar 1.5, a digital human video model that advances from open-source SOTA toward commercial-grade applications. It delivers comprehensive improvements in lip synchronization, physical plausibility, long-video stability, multi-person interaction, and efficient inference—producing stable, natural, high-quality content even in complex business scenarios.

Three Capability Upgrades

Commercialized Base Experience

Under long sentences, fast speech, and singing, lip motion is more precise and smooth. Facial expressions, head pose, and body movements are better coordinated for natural, stable overall performance.

Richer Open-Domain Scenes

A high-quality data system enables stable handling of real humans, anime, virtual idols, animals, and more. Multi-person dialogue is more natural, with accurate speaker/listener distinction.

Efficient Inference

DMD distillation reduces generation from 50 steps to 8 steps (~15× speedup). A shared base model plus LoRA adapters replaces three-model parallel deployment, cutting VRAM. A 10-second video generates in about 1 minute.

Whisper-Large Audio Upgrade

The audio encoder moves from Wav2Vec2 to Whisper-large, capturing finer phoneme changes, pronunciation rhythm, and multilingual prosody. This improves lip sync and full-body temporal stability—reducing jitter, frame skips, frozen frames, and identity drift in long videos.

Data Engineering

Offline annotation: Face keypoints, person count, body composition, audio-visual sync
Online validation: Filter transitions, black frames, flicker, frame skips
Multi-person data: Active speaker detection for single-speaker segments
Silent data: Natural micro-expressions without spurious lip motion on non-speakers
Emotion data: Frame-level emotion recognition for speech-expression-body alignment

GRPO & Hand Stability

GRPO (Group Relative Policy Optimization) applies frame-level human preference alignment, correcting discontinuous motion, hand deformation, structural collapse, and expression-speech mismatch. First-frame hand detection for image-to-video and continuation tasks increases training on visible-hand samples, reducing hand artifacts in e-commerce, product showcase, and education scenarios.

EvalTalker Benchmark Results

Built on EvalTalker across news, education, entertainment, and commercial scenarios—with 770 evaluators, 13,240 subjective scores, and 10 expert structured analyses.

User Preference

vs. LongCat-Video-Avatar 1.5	Win Rate
Kling Avatar 2.0	65.9%
OmniHuman 1.5	61.1%
HeyGen	54.3%

Key Metrics

Metric	Result
Single-person score	3.336
Multi-person score	2.730 (vs. InfiniteTalk 2.339)
Subject deformation rate	23.1%
Background deformation rate	9.4%
Frame skip rate	0.8% (lowest)
Face-body sync issue rate	5.1% (best)
Lip sync issue rate	29.8% (best)

Open Source Resources

GitHub: https://github.com/meituan-longcat/LongCat-Video
Hugging Face: LongCat-Video-Avatar-1.5
ModelScope: LongCat-Video-Avatar-1.5
Project Page: LongCat-Video-Avatar-1.5-Page
Tech Report: LongCat-Video-Avatar-1.5-Tech-Report.pdf

Full Model Page Hugging Face All News