pith. sign in

arxiv: 2605.26486 · v1 · pith:752YPGXGnew · submitted 2026-05-26 · 💻 cs.CV

LongCat-Video-Avatar 1.5 Technical Report

Pith reviewed 2026-06-29 18:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords audio-driven video generationavatar animationopen-source modellip synchronizationtemporal stabilityinference distillationRLHF training
0
0 comments X

The pith

LongCat-Video-Avatar 1.5 reaches competitive or superior results to closed-source avatar systems by upgrading audio encoding, training scale, RLHF, and inference speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LongCat-Video-Avatar 1.5 as an open-source audio-driven video framework that prioritizes systematic engineering improvements over new model architectures. It claims these changes—switching to a Whisper Large audio encoder, scaling training, applying RLHF, and distilling to 8 NFE—produce accurate lip synchronization, full-body stability across long videos, strict identity preservation, and reliable handling of multi-person scenes or stylized content. A sympathetic reader would care because the work shows that open models can reach production-level output without proprietary restrictions, narrowing the practical gap to commercial tools. Validation rests on quantitative metrics plus human ratings from over 500 diverse test cases that place the model on par with or ahead of listed closed-source alternatives.

Core claim

LongCat-Video-Avatar 1.5 shows that upgrading the audio encoder to Whisper Large, scaling training recipes, applying rigorous data curation and RLHF, plus step distillation to an optimal 8 NFE produces accurate lip-synchronization, full-body temporal stability, robust long-video generation with identity consistency, and native support for complex conditions including multi-person interactions and stylized domains such as anime and animals, while achieving competitive or superior human-likeness and expert quality scores against leading closed-source systems on an internal benchmark of over 500 cases.

What carries the argument

The suite of targeted upgrades—Whisper Large audio encoder, scaled training with RLHF, and 8-NFE step distillation—that together deliver stable, identity-consistent output and fast inference.

If this is right

  • The model generalizes directly to stylized domains such as anime and animals without extra fine-tuning.
  • Native support for multi-person interactions and object handling appears in the generated videos.
  • Inference reaches an 8 NFE operating point that balances serving speed and visual fidelity.
  • Open-source release enables industrial deployment without closed API dependence.
  • Rigorous data curation and RLHF training support consistent identity across extended sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar engineering-focused upgrades could be applied to other generative video tasks to improve stability without architectural changes.
  • Widespread use of the released model would let developers test avatar pipelines locally rather than through paid services.
  • Public release of the 500-case benchmark would allow direct comparison of future open models against the same reference.

Load-bearing premise

The paper's internal benchmark of over 500 test cases plus its human evaluation protocol provides an unbiased measure of real-world performance against closed-source competitors.

What would settle it

An independent evaluation on a separate public test set with new raters that shows LongCat-Video-Avatar 1.5 scoring lower than the listed closed-source systems on human-likeness or expert quality.

Figures

Figures reproduced from arXiv: 2605.26486 by Feng Gao, Hongyu Liu, Jiamu Li, Le Li, Meituan LongCat Team: Xunliang Cai, Meng Cheng, Shuai Tan, Tianyu Yang, Weiheng Li, Xiaoming Wei, Yong Zhang, Zhe Kong.

Figure 1
Figure 1. Figure 1: Human evaluation. The overarching benchmark includes over 500 test samples with varying audio-visual [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Demonstration of generated video frames across various application scenarios, including broadcasting, acting, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the two stage data curation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the Emotion Data Filtering and Captioning Pipeline. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The overall pipeline of LongCat-Video-Avatar 1.5. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The lip synchronization comparison between Wav2vec2 and Whisper-large. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual illustration of the background character driving strategy. (a) w/o Silent Condition. (b) w/ Silent [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Human-likeness comparison across different methods in single person talking and multiple person [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Background distortion in rationality [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Tone error accumulation in stability [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Lip synchronization in harmony [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Body naturalness in harmony [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: Visual comparison in rationality. HeyGen Kling Avarar 2.0 Ours Omnihuman-1.5 [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Visual comparison in stability. 5.3 Expert-level Objective Quality Evaluation Furthermore, we conduct an expert-level objective quality analysis across four complementary perceptual dimensions: temporal stability, physical rationality, identity consistency, and harmony (i.e., audio-visual harmony). This decomposed evaluation enables the precise identification of strengths and remaining challenges. As illu… view at source ↗
Figure 19
Figure 19. Figure 19: Visual comparison in talking head scenarios. [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Visual comparison in music scenarios. Rationality. Rationality evaluates whether the synthesized avatar’s movements, expressions, and environmental interactions comply with real-world physical laws and biomechanics, encompassing aspects like subject and background distortion. Figs. 9 and 10 detail the issue rates of these artifacts across various methods, revealing that physical rationality remains a prev… view at source ↗
Figure 21
Figure 21. Figure 21: Visual comparison in anime scenarios. Kling-Avatar 2.0 Omnihuman-1.5 Heygen Ours /xiǎo/ /wǒ/ /nǐ/ /lǜ/ /hé/ [PITH_FULL_IMAGE:figures/full_fig_p018_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Visual comparison in performance scenarios. [PITH_FULL_IMAGE:figures/full_fig_p018_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Visual comparison in emotional expression scenarios. [PITH_FULL_IMAGE:figures/full_fig_p019_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Stability comparison with LC-Video-Avatar 1.0. [PITH_FULL_IMAGE:figures/full_fig_p020_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Lip synchronization comparison between v1.0 and v1.5. [PITH_FULL_IMAGE:figures/full_fig_p021_25.png] view at source ↗
read the original abstract

Despite advances in audio-driven video generation, achieving commercial-grade stability remains challenging. We present LongCat-Video-Avatar 1.5, an upgraded open-source framework prioritizing systematic engineering and production-readiness over architectural novelty. By upgrading the audio encoder to Whisper Large and meticulously scaling our training recipes, v1.5 achieves accurate lip-synchronization, full-body temporal stability, and robust long-video generation with strict identity consistency. Through rigorous data curation and RLHF Training, the model readily generalizes to stylized domains such as anime and animals, and natively handles complex real-world conditions, such as multi-person interactions and object handling. Furthermore, addressing the practical demands of industrial deployment, we employ advanced step distillation to accelerate inference to an optimal 8 NFE, achieving a favorable trade-off between serving efficiency and visual fidelity. The superiority of our approach is validated through extensive quantitative metrics and a rigorous human evaluation conducted on a comprehensive benchmark of over 500 diverse test cases. Results show that v1.5 achieves competitive or superior performance compared to leading closed-source systems (e.g., HeyGen, OmniHuman 1.5, Kling Avatar 2.0) across human-likeness ratings and expert-level quality assessments on our benchmark. With its open-source release, LongCat-Video-Avatar 1.5 narrows the gap between academic research prototypes and commercial-grade deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents LongCat-Video-Avatar 1.5, an open-source audio-driven video avatar framework that upgrades the audio encoder to Whisper Large, scales training recipes, applies RLHF for domain generalization (including anime and animals), and uses step distillation to reach 8 NFE inference. It claims accurate lip synchronization, full-body temporal stability, identity consistency, and multi-person/object handling, with the headline result that v1.5 achieves competitive or superior human-likeness and expert-level quality versus closed-source systems (HeyGen, OmniHuman 1.5, Kling Avatar 2.0) on an internal benchmark of over 500 diverse test cases.

Significance. If the evaluation protocol and results can be substantiated, the work would be significant as a production-oriented open-source release that narrows the gap to commercial closed-source avatar systems while providing measurable gains in inference efficiency and robustness to real-world conditions.

major comments (1)
  1. [Abstract and Evaluation] Abstract and Evaluation section: The central superiority claims rest exclusively on a human evaluation conducted on an internal benchmark of >500 cases, yet no details are supplied on benchmark construction, diversity sampling strategy, prompt distribution, human-study protocol (blinding, rating scales, number of raters, inter-rater reliability statistics), or controls for inference settings when comparing against closed-source systems. This absence is load-bearing because it prevents any assessment of selection bias or reproducibility of the reported performance advantage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in our evaluation protocol. We agree that the current description of the human study is insufficient to allow independent assessment of selection bias or reproducibility, and we will expand the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: The central superiority claims rest exclusively on a human evaluation conducted on an internal benchmark of >500 cases, yet no details are supplied on benchmark construction, diversity sampling strategy, prompt distribution, human-study protocol (blinding, rating scales, number of raters, inter-rater reliability statistics), or controls for inference settings when comparing against closed-source systems. This absence is load-bearing because it prevents any assessment of selection bias or reproducibility of the reported performance advantage.

    Authors: We agree with the referee that the absence of these details is a significant limitation. In the revised manuscript we will add a new subsection under Evaluation that explicitly describes: (1) benchmark construction and diversity sampling strategy (including prompt distribution across domains, identities, and conditions); (2) the human-study protocol, including blinding procedures, rating scales, number of raters, and inter-rater reliability statistics; and (3) the controls applied to ensure fair inference settings when comparing against closed-source systems. We will also clarify any constraints on releasing the full benchmark while providing sufficient methodological detail for reproducibility assessment. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical superiority claims rest on external closed-source comparisons and stated benchmark evaluation rather than self-referential fits or derivations.

full rationale

The paper is a technical report focused on engineering upgrades (Whisper encoder, data curation, RLHF, step distillation) and reports performance via quantitative metrics plus human evaluation on an internal >500-case benchmark against external systems (HeyGen, OmniHuman, Kling). No mathematical derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing premises appear in the provided text. The central claim is an empirical comparison to independent external models, which does not reduce to the paper's own inputs by construction. This matches the default expectation of no circularity for systems papers whose results are benchmarked externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, new physical entities, or formal axioms are present in the abstract; the contribution is an empirical model release whose performance claims rest on data curation and evaluation choices not detailed here.

pith-pipeline@v0.9.1-grok · 5821 in / 1041 out tokens · 26986 ms · 2026-06-29T18:25:51.937229+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 6 internal anchors

  1. [1]

    Evaltalker: Learning to evaluate real-portrait-driven multi-subject talking humans.arXiv preprint arXiv:2512.01340,

    Yingjie Zhou, Xilei Zhu, Siyu Ren, Ziyi Zhao, Ziwen Wang, Farong Wen, Yu Zhou, Jiezhang Cao, Xiongkuo Min, Fengjiao Chen, et al. Evaltalker: Learning to evaluate real-portrait-driven multi-subject talking humans.arXiv preprint arXiv:2512.01340,

  2. [2]

    Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209,

    Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, and Mingyuan Gao. Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209,

  3. [3]

    Longcat-video technical report, 2025

    Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200,

  4. [4]

    Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621,

    Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621,

  5. [5]

    Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXiv preprint arXiv:2508.14033,

    Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, et al. Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXiv preprint arXiv:2508.14033,

  6. [6]

    Hunyuanvideo-avatar: High-fidelity audio-driven human animation for multiple characters.arXiv preprint arXiv:2505.20156, 2025a

    Yi Chen, Sen Liang, Zixiang Zhou, Ziyao Huang, Yifeng Ma, Junshu Tang, Qin Lin, Yuan Zhou, and Qinglin Lu. Hunyuanvideo-avatar: High-fidelity audio-driven human animation for multiple characters.arXiv preprint arXiv:2505.20156, 2025a. Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. Omniavatar: Efficient audio-driven avatar video generation...

  7. [7]

    Joyavatar-flash: Real-time and infinite audio-driven avatar generation with autoregressive diffusion.arXiv preprint arXiv:2512.11423, 2025a

    Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo, Huan Zhang, Youzheng Wu, and Xiaodong He. Joyavatar-flash: Real-time and infinite audio-driven avatar generation with autoregressive diffusion.arXiv preprint arXiv:2512.11423, 2025a. Zhiyuan Li, Chi-Man Pun, Chen Fang, Jue Wang, and Xiaodong Cun. Personalive! expressive portrait image ani...

  8. [8]

    Ailing Zeng, Casper Yang, Chauncey Ge, Eddie Zhang, Garvey Xu, Gavin Lin, Gilbert Gu, Jeremy Pi, Leo Li, Mingyi Shi, et al

    URLhttps://arxiv.org/abs/2512.23379. Ailing Zeng, Casper Yang, Chauncey Ge, Eddie Zhang, Garvey Xu, Gavin Lin, Gilbert Gu, Jeremy Pi, Leo Li, Mingyi Shi, et al. Lpm 1.0: Video-based character performance model.arXiv preprint arXiv:2604.07823,

  9. [9]

    Robust Speech Recognition via Large-Scale Weak Supervision

    URLhttps://arxiv.org/abs/2212.04356. Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487,

  10. [10]

    Revisiting Active Speaker Detection: An In-the-Wild Benchmark for Generalization and Robustness

    Le Thien Phuc Nguyen, Zhuoran Yu, Khoa Quang Nhat Cao, Yuwei Guo, Tu Ho Manh Pham, Tuan Tai Nguyen, Toan Ngo Duc V o, Lucas Poon, Soochahn Lee, and Yong Jae Lee. Unitalk: Towards universal active speaker detection in real world scenarios.arXiv preprint arXiv:2505.21954,

  11. [11]

    Yolov6: A single-stage object detection framework for industrial applications.arXiv preprint arXiv:2209.02976,

    Chuyi Li, Lulu Li, Hongliang Geng, Hongyu Jiang, Meng Cheng, Bo Zhang, Zaidan Ke, Xiaoming Xu, and Xiangxiang Chu. Yolov6: A single-stage object detection framework for industrial applications.arXiv preprint arXiv:2209.02976,

  12. [12]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765,

  13. [13]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  14. [14]

    Humo: Human-centric video generation via collaborative multi-modal conditioning, 2025b

    Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. Humo: Human-centric video generation via collaborative multi-modal conditioning, 2025b. URL https://arxiv.org/abs/2509.08519. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarw...

  15. [15]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  16. [16]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  17. [17]

    Kling-avatar: Grounding multimodal instructions for cascaded long-duration avatar animation synthesis.arXiv preprint arXiv:2509.09595,

    23 LongCat-Video-Avatar 1.5 Technical Report Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-shen Liu, and Wan Pengfei. Kling-avatar: Grounding multimodal instructions for cascaded long-duration avatar animation synthesis.arXiv preprint arXiv:2509.09595,