Recognition: unknown
LPM 1.0: Video-based Character Performance Model
Pith reviewed 2026-05-10 17:04 UTC · model grok-4.3
The pith
LPM 1.0 generates expressive, identity-stable conversational videos in real time from audio and text prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LPM 1.0 is constructed through strict dataset filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; a 17B-parameter Diffusion Transformer (Base LPM) is trained for highly controllable, identity-consistent performance via multimodal conditioning; this is distilled into a causal streaming generator (Online LPM) that supports low-latency, infinite-length interaction. Given a character image with identity-aware references, the model outputs listening videos from user audio and speaking videos from synthesized audio, with text prompts controlling motion, all while running in real time.
What carries the argument
Multimodal conditioning on audio, identity references, and text prompts inside a 17B Diffusion Transformer that is distilled into a causal streaming generator for controllable, infinite-horizon performance synthesis.
If this is right
- It functions as a visual engine that supplies real-time listening and speaking behaviors for conversational agents, live-stream characters, and game NPCs.
- The model supports infinite-length generation while maintaining identity consistency across extended conversational turns.
- It delivers state-of-the-art scores on all dimensions of the new LPM-Bench benchmark at real-time inference speeds.
- Text prompts allow explicit motion control on top of audio-driven performance without requiring 3D rigs.
Where Pith is reading between the lines
- The single-person conversational focus could be extended to multi-character scenes if the identity-aware conditioning generalizes to mutual reactions.
- The speaking-listening data pairing technique may transfer to training other interactive visual systems such as virtual-reality avatars or telepresence.
- Deployment in open-ended user sessions would test whether identity stability persists beyond the lengths examined in the benchmark.
- Pairing the model with separate audio synthesis would create an end-to-end pipeline from text input to synchronized speech and visual performance.
Load-bearing premise
The strict filtering, speaking-listening pairing, and identity-aware extraction used to build the dataset, together with the distillation step, actually preserve expressiveness and identity stability without introducing artifacts or benchmark leakage.
What would settle it
Long video sequences generated by the model showing visible identity drift, motion artifacts, or failure to match claimed scores when independently evaluated on the LPM-Bench.
read the original abstract
Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LPM 1.0, a video-based Large Performance Model for single-person full-duplex audio-visual conversational character performance. It addresses the 'performance trilemma' (expressiveness, real-time inference, long-horizon identity stability) by constructing a multimodal human-centric dataset via strict filtering, speaking-listening pairing, performance understanding, and identity-aware multi-reference extraction; training a 17B-parameter Diffusion Transformer (Base LPM) with multimodal conditioning; distilling it into a causal streaming Online LPM; and evaluating on the newly proposed LPM-Bench benchmark, where it claims state-of-the-art results across all dimensions while achieving real-time speed.
Significance. If the quantitative results, ablations, and controls hold, the work could meaningfully advance real-time video generation for interactive applications such as conversational agents, live-streaming characters, and game NPCs. The introduction of LPM-Bench as a standardized evaluation protocol for interactive performance is a constructive contribution to the field. The combination of large-scale diffusion training followed by distillation to a streaming model is technically interesting, but the absence of any numerical metrics, error bars, or dataset statistics in the abstract prevents assessment of whether the central claims are supported.
major comments (3)
- [Abstract] Abstract: The central claim that 'LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference' is asserted without any quantitative metrics, ablation tables, error bars, or description of how LPM-Bench was constructed or how data exclusions were handled. This directly undermines evaluation of the headline result.
- [Abstract] Abstract (dataset construction paragraph): The multimodal dataset is built via 'strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction,' yet no details are supplied on filtering criteria, diversity statistics, train/test split methodology, or controls for test-set contamination. Without these, it is impossible to determine whether the reported LPM-Bench numbers reflect genuine generalization or artifacts of the custom data pipeline.
- [Abstract] Abstract (distillation paragraph): The claim that distillation from the 17B Base Diffusion Transformer into the causal streaming Online LPM 'preserves' the same metrics is stated without any comparative numbers, degradation analysis, or latency/quality trade-off measurements. This step is load-bearing for the real-time claim and requires explicit verification.
minor comments (1)
- [Abstract] Abstract: The phrase 'performance trilemma' is introduced without a formal definition or explicit metrics for each of the three axes (expressiveness, real-time inference, identity stability).
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We agree that the abstract should be more self-contained with quantitative support and dataset details to allow immediate assessment of the claims. We have revised the abstract accordingly while preserving its brevity. Point-by-point responses are provided below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference' is asserted without any quantitative metrics, ablation tables, error bars, or description of how LPM-Bench was constructed or how data exclusions were handled. This directly undermines evaluation of the headline result.
Authors: We agree that the original abstract was insufficiently quantitative. In the revised manuscript we have updated the abstract to include key quantitative results from LPM-Bench (comparative scores on expressiveness, identity stability, and latency against baselines) together with a concise description of LPM-Bench construction and data-exclusion protocols. Full ablation tables, error bars, and methodological details remain in the main text and supplementary material. This revision makes the central claim directly verifiable from the abstract. revision: yes
-
Referee: [Abstract] Abstract (dataset construction paragraph): The multimodal dataset is built via 'strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction,' yet no details are supplied on filtering criteria, diversity statistics, train/test split methodology, or controls for test-set contamination. Without these, it is impossible to determine whether the reported LPM-Bench numbers reflect genuine generalization or artifacts of the custom data pipeline.
Authors: We acknowledge the need for greater transparency on dataset construction even within the abstract. The revised abstract now specifies the filtering criteria (resolution, duration, and quality thresholds), provides high-level diversity statistics (total video hours and number of identities), describes the train/test split methodology (identity-disjoint partitioning), and notes controls for test-set contamination. Complete statistics and implementation details are given in Section 3 of the paper. revision: yes
-
Referee: [Abstract] Abstract (distillation paragraph): The claim that distillation from the 17B Base Diffusion Transformer into the causal streaming Online LPM 'preserves' the same metrics is stated without any comparative numbers, degradation analysis, or latency/quality trade-off measurements. This step is load-bearing for the real-time claim and requires explicit verification.
Authors: We agree that the abstract must explicitly verify the distillation outcome. The revised abstract now states that the Online LPM retains performance comparable to the Base LPM across LPM-Bench dimensions while achieving real-time inference, and it references the degradation analysis and latency-quality trade-offs. The full comparative numbers, degradation study, and measurements are provided in Section 4 and the supplementary material. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper describes an empirical pipeline: construction of a custom multimodal dataset via filtering/pairing/extraction steps, training of a 17B Diffusion Transformer (Base LPM), distillation into a causal streaming Online LPM, and evaluation on the newly proposed LPM-Bench. No mathematical equations, first-principles derivations, or parameter-fitting steps are presented that reduce a claimed prediction or result to the inputs by construction. The SOTA claims are experimental performance measurements on the authors' benchmark rather than self-definitional outputs or fitted quantities renamed as predictions. No self-citations appear as load-bearing justifications for uniqueness or ansatz choices in the provided text. The central claims therefore retain independent empirical content and do not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The presentation of self in everyday life
Erving Goffman. The presentation of self in everyday life. InSocial theory re-wired, pages 450–459. Routledge, 2023
2023
-
[2]
Thomson Wadsworth, 1972
Mark L Knapp, Judith A Hall, and Terrence G Horgan.Nonverbal communication in human interaction. Thomson Wadsworth, 1972
1972
-
[3]
Schegloff, and Gail Jefferson
Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson. A simplest systematics for the organi- zation of turn-taking for conversation.Language, 50(4):696–735, 1974
1974
-
[4]
Newnes, 2012
Rick Parent.Computer animation: algorithms and techniques. Newnes, 2012
2012
-
[5]
Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023
Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023
2023
-
[6]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023
2023
-
[7]
NVIDIA ACE: Autonomous game characters with generative AI.https://develope r.nvidia.com/ace, 2024
NVIDIA. NVIDIA ACE: Autonomous game characters with generative AI.https://develope r.nvidia.com/ace, 2024
2024
-
[8]
Xuangeng Chu, Ruicong Liu, Yifei Huang, Yun Liu, Yichen Peng, and Bo Zheng. Unils: End-to- end audio-driven avatars for unified listening and speaking.arXiv preprint arXiv:2512.09327, 2025
-
[9]
Yichen Peng, Jyun-Ting Song, Siyeol Jung, Ruofan Liu, Haiyang Liu, Xuangeng Chu, Ruicong Liu, Erwin Wu, Hideki Koike, and Kris Kitani. Dyadit: A multi-modal diffusion transformer for socially favorable dyadic gesture generation.arXiv preprint arXiv:2602.23165, 2026
-
[10]
Towards interactive intelligence for digital humans.arXiv preprint arXiv:2512.13674, 2025
YiyiCai, XuangengChu, XiweiGao, SitongGong, YifeiHuang, CaixinKang, KunhangLi, Haiyang Liu, Ruicong Liu, Yun Liu, et al. Towards interactive intelligence for digital humans.arXiv preprint arXiv:2512.13674, 2025
-
[11]
Wan: Open and Advanced Large-Scale Video Generative Models
Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 3(4):6, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Veo — Google DeepMind.https://deepmind.google/models/veo/,
Google DeepMind. Veo — Google DeepMind.https://deepmind.google/models/veo/,
-
[13]
Accessed: 2026-03-14
2026
-
[14]
Kling ai.https://klingai.kuaishou.com/, 2024.06
Kuaishou. Kling ai.https://klingai.kuaishou.com/, 2024.06
2024
-
[15]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024
2024
-
[16]
Seedance 2.0.https://seed.bytedance.com/en/blog/seedanc e-2-0, 2026
ByteDance Seed Team. Seedance 2.0.https://seed.bytedance.com/en/blog/seedanc e-2-0, 2026. Accessed: 2026-03-14
2026
-
[17]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 37 LPM 1.0: Video-based Character Performance Model
work page internal anchor Pith review arXiv 2024
- [18]
-
[19]
Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, and Mingyuan Gao. Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025
-
[20]
arXiv preprint arXiv:2512.13313 (2025)
Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, et al. Klingavatar 2.0 technical report.arXiv preprint arXiv:2512.13313, 2025
-
[21]
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, et al. Live avatar: Streaming real-time audio-driven avatar generation with infinite length.arXiv preprint arXiv:2512.04677, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Soulx-livetalk: Real-time infinite streaming of audio-driven avatars via self-correcting bidirectional distillation.arXiv e-prints, pages arXiv–2512, 2025
LeShen, QiaoQian, TanYu, KeZhou, TianhangYu, YuZhan, ZhenjieWang, MingTao, Shunshun Yin, and Siyuan Liu. Soulx-livetalk: Real-time infinite streaming of audio-driven avatars via self-correcting bidirectional distillation.arXiv e-prints, pages arXiv–2512, 2025
2025
-
[23]
Accessed: 2025-11-12
Sekotalk.https://sekotalk.com/. Accessed: 2025-11-12
2025
-
[25]
Flowact-r1: Towards interactive humanoid video generation.arXiv preprint arXiv:2601.10103, 2026
Lizhen Wang, Yongming Zhu, Zhipeng Ge, Youwei Zheng, Longhao Zhang, Tianshu Hu, Shiyang Qin, Mingshuang Luo, Jiaxu Zhang, Xin Chen, et al. Flowact-r1: Towards interactive humanoid video generation.arXiv preprint arXiv:2601.10103, 2026
-
[26]
Chetwin Low and Weimin Wang. Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models.arXiv preprint arXiv:2506.03099, 2025
-
[27]
A large-scale high-quality dataset for audio-visual dyadic interactive human generation
Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, and Li Xiu. A large-scale high-quality dataset for audio-visual dyadic interactive human generation. 2025
2025
-
[28]
Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, et al. Seamless interaction: Dyadic audiovisual motion modeling and large-scale dataset.arXiv preprint arXiv:2506.22554, 2025
-
[29]
Arig: Autoregressive interactive head generation for real-time conversations
Ying Guo, Xi Liu, Cheng Zhen, Pengfei Yan, and Xiaoming Wei. Arig: Autoregressive interactive head generation for real-time conversations. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12956–12965, 2025
2025
-
[30]
Ditailistener: Controllable high fidelity listener video generation with diffusion
Maksim Siniukov, Di Chang, Minh Tran, Hongkun Gong, Ashutosh Chaubey, and Mohammad Soleymani. Ditailistener: Controllable high fidelity listener video generation with diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11991–12001, 2025
2025
-
[31]
You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, and Linjie Luo. X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025. 38 LPM 1.0: Video-based Character Performance Model
-
[32]
Responsive listening head generation: A benchmark dataset and baseline
Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. Responsive listening head generation: A benchmark dataset and baseline. InProceedings of the European Conference on Computer Vision (ECCV), pages 124–142. Springer, 2022
2022
-
[33]
Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025
Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025
-
[34]
Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, et al. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation.arXiv preprint arXiv:2412.00115, 2024
-
[35]
ViCo-X: Multimodal conversation dataset.https://project.mhzhou.com/vico, 2022
Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. ViCo-X: Multimodal conversation dataset.https://project.mhzhou.com/vico, 2022. Accessed: 2022-09-30
2022
-
[36]
Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset
Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3661–3670, 2021
2021
-
[37]
Celebv-hq: A large-scale video facial attributes dataset
Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. Celebv-hq: A large-scale video facial attributes dataset. InEuropean conference on computer vision, pages 650–667. Springer, 2022
2022
-
[38]
Affective faces for goal-driven dyadic communication.arXiv preprint arXiv:2301.10939, 2023
Scott Geng, Revant Teotia, Purva Tendulkar, Sachit Menon, and Carl Vondrick. Affective faces for goal-driven dyadic communication.arXiv preprint arXiv:2301.10939, 2023
-
[39]
Liyuan Cui, Wentao Hu, Wenyuan Zhang, Zesong Yang, Fan Shi, and Xiaoqiang Liu. Avatar- forcing: One-step streaming talking avatars via local-future sliding-window denoising.arXiv preprint arXiv:2603.14331, 2026
-
[40]
arXiv preprint arXiv:2512.11423 (2025)
Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo, Huan Zhang, Youzheng Wu, and Xiaodong He. Joyavatar-flash: Real-time and infinite audio-driven avatar generation with autoregressive diffusion.arXiv preprint arXiv:2512.11423, 2025
-
[41]
arXiv preprint arXiv:2512.21734 (2025)
Steven Xiao, XIndi Zhang, Dechao Meng, Qi Wang, Peng Zhang, and Bang Zhang. Knot forcing: Tamingautoregressivevideodiffusionmodelsforreal-timeinfiniteinteractiveportraitanimation. arXiv preprint arXiv:2512.21734, 2025
-
[42]
ShuyuanTu, YuemingPan, YinmingHuang, XintongHan, ZhenXing, QiDai, ChongLuo, Zuxuan Wu, and Yu-Gang Jiang. Stableavatar: Infinite-length audio-driven avatar video generation. arXiv preprint arXiv:2508.08248, 2025
-
[43]
Vasa-1: Lifelike audio-driven talking faces generated in real time
Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time. Advances in Neural Information Processing Systems, 37:660–684, 2024
2024
-
[44]
Transnet v2: An effective deep network architecture for fast shot transition detection
Tomáš Souček and Jakub Lokoč. Transnet v2: An effective deep network architecture for fast shot transition detection.arXiv preprint arXiv:2008.04838, 2020
-
[45]
Yolov9: Learning what you want to learn using programmable gradient information
Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn using programmable gradient information. InEuropean conference on computer vision, pages 1–21. Springer, 2024. 39 LPM 1.0: Video-based Character Performance Model
2024
-
[46]
Finevq: Fine-grained user generated content video quality assessment
Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, and Guangtao Zhai. Finevq: Fine-grained user generated content video quality assessment. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3206–3217, June 2025
2025
-
[47]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...
work page internal anchor Pith review arXiv 2025
-
[49]
A light weight model for active speaker detection
Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, and Liangyin Chen. A light weight model for active speaker detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22932–22941, 2023
2023
-
[50]
Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection
Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, and Haizhou Li. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. InProceedings of the 29th ACM international conference on multimedia, pages 3927– 3935, 2021
2021
-
[51]
Lr-asd: Lightweight and robust network for active speaker detection.International Journal of Computer Vision, pages 1–21, 2025
Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen, and Yanru Chen. Lr-asd: Lightweight and robust network for active speaker detection.International Journal of Computer Vision, pages 1–21, 2025
2025
-
[52]
Out of time: automated lip sync in the wild
Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016
2016
-
[53]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
World-grounded human motion recovery via gravity-view coordinates
Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. In SIGGRAPH Asia Conference Proceedings, 2024
2024
-
[55]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graph., 2015
2015
-
[56]
Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021
Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021. 40 LPM 1.0: Video-based Character Performance Model
2021
-
[57]
Facial expression recognition with adaptive frame rate based on multiple testing correction
Andrey Savchenko. Facial expression recognition with adaptive frame rate based on multiple testing correction. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research,...
2023
-
[58]
Classifying emotions and engagement in online learning based on a single facial expression recognition neural network
Andrey V Savchenko, Lyudmila V Savchenko, and Ilya Makarov. Classifying emotions and engagement in online learning based on a single facial expression recognition neural network. IEEE Transactions on Affective Computing, 2022
2022
-
[59]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[60]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
2024
-
[61]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[62]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[63]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
2023
-
[64]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[65]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, and Sung Ju Hwang. Avatar forcing: Real-time interactive head avatar generation for natural conversation.arXiv preprint arXiv:2601.00664, 2026
-
[67]
Jie Du, Xinyu Gong, Qingshan Tan, Wen Li, Yangming Cheng, Weitao Wang, Chenlu Zhan, Suhui Wu, Hao Zhang, and Jun Zhang. Reg-dpo: Sft-regularized direct preference optimization with gt-pair for improving video generation.arXiv preprint arXiv:2511.01450, 2025
-
[68]
Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025
-
[69]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025
work page internal anchor Pith review arXiv 2025
-
[70]
Improved distribution matching distillation for fast image synthesis
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. In NeurIPS, 2024
2024
-
[71]
From slow bidirectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025. 41 LPM 1.0: Video-based Character Performance Model
2025
-
[72]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, 2024
2024
-
[73]
Barron, and Ben Mildenhall
Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv, 2022
2022
-
[74]
Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems, 36:8406–8441, 2023
2023
-
[75]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018
2018
-
[76]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review arXiv 2023
-
[77]
Torchtitan: One-stop pytorch native solution for production ready LLM pretraining
Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. Torchtitan: One-stop pytorch native solution for production ready LLM pretraining. In The Thirteenth International Conference on Learning Representations, 2025
2025
-
[78]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023
work page internal anchor Pith review arXiv 2023
-
[79]
Flashattention-2: Faster attention with better parallelism and work partitioning, 2023
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023
2023
-
[80]
Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024
2024
-
[81]
Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling. arXiv preprint arXiv:2603.05451, 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.