ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation
Pith reviewed 2026-06-27 10:45 UTC · model grok-4.3
The pith
Argus converts multiple identity views into a synchronized dynamic memory that keeps generated video subjects recognizable across motion and viewpoint changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that Stacked Multi-View Identity Mosaic Injection converts MLLM-selected multi-view evidence into a 3x3 mosaic, synchronizes it with the current diffusion time, and injects it as negative-time read-only memory in native token space, turning identity references into a compact dynamic distribution that remains disentangled from pose, lighting, and background and thereby enables subject preservation across motion, large viewpoint changes, expression shifts, occlusion, scale variation, and condition conflicts.
What carries the argument
Stacked Multi-View Identity Mosaic Injection (SMII), which stacks selected identity views into a mosaic, synchronizes it with diffusion time, and injects it as read-only memory to form a dynamic identity distribution.
If this is right
- Subject identity stays consistent under large yaw angles and first-frame occlusions without paired subject-video training data.
- Dynamic memory injection plus counterfactual self-supervision improves robustness to expression shifts and condition conflicts.
- The released HardID-Celeb benchmark together with YawScore and OccScore supply concrete metrics for testing identity stress.
- No-cross-pair training and temporal identity annealing allow large-scale self-supervision while avoiding identity leakage.
- The overall framework demonstrates that converting point references into synchronized dynamic distributions outperforms single-image adapters.
Where Pith is reading between the lines
- The same mosaic-injection pattern could be tested in other diffusion pipelines where multiple references must remain disentangled from scene variables.
- If the read-only memory mechanism generalizes, it might reduce reliance on large paired datasets for identity-consistent generation in adjacent tasks such as image-to-video or editing.
- Extending the MLLM director to non-human subjects would test whether the dynamic-distribution benefit holds outside face-centric video generation.
Load-bearing premise
The premise that MLLM-selected multi-view mosaics can be synchronized with diffusion time steps and injected as negative-time read-only memory without becoming entangled with pose, lighting, or background statistics.
What would settle it
If videos generated by Argus on the HardID-Celeb benchmark show no gains over baselines in YawScore or OccScore, the claim that the injected mosaic creates a disentangled dynamic identity distribution would be falsified.
Figures
read the original abstract
Subject-preserving video generation is not solved by frontal-face similarity alone: a generated person must remain recognizable across motion, large viewpoint changes, expression shifts, occlusion, scale variation, and conflicts among text, first-frame, and identity references. We argue that the central bottleneck is the point-reference paradigm, which collapses identity into a single static observation entangled with pose, accessories, lighting, background, and camera statistics. We introduce Argus, a Wan-based framework centered on Stacked Multi-View Identity Mosaic Injection (SMII). SMII converts MLLM-selected image/video identity evidence into a 3*3 stacked mosaic, synchronizes the mosaic with the current diffusion time, and injects it as negative-time read-only memory in Wan's native token space. This turns identity from an external clean adapter or a single reference image into a compact dynamic distribution. Around SMII, an MLLM Identity Director selects informative identity moments and resolves condition conflicts, while no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance improve robustness without paired subject-video supervision. We further release HardID-Celeb, a public-figure identity-stress benchmark, and introduce YawScore and OccScore to probe large-yaw and first-frame-occlusion robustness. Argus achieves state-of-the-art results on OpenS2V-Eval Human-Domain, reaching 64.38 Total Score, 71.86 FaceSim, 51.62 NexusScore, and 79.14 NaturalScore. On HardID-Celeb, Argus obtains 76.80 FaceSim and improves YawScore and OccScore by 12.60 and 15.10 points over the strongest baselines, demonstrating that dynamic identity memory and large-scale counterfactual self-supervision are highly effective for subject-preserving video generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Argus, a Wan-based framework for subject-preserving video generation that replaces the point-reference paradigm with Stacked Multi-View Identity Mosaic Injection (SMII). SMII uses an MLLM to select identity evidence, forms a 3x3 mosaic, synchronizes it to diffusion timesteps, and injects it as negative-time read-only memory in native token space to produce a compact dynamic identity distribution. Supporting components include an MLLM Identity Director for conflict resolution, no-cross-pair counterfactual self-supervision, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance. The paper releases the HardID-Celeb benchmark with new YawScore and OccScore metrics and reports SOTA results on OpenS2V-Eval Human-Domain (64.38 Total, 71.86 FaceSim) and HardID-Celeb (76.80 FaceSim, +12.60 YawScore, +15.10 OccScore over baselines).
Significance. If the core mechanism is shown to produce a disentangled dynamic identity distribution rather than simply increasing reference volume, the work would advance subject-consistent video synthesis by addressing viewpoint, occlusion, and condition-conflict robustness without paired supervision. The public HardID-Celeb benchmark and the two new stress metrics constitute a clear community contribution. The reported numerical gains on named benchmarks are substantial, but their attribution to the claimed dynamic-distribution effect versus MLLM curation or multi-reference volume remains to be isolated.
major comments (2)
- [§3] §3 (SMII description) and the Identity Director paragraph: the central claim that mosaic injection in negative-time read-only memory converts references into a compact dynamic distribution independent of pose, lighting, background, and camera statistics is not supported by an equation, pseudocode, or ablation that isolates the disentanglement effect from standard multi-reference cross-attention or simple concatenation. Without such evidence the reported FaceSim, YawScore, and OccScore gains could arise from increased reference count or MLLM selection rather than the claimed mechanism.
- [§4.2, Table 2] §4.2 and Table 2 (HardID-Celeb results): the 12.60-point YawScore and 15.10-point OccScore improvements are load-bearing for the “highly effective” conclusion, yet the manuscript does not report whether the strongest baselines were re-implemented with identical MLLM curation or reference volume; any mismatch would undermine the attribution to SMII and counterfactual training.
minor comments (2)
- [Abstract, §3] The abstract and §3 use “negative-time read-only memory” without defining the precise token-space operation or the synchronization schedule with diffusion timesteps; a short algorithmic box would improve reproducibility.
- [§4.1] OpenS2V-Eval and HardID-Celeb metric definitions (NexusScore, NaturalScore) are referenced but not restated; a one-paragraph appendix definition would aid readers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the contributions of the HardID-Celeb benchmark and the new stress metrics. We address each major comment below with clarifications and proposed revisions where appropriate.
read point-by-point responses
-
Referee: [§3] §3 (SMII description) and the Identity Director paragraph: the central claim that mosaic injection in negative-time read-only memory converts references into a compact dynamic distribution independent of pose, lighting, background, and camera statistics is not supported by an equation, pseudocode, or ablation that isolates the disentanglement effect from standard multi-reference cross-attention or simple concatenation. Without such evidence the reported FaceSim, YawScore, and OccScore gains could arise from increased reference count or MLLM selection rather than the claimed mechanism.
Authors: We agree that an explicit isolating ablation and pseudocode would strengthen the attribution of the dynamic-distribution effect. Section 3 describes the synchronization and negative-time read-only injection process in native token space, and Table 2 ablations compare SMII against multi-reference baselines; however, these do not fully isolate disentanglement from reference volume. We will add pseudocode for the mosaic synchronization and injection steps plus a targeted ablation that holds reference count and MLLM selection fixed while varying only the injection mechanism. revision: yes
-
Referee: [§4.2, Table 2] §4.2 and Table 2 (HardID-Celeb results): the 12.60-point YawScore and 15.10-point OccScore improvements are load-bearing for the “highly effective” conclusion, yet the manuscript does not report whether the strongest baselines were re-implemented with identical MLLM curation or reference volume; any mismatch would undermine the attribution to SMII and counterfactual training.
Authors: The reported numbers use the original baseline implementations and published settings. To ensure attribution is not confounded by curation differences, we will re-run the strongest baselines (with identical MLLM-selected references and reference volume) and update Table 2 and the corresponding text in the revision. revision: yes
Circularity Check
No significant circularity; empirical benchmark results with no load-bearing derivation chain
full rationale
The paper reports empirical SOTA performance on OpenS2V-Eval Human-Domain (64.38 Total Score, 71.86 FaceSim) and HardID-Celeb (76.80 FaceSim) using the introduced SMII method, MLLM Identity Director, no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance. The abstract supplies no equations, fitted parameters renamed as predictions, or self-citations whose load-bearing premise reduces to the current work. Claims rest on external benchmark comparisons and described architectural choices rather than any self-definitional, uniqueness-imported, or ansatz-smuggled reduction. The derivation chain is therefore self-contained against the stated evaluation metrics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption An MLLM can reliably select informative identity moments and resolve conflicts among text, first-frame, and identity references.
invented entities (1)
-
Stacked Multi-View Identity Mosaic
no independent evidence
Forward citations
Cited by 5 Pith papers
-
OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation
DRIVE-CHOREO uses three LLM agents to create a unified position-aware token sequence co-compressed with multi-view video, achieving SOTA BEV mAP of 21.6 and +2.4 NDS improvement on nuScenes.
-
OrthoMotion:Disentangling Camera and Subject Motion via Geometry Semantics Orthogonal Attention
OrthoMotion disentangles camera and subject motion in video generation by splitting attention into algebraically complementary geometric (RoPE rotation) and semantic (gated value) channels driven to orthogonality by a...
-
ParaScale: Scale-Calibrated Camera-Motion Transfer via a Gauge-Invariant Parallax Number
ParaScale extracts a gauge-invariant Parallax Number from a reference video and re-realizes the same parallax against the target scene's depth map to achieve scale-calibrated camera motion transfer.
-
OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data
OmniDirector introduces a grid-based camera representation and hierarchical prompt agent for multi-shot camera cloning in video diffusion models trained on million-scale unpaired data.
-
TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning
TRIDENT is a MARL framework using Richardson-Romberg gradient correction, Lyapunov-constrained trust-region updates, and a physics-informed residual critic that claims O(1/sqrt(K)) convergence to constrained Nash equi...
Reference graph
Works this paper leans on
-
[1]
Self-rectifying diffusion sampling with perturbed-attention guidance
Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, and Seungryong Kim. Self-rectifying diffusion sampling with perturbed-attention guidance. InEuropean Conference on Computer Vision, pages 1–17. Springer, 2024
2024
-
[2]
Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, et al. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024
arXiv 2024
-
[3]
Yolo- world: Real-time open-vocabulary object detection
Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo- world: Real-time open-vocabulary object detection. InProc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2024
2024
-
[4]
Arcface: Additive angular margin loss for deep face recognition
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4685–4694. IEEE, 2019
2019
-
[5]
Yufan Deng, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Angtian Wang, Shenghai Yuan, Yiding Yang, Bo Liu, Haibin Huang, and Chongyang Ma. Cinema: Coherent multi-subject video generation via mllm-based guidance.arXiv preprint arXiv:2503.10391, 2025
arXiv 2025
-
[6]
Yufan Deng, Yuanyang Yin, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Shenghai Yuan, Yiding Yang, Angtian Wang, Bo Liu, Haibin Huang, et al. Magref: Masked guidance for any- reference video generation with subject disentanglement.arXiv preprint arXiv:2505.23742, 2025
arXiv 2025
-
[7]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
2024
-
[8]
Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025
Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025
arXiv 2025
-
[9]
Zhengcong Fei, Debang Li, Di Qiu, Changqian Yu, and Mingyuan Fan. Ingredients: Blending custom photos with video diffusion transformers.arXiv preprint arXiv:2501.01790, 2025
arXiv 2025
-
[10]
An image is worth one word: Personalizing text-to-image generation using textual inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. InICLR, 2023
2023
-
[11]
Identity-preserving text-to-video generation via training-free prompt, image, and guidance enhancement
Jiayi Gao, Changcheng Hua, Qingchao Chen, Yuxin Peng, and Yang Liu. Identity-preserving text-to-video generation via training-free prompt, image, and guidance enhancement. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 13751–13757, 2025
2025
-
[12]
Mochi 1.https://github.com/genmoai/models, 2024
Genmo Team. Mochi 1.https://github.com/genmoai/models, 2024
2024
-
[13]
Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. Id-animator: Zero-shot identity-preserving human video generation.arXiv preprint arXiv:2404.15275, 2024. 12
arXiv 2024
-
[14]
Clipscore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021
2021
-
[15]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021
2021
-
[16]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022
2022
-
[17]
Hunyuancustom: A multimodal-driven architecture for customized video generation, 2025
Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video generation, 2025. URLhttps://arxiv.org/abs/2505.04512
arXiv 2025
-
[18]
Curricularface: Adaptive curriculum learning loss for deep face recognition
Yuge Huang, Yuhan Wang, Ying Tai, Xiaoming Liu, Pengcheng Shen, Shaoxin Li, Jilin Li, and Feiyue Huang. Curricularface: Adaptive curriculum learning loss for deep face recognition. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5900–5909. IEEE Computer Society, 2020
2020
-
[19]
VBench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[20]
doi:10.1109/TPAMI.2025.3633890 , url =
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelli...
-
[21]
Vace: All- in-one video creation and editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All- in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025
2025
-
[22]
Jimeng Video Generation System.https://jimeng.jianying.com/, 2025
Jimeng Team. Jimeng Video Generation System.https://jimeng.jianying.com/, 2025
2025
-
[24]
Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37:52996–53021, 2024
Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37:52996–53021, 2024
2024
-
[25]
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
Pith/arXiv arXiv 2024
-
[26]
Bindweave: Subject-consistent video generation via cross-modal integration
Zhaoyang Li, Dongjun Qian, Kai Su, Qishuai Diao, Xiangyang Xia, Chang Liu, Wenfei Yang, Tianzhu Zhang, and Zehuan Yuan. Bindweave: Subject-consistent video generation via cross-modal integration. InICLR, 2026. 13
2026
-
[27]
Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024
Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024
Pith/arXiv arXiv 2024
-
[28]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[29]
Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross-modal alignment.arXiv preprint arXiv:2502.11079, 2025
arXiv 2025
-
[30]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55. Springer, 2024
2024
-
[31]
Evalcrafter: Benchmarking and evaluating large video generation models
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024
2024
-
[32]
Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023
Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023
2023
-
[33]
Magic-me: Identity-specific video customized diffusion
Ze Ma, Daquan Zhou, Xue-She Wang, Chun-Hsiao Yeh, Xiuyu Li, Huanrui Yang, Zhen Dong, Kurt Keutzer, and Jiashi Feng. Magic-me: Identity-specific video customized diffusion. InEuropean Conference on Computer Vision, pages 19–37. Springer, 2024
2024
-
[34]
Hailuo ai: Text-to-video generation platform.https://hailuoai.video/, 2024
MiniMax. Hailuo ai: Text-to-video generation platform.https://hailuoai.video/, 2024. Accessed: 2026-05-04
2024
-
[35]
Dreamdance: Animating human images by enriching 3d geometry cues from 2d poses
Yatian Pang, Bin Zhu, Bin Lin, Mingzhe Zheng, Francis EH Tay, Ser-Nam Lim, Harry Yang, and Li Yuan. Dreamdance: Animating human images by enriching 3d geometry cues from 2d poses. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14039–14050, 2025
2025
-
[36]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[37]
Dreambench++: A human-aligned benchmark for personalized image generation
Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://dreambenchplus.github.io/
2025
-
[38]
Pika 2.1: Ai video generation model
Pika Labs. Pika 2.1: Ai video generation model. https://pika.art/, 2025. Accessed: 2026-05-04
2025
-
[39]
Qwen3-vl technical report, 2025
Qwen Team, Alibaba Group. Qwen3-vl technical report, 2025. URLhttps://arxiv.org/ abs/2511.21631
Pith/arXiv arXiv 2025
-
[40]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 14
2021
-
[41]
Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
2020
-
[42]
Nano Banana 2: Combining Pro capabilities with lightning- fast speed
Naina Raisinghani. Nano Banana 2: Combining Pro capabilities with lightning- fast speed. https://blog.google/innovation-and-ai/technology/ai/nano-banana-2/, February 2026. Google Blog; accessed 2026-05-07
2026
-
[43]
Sam 2: Segment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[44]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023
2023
-
[45]
World-grounded human motion recovery via gravity-view coordinates
Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia Conference Proceedings, 2024
2024
-
[46]
Step-Video Team. Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025. URLhttps://arxiv.org/abs/2502.10248
Pith/arXiv arXiv 2025
-
[47]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
2024
-
[48]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[49]
Magi-1: Autoregressive video generation at scale
Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211, 2025
Pith/arXiv arXiv 2025
-
[50]
Jiangchuan Wei, Shiyue Yan, Wenfeng Lin, Boyuan Liu, Renjie Chen, and Mingyu Guo. Echovideo: Identity-preserving human video generation by multimodal feature fusion.arXiv preprint arXiv:2501.13452, 2025
arXiv 2025
-
[51]
Towards a better metric for text-to-video generation.arXiv preprint arXiv:2401.07781, 2024
Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, et al. Towards a better metric for text-to-video generation.arXiv preprint arXiv:2401.07781, 2024
arXiv 2024
-
[52]
Tao Wu, Yong Zhang, Xiaodong Cun, Zhongang Qi, Junfu Pu, Huanzhang Dou, Guangcong Zheng, Ying Shan, and Xi Li. Videomaker: Zero-shot customized video generation with the inherent force of video diffusion models.arXiv preprint arXiv:2412.19645, 2024
arXiv 2024
-
[53]
Easyanimate: High-performance video generation framework with hybrid windows attention and reward backpropagation
Jiaqi Xu, Kunzhe Huang, Xinyi Zou, Yunkuo Chen, Bo Liu, Mengli Cheng, Jun Huang, and Xing Shi. Easyanimate: High-performance video generation framework with hybrid windows attention and reward backpropagation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10925–10934, 2025
2025
-
[54]
Stand-in: A lightweight and plug-and-play identity control for video generation
Bowen Xue, Zheng-Peng Duan, Qixin Yan, Wenjing Wang, Hao Liu, Chun-Le Guo, Chongyi Li, Chen Li, and Jing Lyu. Stand-in: A lightweight and plug-and-play identity control for video generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2026. 15
2026
-
[55]
Effective whole-body pose estimation with two-stages distillation
Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effective whole-body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023
2023
-
[56]
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
Pith/arXiv arXiv 2024
-
[57]
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023
Pith/arXiv arXiv 2023
-
[58]
Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024
Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Rui- Jie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024
2024
-
[59]
Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation
Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Chongyang Ma, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025
2025
-
[60]
Identity-preserving text-to-video generation by frequency decomposition
Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity-preserving text-to-video generation by frequency decomposition. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12978–12988, 2025
2025
-
[61]
Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: improving universal multimodal retrieval by multimodal llms.arXiv preprint arXiv:2412.16855, 2024
Pith/arXiv arXiv 2024
-
[62]
Yunpeng Zhang, Qiang Wang, Fan Jiang, Yaqi Fan, Mu Xu, and Yonggang Qi. Fantasyid: Face knowledge enhanced id-preserving video generation.arXiv preprint arXiv:2502.13995, 2025
arXiv 2025
-
[63]
Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024
Pith/arXiv arXiv 2024
-
[64]
Concat-id: Towards universal identity-preserving video synthesis
Yong Zhong, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, and Chongxuan Li. Concat-id: Towards universal identity-preserving video synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1906–1915, 2025
1906
-
[65]
Yuan Zhou, Qiuyue Wang, Yuxuan Cai, and Huan Yang. Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024. 16
arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.