pith. sign in

arxiv: 2606.11670 · v1 · pith:ZGXYBSQXnew · submitted 2026-06-10 · 💻 cs.CV · cs.AI

ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation

Pith reviewed 2026-06-27 10:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords subject-preserving video generationidentity preservationmulti-view mosaic injectiondiffusion modelscounterfactual self-supervisionface similarity metricsvideo synthesis
0
0 comments X

The pith

Argus converts multiple identity views into a synchronized dynamic memory that keeps generated video subjects recognizable across motion and viewpoint changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that subject-preserving video generation fails when identity is collapsed into one static reference image entangled with pose, lighting, and background. It introduces Stacked Multi-View Identity Mosaic Injection to select multiple views, stack them into a 3x3 mosaic, synchronize the mosaic with diffusion time steps, and inject it as negative-time read-only memory. This converts identity into a compact dynamic distribution that stays separate from other scene factors. An MLLM Identity Director resolves condition conflicts while counterfactual training and self-likeness guidance add robustness without paired data. The approach yields higher face similarity and robustness scores on both standard and new identity-stress benchmarks.

Core claim

The paper claims that Stacked Multi-View Identity Mosaic Injection converts MLLM-selected multi-view evidence into a 3x3 mosaic, synchronizes it with the current diffusion time, and injects it as negative-time read-only memory in native token space, turning identity references into a compact dynamic distribution that remains disentangled from pose, lighting, and background and thereby enables subject preservation across motion, large viewpoint changes, expression shifts, occlusion, scale variation, and condition conflicts.

What carries the argument

Stacked Multi-View Identity Mosaic Injection (SMII), which stacks selected identity views into a mosaic, synchronizes it with diffusion time, and injects it as read-only memory to form a dynamic identity distribution.

If this is right

  • Subject identity stays consistent under large yaw angles and first-frame occlusions without paired subject-video training data.
  • Dynamic memory injection plus counterfactual self-supervision improves robustness to expression shifts and condition conflicts.
  • The released HardID-Celeb benchmark together with YawScore and OccScore supply concrete metrics for testing identity stress.
  • No-cross-pair training and temporal identity annealing allow large-scale self-supervision while avoiding identity leakage.
  • The overall framework demonstrates that converting point references into synchronized dynamic distributions outperforms single-image adapters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mosaic-injection pattern could be tested in other diffusion pipelines where multiple references must remain disentangled from scene variables.
  • If the read-only memory mechanism generalizes, it might reduce reliance on large paired datasets for identity-consistent generation in adjacent tasks such as image-to-video or editing.
  • Extending the MLLM director to non-human subjects would test whether the dynamic-distribution benefit holds outside face-centric video generation.

Load-bearing premise

The premise that MLLM-selected multi-view mosaics can be synchronized with diffusion time steps and injected as negative-time read-only memory without becoming entangled with pose, lighting, or background statistics.

What would settle it

If videos generated by Argus on the HardID-Celeb benchmark show no gains over baselines in YawScore or OccScore, the claim that the injected mosaic creates a disentangled dynamic identity distribution would be falsified.

Figures

Figures reproduced from arXiv: 2606.11670 by Chengzhuo Tong, Jiwen Liu, Pengfei Wan, Xiaoqiang Liu, Yuanxing Zhang, Yufei Liu, Yulong Xu, Zijie Meng.

Figure 1
Figure 1. Figure 1: Argus overview. A multimodal large language model (MLLM) Identity Director compiles dynamic identity evidence from reference images/videos. Stacked Multi-View Identity Mosaic Injection (SMII) arranges selected identity observations into a 3 ×3 stacked mosaic whose cells may become n latent frames after VAE compression, synchronizes the mosaic latent with the current diffusion time, and injects it as negati… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison under identity-stress scenarios. (a) Large-yaw generation. Baselines [21, 22, 29, 54] suffer from side-face identity loss, geometric instability, or over-smoothed facial texture. Argus preserves sharper identity details and dynamic likeness during turning. (b) First-frame occlusion. Baselines either hallucinate the hidden face without identity injection [48], degrade under weak first… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation on the MLLM Director and ASLG. The comparison shows that the MLLM Director plays a clear role in preserving fine-grained identity details, while ASLG significantly improves identity-specific texture and skin quality [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Subject-preserving video generation is not solved by frontal-face similarity alone: a generated person must remain recognizable across motion, large viewpoint changes, expression shifts, occlusion, scale variation, and conflicts among text, first-frame, and identity references. We argue that the central bottleneck is the point-reference paradigm, which collapses identity into a single static observation entangled with pose, accessories, lighting, background, and camera statistics. We introduce Argus, a Wan-based framework centered on Stacked Multi-View Identity Mosaic Injection (SMII). SMII converts MLLM-selected image/video identity evidence into a 3*3 stacked mosaic, synchronizes the mosaic with the current diffusion time, and injects it as negative-time read-only memory in Wan's native token space. This turns identity from an external clean adapter or a single reference image into a compact dynamic distribution. Around SMII, an MLLM Identity Director selects informative identity moments and resolves condition conflicts, while no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance improve robustness without paired subject-video supervision. We further release HardID-Celeb, a public-figure identity-stress benchmark, and introduce YawScore and OccScore to probe large-yaw and first-frame-occlusion robustness. Argus achieves state-of-the-art results on OpenS2V-Eval Human-Domain, reaching 64.38 Total Score, 71.86 FaceSim, 51.62 NexusScore, and 79.14 NaturalScore. On HardID-Celeb, Argus obtains 76.80 FaceSim and improves YawScore and OccScore by 12.60 and 15.10 points over the strongest baselines, demonstrating that dynamic identity memory and large-scale counterfactual self-supervision are highly effective for subject-preserving video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Argus, a Wan-based framework for subject-preserving video generation that replaces the point-reference paradigm with Stacked Multi-View Identity Mosaic Injection (SMII). SMII uses an MLLM to select identity evidence, forms a 3x3 mosaic, synchronizes it to diffusion timesteps, and injects it as negative-time read-only memory in native token space to produce a compact dynamic identity distribution. Supporting components include an MLLM Identity Director for conflict resolution, no-cross-pair counterfactual self-supervision, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance. The paper releases the HardID-Celeb benchmark with new YawScore and OccScore metrics and reports SOTA results on OpenS2V-Eval Human-Domain (64.38 Total, 71.86 FaceSim) and HardID-Celeb (76.80 FaceSim, +12.60 YawScore, +15.10 OccScore over baselines).

Significance. If the core mechanism is shown to produce a disentangled dynamic identity distribution rather than simply increasing reference volume, the work would advance subject-consistent video synthesis by addressing viewpoint, occlusion, and condition-conflict robustness without paired supervision. The public HardID-Celeb benchmark and the two new stress metrics constitute a clear community contribution. The reported numerical gains on named benchmarks are substantial, but their attribution to the claimed dynamic-distribution effect versus MLLM curation or multi-reference volume remains to be isolated.

major comments (2)
  1. [§3] §3 (SMII description) and the Identity Director paragraph: the central claim that mosaic injection in negative-time read-only memory converts references into a compact dynamic distribution independent of pose, lighting, background, and camera statistics is not supported by an equation, pseudocode, or ablation that isolates the disentanglement effect from standard multi-reference cross-attention or simple concatenation. Without such evidence the reported FaceSim, YawScore, and OccScore gains could arise from increased reference count or MLLM selection rather than the claimed mechanism.
  2. [§4.2, Table 2] §4.2 and Table 2 (HardID-Celeb results): the 12.60-point YawScore and 15.10-point OccScore improvements are load-bearing for the “highly effective” conclusion, yet the manuscript does not report whether the strongest baselines were re-implemented with identical MLLM curation or reference volume; any mismatch would undermine the attribution to SMII and counterfactual training.
minor comments (2)
  1. [Abstract, §3] The abstract and §3 use “negative-time read-only memory” without defining the precise token-space operation or the synchronization schedule with diffusion timesteps; a short algorithmic box would improve reproducibility.
  2. [§4.1] OpenS2V-Eval and HardID-Celeb metric definitions (NexusScore, NaturalScore) are referenced but not restated; a one-paragraph appendix definition would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the contributions of the HardID-Celeb benchmark and the new stress metrics. We address each major comment below with clarifications and proposed revisions where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (SMII description) and the Identity Director paragraph: the central claim that mosaic injection in negative-time read-only memory converts references into a compact dynamic distribution independent of pose, lighting, background, and camera statistics is not supported by an equation, pseudocode, or ablation that isolates the disentanglement effect from standard multi-reference cross-attention or simple concatenation. Without such evidence the reported FaceSim, YawScore, and OccScore gains could arise from increased reference count or MLLM selection rather than the claimed mechanism.

    Authors: We agree that an explicit isolating ablation and pseudocode would strengthen the attribution of the dynamic-distribution effect. Section 3 describes the synchronization and negative-time read-only injection process in native token space, and Table 2 ablations compare SMII against multi-reference baselines; however, these do not fully isolate disentanglement from reference volume. We will add pseudocode for the mosaic synchronization and injection steps plus a targeted ablation that holds reference count and MLLM selection fixed while varying only the injection mechanism. revision: yes

  2. Referee: [§4.2, Table 2] §4.2 and Table 2 (HardID-Celeb results): the 12.60-point YawScore and 15.10-point OccScore improvements are load-bearing for the “highly effective” conclusion, yet the manuscript does not report whether the strongest baselines were re-implemented with identical MLLM curation or reference volume; any mismatch would undermine the attribution to SMII and counterfactual training.

    Authors: The reported numbers use the original baseline implementations and published settings. To ensure attribution is not confounded by curation differences, we will re-run the strongest baselines (with identical MLLM-selected references and reference volume) and update Table 2 and the corresponding text in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results with no load-bearing derivation chain

full rationale

The paper reports empirical SOTA performance on OpenS2V-Eval Human-Domain (64.38 Total Score, 71.86 FaceSim) and HardID-Celeb (76.80 FaceSim) using the introduced SMII method, MLLM Identity Director, no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance. The abstract supplies no equations, fitted parameters renamed as predictions, or self-citations whose load-bearing premise reduces to the current work. Claims rest on external benchmark comparisons and described architectural choices rather than any self-definitional, uniqueness-imported, or ansatz-smuggled reduction. The derivation chain is therefore self-contained against the stated evaluation metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full manuscript text was referenced but not supplied, so free parameters, axioms, and invented entities cannot be exhaustively audited.

axioms (1)
  • domain assumption An MLLM can reliably select informative identity moments and resolve conflicts among text, first-frame, and identity references.
    Invoked when describing the Identity Director component.
invented entities (1)
  • Stacked Multi-View Identity Mosaic no independent evidence
    purpose: Convert multiple identity references into a compact dynamic distribution injected as negative-time read-only memory.
    New construct introduced to replace single-reference or adapter paradigms.

pith-pipeline@v0.9.1-grok · 5895 in / 1573 out tokens · 30829 ms · 2026-06-27T10:45:08.733258+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

    cs.CV 2026-06 unverdicted novelty 7.0

    DRIVE-CHOREO uses three LLM agents to create a unified position-aware token sequence co-compressed with multi-view video, achieving SOTA BEV mAP of 21.6 and +2.4 NDS improvement on nuScenes.

  2. OrthoMotion:Disentangling Camera and Subject Motion via Geometry Semantics Orthogonal Attention

    cs.CV 2026-06 unverdicted novelty 6.0

    OrthoMotion disentangles camera and subject motion in video generation by splitting attention into algebraically complementary geometric (RoPE rotation) and semantic (gated value) channels driven to orthogonality by a...

  3. ParaScale: Scale-Calibrated Camera-Motion Transfer via a Gauge-Invariant Parallax Number

    cs.CV 2026-06 unverdicted novelty 6.0

    ParaScale extracts a gauge-invariant Parallax Number from a reference video and re-realizes the same parallax against the target scene's depth map to achieve scale-calibrated camera motion transfer.

  4. OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

    cs.CV 2026-06 unverdicted novelty 6.0

    OmniDirector introduces a grid-based camera representation and hierarchical prompt agent for multi-shot camera cloning in video diffusion models trained on million-scale unpaired data.

  5. TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

    cs.LG 2026-06 unverdicted novelty 5.0

    TRIDENT is a MARL framework using Richardson-Romberg gradient correction, Lyapunov-constrained trust-region updates, and a physics-informed residual critic that claims O(1/sqrt(K)) convergence to constrained Nash equi...

Reference graph

Works this paper leans on

64 extracted references · 1 canonical work pages · cited by 5 Pith papers

  1. [1]

    Self-rectifying diffusion sampling with perturbed-attention guidance

    Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, and Seungryong Kim. Self-rectifying diffusion sampling with perturbed-attention guidance. InEuropean Conference on Computer Vision, pages 1–17. Springer, 2024

  2. [2]

    Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

    Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, et al. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

  3. [3]

    Yolo- world: Real-time open-vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo- world: Real-time open-vocabulary object detection. InProc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2024

  4. [4]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4685–4694. IEEE, 2019

  5. [5]

    Cinema: Coherent multi-subject video generation via mllm-based guidance.arXiv preprint arXiv:2503.10391, 2025

    Yufan Deng, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Angtian Wang, Shenghai Yuan, Yiding Yang, Bo Liu, Haibin Huang, and Chongyang Ma. Cinema: Coherent multi-subject video generation via mllm-based guidance.arXiv preprint arXiv:2503.10391, 2025

  6. [6]

    Magref: Masked guidance for any- reference video generation with subject disentanglement.arXiv preprint arXiv:2505.23742, 2025

    Yufan Deng, Yuanyang Yin, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Shenghai Yuan, Yiding Yang, Angtian Wang, Bo Liu, Haibin Huang, et al. Magref: Masked guidance for any- reference video generation with subject disentanglement.arXiv preprint arXiv:2505.23742, 2025

  7. [7]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  8. [8]

    Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025

    Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025

  9. [9]

    Ingredients: Blending custom photos with video diffusion transformers.arXiv preprint arXiv:2501.01790, 2025

    Zhengcong Fei, Debang Li, Di Qiu, Changqian Yu, and Mingyuan Fan. Ingredients: Blending custom photos with video diffusion transformers.arXiv preprint arXiv:2501.01790, 2025

  10. [10]

    An image is worth one word: Personalizing text-to-image generation using textual inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. InICLR, 2023

  11. [11]

    Identity-preserving text-to-video generation via training-free prompt, image, and guidance enhancement

    Jiayi Gao, Changcheng Hua, Qingchao Chen, Yuxin Peng, and Yang Liu. Identity-preserving text-to-video generation via training-free prompt, image, and guidance enhancement. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 13751–13757, 2025

  12. [12]

    Mochi 1.https://github.com/genmoai/models, 2024

    Genmo Team. Mochi 1.https://github.com/genmoai/models, 2024

  13. [13]

    Id-animator: Zero-shot identity-preserving human video generation.arXiv preprint arXiv:2404.15275, 2024

    Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. Id-animator: Zero-shot identity-preserving human video generation.arXiv preprint arXiv:2404.15275, 2024. 12

  14. [14]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021

  15. [15]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021

  16. [16]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022

  17. [17]

    Hunyuancustom: A multimodal-driven architecture for customized video generation, 2025

    Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video generation, 2025. URLhttps://arxiv.org/abs/2505.04512

  18. [18]

    Curricularface: Adaptive curriculum learning loss for deep face recognition

    Yuge Huang, Yuhan Wang, Ying Tai, Xiaoming Liu, Pengcheng Shen, Shaoxin Li, Jilin Li, and Feiyue Huang. Curricularface: Adaptive curriculum learning loss for deep face recognition. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5900–5909. IEEE Computer Society, 2020

  19. [19]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  20. [20]

    doi:10.1109/TPAMI.2025.3633890 , url =

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelli...

  21. [21]

    Vace: All- in-one video creation and editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All- in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025

  22. [22]

    Jimeng Video Generation System.https://jimeng.jianying.com/, 2025

    Jimeng Team. Jimeng Video Generation System.https://jimeng.jianying.com/, 2025

  23. [24]

    Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37:52996–53021, 2024

    Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37:52996–53021, 2024

  24. [25]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  25. [26]

    Bindweave: Subject-consistent video generation via cross-modal integration

    Zhaoyang Li, Dongjun Qian, Kai Su, Qishuai Diao, Xiangyang Xia, Chang Liu, Wenfei Yang, Tianzhu Zhang, and Zehuan Yuan. Bindweave: Subject-consistent video generation via cross-modal integration. InICLR, 2026. 13

  26. [27]

    Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

  27. [28]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023

  28. [29]

    Phantom: Subject-consistent video generation via cross-modal alignment.arXiv preprint arXiv:2502.11079, 2025

    Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross-modal alignment.arXiv preprint arXiv:2502.11079, 2025

  29. [30]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55. Springer, 2024

  30. [31]

    Evalcrafter: Benchmarking and evaluating large video generation models

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024

  31. [32]

    Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023

    Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023

  32. [33]

    Magic-me: Identity-specific video customized diffusion

    Ze Ma, Daquan Zhou, Xue-She Wang, Chun-Hsiao Yeh, Xiuyu Li, Huanrui Yang, Zhen Dong, Kurt Keutzer, and Jiashi Feng. Magic-me: Identity-specific video customized diffusion. InEuropean Conference on Computer Vision, pages 19–37. Springer, 2024

  33. [34]

    Hailuo ai: Text-to-video generation platform.https://hailuoai.video/, 2024

    MiniMax. Hailuo ai: Text-to-video generation platform.https://hailuoai.video/, 2024. Accessed: 2026-05-04

  34. [35]

    Dreamdance: Animating human images by enriching 3d geometry cues from 2d poses

    Yatian Pang, Bin Zhu, Bin Lin, Mingzhe Zheng, Francis EH Tay, Ser-Nam Lim, Harry Yang, and Li Yuan. Dreamdance: Animating human images by enriching 3d geometry cues from 2d poses. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14039–14050, 2025

  35. [36]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  36. [37]

    Dreambench++: A human-aligned benchmark for personalized image generation

    Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://dreambenchplus.github.io/

  37. [38]

    Pika 2.1: Ai video generation model

    Pika Labs. Pika 2.1: Ai video generation model. https://pika.art/, 2025. Accessed: 2026-05-04

  38. [39]

    Qwen3-vl technical report, 2025

    Qwen Team, Alibaba Group. Qwen3-vl technical report, 2025. URLhttps://arxiv.org/ abs/2511.21631

  39. [40]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 14

  40. [41]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  41. [42]

    Nano Banana 2: Combining Pro capabilities with lightning- fast speed

    Naina Raisinghani. Nano Banana 2: Combining Pro capabilities with lightning- fast speed. https://blog.google/innovation-and-ai/technology/ai/nano-banana-2/, February 2026. Google Blog; accessed 2026-05-07

  42. [43]

    Sam 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. InThe Thirteenth International Conference on Learning Representations, 2025

  43. [44]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

  44. [45]

    World-grounded human motion recovery via gravity-view coordinates

    Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia Conference Proceedings, 2024

  45. [46]

    Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025

    Step-Video Team. Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025. URLhttps://arxiv.org/abs/2502.10248

  46. [47]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  47. [48]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  48. [49]

    Magi-1: Autoregressive video generation at scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211, 2025

  49. [50]

    Echovideo: Identity-preserving human video generation by multimodal feature fusion.arXiv preprint arXiv:2501.13452, 2025

    Jiangchuan Wei, Shiyue Yan, Wenfeng Lin, Boyuan Liu, Renjie Chen, and Mingyu Guo. Echovideo: Identity-preserving human video generation by multimodal feature fusion.arXiv preprint arXiv:2501.13452, 2025

  50. [51]

    Towards a better metric for text-to-video generation.arXiv preprint arXiv:2401.07781, 2024

    Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, et al. Towards a better metric for text-to-video generation.arXiv preprint arXiv:2401.07781, 2024

  51. [52]

    Videomaker: Zero-shot customized video generation with the inherent force of video diffusion models.arXiv preprint arXiv:2412.19645, 2024

    Tao Wu, Yong Zhang, Xiaodong Cun, Zhongang Qi, Junfu Pu, Huanzhang Dou, Guangcong Zheng, Ying Shan, and Xi Li. Videomaker: Zero-shot customized video generation with the inherent force of video diffusion models.arXiv preprint arXiv:2412.19645, 2024

  52. [53]

    Easyanimate: High-performance video generation framework with hybrid windows attention and reward backpropagation

    Jiaqi Xu, Kunzhe Huang, Xinyi Zou, Yunkuo Chen, Bo Liu, Mengli Cheng, Jun Huang, and Xing Shi. Easyanimate: High-performance video generation framework with hybrid windows attention and reward backpropagation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10925–10934, 2025

  53. [54]

    Stand-in: A lightweight and plug-and-play identity control for video generation

    Bowen Xue, Zheng-Peng Duan, Qixin Yan, Wenjing Wang, Hao Liu, Chun-Le Guo, Chongyi Li, Chen Li, and Jing Lyu. Stand-in: A lightweight and plug-and-play identity control for video generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2026. 15

  54. [55]

    Effective whole-body pose estimation with two-stages distillation

    Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effective whole-body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023

  55. [56]

    Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  56. [57]

    Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

  57. [58]

    Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

    Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Rui- Jie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

  58. [59]

    Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation

    Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Chongyang Ma, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  59. [60]

    Identity-preserving text-to-video generation by frequency decomposition

    Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity-preserving text-to-video generation by frequency decomposition. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12978–12988, 2025

  60. [61]

    Gme: improving universal multimodal retrieval by multimodal llms.arXiv preprint arXiv:2412.16855, 2024

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: improving universal multimodal retrieval by multimodal llms.arXiv preprint arXiv:2412.16855, 2024

  61. [62]

    Fantasyid: Face knowledge enhanced id-preserving video generation.arXiv preprint arXiv:2502.13995, 2025

    Yunpeng Zhang, Qiang Wang, Fan Jiang, Yaqi Fan, Mu Xu, and Yonggang Qi. Fantasyid: Face knowledge enhanced id-preserving video generation.arXiv preprint arXiv:2502.13995, 2025

  62. [63]

    Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

  63. [64]

    Concat-id: Towards universal identity-preserving video synthesis

    Yong Zhong, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, and Chongxuan Li. Concat-id: Towards universal identity-preserving video synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1906–1915, 2025

  64. [65]

    Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024

    Yuan Zhou, Qiuyue Wang, Yuxuan Cai, and Huan Yang. Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024. 16