pith. machine review for the scientific record. sign in

arxiv: 2605.03652 · v3 · submitted 2026-05-05 · 💻 cs.CV · cs.AI

Recognition: no theorem link

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

Tencent HY Team

Pith reviewed 2026-05-12 03:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords anime video generationartistic conventionsproduction knowledge systemdeformation-aware optimizationvideo generation modelpreference optimizationstylistic motion
0
0 comments X

The pith

AniMatrix generates anime videos by encoding artistic conventions as controllable production variables instead of physical realism.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard video models internalize physical laws as their core prior, but anime routinely violates them through deliberate effects such as smears, impact frames, and chibi shifts that have no single underlying physics. AniMatrix instead treats anime as a structured set of production choices by building a taxonomy of variables covering style, motion, camera, and effects, then infers those choices directly from reference frames. Dual-channel injection keeps the categorical directives precise while a curriculum gradually shifts the model toward expressive, non-physical motion and a specialized optimizer separates intentional art from unwanted collapse. If the approach holds, generated anime sequences can match the medium's own standards of correctness as judged by professional animators rather than defaulting to photorealistic approximations.

Core claim

AniMatrix targets artistic rather than physical correctness in anime video generation through a dual-channel conditioning mechanism, a Production Knowledge System taxonomy of controllable variables, AniCaption inference, a style-motion-deformation curriculum, and deformation-aware preference optimization, achieving first place on four of five production dimensions in evaluations by professional animators.

What carries the argument

The Production Knowledge System, a structured taxonomy of anime production variables (Style, Motion, Camera, VFX) that is preserved by a trainable tag encoder and injected through cross-attention for fine control plus AdaLN for global enforcement.

If this is right

  • Video generation can override an embedded physics prior when the target domain uses deliberate violations of realism as its defining language.
  • Categorical production directives remain effective when kept separate from open-ended narrative text through dual-path injection.
  • Deformation-aware rewards allow training to reward expressive anime motion while penalizing only pathological artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same taxonomy-plus-curriculum pattern could be adapted for other stylized domains such as Western cartoon or comic-strip video generation.
  • Expanding the curriculum to handle intra-shot style shifts would test whether the method scales beyond single-style clips.
  • Public release of the taxonomy and reward model would let independent groups build anime-specific benchmarks that measure artistic fidelity directly.

Load-bearing premise

The Production Knowledge System taxonomy and AniCaption inference capture the full range of intentional artistic conventions without selection bias or mistaking stylistic choices for errors.

What would settle it

Evaluating the model on anime sequences that use artistic conventions outside the defined taxonomy and checking whether it reverts to flattened or physically realistic outputs.

Figures

Figures reproduced from arXiv: 2605.03652 by Tencent HY Team.

Figure 1
Figure 1. Figure 1: The Industrial Production Taxonomy T = S × M × C × V. Every clip is mapped to a coordinate in this four-axis production-variable space—Style (rendering paradigm and motion dialect), Motion (performance semantics and kinetic intensity), Camera (cinematographic framing and choreography), and VFX (anime-specific symbolic and technical effects)—forming a structured, navigable control space that the model canno… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Creator-Language Dual-Channel Conditioning architecture. Production tags are encoded by a trainable Tag Transformer via field–value decomposition, while free-form directives pass through a frozen umT5-XXL encoder. The two representations are injected into the MoE DiT through complementary pathways: concatenated sequences via cross-attention (Path 1) for fine-grained spatial/temporal control… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on two prompts at opposite extremes of the artistic-control spectrum (rows: AniMatrix, Wan2.2, Seedance-Pro 1.0; columns: temporally ordered samples). Example 1 (top, sakuga). A character lunges forward in a low stance, trailed by straight energy beams across the night sky. AniMatrix renders the lunge with crisp straight beams; Wan2.2 collapses the beams into deformed smears with mot… view at source ↗
Figure 4
Figure 4. Figure 4: Compact excerpt of a structured caption highlighting three distinguishing design choices: (i) the temporally ordered motion array uses cross-references such as <subject_0> to subjects; (ii) the AnimeVisualEffects field carries the three-level VFX hierarchy (type/sub_type/sub_sub_type); (iii) global style and camera tags are kept separate from per-entity annotations. { "subjects": [ { "idx": 0, "TYPES": {"t… view at source ↗
Figure 5
Figure 5. Figure 5: Full structured caption for a single clip, expanded from the excerpt in view at source ↗
Figure 6
Figure 6. Figure 6: Compact excerpt of the three-section natural-language directive rewritten from view at source ↗
Figure 7
Figure 7. Figure 7: Full natural-language rewriting output for the structured caption in view at source ↗
read the original abstract

Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We are preparing accompanying resources for public release to support reproducibility and follow-up research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AniMatrix, a video generation model for anime that prioritizes artistic conventions (smears, impact frames, chibi shifts) over physical realism. It uses a Production Knowledge System taxonomy of production variables (Style, Motion, Camera, VFX), AniCaption to infer these as directives from pixels, a dual-channel conditioning mechanism (trainable tag encoder + frozen T5 with cross-attention and AdaLN), a three-step transition via style-motion-deformation curriculum, and deformation-aware preference optimization with a domain-specific reward model. On a human evaluation scored by professional animators across five production dimensions, AniMatrix ranks first on four, with largest gains over Seedance-Pro 1.0 of +0.70 (+22.4%) on Prompt Understanding and +0.55 (+16.9%) on Artistic Motion.

Significance. If the human evaluation holds under scrutiny, the work offers a concrete framework for domain-specific generative modeling that overrides physics-biased priors with structured artistic knowledge. The dual-channel injection, curriculum transition, and reward model separation of intentional deformation from collapse are technically interesting contributions that could generalize to other stylized domains. The explicit taxonomy and inference pipeline provide a reproducible starting point for follow-up, though significance is limited by the absence of independent validation for the core ontology.

major comments (2)
  1. [§4 (Human Evaluation)] §4 (Human Evaluation): The central claim that AniMatrix ranks first on four of five dimensions with specific gains (+0.70 on Prompt Understanding, +0.55 on Artistic Motion) rests on animator scores, yet the manuscript provides no details on the number of professional animators, prompt selection criteria, inter-rater agreement, or statistical significance tests. These omissions are load-bearing because the reported margins cannot be interpreted without them.
  2. [§3.1 (Production Knowledge System) and §3.3 (Deformation-aware Preference Optimization)] §3.1 (Production Knowledge System) and §3.3 (Deformation-aware Preference Optimization): The same taxonomy is used to define AniCaption training directives, the reward model that distinguishes art from collapse, and the five-dimensional evaluation rubric. No independent validation, coverage study on held-out clips, or ablation comparing against an external rubric is described, so the performance margins risk reflecting alignment with the taxonomy's own priors rather than superior artistic fidelity.
minor comments (2)
  1. [Abstract] The abstract states that accompanying resources are being prepared for release but provides no link, repository, or timeline; adding this would strengthen reproducibility claims.
  2. [§3.2 (Dual-channel Conditioning)] Notation for the dual-path injection (cross-attention vs. AdaLN) could be clarified with a small diagram or explicit equations showing how categorical directives are preserved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We address each of the major comments below and commit to revisions that will enhance the clarity and rigor of the manuscript.

read point-by-point responses
  1. Referee: [§4 (Human Evaluation)] §4 (Human Evaluation): The central claim that AniMatrix ranks first on four of five dimensions with specific gains (+0.70 on Prompt Understanding, +0.55 on Artistic Motion) rests on animator scores, yet the manuscript provides no details on the number of professional animators, prompt selection criteria, inter-rater agreement, or statistical significance tests. These omissions are load-bearing because the reported margins cannot be interpreted without them.

    Authors: We concur that the human evaluation results require additional context to be fully interpretable. The revised manuscript will incorporate details on the number of professional animators who participated in the study, the criteria used for selecting the evaluation prompts, measures of inter-rater agreement, and the statistical tests performed to assess the significance of the observed differences. These enhancements will be added to Section 4, ensuring that the reported gains can be properly evaluated. revision: yes

  2. Referee: [§3.1 (Production Knowledge System) and §3.3 (Deformation-aware Preference Optimization)] §3.1 (Production Knowledge System) and §3.3 (Deformation-aware Preference Optimization): The same taxonomy is used to define AniCaption training directives, the reward model that distinguishes art from collapse, and the five-dimensional evaluation rubric. No independent validation, coverage study on held-out clips, or ablation comparing against an external rubric is described, so the performance margins risk reflecting alignment with the taxonomy's own priors rather than superior artistic fidelity.

    Authors: The referee raises a valid point regarding the potential for circularity in our use of the Production Knowledge System taxonomy across training, optimization, and evaluation. To mitigate this concern, the revised manuscript will include an expanded discussion in Section 3.1 on the independent derivation of the taxonomy from established anime production literature and expert consultation. Additionally, we will provide a coverage study on held-out clips and an ablation experiment contrasting our rubric with an external one. These additions aim to demonstrate that the performance improvements stem from the model's ability to capture artistic conventions rather than mere alignment with the taxonomy. revision: yes

Circularity Check

0 steps flagged

No circularity; performance claims rest on external human evaluation with no self-referential derivations.

full rationale

The paper presents a descriptive architecture (dual-channel conditioning, style-motion-deformation curriculum, deformation-aware preference optimization) and reports results exclusively via external professional-animator rankings on five production dimensions. No equations, first-principles derivations, fitted-parameter predictions, or self-citation chains are described that reduce any claimed output to inputs by construction. The Production Knowledge System taxonomy and AniCaption are used for training directives, but the evaluation scores are independently collected human judgments rather than quantities defined or fitted within the paper itself, rendering the result self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond standard deep-learning components; the Production Knowledge System and AniCaption are engineering constructs rather than new physical postulates.

pith-pipeline@v0.9.0 · 5589 in / 1217 out tokens · 34439 ms · 2026-05-12T03:35:32.117982+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 11 internal anchors

  1. [1]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    ByteDance Seed Team. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

  2. [2]

    Video generation models as world simula- tors

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simula- tors. https://openai.com/index/video-generation-models-as-world-simulators/ , 2024. OpenAI Technical Report. 2Sora-2, Veo-3, and Wan-2.5 are annou...

  3. [3]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  4. [4]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  5. [5]

    Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

    Kling Team. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

  6. [6]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  7. [7]

    Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

    Team Seedance. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

  8. [8]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

  9. [9]

    Skyreels-v4: Multi-modal video-audio generation, inpainting and editing model.arXiv preprint arXiv:2602.21818, 2026

    Guibin Chen, Dixuan Lin, Jiangping Yang, et al. Skyreels-v4: Multi-modal video-audio generation, inpainting and editing model.arXiv preprint arXiv:2602.21818, 2026

  10. [10]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020

  11. [11]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022

  12. [12]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023

  13. [13]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22563–22575, 2023

  14. [14]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

  15. [15]

    Walt Disney Productions, 1981

    Frank Thomas and Ollie Johnston.The Illusion of Life: Disney Animation. Walt Disney Productions, 1981

  16. [16]

    Anisora: Exploring the frontiers of animation video generation in the sora era.arXiv preprint arXiv:2412.10255, 2024

    Yudong Jiang, Baohan Xu, Siqian Yang, Mingyu Yin, Jing Liu, Chao Xu, Siqi Wang, Yidi Wu, Bingwen Zhu, Xinwen Zhang, et al. Anisora: Exploring the frontiers of animation video generation in the sora era.arXiv preprint arXiv:2412.10255, 2024

  17. [17]

    Aligning anime video generation with human feedback.arXiv preprint arXiv:2504.10044, 2025

    Bingwen Zhu, Yudong Jiang, Baohan Xu, Siqian Yang, Mingyu Yin, Yidi Wu, Huyang Sun, and Zuxuan Wu. Aligning anime video generation with human feedback.arXiv preprint arXiv:2504.10044, 2025

  18. [18]

    Qwen3-VL Technical Report

    Qwen Team. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  19. [19]

    Curriculum learning

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InInternational Conference on Machine Learning (ICML), pages 41–48, 2009

  20. [20]

    Denoising task difficulty-based curriculum for training diffusion models

    Jin-Young Kim, Hyojun Go, Soonwoo Kwon, and Hyun-Gyoon Kim. Denoising task difficulty-based curriculum for training diffusion models. InProceedings of the 13th International Conference on Learning Representations (ICLR), 2025

  21. [21]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

  22. [22]

    Patel, and Shao-Yuan Lo

    Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8009–8019, 2025. doi: 10.1109/CVPR52734.2025.00750

  23. [23]

    Score- based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. InProceedings of the 9th International Conference on Learning Representations (ICLR), 2021

  24. [24]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InProceedings of the 9th International Conference on Learning Representations (ICLR), 2021. 20

  25. [25]

    Diffusion models beat gans on image synthesis

    Prabhat Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 8780–8794, 2021

  26. [26]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

  27. [27]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Liang, Wayne Liao, Tong Zhao, Yuxin Wu, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

  28. [28]

    Vidu: A highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

    Fan Bao, Chendong Zhao, Guanbin Hao, Shanchuan Cao, Zhanzhan Liu, Zhaolong Zhang, Hanwang Li, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

  29. [29]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-Wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13320–...

  30. [30]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  31. [31]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

  32. [32]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InProceedings of the 12th International Conference on Learning Representations (ICLR), 2024

  33. [33]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2019

  34. [34]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

  35. [35]

    T-VSL: text-guided visual sound source localization in mixtures

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogniti...

  36. [36]

    Tooncrafter: Generative cartoon interpolation.ACM Transactions on Graphics, 43(6):1–11, 2024

    Jinbo Xing, Hanyuan Liu, Menghan Xia, Yong Zhang, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Tooncrafter: Generative cartoon interpolation.ACM Transactions on Graphics, 43(6):1–11, 2024

  37. [37]

    Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. Proceedings of the 12th International Conference on Learning Representations (ICLR), 2024

  38. [38]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022

  39. [39]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

  40. [40]

    TabTransformer: Tabular data modeling using contextual embeddings.arXiv preprint arXiv:2012.06678,

    Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin. Tabtransformer: Tabular data modeling using contextual embeddings.arXiv preprint arXiv:2012.06678, 2020

  41. [41]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorber, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024

  42. [42]

    FiLM: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

  43. [43]

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural Networks, 107:3–11, 2018

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural Networks, 107:3–11, 2018

  44. [44]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 21

  45. [45]

    Tenenbaum

    Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models. InProceedings of the European Conference on Computer Vision (ECCV), pages 423–439, 2022

  46. [46]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8228–8238, 2024

  47. [47]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763, 2021

  48. [48]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6613–6623, 2024

  49. [49]

    Seedance 2.0: Advancing Video Generation for World Complexity

    ByteDance Seed. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

  50. [50]

    Pyscenedetect: Video scene cut detection and analysis tool

    Brandon Castellano. Pyscenedetect: Video scene cut detection and analysis tool. https://github.com/ Breakthrough/PySceneDetect, 2020

  51. [51]

    Transnet v2: An effective deep network architecture for fast shot transition detection

    Tomáš Souˇcek and Jakub Lokoˇc. TransNet V2: An effective deep network architecture for fast shot transition detection.arXiv preprint arXiv:2008.04838, 2020

  52. [52]

    OpenCV: Open source computer vision library.https://opencv.org/, 2024

    OpenCV Developers. OpenCV: Open source computer vision library.https://opencv.org/, 2024

  53. [53]

    YOLOX: Exceeding YOLO Series in 2021

    Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. YOLOX: Exceeding YOLO series in 2021.arXiv preprint arXiv:2107.08430, 2021

  54. [54]

    Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

    Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

  55. [55]

    arXiv preprint arXiv:2507.04590 , year=

    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents, 2025. URLhttps://arxiv.org/abs/2507.04590

  56. [56]

    Shinkai-style romance

    Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J. Guibas. PointOdyssey: A large- scale synthetic dataset for long-term point tracking. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19855–19865, 2023. 22 A Supplementary Material for Data Preparation The supplementary material expands the det...

  57. [57]

    Dimension-aligned restructuring.Both the predicted caption and the human-written reference are first reorganized into three parallel dimensions—characters(subject identity, appearance, position),events(actions, interactions, and the visual effects associated with them), andscene(environment, lighting, atmosphere)—using the structured caption JSON as the c...

  58. [58]

    the woman has long platinum-blonde hair

    Element-level atomization.Within each dimension, the LLM further splits the content into atomic statements (e.g., “the woman has long platinum-blonde hair” and “she wears a blue cloak” become two separate atoms in the characters dimension), producing a per-dimension list of fine-grained claims for both the prediction and the reference

  59. [59]

    Err. / Hall

    Atom-level matching.For each predicted atom, an LLM judge decides whether a semantically equivalent atom exists in the reference list, and vice versa. Aggregating across all clips and atoms within a dimension yields per-dimensionF1, which jointly captures whether the caption (i) asserts content supported by the reference and (ii) covers the production inf...