arxiv: 2605.03652 · v3 · submitted 2026-05-05 · 💻 cs.CV · cs.AI

Recognition: no theorem link

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

Tencent HY Team

Pith reviewed 2026-05-12 03:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords anime video generationartistic conventionsproduction knowledge systemdeformation-aware optimizationvideo generation modelpreference optimizationstylistic motion

0 comments

The pith

AniMatrix generates anime videos by encoding artistic conventions as controllable production variables instead of physical realism.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard video models internalize physical laws as their core prior, but anime routinely violates them through deliberate effects such as smears, impact frames, and chibi shifts that have no single underlying physics. AniMatrix instead treats anime as a structured set of production choices by building a taxonomy of variables covering style, motion, camera, and effects, then infers those choices directly from reference frames. Dual-channel injection keeps the categorical directives precise while a curriculum gradually shifts the model toward expressive, non-physical motion and a specialized optimizer separates intentional art from unwanted collapse. If the approach holds, generated anime sequences can match the medium's own standards of correctness as judged by professional animators rather than defaulting to photorealistic approximations.

Core claim

AniMatrix targets artistic rather than physical correctness in anime video generation through a dual-channel conditioning mechanism, a Production Knowledge System taxonomy of controllable variables, AniCaption inference, a style-motion-deformation curriculum, and deformation-aware preference optimization, achieving first place on four of five production dimensions in evaluations by professional animators.

What carries the argument

The Production Knowledge System, a structured taxonomy of anime production variables (Style, Motion, Camera, VFX) that is preserved by a trainable tag encoder and injected through cross-attention for fine control plus AdaLN for global enforcement.

If this is right

Video generation can override an embedded physics prior when the target domain uses deliberate violations of realism as its defining language.
Categorical production directives remain effective when kept separate from open-ended narrative text through dual-path injection.
Deformation-aware rewards allow training to reward expressive anime motion while penalizing only pathological artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same taxonomy-plus-curriculum pattern could be adapted for other stylized domains such as Western cartoon or comic-strip video generation.
Expanding the curriculum to handle intra-shot style shifts would test whether the method scales beyond single-style clips.
Public release of the taxonomy and reward model would let independent groups build anime-specific benchmarks that measure artistic fidelity directly.

Load-bearing premise

The Production Knowledge System taxonomy and AniCaption inference capture the full range of intentional artistic conventions without selection bias or mistaking stylistic choices for errors.

What would settle it

Evaluating the model on anime sequences that use artistic conventions outside the defined taxonomy and checking whether it reverts to flattened or physically realistic outputs.

Figures

Figures reproduced from arXiv: 2605.03652 by Tencent HY Team.

**Figure 1.** Figure 1: The Industrial Production Taxonomy T = S × M × C × V. Every clip is mapped to a coordinate in this four-axis production-variable space—Style (rendering paradigm and motion dialect), Motion (performance semantics and kinetic intensity), Camera (cinematographic framing and choreography), and VFX (anime-specific symbolic and technical effects)—forming a structured, navigable control space that the model canno… view at source ↗

**Figure 2.** Figure 2: Overview of the Creator-Language Dual-Channel Conditioning architecture. Production tags are encoded by a trainable Tag Transformer via field–value decomposition, while free-form directives pass through a frozen umT5-XXL encoder. The two representations are injected into the MoE DiT through complementary pathways: concatenated sequences via cross-attention (Path 1) for fine-grained spatial/temporal control… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on two prompts at opposite extremes of the artistic-control spectrum (rows: AniMatrix, Wan2.2, Seedance-Pro 1.0; columns: temporally ordered samples). Example 1 (top, sakuga). A character lunges forward in a low stance, trailed by straight energy beams across the night sky. AniMatrix renders the lunge with crisp straight beams; Wan2.2 collapses the beams into deformed smears with mot… view at source ↗

**Figure 4.** Figure 4: Compact excerpt of a structured caption highlighting three distinguishing design choices: (i) the temporally ordered motion array uses cross-references such as <subject_0> to subjects; (ii) the AnimeVisualEffects field carries the three-level VFX hierarchy (type/sub_type/sub_sub_type); (iii) global style and camera tags are kept separate from per-entity annotations. { "subjects": [ { "idx": 0, "TYPES": {"t… view at source ↗

**Figure 5.** Figure 5: Full structured caption for a single clip, expanded from the excerpt in view at source ↗

**Figure 6.** Figure 6: Compact excerpt of the three-section natural-language directive rewritten from view at source ↗

**Figure 7.** Figure 7: Full natural-language rewriting output for the structured caption in view at source ↗

read the original abstract

Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We are preparing accompanying resources for public release to support reproducibility and follow-up research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AniMatrix gives a practical recipe for overriding physics priors in anime video gen via a production taxonomy, dual conditioning, and staged curriculum, but the human eval may be circular around that same taxonomy.

read the letter

Hi, the core of this paper is a method to make video generation respect anime's deliberate physics violations—smears, impact frames, chibi shifts—by treating them as controllable production variables rather than noise. They encode anime as a taxonomy of Style, Motion, Camera, and VFX, infer those tags from pixels with AniCaption, and feed them through a trainable tag encoder plus frozen T5. Dual-path injection uses cross-attention for fine tags and AdaLN for global enforcement so the directives do not get washed out by narrative text. A three-stage curriculum starts near physical motion and ramps up to full artistic deformation, then deformation-aware preference optimization with a domain reward model tries to reward intentional style over collapse. On animator ratings it beats Seedance-Pro on four of five dimensions, with the biggest lifts in prompt understanding and artistic motion. That is a concrete engineering response to a domain where off-the-shelf models flatten the medium. The approach is new enough in its combination of structured taxonomy, dual injection, and explicit physics-to-art transition that it is not just a rehash of prior video papers. The reported gains are at least directionally useful for anyone building stylized generators. The main soft spot is that the same Production Knowledge System taxonomy supplies the training directives, the reward model, and the evaluation rubric. If the taxonomy under-represents certain studios or eras or mislabels deliberate deformations as failures, both the optimization and the scores become self-reinforcing rather than independent evidence of better artistic capture. The abstract also gives no numbers on how many animators rated, what the prompt set was, or inter-rater stats, so the +0.70 and +0.55 margins are hard to interpret without more detail. No ablations on the curriculum stages or the dual-path choice appear either. This is aimed at researchers and engineers working on domain-specific video synthesis, especially animation pipelines. The method is grounded enough and the problem is real enough that it deserves a serious referee to check the taxonomy validation and eval protocol. I would bring it to a reading group on stylized generation techniques, but I would not cite it in my own work in the next year unless I were doing anime-specific modeling. It shows clear thinking on the mismatch between general priors and artistic conventions, so it is worth engaging.

Referee Report

2 major / 2 minor

Summary. The paper introduces AniMatrix, a video generation model for anime that prioritizes artistic conventions (smears, impact frames, chibi shifts) over physical realism. It uses a Production Knowledge System taxonomy of production variables (Style, Motion, Camera, VFX), AniCaption to infer these as directives from pixels, a dual-channel conditioning mechanism (trainable tag encoder + frozen T5 with cross-attention and AdaLN), a three-step transition via style-motion-deformation curriculum, and deformation-aware preference optimization with a domain-specific reward model. On a human evaluation scored by professional animators across five production dimensions, AniMatrix ranks first on four, with largest gains over Seedance-Pro 1.0 of +0.70 (+22.4%) on Prompt Understanding and +0.55 (+16.9%) on Artistic Motion.

Significance. If the human evaluation holds under scrutiny, the work offers a concrete framework for domain-specific generative modeling that overrides physics-biased priors with structured artistic knowledge. The dual-channel injection, curriculum transition, and reward model separation of intentional deformation from collapse are technically interesting contributions that could generalize to other stylized domains. The explicit taxonomy and inference pipeline provide a reproducible starting point for follow-up, though significance is limited by the absence of independent validation for the core ontology.

major comments (2)

[§4 (Human Evaluation)] §4 (Human Evaluation): The central claim that AniMatrix ranks first on four of five dimensions with specific gains (+0.70 on Prompt Understanding, +0.55 on Artistic Motion) rests on animator scores, yet the manuscript provides no details on the number of professional animators, prompt selection criteria, inter-rater agreement, or statistical significance tests. These omissions are load-bearing because the reported margins cannot be interpreted without them.
[§3.1 (Production Knowledge System) and §3.3 (Deformation-aware Preference Optimization)] §3.1 (Production Knowledge System) and §3.3 (Deformation-aware Preference Optimization): The same taxonomy is used to define AniCaption training directives, the reward model that distinguishes art from collapse, and the five-dimensional evaluation rubric. No independent validation, coverage study on held-out clips, or ablation comparing against an external rubric is described, so the performance margins risk reflecting alignment with the taxonomy's own priors rather than superior artistic fidelity.

minor comments (2)

[Abstract] The abstract states that accompanying resources are being prepared for release but provides no link, repository, or timeline; adding this would strengthen reproducibility claims.
[§3.2 (Dual-channel Conditioning)] Notation for the dual-path injection (cross-attention vs. AdaLN) could be clarified with a small diagram or explicit equations showing how categorical directives are preserved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We address each of the major comments below and commit to revisions that will enhance the clarity and rigor of the manuscript.

read point-by-point responses

Referee: [§4 (Human Evaluation)] §4 (Human Evaluation): The central claim that AniMatrix ranks first on four of five dimensions with specific gains (+0.70 on Prompt Understanding, +0.55 on Artistic Motion) rests on animator scores, yet the manuscript provides no details on the number of professional animators, prompt selection criteria, inter-rater agreement, or statistical significance tests. These omissions are load-bearing because the reported margins cannot be interpreted without them.

Authors: We concur that the human evaluation results require additional context to be fully interpretable. The revised manuscript will incorporate details on the number of professional animators who participated in the study, the criteria used for selecting the evaluation prompts, measures of inter-rater agreement, and the statistical tests performed to assess the significance of the observed differences. These enhancements will be added to Section 4, ensuring that the reported gains can be properly evaluated. revision: yes
Referee: [§3.1 (Production Knowledge System) and §3.3 (Deformation-aware Preference Optimization)] §3.1 (Production Knowledge System) and §3.3 (Deformation-aware Preference Optimization): The same taxonomy is used to define AniCaption training directives, the reward model that distinguishes art from collapse, and the five-dimensional evaluation rubric. No independent validation, coverage study on held-out clips, or ablation comparing against an external rubric is described, so the performance margins risk reflecting alignment with the taxonomy's own priors rather than superior artistic fidelity.

Authors: The referee raises a valid point regarding the potential for circularity in our use of the Production Knowledge System taxonomy across training, optimization, and evaluation. To mitigate this concern, the revised manuscript will include an expanded discussion in Section 3.1 on the independent derivation of the taxonomy from established anime production literature and expert consultation. Additionally, we will provide a coverage study on held-out clips and an ablation experiment contrasting our rubric with an external one. These additions aim to demonstrate that the performance improvements stem from the model's ability to capture artistic conventions rather than mere alignment with the taxonomy. revision: yes

Circularity Check

0 steps flagged

No circularity; performance claims rest on external human evaluation with no self-referential derivations.

full rationale

The paper presents a descriptive architecture (dual-channel conditioning, style-motion-deformation curriculum, deformation-aware preference optimization) and reports results exclusively via external professional-animator rankings on five production dimensions. No equations, first-principles derivations, fitted-parameter predictions, or self-citation chains are described that reduce any claimed output to inputs by construction. The Production Knowledge System taxonomy and AniCaption are used for training directives, but the evaluation scores are independently collected human judgments rather than quantities defined or fitted within the paper itself, rendering the result self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond standard deep-learning components; the Production Knowledge System and AniCaption are engineering constructs rather than new physical postulates.

pith-pipeline@v0.9.0 · 5589 in / 1217 out tokens · 34439 ms · 2026-05-12T03:35:32.117982+00:00 · methodology

Review history (4 revisions) →

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 11 internal anchors

[1]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

ByteDance Seed Team. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Video generation models as world simula- tors

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simula- tors. https://openai.com/index/video-generation-models-as-world-simulators/ , 2024. OpenAI Technical Report. 2Sora-2, Veo-3, and Wan-2.5 are annou...

work page 2024
[3]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

Kling Team. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page arXiv 2025
[6]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

Team Seedance. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

work page arXiv 2025
[8]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

work page internal anchor Pith review arXiv 2025
[9]

Skyreels-v4: Multi-modal video-audio generation, inpainting and editing model.arXiv preprint arXiv:2602.21818, 2026

Guibin Chen, Dixuan Lin, Jiangping Yang, et al. Skyreels-v4: Multi-modal video-audio generation, inpainting and editing model.arXiv preprint arXiv:2602.21818, 2026

work page arXiv 2026
[10]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020

work page 2020
[11]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022

work page 2022
[12]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023

work page 2023
[13]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22563–22575, 2023

work page 2023
[14]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

work page 2023
[15]

Walt Disney Productions, 1981

Frank Thomas and Ollie Johnston.The Illusion of Life: Disney Animation. Walt Disney Productions, 1981

work page 1981
[16]

Anisora: Exploring the frontiers of animation video generation in the sora era.arXiv preprint arXiv:2412.10255, 2024

Yudong Jiang, Baohan Xu, Siqian Yang, Mingyu Yin, Jing Liu, Chao Xu, Siqi Wang, Yidi Wu, Bingwen Zhu, Xinwen Zhang, et al. Anisora: Exploring the frontiers of animation video generation in the sora era.arXiv preprint arXiv:2412.10255, 2024

work page arXiv 2024
[17]

Aligning anime video generation with human feedback.arXiv preprint arXiv:2504.10044, 2025

Bingwen Zhu, Yudong Jiang, Baohan Xu, Siqian Yang, Mingyu Yin, Yidi Wu, Huyang Sun, and Zuxuan Wu. Aligning anime video generation with human feedback.arXiv preprint arXiv:2504.10044, 2025

work page arXiv 2025
[18]

Qwen3-VL Technical Report

Qwen Team. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InInternational Conference on Machine Learning (ICML), pages 41–48, 2009

work page 2009
[20]

Denoising task difficulty-based curriculum for training diffusion models

Jin-Young Kim, Hyojun Go, Soonwoo Kwon, and Hyun-Gyoon Kim. Denoising task difficulty-based curriculum for training diffusion models. InProceedings of the 13th International Conference on Learning Representations (ICLR), 2025

work page 2025
[21]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

work page 2023
[22]

Patel, and Shao-Yuan Lo

Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8009–8019, 2025. doi: 10.1109/CVPR52734.2025.00750

work page doi:10.1109/cvpr52734.2025.00750 2025
[23]

Score- based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. InProceedings of the 9th International Conference on Learning Representations (ICLR), 2021

work page 2021
[24]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InProceedings of the 9th International Conference on Learning Representations (ICLR), 2021. 20

work page 2021
[25]

Diffusion models beat gans on image synthesis

Prabhat Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 8780–8794, 2021

work page 2021
[26]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

work page 2017
[27]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Liang, Wayne Liao, Tong Zhao, Yuxin Wu, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Vidu: A highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

Fan Bao, Chendong Zhao, Guanbin Hao, Shanchuan Cao, Zhanzhan Liu, Zhaolong Zhang, Hanwang Li, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

work page arXiv 2024
[29]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-Wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13320–...

work page 2024
[30]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[31]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[32]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InProceedings of the 12th International Conference on Learning Representations (ICLR), 2024

work page 2024
[33]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[34]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

work page 2017
[35]

T-VSL: text-guided visual sound source localization in mixtures

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogniti...

work page doi:10.1109/cvpr52733.2024.02060 2024
[36]

Tooncrafter: Generative cartoon interpolation.ACM Transactions on Graphics, 43(6):1–11, 2024

Jinbo Xing, Hanyuan Liu, Menghan Xia, Yong Zhang, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Tooncrafter: Generative cartoon interpolation.ACM Transactions on Graphics, 43(6):1–11, 2024

work page 2024
[37]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. Proceedings of the 12th International Conference on Learning Representations (ICLR), 2024

work page 2024
[38]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022

work page 2022
[39]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

work page 2020
[40]

TabTransformer: Tabular data modeling using contextual embeddings.arXiv preprint arXiv:2012.06678,

Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin. Tabtransformer: Tabular data modeling using contextual embeddings.arXiv preprint arXiv:2012.06678, 2020

work page arXiv 2012
[41]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorber, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024

work page 2024
[42]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018
[43]

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural Networks, 107:3–11, 2018

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural Networks, 107:3–11, 2018

work page 2018
[44]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 21

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Tenenbaum

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models. InProceedings of the European Conference on Computer Vision (ECCV), pages 423–439, 2022

work page 2022
[46]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8228–8238, 2024

work page 2024
[47]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763, 2021

work page 2021
[48]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6613–6623, 2024

work page 2024
[49]

Seedance 2.0: Advancing Video Generation for World Complexity

ByteDance Seed. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Pyscenedetect: Video scene cut detection and analysis tool

Brandon Castellano. Pyscenedetect: Video scene cut detection and analysis tool. https://github.com/ Breakthrough/PySceneDetect, 2020

work page 2020
[51]

Transnet v2: An effective deep network architecture for fast shot transition detection

Tomáš Souˇcek and Jakub Lokoˇc. TransNet V2: An effective deep network architecture for fast shot transition detection.arXiv preprint arXiv:2008.04838, 2020

work page arXiv 2008
[52]

OpenCV: Open source computer vision library.https://opencv.org/, 2024

OpenCV Developers. OpenCV: Open source computer vision library.https://opencv.org/, 2024

work page 2024
[53]

YOLOX: Exceeding YOLO Series in 2021

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. YOLOX: Exceeding YOLO series in 2021.arXiv preprint arXiv:2107.08430, 2021

work page internal anchor Pith review arXiv 2021
[54]

Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

work page 2023
[55]

arXiv preprint arXiv:2507.04590 , year=

Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents, 2025. URLhttps://arxiv.org/abs/2507.04590

work page arXiv 2025
[56]

Shinkai-style romance

Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J. Guibas. PointOdyssey: A large- scale synthetic dataset for long-term point tracking. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19855–19865, 2023. 22 A Supplementary Material for Data Preparation The supplementary material expands the det...

work page 2023
[57]

Dimension-aligned restructuring.Both the predicted caption and the human-written reference are first reorganized into three parallel dimensions—characters(subject identity, appearance, position),events(actions, interactions, and the visual effects associated with them), andscene(environment, lighting, atmosphere)—using the structured caption JSON as the c...

work page
[58]

the woman has long platinum-blonde hair

Element-level atomization.Within each dimension, the LLM further splits the content into atomic statements (e.g., “the woman has long platinum-blonde hair” and “she wears a blue cloak” become two separate atoms in the characters dimension), producing a per-dimension list of fine-grained claims for both the prediction and the reference

work page
[59]

Err. / Hall

Atom-level matching.For each predicted atom, an LLM judge decides whether a semantically equivalent atom exists in the reference list, and vice versa. Aggregating across all clips and atoms within a dimension yields per-dimensionF1, which jointly captures whether the caption (i) asserts content supported by the reference and (ii) covers the production inf...

work page 2025