pith. machine review for the scientific record. sign in

arxiv: 2605.01725 · v2 · submitted 2026-05-03 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Motion-Aware Caching for Efficient Autoregressive Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords autoregressive video generationmotion-aware cachingdenoising accelerationcache reusevideo synthesisdiffusion models
0
0 comments X

The pith

MotionCache uses inter-frame differences to dynamically skip denoising steps for static pixels in autoregressive video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard cache reuse in autoregressive video models skips too broadly and ignores how moving pixels need extra denoising steps to stop error buildup. It introduces MotionCache, which treats simple differences between frames as a stand-in for actual pixel motion and then varies cache update rates token by token after an initial warm-up phase. Experiments on SkyReels-V2 and MAGI-1 show this yields large speed gains while holding generation quality nearly constant. A sympathetic reader would care because sequential denoising is the main bottleneck for making long videos practical.

Core claim

MotionCache links cache errors directly to residual instability and then applies a coarse-to-fine motion-weighted reuse scheme: an early warm-up phase secures semantic coherence, after which inter-frame differences dictate per-token denoising frequencies so that high-motion regions receive more updates and static regions receive fewer.

What carries the argument

MotionCache, a framework that treats inter-frame differences as a lightweight proxy for pixel-level motion and uses them to set dynamic cache update frequencies during iterative denoising.

If this is right

  • Produces 6.28 times faster generation on SkyReels-V2 while dropping VBench score by only 1 percent.
  • Produces 1.64 times faster generation on MAGI-1 while dropping VBench score by only 0.01 percent.
  • Preserves output quality by allocating more denoising steps exactly where motion is high.
  • Requires only an initial warm-up phase before motion-weighted reuse begins.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same difference-based proxy could be tested on other autoregressive tasks such as audio waveform generation where local change signals computational need.
  • Hardware energy use for long video synthesis would drop in proportion to the observed speedups if the method generalizes.
  • Extending the warm-up length or combining it with learned motion estimators might further reduce quality variance on complex scenes.

Load-bearing premise

Inter-frame differences accurately indicate which pixels have enough motion to need extra denoising steps to prevent error accumulation.

What would settle it

Run full denoising versus MotionCache on the same high-motion video clip and measure whether VBench or perceptual scores drop more than the reported 1 percent when the motion proxy is applied.

read the original abstract

Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion-aware cache framework that exploits inter-frame differences as a lightweight proxy for pixel-level motion characteristics. MotionCache employs a coarse-to-fine strategy: an initial warm-up phase establishes semantic coherence, followed by motion-weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of $\textbf{6.28}\times$ and $\textbf{1.64}\times$ respectively, while effectively preserving generation quality (VBench: $1\%\downarrow$ and $0.01\%\downarrow$ respectively). The code is available at https://github.com/ywlq/MotionCache.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes MotionCache, a motion-aware caching framework for autoregressive video generation models. It uses inter-frame pixel differences as a lightweight proxy for per-token motion to dynamically decide cache reuse during iterative denoising, after an initial warm-up phase for semantic coherence. The central empirical claims are speedups of 6.28× on SkyReels-V2 and 1.64× on MAGI-1 with negligible quality degradation (VBench drops of 1% and 0.01%). The approach is presented as an engineering improvement that exploits the insight that high-motion pixels require more denoising steps to avoid error accumulation.

Significance. If the inter-frame proxy reliably bounds residual instability, the method could enable practical long-video autoregressive generation at lower cost. The concrete speed/quality numbers on two models and the public code release strengthen the contribution. However, the absence of a derivation for the cache-error-to-instability link and the lack of ablations against stronger motion signals (optical flow, oracle error) make it difficult to assess whether the reported gains generalize beyond the tested videos or are artifacts of low-motion content.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (theoretical formalization): the claim that cache errors are linked to residual instability is asserted without a derivation, error bounds, or analysis of how inter-frame difference magnitude maps to the exact number of denoising steps needed to prevent accumulation. This link is load-bearing for the proxy's validity.
  2. [§4 (Experiments)] §4 (Experiments): no ablation is reported that replaces the inter-frame-difference proxy with an oracle motion signal (e.g., optical flow or ground-truth per-token error) and re-measures both speedup and VBench. Without this, it is impossible to confirm that the 6.28× figure is not an artifact of the specific test set's motion statistics.
  3. [§4.2 (results tables)] §4.2 (results tables): the reported VBench deltas lack error bars, standard deviations across seeds, or multiple runs, making it hard to judge whether the 1% and 0.01% drops are statistically distinguishable from noise.
minor comments (2)
  1. [§3.2] Notation for the motion weight and update frequency is introduced without a compact equation; a single displayed equation would improve clarity.
  2. [Abstract] The GitHub link is given but the README does not specify the exact command lines or random seeds used to reproduce the SkyReels-V2 6.28× figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each point below and describe the planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (theoretical formalization): the claim that cache errors are linked to residual instability is asserted without a derivation, error bounds, or analysis of how inter-frame difference magnitude maps to the exact number of denoising steps needed to prevent accumulation. This link is load-bearing for the proxy's validity.

    Authors: We acknowledge that the theoretical formalization in §3 provides a motivational link rather than a rigorous derivation with explicit error bounds. The connection is based on the principle that residual errors in high-motion regions propagate more rapidly in the autoregressive setting due to the iterative nature of denoising. In the revised manuscript, we will expand this section with a more detailed heuristic analysis, including how inter-frame difference magnitudes correlate with the number of necessary denoising steps, supported by additional visualizations of error accumulation. revision: partial

  2. Referee: [§4 (Experiments)] §4 (Experiments): no ablation is reported that replaces the inter-frame-difference proxy with an oracle motion signal (e.g., optical flow or ground-truth per-token error) and re-measures both speedup and VBench. Without this, it is impossible to confirm that the 6.28× figure is not an artifact of the specific test set's motion statistics.

    Authors: We agree that comparing against stronger motion signals would provide valuable insight. Computing an oracle per-token error is computationally expensive as it requires full denoising without caching. We will include an additional ablation study using optical flow as a proxy in the revised experiments section, evaluating its impact on speedup and quality metrics on the same test sets. revision: yes

  3. Referee: [§4.2 (results tables)] §4.2 (results tables): the reported VBench deltas lack error bars, standard deviations across seeds, or multiple runs, making it hard to judge whether the 1% and 0.01% drops are statistically distinguishable from noise.

    Authors: This is a valid point. The original experiments were run with a single seed due to the high computational cost of video generation. In the revision, we will perform multiple runs with different random seeds and report the mean VBench scores along with standard deviations to better assess the statistical significance of the observed quality changes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical engineering with external validation

full rationale

The paper presents MotionCache as an empirical caching heuristic that uses inter-frame pixel differences as a lightweight proxy for per-token denoising needs. Speedup claims (6.28× on SkyReels-V2, 1.64× on MAGI-1) are measured against external models and reported via VBench scores on held-out video sets. No equations, fitted parameters, or self-citations are shown to reduce the reported gains to quantities defined inside the paper itself. The stated theoretical link between cache error and residual instability is presented as motivation rather than a closed derivation that forces the outcome by construction. The method therefore remains an engineering improvement whose performance is independently falsifiable on new models and datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that inter-frame differences correlate with required denoising steps and that a coarse-to-fine schedule can be safely applied without quality collapse.

axioms (1)
  • domain assumption Inter-frame differences serve as a reliable proxy for pixel-level motion that dictates denoising frequency
    Invoked to justify motion-weighted cache reuse; no independent validation of the proxy strength is shown in the abstract.

pith-pipeline@v0.9.0 · 5545 in / 1142 out tokens · 44015 ms · 2026-05-15T07:12:13.659844+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 12 internal anchors

  1. [1]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  2. [2]

    Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

  3. [3]

    Dicache: Let diffusion model determine its own cache.arXiv preprint arXiv:2508.17356, 2025

    Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Dahua Lin, and Jiaqi Wang. Dicache: Let diffusion model determine its own cache.arXiv preprint arXiv:2508.17356, 2025

  4. [4]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advancesin Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advancesin Neural Information Processing Systems, 37:24081–24125, 2024

  5. [5]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

  6. [6]

    δ-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024

    Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen. δ-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024

  7. [7]

    E2net: Excitative- expansile learning for weakly supervised object localization

    Zhiwei Chen, Liujuan Cao, Yunhang Shen, Feihong Lian, Yongjian Wu, and Rongrong Ji. E2net: Excitative- expansile learning for weakly supervised object localization. In Proceedings of the 29th ACM international conference on multimedia, pages 573–581, 2021

  8. [8]

    Lctr: On awakening the local continuity of transformer for weakly supervised object localization

    Zhiwei Chen, Changan Wang, Yabiao Wang, Guannan Jiang, Yunhang Shen, Ying Tai, Chengjie Wang, Wei Zhang, and Liujuan Cao. Lctr: On awakening the local continuity of transformer for weakly supervised object localization. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 410–418, 2022

  9. [9]

    Category-aware allocation transformer for weakly supervised object localization

    Zhiwei Chen, Jinren Ding, Liujuan Cao, Yunhang Shen, Shengchuan Zhang, Guannan Jiang, and Rongrong Ji. Category-aware allocation transformer for weakly supervised object localization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6643–6652, 2023

  10. [10]

    Adaptive zone learning for weakly supervised object localization.IEEE Transactions on Neural Networks and Learning Systems, 36(4):7211–7224, 2024

    Zhiwei Chen, Siwei Wang, Liujuan Cao, Yunhang Shen, and Rongrong Ji. Adaptive zone learning for weakly supervised object localization.IEEE Transactions on Neural Networks and Learning Systems, 36(4):7211–7224, 2024

  11. [11]

    Clip-driven transformer for weakly supervised object localization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Zhiwei Chen, Yunhang Shen, Liujuan Cao, Shengchuan Zhang, and Rongrong Ji. Clip-driven transformer for weakly supervised object localization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  12. [12]

    Bwcache: Accelerating video diffusion transformers through block-wise caching.arXiv preprint arXiv:2509.13789, 2025

    Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, and Weijia Jia. Bwcache: Accelerating video diffusion transformers through block-wise caching.arXiv preprint arXiv:2509.13789, 2025

  13. [13]

    Q-vdit: Towards accurate quantization and distillation of video-generation diffusion transformers

    Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, et al. Q-vdit: Towards accurate quantization and distillation of video-generation diffusion transformers. arXiv preprint arXiv:2505.22167, 2025

  14. [14]

    Mdreid: Modality-decoupled learning for any-to-any multi-modal object re-identification.arXiv preprint arXiv:2510.23301, 2025

    Yingying Feng, Jie Li, Jie Hu, Yukang Zhang, Lei Tan, and Jiayi Ji. Mdreid: Modality-decoupled learning for any-to-any multi-modal object re-identification.arXiv preprint arXiv:2510.23301, 2025

  15. [15]

    Multi-modal object re-identification via sparse mixture-of- experts

    Yingying Feng, Jie Li, Chi Xie, Lei Tan, and Jiayi Ji. Multi-modal object re-identification via sparse mixture-of- experts. In Forty-secondInternational Conference on Machine Learning, 2025

  16. [16]

    Exploring the interplay between video generation and world models in autonomous driving: A survey.arXiv preprint arXiv:2411.02914, 2024

    Ao Fu, Yi Zhou, Tao Zhou, Yi Yang, Bojun Gao, Qun Li, Guobin Wu, and Ling Shao. Exploring the interplay between video generation and world models in autonomous driving: A survey.arXiv preprint arXiv:2411.02914, 2024

  17. [17]

    Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive control

    Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive control. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28135–28144, 2025. 12

  18. [18]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025

  19. [19]

    Image quality metrics: Psnr vs

    Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010

  20. [20]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  21. [21]

    Ir evaluation methods for retrieving highly relevant documents

    Kalervo Järvelin and Jaana Kekäläinen. Ir evaluation methods for retrieving highly relevant documents. InACM SIGIR Forum, volume 51, pages 243–250. ACM New York, NY, USA, 2017

  22. [22]

    Adaptive caching for faster video generation with diffusion transformers

    Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S Ryoo, and Tian Xie. Adaptive caching for faster video generation with diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15240–15252, 2025

  23. [23]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advancesin neural information processing systems, 35:26565–26577, 2022

  24. [24]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  25. [25]

    Rle: A unified perspective of data augmentation for cross-spectral re-identification.Advances in Neural Information Processing Systems, 37:126977–126996, 2024

    Tan Lei, Yukang Zhang, Keke Han, Pingyang Dai, Yan Zhang, Yongjian Wu, and Rongrong Ji. Rle: A unified perspective of data augmentation for cross-spectral re-identification.Advances in Neural Information Processing Systems, 37:126977–126996, 2024

  26. [26]

    Find: A simple yet effective baseline for diffusion- generated image detection

    Jie Li, Yingying Feng, Chi Xie, Jie Hu, Lei Tan, and Jiayi Ji. Find: A simple yet effective baseline for diffusion- generated image detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6217–6225, 2026

  27. [27]

    Apoavatar: Expressive audio-driven avatar generation via refocused audio-pose priors

    Jingyu Lin, Chao Zhang, Wei Feng, Donghao Zhou, Shilei Wen, Lan Du, and Cunjian Chen. Apoavatar: Expressive audio-driven avatar generation via refocused audio-pose priors

  28. [28]

    An end-to-end scene text detector with dynamic attention

    Jingyu Lin, Yan Yan, and Hanzi Wang. An end-to-end scene text detector with dynamic attention. InProceedings of the 4th ACM International Conference on Multimedia in Asia, pages 1–7, 2022

  29. [29]

    A dual-path transformer network for scene text detection

    Jingyu Lin, Yan Yan, and Hanzi Wang. A dual-path transformer network for scene text detection. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

  30. [30]

    Pair-id: A dual modal framework for identity preserving image generation.IEEE Signal Processing Letters, 31:2715–2719, 2024

    Jingyu Lin, Yongrong Wu, Zeyu Wang, Xiaode Liu, and Yufei Guo. Pair-id: A dual modal framework for identity preserving image generation.IEEE Signal Processing Letters, 31:2715–2719, 2024

  31. [31]

    Difftv: Identity-preserved thermal-to-visible face translation via feature alignment and dual-stage conditions

    Jingyu Lin, Guiqin Zhao, Jing Xu, Guoli Wang, Zejin Wang, Antitza Dantcheva, Lan Du, and Cunjian Chen. Difftv: Identity-preserved thermal-to-visible face translation via feature alignment and dual-stage conditions. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 10930–10938, 2024

  32. [32]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  33. [33]

    Timestep embedding tells: It’s time to cache for video diffusion model

    Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7353–7363, 2025

  34. [34]

    From reusing to forecasting: Accelerating diffusion models with taylorseers.arXiv preprint arXiv:2503.06923, 2025

    Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers.arXiv preprint arXiv:2503.06923, 2025

  35. [35]

    Speca: Accelerating diffusion transformers with speculative feature caching

    Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Fei Ren, Shaobo Wang, Kaixin Li, and Linfeng Zhang. Speca: Accelerating diffusion transformers with speculative feature caching. InProceedings of the 33rd ACMInternational Conference on Multimedia, pages 10024–10033, 2025. 13

  36. [36]

    Token caching for diffusion transformer acceleration.arXiv preprint arXiv:2409.18523, 2024

    Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Jiajiong Cao, Yuming Li, and Chenguang Ma. Token caching for diffusion transformer acceleration.arXiv preprint arXiv:2409.18523, 2024

  37. [37]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

  38. [38]

    Ompq: Orthogonal mixed precision quantization

    Yuexiao Ma, Taisong Jin, Xiawu Zheng, Yan Wang, Huixia Li, Yongjian Wu, Guannan Jiang, Wei Zhang, and Rongrong Ji. Ompq: Orthogonal mixed precision quantization. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 9029–9037, 2023

  39. [39]

    Solving oscillation problem in post-training quantization through a theoretical perspective

    Yuexiao Ma, Huixia Li, Xiawu Zheng, Xuefeng Xiao, Rui Wang, Shilei Wen, Xin Pan, Fei Chao, and Rongrong Ji. Solving oscillation problem in post-training quantization through a theoretical perspective. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7950–7959, 2023

  40. [40]

    Affinequant: Affine transformation quantization for large language models.arXiv preprint arXiv:2403.12544, 2024

    Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Affinequant: Affine transformation quantization for large language models.arXiv preprint arXiv:2403.12544, 2024

  41. [41]

    Outlier-aware slicing for post-training quantization in vision transformer

    Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Outlier-aware slicing for post-training quantization in vision transformer. InForty-firstInternational Conference on Machine Learning, 2024

  42. [42]

    Flow caching for autoregressive video generation.arXiv preprint arXiv:2602.10825, 2026

    Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, et al. Flow caching for autoregressive video generation.arXiv preprint arXiv:2602.10825, 2026

  43. [43]

    Complementary reinforcement learning.arXiv preprint arXiv:2603.17621, 2026

    Dilxat Muhtar, Jiashun Liu, Wei Gao, Weixun Wang, Shaopan Xiong, Ju Huang, Siran Yang, Wenbo Su, Jiamang Wang, Ling Pan, et al. Complementary reinforcement learning.arXiv preprint arXiv:2603.17621, 2026

  44. [44]

    When does sparsity mitigate the curse of depth in llms.arXiv preprint arXiv:2603.15389, 2026

    Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hofmann, and Shiwei Liu. When does sparsity mitigate the curse of depth in llms.arXiv preprint arXiv:2603.15389, 2026

  45. [45]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  46. [46]

    Open-sora 2.0: Training a commercial-level video generation model in $200 k

    Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k. arXiv preprint arXiv:2503.09642, 2025

  47. [47]

    Ertacache: Error rectification and timesteps adjustment for efficient diffusion.arXiv preprint arXiv:2508.21091, 2025

    Xurui Peng, Chenqian Yan, Hong Liu, Rui Ma, Fangmin Chen, Xing Wang, Zhihua Wu, Songwei Liu, and Mingbao Lin. Ertacache: Error rectification and timesteps adjustment for efficient diffusion.arXiv preprint arXiv:2508.21091, 2025

  48. [48]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

  49. [49]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  50. [50]

    Photorealistic text-to-image diffusion models with deep language understanding.Advancesin neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advancesin neural information processing systems, 35:36479–36494, 2022

  51. [51]

    Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425, 2024

    Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, and Luming Liang. Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425, 2024

  52. [52]

    Tr-dq: Time-rotation diffusion quantization.arXiv preprint arXiv:2503.06564, 2025

    Yihua Shao, Deyang Lin, Fanhu Zeng, Minxi Yan, Muyang Zhang, Siyu Chen, Yuxuan Fan, Ziyang Yan, Haozhe Wang, Jingcai Guo, et al. Tr-dq: Time-rotation diffusion quantization.arXiv preprint arXiv:2503.06564, 2025

  53. [53]

    History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025

    Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025

  54. [54]

    Diffusionfake: Enhancing generalization in deepfake detection via guided stable diffusion.Advances in Neural Information Processing Systems, 37:101474–101497, 2024

    Ke Sun, Shen Chen, Taiping Yao, Hong Liu, Xiaoshuai Sun, Shouhong Ding, and Rongrong Ji. Diffusionfake: Enhancing generalization in deepfake detection via guided stable diffusion.Advances in Neural Information Processing Systems, 37:101474–101497, 2024. 14

  55. [55]

    Continual face forgery detection via historical distribution preserving.International Journal of Computer Vision, 133(3):1067–1084, 2025

    Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, and Rongrong Ji. Continual face forgery detection via historical distribution preserving.International Journal of Computer Vision, 133(3):1067–1084, 2025

  56. [56]

    Towards general visual-linguistic face forgery detection

    Ke Sun, Shen Chen, Taiping Yao, Ziyin Zhou, Jiayi Ji, Xiaoshuai Sun, Chia-Wen Lin, and Rongrong Ji. Towards general visual-linguistic face forgery detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19576–19586, 2025

  57. [57]

    When ai settles down: Late-stage stability as a signature of ai-generated text detection.arXiv preprint arXiv:2601.04833, 2026

    Ke Sun, Guangsheng Bao, Han Cui, and Yue Zhang. When ai settles down: Late-stage stability as a signature of ai-generated text detection.arXiv preprint arXiv:2601.04833, 2026

  58. [58]

    Minimizing mismatch risk: A prototype-based routing framework for zero-shot llm-generated text detection.arXiv preprint arXiv:2602.01240, 2026

    Ke Sun, Guangsheng Bao, Han Cui, and Yue Zhang. Minimizing mismatch risk: A prototype-based routing framework for zero-shot llm-generated text detection.arXiv preprint arXiv:2602.01240, 2026

  59. [59]

    Knowing where to focus: Attention- guided alignment for text-based person search.arXiv preprint arXiv:2412.15106, 2024

    Lei Tan, Weihao Li, Pingyang Dai, Jie Chen, Liujuan Cao, and Rongrong Ji. Knowing where to focus: Attention- guided alignment for text-based person search.arXiv preprint arXiv:2412.15106, 2024

  60. [60]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  61. [61]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  62. [62]

    Multi-kernel neural network sliding mode control for permanent magnet linear synchronous motors.IEEE Access, 9:57385–57392, 2021

    Pan Wang, Yunlang Xu, Runze Ding, Weike Liu, Steve Shu, and Xiaofeng Yang. Multi-kernel neural network sliding mode control for permanent magnet linear synchronous motors.IEEE Access, 9:57385–57392, 2021

  63. [63]

    Dlf: Disentangled-language-focused multimodal sentiment analysis

    Pan Wang, Qiang Zhou, Yawen Wu, Tianlong Chen, and Jingtong Hu. Dlf: Disentangled-language-focused multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 21180–21188, 2025

  64. [64]

    From models to systems: A comprehensive survey of efficient multimodal learning

    Pan Wang, Siwei Song, Hui Ji, Siqi Cao, Heng Yu, Zhijian Liu, Huanrui Yang, Yingyan Lin, Beidi Chen, Mohit Bansal, et al. From models to systems: A comprehensive survey of efficient multimodal learning. 2026

  65. [65]

    A theoretical analysis of ndcg type ranking measures

    Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. A theoretical analysis of ndcg type ranking measures. In Conference on learning theory, pages 25–54. PMLR, 2013

  66. [66]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

  67. [67]

    Panacea: Panoramic and controllable video generation for autonomous driving

    Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6902–6912, 2024

  68. [68]

    Motioncanvas: Cinematic shot design with controllable image-to-video generation

    Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Aniruddha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, and Feng Liu. Motioncanvas: Cinematic shot design with controllable image-to-video generation. InProceedings of the Special Interest Group on Computer Graphics and Interactive TechniquesConference Conference Papers, pages 1–11, 2025

  69. [69]

    An enhanced differential evolution algorithm with a new oppositional-mutual learning strategy.Neurocomputing, 435:162–175, 2021

    Yunlang Xu, Xiaofeng Yang, Zhile Yang, Xiaoping Li, Pang Wang, Runze Ding, and Weike Liu. An enhanced differential evolution algorithm with a new oppositional-mutual learning strategy.Neurocomputing, 435:162–175, 2021

  70. [70]

    Veta-dit: Variance-equalized and temporally adaptive quantization for efficient 4-bit diffusion transformers

    Lin Yang, Li Li, Yuxiang Fu, et al. Veta-dit: Variance-equalized and temporally adaptive quantization for efficient 4-bit diffusion transformers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  71. [71]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying- cong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

  72. [72]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 15

  73. [73]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025

  74. [74]

    Ab-cache: Training-free acceleration of diffusion models via adams-bashforth cached feature reuse

    Zichao Yu, Zhen Zou, Guojiang Shao, Chenwei Zhang, Shengze Xu, Jie Huang, Feng Zhao, Xiaodong Cun, and Wenyi Zhang. Ab-cache: Training-free acceleration of diffusion models via adams-bashforth cached feature reuse. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 10408–10417, 2025

  75. [75]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  76. [76]

    Aligning findings with diagnosis: A self-consistent reinforcement learning framework for trustworthy radiology reporting

    Kun Zhao, Siyuan Dai, Pan Wang, Jifeng Song, Hui Ji, Chenghua Lin, Liang Zhan, and Haoteng Tang. Aligning findings with diagnosis: A self-consistent reinforcement learning framework for trustworthy radiology reporting. arXiv preprint arXiv:2601.03321, 2026

  77. [77]

    Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2406.02540, 2024

    Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, et al. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2406.02540, 2024

  78. [78]

    Real-time video generation with pyramid attention broadcast

    Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broadcast. arXiv preprint arXiv:2408.12588, 2024

  79. [79]

    An information theory-inspired strategy for automatic network pruning.arXiv preprint arXiv:2108.08532, 2021

    Xiawu Zheng, Yuexiao Ma, Teng Xi, Gang Zhang, Errui Ding, Yuchao Li, Jie Chen, Yonghong Tian, and Rongrong Ji. An information theory-inspired strategy for automatic network pruning.arXiv preprint arXiv:2108.08532, 2021

  80. [80]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

Showing first 80 references.