arxiv: 2605.01725 · v2 · submitted 2026-05-03 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Motion-Aware Caching for Efficient Autoregressive Video Generation

Jing Xu , Yuexiao Ma , Songwei Liu , Xuzhe Zheng , Shiwei Liu , Chenqian Yan , Xiawu Zheng , Rongrong Ji

show 2 more authors

Fei Chao Xing Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords autoregressive video generationmotion-aware cachingdenoising accelerationcache reusevideo synthesisdiffusion models

0 comments

The pith

MotionCache uses inter-frame differences to dynamically skip denoising steps for static pixels in autoregressive video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard cache reuse in autoregressive video models skips too broadly and ignores how moving pixels need extra denoising steps to stop error buildup. It introduces MotionCache, which treats simple differences between frames as a stand-in for actual pixel motion and then varies cache update rates token by token after an initial warm-up phase. Experiments on SkyReels-V2 and MAGI-1 show this yields large speed gains while holding generation quality nearly constant. A sympathetic reader would care because sequential denoising is the main bottleneck for making long videos practical.

Core claim

MotionCache links cache errors directly to residual instability and then applies a coarse-to-fine motion-weighted reuse scheme: an early warm-up phase secures semantic coherence, after which inter-frame differences dictate per-token denoising frequencies so that high-motion regions receive more updates and static regions receive fewer.

What carries the argument

MotionCache, a framework that treats inter-frame differences as a lightweight proxy for pixel-level motion and uses them to set dynamic cache update frequencies during iterative denoising.

If this is right

Produces 6.28 times faster generation on SkyReels-V2 while dropping VBench score by only 1 percent.
Produces 1.64 times faster generation on MAGI-1 while dropping VBench score by only 0.01 percent.
Preserves output quality by allocating more denoising steps exactly where motion is high.
Requires only an initial warm-up phase before motion-weighted reuse begins.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same difference-based proxy could be tested on other autoregressive tasks such as audio waveform generation where local change signals computational need.
Hardware energy use for long video synthesis would drop in proportion to the observed speedups if the method generalizes.
Extending the warm-up length or combining it with learned motion estimators might further reduce quality variance on complex scenes.

Load-bearing premise

Inter-frame differences accurately indicate which pixels have enough motion to need extra denoising steps to prevent error accumulation.

What would settle it

Run full denoising versus MotionCache on the same high-motion video clip and measure whether VBench or perceptual scores drop more than the reported 1 percent when the motion proxy is applied.

read the original abstract

Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion-aware cache framework that exploits inter-frame differences as a lightweight proxy for pixel-level motion characteristics. MotionCache employs a coarse-to-fine strategy: an initial warm-up phase establishes semantic coherence, followed by motion-weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of $\textbf{6.28}\times$ and $\textbf{1.64}\times$ respectively, while effectively preserving generation quality (VBench: $1\%\downarrow$ and $0.01\%\downarrow$ respectively). The code is available at https://github.com/ywlq/MotionCache.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MotionCache gets real speedups on two models by doing per-token cache reuse based on frame differences, but the proxy assumption needs tighter checks.

read the letter

The main thing to know is that this paper reports solid speedups—6.28× on SkyReels-V2 and 1.64× on MAGI-1—by skipping denoising steps more aggressively on low-motion tokens while keeping VBench scores nearly flat. The approach moves past chunk-level caching to a per-token scheme with a warm-up phase followed by motion-weighted reuse, using inter-frame pixel differences as the signal for how much each token needs updating. That is the concrete advance over earlier reuse methods, and the numbers plus released code make it easy to test directly. The intuition holds up on their benchmarks: static regions tolerate more skipping without visible quality hits. The formal link they draw between cache error and residual instability is stated but stays light on derivation, which is fine for an engineering paper as long as the empirical controls are there. The soft spot is the proxy itself. Inter-frame differences are a cheap stand-in for motion, but without an ablation against optical flow or measured accumulation error, it is unclear how well it generalizes to lighting shifts, low-contrast motion, or aliasing. VBench is coarse enough that small undetected drifts could slip through. If the full paper shows multiple runs, error bars, and at least one oracle comparison, the claims strengthen; otherwise the 6× figure might be partly test-set specific. This is useful for anyone building longer autoregressive video pipelines who already has the base models running. It is not a theoretical breakthrough, but the practical gains are large enough that a serious referee should look at the experimental design and reproducibility details before deciding on acceptance.

Referee Report

3 major / 2 minor

Summary. The paper proposes MotionCache, a motion-aware caching framework for autoregressive video generation models. It uses inter-frame pixel differences as a lightweight proxy for per-token motion to dynamically decide cache reuse during iterative denoising, after an initial warm-up phase for semantic coherence. The central empirical claims are speedups of 6.28× on SkyReels-V2 and 1.64× on MAGI-1 with negligible quality degradation (VBench drops of 1% and 0.01%). The approach is presented as an engineering improvement that exploits the insight that high-motion pixels require more denoising steps to avoid error accumulation.

Significance. If the inter-frame proxy reliably bounds residual instability, the method could enable practical long-video autoregressive generation at lower cost. The concrete speed/quality numbers on two models and the public code release strengthen the contribution. However, the absence of a derivation for the cache-error-to-instability link and the lack of ablations against stronger motion signals (optical flow, oracle error) make it difficult to assess whether the reported gains generalize beyond the tested videos or are artifacts of low-motion content.

major comments (3)

[Abstract and §3] Abstract and §3 (theoretical formalization): the claim that cache errors are linked to residual instability is asserted without a derivation, error bounds, or analysis of how inter-frame difference magnitude maps to the exact number of denoising steps needed to prevent accumulation. This link is load-bearing for the proxy's validity.
[§4 (Experiments)] §4 (Experiments): no ablation is reported that replaces the inter-frame-difference proxy with an oracle motion signal (e.g., optical flow or ground-truth per-token error) and re-measures both speedup and VBench. Without this, it is impossible to confirm that the 6.28× figure is not an artifact of the specific test set's motion statistics.
[§4.2 (results tables)] §4.2 (results tables): the reported VBench deltas lack error bars, standard deviations across seeds, or multiple runs, making it hard to judge whether the 1% and 0.01% drops are statistically distinguishable from noise.

minor comments (2)

[§3.2] Notation for the motion weight and update frequency is introduced without a compact equation; a single displayed equation would improve clarity.
[Abstract] The GitHub link is given but the README does not specify the exact command lines or random seeds used to reproduce the SkyReels-V2 6.28× figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each point below and describe the planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (theoretical formalization): the claim that cache errors are linked to residual instability is asserted without a derivation, error bounds, or analysis of how inter-frame difference magnitude maps to the exact number of denoising steps needed to prevent accumulation. This link is load-bearing for the proxy's validity.

Authors: We acknowledge that the theoretical formalization in §3 provides a motivational link rather than a rigorous derivation with explicit error bounds. The connection is based on the principle that residual errors in high-motion regions propagate more rapidly in the autoregressive setting due to the iterative nature of denoising. In the revised manuscript, we will expand this section with a more detailed heuristic analysis, including how inter-frame difference magnitudes correlate with the number of necessary denoising steps, supported by additional visualizations of error accumulation. revision: partial
Referee: [§4 (Experiments)] §4 (Experiments): no ablation is reported that replaces the inter-frame-difference proxy with an oracle motion signal (e.g., optical flow or ground-truth per-token error) and re-measures both speedup and VBench. Without this, it is impossible to confirm that the 6.28× figure is not an artifact of the specific test set's motion statistics.

Authors: We agree that comparing against stronger motion signals would provide valuable insight. Computing an oracle per-token error is computationally expensive as it requires full denoising without caching. We will include an additional ablation study using optical flow as a proxy in the revised experiments section, evaluating its impact on speedup and quality metrics on the same test sets. revision: yes
Referee: [§4.2 (results tables)] §4.2 (results tables): the reported VBench deltas lack error bars, standard deviations across seeds, or multiple runs, making it hard to judge whether the 1% and 0.01% drops are statistically distinguishable from noise.

Authors: This is a valid point. The original experiments were run with a single seed due to the high computational cost of video generation. In the revision, we will perform multiple runs with different random seeds and report the mean VBench scores along with standard deviations to better assess the statistical significance of the observed quality changes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical engineering with external validation

full rationale

The paper presents MotionCache as an empirical caching heuristic that uses inter-frame pixel differences as a lightweight proxy for per-token denoising needs. Speedup claims (6.28× on SkyReels-V2, 1.64× on MAGI-1) are measured against external models and reported via VBench scores on held-out video sets. No equations, fitted parameters, or self-citations are shown to reduce the reported gains to quantities defined inside the paper itself. The stated theoretical link between cache error and residual instability is presented as motivation rather than a closed derivation that forces the outcome by construction. The method therefore remains an engineering improvement whose performance is independently falsifiable on new models and datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that inter-frame differences correlate with required denoising steps and that a coarse-to-fine schedule can be safely applied without quality collapse.

axioms (1)

domain assumption Inter-frame differences serve as a reliable proxy for pixel-level motion that dictates denoising frequency
Invoked to justify motion-weighted cache reuse; no independent validation of the proxy strength is shown in the abstract.

pith-pipeline@v0.9.0 · 5545 in / 1142 out tokens · 44015 ms · 2026-05-15T07:12:13.659844+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 12 internal anchors

[1]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

work page 2024
[3]

Dicache: Let diffusion model determine its own cache.arXiv preprint arXiv:2508.17356, 2025

Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Dahua Lin, and Jiaqi Wang. Dicache: Let diffusion model determine its own cache.arXiv preprint arXiv:2508.17356, 2025

work page arXiv 2025
[4]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advancesin Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advancesin Neural Information Processing Systems, 37:24081–24125, 2024

work page 2024
[5]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

δ-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024

Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen. δ-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024

work page arXiv 2024
[7]

E2net: Excitative- expansile learning for weakly supervised object localization

Zhiwei Chen, Liujuan Cao, Yunhang Shen, Feihong Lian, Yongjian Wu, and Rongrong Ji. E2net: Excitative- expansile learning for weakly supervised object localization. In Proceedings of the 29th ACM international conference on multimedia, pages 573–581, 2021

work page 2021
[8]

Lctr: On awakening the local continuity of transformer for weakly supervised object localization

Zhiwei Chen, Changan Wang, Yabiao Wang, Guannan Jiang, Yunhang Shen, Ying Tai, Chengjie Wang, Wei Zhang, and Liujuan Cao. Lctr: On awakening the local continuity of transformer for weakly supervised object localization. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 410–418, 2022

work page 2022
[9]

Category-aware allocation transformer for weakly supervised object localization

Zhiwei Chen, Jinren Ding, Liujuan Cao, Yunhang Shen, Shengchuan Zhang, Guannan Jiang, and Rongrong Ji. Category-aware allocation transformer for weakly supervised object localization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6643–6652, 2023

work page 2023
[10]

Adaptive zone learning for weakly supervised object localization.IEEE Transactions on Neural Networks and Learning Systems, 36(4):7211–7224, 2024

Zhiwei Chen, Siwei Wang, Liujuan Cao, Yunhang Shen, and Rongrong Ji. Adaptive zone learning for weakly supervised object localization.IEEE Transactions on Neural Networks and Learning Systems, 36(4):7211–7224, 2024

work page 2024
[11]

Clip-driven transformer for weakly supervised object localization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Zhiwei Chen, Yunhang Shen, Liujuan Cao, Shengchuan Zhang, and Rongrong Ji. Clip-driven transformer for weakly supervised object localization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[12]

Bwcache: Accelerating video diffusion transformers through block-wise caching.arXiv preprint arXiv:2509.13789, 2025

Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, and Weijia Jia. Bwcache: Accelerating video diffusion transformers through block-wise caching.arXiv preprint arXiv:2509.13789, 2025

work page arXiv 2025
[13]

Q-vdit: Towards accurate quantization and distillation of video-generation diffusion transformers

Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, et al. Q-vdit: Towards accurate quantization and distillation of video-generation diffusion transformers. arXiv preprint arXiv:2505.22167, 2025

work page arXiv 2025
[14]

Mdreid: Modality-decoupled learning for any-to-any multi-modal object re-identification.arXiv preprint arXiv:2510.23301, 2025

Yingying Feng, Jie Li, Jie Hu, Yukang Zhang, Lei Tan, and Jiayi Ji. Mdreid: Modality-decoupled learning for any-to-any multi-modal object re-identification.arXiv preprint arXiv:2510.23301, 2025

work page arXiv 2025
[15]

Multi-modal object re-identification via sparse mixture-of- experts

Yingying Feng, Jie Li, Chi Xie, Lei Tan, and Jiayi Ji. Multi-modal object re-identification via sparse mixture-of- experts. In Forty-secondInternational Conference on Machine Learning, 2025

work page 2025
[16]

Exploring the interplay between video generation and world models in autonomous driving: A survey.arXiv preprint arXiv:2411.02914, 2024

Ao Fu, Yi Zhou, Tao Zhou, Yi Yang, Bojun Gao, Qun Li, Guobin Wu, and Ling Shao. Exploring the interplay between video generation and world models in autonomous driving: A survey.arXiv preprint arXiv:2411.02914, 2024

work page arXiv 2024
[17]

Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive control

Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive control. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28135–28144, 2025. 12

work page 2025
[18]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010

work page 2010
[20]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024
[21]

Ir evaluation methods for retrieving highly relevant documents

Kalervo Järvelin and Jaana Kekäläinen. Ir evaluation methods for retrieving highly relevant documents. InACM SIGIR Forum, volume 51, pages 243–250. ACM New York, NY, USA, 2017

work page 2017
[22]

Adaptive caching for faster video generation with diffusion transformers

Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S Ryoo, and Tian Xie. Adaptive caching for faster video generation with diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15240–15252, 2025

work page 2025
[23]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advancesin neural information processing systems, 35:26565–26577, 2022

work page 2022
[24]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Rle: A unified perspective of data augmentation for cross-spectral re-identification.Advances in Neural Information Processing Systems, 37:126977–126996, 2024

Tan Lei, Yukang Zhang, Keke Han, Pingyang Dai, Yan Zhang, Yongjian Wu, and Rongrong Ji. Rle: A unified perspective of data augmentation for cross-spectral re-identification.Advances in Neural Information Processing Systems, 37:126977–126996, 2024

work page 2024
[26]

Find: A simple yet effective baseline for diffusion- generated image detection

Jie Li, Yingying Feng, Chi Xie, Jie Hu, Lei Tan, and Jiayi Ji. Find: A simple yet effective baseline for diffusion- generated image detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6217–6225, 2026

work page 2026
[27]

Apoavatar: Expressive audio-driven avatar generation via refocused audio-pose priors

Jingyu Lin, Chao Zhang, Wei Feng, Donghao Zhou, Shilei Wen, Lan Du, and Cunjian Chen. Apoavatar: Expressive audio-driven avatar generation via refocused audio-pose priors

work page
[28]

An end-to-end scene text detector with dynamic attention

Jingyu Lin, Yan Yan, and Hanzi Wang. An end-to-end scene text detector with dynamic attention. InProceedings of the 4th ACM International Conference on Multimedia in Asia, pages 1–7, 2022

work page 2022
[29]

A dual-path transformer network for scene text detection

Jingyu Lin, Yan Yan, and Hanzi Wang. A dual-path transformer network for scene text detection. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

work page 2023
[30]

Pair-id: A dual modal framework for identity preserving image generation.IEEE Signal Processing Letters, 31:2715–2719, 2024

Jingyu Lin, Yongrong Wu, Zeyu Wang, Xiaode Liu, and Yufei Guo. Pair-id: A dual modal framework for identity preserving image generation.IEEE Signal Processing Letters, 31:2715–2719, 2024

work page 2024
[31]

Difftv: Identity-preserved thermal-to-visible face translation via feature alignment and dual-stage conditions

Jingyu Lin, Guiqin Zhao, Jing Xu, Guoli Wang, Zejin Wang, Antitza Dantcheva, Lan Du, and Cunjian Chen. Difftv: Identity-preserved thermal-to-visible face translation via feature alignment and dual-stage conditions. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 10930–10938, 2024

work page 2024
[32]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Timestep embedding tells: It’s time to cache for video diffusion model

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7353–7363, 2025

work page 2025
[34]

From reusing to forecasting: Accelerating diffusion models with taylorseers.arXiv preprint arXiv:2503.06923, 2025

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers.arXiv preprint arXiv:2503.06923, 2025

work page arXiv 2025
[35]

Speca: Accelerating diffusion transformers with speculative feature caching

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Fei Ren, Shaobo Wang, Kaixin Li, and Linfeng Zhang. Speca: Accelerating diffusion transformers with speculative feature caching. InProceedings of the 33rd ACMInternational Conference on Multimedia, pages 10024–10033, 2025. 13

work page 2025
[36]

Token caching for diffusion transformer acceleration.arXiv preprint arXiv:2409.18523, 2024

Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Jiajiong Cao, Yuming Li, and Chenguang Ma. Token caching for diffusion transformer acceleration.arXiv preprint arXiv:2409.18523, 2024

work page arXiv 2024
[37]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Ompq: Orthogonal mixed precision quantization

Yuexiao Ma, Taisong Jin, Xiawu Zheng, Yan Wang, Huixia Li, Yongjian Wu, Guannan Jiang, Wei Zhang, and Rongrong Ji. Ompq: Orthogonal mixed precision quantization. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 9029–9037, 2023

work page 2023
[39]

Solving oscillation problem in post-training quantization through a theoretical perspective

Yuexiao Ma, Huixia Li, Xiawu Zheng, Xuefeng Xiao, Rui Wang, Shilei Wen, Xin Pan, Fei Chao, and Rongrong Ji. Solving oscillation problem in post-training quantization through a theoretical perspective. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7950–7959, 2023

work page 2023
[40]

Affinequant: Affine transformation quantization for large language models.arXiv preprint arXiv:2403.12544, 2024

Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Affinequant: Affine transformation quantization for large language models.arXiv preprint arXiv:2403.12544, 2024

work page arXiv 2024
[41]

Outlier-aware slicing for post-training quantization in vision transformer

Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Outlier-aware slicing for post-training quantization in vision transformer. InForty-firstInternational Conference on Machine Learning, 2024

work page 2024
[42]

Flow caching for autoregressive video generation.arXiv preprint arXiv:2602.10825, 2026

Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, et al. Flow caching for autoregressive video generation.arXiv preprint arXiv:2602.10825, 2026

work page arXiv 2026
[43]

Complementary reinforcement learning.arXiv preprint arXiv:2603.17621, 2026

Dilxat Muhtar, Jiashun Liu, Wei Gao, Weixun Wang, Shaopan Xiong, Ju Huang, Siran Yang, Wenbo Su, Jiamang Wang, Ling Pan, et al. Complementary reinforcement learning.arXiv preprint arXiv:2603.17621, 2026

work page arXiv 2026
[44]

When does sparsity mitigate the curse of depth in llms.arXiv preprint arXiv:2603.15389, 2026

Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hofmann, and Shiwei Liu. When does sparsity mitigate the curse of depth in llms.arXiv preprint arXiv:2603.15389, 2026

work page arXiv 2026
[45]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[46]

Open-sora 2.0: Training a commercial-level video generation model in $200 k

Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k. arXiv preprint arXiv:2503.09642, 2025

work page arXiv 2025
[47]

Ertacache: Error rectification and timesteps adjustment for efficient diffusion.arXiv preprint arXiv:2508.21091, 2025

Xurui Peng, Chenqian Yan, Hong Liu, Rui Ma, Fangmin Chen, Xing Wang, Zhihua Wu, Songwei Liu, and Mingbao Lin. Ertacache: Error rectification and timesteps adjustment for efficient diffusion.arXiv preprint arXiv:2508.21091, 2025

work page arXiv 2025
[48]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

work page 2015
[50]

Photorealistic text-to-image diffusion models with deep language understanding.Advancesin neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advancesin neural information processing systems, 35:36479–36494, 2022

work page 2022
[51]

Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425, 2024

Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, and Luming Liang. Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425, 2024

work page arXiv 2024
[52]

Tr-dq: Time-rotation diffusion quantization.arXiv preprint arXiv:2503.06564, 2025

Yihua Shao, Deyang Lin, Fanhu Zeng, Minxi Yan, Muyang Zhang, Siyu Chen, Yuxuan Fan, Ziyang Yan, Haozhe Wang, Jingcai Guo, et al. Tr-dq: Time-rotation diffusion quantization.arXiv preprint arXiv:2503.06564, 2025

work page arXiv 2025
[53]

History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025

work page arXiv 2025
[54]

Diffusionfake: Enhancing generalization in deepfake detection via guided stable diffusion.Advances in Neural Information Processing Systems, 37:101474–101497, 2024

Ke Sun, Shen Chen, Taiping Yao, Hong Liu, Xiaoshuai Sun, Shouhong Ding, and Rongrong Ji. Diffusionfake: Enhancing generalization in deepfake detection via guided stable diffusion.Advances in Neural Information Processing Systems, 37:101474–101497, 2024. 14

work page 2024
[55]

Continual face forgery detection via historical distribution preserving.International Journal of Computer Vision, 133(3):1067–1084, 2025

Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, and Rongrong Ji. Continual face forgery detection via historical distribution preserving.International Journal of Computer Vision, 133(3):1067–1084, 2025

work page 2025
[56]

Towards general visual-linguistic face forgery detection

Ke Sun, Shen Chen, Taiping Yao, Ziyin Zhou, Jiayi Ji, Xiaoshuai Sun, Chia-Wen Lin, and Rongrong Ji. Towards general visual-linguistic face forgery detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19576–19586, 2025

work page 2025
[57]

When ai settles down: Late-stage stability as a signature of ai-generated text detection.arXiv preprint arXiv:2601.04833, 2026

Ke Sun, Guangsheng Bao, Han Cui, and Yue Zhang. When ai settles down: Late-stage stability as a signature of ai-generated text detection.arXiv preprint arXiv:2601.04833, 2026

work page arXiv 2026
[58]

Minimizing mismatch risk: A prototype-based routing framework for zero-shot llm-generated text detection.arXiv preprint arXiv:2602.01240, 2026

Ke Sun, Guangsheng Bao, Han Cui, and Yue Zhang. Minimizing mismatch risk: A prototype-based routing framework for zero-shot llm-generated text detection.arXiv preprint arXiv:2602.01240, 2026

work page arXiv 2026
[59]

Knowing where to focus: Attention- guided alignment for text-based person search.arXiv preprint arXiv:2412.15106, 2024

Lei Tan, Weihao Li, Pingyang Dai, Jie Chen, Liujuan Cao, and Rongrong Ji. Knowing where to focus: Attention- guided alignment for text-based person search.arXiv preprint arXiv:2412.15106, 2024

work page arXiv 2024
[60]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Multi-kernel neural network sliding mode control for permanent magnet linear synchronous motors.IEEE Access, 9:57385–57392, 2021

Pan Wang, Yunlang Xu, Runze Ding, Weike Liu, Steve Shu, and Xiaofeng Yang. Multi-kernel neural network sliding mode control for permanent magnet linear synchronous motors.IEEE Access, 9:57385–57392, 2021

work page 2021
[63]

Dlf: Disentangled-language-focused multimodal sentiment analysis

Pan Wang, Qiang Zhou, Yawen Wu, Tianlong Chen, and Jingtong Hu. Dlf: Disentangled-language-focused multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 21180–21188, 2025

work page 2025
[64]

From models to systems: A comprehensive survey of efficient multimodal learning

Pan Wang, Siwei Song, Hui Ji, Siqi Cao, Heng Yu, Zhijian Liu, Huanrui Yang, Yingyan Lin, Beidi Chen, Mohit Bansal, et al. From models to systems: A comprehensive survey of efficient multimodal learning. 2026

work page 2026
[65]

A theoretical analysis of ndcg type ranking measures

Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. A theoretical analysis of ndcg type ranking measures. In Conference on learning theory, pages 25–54. PMLR, 2013

work page 2013
[66]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

work page 2004
[67]

Panacea: Panoramic and controllable video generation for autonomous driving

Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6902–6912, 2024

work page 2024
[68]

Motioncanvas: Cinematic shot design with controllable image-to-video generation

Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Aniruddha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, and Feng Liu. Motioncanvas: Cinematic shot design with controllable image-to-video generation. InProceedings of the Special Interest Group on Computer Graphics and Interactive TechniquesConference Conference Papers, pages 1–11, 2025

work page 2025
[69]

An enhanced differential evolution algorithm with a new oppositional-mutual learning strategy.Neurocomputing, 435:162–175, 2021

Yunlang Xu, Xiaofeng Yang, Zhile Yang, Xiaoping Li, Pang Wang, Runze Ding, and Weike Liu. An enhanced differential evolution algorithm with a new oppositional-mutual learning strategy.Neurocomputing, 435:162–175, 2021

work page 2021
[70]

Veta-dit: Variance-equalized and temporally adaptive quantization for efficient 4-bit diffusion transformers

Lin Yang, Li Li, Yuxiang Fu, et al. Veta-dit: Variance-equalized and temporally adaptive quantization for efficient 4-bit diffusion transformers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[71]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying- cong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025

work page 2025
[74]

Ab-cache: Training-free acceleration of diffusion models via adams-bashforth cached feature reuse

Zichao Yu, Zhen Zou, Guojiang Shao, Chenwei Zhang, Shengze Xu, Jie Huang, Feng Zhao, Xiaodong Cun, and Wenyi Zhang. Ab-cache: Training-free acceleration of diffusion models via adams-bashforth cached feature reuse. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 10408–10417, 2025

work page 2025
[75]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018
[76]

Aligning findings with diagnosis: A self-consistent reinforcement learning framework for trustworthy radiology reporting

Kun Zhao, Siyuan Dai, Pan Wang, Jifeng Song, Hui Ji, Chenghua Lin, Liang Zhan, and Haoteng Tang. Aligning findings with diagnosis: A self-consistent reinforcement learning framework for trustworthy radiology reporting. arXiv preprint arXiv:2601.03321, 2026

work page arXiv 2026
[77]

Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2406.02540, 2024

Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, et al. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2406.02540, 2024

work page arXiv 2024
[78]

Real-time video generation with pyramid attention broadcast

Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broadcast. arXiv preprint arXiv:2408.12588, 2024

work page arXiv 2024
[79]

An information theory-inspired strategy for automatic network pruning.arXiv preprint arXiv:2108.08532, 2021

Xiawu Zheng, Yuexiao Ma, Teng Xi, Gang Zhang, Errui Ding, Yuchao Li, Jie Chen, Yonghong Tian, and Rongrong Ji. An information theory-inspired strategy for automatic network pruning.arXiv preprint arXiv:2108.08532, 2021

work page arXiv 2021
[80]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

Showing first 80 references.