Recognition: no theorem link
Motion-Aware Caching for Efficient Autoregressive Video Generation
Pith reviewed 2026-05-15 07:12 UTC · model grok-4.3
The pith
MotionCache uses inter-frame differences to dynamically skip denoising steps for static pixels in autoregressive video generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MotionCache links cache errors directly to residual instability and then applies a coarse-to-fine motion-weighted reuse scheme: an early warm-up phase secures semantic coherence, after which inter-frame differences dictate per-token denoising frequencies so that high-motion regions receive more updates and static regions receive fewer.
What carries the argument
MotionCache, a framework that treats inter-frame differences as a lightweight proxy for pixel-level motion and uses them to set dynamic cache update frequencies during iterative denoising.
If this is right
- Produces 6.28 times faster generation on SkyReels-V2 while dropping VBench score by only 1 percent.
- Produces 1.64 times faster generation on MAGI-1 while dropping VBench score by only 0.01 percent.
- Preserves output quality by allocating more denoising steps exactly where motion is high.
- Requires only an initial warm-up phase before motion-weighted reuse begins.
Where Pith is reading between the lines
- The same difference-based proxy could be tested on other autoregressive tasks such as audio waveform generation where local change signals computational need.
- Hardware energy use for long video synthesis would drop in proportion to the observed speedups if the method generalizes.
- Extending the warm-up length or combining it with learned motion estimators might further reduce quality variance on complex scenes.
Load-bearing premise
Inter-frame differences accurately indicate which pixels have enough motion to need extra denoising steps to prevent error accumulation.
What would settle it
Run full denoising versus MotionCache on the same high-motion video clip and measure whether VBench or perceptual scores drop more than the reported 1 percent when the motion proxy is applied.
read the original abstract
Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion-aware cache framework that exploits inter-frame differences as a lightweight proxy for pixel-level motion characteristics. MotionCache employs a coarse-to-fine strategy: an initial warm-up phase establishes semantic coherence, followed by motion-weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of $\textbf{6.28}\times$ and $\textbf{1.64}\times$ respectively, while effectively preserving generation quality (VBench: $1\%\downarrow$ and $0.01\%\downarrow$ respectively). The code is available at https://github.com/ywlq/MotionCache.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MotionCache, a motion-aware caching framework for autoregressive video generation models. It uses inter-frame pixel differences as a lightweight proxy for per-token motion to dynamically decide cache reuse during iterative denoising, after an initial warm-up phase for semantic coherence. The central empirical claims are speedups of 6.28× on SkyReels-V2 and 1.64× on MAGI-1 with negligible quality degradation (VBench drops of 1% and 0.01%). The approach is presented as an engineering improvement that exploits the insight that high-motion pixels require more denoising steps to avoid error accumulation.
Significance. If the inter-frame proxy reliably bounds residual instability, the method could enable practical long-video autoregressive generation at lower cost. The concrete speed/quality numbers on two models and the public code release strengthen the contribution. However, the absence of a derivation for the cache-error-to-instability link and the lack of ablations against stronger motion signals (optical flow, oracle error) make it difficult to assess whether the reported gains generalize beyond the tested videos or are artifacts of low-motion content.
major comments (3)
- [Abstract and §3] Abstract and §3 (theoretical formalization): the claim that cache errors are linked to residual instability is asserted without a derivation, error bounds, or analysis of how inter-frame difference magnitude maps to the exact number of denoising steps needed to prevent accumulation. This link is load-bearing for the proxy's validity.
- [§4 (Experiments)] §4 (Experiments): no ablation is reported that replaces the inter-frame-difference proxy with an oracle motion signal (e.g., optical flow or ground-truth per-token error) and re-measures both speedup and VBench. Without this, it is impossible to confirm that the 6.28× figure is not an artifact of the specific test set's motion statistics.
- [§4.2 (results tables)] §4.2 (results tables): the reported VBench deltas lack error bars, standard deviations across seeds, or multiple runs, making it hard to judge whether the 1% and 0.01% drops are statistically distinguishable from noise.
minor comments (2)
- [§3.2] Notation for the motion weight and update frequency is introduced without a compact equation; a single displayed equation would improve clarity.
- [Abstract] The GitHub link is given but the README does not specify the exact command lines or random seeds used to reproduce the SkyReels-V2 6.28× figure.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for major revision. We address each point below and describe the planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (theoretical formalization): the claim that cache errors are linked to residual instability is asserted without a derivation, error bounds, or analysis of how inter-frame difference magnitude maps to the exact number of denoising steps needed to prevent accumulation. This link is load-bearing for the proxy's validity.
Authors: We acknowledge that the theoretical formalization in §3 provides a motivational link rather than a rigorous derivation with explicit error bounds. The connection is based on the principle that residual errors in high-motion regions propagate more rapidly in the autoregressive setting due to the iterative nature of denoising. In the revised manuscript, we will expand this section with a more detailed heuristic analysis, including how inter-frame difference magnitudes correlate with the number of necessary denoising steps, supported by additional visualizations of error accumulation. revision: partial
-
Referee: [§4 (Experiments)] §4 (Experiments): no ablation is reported that replaces the inter-frame-difference proxy with an oracle motion signal (e.g., optical flow or ground-truth per-token error) and re-measures both speedup and VBench. Without this, it is impossible to confirm that the 6.28× figure is not an artifact of the specific test set's motion statistics.
Authors: We agree that comparing against stronger motion signals would provide valuable insight. Computing an oracle per-token error is computationally expensive as it requires full denoising without caching. We will include an additional ablation study using optical flow as a proxy in the revised experiments section, evaluating its impact on speedup and quality metrics on the same test sets. revision: yes
-
Referee: [§4.2 (results tables)] §4.2 (results tables): the reported VBench deltas lack error bars, standard deviations across seeds, or multiple runs, making it hard to judge whether the 1% and 0.01% drops are statistically distinguishable from noise.
Authors: This is a valid point. The original experiments were run with a single seed due to the high computational cost of video generation. In the revision, we will perform multiple runs with different random seeds and report the mean VBench scores along with standard deviations to better assess the statistical significance of the observed quality changes. revision: yes
Circularity Check
No significant circularity; empirical engineering with external validation
full rationale
The paper presents MotionCache as an empirical caching heuristic that uses inter-frame pixel differences as a lightweight proxy for per-token denoising needs. Speedup claims (6.28× on SkyReels-V2, 1.64× on MAGI-1) are measured against external models and reported via VBench scores on held-out video sets. No equations, fitted parameters, or self-citations are shown to reduce the reported gains to quantities defined inside the paper itself. The stated theoretical link between cache error and residual instability is presented as motivation rather than a closed derivation that forces the outcome by construction. The method therefore remains an engineering improvement whose performance is independently falsifiable on new models and datasets.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Inter-frame differences serve as a reliable proxy for pixel-level motion that dictates denoising frequency
Reference graph
Works this paper leans on
-
[1]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
work page 2024
-
[3]
Dicache: Let diffusion model determine its own cache.arXiv preprint arXiv:2508.17356, 2025
Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Dahua Lin, and Jiaqi Wang. Dicache: Let diffusion model determine its own cache.arXiv preprint arXiv:2508.17356, 2025
-
[4]
Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advancesin Neural Information Processing Systems, 37:24081–24125, 2024
work page 2024
-
[5]
SkyReels-V2: Infinite-length Film Generative Model
Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen. δ-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024
-
[7]
E2net: Excitative- expansile learning for weakly supervised object localization
Zhiwei Chen, Liujuan Cao, Yunhang Shen, Feihong Lian, Yongjian Wu, and Rongrong Ji. E2net: Excitative- expansile learning for weakly supervised object localization. In Proceedings of the 29th ACM international conference on multimedia, pages 573–581, 2021
work page 2021
-
[8]
Lctr: On awakening the local continuity of transformer for weakly supervised object localization
Zhiwei Chen, Changan Wang, Yabiao Wang, Guannan Jiang, Yunhang Shen, Ying Tai, Chengjie Wang, Wei Zhang, and Liujuan Cao. Lctr: On awakening the local continuity of transformer for weakly supervised object localization. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 410–418, 2022
work page 2022
-
[9]
Category-aware allocation transformer for weakly supervised object localization
Zhiwei Chen, Jinren Ding, Liujuan Cao, Yunhang Shen, Shengchuan Zhang, Guannan Jiang, and Rongrong Ji. Category-aware allocation transformer for weakly supervised object localization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6643–6652, 2023
work page 2023
-
[10]
Zhiwei Chen, Siwei Wang, Liujuan Cao, Yunhang Shen, and Rongrong Ji. Adaptive zone learning for weakly supervised object localization.IEEE Transactions on Neural Networks and Learning Systems, 36(4):7211–7224, 2024
work page 2024
-
[11]
Zhiwei Chen, Yunhang Shen, Liujuan Cao, Shengchuan Zhang, and Rongrong Ji. Clip-driven transformer for weakly supervised object localization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[12]
Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, and Weijia Jia. Bwcache: Accelerating video diffusion transformers through block-wise caching.arXiv preprint arXiv:2509.13789, 2025
-
[13]
Q-vdit: Towards accurate quantization and distillation of video-generation diffusion transformers
Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, et al. Q-vdit: Towards accurate quantization and distillation of video-generation diffusion transformers. arXiv preprint arXiv:2505.22167, 2025
-
[14]
Yingying Feng, Jie Li, Jie Hu, Yukang Zhang, Lei Tan, and Jiayi Ji. Mdreid: Modality-decoupled learning for any-to-any multi-modal object re-identification.arXiv preprint arXiv:2510.23301, 2025
-
[15]
Multi-modal object re-identification via sparse mixture-of- experts
Yingying Feng, Jie Li, Chi Xie, Lei Tan, and Jiayi Ji. Multi-modal object re-identification via sparse mixture-of- experts. In Forty-secondInternational Conference on Machine Learning, 2025
work page 2025
-
[16]
Ao Fu, Yi Zhou, Tao Zhou, Yi Yang, Bojun Gao, Qun Li, Guobin Wu, and Ling Shao. Exploring the interplay between video generation and world models in autonomous driving: A survey.arXiv preprint arXiv:2411.02914, 2024
-
[17]
Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive control
Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive control. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28135–28144, 2025. 12
work page 2025
-
[18]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Image quality metrics: Psnr vs
Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010
work page 2010
-
[20]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
work page 2024
-
[21]
Ir evaluation methods for retrieving highly relevant documents
Kalervo Järvelin and Jaana Kekäläinen. Ir evaluation methods for retrieving highly relevant documents. InACM SIGIR Forum, volume 51, pages 243–250. ACM New York, NY, USA, 2017
work page 2017
-
[22]
Adaptive caching for faster video generation with diffusion transformers
Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S Ryoo, and Tian Xie. Adaptive caching for faster video generation with diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15240–15252, 2025
work page 2025
-
[23]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advancesin neural information processing systems, 35:26565–26577, 2022
work page 2022
-
[24]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Tan Lei, Yukang Zhang, Keke Han, Pingyang Dai, Yan Zhang, Yongjian Wu, and Rongrong Ji. Rle: A unified perspective of data augmentation for cross-spectral re-identification.Advances in Neural Information Processing Systems, 37:126977–126996, 2024
work page 2024
-
[26]
Find: A simple yet effective baseline for diffusion- generated image detection
Jie Li, Yingying Feng, Chi Xie, Jie Hu, Lei Tan, and Jiayi Ji. Find: A simple yet effective baseline for diffusion- generated image detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6217–6225, 2026
work page 2026
-
[27]
Apoavatar: Expressive audio-driven avatar generation via refocused audio-pose priors
Jingyu Lin, Chao Zhang, Wei Feng, Donghao Zhou, Shilei Wen, Lan Du, and Cunjian Chen. Apoavatar: Expressive audio-driven avatar generation via refocused audio-pose priors
-
[28]
An end-to-end scene text detector with dynamic attention
Jingyu Lin, Yan Yan, and Hanzi Wang. An end-to-end scene text detector with dynamic attention. InProceedings of the 4th ACM International Conference on Multimedia in Asia, pages 1–7, 2022
work page 2022
-
[29]
A dual-path transformer network for scene text detection
Jingyu Lin, Yan Yan, and Hanzi Wang. A dual-path transformer network for scene text detection. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023
work page 2023
-
[30]
Jingyu Lin, Yongrong Wu, Zeyu Wang, Xiaode Liu, and Yufei Guo. Pair-id: A dual modal framework for identity preserving image generation.IEEE Signal Processing Letters, 31:2715–2719, 2024
work page 2024
-
[31]
Jingyu Lin, Guiqin Zhao, Jing Xu, Guoli Wang, Zejin Wang, Antitza Dantcheva, Lan Du, and Cunjian Chen. Difftv: Identity-preserved thermal-to-visible face translation via feature alignment and dual-stage conditions. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 10930–10938, 2024
work page 2024
-
[32]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
Timestep embedding tells: It’s time to cache for video diffusion model
Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7353–7363, 2025
work page 2025
-
[34]
Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers.arXiv preprint arXiv:2503.06923, 2025
-
[35]
Speca: Accelerating diffusion transformers with speculative feature caching
Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Fei Ren, Shaobo Wang, Kaixin Li, and Linfeng Zhang. Speca: Accelerating diffusion transformers with speculative feature caching. InProceedings of the 33rd ACMInternational Conference on Multimedia, pages 10024–10033, 2025. 13
work page 2025
-
[36]
Token caching for diffusion transformer acceleration.arXiv preprint arXiv:2409.18523, 2024
Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Jiajiong Cao, Yuming Li, and Chenguang Ma. Token caching for diffusion transformer acceleration.arXiv preprint arXiv:2409.18523, 2024
-
[37]
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Ompq: Orthogonal mixed precision quantization
Yuexiao Ma, Taisong Jin, Xiawu Zheng, Yan Wang, Huixia Li, Yongjian Wu, Guannan Jiang, Wei Zhang, and Rongrong Ji. Ompq: Orthogonal mixed precision quantization. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 9029–9037, 2023
work page 2023
-
[39]
Solving oscillation problem in post-training quantization through a theoretical perspective
Yuexiao Ma, Huixia Li, Xiawu Zheng, Xuefeng Xiao, Rui Wang, Shilei Wen, Xin Pan, Fei Chao, and Rongrong Ji. Solving oscillation problem in post-training quantization through a theoretical perspective. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7950–7959, 2023
work page 2023
-
[40]
Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Affinequant: Affine transformation quantization for large language models.arXiv preprint arXiv:2403.12544, 2024
-
[41]
Outlier-aware slicing for post-training quantization in vision transformer
Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Outlier-aware slicing for post-training quantization in vision transformer. InForty-firstInternational Conference on Machine Learning, 2024
work page 2024
-
[42]
Flow caching for autoregressive video generation.arXiv preprint arXiv:2602.10825, 2026
Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, et al. Flow caching for autoregressive video generation.arXiv preprint arXiv:2602.10825, 2026
-
[43]
Complementary reinforcement learning.arXiv preprint arXiv:2603.17621, 2026
Dilxat Muhtar, Jiashun Liu, Wei Gao, Weixun Wang, Shaopan Xiong, Ju Huang, Siran Yang, Wenbo Su, Jiamang Wang, Ling Pan, et al. Complementary reinforcement learning.arXiv preprint arXiv:2603.17621, 2026
-
[44]
When does sparsity mitigate the curse of depth in llms.arXiv preprint arXiv:2603.15389, 2026
Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hofmann, and Shiwei Liu. When does sparsity mitigate the curse of depth in llms.arXiv preprint arXiv:2603.15389, 2026
-
[45]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[46]
Open-sora 2.0: Training a commercial-level video generation model in $200 k
Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k. arXiv preprint arXiv:2503.09642, 2025
-
[47]
Xurui Peng, Chenqian Yan, Hong Liu, Rui Ma, Fangmin Chen, Xing Wang, Zhihua Wu, Songwei Liu, and Mingbao Lin. Ertacache: Error rectification and timesteps adjustment for efficient diffusion.arXiv preprint arXiv:2508.21091, 2025
-
[48]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015
work page 2015
-
[50]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advancesin neural information processing systems, 35:36479–36494, 2022
work page 2022
-
[51]
Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, and Luming Liang. Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425, 2024
-
[52]
Tr-dq: Time-rotation diffusion quantization.arXiv preprint arXiv:2503.06564, 2025
Yihua Shao, Deyang Lin, Fanhu Zeng, Minxi Yan, Muyang Zhang, Siyu Chen, Yuxuan Fan, Ziyang Yan, Haozhe Wang, Jingcai Guo, et al. Tr-dq: Time-rotation diffusion quantization.arXiv preprint arXiv:2503.06564, 2025
-
[53]
History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025
Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025
-
[54]
Ke Sun, Shen Chen, Taiping Yao, Hong Liu, Xiaoshuai Sun, Shouhong Ding, and Rongrong Ji. Diffusionfake: Enhancing generalization in deepfake detection via guided stable diffusion.Advances in Neural Information Processing Systems, 37:101474–101497, 2024. 14
work page 2024
-
[55]
Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, and Rongrong Ji. Continual face forgery detection via historical distribution preserving.International Journal of Computer Vision, 133(3):1067–1084, 2025
work page 2025
-
[56]
Towards general visual-linguistic face forgery detection
Ke Sun, Shen Chen, Taiping Yao, Ziyin Zhou, Jiayi Ji, Xiaoshuai Sun, Chia-Wen Lin, and Rongrong Ji. Towards general visual-linguistic face forgery detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19576–19586, 2025
work page 2025
-
[57]
Ke Sun, Guangsheng Bao, Han Cui, and Yue Zhang. When ai settles down: Late-stage stability as a signature of ai-generated text detection.arXiv preprint arXiv:2601.04833, 2026
-
[58]
Ke Sun, Guangsheng Bao, Han Cui, and Yue Zhang. Minimizing mismatch risk: A prototype-based routing framework for zero-shot llm-generated text detection.arXiv preprint arXiv:2602.01240, 2026
-
[59]
Lei Tan, Weihao Li, Pingyang Dai, Jie Chen, Liujuan Cao, and Rongrong Ji. Knowing where to focus: Attention- guided alignment for text-based person search.arXiv preprint arXiv:2412.15106, 2024
-
[60]
MAGI-1: Autoregressive Video Generation at Scale
Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Pan Wang, Yunlang Xu, Runze Ding, Weike Liu, Steve Shu, and Xiaofeng Yang. Multi-kernel neural network sliding mode control for permanent magnet linear synchronous motors.IEEE Access, 9:57385–57392, 2021
work page 2021
-
[63]
Dlf: Disentangled-language-focused multimodal sentiment analysis
Pan Wang, Qiang Zhou, Yawen Wu, Tianlong Chen, and Jingtong Hu. Dlf: Disentangled-language-focused multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 21180–21188, 2025
work page 2025
-
[64]
From models to systems: A comprehensive survey of efficient multimodal learning
Pan Wang, Siwei Song, Hui Ji, Siqi Cao, Heng Yu, Zhijian Liu, Huanrui Yang, Yingyan Lin, Beidi Chen, Mohit Bansal, et al. From models to systems: A comprehensive survey of efficient multimodal learning. 2026
work page 2026
-
[65]
A theoretical analysis of ndcg type ranking measures
Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. A theoretical analysis of ndcg type ranking measures. In Conference on learning theory, pages 25–54. PMLR, 2013
work page 2013
-
[66]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004
work page 2004
-
[67]
Panacea: Panoramic and controllable video generation for autonomous driving
Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6902–6912, 2024
work page 2024
-
[68]
Motioncanvas: Cinematic shot design with controllable image-to-video generation
Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Aniruddha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, and Feng Liu. Motioncanvas: Cinematic shot design with controllable image-to-video generation. InProceedings of the Special Interest Group on Computer Graphics and Interactive TechniquesConference Conference Papers, pages 1–11, 2025
work page 2025
-
[69]
Yunlang Xu, Xiaofeng Yang, Zhile Yang, Xiaoping Li, Pang Wang, Runze Ding, and Weike Liu. An enhanced differential evolution algorithm with a new oppositional-mutual learning strategy.Neurocomputing, 435:162–175, 2021
work page 2021
-
[70]
Lin Yang, Li Li, Yuxiang Fu, et al. Veta-dit: Variance-equalized and temporally adaptive quantization for efficient 4-bit diffusion transformers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[71]
LongLive: Real-time Interactive Long Video Generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying- cong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 15
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[73]
From slow bidirectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025
work page 2025
-
[74]
Ab-cache: Training-free acceleration of diffusion models via adams-bashforth cached feature reuse
Zichao Yu, Zhen Zou, Guojiang Shao, Chenwei Zhang, Shengze Xu, Jie Huang, Feng Zhao, Xiaodong Cun, and Wenyi Zhang. Ab-cache: Training-free acceleration of diffusion models via adams-bashforth cached feature reuse. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 10408–10417, 2025
work page 2025
-
[75]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018
work page 2018
-
[76]
Kun Zhao, Siyuan Dai, Pan Wang, Jifeng Song, Hui Ji, Chenghua Lin, Liang Zhan, and Haoteng Tang. Aligning findings with diagnosis: A self-consistent reinforcement learning framework for trustworthy radiology reporting. arXiv preprint arXiv:2601.03321, 2026
-
[77]
Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, et al. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2406.02540, 2024
-
[78]
Real-time video generation with pyramid attention broadcast
Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broadcast. arXiv preprint arXiv:2408.12588, 2024
-
[79]
Xiawu Zheng, Yuexiao Ma, Teng Xi, Gang Zhang, Errui Ding, Yuchao Li, Jie Chen, Yonghong Tian, and Rongrong Ji. An information theory-inspired strategy for automatic network pruning.arXiv preprint arXiv:2108.08532, 2021
-
[80]
Open-Sora: Democratizing Efficient Video Production for All
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.