{"total":14,"items":[{"citing_arxiv_id":"2605.23381","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation","primary_cat":"cs.CV","submitted_at":"2026-05-22T08:50:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22015","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration","primary_cat":"cs.CV","submitted_at":"2026-05-21T05:23:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ORBIS uses output-guided token reduction and DATM to achieve 2x higher token reduction than AsymRnR, with up to 4.5x speedup and 79.3% energy savings versus A100 GPU for video DiT models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01725","ref_index":46,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Motion-Aware Caching for Efficient Autoregressive Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-03T05:49:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586-595, 2018. [45] Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, et al. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2406.02540, 2024. [46] Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broadcast. arXiv preprint arXiv:2408.12588, 2024. [47] Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412."},{"citing_arxiv_id":"2604.24447","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment","primary_cat":"cs.RO","submitted_at":"2026-04-27T13:12:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with marginal task degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20470","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-04-22T11:56:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with a semantic motion router.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18348","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-20T14:43:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22046-22055, 2025. 1 [60] Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora: Trajectory-oriented diffusion transformer for video genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2063-2073, 2025. 2 [61] Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broad- cast.arXiv preprint arXiv:2408.12588, 2024. 2 [62] Xizhou Zhu, Dazhi Cheng, Zheng Zhang, Stephen Lin, and Jifeng Dai. An empirical study of spatial attention mecha- nisms in deep networks. InProceedings of the IEEE/CVF international conference on computer vision, pages 6688-"},{"citing_arxiv_id":"2604.15911","ref_index":195,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient Video Diffusion Models: Advancements and Challenges","primary_cat":"cs.CV","submitted_at":"2026-04-17T10:11:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"whereas AVDM2 [194] and MagicDistillation [107] strengthen the DMD line. The trend is to retain DMD-level compression while reducing optimization brittleness. Specific Applications of Non-Streaming Distribution Distillation.Beyond generic video synthesis, non- streaming distribution distillation has been adapted to restoration and enhancement tasks such as FlashVSR [195] and GFix [123], human-centric animation with DiffusionTalker [13], and controllable manipulation with EquiVDM [74]. This diversity suggests that once distribution matching is stabilized, it transfers naturally to deployment settings that demand very low-step generation. 3.2.2 Streaming Distribution Distillation.Streaming video generation, characterized by the sequential synthesis"},{"citing_arxiv_id":"2604.16492","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LayerCache: Exploiting Layer-wise Velocity Heterogeneity for Efficient Flow Matching Inference","primary_cat":"cs.CV","submitted_at":"2026-04-13T15:44:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LayerCache enables per-layer-group caching in flow matching models via adaptive JVP span selection and greedy 3D scheduling, delivering 1.37x speedup with PSNR 37.46 dB, SSIM 0.9834, and LPIPS 0.0178 on Qwen-Image.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02979","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation","primary_cat":"cs.CV","submitted_at":"2026-04-03T11:34:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"max( ¯𝑒𝑡−1 ,1+max{𝑗:𝑚 𝑡,𝑗 =1}),otherwise, (8) where ⊮[·] denotes the indicator function, which equals 1 when the enclosed condition is true and 0 otherwise. 𝑚𝑡,𝑗 is the scheduler- derived update mask indicating whether latent slot 𝑗 both advances at iteration 𝑡 and has not yet reached the terminal state. The result- ing scheduler-facing valid interval is: 𝑉𝑡 =[max( ¯𝑒𝑡 −𝐵,0), ¯𝑒𝑡 ).(9) Relative to the baseline window length 𝐵𝑡 =min(𝐵, 𝐹) without selective computation, the per-step cost scales primarily with |𝑉𝑡 |, while the overhead of extracting the interval remains lower order. Because ¯𝑒𝑡 is monotone, the compute interval remains a safe suffix of the scheduler-valid interval throughout the rollout. Selective computation therefore leaves the AR schedule itself unchanged, as it replaces the default full-length interval with a mask-derived compute interval. This mechanism is orthogonal to the predictive extension of cache reuse, so a step can be processed on a reduced interval. The detailed cost relation is provided in the supplementary material. 3.5 Error Analysis and Stability Controls Error propagation analysis.To understand how prediction er- rors affect the output, we analyze how errors propagate through the denoising trajectory. When SCOPE usesPredictmode, the predicted velocity ˆ𝑣𝑘 differs from the true velocity𝑣𝑘, and this error compounds over multiple steps. SCOPE MM '26, October 26-30, 2026, Melbourne, Australia Table 1: Main comparison on SkyReels-V2 DF-1.3B and MAGI-1 4.5B-distill, with all results measured on a single NVIDIA A800 80GB GPU. LPIPS, SSIM, and PSNR are computed against the Original output as reference. Best and second-best accelerated results are shown in bold and underlined, respectively. Original is excluded from ranking. Model Method Efficiency Visual Quality Time (s)↓Speedup↑FLOPs (P)↓Speedup↑VBench↑LPIPS↓SSIM↑PSNR↑ SkyReels-V2 DF-1.3B Original 1452.41 1.00×245.01 1.00×81.51% 0.0000 1.0000∞ Δ-DiT 1223.01 1.19×201.60 1.22×74.57"},{"citing_arxiv_id":"2602.05449","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching","primary_cat":"cs.CV","submitted_at":"2026-02-05T08:45:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.24527","ref_index":82,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training Agents Inside of Scalable World Models","primary_cat":"cs.AI","submitted_at":"2025-09-29T09:42:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.03603","ref_index":101,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HunyuanVideo: A Systematic Framework For Large Video Generative Models","primary_cat":"cs.CV","submitted_at":"2024-12-03T23:52:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HunyuanVideo presents a 13B-parameter open-source video generative model with integrated data, architecture, training, and inference systems whose professional evaluations show it outperforming prior SOTA models including Runway Gen-3 and Luma 1.6.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.13720","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Movie Gen: A Cast of Media Foundation Models","primary_cat":"cs.CV","submitted_at":"2024-10-17T16:22:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.14430","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference","primary_cat":"cs.CV","submitted_at":"2024-05-23T11:00:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PipeFusion applies patch partitioning and pipeline parallelism with one-step stale feature reuse to reduce communication overhead in DiT inference, reporting SOTA results on 8x L40 GPUs for Pixart, SD3, and Flux.1.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}