{"total":23,"items":[{"citing_arxiv_id":"2606.00793","ref_index":91,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MBench: A Comprehensive Benchmark on Memory Capability for Video World Models","primary_cat":"cs.CV","submitted_at":"2026-05-30T16:17:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30431","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution","primary_cat":"cs.CV","submitted_at":"2026-05-28T18:00:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Presents Decoupled Time Guidance (DTG) for training-free generative video super-resolution by temporally decoupling conditional and unconditional diffusion signals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30351","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VideoMLA applies multi-head latent attention with 3D-RoPE decoupling to autoregressive video diffusion, delivering 92.7% KV memory reduction while matching short-horizon baselines and leading long-horizon VBench scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30349","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AdaState: Self-Evolving Anchors for Streaming Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:59:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AdaState replaces the static first-frame KV anchor with an evolving hidden latent that the model denoises alongside content, treating time as relative to enable recurrence and richer dynamics in streaming video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23458","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"One-Forcing: Towards Stable One-Step Autoregressive Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-22T10:16:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"One-Forcing augments DMD with a GAN loss to enable stable one-step causal autoregressive video generation, reporting a VBench score of 83.76 as SOTA among one-step methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21072","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Q-ARVD: Quantizing Autoregressive Video Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T11:58:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Q-ARVD introduces final-quality-aware frame weighting and outlier-aware adaptive dual-scale quantization to enable accurate low-bit inference for autoregressive video diffusion models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20910","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching","primary_cat":"cs.CV","submitted_at":"2026-05-20T08:55:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlowLong generates videos several times longer than native model windows by blending adjacent predictions with Tweedie matching to enforce manifold and temporal consistency while using stochastic noise injection early and deterministic sampling later.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19957","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks","primary_cat":"cs.CV","submitted_at":"2026-05-19T15:10:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18233","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos","primary_cat":"cs.CV","submitted_at":"2026-05-18T11:28:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on VBench and NarrLV.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16003","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-15T14:33:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15824","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization","primary_cat":"cs.CV","submitted_at":"2026-05-15T10:25:06+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15199","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:59:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14487","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity","primary_cat":"cs.CV","submitted_at":"2026-05-14T07:27:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13111","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-13T07:23:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12496","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"↑Inter-Shot Cons.↑ SCA↑ Subject Background HoloCine Bidirectional 0.58420.20500.9728 0.97110.68210.9694 MultiShotMaster Bidirectional 0.5811 0.2046 0.9626 0.9671 0.6530 0.9678 Ours Causal, 4step0.61940.20040.9823 0.97520.66080.9883 Comparisons.We first compare with autoregressive long-video generation methods, including Self-Forcing [17], Infinity-RoPE [52], LongLive [51], MemFlow [21], and ShotStream [30]. These methods extend generation through causal rollout, KV caching, or long-context positional extrapo- lation, but most of them are primarily designed for short-context continuation. As shown in Tab. 1 and Fig. 3, they often produce locally smooth videos that remain semantically static, repeating similar"},{"citing_arxiv_id":"2605.06051","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control","primary_cat":"cs.CV","submitted_at":"2026-05-07T11:36:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15911","ref_index":161,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient Video Diffusion Models: Advancements and Challenges","primary_cat":"cs.CV","submitted_at":"2026-04-17T10:11:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Overview of the Self-Forcing algorithm. This framework serves as the foundation for various real-time video generation methods. At its core, the algorithm leverages a causal model for video synthesis and subsequently employs standard DMD to facilitate few-step generation, thereby enabling the real-time, streaming output of video chunks. Forcing [89], LongLive [151], and VideoSSM [161]. Thus, full-history attention is no longer viable, but windowing alone still cannot guarantee long-range recall. Causal Rollout Optimization.This direction focuses on improving the autoregressive rollout itself so that exposure drift accumulates more slowly over time. Rolling Forcing [ 79] introduces a rolling-window joint denoising scheme with progressively increasing noise levels within the window, thereby adding a form of local"},{"citing_arxiv_id":"2604.10103","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation","primary_cat":"cs.CV","submitted_at":"2026-04-11T08:54:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"cal attention mechanism effectively eliminates redundant short-range correla- tions, while the linear temporal pathway ensures the preservation of globally aggregated temporal history. This dual approach reallocates attention capacity towards the most informative dependencies over time. 8 R. Li et al. 3.3 Relative RoPE Previous work clamps the temporal RoPE index at inference time [51], While this strategy can alleviate extrapolation errors in simple cases and provide un- bounded generation capacity, it cannot provide stable long-duration streaming. To better address these issues, we propose to incorporaterelative RoPE directly into the training process. Specifically, we impose a cap on the maximum tem- poral RoPE index during both training and inference, and represent temporal"},{"citing_arxiv_id":"2604.06939","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis","primary_cat":"cs.CV","submitted_at":"2026-04-08T11:03:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressive video synthesis.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Another critical challenge involves breaking the fixed temporal horizon of 3D- RoPE, which restricts AR models to a pre-trained maximum frame count. Self- Forcing++ [6] extended training horizons via long rollouts but incurred pro- hibitive computational costs while remaining constrained by absolute ROPE time location indices. Conversely, Infinity-RoPE [27] introduced a training-free Block-Relativistic RoPE, reformulating temporal encoding as a moving reference frame to enable infinite-horizon generation. However, as a purely inference-time adaptation, it inherits the base model's semantic drift due to the lack of training- stage semantic anchoring. Interactive and Controllable Long Video Generation."},{"citing_arxiv_id":"2604.03118","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-03T15:43:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"distribution: Recent one-step video methods [20,21] operationalize distribution matching via adversarial post-training, while POSE [5] introduces a phased equi- librium procedure to stabilize adversarial one-step distillation for video models. Autoregressive Video Generationenables streaming and interactivity via causalfactorization,andcanbegroupedintotraining-based[4,13,23,26,34,39,43] and training-free [40,45] approaches. Training-based methods close the train-test gap of long rollouts beyond standard teacher forcing: Self Forcing [13] directly trains on autoregressive self-rollouts so each step conditions on previously gen- erated context, reducing exposure bias; Causal Forcing [47] further pinpoints the bidirectional-to-causal distillation gap and uses an autoregressive teacher for"},{"citing_arxiv_id":"2603.28489","ref_index":137,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms","primary_cat":"eess.IV","submitted_at":"2026-03-30T14:23:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"frequency-only scaling, this maintains better imaging quality at larger extrapolation limits. 3) From Long to Infinite:To enable effectively infinite simulation, Infinity-RoPE [136] proposes Block-Relativistic RoPE, rotating new latent blocks relative to a moving local reference frame. This shifts from \"extending a window\" to a \"sliding world\" paradigm. Related works like FreeNoise [137] and Align your Latents [10] explore complementary tuning- free noise and attention rescheduling strategies. E. Discussion Despite significant advances, existing efficient architectures face fundamental trade-offs between computational cost and spatiotemporal/causal integrity. Specifically, hierarchical com- pression often sacrifices long-term semantic consistency for"},{"citing_arxiv_id":"2602.07775","ref_index":101,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-02-08T02:16:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"adopt bidirectional attentions [71] and denoise all frames simultaneously. There- fore, though impressive, the generated videos are generally limited to short clips. In contrast, AR models [1,10,75,87-89] can in principle, infinitely predict next- state conditioned on prior ones. To marry the best of both paradigms, a rapidly growing number of AR video diffusion models [11,12,16,18,25-27,37,38,42,48,55, 62,63,66,72,74,77,84,93-95,97,98,101,102,106,107,109,111,114] have emerged. Earlier methods, e.g., NOVA [17], SkyReels-V2 [13], and MAGI-1 [86] still rely on inefficient multi-step denoisingin eachAR generation step. Recently, Pyramid Flow [45] and CausVid [103-105] adopt few-step generation, making AR video generationtemporallyefficient. However, as the cached history grows longer, the demand of computational resources grows dramatically, which significantly con-"},{"citing_arxiv_id":"2512.04678","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation","primary_cat":"cs.CV","submitted_at":"2025-12-04T11:12:13+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}