Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

Dongman Lee; Qing Yin; Tianhao Chen; Xiangbo Gao; Xinghao Chen; Yuheng Wu; Zhengzhong Tu

arxiv: 2605.14382 · v3 · pith:WAPKHKF7new · submitted 2026-05-14 · 💻 cs.CV · cs.GR· cs.MM

Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

Yuheng Wu , Xiangbo Gao , Tianhao Chen , Xinghao Chen , Qing Yin , Zhengzhong Tu , Dongman Lee This is my paper

Pith reviewed 2026-05-21 09:08 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.MM

keywords autoregressive video generationtrust regiontemporal consistencyinteractive videoteacher distillationlatent deltaconditional biasvideo modeling

0 comments

The pith

Delta Forcing constrains unreliable teacher guidance within an adaptive trust region estimated from latent trajectory deltas to reduce drift while keeping reactivity in autoregressive video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix the tension between quick adaptation to new events and long-term visual stability in real-time autoregressive video models. It traces persistent drift to conditional bias, where the teacher supplies locally aligned but trajectory-agnostic signals that push generation into inconsistent modes. Delta Forcing, drawing on trust-region ideas, measures consistency via the latent difference between teacher and generator paths and uses that to limit how far the teacher can steer the output. A reader would care because successful control of this bias would let models sustain coherent video over extended horizons in interactive settings such as content creation and world simulation.

Core claim

Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories and places unreliable teacher supervision inside an adaptive trust region, balancing that supervision against a monotonic continuity objective so that teacher-induced shifts are suppressed while responsiveness to new events is retained.

What carries the argument

Delta Forcing, the mechanism that computes an adaptive trust region from latent deltas between teacher and generator trajectories to modulate teacher supervision against a continuity objective.

If this is right

Autoregressive generators distilled from bidirectional teachers exhibit less persistent drift after streaming long tuning.
Interactive video outputs maintain temporal coherence across extended sequences even when input conditions evolve.
The balance between teacher supervision and continuity objective reduces mode collapse toward locally valid but globally inconsistent trajectories.
Event reactivity is preserved because the trust region adapts rather than applying a fixed restriction on teacher influence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-delta trust region idea could transfer to autoregressive generation in other modalities where teacher models create similar consistency-reactivity trade-offs.
Explicitly tracking trajectory deltas might offer a general diagnostic for when distillation introduces bias in sequential models.
Scaling the trust region size with sequence length or event complexity could be a direct next step for longer-horizon applications.

Load-bearing premise

The latent delta between teacher and generator trajectories supplies a trustworthy signal of transition consistency that can limit harmful teacher shifts without impairing the model's response to fresh events.

What would settle it

A controlled test in which videos generated with Delta Forcing display measurably higher long-horizon consistency scores after abrupt condition changes than baseline methods, while reaction speed to new inputs remains comparable.

Figures

Figures reproduced from arXiv: 2605.14382 by Dongman Lee, Qing Yin, Tianhao Chen, Xiangbo Gao, Xinghao Chen, Yuheng Wu, Zhengzhong Tu.

**Figure 1.** Figure 1: Left: Under evolving events, the frozen teacher, biased toward certain patterns, remains condition-aware but trajectory-agnostic, inducing conditional bias that deviates from the historical trajectory. Right: Decoding both the real teacher model (i.e., Wan2.1-14B-T2V [1]) and generator (MemFlow [16]) shows that the generator’s drift closely follows these teacher-induced shifts. autoregressive diffusion tra… view at source ↗

**Figure 2.** Figure 2: (a) Standard DMD fails to handle condition changes. (b) Streaming Long Tuning improves interactivity but still suffers from biased guidance, and (c) our method enforces transition consistency to mitigate conditional bias and preserve temporal coherence. A complementary line of work extends AR video generation to interactive settings, where conditions evolve dynamically and the model must adapt to each new… view at source ↗

**Figure 3.** Figure 3: Qualitative results. Each 10s segment corresponds to one event and the full event prompts [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation study. Without adaptive trust regions (Design 2). We then remove the adaptive trust-region weight wk from the original DMD loss, so that teacher supervision is no longer selectively suppressed according to its reliability. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Latent trajectory visualization via PCA under multi-event prompt switching. We project frame-wise denoised latent features (before VAE decoding) into a two-dimensional PCA space and connect them in temporal order. Different colors denote different interaction segments. Left exhibits short and narrow transitions across prompt switches, indicating insufficient semantic displacement despite changed conditions… view at source ↗

**Figure 6.** Figure 6: Extended latent trajectory comparison. Each row shows one example under the same multi-event prompt schedule, comparing three baselines (columns 1–3) against Delta Forcing (column 4). Red arrows highlight segments where Delta Forcing exhibits compact within-interaction clusters connected by smooth cross-interaction transitions, consistent with the desirable properties established in Section A.1. A.4 Furthe… view at source ↗

**Figure 7.** Figure 7: User study interface. D Social Impact Delta Forcing aims to improve interactive real-time video generation by enhancing long-horizon stability and responsiveness under dynamically changing event conditions. This capability can benefit creative workflows in areas such as short-form content creation, filmmaking, game development, virtual environments, and world-model-based simulation, where users require con… view at source ↗

read the original abstract

Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Delta Forcing gives a practical way to limit teacher drift in autoregressive video models using latent deltas, but the supporting details are still thin.

read the letter

The main thing to know is that this paper proposes Delta Forcing as a way to keep autoregressive video generators from drifting after new events by using the latent difference between teacher and generator trajectories to set an adaptive limit on teacher influence. It pairs that with a monotonic continuity term so the model stays responsive without going off track over long sequences. The construction is presented as a direct application of a trust-region idea to the distillation-plus-streaming-tuning pipeline that most current interactive video work relies on. That specific delta-based steering rule does not appear in the earlier papers they cite, so the mechanism itself counts as the novel piece. The motivation section does a clear job of naming the conditional bias problem and why standard teacher supervision can produce locally valid but globally inconsistent frames. If the experiments hold up under scrutiny, the approach is simple enough that groups already running similar models could test it quickly. The main soft spot is that the abstract and available description stay high-level on the exact loss, the threshold logic, and the controls. There are no visible ablations on how sensitive results are to the delta estimation method or on whether the continuity objective ever over-constrains event response. Without those checks it is hard to know how much of the reported consistency gain comes from the new term versus other tuning choices. The stress-test note did not turn up an internal contradiction, which is reassuring, but the lack of equations and dataset specifics still leaves the central claim under-supported for now. This work is aimed at researchers who tune or distill autoregressive video models for real-time or interactive settings. Anyone already working on stability fixes in that subfield would get immediate value from seeing the trust-region framing and could decide whether to try the delta estimate themselves. The paper is coherent enough on its own terms and makes a concrete proposal with claimed gains, so it deserves a serious referee to examine the implementation and controls rather than a desk reject.

Referee Report

0 major / 3 minor

Summary. The paper claims that existing distillation and streaming long tuning methods for autoregressive video generators suffer from persistent drift after condition changes due to conditional bias in teacher supervision. It proposes Delta Forcing, which adapts the trust-region concept from TRPO to estimate transition consistency via the latent delta between teacher and generator trajectories. This delta is used to adaptively constrain unreliable teacher guidance within a trust region while adding a monotonic continuity objective, thereby suppressing teacher-induced shifts without harming reactivity to new events. The authors report that extensive experiments show significant gains in consistency while preserving event responsiveness.

Significance. If the central mechanism holds, the work offers a lightweight, interpretable steering method for long-horizon autoregressive video models in interactive settings. By directly importing a trust-region constraint from reinforcement learning and grounding it in observable latent deltas, the approach could provide a practical alternative to heavier fine-tuning regimes and help stabilize generation without sacrificing responsiveness.

minor comments (3)

The abstract and method description would benefit from an explicit equation or pseudocode block showing how the latent delta is computed, how the trust-region threshold is set, and how the continuity objective is formulated and combined with the teacher loss.
Experimental section should include ablation studies isolating the contribution of the delta-based trust region versus the continuity objective, together with quantitative metrics and error bars for both consistency and reactivity on the reported datasets.
Clarify whether the method introduces any additional hyperparameters beyond the trust-region radius and, if so, how they are chosen or shown to be robust.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work and the recommendation for minor revision. The referee accurately captures the core problem of conditional bias in teacher supervision for autoregressive video generators and the trust-region-inspired mechanism of Delta Forcing. Since the report lists no specific major comments, we have no individual points to address.

Circularity Check

0 steps flagged

No significant circularity; proposal is externally inspired

full rationale

The paper introduces Delta Forcing as a framework inspired by Trust Region Policy Optimization (TRPO) to constrain teacher supervision using latent deltas between trajectories. The central construction estimates transition consistency from these deltas and balances it against a continuity objective, but this is framed as a new adaptive mechanism rather than a quantity derived from or equivalent to quantities already defined inside the paper. No equations reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations. The derivation remains self-contained against external benchmarks like TRPO, with the reader's assessment of minor (score 2) circularity risk aligning with the absence of any quoted reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies insufficient detail to enumerate concrete free parameters, axioms, or invented entities; the method appears to introduce a new balancing objective but its exact parameterization is not stated.

pith-pipeline@v0.9.0 · 5728 in / 1049 out tokens · 46237 ms · 2026-05-21T09:08:47.659811+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a reliability-aware framework that introduces a delta-based mechanism to modulate supervision online.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 15 internal anchors

[1]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

HunyuanVideo 1.5 Technical Report

B. Wu, C. Zou, C. Li, D. Huang, F. Yang, H. Tan, J. Peng, J. Wu, J. Xiong, J. Jianget al., “Hunyuanvideo 1.5 technical report,”arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

LTX-Video: Realtime Video Latent Diffusion

Y. HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordonet al., “Ltx-video: Realtime video latent diffusion,”arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Seedance 2.0: Advancing Video Generation for World Complexity

T. Seedance, D. Chen, L. Chen, X. Chen, Y. Chen, Z. Chen, Z. Chen, F. Cheng, T. Cheng, Y. Chenget al., “Seedance 2.0: Advancing video generation for world complexity,”arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Kling-Omni Technical Report

K. Team, J. Chen, Y. Ci, X. Du, Z. Feng, K. Gai, S. Guo, F. Han, J. He, K. Heet al., “Kling-omni technical report,”arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Video generation models as world simulators,

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman et al., “Video generation models as world simulators,”OpenAI Blog, vol. 1, no. 8, p. 1, 2024

work page 2024
[7]

Open-Sora Plan: Open-Source Large Video Generation Model

B. Lin, Y. Ge, X. Cheng, Z. Li, B. Zhu, S. Wang, X. He, Y. Ye, S. Yuan, L. Chenet al., “Open-sora plan: Open-source large video generation model,”arXiv preprint arXiv:2412.00131, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models

F. Bao, C. Xiang, G. Yue, G. He, H. Zhu, K. Zheng, M. Zhao, S. Liu, Y. Wang, and J. Zhu, “Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models,”arXiv preprint arXiv:2405.04233, 2024

work page arXiv 2024
[9]

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion,

B. Chen, D. M. Monso, Y. Du, M. Simchowitz, R. Tedrake, and V. Sitzmann, “Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion,” Dec. 2024

work page 2024
[10]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion,

X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman, “Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion,” Nov. 2025

work page 2025
[11]

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models,

T. Yin, Q. Zhang, R. Zhang, W. T. Freeman, F. Durand, E. Shechtman, and X. Huang, “From Slow Bidirectional to Fast Autoregressive Video Diffusion Models,” Sep. 2025

work page 2025
[12]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation,

H. Zhu, M. Zhao, G. He, H. Su, C. Li, and J. Zhu, “Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation,” Feb. 2026

work page 2026
[13]

One-step Diffusion with Distribution Matching Distillation,

T. Yin, M. Gharbi, R. Zhang, E. Shechtman, F. Durand, W. T. Freeman, and T. Park, “One-step Diffusion with Distribution Matching Distillation,” Oct. 2024

work page 2024
[14]

Improved distribution matching distillation for fast image synthesis,

T. Yin, M. Gharbi, T. Park, R. Zhang, E. Shechtman, F. Durand, and W. T. Freeman, “Improved distribution matching distillation for fast image synthesis,”Advances in neural information processing systems, vol. 37, pp. 47455–47487, 2024

work page 2024
[15]

LongLive: Real-time Interactive Long Video Generation,

S. Yang, W. Huang, R. Chu, Y. Xiao, Y. Zhao, X. Wang, M. Li, E. Xie, Y. Chen, Y. Lu, S. Han, and Y. Chen, “LongLive: Real-time Interactive Long Video Generation,” Oct. 2025

work page 2025
[16]

MemFlow: Flowing adaptive memory for consistent and efficient long video narratives,

S. Ji, X. Chen, S. Yang, X. Tao, P. Wan, and H. Zhao, “MemFlow: Flowing adaptive memory for consistent and efficient long video narratives,” Dec. 2025

work page 2025
[17]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y. Ban, and C.-J. Hsieh, “Self-forcing++: Towards minute-scale high-quality video generation,”arXiv preprint arXiv:2510.02283, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Trust region policy optimization,

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” inInternational conference on machine learning. PMLR, 2015, pp. 1889–1897

work page 2015
[19]

Live: Long-horizon interactive video world modeling,

J. Huang, Z. Ye, X. Hu, T. He, G. Zhang, S. Shi, J. Bian, and L. Jiang, “Live: Long-horizon interactive video world modeling,”arXiv preprint arXiv:2602.03747, 2026

work page arXiv 2026
[20]

Context forcing: Consistent autoregressive video generation with long context,

S. Chen, C. Wei, S. Sun, P. Nie, K. Zhou, G. Zhang, M.-H. Yang, and W. Chen, “Context forcing: Consistent autoregressive video generation with long context,”arXiv preprint arXiv:2602.06028, 2026

work page arXiv 2026
[21]

Rolling forcing: Autoregressive long video diffusion in real time,

K. Liu, W. Hu, J. Xu, Y. Shan, and S. Lu, “Rolling forcing: Autoregressive long video diffusion in real time,” Sep. 2025

work page 2025
[22]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation,

Y. Lu, Y. Zeng, H. Li, H. Ouyang, Q. Wang, K. L. Cheng, J. Zhu, H. Cao, Z. Zhang, X. Zhu, Y. Shen, and M. Zhang, “Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation,” Dec. 2025

work page 2025
[23]

Anchor forcing: Anchor memory and tri-region rope for interactive streaming video diffusion,

Y. Yang, T. Zhang, W. Huang, J. Chen, B. Wu, X. He, D. Cai, B. Li, and P.-T. Jiang, “Anchor forcing: Anchor memory and tri-region rope for interactive streaming video diffusion,”arXiv preprint arXiv:2603.13405, 2026. 11

work page arXiv 2026
[24]

Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

J. Chen, C. Bai, X. Xue, M. Xuet al., “Grounded forcing: Bridging time-independent semantics and proximal dynamics in autoregressive video synthesis,”arXiv preprint arXiv:2604.06939, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Streaming autoregressive video generation via diagonal distillation,

J. Liu, X. Liu, K. Mei, Y. Wen, Ming-HsuanYang, and W. Liu, “Streaming autoregressive video generation via diagonal distillation,” 2026. [Online]. Available: https://arxiv.org/abs/2603.09488

work page arXiv 2026
[26]

Hiar: Efficient autoregressive long video generation via hierarchical denoising,

K. Zou, D. Zheng, H. Liu, T. Hang, B. Liu, and N. Yu, “Hiar: Efficient autoregressive long video generation via hierarchical denoising,”arXiv preprint arXiv:2603.08703, 2026

work page arXiv 2026
[27]

SkyReels-V2: Infinite-length Film Generative Model

G. Chen, D. Lin, J. Yang, C. Lin, J. Zhu, M. Fan, H. Zhang, S. Chen, Z. Chen, C. Maet al., “Skyreels-v2: infinite-length film generative model (2025),”URL https://arxiv. org/abs/2504.13074

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

MAGI-1: Autoregressive Video Generation at Scale

H. Teng, H. Jia, L. Sun, L. Li, M. Li, M. Tang, S. Han, T. Zhang, W. Zhang, W. Luoet al., “Magi-1: Autoregressive video generation at scale,”arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

DINOv3

O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. Jégou, P. Labatut, and P. Bojanowski, “DINOv3,” 2025. [Online]. Available: https://...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

VBench: Comprehensive benchmark suite for video generative models,

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu, “VBench: Comprehensive benchmark suite for video generative models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[32]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

D. Zheng, Z. Huang, H. Liu, K. Zou, Y. He, F. Zhang, Y. Zhang, J. He, W.-S. Zheng, Y. Qiao, and Z. Liu, “VBench- 2.0: Advancing video generation benchmark suite for intrinsic faithfulness,”arXiv preprint arXiv:2503.21755, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

VBench++: Comprehensive and versatile benchmark suite for video generative models,

Z. Huang, F. Zhang, X. Xu, Y. He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y. Jiang, Y. Wang, X. Chen, Y.-C. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu, “VBench++: Comprehensive and versatile benchmark suite for video generative models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[34]

Long-clip: Unlocking the long-text capability of clip,

B. Zhang, P. Zhang, X. Dong, Y. Zang, and J. Wang, “Long-clip: Unlocking the long-text capability of clip,” arXiv preprint arXiv:2403.15378, 2024

work page arXiv 2024
[35]

Improving Video Generation with Human Feedback

J. Liu, G. Liu, J. Liang, Z. Yuan, X. Liu, M. Zheng, X. Wu, Q. Wang, M. Xia, X. Wanget al., “Improving video generation with human feedback,”arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021
[37]

Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

D. Jiang, D. Liu, Z. Wang, Q. Wu, L. Li, H. Li, X. Jin, D. Liu, C. Lu, Z. Liet al., “Distribution matching distillation meets reinforcement learning,”arXiv preprint arXiv:2511.13649, 2025

work page arXiv 2025
[38]

Optimizing few-step generation with adaptive matching distillation,

L. Bai, Z. Zhou, S. Shao, W. Zhong, S. Yang, S. Chen, B. Chen, and Z. Xie, “Optimizing few-step generation with adaptive matching distillation,”arXiv preprint arXiv:2602.07345, 2026

work page internal anchor Pith review arXiv 2026
[39]

Visualizing data using t-sne,

L. van der Maaten and G. Hinton, “Visualizing data using t-sne,”Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008. [Online]. Available: http://jmlr.org/papers/v9/vandermaaten08a.html

work page 2008
[40]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,

L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,”ArXiv e-prints, Feb. 2018. 12 Appendix A Motivation Study via Latent Trajectory Visualization To supplement our motivation analysis, we provide a latent-space diagnostic that reveals how existing interactive streaming video generation metho...

work page 2018

[1] [1]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

HunyuanVideo 1.5 Technical Report

B. Wu, C. Zou, C. Li, D. Huang, F. Yang, H. Tan, J. Peng, J. Wu, J. Xiong, J. Jianget al., “Hunyuanvideo 1.5 technical report,”arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

LTX-Video: Realtime Video Latent Diffusion

Y. HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordonet al., “Ltx-video: Realtime video latent diffusion,”arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Seedance 2.0: Advancing Video Generation for World Complexity

T. Seedance, D. Chen, L. Chen, X. Chen, Y. Chen, Z. Chen, Z. Chen, F. Cheng, T. Cheng, Y. Chenget al., “Seedance 2.0: Advancing video generation for world complexity,”arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Kling-Omni Technical Report

K. Team, J. Chen, Y. Ci, X. Du, Z. Feng, K. Gai, S. Guo, F. Han, J. He, K. Heet al., “Kling-omni technical report,”arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Video generation models as world simulators,

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman et al., “Video generation models as world simulators,”OpenAI Blog, vol. 1, no. 8, p. 1, 2024

work page 2024

[7] [7]

Open-Sora Plan: Open-Source Large Video Generation Model

B. Lin, Y. Ge, X. Cheng, Z. Li, B. Zhu, S. Wang, X. He, Y. Ye, S. Yuan, L. Chenet al., “Open-sora plan: Open-source large video generation model,”arXiv preprint arXiv:2412.00131, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models

F. Bao, C. Xiang, G. Yue, G. He, H. Zhu, K. Zheng, M. Zhao, S. Liu, Y. Wang, and J. Zhu, “Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models,”arXiv preprint arXiv:2405.04233, 2024

work page arXiv 2024

[9] [9]

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion,

B. Chen, D. M. Monso, Y. Du, M. Simchowitz, R. Tedrake, and V. Sitzmann, “Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion,” Dec. 2024

work page 2024

[10] [10]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion,

X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman, “Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion,” Nov. 2025

work page 2025

[11] [11]

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models,

T. Yin, Q. Zhang, R. Zhang, W. T. Freeman, F. Durand, E. Shechtman, and X. Huang, “From Slow Bidirectional to Fast Autoregressive Video Diffusion Models,” Sep. 2025

work page 2025

[12] [12]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation,

H. Zhu, M. Zhao, G. He, H. Su, C. Li, and J. Zhu, “Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation,” Feb. 2026

work page 2026

[13] [13]

One-step Diffusion with Distribution Matching Distillation,

T. Yin, M. Gharbi, R. Zhang, E. Shechtman, F. Durand, W. T. Freeman, and T. Park, “One-step Diffusion with Distribution Matching Distillation,” Oct. 2024

work page 2024

[14] [14]

Improved distribution matching distillation for fast image synthesis,

T. Yin, M. Gharbi, T. Park, R. Zhang, E. Shechtman, F. Durand, and W. T. Freeman, “Improved distribution matching distillation for fast image synthesis,”Advances in neural information processing systems, vol. 37, pp. 47455–47487, 2024

work page 2024

[15] [15]

LongLive: Real-time Interactive Long Video Generation,

S. Yang, W. Huang, R. Chu, Y. Xiao, Y. Zhao, X. Wang, M. Li, E. Xie, Y. Chen, Y. Lu, S. Han, and Y. Chen, “LongLive: Real-time Interactive Long Video Generation,” Oct. 2025

work page 2025

[16] [16]

MemFlow: Flowing adaptive memory for consistent and efficient long video narratives,

S. Ji, X. Chen, S. Yang, X. Tao, P. Wan, and H. Zhao, “MemFlow: Flowing adaptive memory for consistent and efficient long video narratives,” Dec. 2025

work page 2025

[17] [17]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y. Ban, and C.-J. Hsieh, “Self-forcing++: Towards minute-scale high-quality video generation,”arXiv preprint arXiv:2510.02283, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Trust region policy optimization,

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” inInternational conference on machine learning. PMLR, 2015, pp. 1889–1897

work page 2015

[19] [19]

Live: Long-horizon interactive video world modeling,

J. Huang, Z. Ye, X. Hu, T. He, G. Zhang, S. Shi, J. Bian, and L. Jiang, “Live: Long-horizon interactive video world modeling,”arXiv preprint arXiv:2602.03747, 2026

work page arXiv 2026

[20] [20]

Context forcing: Consistent autoregressive video generation with long context,

S. Chen, C. Wei, S. Sun, P. Nie, K. Zhou, G. Zhang, M.-H. Yang, and W. Chen, “Context forcing: Consistent autoregressive video generation with long context,”arXiv preprint arXiv:2602.06028, 2026

work page arXiv 2026

[21] [21]

Rolling forcing: Autoregressive long video diffusion in real time,

K. Liu, W. Hu, J. Xu, Y. Shan, and S. Lu, “Rolling forcing: Autoregressive long video diffusion in real time,” Sep. 2025

work page 2025

[22] [22]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation,

Y. Lu, Y. Zeng, H. Li, H. Ouyang, Q. Wang, K. L. Cheng, J. Zhu, H. Cao, Z. Zhang, X. Zhu, Y. Shen, and M. Zhang, “Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation,” Dec. 2025

work page 2025

[23] [23]

Anchor forcing: Anchor memory and tri-region rope for interactive streaming video diffusion,

Y. Yang, T. Zhang, W. Huang, J. Chen, B. Wu, X. He, D. Cai, B. Li, and P.-T. Jiang, “Anchor forcing: Anchor memory and tri-region rope for interactive streaming video diffusion,”arXiv preprint arXiv:2603.13405, 2026. 11

work page arXiv 2026

[24] [24]

Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

J. Chen, C. Bai, X. Xue, M. Xuet al., “Grounded forcing: Bridging time-independent semantics and proximal dynamics in autoregressive video synthesis,”arXiv preprint arXiv:2604.06939, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Streaming autoregressive video generation via diagonal distillation,

J. Liu, X. Liu, K. Mei, Y. Wen, Ming-HsuanYang, and W. Liu, “Streaming autoregressive video generation via diagonal distillation,” 2026. [Online]. Available: https://arxiv.org/abs/2603.09488

work page arXiv 2026

[26] [26]

Hiar: Efficient autoregressive long video generation via hierarchical denoising,

K. Zou, D. Zheng, H. Liu, T. Hang, B. Liu, and N. Yu, “Hiar: Efficient autoregressive long video generation via hierarchical denoising,”arXiv preprint arXiv:2603.08703, 2026

work page arXiv 2026

[27] [27]

SkyReels-V2: Infinite-length Film Generative Model

G. Chen, D. Lin, J. Yang, C. Lin, J. Zhu, M. Fan, H. Zhang, S. Chen, Z. Chen, C. Maet al., “Skyreels-v2: infinite-length film generative model (2025),”URL https://arxiv. org/abs/2504.13074

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

MAGI-1: Autoregressive Video Generation at Scale

H. Teng, H. Jia, L. Sun, L. Li, M. Li, M. Tang, S. Han, T. Zhang, W. Zhang, W. Luoet al., “Magi-1: Autoregressive video generation at scale,”arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

DINOv3

O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. Jégou, P. Labatut, and P. Bojanowski, “DINOv3,” 2025. [Online]. Available: https://...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

VBench: Comprehensive benchmark suite for video generative models,

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu, “VBench: Comprehensive benchmark suite for video generative models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[32] [32]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

D. Zheng, Z. Huang, H. Liu, K. Zou, Y. He, F. Zhang, Y. Zhang, J. He, W.-S. Zheng, Y. Qiao, and Z. Liu, “VBench- 2.0: Advancing video generation benchmark suite for intrinsic faithfulness,”arXiv preprint arXiv:2503.21755, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

VBench++: Comprehensive and versatile benchmark suite for video generative models,

Z. Huang, F. Zhang, X. Xu, Y. He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y. Jiang, Y. Wang, X. Chen, Y.-C. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu, “VBench++: Comprehensive and versatile benchmark suite for video generative models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[34] [34]

Long-clip: Unlocking the long-text capability of clip,

B. Zhang, P. Zhang, X. Dong, Y. Zang, and J. Wang, “Long-clip: Unlocking the long-text capability of clip,” arXiv preprint arXiv:2403.15378, 2024

work page arXiv 2024

[35] [35]

Improving Video Generation with Human Feedback

J. Liu, G. Liu, J. Liang, Z. Yuan, X. Liu, M. Zheng, X. Wu, Q. Wang, M. Xia, X. Wanget al., “Improving video generation with human feedback,”arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021

[37] [37]

Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

D. Jiang, D. Liu, Z. Wang, Q. Wu, L. Li, H. Li, X. Jin, D. Liu, C. Lu, Z. Liet al., “Distribution matching distillation meets reinforcement learning,”arXiv preprint arXiv:2511.13649, 2025

work page arXiv 2025

[38] [38]

Optimizing few-step generation with adaptive matching distillation,

L. Bai, Z. Zhou, S. Shao, W. Zhong, S. Yang, S. Chen, B. Chen, and Z. Xie, “Optimizing few-step generation with adaptive matching distillation,”arXiv preprint arXiv:2602.07345, 2026

work page internal anchor Pith review arXiv 2026

[39] [39]

Visualizing data using t-sne,

L. van der Maaten and G. Hinton, “Visualizing data using t-sne,”Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008. [Online]. Available: http://jmlr.org/papers/v9/vandermaaten08a.html

work page 2008

[40] [40]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,

L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,”ArXiv e-prints, Feb. 2018. 12 Appendix A Motivation Study via Latent Trajectory Visualization To supplement our motivation analysis, we provide a latent-space diagnostic that reveals how existing interactive streaming video generation metho...

work page 2018