pith. machine review for the scientific record. sign in

arxiv: 2604.04913 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords delta tokensgenerative world modelsvideo forecastingvision foundation modelsefficient tokenizationmulti-hypothesis predictiondense forecastingtoken reduction
0
0 comments X

The pith

Delta tokens from VFM feature differences enable generative world models to forecast diverse futures using 35x fewer parameters and 2000x fewer FLOPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a tokenizer converting the feature difference between consecutive video frames in a vision foundation model into one continuous delta token can power a generative model for diverse future prediction. This collapses three-dimensional spatio-temporal video data into a one-dimensional temporal sequence, achieving reductions such as 1,024x fewer tokens for standard frame resolutions. The compact form makes parallel multi-hypothesis training feasible, where many candidate futures are generated together and only the best match is used for supervision, yielding varied outputs from a single forward pass. Readers should care because anticipating multiple plausible futures supports planning in uncertain environments like robotics or driving, yet prior generative world models have been too expensive for broad use. If correct, the approach renders high-fidelity generative forecasting computationally practical.

Core claim

DeltaTok encodes the VFM feature difference between consecutive frames into a single continuous delta token, allowing DeltaWorld to model video as a 1D temporal sequence of these tokens and generate multiple plausible future states through efficient multi-hypothesis training and inference.

What carries the argument

The DeltaTok tokenizer, which reduces each pair of consecutive VFM feature maps to one continuous delta token and thereby converts video into a 1D temporal token sequence.

If this is right

  • Diverse future predictions become possible in a single forward pass through parallel multi-hypothesis generation.
  • The model requires over 35x fewer parameters than existing generative world models.
  • Computation drops by 2,000x in FLOPs compared with prior generative approaches.
  • Forecasts align more closely with real-world video outcomes on dense forecasting benchmarks.
  • Multi-hypothesis training becomes tractable because only the best-matching prediction needs supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Chaining successive delta tokens could extend the method to longer video horizons while keeping the representation compact.
  • The efficiency gains might allow integration of generative world models into real-time systems on limited hardware.
  • Advances in the underlying vision foundation model would likely transfer directly to improved delta-token performance.
  • The same difference-based tokenization could be explored for other sequential modalities such as audio or time-series sensor data.

Load-bearing premise

The single delta token derived from VFM feature differences between consecutive frames retains enough spatio-temporal information to support accurate generation of diverse future states.

What would settle it

A controlled test in which two future video sequences differing in spatial detail produce nearly identical delta tokens yet the model cannot generate both accurately or distinguish them during inference.

Figures

Figures reproduced from arXiv: 2604.04913 by Daan de Geus, Gabriele Berton, Gijs Dubbelman, Ju He, Liang-Chieh Chen, Qihang Yu, Tommie Kerssies, Wufei Ma.

Figure 1
Figure 1. Figure 1: Outline of DeltaWorld. Unlike large existing generative world models that require many forward passes and represent each frame with many spatial tokens, our small DeltaWorld generates multiple futures in a single forward pass by using a single delta token to encode the difference between consecutive frames. Cosmos-4B Cosmos-12B DeltaWorld-0.3B (Ours) 40 45 50 55 60 65 70 60,000 TFLOPs 64,000 TFLOPs 31 TFLO… view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison. Compared to the generative world model Cosmos [1], our DeltaWorld forecasts futures that better align with real-world outcomes while having over 35× fewer parameters and using 2,000× fewer FLOPs. multiple possible future world states. In autonomous driving, for instance, anticipating interactions among multiple agents requires reasoning over diverse futures to prevent collisions. Di… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of DeltaTok. Given two frames encoded by a frozen vision foundation model (VFM) into grids of patch tokens xt−1 and xt, the DeltaTok encoder takes both as input and compresses them into a single delta token zt. The decoder reconstructs xˆt from xt−1 and zt. Both encoder and decoder are Vision Transformers (ViT) [18] trained with a Mean Squared Error (MSE) loss. DeltaTok. We introduce DeltaTok ( [… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of DeltaWorld. The predictor operates entirely on delta tokens ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Best-of-Many sample scaling. Effect of the number of training and evaluation queries on Cityscapes mid-horizon (∼0.6 s) mIoU. Using 256 × 256 crops. representational capacity, which may ultimately lower both best and mean predictions. Step (3) – Delta compression (DeltaWorld). To address the limitations of full frame compression, we encode only the change between consecutive frames as a single delta to￾ken… view at source ↗
Figure 6
Figure 6. Figure 6: Diverse sampled futures. Top row: four context frames and the future frame. Bottom row: four sampled DeltaWorld predic￾tions and the oracle. In this VSPW [47] example, the pedestrian’s position and ego-camera motion lead to multiple plausible futures. These results demonstrate that representing video with delta tokens enables an efficient generative world model that is competitive with the discriminative b… view at source ↗
read the original abstract

Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous "delta" token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi-hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: https://deltatok.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DeltaTok, a tokenizer that encodes the difference between consecutive VFM feature maps into a single continuous delta token, thereby collapsing video from a 3D spatio-temporal grid to a 1D temporal sequence of tokens (e.g., a 1,024× reduction for 512×512 frames). DeltaWorld is an autoregressive generative world model trained on these delta tokens with a multi-hypothesis objective that generates many candidate futures in parallel and supervises only the best one. The central empirical claim is that the resulting forecasts align more closely with real-world outcomes on dense forecasting tasks while using >35× fewer parameters and 2,000× fewer FLOPs than prior generative world models.

Significance. If the efficiency and accuracy claims are substantiated, the work would represent a meaningful advance in tractable generative video world modeling. Operating in VFM feature space with extreme token compression addresses the prohibitive cost of existing generative approaches and enables practical multi-hypothesis sampling. The public release of code and weights further strengthens the contribution by supporting reproducibility.

major comments (2)
  1. [DeltaTok and DeltaWorld architecture sections] The central claim that a single delta token suffices for accurate multi-step dense forecasting rests on the unexamined assumption that the DeltaTok compressor (pooling or attention over the VFM difference map) preserves localized spatio-temporal structure. No ablation or information-retention analysis is provided to show that error accumulation remains controlled over multiple autoregressive steps; if the compression is lossy with respect to motion or appearance detail, the reported alignment gains would be undermined.
  2. [Experiments section] The abstract states quantitative gains in alignment with real outcomes together with 35× parameter and 2,000× FLOP reductions, yet supplies no description of the experimental protocol, exact baselines, evaluation metrics, or statistical significance tests. Without these details it is impossible to determine whether the efficiency advantage is measured on comparable tasks or whether the accuracy improvement is robust.
minor comments (2)
  1. [Abstract and DeltaTok description] The 1,024× token-reduction example for 512×512 frames should be accompanied by the precise VFM feature resolution and the exact compression ratio formula to allow independent verification.
  2. [Training procedure] Clarify the precise mechanism for selecting the 'best' hypothesis during multi-hypothesis training and whether this selection is performed with ground-truth supervision or a learned critic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of clarity and empirical validation that we will address in the revision. We provide point-by-point responses below.

read point-by-point responses
  1. Referee: [DeltaTok and DeltaWorld architecture sections] The central claim that a single delta token suffices for accurate multi-step dense forecasting rests on the unexamined assumption that the DeltaTok compressor (pooling or attention over the VFM difference map) preserves localized spatio-temporal structure. No ablation or information-retention analysis is provided to show that error accumulation remains controlled over multiple autoregressive steps; if the compression is lossy with respect to motion or appearance detail, the reported alignment gains would be undermined.

    Authors: We appreciate this point on the need for explicit validation of the compression. DeltaTok is designed around the gradual nature of VFM feature changes between frames, but we agree that an information-retention analysis would strengthen the claims. In the revised manuscript, we will add an ablation comparing compression variants (mean pooling versus attention-based) and quantify retention via reconstruction error when decoding the delta token back to the original feature difference map. We will also include a plot of cumulative reconstruction error over increasing autoregressive steps on validation sequences to demonstrate that error accumulation remains controlled for the forecasting horizon used in our experiments. revision: yes

  2. Referee: [Experiments section] The abstract states quantitative gains in alignment with real outcomes together with 35× parameter and 2,000× FLOP reductions, yet supplies no description of the experimental protocol, exact baselines, evaluation metrics, or statistical significance tests. Without these details it is impossible to determine whether the efficiency advantage is measured on comparable tasks or whether the accuracy improvement is robust.

    Authors: The Experiments section (Section 4) already details the full protocol, including the dense forecasting datasets, exact baselines (prior generative world models), alignment metrics, and results with means and standard deviations over multiple runs. However, we agree the abstract is too concise on these elements. We will revise the abstract to briefly reference the evaluation setup and primary baselines while keeping it within length limits. This will make the quantitative claims more immediately verifiable without altering the existing detailed descriptions in the main text. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical comparisons

full rationale

The paper presents DeltaTok as a tokenizer compressing VFM feature differences into single delta tokens and DeltaWorld as an autoregressive generative model on the resulting 1D sequence. Efficiency and accuracy claims are demonstrated via direct experimental comparisons (parameter counts, FLOPs, forecasting alignment) to prior generative world models, without any equations or derivations that reduce a claimed output to an input quantity by construction. No self-citations are invoked as load-bearing uniqueness theorems, no fitted parameters are relabeled as predictions, and no ansatz is smuggled via prior work. The architecture is introduced as a design choice whose value is validated externally rather than forced internally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach assumes a pre-trained vision foundation model already encodes the necessary spatio-temporal information; the delta token is an invented compression step whose sufficiency is validated only empirically.

axioms (1)
  • domain assumption A pre-trained vision foundation model produces features whose differences between consecutive frames are sufficient to reconstruct plausible future video states.
    Invoked implicitly when the tokenizer is defined and when decoding is assumed to recover accurate futures.
invented entities (1)
  • Delta token no independent evidence
    purpose: Compact 1D representation of VFM feature difference between consecutive frames
    New compression step introduced to reduce video to a temporal sequence of single tokens.

pith-pipeline@v0.9.0 · 5567 in / 1395 out tokens · 27277 ms · 2026-05-10T19:15:03.615182+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

90 extracted references · 13 canonical work pages · 8 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos World Foundation Model Platform for Physical AI.arXiv preprint arXiv:2501.03575,

  2. [2]

    1, 2, 3, 6, 8, 13, 15, 16, 17

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025. 3

  4. [4]

    Back to the Fea- tures: DINO as a Foundation for Video World Models

    Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, and Piotr Bojanowski. Back to the Fea- tures: DINO as a Foundation for Video World Models. In ICML, 2025. 1, 2, 3, 5, 6, 7, 8, 12, 13, 14, 15, 16

  5. [5]

    Revisiting Feature Prediction for Learning Visual Representations from Video.TMLR, 2024

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. Revisiting Feature Prediction for Learning Visual Representations from Video.TMLR, 2024. 3

  6. [6]

    Best of Many

    Apratim Bhattacharyya, Bernt Schiele, and Mario Fritz. Ac- curate and Diverse Sampling of Sequences Based on a “Best of Many” Sample Objective. InCVPR, 2018. 2, 4, 7, 14

  7. [7]

    Black Forest Labs. FLUX. https://github.com/ black-forest-labs/flux, 2024. 3

  8. [8]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets.arXiv preprint arXiv:2311.15127, 2023. 3

  9. [9]

    Video Generation Models as World Simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video Generation Models as World Simulators. OpenAI Blog, 1(8):1, 2024. 2, 3

  10. [10]

    Language Models are Few-Shot Learners.NeurIPS, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language Models are Few-Shot Learners.NeurIPS, 2020. 2

  11. [11]

    Genie: Generative Interactive Environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker- Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative Interactive Environments. InICML, 2024. 2, 3

  12. [12]

    Emerg- ing Properties in Self-Supervised Vision Transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing Properties in Self-Supervised Vision Transformers. In ICCV, 2021. 3

  13. [13]

    Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffu- sion.NeurIPS, 2024

    Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffu- sion.NeurIPS, 2024. 2, 3

  14. [14]

    SoftVQ-V AE: Efficient 1-Dimensional Con- tinuous Tokenizer

    Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. SoftVQ-V AE: Efficient 1-Dimensional Con- tinuous Tokenizer. InCVPR, 2025. 3

  15. [15]

    Deep Com- pression Autoencoder for Efficient High-Resolution Diffusion Models

    Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep Com- pression Autoencoder for Efficient High-Resolution Diffusion Models. InICLR, 2025. 3

  16. [16]

    The Cityscapes Dataset for Semantic Urban Scene Understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes Dataset for Semantic Urban Scene Understanding. InCVPR, 2016. 2, 6, 12, 13, 14, 15, 16, 18, 20

  17. [17]

    ImageNet: A Large-Scale Hierarchical Image Database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. InCVPR, 2009. 15

  18. [18]

    Diffusion Models Beat GANs on Image Synthesis.NeurIPS, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion Models Beat GANs on Image Synthesis.NeurIPS, 2021. 3

  19. [19]

    An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale

    Alexey Dosovitskiy. An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale. InICLR, 2021. 5, 12

  20. [20]

    Depth map prediction from a single image using a multi-scale deep network

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. InNeurIPS, 2014. 13

  21. [21]

    A Symmetric and Object-Centric World Model for Stochastic Environments

    Patrick Emami, Pan He, Anand Rangarajan, and Sanjay Ranka. A Symmetric and Object-Centric World Model for Stochastic Environments. InNeurIPS Workshop on Object Representations for Learning and Reasoning, 2020. 3

  22. [22]

    Taming Transformers for High-Resolution Image Synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming Transformers for High-Resolution Image Synthesis. InCVPR,

  23. [23]

    RefTok: Reference- Based Tokenization for Video Generation.arXiv preprint arXiv:2507.02862, 2025

    Xiang Fan, Xiaohang Sun, Kushan Thakkar, Zhu Liu, Vimal Bhat, Ranjay Krishna, and Xiang Hao. RefTok: Reference- Based Tokenization for Video Generation.arXiv preprint arXiv:2507.02862, 2025. 3

  24. [24]

    Unsupervised CNN for single view depth estimation: Geome- try to the rescue

    Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. Unsupervised CNN for single view depth estimation: Geome- try to the rescue. InECCV, 2016. 13

  25. [25]

    Vision meets Robotics: The KITTI Dataset.The international journal of robotics research, 32(11):1231–1237,

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets Robotics: The KITTI Dataset.The international journal of robotics research, 32(11):1231–1237,

  26. [26]

    2, 6, 12, 13, 15, 17, 18, 20

  27. [27]

    Generative Adversarial Nets.NeurIPS, 2014

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets.NeurIPS, 2014. 3

  28. [28]

    Google DeepMind. Veo 3. https : / / deepmind . google/veo, 2025. 3

  29. [29]

    Recurrent World Models Facilitate Policy Evolution.NeurIPS, 2018

    David Ha and J¨urgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution.NeurIPS, 2018. 1, 3

  30. [30]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Moham- mad Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. InICLR, 2020. 3

  31. [31]

    Flow- Tok: Flowing Seamlessly Across Text and Image Tokens

    Ju He, Qihang Yu, Qihao Liu, and Liang-Chieh Chen. Flow- Tok: Flowing Seamlessly Across Text and Image Tokens. In ICCV, 2025. 3

  32. [32]

    β-V AE: Learning Basic Visual Concepts with a Constrained Variational Framework

    Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. β-V AE: Learning Basic Visual Concepts with a Constrained Variational Framework. InICLR, 2017. 3

  33. [33]

    Reducing the Dimensionality of Data with Neural Networks.science, 313(5786):504–507, 2006

    Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks.science, 313(5786):504–507, 2006. 3, 4, 12

  34. [34]

    Denoising Diffu- sion Probabilistic Models.NeurIPS, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffu- sion Probabilistic Models.NeurIPS, 2020. 2, 4, 14 9

  35. [35]

    simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. In ICML, 2023. 3

  36. [36]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Cor- rado. GAIA-1: A Generative World Model for Autonomous Driving.arXiv preprint arXiv:2309.17080, 2023. 2, 3

  37. [37]

    DINO-Foresight: Looking into the Future with DINO

    Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. DINO-Foresight: Looking into the Future with DINO. InNeurIPS, 2025. 1, 2, 3, 14

  38. [38]

    Democratiz- ing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

    Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiao- hui Shen, Suha Kwak, and Liang-Chieh Chen. Democratiz- ing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens. InICCV, 2025. 3

  39. [39]

    Auto-Encoding Varia- tional Bayes

    Diederik P Kingma and Max Welling. Auto-Encoding Varia- tional Bayes. InICLR, 2014. 3, 12

  40. [40]

    LLaV A-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. LLaV A-OneVision: Easy Visual Task Transfer. TMLR, 2025. 3

  41. [41]

    Autoregressive Image Generation Without Vec- tor Quantization.NeurIPS, 2024

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive Image Generation Without Vec- tor Quantization.NeurIPS, 2024. 3

  42. [42]

    Learning Adaptive and Temporally Causal Video Tokeniza- tion in a 1D Latent Space.arXiv preprint arXiv:2505.17011,

    Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, and Xue Yang. Learning Adaptive and Temporally Causal Video Tokeniza- tion in a 1D Latent Space.arXiv preprint arXiv:2505.17011,

  43. [43]

    Improving Generative Imagination in Object-Centric World Models

    Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Bofeng Fu, Jindong Jiang, and Sungjin Ahn. Improving Generative Imagination in Object-Centric World Models. InICML, 2020. 3

  44. [44]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling. InICLR, 2023. 3

  45. [45]

    Visual Instruction Tuning.NeurIPS, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning.NeurIPS, 2023. 3

  46. [46]

    Alleviating Distortion in Image Gen- eration via Multi-Resolution Diffusion Models and Time- Dependent Layer Normalization.NeurIPS, 2024

    Qihao Liu, Zhanpeng Zeng, Ju He, Qihang Yu, Xiaohui Shen, and Liang-Chieh Chen. Alleviating Distortion in Image Gen- eration via Multi-Resolution Diffusion Models and Time- Dependent Layer Normalization.NeurIPS, 2024. 3

  47. [47]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InICLR, 2019. 12

  48. [48]

    Finite Scalar Quantization: VQ-V AE Made Simple

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite Scalar Quantization: VQ-V AE Made Simple. InICLR, 2024. 3

  49. [49]

    VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild

    Jiaxu Miao, Yunchao Wei, Yu Wu, Chen Liang, Guangrui Li, and Yi Yang. VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild. InCVPR, 2021. 2, 6, 8, 12, 13, 18, 20

  50. [50]

    Efficient World Models with Context-Aware Tokenization

    Vincent Micheli, Eloi Alonso, and Franc ¸ois Fleuret. Efficient World Models with Context-Aware Tokenization. InICML,

  51. [51]

    Sora.https://openai.com/sora, 2024

    OpenAI. Sora.https://openai.com/sora, 2024. 3

  52. [52]

    DINOv2: Learning Robust Visual Features without Supervision.TMLR,

    Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning Robust Visual Features without Supervision.TMLR,

  53. [53]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. InICCV, 2023. 3

  54. [54]

    SDXL: Improving Latent Diffusion Models for High- Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rom- bach. SDXL: Improving Latent Diffusion Models for High- Resolution Image Synthesis. InICLR, 2024. 3

  55. [55]

    Learning Transferable Visual Models From Natural Language Supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models From Natural Language Supervi- sion. InICML, 2021. 3

  56. [56]

    DeepSpeed: System Optimizations Enable Train- ing Deep Learning Models with Over 100 Billion Parameters

    Jordan Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System Optimizations Enable Train- ing Deep Learning Models with Over 100 Billion Parameters. InKDD, 2020. 6

  57. [57]

    FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching

    Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching. InICML, 2025. 3

  58. [58]

    Beyond Next-Token: Next-X Predic- tion for Autoregressive Visual Generation

    Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Beyond Next-Token: Next-X Predic- tion for Autoregressive Visual Generation. InICCV, 2025. 3

  59. [59]

    Grouping first, attending smartly: Training-free acceleration for diffusion transformers.arXiv preprint arXiv:2505.14687,

    Sucheng Ren, Qihang Yu, Ju He, Alan Yuille, and Liang- Chieh Chen. Grouping First, Attending Smartly: Training- Free Acceleration for Diffusion Transformers.arXiv preprint arXiv:2505.14687, 2025. 3

  60. [60]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. InCVPR, 2022. 3

  61. [61]

    Deeply Supervised Flow-Based Generative Models

    Inkyu Shin, Chenglin Yang, and Liang-Chieh Chen. Deeply Supervised Flow-Based Generative Models. InICCV, 2025. 3

  62. [62]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Ramamonjisoa, et al. DI- NOv3.arXiv preprint arXiv:2508.10104, 2025. 3, 5, 12

  63. [63]

    RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063,

  64. [64]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation.arXiv preprint arXiv:2406.06525, 2024. 3

  65. [65]

    RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

    Zachary Teed and Jia Deng. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. InECCV, 2020. 3

  66. [66]

    Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction.NeurIPS, 2024

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction.NeurIPS, 2024. 3

  67. [67]

    Going Deeper with Image Transformers

    Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv ´e J ´egou. Going Deeper with Image Transformers. InICCV, 2021. 12

  68. [68]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. SigLIP 2: Multilingual Vision-Language 10 Encoders with Improved Semantic Understanding, Localiza- tion, and Dense Features.arXiv preprint arXiv:2502.14786,

  69. [69]

    Neural Discrete Representation Learning.NeurIPS, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural Discrete Representation Learning.NeurIPS, 2017. 3

  70. [70]

    Attention is All You Need.NeurIPS, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All You Need.NeurIPS, 2017. 3

  71. [71]

    Extracting and Composing Robust Fea- tures with Denoising Autoencoders

    Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre- Antoine Manzagol. Extracting and Composing Robust Fea- tures with Denoising Autoencoders. InICML, 2008. 3

  72. [72]

    An Uncer- tain Future: Forecasting from Static Images Using Variational Autoencoders

    Jacob Walker, Carl Doersch, and Abhinav Gupta. An Uncer- tain Future: Forecasting from Static Images Using Variational Autoencoders. InECCV, 2016. 2, 4

  73. [73]

    Frozen Forecasting: A Unified Evaluation

    Jacob C. Walker, Pedro V ´elez, Luisa Polania Cabrera, Guangyao Zhou, Rishabh Kabra, Carl Doersch, Maks Ovs- janikov, Jo˜ao Carreira, and Shiry Ginosar. Generalist Fore- casting with Frozen Video Models via Latent Diffusion.arXiv preprint arXiv:2507.13942, 2025. 2, 3, 6

  74. [74]

    MaskBit: Embedding-free Image Generation via Bit Tokens.TMLR,

    Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. MaskBit: Embedding-free Image Generation via Bit Tokens.TMLR,

  75. [75]

    Overview of the H.264/A VC Video Coding Standard.IEEE Transactions on circuits and systems for video technology, 13(7):560–576, 2003

    Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and Ajay Luthra. Overview of the H.264/A VC Video Coding Standard.IEEE Transactions on circuits and systems for video technology, 13(7):560–576, 2003. 2, 3, 4

  76. [76]

    A Learning Algorithm for Continually Running Fully Recurrent Neural Networks

    Ronald J Williams and David Zipser. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural computation, 1(2):270–280, 1989. 4

  77. [77]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Trans- formers: State-of-t...

  78. [78]

    Video Prediction via Example Guidance

    Jingwei Xu, Huazhe Xu, Bingbing Ni, Xiaokang Yang, and Trevor Darrell. Video Prediction via Example Guidance. In ICML, 2020. 6

  79. [79]

    1.58-bit FLUX.arXiv preprint arXiv:2412.18653, 2024

    Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, and Liang-Chieh Chen. 1.58-bit FLUX.arXiv preprint arXiv:2412.18653, 2024. 3

  80. [80]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models. InCVPR, 2025. 3

Showing first 80 references.