arxiv: 2604.04913 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

Tommie Kerssies , Gabriele Berton , Ju He , Qihang Yu , Wufei Ma , Daan de Geus , Gijs Dubbelman , Liang-Chieh Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords delta tokensgenerative world modelsvideo forecastingvision foundation modelsefficient tokenizationmulti-hypothesis predictiondense forecastingtoken reduction

0 comments

The pith

Delta tokens from VFM feature differences enable generative world models to forecast diverse futures using 35x fewer parameters and 2000x fewer FLOPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a tokenizer converting the feature difference between consecutive video frames in a vision foundation model into one continuous delta token can power a generative model for diverse future prediction. This collapses three-dimensional spatio-temporal video data into a one-dimensional temporal sequence, achieving reductions such as 1,024x fewer tokens for standard frame resolutions. The compact form makes parallel multi-hypothesis training feasible, where many candidate futures are generated together and only the best match is used for supervision, yielding varied outputs from a single forward pass. Readers should care because anticipating multiple plausible futures supports planning in uncertain environments like robotics or driving, yet prior generative world models have been too expensive for broad use. If correct, the approach renders high-fidelity generative forecasting computationally practical.

Core claim

DeltaTok encodes the VFM feature difference between consecutive frames into a single continuous delta token, allowing DeltaWorld to model video as a 1D temporal sequence of these tokens and generate multiple plausible future states through efficient multi-hypothesis training and inference.

What carries the argument

The DeltaTok tokenizer, which reduces each pair of consecutive VFM feature maps to one continuous delta token and thereby converts video into a 1D temporal token sequence.

If this is right

Diverse future predictions become possible in a single forward pass through parallel multi-hypothesis generation.
The model requires over 35x fewer parameters than existing generative world models.
Computation drops by 2,000x in FLOPs compared with prior generative approaches.
Forecasts align more closely with real-world video outcomes on dense forecasting benchmarks.
Multi-hypothesis training becomes tractable because only the best-matching prediction needs supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Chaining successive delta tokens could extend the method to longer video horizons while keeping the representation compact.
The efficiency gains might allow integration of generative world models into real-time systems on limited hardware.
Advances in the underlying vision foundation model would likely transfer directly to improved delta-token performance.
The same difference-based tokenization could be explored for other sequential modalities such as audio or time-series sensor data.

Load-bearing premise

The single delta token derived from VFM feature differences between consecutive frames retains enough spatio-temporal information to support accurate generation of diverse future states.

What would settle it

A controlled test in which two future video sequences differing in spatial detail produce nearly identical delta tokens yet the model cannot generate both accurately or distinguish them during inference.

Figures

Figures reproduced from arXiv: 2604.04913 by Daan de Geus, Gabriele Berton, Gijs Dubbelman, Ju He, Liang-Chieh Chen, Qihang Yu, Tommie Kerssies, Wufei Ma.

**Figure 1.** Figure 1: Outline of DeltaWorld. Unlike large existing generative world models that require many forward passes and represent each frame with many spatial tokens, our small DeltaWorld generates multiple futures in a single forward pass by using a single delta token to encode the difference between consecutive frames. Cosmos-4B Cosmos-12B DeltaWorld-0.3B (Ours) 40 45 50 55 60 65 70 60,000 TFLOPs 64,000 TFLOPs 31 TFLO… view at source ↗

**Figure 2.** Figure 2: Performance comparison. Compared to the generative world model Cosmos [1], our DeltaWorld forecasts futures that better align with real-world outcomes while having over 35× fewer parameters and using 2,000× fewer FLOPs. multiple possible future world states. In autonomous driving, for instance, anticipating interactions among multiple agents requires reasoning over diverse futures to prevent collisions. Di… view at source ↗

**Figure 3.** Figure 3: Overview of DeltaTok. Given two frames encoded by a frozen vision foundation model (VFM) into grids of patch tokens xt−1 and xt, the DeltaTok encoder takes both as input and compresses them into a single delta token zt. The decoder reconstructs xˆt from xt−1 and zt. Both encoder and decoder are Vision Transformers (ViT) [18] trained with a Mean Squared Error (MSE) loss. DeltaTok. We introduce DeltaTok ( [… view at source ↗

**Figure 4.** Figure 4: Overview of DeltaWorld. The predictor operates entirely on delta tokens ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Best-of-Many sample scaling. Effect of the number of training and evaluation queries on Cityscapes mid-horizon (∼0.6 s) mIoU. Using 256 × 256 crops. representational capacity, which may ultimately lower both best and mean predictions. Step (3) – Delta compression (DeltaWorld). To address the limitations of full frame compression, we encode only the change between consecutive frames as a single delta token… view at source ↗

**Figure 6.** Figure 6: Diverse sampled futures. Top row: four context frames and the future frame. Bottom row: four sampled DeltaWorld predictions and the oracle. In this VSPW [47] example, the pedestrian’s position and ego-camera motion lead to multiple plausible futures. These results demonstrate that representing video with delta tokens enables an efficient generative world model that is competitive with the discriminative b… view at source ↗

read the original abstract

Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous "delta" token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi-hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: https://deltatok.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DeltaTok, a tokenizer that encodes the difference between consecutive VFM feature maps into a single continuous delta token, thereby collapsing video from a 3D spatio-temporal grid to a 1D temporal sequence of tokens (e.g., a 1,024× reduction for 512×512 frames). DeltaWorld is an autoregressive generative world model trained on these delta tokens with a multi-hypothesis objective that generates many candidate futures in parallel and supervises only the best one. The central empirical claim is that the resulting forecasts align more closely with real-world outcomes on dense forecasting tasks while using >35× fewer parameters and 2,000× fewer FLOPs than prior generative world models.

Significance. If the efficiency and accuracy claims are substantiated, the work would represent a meaningful advance in tractable generative video world modeling. Operating in VFM feature space with extreme token compression addresses the prohibitive cost of existing generative approaches and enables practical multi-hypothesis sampling. The public release of code and weights further strengthens the contribution by supporting reproducibility.

major comments (2)

[DeltaTok and DeltaWorld architecture sections] The central claim that a single delta token suffices for accurate multi-step dense forecasting rests on the unexamined assumption that the DeltaTok compressor (pooling or attention over the VFM difference map) preserves localized spatio-temporal structure. No ablation or information-retention analysis is provided to show that error accumulation remains controlled over multiple autoregressive steps; if the compression is lossy with respect to motion or appearance detail, the reported alignment gains would be undermined.
[Experiments section] The abstract states quantitative gains in alignment with real outcomes together with 35× parameter and 2,000× FLOP reductions, yet supplies no description of the experimental protocol, exact baselines, evaluation metrics, or statistical significance tests. Without these details it is impossible to determine whether the efficiency advantage is measured on comparable tasks or whether the accuracy improvement is robust.

minor comments (2)

[Abstract and DeltaTok description] The 1,024× token-reduction example for 512×512 frames should be accompanied by the precise VFM feature resolution and the exact compression ratio formula to allow independent verification.
[Training procedure] Clarify the precise mechanism for selecting the 'best' hypothesis during multi-hypothesis training and whether this selection is performed with ground-truth supervision or a learned critic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of clarity and empirical validation that we will address in the revision. We provide point-by-point responses below.

read point-by-point responses

Referee: [DeltaTok and DeltaWorld architecture sections] The central claim that a single delta token suffices for accurate multi-step dense forecasting rests on the unexamined assumption that the DeltaTok compressor (pooling or attention over the VFM difference map) preserves localized spatio-temporal structure. No ablation or information-retention analysis is provided to show that error accumulation remains controlled over multiple autoregressive steps; if the compression is lossy with respect to motion or appearance detail, the reported alignment gains would be undermined.

Authors: We appreciate this point on the need for explicit validation of the compression. DeltaTok is designed around the gradual nature of VFM feature changes between frames, but we agree that an information-retention analysis would strengthen the claims. In the revised manuscript, we will add an ablation comparing compression variants (mean pooling versus attention-based) and quantify retention via reconstruction error when decoding the delta token back to the original feature difference map. We will also include a plot of cumulative reconstruction error over increasing autoregressive steps on validation sequences to demonstrate that error accumulation remains controlled for the forecasting horizon used in our experiments. revision: yes
Referee: [Experiments section] The abstract states quantitative gains in alignment with real outcomes together with 35× parameter and 2,000× FLOP reductions, yet supplies no description of the experimental protocol, exact baselines, evaluation metrics, or statistical significance tests. Without these details it is impossible to determine whether the efficiency advantage is measured on comparable tasks or whether the accuracy improvement is robust.

Authors: The Experiments section (Section 4) already details the full protocol, including the dense forecasting datasets, exact baselines (prior generative world models), alignment metrics, and results with means and standard deviations over multiple runs. However, we agree the abstract is too concise on these elements. We will revise the abstract to briefly reference the evaluation setup and primary baselines while keeping it within length limits. This will make the quantitative claims more immediately verifiable without altering the existing detailed descriptions in the main text. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical comparisons

full rationale

The paper presents DeltaTok as a tokenizer compressing VFM feature differences into single delta tokens and DeltaWorld as an autoregressive generative model on the resulting 1D sequence. Efficiency and accuracy claims are demonstrated via direct experimental comparisons (parameter counts, FLOPs, forecasting alignment) to prior generative world models, without any equations or derivations that reduce a claimed output to an input quantity by construction. No self-citations are invoked as load-bearing uniqueness theorems, no fitted parameters are relabeled as predictions, and no ansatz is smuggled via prior work. The architecture is introduced as a design choice whose value is validated externally rather than forced internally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach assumes a pre-trained vision foundation model already encodes the necessary spatio-temporal information; the delta token is an invented compression step whose sufficiency is validated only empirically.

axioms (1)

domain assumption A pre-trained vision foundation model produces features whose differences between consecutive frames are sufficient to reconstruct plausible future video states.
Invoked implicitly when the tokenizer is defined and when decoding is assumed to recover accurate futures.

invented entities (1)

Delta token no independent evidence
purpose: Compact 1D representation of VFM feature difference between consecutive frames
New compression step introduced to reduce video to a temporal sequence of single tokens.

pith-pipeline@v0.9.0 · 5567 in / 1395 out tokens · 27277 ms · 2026-05-10T19:15:03.615182+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DeltaTok ... encodes the VFM feature difference between consecutive frames into a single continuous 'delta' token ... reducing video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DeltaWorld ... generates multiple futures in a single forward pass by using a single delta token per future
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Best-of-Many (BoM) training ... only the prediction closest to the ground truth is supervised

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

90 extracted references · 13 canonical work pages · 8 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos World Foundation Model Platform for Physical AI.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review arXiv
[2]

1, 2, 3, 6, 8, 13, 15, 16, 17
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Back to the Fea- tures: DINO as a Foundation for Video World Models

Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, and Piotr Bojanowski. Back to the Fea- tures: DINO as a Foundation for Video World Models. In ICML, 2025. 1, 2, 3, 5, 6, 7, 8, 12, 13, 14, 15, 16

2025
[5]

Revisiting Feature Prediction for Learning Visual Representations from Video.TMLR, 2024

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. Revisiting Feature Prediction for Learning Visual Representations from Video.TMLR, 2024. 3

2024
[6]

Best of Many

Apratim Bhattacharyya, Bernt Schiele, and Mario Fritz. Ac- curate and Diverse Sampling of Sequences Based on a “Best of Many” Sample Objective. InCVPR, 2018. 2, 4, 7, 14

2018
[7]

Black Forest Labs. FLUX. https://github.com/ black-forest-labs/flux, 2024. 3

2024
[8]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets.arXiv preprint arXiv:2311.15127, 2023. 3

work page internal anchor Pith review arXiv 2023
[9]

Video Generation Models as World Simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video Generation Models as World Simulators. OpenAI Blog, 1(8):1, 2024. 2, 3

2024
[10]

Language Models are Few-Shot Learners.NeurIPS, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language Models are Few-Shot Learners.NeurIPS, 2020. 2

2020
[11]

Genie: Generative Interactive Environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker- Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative Interactive Environments. InICML, 2024. 2, 3

2024
[12]

Emerg- ing Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing Properties in Self-Supervised Vision Transformers. In ICCV, 2021. 3

2021
[13]

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffu- sion.NeurIPS, 2024

Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffu- sion.NeurIPS, 2024. 2, 3

2024
[14]

SoftVQ-V AE: Efficient 1-Dimensional Con- tinuous Tokenizer

Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. SoftVQ-V AE: Efficient 1-Dimensional Con- tinuous Tokenizer. InCVPR, 2025. 3

2025
[15]

Deep Com- pression Autoencoder for Efficient High-Resolution Diffusion Models

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep Com- pression Autoencoder for Efficient High-Resolution Diffusion Models. InICLR, 2025. 3

2025
[16]

The Cityscapes Dataset for Semantic Urban Scene Understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes Dataset for Semantic Urban Scene Understanding. InCVPR, 2016. 2, 6, 12, 13, 14, 15, 16, 18, 20

2016
[17]

ImageNet: A Large-Scale Hierarchical Image Database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. InCVPR, 2009. 15

2009
[18]

Diffusion Models Beat GANs on Image Synthesis.NeurIPS, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion Models Beat GANs on Image Synthesis.NeurIPS, 2021. 3

2021
[19]

An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale

Alexey Dosovitskiy. An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale. InICLR, 2021. 5, 12

2021
[20]

Depth map prediction from a single image using a multi-scale deep network

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. InNeurIPS, 2014. 13

2014
[21]

A Symmetric and Object-Centric World Model for Stochastic Environments

Patrick Emami, Pan He, Anand Rangarajan, and Sanjay Ranka. A Symmetric and Object-Centric World Model for Stochastic Environments. InNeurIPS Workshop on Object Representations for Learning and Reasoning, 2020. 3

2020
[22]

Taming Transformers for High-Resolution Image Synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming Transformers for High-Resolution Image Synthesis. InCVPR,
[23]

RefTok: Reference- Based Tokenization for Video Generation.arXiv preprint arXiv:2507.02862, 2025

Xiang Fan, Xiaohang Sun, Kushan Thakkar, Zhu Liu, Vimal Bhat, Ranjay Krishna, and Xiang Hao. RefTok: Reference- Based Tokenization for Video Generation.arXiv preprint arXiv:2507.02862, 2025. 3

work page arXiv 2025
[24]

Unsupervised CNN for single view depth estimation: Geome- try to the rescue

Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. Unsupervised CNN for single view depth estimation: Geome- try to the rescue. InECCV, 2016. 13

2016
[25]

Vision meets Robotics: The KITTI Dataset.The international journal of robotics research, 32(11):1231–1237,

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets Robotics: The KITTI Dataset.The international journal of robotics research, 32(11):1231–1237,
[26]

2, 6, 12, 13, 15, 17, 18, 20
[27]

Generative Adversarial Nets.NeurIPS, 2014

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets.NeurIPS, 2014. 3

2014
[28]

Google DeepMind. Veo 3. https : / / deepmind . google/veo, 2025. 3

2025
[29]

Recurrent World Models Facilitate Policy Evolution.NeurIPS, 2018

David Ha and J¨urgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution.NeurIPS, 2018. 1, 3

2018
[30]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Moham- mad Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. InICLR, 2020. 3

2020
[31]

Flow- Tok: Flowing Seamlessly Across Text and Image Tokens

Ju He, Qihang Yu, Qihao Liu, and Liang-Chieh Chen. Flow- Tok: Flowing Seamlessly Across Text and Image Tokens. In ICCV, 2025. 3

2025
[32]

β-V AE: Learning Basic Visual Concepts with a Constrained Variational Framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. β-V AE: Learning Basic Visual Concepts with a Constrained Variational Framework. InICLR, 2017. 3

2017
[33]

Reducing the Dimensionality of Data with Neural Networks.science, 313(5786):504–507, 2006

Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks.science, 313(5786):504–507, 2006. 3, 4, 12

2006
[34]

Denoising Diffu- sion Probabilistic Models.NeurIPS, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffu- sion Probabilistic Models.NeurIPS, 2020. 2, 4, 14 9

2020
[35]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. In ICML, 2023. 3

2023
[36]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Cor- rado. GAIA-1: A Generative World Model for Autonomous Driving.arXiv preprint arXiv:2309.17080, 2023. 2, 3

work page internal anchor Pith review arXiv 2023
[37]

DINO-Foresight: Looking into the Future with DINO

Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. DINO-Foresight: Looking into the Future with DINO. InNeurIPS, 2025. 1, 2, 3, 14

2025
[38]

Democratiz- ing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiao- hui Shen, Suha Kwak, and Liang-Chieh Chen. Democratiz- ing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens. InICCV, 2025. 3

2025
[39]

Auto-Encoding Varia- tional Bayes

Diederik P Kingma and Max Welling. Auto-Encoding Varia- tional Bayes. InICLR, 2014. 3, 12

2014
[40]

LLaV A-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. LLaV A-OneVision: Easy Visual Task Transfer. TMLR, 2025. 3

2025
[41]

Autoregressive Image Generation Without Vec- tor Quantization.NeurIPS, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive Image Generation Without Vec- tor Quantization.NeurIPS, 2024. 3

2024
[42]

Learning Adaptive and Temporally Causal Video Tokeniza- tion in a 1D Latent Space.arXiv preprint arXiv:2505.17011,

Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, and Xue Yang. Learning Adaptive and Temporally Causal Video Tokeniza- tion in a 1D Latent Space.arXiv preprint arXiv:2505.17011,

work page arXiv
[43]

Improving Generative Imagination in Object-Centric World Models

Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Bofeng Fu, Jindong Jiang, and Sungjin Ahn. Improving Generative Imagination in Object-Centric World Models. InICML, 2020. 3

2020
[44]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling. InICLR, 2023. 3

2023
[45]

Visual Instruction Tuning.NeurIPS, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning.NeurIPS, 2023. 3

2023
[46]

Alleviating Distortion in Image Gen- eration via Multi-Resolution Diffusion Models and Time- Dependent Layer Normalization.NeurIPS, 2024

Qihao Liu, Zhanpeng Zeng, Ju He, Qihang Yu, Xiaohui Shen, and Liang-Chieh Chen. Alleviating Distortion in Image Gen- eration via Multi-Resolution Diffusion Models and Time- Dependent Layer Normalization.NeurIPS, 2024. 3

2024
[47]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InICLR, 2019. 12

2019
[48]

Finite Scalar Quantization: VQ-V AE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite Scalar Quantization: VQ-V AE Made Simple. InICLR, 2024. 3

2024
[49]

VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild

Jiaxu Miao, Yunchao Wei, Yu Wu, Chen Liang, Guangrui Li, and Yi Yang. VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild. InCVPR, 2021. 2, 6, 8, 12, 13, 18, 20

2021
[50]

Efficient World Models with Context-Aware Tokenization

Vincent Micheli, Eloi Alonso, and Franc ¸ois Fleuret. Efficient World Models with Context-Aware Tokenization. InICML,
[51]

Sora.https://openai.com/sora, 2024

OpenAI. Sora.https://openai.com/sora, 2024. 3

2024
[52]

DINOv2: Learning Robust Visual Features without Supervision.TMLR,

Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning Robust Visual Features without Supervision.TMLR,
[53]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. InICCV, 2023. 3

2023
[54]

SDXL: Improving Latent Diffusion Models for High- Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rom- bach. SDXL: Improving Latent Diffusion Models for High- Resolution Image Synthesis. InICLR, 2024. 3

2024
[55]

Learning Transferable Visual Models From Natural Language Supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models From Natural Language Supervi- sion. InICML, 2021. 3

2021
[56]

DeepSpeed: System Optimizations Enable Train- ing Deep Learning Models with Over 100 Billion Parameters

Jordan Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System Optimizations Enable Train- ing Deep Learning Models with Over 100 Billion Parameters. InKDD, 2020. 6

2020
[57]

FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching

Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching. InICML, 2025. 3

2025
[58]

Beyond Next-Token: Next-X Predic- tion for Autoregressive Visual Generation

Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Beyond Next-Token: Next-X Predic- tion for Autoregressive Visual Generation. InICCV, 2025. 3

2025
[59]

Grouping first, attending smartly: Training-free acceleration for diffusion transformers.arXiv preprint arXiv:2505.14687,

Sucheng Ren, Qihang Yu, Ju He, Alan Yuille, and Liang- Chieh Chen. Grouping First, Attending Smartly: Training- Free Acceleration for Diffusion Transformers.arXiv preprint arXiv:2505.14687, 2025. 3

work page arXiv 2025
[60]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. InCVPR, 2022. 3

2022
[61]

Deeply Supervised Flow-Based Generative Models

Inkyu Shin, Chenglin Yang, and Liang-Chieh Chen. Deeply Supervised Flow-Based Generative Models. InICCV, 2025. 3

2025
[62]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Ramamonjisoa, et al. DI- NOv3.arXiv preprint arXiv:2508.10104, 2025. 3, 5, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063,
[64]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation.arXiv preprint arXiv:2406.06525, 2024. 3

work page internal anchor Pith review arXiv 2024
[65]

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

Zachary Teed and Jia Deng. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. InECCV, 2020. 3

2020
[66]

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction.NeurIPS, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction.NeurIPS, 2024. 3

2024
[67]

Going Deeper with Image Transformers

Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv ´e J ´egou. Going Deeper with Image Transformers. InICCV, 2021. 12

2021
[68]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. SigLIP 2: Multilingual Vision-Language 10 Encoders with Improved Semantic Understanding, Localiza- tion, and Dense Features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Neural Discrete Representation Learning.NeurIPS, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural Discrete Representation Learning.NeurIPS, 2017. 3

2017
[70]

Attention is All You Need.NeurIPS, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All You Need.NeurIPS, 2017. 3

2017
[71]

Extracting and Composing Robust Fea- tures with Denoising Autoencoders

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre- Antoine Manzagol. Extracting and Composing Robust Fea- tures with Denoising Autoencoders. InICML, 2008. 3

2008
[72]

An Uncer- tain Future: Forecasting from Static Images Using Variational Autoencoders

Jacob Walker, Carl Doersch, and Abhinav Gupta. An Uncer- tain Future: Forecasting from Static Images Using Variational Autoencoders. InECCV, 2016. 2, 4

2016
[73]

Frozen Forecasting: A Unified Evaluation

Jacob C. Walker, Pedro V ´elez, Luisa Polania Cabrera, Guangyao Zhou, Rishabh Kabra, Carl Doersch, Maks Ovs- janikov, Jo˜ao Carreira, and Shiry Ginosar. Generalist Fore- casting with Frozen Video Models via Latent Diffusion.arXiv preprint arXiv:2507.13942, 2025. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

MaskBit: Embedding-free Image Generation via Bit Tokens.TMLR,

Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. MaskBit: Embedding-free Image Generation via Bit Tokens.TMLR,
[75]

Overview of the H.264/A VC Video Coding Standard.IEEE Transactions on circuits and systems for video technology, 13(7):560–576, 2003

Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and Ajay Luthra. Overview of the H.264/A VC Video Coding Standard.IEEE Transactions on circuits and systems for video technology, 13(7):560–576, 2003. 2, 3, 4

2003
[76]

A Learning Algorithm for Continually Running Fully Recurrent Neural Networks

Ronald J Williams and David Zipser. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural computation, 1(2):270–280, 1989. 4

1989
[77]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Trans- formers: State-of-t...

2020
[78]

Video Prediction via Example Guidance

Jingwei Xu, Huazhe Xu, Bingbing Ni, Xiaokang Yang, and Trevor Darrell. Video Prediction via Example Guidance. In ICML, 2020. 6

2020
[79]

1.58-bit FLUX.arXiv preprint arXiv:2412.18653, 2024

Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, and Liang-Chieh Chen. 1.58-bit FLUX.arXiv preprint arXiv:2412.18653, 2024. 3

work page arXiv 2024
[80]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models. InCVPR, 2025. 3

2025

Showing first 80 references.