Recognition: 3 theorem links
· Lean TheoremA Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
Pith reviewed 2026-05-10 19:15 UTC · model grok-4.3
The pith
Delta tokens from VFM feature differences enable generative world models to forecast diverse futures using 35x fewer parameters and 2000x fewer FLOPs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeltaTok encodes the VFM feature difference between consecutive frames into a single continuous delta token, allowing DeltaWorld to model video as a 1D temporal sequence of these tokens and generate multiple plausible future states through efficient multi-hypothesis training and inference.
What carries the argument
The DeltaTok tokenizer, which reduces each pair of consecutive VFM feature maps to one continuous delta token and thereby converts video into a 1D temporal token sequence.
If this is right
- Diverse future predictions become possible in a single forward pass through parallel multi-hypothesis generation.
- The model requires over 35x fewer parameters than existing generative world models.
- Computation drops by 2,000x in FLOPs compared with prior generative approaches.
- Forecasts align more closely with real-world video outcomes on dense forecasting benchmarks.
- Multi-hypothesis training becomes tractable because only the best-matching prediction needs supervision.
Where Pith is reading between the lines
- Chaining successive delta tokens could extend the method to longer video horizons while keeping the representation compact.
- The efficiency gains might allow integration of generative world models into real-time systems on limited hardware.
- Advances in the underlying vision foundation model would likely transfer directly to improved delta-token performance.
- The same difference-based tokenization could be explored for other sequential modalities such as audio or time-series sensor data.
Load-bearing premise
The single delta token derived from VFM feature differences between consecutive frames retains enough spatio-temporal information to support accurate generation of diverse future states.
What would settle it
A controlled test in which two future video sequences differing in spatial detail produce nearly identical delta tokens yet the model cannot generate both accurately or distinguish them during inference.
Figures
read the original abstract
Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous "delta" token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi-hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: https://deltatok.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DeltaTok, a tokenizer that encodes the difference between consecutive VFM feature maps into a single continuous delta token, thereby collapsing video from a 3D spatio-temporal grid to a 1D temporal sequence of tokens (e.g., a 1,024× reduction for 512×512 frames). DeltaWorld is an autoregressive generative world model trained on these delta tokens with a multi-hypothesis objective that generates many candidate futures in parallel and supervises only the best one. The central empirical claim is that the resulting forecasts align more closely with real-world outcomes on dense forecasting tasks while using >35× fewer parameters and 2,000× fewer FLOPs than prior generative world models.
Significance. If the efficiency and accuracy claims are substantiated, the work would represent a meaningful advance in tractable generative video world modeling. Operating in VFM feature space with extreme token compression addresses the prohibitive cost of existing generative approaches and enables practical multi-hypothesis sampling. The public release of code and weights further strengthens the contribution by supporting reproducibility.
major comments (2)
- [DeltaTok and DeltaWorld architecture sections] The central claim that a single delta token suffices for accurate multi-step dense forecasting rests on the unexamined assumption that the DeltaTok compressor (pooling or attention over the VFM difference map) preserves localized spatio-temporal structure. No ablation or information-retention analysis is provided to show that error accumulation remains controlled over multiple autoregressive steps; if the compression is lossy with respect to motion or appearance detail, the reported alignment gains would be undermined.
- [Experiments section] The abstract states quantitative gains in alignment with real outcomes together with 35× parameter and 2,000× FLOP reductions, yet supplies no description of the experimental protocol, exact baselines, evaluation metrics, or statistical significance tests. Without these details it is impossible to determine whether the efficiency advantage is measured on comparable tasks or whether the accuracy improvement is robust.
minor comments (2)
- [Abstract and DeltaTok description] The 1,024× token-reduction example for 512×512 frames should be accompanied by the precise VFM feature resolution and the exact compression ratio formula to allow independent verification.
- [Training procedure] Clarify the precise mechanism for selecting the 'best' hypothesis during multi-hypothesis training and whether this selection is performed with ground-truth supervision or a learned critic.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of clarity and empirical validation that we will address in the revision. We provide point-by-point responses below.
read point-by-point responses
-
Referee: [DeltaTok and DeltaWorld architecture sections] The central claim that a single delta token suffices for accurate multi-step dense forecasting rests on the unexamined assumption that the DeltaTok compressor (pooling or attention over the VFM difference map) preserves localized spatio-temporal structure. No ablation or information-retention analysis is provided to show that error accumulation remains controlled over multiple autoregressive steps; if the compression is lossy with respect to motion or appearance detail, the reported alignment gains would be undermined.
Authors: We appreciate this point on the need for explicit validation of the compression. DeltaTok is designed around the gradual nature of VFM feature changes between frames, but we agree that an information-retention analysis would strengthen the claims. In the revised manuscript, we will add an ablation comparing compression variants (mean pooling versus attention-based) and quantify retention via reconstruction error when decoding the delta token back to the original feature difference map. We will also include a plot of cumulative reconstruction error over increasing autoregressive steps on validation sequences to demonstrate that error accumulation remains controlled for the forecasting horizon used in our experiments. revision: yes
-
Referee: [Experiments section] The abstract states quantitative gains in alignment with real outcomes together with 35× parameter and 2,000× FLOP reductions, yet supplies no description of the experimental protocol, exact baselines, evaluation metrics, or statistical significance tests. Without these details it is impossible to determine whether the efficiency advantage is measured on comparable tasks or whether the accuracy improvement is robust.
Authors: The Experiments section (Section 4) already details the full protocol, including the dense forecasting datasets, exact baselines (prior generative world models), alignment metrics, and results with means and standard deviations over multiple runs. However, we agree the abstract is too concise on these elements. We will revise the abstract to briefly reference the evaluation setup and primary baselines while keeping it within length limits. This will make the quantitative claims more immediately verifiable without altering the existing detailed descriptions in the main text. revision: partial
Circularity Check
No significant circularity; claims rest on empirical comparisons
full rationale
The paper presents DeltaTok as a tokenizer compressing VFM feature differences into single delta tokens and DeltaWorld as an autoregressive generative model on the resulting 1D sequence. Efficiency and accuracy claims are demonstrated via direct experimental comparisons (parameter counts, FLOPs, forecasting alignment) to prior generative world models, without any equations or derivations that reduce a claimed output to an input quantity by construction. No self-citations are invoked as load-bearing uniqueness theorems, no fitted parameters are relabeled as predictions, and no ansatz is smuggled via prior work. The architecture is introduced as a design choice whose value is validated externally rather than forced internally.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A pre-trained vision foundation model produces features whose differences between consecutive frames are sufficient to reconstruct plausible future video states.
invented entities (1)
-
Delta token
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DeltaTok ... encodes the VFM feature difference between consecutive frames into a single continuous 'delta' token ... reducing video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DeltaWorld ... generates multiple futures in a single forward pass by using a single delta token per future
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Best-of-Many (BoM) training ... only the prediction closest to the ground truth is supervised
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos World Foundation Model Platform for Physical AI.arXiv preprint arXiv:2501.03575,
work page internal anchor Pith review arXiv
-
[2]
1, 2, 3, 6, 8, 13, 15, 16, 17
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Back to the Fea- tures: DINO as a Foundation for Video World Models
Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, and Piotr Bojanowski. Back to the Fea- tures: DINO as a Foundation for Video World Models. In ICML, 2025. 1, 2, 3, 5, 6, 7, 8, 12, 13, 14, 15, 16
2025
-
[5]
Revisiting Feature Prediction for Learning Visual Representations from Video.TMLR, 2024
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. Revisiting Feature Prediction for Learning Visual Representations from Video.TMLR, 2024. 3
2024
-
[6]
Best of Many
Apratim Bhattacharyya, Bernt Schiele, and Mario Fritz. Ac- curate and Diverse Sampling of Sequences Based on a “Best of Many” Sample Objective. InCVPR, 2018. 2, 4, 7, 14
2018
-
[7]
Black Forest Labs. FLUX. https://github.com/ black-forest-labs/flux, 2024. 3
2024
-
[8]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets.arXiv preprint arXiv:2311.15127, 2023. 3
work page internal anchor Pith review arXiv 2023
-
[9]
Video Generation Models as World Simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video Generation Models as World Simulators. OpenAI Blog, 1(8):1, 2024. 2, 3
2024
-
[10]
Language Models are Few-Shot Learners.NeurIPS, 2020
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language Models are Few-Shot Learners.NeurIPS, 2020. 2
2020
-
[11]
Genie: Generative Interactive Environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker- Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative Interactive Environments. InICML, 2024. 2, 3
2024
-
[12]
Emerg- ing Properties in Self-Supervised Vision Transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing Properties in Self-Supervised Vision Transformers. In ICCV, 2021. 3
2021
-
[13]
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffu- sion.NeurIPS, 2024
Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffu- sion.NeurIPS, 2024. 2, 3
2024
-
[14]
SoftVQ-V AE: Efficient 1-Dimensional Con- tinuous Tokenizer
Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. SoftVQ-V AE: Efficient 1-Dimensional Con- tinuous Tokenizer. InCVPR, 2025. 3
2025
-
[15]
Deep Com- pression Autoencoder for Efficient High-Resolution Diffusion Models
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep Com- pression Autoencoder for Efficient High-Resolution Diffusion Models. InICLR, 2025. 3
2025
-
[16]
The Cityscapes Dataset for Semantic Urban Scene Understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes Dataset for Semantic Urban Scene Understanding. InCVPR, 2016. 2, 6, 12, 13, 14, 15, 16, 18, 20
2016
-
[17]
ImageNet: A Large-Scale Hierarchical Image Database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. InCVPR, 2009. 15
2009
-
[18]
Diffusion Models Beat GANs on Image Synthesis.NeurIPS, 2021
Prafulla Dhariwal and Alexander Nichol. Diffusion Models Beat GANs on Image Synthesis.NeurIPS, 2021. 3
2021
-
[19]
An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale
Alexey Dosovitskiy. An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale. InICLR, 2021. 5, 12
2021
-
[20]
Depth map prediction from a single image using a multi-scale deep network
David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. InNeurIPS, 2014. 13
2014
-
[21]
A Symmetric and Object-Centric World Model for Stochastic Environments
Patrick Emami, Pan He, Anand Rangarajan, and Sanjay Ranka. A Symmetric and Object-Centric World Model for Stochastic Environments. InNeurIPS Workshop on Object Representations for Learning and Reasoning, 2020. 3
2020
-
[22]
Taming Transformers for High-Resolution Image Synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming Transformers for High-Resolution Image Synthesis. InCVPR,
-
[23]
RefTok: Reference- Based Tokenization for Video Generation.arXiv preprint arXiv:2507.02862, 2025
Xiang Fan, Xiaohang Sun, Kushan Thakkar, Zhu Liu, Vimal Bhat, Ranjay Krishna, and Xiang Hao. RefTok: Reference- Based Tokenization for Video Generation.arXiv preprint arXiv:2507.02862, 2025. 3
-
[24]
Unsupervised CNN for single view depth estimation: Geome- try to the rescue
Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. Unsupervised CNN for single view depth estimation: Geome- try to the rescue. InECCV, 2016. 13
2016
-
[25]
Vision meets Robotics: The KITTI Dataset.The international journal of robotics research, 32(11):1231–1237,
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets Robotics: The KITTI Dataset.The international journal of robotics research, 32(11):1231–1237,
-
[26]
2, 6, 12, 13, 15, 17, 18, 20
-
[27]
Generative Adversarial Nets.NeurIPS, 2014
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets.NeurIPS, 2014. 3
2014
-
[28]
Google DeepMind. Veo 3. https : / / deepmind . google/veo, 2025. 3
2025
-
[29]
Recurrent World Models Facilitate Policy Evolution.NeurIPS, 2018
David Ha and J¨urgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution.NeurIPS, 2018. 1, 3
2018
-
[30]
Dream to Control: Learning Behaviors by Latent Imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Moham- mad Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. InICLR, 2020. 3
2020
-
[31]
Flow- Tok: Flowing Seamlessly Across Text and Image Tokens
Ju He, Qihang Yu, Qihao Liu, and Liang-Chieh Chen. Flow- Tok: Flowing Seamlessly Across Text and Image Tokens. In ICCV, 2025. 3
2025
-
[32]
β-V AE: Learning Basic Visual Concepts with a Constrained Variational Framework
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. β-V AE: Learning Basic Visual Concepts with a Constrained Variational Framework. InICLR, 2017. 3
2017
-
[33]
Reducing the Dimensionality of Data with Neural Networks.science, 313(5786):504–507, 2006
Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks.science, 313(5786):504–507, 2006. 3, 4, 12
2006
-
[34]
Denoising Diffu- sion Probabilistic Models.NeurIPS, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffu- sion Probabilistic Models.NeurIPS, 2020. 2, 4, 14 9
2020
-
[35]
simple diffusion: End-to-end diffusion for high resolution images
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. In ICML, 2023. 3
2023
-
[36]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Cor- rado. GAIA-1: A Generative World Model for Autonomous Driving.arXiv preprint arXiv:2309.17080, 2023. 2, 3
work page internal anchor Pith review arXiv 2023
-
[37]
DINO-Foresight: Looking into the Future with DINO
Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. DINO-Foresight: Looking into the Future with DINO. InNeurIPS, 2025. 1, 2, 3, 14
2025
-
[38]
Democratiz- ing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiao- hui Shen, Suha Kwak, and Liang-Chieh Chen. Democratiz- ing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens. InICCV, 2025. 3
2025
-
[39]
Auto-Encoding Varia- tional Bayes
Diederik P Kingma and Max Welling. Auto-Encoding Varia- tional Bayes. InICLR, 2014. 3, 12
2014
-
[40]
LLaV A-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. LLaV A-OneVision: Easy Visual Task Transfer. TMLR, 2025. 3
2025
-
[41]
Autoregressive Image Generation Without Vec- tor Quantization.NeurIPS, 2024
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive Image Generation Without Vec- tor Quantization.NeurIPS, 2024. 3
2024
-
[42]
Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, and Xue Yang. Learning Adaptive and Temporally Causal Video Tokeniza- tion in a 1D Latent Space.arXiv preprint arXiv:2505.17011,
-
[43]
Improving Generative Imagination in Object-Centric World Models
Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Bofeng Fu, Jindong Jiang, and Sungjin Ahn. Improving Generative Imagination in Object-Centric World Models. InICML, 2020. 3
2020
-
[44]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling. InICLR, 2023. 3
2023
-
[45]
Visual Instruction Tuning.NeurIPS, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning.NeurIPS, 2023. 3
2023
-
[46]
Alleviating Distortion in Image Gen- eration via Multi-Resolution Diffusion Models and Time- Dependent Layer Normalization.NeurIPS, 2024
Qihao Liu, Zhanpeng Zeng, Ju He, Qihang Yu, Xiaohui Shen, and Liang-Chieh Chen. Alleviating Distortion in Image Gen- eration via Multi-Resolution Diffusion Models and Time- Dependent Layer Normalization.NeurIPS, 2024. 3
2024
-
[47]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InICLR, 2019. 12
2019
-
[48]
Finite Scalar Quantization: VQ-V AE Made Simple
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite Scalar Quantization: VQ-V AE Made Simple. InICLR, 2024. 3
2024
-
[49]
VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild
Jiaxu Miao, Yunchao Wei, Yu Wu, Chen Liang, Guangrui Li, and Yi Yang. VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild. InCVPR, 2021. 2, 6, 8, 12, 13, 18, 20
2021
-
[50]
Efficient World Models with Context-Aware Tokenization
Vincent Micheli, Eloi Alonso, and Franc ¸ois Fleuret. Efficient World Models with Context-Aware Tokenization. InICML,
-
[51]
Sora.https://openai.com/sora, 2024
OpenAI. Sora.https://openai.com/sora, 2024. 3
2024
-
[52]
DINOv2: Learning Robust Visual Features without Supervision.TMLR,
Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning Robust Visual Features without Supervision.TMLR,
-
[53]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. InICCV, 2023. 3
2023
-
[54]
SDXL: Improving Latent Diffusion Models for High- Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rom- bach. SDXL: Improving Latent Diffusion Models for High- Resolution Image Synthesis. InICLR, 2024. 3
2024
-
[55]
Learning Transferable Visual Models From Natural Language Supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models From Natural Language Supervi- sion. InICML, 2021. 3
2021
-
[56]
DeepSpeed: System Optimizations Enable Train- ing Deep Learning Models with Over 100 Billion Parameters
Jordan Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System Optimizations Enable Train- ing Deep Learning Models with Over 100 Billion Parameters. InKDD, 2020. 6
2020
-
[57]
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching. InICML, 2025. 3
2025
-
[58]
Beyond Next-Token: Next-X Predic- tion for Autoregressive Visual Generation
Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Beyond Next-Token: Next-X Predic- tion for Autoregressive Visual Generation. InICCV, 2025. 3
2025
-
[59]
Sucheng Ren, Qihang Yu, Ju He, Alan Yuille, and Liang- Chieh Chen. Grouping First, Attending Smartly: Training- Free Acceleration for Diffusion Transformers.arXiv preprint arXiv:2505.14687, 2025. 3
-
[60]
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. InCVPR, 2022. 3
2022
-
[61]
Deeply Supervised Flow-Based Generative Models
Inkyu Shin, Chenglin Yang, and Liang-Chieh Chen. Deeply Supervised Flow-Based Generative Models. InICCV, 2025. 3
2025
-
[62]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Ramamonjisoa, et al. DI- NOv3.arXiv preprint arXiv:2508.10104, 2025. 3, 5, 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063,
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063,
-
[64]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation.arXiv preprint arXiv:2406.06525, 2024. 3
work page internal anchor Pith review arXiv 2024
-
[65]
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
Zachary Teed and Jia Deng. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. InECCV, 2020. 3
2020
-
[66]
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction.NeurIPS, 2024
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction.NeurIPS, 2024. 3
2024
-
[67]
Going Deeper with Image Transformers
Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv ´e J ´egou. Going Deeper with Image Transformers. InICCV, 2021. 12
2021
-
[68]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. SigLIP 2: Multilingual Vision-Language 10 Encoders with Improved Semantic Understanding, Localiza- tion, and Dense Features.arXiv preprint arXiv:2502.14786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Neural Discrete Representation Learning.NeurIPS, 2017
Aaron Van Den Oord, Oriol Vinyals, et al. Neural Discrete Representation Learning.NeurIPS, 2017. 3
2017
-
[70]
Attention is All You Need.NeurIPS, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All You Need.NeurIPS, 2017. 3
2017
-
[71]
Extracting and Composing Robust Fea- tures with Denoising Autoencoders
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre- Antoine Manzagol. Extracting and Composing Robust Fea- tures with Denoising Autoencoders. InICML, 2008. 3
2008
-
[72]
An Uncer- tain Future: Forecasting from Static Images Using Variational Autoencoders
Jacob Walker, Carl Doersch, and Abhinav Gupta. An Uncer- tain Future: Forecasting from Static Images Using Variational Autoencoders. InECCV, 2016. 2, 4
2016
-
[73]
Frozen Forecasting: A Unified Evaluation
Jacob C. Walker, Pedro V ´elez, Luisa Polania Cabrera, Guangyao Zhou, Rishabh Kabra, Carl Doersch, Maks Ovs- janikov, Jo˜ao Carreira, and Shiry Ginosar. Generalist Fore- casting with Frozen Video Models via Latent Diffusion.arXiv preprint arXiv:2507.13942, 2025. 2, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
MaskBit: Embedding-free Image Generation via Bit Tokens.TMLR,
Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. MaskBit: Embedding-free Image Generation via Bit Tokens.TMLR,
-
[75]
Overview of the H.264/A VC Video Coding Standard.IEEE Transactions on circuits and systems for video technology, 13(7):560–576, 2003
Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and Ajay Luthra. Overview of the H.264/A VC Video Coding Standard.IEEE Transactions on circuits and systems for video technology, 13(7):560–576, 2003. 2, 3, 4
2003
-
[76]
A Learning Algorithm for Continually Running Fully Recurrent Neural Networks
Ronald J Williams and David Zipser. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural computation, 1(2):270–280, 1989. 4
1989
-
[77]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Trans- formers: State-of-t...
2020
-
[78]
Video Prediction via Example Guidance
Jingwei Xu, Huazhe Xu, Bingbing Ni, Xiaokang Yang, and Trevor Darrell. Video Prediction via Example Guidance. In ICML, 2020. 6
2020
-
[79]
1.58-bit FLUX.arXiv preprint arXiv:2412.18653, 2024
Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, and Liang-Chieh Chen. 1.58-bit FLUX.arXiv preprint arXiv:2412.18653, 2024. 3
-
[80]
Reconstruction vs
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models. InCVPR, 2025. 3
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.