arxiv: 2605.07079 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Recognition: no theorem link

Learning Visual Feature-Based World Models via Residual Latent Action

Abdeslam Boularias, Xinyu Zhang, Yeping Wang, Yu She, Yutian Tao, Zhengtong Xu

Pith reviewed 2026-05-11 01:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords world modelsresidual latent actionvisual featuresflow matchingrobot learningoffline reinforcement learningDINO featuresfeature-based prediction

0 comments

The pith

Residual latent actions from visual feature differences let world models predict future states accurately and quickly via flow matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a latent action representation called residual latent action that is extracted directly from differences between consecutive visual features produced by a pre-trained extractor. These residuals turn out to be predictive of future dynamics and can be modeled with flow matching to generate next-step features without regressing to blurry averages or diffusing full images. The resulting world model matches or exceeds both prior feature-based predictors and much slower video-diffusion models on both simulated and real robot datasets. The same model further supports two downstream robot techniques that learn policies or action models from videos alone, without actions, online interaction, or hand-designed rewards.

Core claim

Residual latent actions derived from DINO feature residuals are predictive, generalizable across tasks, and encode temporal progression; modeling them with flow matching produces an accurate and efficient visual-feature world model that outperforms existing feature-based and video-diffusion alternatives while enabling entirely offline visual reinforcement learning.

What carries the argument

Residual Latent Action (RLA): a latent vector obtained from the residual between successive DINO visual features that is then predicted by flow matching to synthesize the next feature vector.

If this is right

The world model runs orders of magnitude faster than video-diffusion baselines while matching or exceeding their accuracy on both simulation and real-robot data.
A minimalist world-action model trained with RLA can extract implicit actions from unlabeled demonstration videos.
Visual reinforcement-learning policies can be trained entirely inside the learned world model using only offline videos and a video-aligned reward, without any online interaction or handcrafted reward functions.
Feature-space prediction avoids the hallucination problems typical of pixel-level generation while remaining more efficient than direct regression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may scale to large unlabeled video corpora, allowing world models to be pre-trained on internet-scale data before fine-tuning on robot tasks.
Because RLA is compact and predictive, it could support real-time model-based planning loops inside physical robots.
The success of residual features suggests that many existing self-supervised vision backbones already embed implicit action signals that could be reused for dynamics modeling.

Load-bearing premise

That the residuals between consecutive visual features already contain enough information about actions and dynamics to support accurate future prediction without extra supervision or task-specific tuning.

What would settle it

Run the trained RLA world model on a held-out real-robot dataset with new objects or camera angles and measure whether its predicted next-step features match the actual extracted features at least as closely as the best baseline method; a clear gap would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.07079 by Abdeslam Boularias, Xinyu Zhang, Yeping Wang, Yu She, Yutian Tao, Zhengtong Xu.

**Figure 1.** Figure 1: Overview of our framework. We introduce the Residual Latent Action (RLA), which compresses the DINO token residual st+h −st into a compact latent z. We discover that RLA is predictive, generalizable, and encodes temporal progression. Next, we propose the RLA World Model (RLA-WM), which learns from offline videos and predicts RLA z instead of st+h directly. RLAWM achieves accurate future prediction while … view at source ↗

**Figure 2.** Figure 2: RLA World Model. RLA-WM predicts future states by generating the residual latent action z. We first embed the robot actions at:t+h (padded to a maximum horizon) via an MLP. This embedding is concatenated with the DINO tokens st and learnable queries, then processed through self-attention layers to produce condition tokens. During flow matching, this condition is fixed and concatenated with a noisy latent z… view at source ↗

**Figure 3.** Figure 3: Qualitative Comparison for RLA-WM. Given an input frame at t = 0, our RLA-WM predicts future frames with high visual quality and physical fidelity, closely matching ground-truth states. In contrast, DINO-WM produces increasingly blurrier predictions for Push-T over longer horizons, and inconsistent rope states. Applying diffusion or flow matching directly in DINO token space yields inferior results (RAE, F… view at source ↗

**Figure 4.** Figure 4: Learning from Actionless Videos using RLA. We extend a BC ResNet with a linear layer to predict the RLA zˆ. The RLA targets are extracted from (st, st+h) using an RLA encoder fenc learned from task-agnostic videos. This turns the BC policy into a minimalist world action model that can learn from videos whose proprioceptive states and robot actions are not available, without forcing the policy to couple wit… view at source ↗

**Figure 5.** Figure 5: Visual Reinforcement Learning within RLA World Models. We adapt a pretrained ResNet BC policy for RL using LoRA adapters and a residual action head predicting delta actions and a Gaussian log deviation. The policy outputs action chunks at:t+h to our RLA-WM, which predicts future tokens sˆt+h. A pretrained UNet decodes sˆt+h into RGB observations for next step. RLAWM resets its state sˆ0 from the initial … view at source ↗

**Figure 6.** Figure 6: World Model RL Performance Distribution. We select the best-performing BC models (■ BC∗ ) and apply WMRL ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RLA from DINO residuals is a fresh framing for action-free world models in robotics, but the assumption that those residuals reliably encode dynamics looks unproven from the abstract alone.

read the letter

The one or two things to know are that this paper pulls Residual Latent Actions out of DINO feature differences and feeds them into a flow-matching world model, then builds two robot learning methods on top that work from offline videos without actions or rewards. They claim the model beats both feature-based and video-diffusion baselines while running much faster, and that the RLA itself is predictive and generalizable across tasks.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Residual Latent Action (RLA), a latent representation derived from residuals between consecutive DINO visual features, and builds the RLA World Model (RLA-WM) that predicts future RLA via flow matching in feature space. It claims RLA is predictive, generalizable across tasks/environments, and encodes temporal progression, enabling RLA-WM to outperform state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets while being orders of magnitude faster. The paper further presents two robot learning applications: a minimalist world action model trained on actionless demonstration videos and the first visual RL framework trained entirely inside an offline-video-learned world model using video-aligned rewards without online interactions or handcrafted rewards.

Significance. If the empirical results and generalization claims hold, the work would offer a computationally efficient alternative to pixel-level world models for robotics, reducing hallucination risks through feature-space prediction and flow matching. The offline robot learning techniques, especially the fully model-based visual RL without online data or engineered rewards, could enable more scalable policy learning from passive video demonstrations.

major comments (3)

Abstract: The central claims that RLA is 'predictive, generalizable, and encodes temporal progression' and that RLA-WM 'outperforms' SOTA methods are asserted without any quantitative results, baselines, ablation studies, error analysis, or dataset details in the provided text. These assertions are load-bearing for the introduction of RLA and the downstream robot learning techniques.
Abstract: The assumption that residuals of DINO features yield meaningful latent actions (rather than photometric/viewpoint noise) is not justified or validated; DINO embeddings are optimized for static semantic discrimination, so it is unclear why their consecutive differences would reliably support causal, controllable future predictions under flow matching without additional supervision.
Abstract: No derivation, training objective, or architectural details are supplied for how RLA is extracted or why flow matching on RLA succeeds where direct regression on features fails, leaving the claimed advantages over existing feature-based world models unsubstantiated.

minor comments (1)

The abstract references a project page but provides no information on code, model weights, or reproducibility of the claimed results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with references to the full manuscript and indicate revisions where we strengthen the presentation.

read point-by-point responses

Referee: Abstract: The central claims that RLA is 'predictive, generalizable, and encodes temporal progression' and that RLA-WM 'outperforms' SOTA methods are asserted without any quantitative results, baselines, ablation studies, error analysis, or dataset details in the provided text. These assertions are load-bearing for the introduction of RLA and the downstream robot learning techniques.

Authors: The full manuscript presents these results in Sections 4 and 5, with quantitative comparisons to feature-based and diffusion baselines across simulation and real-world datasets, including ablations, error metrics, and dataset specifications. The abstract summarizes the key findings from these sections. We have revised the abstract to include a concise reference to the empirical evaluations and main datasets. revision: partial
Referee: Abstract: The assumption that residuals of DINO features yield meaningful latent actions (rather than photometric/viewpoint noise) is not justified or validated; DINO embeddings are optimized for static semantic discrimination, so it is unclear why their consecutive differences would reliably support causal, controllable future predictions under flow matching without additional supervision.

Authors: Section 3.1 derives RLA explicitly as the residual between consecutive DINO features and validates its properties through visualizations and experiments demonstrating that it captures temporal dynamics rather than noise. We show its predictive and generalizable nature via downstream tasks in Sections 4 and 5, where it enables effective flow-matching predictions without extra supervision. This is further supported by the success of the robot learning applications. revision: no
Referee: Abstract: No derivation, training objective, or architectural details are supplied for how RLA is extracted or why flow matching on RLA succeeds where direct regression on features fails, leaving the claimed advantages over existing feature-based world models unsubstantiated.

Authors: Section 3.1 details the RLA extraction as the per-frame DINO residual. Section 3.3 presents the flow-matching training objective and architecture, while Section 4.3 provides ablations showing that flow matching avoids the blurry predictions of direct regression in complex interactions. We have updated the abstract to briefly note the flow-matching formulation and its advantages. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines RLA explicitly as residuals between consecutive DINO features (an external pretrained model), then trains a separate flow-matching network to predict those RLA values for future states. This separation means the predictiveness claim is an empirical outcome of the trained model rather than a definitional tautology or a fitted parameter renamed as a prediction. No self-citation is invoked as a uniqueness theorem or load-bearing premise; DINO and flow matching are cited as independent prior work. The central claims rest on experimental validation across datasets rather than reducing to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review prevents identification of concrete free parameters or background axioms; the primary invented element is the RLA representation itself.

invented entities (1)

Residual Latent Action (RLA) no independent evidence
purpose: Compact latent representation of actions extracted from visual feature residuals for efficient future prediction in world models
Described as discovered from DINO residuals and used as the core predictive target; no external validation or independent evidence supplied in abstract

pith-pipeline@v0.9.0 · 5589 in / 1417 out tokens · 62791 ms · 2026-05-11T01:31:51.348225+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 14 internal anchors

[1]

Learning to model the world: A survey of world models in artiﬁcial intelligence

Jiahua Dong, Qi Lyu, Baichen Liu, Xudong Wang, Wenqi Liang, Duzhen Zhang, Jiahang Tu, Hongliu Li, Hanbin Zhao, Henghui Ding, et al. Learning to model the world: A survey of world models in artiﬁcial intelligence. 2026

work page 2026
[2]

Learning interactive real-world simulators

Mengjiao Y ang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[3]

Rlvr-world: Training world models with reinforcement learning.arXiv preprint arXiv:2505.13934,

Jialong Wu, Shaofeng Yin, Ningya Feng, and Mingsheng Long. Rlvr-world: Training world models with reinforcement learning. arXiv preprint arXiv:2505.13934, 2025

work page arXiv 2025
[4]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Y ang Y e, Y ueqi Duan, and Jiang Bian. Ge- ometry forcing: Marrying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Precise action-to-video generation through visual action prompts

Y uang Wang, Chao Wen, Haoyu Guo, Sida Peng, Minghan Qin, Hujun Bao, Xiaowei Zhou, and Ruizhen Hu. Precise action-to-video generation through visual action prompts. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 12713–12724, 2025

work page 2025
[6]

Huang, J

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models. arXiv preprint arXiv:2505.14357, 2025

work page arXiv 2025
[7]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Y u, Hangjie Y uan, Y uming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review arXiv 2025
[8]

Video-bench: Human-aligned video generation benchmark

Hui Han, Siyuan Li, Jiaqi Chen, Yiwen Y uan, Y uling Wu, Y ufan Deng, Chak Tou Leong, Hanwen Du, Junchen Fu, Y ouhua Li, et al. Video-bench: Human-aligned video generation benchmark. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 18858–18868, 2025

work page 2025
[9]

Open-sora 2.0: Training a commercial-level video generation model in $200k,

Zangwei Zheng, Xiangyu Peng, Y uxuan Lou, Chenhui Shen, Tom Y oung, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial- level video generation model in $200 k. arXiv preprint arXiv:2503.09642, 2025

work page arXiv 2025
[10]

Interactive world simulator for robot policy training and evaluation, 2026

Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, Tony Dear, Huan Zhang, and Y unzhu Li. Interactive world simulator for robot policy training and evaluation. arXiv preprint arXiv:2603.08546, 2026

work page arXiv 2026
[11]

Gigabrain-0: A world model-powered vision-language- action model.arXiv preprint arXiv:2510.19430, 2025

GigaBrain Team, Angen Y e, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, et al. Gigabrain-0: A world model-powered vision- language-action model. arXiv preprint arXiv:2510.19430, 2025

work page arXiv 2025
[12]

Structured world models from human videos

Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. In Robotics: Science and Systems (RSS) , 2023

work page 2023
[13]

Gwm: Towards scalable gaussian world models for robotic manipulation

Guanxing Lu, Baoxiong Jia, Puhao Li, Yixin Chen, Ziwei Wang, Y ansong Tang, and Siyuan Huang. Gwm: Towards scalable gaussian world models for robotic manipulation. In Pro- ceedings of the IEEE/CVF International Conference on Computer Vision , pages 9263–9274, 2025

work page 2025
[14]

Video language planning

Yilun Du, Mengjiao Y ang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Ser- manet, Tianhe Y u, Pieter Abbeel, Joshua B Tenenbaum, et al. Video language planning. In International Conference on Learning Representations (ICLR) , 2024

work page 2024
[15]

Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

Delong Chen, Theo Moutakanni, Willy Chung, Y ejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model. arXiv preprint arXiv:2509.02722, 2025

work page arXiv 2025
[16]

Dino-wm: World models on pre- trained visual features enable zero-shot planning

Gaoyue Zhou, Hengkai Pan, Y ann LeCun, and Lerrel Pinto. Dino-wm: World models on pre- trained visual features enable zero-shot planning. In International Conference on Machine Learning (ICML), 2025. 10

work page 2025
[17]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V -jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Using goal-driven deep learning models to understand sensory cortex

Daniel LK Y amins and James J DiCarlo. Using goal-driven deep learning models to understand sensory cortex. Nature neuroscience, 19(3):356–365, 2016

work page 2016
[19]

Simulation as an en- gine of physical scene understanding

Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an en- gine of physical scene understanding. Proceedings of the national academy of sciences , 110(45):18327–18332, 2013

work page 2013
[20]

Sparse imagination for efﬁcient visual world model planning

Junha Chun, Y oungjoon Jeong, and Taesup Kim. Sparse imagination for efﬁcient visual world model planning. arXiv preprint arXiv:2506.01392, 2025

work page arXiv 2025
[21]

Back to the features: Dino as a foundation for video world models.arXiv preprint arXiv:2507.19468,

Federico Baldassarre, Marc Szafraniec, Basile Terver, V asil Khalidov, Francisco Massa, Y ann LeCun, Patrick Labatut, Maximilian Seitzer, and Piotr Bojanowski. Back to the features: Dino as a foundation for video world models. arXiv preprint arXiv:2507.19468, 2025

work page arXiv 2025
[22]

Rethinking diffusion model in high dimension

Zhenxin Zheng and Zhenjie Zheng. Rethinking diffusion model in high dimension. arXiv preprint arXiv:2503.08643, 2025

work page arXiv 2025
[23]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review arXiv 2025
[24]

Adaworld: Learning adaptable world models with latent actions

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions. In International Conference on Machine Learning (ICML), 2025

work page 2025
[25]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Y ao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A uniﬁed latent action world model. arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review arXiv 2025
[26]

3d-vla: A 3d vision-language-action generative world model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Y ang, Xin Y an, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model. In Interna- tional Conference on Machine Learning (ICML) , 2024

work page 2024
[27]

Manigaussian++: General robotic bi- manual manipulation with hierarchical gaussian world model

Tengbo Y u, Guanxing Lu, Zaijia Y ang, Haoyuan Deng, Season Si Chen, Jiwen Lu, Wenbo Ding, Guoqiang Hu, Y ansong Tang, and Ziwei Wang. Manigaussian++: General robotic bi- manual manipulation with hierarchical gaussian world model. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 12232–12239. IEEE, 2025

work page 2025
[28]

://arxiv.org/abs/2601.03782

Wenlong Huang, Y u-Wei Chao, Arsalan Mousavian, Ming-Y u Liu, Dieter Fox, Kaichun Mo, and Li Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation. arXiv preprint arXiv:2601.03782, 2026

work page arXiv 2026
[29]

Huang, Q

Suning Huang, Qianzhong Chen, Xiaohan Zhang, Jiankai Sun, and Mac Schwager. Particle- former: A 3d point cloud world model for multi-object, multi-material robotic manipulation. arXiv preprint arXiv:2506.23126, 2025

work page arXiv 2025
[30]

Gaf: Gaussian action field as a 4d representation for dynamic world modeling in robotic manipulation,

Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Kangchen Lv, Liangjun Xing, Xiang Li, Hongwen Zhang, and Y ebin Liu. Gaf: Gaussian action ﬁeld as a 4d representation for dynamic world modeling in robotic manipulation. arXiv preprint arXiv:2506.14135, 2025

work page arXiv 2025
[31]

Robotic world model: A neural network simulator for robust policy optimization in robotics.arXiv preprint arXiv:2501.10100, 2025

Chenhao Li, Andreas Krause, and Marco Hutter. Robotic world model: A neural network simulator for robust policy optimization in robotics. arXiv preprint arXiv:2501.10100, 2025

work page arXiv 2025
[32]

Seo, Y ., Sferrazza, C., Chen, J., Shi, G., Duan, R., and Abbeel, P

SV Jyothir, Siddhartha Jalagam, Y ann LeCun, and Vlad Sobal. Gradient-based planning with world models. arXiv preprint arXiv:2312.17227, pages 703–708, 2023

work page arXiv 2023
[33]

Td-mpc2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. In International Conference on Learning Representations (ICLR) , 2024

work page 2024
[34]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse do- mains through world models. arXiv preprint arXiv:2301.04104, 2023. 11

work page internal anchor Pith review arXiv 2023
[35]

Training agents inside of scalable world models,

Danijar Hafner, Wilson Y an, and Timothy Lillicrap. Training agents inside of scalable world models. arXiv preprint arXiv:2509.24527, 2025

work page arXiv 2025
[36]

Pwm: Policy learn- ing with multi-task world models

Ignat Georgiev, V arun Giridhar, Nicklas Hansen, and Animesh Garg. Pwm: Policy learn- ing with multi-task world models. In International Conference on Learning Representations (ICLR), 2025

work page 2025
[37]

arXiv preprint arXiv:2603.14482 (2026)

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rab- bat, Y ann LeCun, Nicolas Ballas, and Adrien Bardes. V -jepa 2.1: Unlocking dense features in video self-supervised learning. arXiv preprint arXiv:2603.14482, 2026

work page arXiv 2026
[38]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, V asil Khali- dov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, V asil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

What do latent action models actually learn?arXiv preprint arXiv:2506.15691, 2025

Chuheng Zhang, Tim Pearce, Pushi Zhang, Kaixin Wang, Xiaoyu Chen, Wei Shen, Li Zhao, and Jiang Bian. What do latent action models actually learn? arXiv preprint arXiv:2506.15691, 2025

work page arXiv 2025
[41]

Flare: Robot learning with implicit world modeling, 2025

Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Y u Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, et al. Flare: Robot learning with implicit world modeling. arXiv preprint arXiv:2505.15659, 2025

work page arXiv 2025
[42]

Latent action pretraining from videos

Seonghyeon Y e, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Y ang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Y u-Wei Chao, Bill Y uchen Lin, et al. Latent action pretraining from videos. In International Conference on Learning Representations (ICLR) , 2025

work page 2025
[43]

Latent action pretraining through world modeling.arXiv preprint arXiv:2509.18428, 2025

Bahey Tharwat, Y ara Nasser, Ali Abouzeid, and Ian Reid. Latent action pretraining through world modeling. arXiv preprint arXiv:2509.18428, 2025

work page arXiv 2025
[44]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Y anting Y ang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Y ao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review arXiv 2025
[45]

Scaling rectiﬁed ﬂow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Y am Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectiﬁed ﬂow trans- formers for high-resolution image synthesis. In F orty-ﬁrst international conference on machine learning, 2024

work page 2024
[46]

Diffusion schrödinger bridge matching

Y uyang Shi, V alentin De Bortoli, Andrew Campbell, and Arnaud Doucet. Diffusion schrödinger bridge matching. Advances in neural information processing systems , 36:62183– 62223, 2023

work page 2023
[47]

Schrodinger bridge ﬂow for unpaired data translation

V alentin De Bortoli, Iryna Korshunova, Andriy Mnih, and Arnaud Doucet. Schrodinger bridge ﬂow for unpaired data translation. Advances in Neural Information Processing Systems, 37:103384–103441, 2024

work page 2024
[48]

Crom: Continuous reduced-order modeling of pdes using implicit neural representations,

Peter Yichen Chen, Jinxu Xiang, Dong Heon Cho, Y ue Chang, GA Pershing, Henrique Teles Maia, Maurizio M Chiaramonte, Kevin Carlberg, and Eitan Grinspun. Crom: Continu- ous reduced-order modeling of pdes using implicit neural representations. arXiv preprint arXiv:2206.02607, 2022

work page arXiv 2022
[49]

Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations

Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Cathera Y ang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. In Thirty-ﬁfth Conference on Neural Information Processing Sys- tems Datasets and Benchmarks Track (Round 2)

work page
[50]

Flow matching for generative modeling

Y aron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations. 12

work page
[51]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 586–595, 2018

work page 2018
[52]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600– 612, 2004

work page 2004
[53]

Dynamicrafter: Animating open-domain images with video diffusion priors

Jinbo Xing, Menghan Xia, Y ong Zhang, Haoxin Chen, Wangbo Y u, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. In European Conference on Computer Vision , pages 399–

work page
[54]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

work page 2015
[55]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Y uan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination? arXiv preprint arXiv:2603.16666, 2026

work page internal anchor Pith review arXiv 2026
[56]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 770–778, 2016

work page 2016
[57]

Neural discrete representation learning

Aaron V an Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems , 30, 2017

work page 2017
[58]

World Models

David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review arXiv 2018
[59]

org/abs/2510.19818

Jacob Berg, Chuning Zhu, Y anda Bao, Ishan Durugkar, and Abhishek Gupta. Semantic world models. arXiv preprint arXiv:2510.19818, 2025

work page arXiv 2025
[60]

Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955,

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predic- tive control. arXiv preprint arXiv:2203.04955, 2022

work page arXiv 2022
[61]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018

work page 2018
[62]

Reinforcement learning with action chunking

Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[63]

Top- erl: Transformer-based off-policy episodic reinforcement learning

Ge Li, Dong Tian, Hongyi Zhou, Xinkai Jiang, Rudolf Lioutikov, and Gerhard Neumann. Top- erl: Transformer-based off-policy episodic reinforcement learning

work page
[64]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[65]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Ofﬂine reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review arXiv 2005
[66]

Is behavior cloning all you need? under- standing horizon in imitation learning

Dylan J Foster, Adam Block, and Dipendra Misra. Is behavior cloning all you need? under- standing horizon in imitation learning. Advances in Neural Information Processing Systems , 37:120602–120666, 2024

work page 2024
[67]

Deep reinforcement learning that matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artiﬁcial intelligence, volume 32, 2018

work page 2018
[68]

Simple random search provides a competitive approach to reinforcement learning

Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018

work page arXiv 2018
[69]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations. 13

work page
[70]

Structured 3d latents for scalable and versatile 3d generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Y u Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Y ang. Structured 3d latents for scalable and versatile 3d generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 21469–21480, 2025

work page 2025
[71]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchﬁel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research , 44(10-11):1684–1704, 2025

work page 2025
[72]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015. 14 A Appendix A.1 Code The source code of our work is included in the supplementary folder Code. Please see Code/README.md for instructions on installation, dat...

work page internal anchor Pith review arXiv 2015
[73]

Our RLA is learned from residuals between pairs of DINO tokens (st, st+h)

Background and random motion. Our RLA is learned from residuals between pairs of DINO tokens (st, st+h). However, task-irrelevant background motion or workspace randomness (e.g., in humanoid robot or eye-in-hand camera) can also cause visual changes between st and st+h. Learning to encode those randomness-driven motions could waste representation capacity...

work page
[74]

Our RLA-WM predicts st+h from st and at:t+h, yet changes may depend on s<t due to occlusion (e.g., an object disappears and reappears)

Memory and partial observability. Our RLA-WM predicts st+h from st and at:t+h, yet changes may depend on s<t due to occlusion (e.g., an object disappears and reappears). Because RLA z is learned from a single frame pair, it must memorize the object in the latent space rather than encoding true movement and occlusion events. Extending RLA to condition on m...

work page
[75]

Our RLA-WM predicts only visual state evolution via RLA, but not future proprioceptive states

Proprioceptive world model. Our RLA-WM predicts only visual state evolution via RLA, but not future proprioceptive states. Proprioception input has been shown to be useful for policy learning. Extending the world model to predict both would broaden applicability

work page
[76]

Scaling to larger datasets. We deliberately evaluated on small-scale ManiSkill and IWS datasets to isolate method-driven gains from mere data scaling — many prior works scale ﬁrst and leave it unclear whether improvements come from data volume or the method itself. Our clear, reproducible results on small data demonstrate the core properties of RLA and RL...

work page