Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning

Haoxiang You; Mian Wu; Mingqi Yuan; Qi Wang; Wenjun Zeng; Wenyao Zhang; Xiaokang Yang; Xin Jin; Yunbo Wang; Yuyang Zhang

arxiv: 2512.00961 · v2 · submitted 2025-11-30 · 💻 cs.LG

Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning

Qi Wang , Mian Wu , Yuyang Zhang , Mingqi Yuan , Wenyao Zhang , Haoxiang You , Yunbo Wang , Xin Jin

show 2 more authors

Xiaokang Yang Wenjun Zeng

This is my paper

Pith reviewed 2026-05-17 02:48 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningvideo diffusion modelsgoal-conditioned rewardsreward designMeta-Worldforward-backward representationslatent alignment

0 comments

The pith

Pretrained video diffusion models can generate goal-driven rewards for reinforcement learning agents without hand-crafted functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that video diffusion models pretrained on large-scale data already encode enough world knowledge to act as reward providers in RL. Instead of designing programmatic rewards for each task, the approach finetunes a diffusion model on domain data and uses its latent space to score how well an agent's trajectory aligns with generated goal videos. At a finer scale, it extracts a single target frame via CLIP and trains a forward-backward model to estimate the probability of reaching that frame from any state-action pair. Experiments on Meta-World and Distracting Control Suite show agents can learn coherent behaviors using only these signals. A sympathetic reader would care because this removes a major bottleneck in applying RL to new domains where reward engineering is difficult or impossible.

Core claim

By finetuning an off-the-shelf video diffusion model on domain-specific data and then measuring alignment between agent trajectories and the model's generated goal videos in latent space, the method supplies a video-level reward. For frame-level guidance it identifies the most relevant frame from the generated video with CLIP, treats that frame as the goal state, and uses a learned forward-backward representation to compute the probability that a given state-action pair will lead to the goal; this probability serves as the immediate reward. The resulting signals drive RL agents to produce goal-directed behavior on Meta-World and Distracting Control Suite tasks without any task-specific hand-

What carries the argument

Video-level alignment score from the finetuned diffusion model's encoder together with the frame-level forward-backward probability of reaching a CLIP-selected goal frame.

If this is right

RL agents can acquire skills in new visual domains using only example goal videos rather than reward code.
The same pretrained diffusion model can supply rewards across multiple related tasks after one domain-specific finetuning step.
Frame-level forward-backward rewards encourage temporally coherent trajectories that match the structure of real video sequences.
Reward design effort shifts from writing scalar functions to curating small sets of goal videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested in robotics settings where only a short video of the desired outcome is available instead of a simulator reward.
If the diffusion model is kept frozen after finetuning, the approach might scale to very large numbers of tasks without retraining the reward model each time.
Combining the video reward with a small amount of human preference data could further reduce any residual misalignment between the diffusion prior and the actual task.

Load-bearing premise

The alignment scores and forward-backward probabilities truly measure progress toward the intended goal and do not contain hidden biases that could mislead the agent.

What would settle it

Train an agent with these rewards on a held-out Meta-World task and measure whether success rate remains near zero even after many episodes while a hand-designed reward succeeds.

Figures

Figures reproduced from arXiv: 2512.00961 by Haoxiang You, Mian Wu, Mingqi Yuan, Qi Wang, Wenjun Zeng, Wenyao Zhang, Xiaokang Yang, Xin Jin, Yunbo Wang, Yuyang Zhang.

**Figure 2.** Figure 2: Pipeline of GenReward, which computes goal-driven rewards for behavior learning of the agent using generative prior. During [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Goal-driven action selection. Learned representation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of experimental setups in our experiments [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Performance on Meta-World complex manipulation tasks in terms of episode return under dense reward setting. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Policy evaluation on the Meta-World Bin Picking task. TADPoLe fails to contact the puck, while Diffusion Reward moves the grasped puck away from the target position. In contrast, GenReward enables the policy to complete the grasp in fewer steps and outperforms both Dense Reward and RoboCLIP. 4.2. Main Comparison We evaluate the task performance in terms of episode return [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 7.** Figure 7: Performance on Meta-World Bin Picking under sparse reward setting [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Showcase of selecting the goal image from the video generated with the prompt [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: These figures display the ablation studies and sensitivity analyses of GenReward on Meta-World [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Performance of GenReward on Meta-World Pick Place with different generated videos. a decline in performance. FB reward weight β controls the frame-level goal scale. Intuitively, setting β too low may result in the agent not getting enough world knowledge from the video diffusion models. Conversely, an excessively high β may cause the agent to overfit to the generated frame-level goals, struggling to expl… view at source ↗

read the original abstract

Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior. Designing such reward functions can be challenging and may not generalize well across different tasks. To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goal-driven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals. For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the generated goal videos. To enable more fine-grained goal-achievement, we derive a frame-level goal by identifying the most relevant frame from the generated video using CLIP, which serves as the goal state. We then employ a learned forward-backward representation that represents the probability of visiting the goal state from a given state-action pair as frame-level reward, promoting more coherent and goal-driven trajectories. Experiments on Meta-World and Distracting Control Suite demonstrate the effectiveness of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines finetuned video diffusion alignment with CLIP frame selection and a forward-backward model to generate RL rewards from video, but the abstract supplies no numbers so it is still unclear whether the latents track actual goal progress.

read the letter

The main takeaway is that this work tries to automate goal-driven rewards in RL by finetuning a video diffusion model on domain data, scoring trajectory alignment in its latent space, picking a goal frame with CLIP, and using a learned forward-backward probability for frame-level signals. That specific stack has not shown up in the RL papers they reference, so the integration itself counts as new. The approach is sensible in principle because large video models already encode a lot about physical outcomes, and pulling rewards from them could reduce the usual manual shaping work on tasks like robotic manipulation. The authors also keep the method modular, which makes it easier to test or swap pieces. The soft spot is the missing evidence. The abstract claims good results on Meta-World and the Distracting Control Suite yet reports none of the numbers, ablations on the finetuning step, or direct comparisons to hand-designed rewards. Without those, it is hard to tell whether the alignment scores reflect functional success or just visual resemblance. The stress-test concern lands here: diffusion models are trained to denoise and generate, not to rank goal achievement, so distractors or partial views could easily produce misleading high scores. If the full paper includes controls that separate appearance from outcome, that would address it; otherwise the central claim stays fragile. This is for people working on reward design or video-based RL who want to try pretrained generative models instead of coding rewards from scratch. A reader already running Meta-World experiments would find the pipeline concrete enough to reproduce and check. It deserves peer review because the components are clearly described and the idea is testable, even if the current write-up needs the quantitative sections filled in before anyone can judge the strength of the results.

Referee Report

2 major / 2 minor

Summary. The paper proposes using pretrained video diffusion models to generate goal-driven reward signals for RL agents without manual reward design. It finetunes a video diffusion model on domain-specific data and uses the video encoder to compute alignment scores between agent trajectories and generated goal videos for video-level rewards. For frame-level rewards, CLIP selects the most relevant frame from the goal video as the target state, after which a separately learned forward-backward model supplies the probability of reaching that state from a given state-action pair. The method is evaluated on the Meta-World and Distracting Control Suite benchmarks.

Significance. If the empirical results hold, the work offers a promising route to automate reward specification by transferring world knowledge from large-scale generative video models, which could reduce engineering effort and improve generalization in visual RL tasks.

major comments (2)

[§4.1] §4.1 (video-level reward): the claim that latent alignment between agent trajectories and generated goal videos supplies a reliable goal-progress signal is not supported by any reported correlation analysis or ablation that isolates alignment from superficial visual similarity; in environments with distractors this risks rewarding appearance rather than functional success.
[§3.2] §3.2 (frame-level reward): the forward-backward probability is introduced as an accurate estimator of goal-reaching likelihood, yet the manuscript provides no verification that the learned representation remains well-calibrated after domain-specific finetuning of the diffusion model, leaving open the possibility that the frame-level term introduces task-specific bias rather than pure goal progress.

minor comments (2)

[Abstract] The abstract states that experiments demonstrate effectiveness but supplies no numerical results, baseline comparisons, or statistical significance; adding a concise results table in the abstract or introduction would strengthen the presentation.
[§3] Notation for the alignment score and the forward-backward probability should be introduced with explicit equations rather than prose descriptions to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and outline revisions that will be incorporated to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§4.1] §4.1 (video-level reward): the claim that latent alignment between agent trajectories and generated goal videos supplies a reliable goal-progress signal is not supported by any reported correlation analysis or ablation that isolates alignment from superficial visual similarity; in environments with distractors this risks rewarding appearance rather than functional success.

Authors: We appreciate the referee's concern regarding the interpretability of the video-level reward. Our experiments on the Distracting Control Suite already evaluate performance under visual distractors, providing indirect evidence that the method does not rely solely on superficial appearance. However, we agree that a direct correlation analysis between alignment scores and functional success, together with an ablation isolating latent alignment from pixel-level similarity, is currently absent. We will add these analyses and corresponding figures in the revised manuscript to better substantiate the claim. revision: yes
Referee: [§3.2] §3.2 (frame-level reward): the forward-backward probability is introduced as an accurate estimator of goal-reaching likelihood, yet the manuscript provides no verification that the learned representation remains well-calibrated after domain-specific finetuning of the diffusion model, leaving open the possibility that the frame-level term introduces task-specific bias rather than pure goal progress.

Authors: We thank the referee for pointing out this potential issue. The forward-backward model is trained independently on domain data to estimate reachability probabilities, while the diffusion model is finetuned primarily to improve goal video generation for frame selection via CLIP. We acknowledge that no explicit calibration diagnostics (e.g., reliability diagrams or bias checks) are reported for the frame-level term after finetuning. In the revision we will include such verification experiments to confirm that the frame-level rewards reflect goal progress rather than introducing unintended bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method uses external pretrained models with independent finetuning and representation learning steps

full rationale

The paper's core construction finetunes a video diffusion model on domain-specific data and learns a separate forward-backward representation to produce alignment-based rewards. These steps are presented as preprocessing that extracts signals from the pretrained model's world knowledge rather than defining the reward directly in terms of itself or fitting parameters to the exact target quantity being predicted. No equations reduce the final reward to a tautological fit or self-citation chain; the alignment and probability computations remain distinct from the RL policy optimization they support. The approach is therefore self-contained against external benchmarks such as Meta-World performance, with any biases arising from empirical assumptions rather than definitional equivalence.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pretrained video diffusion models encode transferable world knowledge and on two learned components whose parameters are fitted to domain data.

free parameters (2)

Finetuned diffusion model weights
The pretrained video diffusion model is finetuned on domain-specific datasets, introducing task-dependent parameters.
Forward-backward representation parameters
A learned model that estimates goal-reaching probability is trained on the task, adding fitted parameters.

axioms (1)

domain assumption Pretrained video diffusion models contain rich world knowledge that can be repurposed as goal-alignment metrics.
Stated directly in the key idea of the abstract.

pith-pipeline@v0.9.0 · 5540 in / 1370 out tokens · 61808 ms · 2026-05-17T02:48:39.521068+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the generated goal videos... learned forward-backward representation that represents the probability of visiting the goal state
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on Meta-World and Distracting Control Suite

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kan- ervisto, Amos Storkey, Tim Pearce, and Franc ¸ois Fleuret. Diffusion for world modeling: Visual details matter in atari. InNeurIPS, 2024. 8

work page 2024
[2]

Video pretraining (vpt): Learning to act by watching unlabeled online videos

Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampe- dro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. InNeurIPS, 2022. 8

work page 2022
[3]

Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manip- ulation.CoRR, 2024

Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manip- ulation.CoRR, 2024

work page 2024
[4]

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image- editing diffusion models.arXiv preprint arXiv:2310.10639,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Univla: Learning to act anywhere with task-centric latent ac- tions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent ac- tions. InRSS, 2025. 8

work page 2025
[7]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InNeurIPS, 2023. 8

work page 2023
[8]

Video prediction models as rewards for reinforcement learning

Alejandro Escontrela, Ademi Adeniji, Wilson Yan, Ajay Jain, Xue Bin Peng, Ken Goldberg, Youngwoon Lee, Dani- jar Hafner, and Pieter Abbeel. Video prediction models as rewards for reinforcement learning. InNeurIPS, 2023. 2, 8

work page 2023
[9]

Furl: Visual-language models as fuzzy rewards for reinforcement learning

Yuwei Fu, Haichao Zhang, Di Wu, Wei Xu, and Benoit Boulet. Furl: Visual-language models as fuzzy rewards for reinforcement learning. InICLR, 2024. 8

work page 2024
[10]

Adaworld: Learning adaptable world models with latent actions

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions. InICML, 2025. 8

work page 2025
[11]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Dif- fusion reward: Learning rewards via conditional video diffu- sion

Tao Huang, Guangqi Jiang, Yanjie Ze, and Huazhe Xu. Dif- fusion reward: Learning rewards via conditional video diffu- sion. InECCV, pages 478–495, 2024. 1, 2, 5, 8

work page 2024
[13]

Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Let- ters, 5(2):3019–3026, 2020

Stephen James, Zicong Ma, David Rovick Arrojo, and An- drew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Let- ters, 5(2):3019–3026, 2020. 5

work page 2020
[14]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022. 8

work page 2022
[15]

Text- aware diffusion for policy learning

Calvin Luo, Mandy He, Zilai Zeng, and Chen Sun. Text- aware diffusion for policy learning. InNeurIPS, 2024. 1, 2, 5, 8

work page 2024
[16]

Grounding video models to ac- tions through goal conditioned exploration

Yunhao Luo and Yilun Du. Grounding video models to ac- tions through goal conditioned exploration. InICLR, 2025. 8

work page 2025
[17]

Liv: Language-image represen- tations and rewards for robotic control

Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bas- tani, and Dinesh Jayaraman. Liv: Language-image represen- tations and rewards for robotic control. InICML, 2023. 2, 8

work page 2023
[18]

R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022. 8

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763. PMLR, 2021. 4

work page 2021
[20]

World models via policy-guided trajectory diffusion.TMLR, 2024

Marc Rigter, Jun Yamada, and Ingmar Posner. World models via policy-guided trajectory diffusion.TMLR, 2024. 8

work page 2024
[21]

Vision-language models are zero- shot reward models for reinforcement learning

Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, and David Lindner. Vision-language models are zero- shot reward models for reinforcement learning. InICLR,

work page
[22]

Reinforcement learning with action-free pre- training from videos

Younggyo Seo, Kimin Lee, Stephen L James, and Pieter Abbeel. Reinforcement learning with action-free pre- training from videos. InICML, 2022. 8

work page 2022
[23]

Masked world models for visual control

Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. InCoRL, 2023. 5

work page 2023
[24]

Multi-view masked world models for visual robotic manipulation

Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jin- woo Shin, and Pieter Abbeel. Multi-view masked world models for visual robotic manipulation. InICML, 2023. 8

work page 2023
[25]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Roboclip: One demonstration is enough to learn robot poli- cies.NeurIPS, 36:55681–55693, 2023

Sumedh Sontakke, Jesse Zhang, S ´eb Arnold, Karl Pertsch, Erdem Bıyık, Dorsa Sadigh, Chelsea Finn, and Laurent Itti. Roboclip: One demonstration is enough to learn robot poli- cies.NeurIPS, 36:55681–55693, 2023. 1, 2, 5, 8

work page 2023
[27]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In CoRL, 2023. 5

work page 2023
[28]

This&that: Language-gesture controlled video generation for robot planning

Boyang Wang, Nikhil Sridhar, Chao Feng, Mark Van der Merwe, Adam Fishman, Nima Fazeli, and Jeong Joon Park. This&that: Language-gesture controlled video generation for robot planning. InICRA, 2025. 8

work page 2025
[29]

Disentangled world models: Learning to transfer se- mantic knowledge from distracting videos for reinforcement learning

Qi Wang, Zhipeng Zhang, Baao Xie, Xin Jin, Yunbo Wang, Shiyu Wang, Liaomo Zheng, Xiaokang Yang, and Wenjun Zeng. Disentangled world models: Learning to transfer se- mantic knowledge from distracting videos for reinforcement learning. InICCV, 2025

work page 2025
[30]

Pre-training contextualized world models with in-the-wild videos for reinforcement learning.NeurIPS, 2023

Jialong Wu, Haoyu Ma, Chaoyi Deng, and Mingsheng Long. Pre-training contextualized world models with in-the-wild videos for reinforcement learning.NeurIPS, 2023. 8

work page 2023
[31]

Video as the new language for real-world decision making.arXiv preprint arXiv:2402.17139, 2024

Sherry Yang, Jacob Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, and Dale Schu- urmans. Video as the new language for real-world decision making.arXiv preprint arXiv:2402.17139, 2024. 8

work page arXiv 2024
[32]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2025. 2

work page 2025
[33]

Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning. InCoRL, 2019. 2, 5

work page 2019
[34]

Tesseract: learning 4d embodied world models

Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: learning 4d embodied world models. InICCV, 2025. 5 Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning Supplementary Material Figure A. Failure Case of CLIP-based frame selection in a gen- erated RT-1Pick Applevideo. The most relevant fram...

work page 2025

[1] [1]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kan- ervisto, Amos Storkey, Tim Pearce, and Franc ¸ois Fleuret. Diffusion for world modeling: Visual details matter in atari. InNeurIPS, 2024. 8

work page 2024

[2] [2]

Video pretraining (vpt): Learning to act by watching unlabeled online videos

Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampe- dro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. InNeurIPS, 2022. 8

work page 2022

[3] [3]

Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manip- ulation.CoRR, 2024

Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manip- ulation.CoRR, 2024

work page 2024

[4] [4]

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image- editing diffusion models.arXiv preprint arXiv:2310.10639,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Univla: Learning to act anywhere with task-centric latent ac- tions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent ac- tions. InRSS, 2025. 8

work page 2025

[7] [7]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InNeurIPS, 2023. 8

work page 2023

[8] [8]

Video prediction models as rewards for reinforcement learning

Alejandro Escontrela, Ademi Adeniji, Wilson Yan, Ajay Jain, Xue Bin Peng, Ken Goldberg, Youngwoon Lee, Dani- jar Hafner, and Pieter Abbeel. Video prediction models as rewards for reinforcement learning. InNeurIPS, 2023. 2, 8

work page 2023

[9] [9]

Furl: Visual-language models as fuzzy rewards for reinforcement learning

Yuwei Fu, Haichao Zhang, Di Wu, Wei Xu, and Benoit Boulet. Furl: Visual-language models as fuzzy rewards for reinforcement learning. InICLR, 2024. 8

work page 2024

[10] [10]

Adaworld: Learning adaptable world models with latent actions

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions. InICML, 2025. 8

work page 2025

[11] [11]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Dif- fusion reward: Learning rewards via conditional video diffu- sion

Tao Huang, Guangqi Jiang, Yanjie Ze, and Huazhe Xu. Dif- fusion reward: Learning rewards via conditional video diffu- sion. InECCV, pages 478–495, 2024. 1, 2, 5, 8

work page 2024

[13] [13]

Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Let- ters, 5(2):3019–3026, 2020

Stephen James, Zicong Ma, David Rovick Arrojo, and An- drew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Let- ters, 5(2):3019–3026, 2020. 5

work page 2020

[14] [14]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022. 8

work page 2022

[15] [15]

Text- aware diffusion for policy learning

Calvin Luo, Mandy He, Zilai Zeng, and Chen Sun. Text- aware diffusion for policy learning. InNeurIPS, 2024. 1, 2, 5, 8

work page 2024

[16] [16]

Grounding video models to ac- tions through goal conditioned exploration

Yunhao Luo and Yilun Du. Grounding video models to ac- tions through goal conditioned exploration. InICLR, 2025. 8

work page 2025

[17] [17]

Liv: Language-image represen- tations and rewards for robotic control

Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bas- tani, and Dinesh Jayaraman. Liv: Language-image represen- tations and rewards for robotic control. InICML, 2023. 2, 8

work page 2023

[18] [18]

R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022. 8

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763. PMLR, 2021. 4

work page 2021

[20] [20]

World models via policy-guided trajectory diffusion.TMLR, 2024

Marc Rigter, Jun Yamada, and Ingmar Posner. World models via policy-guided trajectory diffusion.TMLR, 2024. 8

work page 2024

[21] [21]

Vision-language models are zero- shot reward models for reinforcement learning

Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, and David Lindner. Vision-language models are zero- shot reward models for reinforcement learning. InICLR,

work page

[22] [22]

Reinforcement learning with action-free pre- training from videos

Younggyo Seo, Kimin Lee, Stephen L James, and Pieter Abbeel. Reinforcement learning with action-free pre- training from videos. InICML, 2022. 8

work page 2022

[23] [23]

Masked world models for visual control

Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. InCoRL, 2023. 5

work page 2023

[24] [24]

Multi-view masked world models for visual robotic manipulation

Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jin- woo Shin, and Pieter Abbeel. Multi-view masked world models for visual robotic manipulation. InICML, 2023. 8

work page 2023

[25] [25]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Roboclip: One demonstration is enough to learn robot poli- cies.NeurIPS, 36:55681–55693, 2023

Sumedh Sontakke, Jesse Zhang, S ´eb Arnold, Karl Pertsch, Erdem Bıyık, Dorsa Sadigh, Chelsea Finn, and Laurent Itti. Roboclip: One demonstration is enough to learn robot poli- cies.NeurIPS, 36:55681–55693, 2023. 1, 2, 5, 8

work page 2023

[27] [27]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In CoRL, 2023. 5

work page 2023

[28] [28]

This&that: Language-gesture controlled video generation for robot planning

Boyang Wang, Nikhil Sridhar, Chao Feng, Mark Van der Merwe, Adam Fishman, Nima Fazeli, and Jeong Joon Park. This&that: Language-gesture controlled video generation for robot planning. InICRA, 2025. 8

work page 2025

[29] [29]

Disentangled world models: Learning to transfer se- mantic knowledge from distracting videos for reinforcement learning

Qi Wang, Zhipeng Zhang, Baao Xie, Xin Jin, Yunbo Wang, Shiyu Wang, Liaomo Zheng, Xiaokang Yang, and Wenjun Zeng. Disentangled world models: Learning to transfer se- mantic knowledge from distracting videos for reinforcement learning. InICCV, 2025

work page 2025

[30] [30]

Pre-training contextualized world models with in-the-wild videos for reinforcement learning.NeurIPS, 2023

Jialong Wu, Haoyu Ma, Chaoyi Deng, and Mingsheng Long. Pre-training contextualized world models with in-the-wild videos for reinforcement learning.NeurIPS, 2023. 8

work page 2023

[31] [31]

Video as the new language for real-world decision making.arXiv preprint arXiv:2402.17139, 2024

Sherry Yang, Jacob Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, and Dale Schu- urmans. Video as the new language for real-world decision making.arXiv preprint arXiv:2402.17139, 2024. 8

work page arXiv 2024

[32] [32]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2025. 2

work page 2025

[33] [33]

Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning. InCoRL, 2019. 2, 5

work page 2019

[34] [34]

Tesseract: learning 4d embodied world models

Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: learning 4d embodied world models. InICCV, 2025. 5 Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning Supplementary Material Figure A. Failure Case of CLIP-based frame selection in a gen- erated RT-1Pick Applevideo. The most relevant fram...

work page 2025