arxiv: 2404.12377 · v1 · submitted 2024-04-18 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

RoboDreamer: Learning Compositional World Models for Robot Imagination

Siyuan Zhou , Yilun Du , Jiaben Chen , Yandong Li , Dit-Yan Yeung , Chuang Gan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:43 UTC · model grok-4.3

classification 💻 cs.RO

keywords compositional world modelstext-to-video generationrobot planninggeneralizationlanguage parsingmultimodal conditioningvideo synthesis

0 comments

The pith

RoboDreamer factorizes video generation using language primitives to create plans for unseen robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RoboDreamer to overcome the generalization limits of text-to-video models in robotics. These models struggle with instructions that combine objects and actions in new ways not seen during training. By parsing instructions into lower-level primitives and using separate conditioned models to generate video components, the approach allows composing videos for novel combinations. This factorization supports both language-only and multimodal goal specifications. A sympathetic reader would care because it promises more flexible imagination for solving new tasks in new environments without retraining on every possibility.

Core claim

RoboDreamer learns a compositional world model by factorizing the video generation process. It leverages the natural compositionality of language to parse instructions into a set of lower-level primitives, which condition a set of models to generate videos. This enables compositional generalization to new natural language instructions formed as combinations of previously seen components, as well as the incorporation of additional multimodal goals such as goal images.

What carries the argument

Factorization of video generation into models each conditioned on a lower-level language primitive parsed from the full instruction.

If this is right

Successfully synthesizes video plans on unseen goals in the RT-X dataset.
Enables successful robot execution of the plans in simulation.
Substantially outperforms monolithic baseline approaches to video generation.
Allows specification of desired videos using both natural language and goal images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could scale to real physical robots if the simulated executions transfer well.
Similar factorization might apply to other generative tasks like action sequence prediction.
It suggests that explicit decomposition of language can mitigate the combinatorial explosion in training data needs for world models.

Load-bearing premise

Natural language instructions can be reliably parsed into lower-level primitives whose separate models compose into coherent, realistic videos without introducing artifacts or losing task-relevant details.

What would settle it

Running the generated videos for an unseen combination of primitives through a robot simulator and checking whether the robot completes the intended task without errors or incoherent motions.

read the original abstract

Text-to-video models have demonstrated substantial potential in robotic decision-making, enabling the imagination of realistic plans of future actions as well as accurate environment simulation. However, one major issue in such models is generalization -- models are limited to synthesizing videos subject to language instructions similar to those seen at training time. This is heavily limiting in decision-making, where we seek a powerful world model to synthesize plans of unseen combinations of objects and actions in order to solve previously unseen tasks in new environments. To resolve this issue, we introduce RoboDreamer, an innovative approach for learning a compositional world model by factorizing the video generation. We leverage the natural compositionality of language to parse instructions into a set of lower-level primitives, which we condition a set of models on to generate videos. We illustrate how this factorization naturally enables compositional generalization, by allowing us to formulate a new natural language instruction as a combination of previously seen components. We further show how such a factorization enables us to add additional multimodal goals, allowing us to specify a video we wish to generate given both natural language instructions and a goal image. Our approach can successfully synthesize video plans on unseen goals in the RT-X, enables successful robot execution in simulation, and substantially outperforms monolithic baseline approaches to video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces RoboDreamer, a compositional world model for robotic imagination that factorizes text-to-video generation by parsing natural language instructions into lower-level primitives and conditioning separate models on them. This factorization is claimed to enable generalization to unseen object-action combinations in the RT-X dataset, support successful robot execution in simulation, and allow multimodal conditioning via language plus goal images, while substantially outperforming monolithic video generation baselines.

Significance. If the compositional factorization demonstrably yields artifact-free videos and reliable execution on novel compositions, the work could advance world models for robotics by providing a scalable path to generalization without exhaustive retraining on all task variants. The multimodal extension further supports flexible goal specification in planning pipelines.

major comments (2)

[Abstract] Abstract: The central claims of successful synthesis on unseen RT-X goals, successful simulation execution, and substantial outperformance over monolithic baselines are stated without any quantitative metrics (e.g., FVD, success rates, artifact counts), baseline names, error bars, or ablation results. This absence directly limits assessment of whether the primitive factorization satisfies the conditions for coherent composition.
[Abstract] The description of language parsing into primitives and subsequent model composition does not specify mechanisms for handling parser ambiguity, cross-primitive alignment on shared scene elements, or quantitative checks for lost task details on held-out combinations. These omissions are load-bearing for the generalization argument, as any misalignment would falsify the outperformance and execution claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions planned for the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of successful synthesis on unseen RT-X goals, successful simulation execution, and substantial outperformance over monolithic baselines are stated without any quantitative metrics (e.g., FVD, success rates, artifact counts), baseline names, error bars, or ablation results. This absence directly limits assessment of whether the primitive factorization satisfies the conditions for coherent composition.

Authors: We agree that the abstract would benefit from key quantitative indicators. The full manuscript reports FVD scores, execution success rates with error bars, specific baseline names, and ablation results in Sections 4 and 5. In revision we will incorporate the most salient metrics (FVD improvement and success rate on unseen compositions) directly into the abstract while preserving its length. revision: yes
Referee: [Abstract] The description of language parsing into primitives and subsequent model composition does not specify mechanisms for handling parser ambiguity, cross-primitive alignment on shared scene elements, or quantitative checks for lost task details on held-out combinations. These omissions are load-bearing for the generalization argument, as any misalignment would falsify the outperformance and execution claims.

Authors: The abstract is intentionally concise. The manuscript details the LLM-based parser with ambiguity handling via confidence thresholding and fallback sampling (Section 3.1), cross-primitive alignment through shared visual latents and consistency losses (Section 3.2), and quantitative verification via held-out ablations and task-fidelity metrics (Section 4.3). We will add a single sentence to the abstract summarizing these mechanisms and referencing the supporting results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in RoboDreamer's empirical compositional factorization

full rationale

The paper presents RoboDreamer as an empirical method that parses language instructions into lower-level primitives and trains separate conditional models whose outputs are composed for video generation. No equations, fitted parameters, or self-citations are shown to reduce the claimed compositional generalization or video synthesis results to inputs by construction. Claims of success on unseen RT-X goals and outperformance over monolithic baselines rest on experimental outcomes rather than tautological redefinitions or load-bearing self-references. The approach is self-contained against external benchmarks with no detected self-definitional, prediction-renaming, or uniqueness-imported circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that language has natural compositionality that can be parsed into reusable primitives whose models compose effectively; no free parameters or invented entities are specified in the abstract.

axioms (1)

domain assumption Language instructions can be parsed into a set of lower-level primitives that capture compositionality.
Invoked to justify factorization and generalization to unseen combinations.

pith-pipeline@v0.9.0 · 5534 in / 993 out tokens · 22100 ms · 2026-05-15T20:43:49.453064+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LogicAsFunctionalEquation RCL_is_unique_functional_form_of_logic echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we introduce RoboDreamer, an innovative approach for learning a compositional world model by factorizing the video generation. We leverage the natural compositionality of language to parse instructions into a set of lower-level primitives, which we condition a set of models on to generate videos.
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

This factorization naturally enables compositional generalization, by allowing us to formulate a new natural language instruction as a combination of previously seen components.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
cs.RO 2026-04 unverdicted novelty 7.0

ACO-MoE employs agent-centric mixture-of-experts to decouple task-relevant features from dynamic visual perturbations in RL, recovering 95.3% of clean performance on the new VDCS benchmark.
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
cs.RO 2026-04 unverdicted novelty 7.0

ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
cs.RO 2026-04 unverdicted novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
Action Images: End-to-End Policy Learning via Multiview Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
cs.LG 2026-02 unverdicted novelty 7.0

QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memor...
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
cs.RO 2025-05 unverdicted novelty 7.0

DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
cs.RO 2026-04 unverdicted novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robo...
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
cs.CV 2026-03 unverdicted novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
cs.RO 2026-04 unverdicted novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
cs.RO 2026-04 unverdicted novelty 5.0

Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
Causal World Modeling for Robot Control
cs.CV 2026-01 unverdicted novelty 5.0

LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 20 Pith papers · 11 internal anchors

[7]

Unsupervised learning of compositional energy concepts

Du, Y., Li, S., Sharma, Y., Tenenbaum, J., and Mordatch, I. Unsupervised learning of compositional energy concepts. Advances in Neural Information Processing Systems, 34: 0 15608--15620, 2021

work page 2021
[8]

B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W

Du, Y., Durkan, C., Strudel, R., Tenenbaum, J. B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W. S. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International Conference on Machine Learning, pp.\ 8489--8510. PMLR, 2023 a

work page 2023
[12]

G., Tapaswi, M., Laptev, I., and Schmid, C

Guhur, P.-L., Chen, S., Pinel, R. G., Tapaswi, M., Laptev, I., and Schmid, C. Instruction-driven history-aware policies for robotic manipulations. In Conference on Robot Learning, pp.\ 175--187. PMLR, 2023

work page 2023
[15]

Diffusion-based generation, optimization, and planning in 3d scenes

Huang, S., Wang, Z., Li, P., Jia, B., Liu, T., Zhu, Y., Liang, W., and Zhu, S.-C. Diffusion-based generation, optimization, and planning in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16750--16761, 2023

work page 2023
[17]

R., and Davison, A

James, S., Ma, Z., Arrojo, D. R., and Davison, A. J. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5 0 (2): 0 3019--3026, 2020

work page 2020
[20]

Ko, P.-C., Mao, J., Du, Y., Sun, S.-H., and Tenenbaum, J. B. Learning to act from actionless videos through dense correspondences, 2023

work page 2023
[21]

Adaptdiffuser: Diffusion models as adaptive self-evolving planners

Liang, Z., Mu, Y., Ding, M., Ni, F., Tomizuka, M., and Luo, P. Adaptdiffuser: Diffusion models as adaptive self-evolving planners. In International Conference on Machine Learning, 2023

work page 2023
[23]

Learning to compose visual relations

Liu, N., Li, S., Du, Y., Tenenbaum, J., and Torralba, A. Learning to compose visual relations. Advances in Neural Information Processing Systems, 34: 0 23166--23178, 2021

work page 2021
[24]

Liu, N., Li, S., Du, Y., Torralba, A., and Tenenbaum, J. B. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pp.\ 423--439. Springer, 2022

work page 2022
[25]

Structdiffusion: Language-guided creation of physically-valid structures using unseen objects

Liu, W., Du, Y., Hermans, T., Chernova, S., and Paxton, C. Structdiffusion: Language-guided creation of physically-valid structures using unseen objects. In RSS 2023, 2023 b

work page 2023
[26]

Controllable and compositional generation with latent-space energy-based models

Nie, W., Vahdat, A., and Anandkumar, A. Controllable and compositional generation with latent-space energy-based models. Advances in Neural Information Processing Systems, 34: 0 13497--13510, 2021

work page 2021
[28]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21 0 (1): 0 5485--5551, 2020

work page 2020
[29]

High-resolution image synthesis with latent diffusion models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

work page 2022
[30]

Shi, C., Ni, H., Li, K., Han, S., Liang, M., and Min, M. R. Exploring compositional visual generation with latent classifier guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 853--862, 2023

work page 2023
[34]

Modular action concept grounding in semantic video prediction

Yu, W., Chen, W., Yin, S., Easterbrook, S., and Garg, A. Modular action concept grounding in semantic video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3605--3614, 2022

work page 2022
[35]

Y., and Zhang, A

Zhang, E., Lu, Y., Wang, W. Y., and Zhang, A. Lad: Language augmented diffusion for reinforcement learning. In Second Workshop on Language and Reinforcement Learning, 2022

work page 2022
[36]

Adding conditional control to text-to-image diffusion models

Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 3836--3847, 2023

work page 2023
[38]

arXiv preprint arXiv:2302.00111 , year=

Learning universal policies via text-guided video generation , author=. arXiv preprint arXiv:2302.00111 , year=

work page arXiv
[39]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[40]

2023 , booktitle =

StructDiffusion: Language-Guided Creation of Physically-Valid Structures using Unseen Objects , author =. 2023 , booktitle =

work page 2023
[41]

International Conference on Machine Learning , year=

AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners , author=. International Conference on Machine Learning , year=

work page
[42]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Diffusion-based generation, optimization, and planning in 3d scenes , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[43]

Second Workshop on Language and Reinforcement Learning , year=

Lad: Language augmented diffusion for reinforcement learning , author=. Second Workshop on Language and Reinforcement Learning , year=

work page
[44]

Imitating human behaviour with diffusion models

Imitating human behaviour with diffusion models , author=. arXiv preprint arXiv:2301.10677 , year=

work page arXiv
[45]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Diffusion policy: Visuomotor policy learning via action diffusion , author=. arXiv preprint arXiv:2303.04137 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Advances in Neural Information Processing Systems , volume=

Controllable and compositional generation with latent-space energy-based models , author=. Advances in Neural Information Processing Systems , volume=

work page
[47]

Advances in Neural Information Processing Systems , volume=

Unsupervised learning of compositional energy concepts , author=. Advances in Neural Information Processing Systems , volume=

work page
[48]

Advances in Neural Information Processing Systems , volume=

Learning to compose visual relations , author=. Advances in Neural Information Processing Systems , volume=

work page
[49]

arXiv preprint arXiv:2304.14391 , year=

Energy-based models as zero-shot planners for compositional scene rearrangement , author=. arXiv preprint arXiv:2304.14391 , year=

work page arXiv
[50]

International Conference on Machine Learning , pages=

Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[51]

Composable Text Controls in Latent Space with ODE s

Liu, Guangyi and Feng, Zeyu and Gao, Yuan and Yang, Zichao and Liang, Xiaodan and Bao, Junwei and He, Xiaodong and Cui, Shuguang and Li, Zhen and Hu, Zhiting. Composable Text Controls in Latent Space with ODE s. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.1030

work page doi:10.18653/v1/2023.emnlp-main.1030 2023
[52]

Planning with Diffusion for Flexible Behavior Synthesis

Planning with diffusion for flexible behavior synthesis , author=. arXiv preprint arXiv:2205.09991 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Is Conditional Generative Modeling all you need for Decision-Making?

Is conditional generative modeling all you need for decision-making? , author=. arXiv preprint arXiv:2211.15657 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

European Conference on Computer Vision , pages=

Compositional visual generation with composable diffusion models , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022
[55]

The Journal of Machine Learning Research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=

work page 2020
[56]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[57]

RT-1: Robotics Transformer for Real-World Control at Scale

Rt-1: Robotics transformer for real-world control at scale , author=. arXiv preprint arXiv:2212.06817 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Bridge data: Boosting generalization of robotic skills with cross-domain datasets , author=. arXiv preprint arXiv:2109.13396 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

IEEE Robotics and Automation Letters , year=

Interactive language: Talking to robots in real time , author=. IEEE Robotics and Automation Letters , year=

work page
[60]

, title =

Dass, Shivin and Yapeter, Jullian and Zhang, Jesse and Zhang, Jiahui and Pertsch, Karl and Nikolaidis, Stefanos and Lim, Joseph J. , title =

work page
[61]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open x-embodiment: Robotic learning datasets and rt-x models , author=. arXiv preprint arXiv:2310.08864 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Berkeley

Lawrence Yunliang Chen and Simeon Adebola and Ken Goldberg , howpublished =. Berkeley

work page
[63]

Conference on Robot Learning , pages=

Bc-z: Zero-shot task generalization with robotic imitation learning , author=. Conference on Robot Learning , pages=. 2022 , organization=

work page 2022
[64]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[65]

arXiv preprint arXiv:2309.08587 , year=

Compositional Foundation Models for Hierarchical Planning , author=. arXiv preprint arXiv:2309.08587 , year=

work page arXiv
[66]

arXiv preprint arXiv:2310.06114 , year=

Learning Interactive Real-World Simulators , author=. arXiv preprint arXiv:2310.06114 , year=

work page arXiv
[67]

European Conference on Computer Vision (ECCV) , year=

Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , author=. European Conference on Computer Vision (ECCV) , year=

work page
[68]

Nature , volume=

The Principles of Psychology , author=. Nature , volume=. 1890 , url=

work page
[69]

Multilingual Constituency Parsing with Self-Attention and Pre-Training

Multilingual constituency parsing with self-attention and pre-training , author=. arXiv preprint arXiv:1812.11760 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

IEEE Robotics and Automation Letters , volume=

Rlbench: The robot learning benchmark & learning environment , author=. IEEE Robotics and Automation Letters , volume=. 2020 , publisher=

work page 2020
[71]

Imagen Video: High Definition Video Generation with Diffusion Models

Imagen video: High definition video generation with diffusion models , author=. arXiv preprint arXiv:2210.02303 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Make-a-video: Text-to-video generation without text-video data , author=. arXiv preprint arXiv:2209.14792 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Ego4d: Around the world in 3,000 hours of egocentric video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[74]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Grounding dino: Marrying dino with grounded pre-training for open-set object detection , author=. arXiv preprint arXiv:2303.05499 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Perceiver io: A general architecture for structured inputs & outputs , author=. arXiv preprint arXiv:2107.14795 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

Conference on Robot Learning , pages=

Instruction-driven history-aware policies for robotic manipulations , author=. Conference on Robot Learning , pages=. 2023 , organization=

work page 2023
[77]

2023 , eprint=

Learning to Act from Actionless Videos through Dense Correspondences , author=. 2023 , eprint=

work page 2023
[78]

arXiv preprint arXiv:2310.09629 , year=

Adaptive Online Replanning with Diffusion Models , author=. arXiv preprint arXiv:2310.09629 , year=

work page arXiv
[79]

arXiv preprint arXiv:2306.01872 , year=

Probabilistic Adaptation of Text-to-Video Models , author=. arXiv preprint arXiv:2306.01872 , year=

work page arXiv
[80]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Exploring Compositional Visual Generation with Latent Classifier Guidance , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[81]

arXiv preprint arXiv:2004.11714 , year=

Residual energy-based models for text generation , author=. arXiv preprint arXiv:2004.11714 , year=

work page arXiv 2004
[82]

arXiv preprint arXiv:2401.01952 , year=

Instruct-Imagen: Image Generation with Multi-modal Instruction , author=. arXiv preprint arXiv:2401.01952 , year=

work page arXiv
[83]

arXiv preprint arXiv:2311.10709 , year=

Emu video: Factorizing text-to-video generation by explicit image conditioning , author=. arXiv preprint arXiv:2311.10709 , year=

work page arXiv
[84]

arXiv preprint arXiv:2006.15327 , year=

Compositional video synthesis with action graphs , author=. arXiv preprint arXiv:2006.15327 , year=

work page arXiv 2006
[85]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Modular action concept grounding in semantic video prediction , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page