pith. machine review for the scientific record. sign in

arxiv: 2312.13139 · v2 · submitted 2023-12-20 · 💻 cs.RO · cs.CV

Recognition: 3 theorem links

· Lean Theorem

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Chilam Cheang, Guangzeng Chen, Hang Li, Hongtao Wu, Jiafeng Xu, Minghuan Liu, Tao Kong, Xinghang Li, Ya Jing

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:25 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords visual robot manipulationvideo generative pre-trainingGPT-style transformermulti-task controllanguage-conditioned roboticsCALVIN benchmarkzero-shot generalizationaction prediction
0
0 comments X

The pith

A GPT-style transformer pre-trained on large-scale videos generalizes to multi-task language-conditioned robot manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that pre-training a simple GPT-style model on vast video datasets produces visual representations that transfer to controlling robots across multiple tasks given language instructions. The model processes a language command along with past images and robot states to output both actions and predicted future images. After fine-tuning on robot data it raises success rates on the CALVIN benchmark and improves zero-shot performance in unseen scenes. A reader would care because the work points to a way of bootstrapping robot skills from abundant everyday video rather than relying solely on expensive robot-specific recordings.

Core claim

GR-1 is a unified GPT-style transformer that accepts a language instruction, a sequence of observation images, and robot states, then predicts both future images and robot actions in an end-to-end manner. When pre-trained generatively on a large-scale non-robot video dataset and subsequently fine-tuned on robot trajectories, the model outperforms prior methods on the CALVIN benchmark, lifting overall success from 88.9 percent to 94.9 percent and zero-shot unseen-scene success from 53.3 percent to 85.4 percent. Real-robot experiments likewise show improved generalization to novel scenes and objects.

What carries the argument

GR-1, the GPT-style transformer that jointly predicts robot actions and future images from language and visual state sequences.

If this is right

  • The approach raises success rates on the CALVIN benchmark from 88.9 percent to 94.9 percent across multi-task settings.
  • Zero-shot generalization to unseen scenes improves from 53.3 percent to 85.4 percent success.
  • Real-robot trials show stronger performance on novel scenes and objects than baselines without video pre-training.
  • The flexible architecture permits direct fine-tuning from video pre-training to robot action prediction without architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Abundant internet video could become a primary data source for acquiring robot manipulation skills, reducing dependence on costly robot-collected trajectories.
  • The same pre-training strategy may transfer to other embodied tasks such as navigation or tool use once suitable action heads are added.
  • Larger video corpora or longer training could further close the remaining gap between seen and unseen environments.
  • Combining the model with existing large language models might enable more open-ended natural-language instruction following in physical settings.

Load-bearing premise

Representations learned from general video data transfer to robot manipulation tasks without a prohibitive domain gap or loss of capability during fine-tuning.

What would settle it

A controlled comparison in which a model trained from scratch on the identical robot dataset matches or exceeds GR-1's success rates on CALVIN and real-robot tests would undermine the claimed benefit of the video pre-training stage.

read the original abstract

Generative pre-trained models have demonstrated remarkable effectiveness in language and vision domains by learning useful representations. In this paper, we extend the scope of this effectiveness by showing that visual robot manipulation can significantly benefit from large-scale video generative pre-training. We introduce GR-1, a straightforward GPT-style model designed for multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs a language instruction, a sequence of observation images, and a sequence of robot states. It predicts robot actions as well as future images in an end-to-end manner. Thanks to a flexible design, GR-1 can be seamlessly finetuned on robot data after pre-trained on a large-scale video dataset. We perform extensive experiments on the challenging CALVIN benchmark and a real robot. On CALVIN benchmark, our method outperforms state-of-the-art baseline methods and improves the success rate from 88.9% to 94.9%. In the setting of zero-shot unseen scene generalization, GR-1 improves the success rate from 53.3% to 85.4%. In real robot experiments, GR-1 also outperforms baseline methods and shows strong potentials in generalization to unseen scenes and objects. We provide inaugural evidence that a unified GPT-style transformer, augmented with large-scale video generative pre-training, exhibits remarkable generalization to multi-task visual robot manipulation. Project page: https://GR1-Manipulation.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GR-1, a GPT-style transformer for multi-task language-conditioned visual robot manipulation that takes language instructions, observation images, and robot states as input and jointly predicts actions and future images. After large-scale video generative pre-training on non-robot data, the model is fine-tuned end-to-end on robot datasets. It reports success-rate gains on the CALVIN benchmark (88.9% to 94.9%) and zero-shot unseen-scene generalization (53.3% to 85.4%), plus real-robot results showing improved generalization to unseen scenes and objects.

Significance. If the empirical gains are reproducible, the work supplies inaugural evidence that video-scale generative pre-training transfers to robot manipulation without catastrophic forgetting, supporting a unified GPT-style architecture for joint action-image prediction. This could influence future robot learning pipelines by demonstrating that non-robot video data can close part of the domain gap in visual control.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments: the central claim attributes the CALVIN and zero-shot gains specifically to large-scale video pre-training, yet the manuscript provides no ablation that isolates the pre-training stage from the GPT architecture or the joint image-action objective; without this comparison the transfer benefit remains correlational rather than causal.
  2. [Experiments] Experiments: success rates are reported as single point estimates (88.9% → 94.9%, 53.3% → 85.4%) with no mention of variance across random seeds, number of evaluation episodes, or statistical significance tests; this weakens confidence that the observed deltas are robust rather than run-specific.
minor comments (2)
  1. [Method] The description of the video pre-training dataset (size, diversity, filtering) is referenced only at high level; adding a table or paragraph with exact statistics would clarify the scale and domain distance to robot data.
  2. [Method] Notation for the joint prediction loss (action + image) is introduced without an explicit equation; including the combined objective as Eq. (X) would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments below and will incorporate revisions to strengthen the manuscript's claims and reporting.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments: the central claim attributes the CALVIN and zero-shot gains specifically to large-scale video pre-training, yet the manuscript provides no ablation that isolates the pre-training stage from the GPT architecture or the joint image-action objective; without this comparison the transfer benefit remains correlational rather than causal.

    Authors: We agree that an explicit ablation isolating the effect of large-scale video pre-training on the identical GR-1 architecture would provide stronger causal evidence. The current comparisons are against prior state-of-the-art methods that lack both the GPT-style joint image-action prediction and video pre-training; however, to directly address this point we will add a controlled ablation in the revised manuscript: training the same GR-1 model from scratch on the robot datasets without the video pre-training stage and reporting the resulting performance drop on CALVIN and zero-shot generalization. This will clarify the incremental benefit attributable to pre-training. revision: yes

  2. Referee: [Experiments] Experiments: success rates are reported as single point estimates (88.9% → 94.9%, 53.3% → 85.4%) with no mention of variance across random seeds, number of evaluation episodes, or statistical significance tests; this weakens confidence that the observed deltas are robust rather than run-specific.

    Authors: We acknowledge that single-point estimates limit assessment of robustness. The CALVIN benchmark protocol uses 1000 evaluation episodes per task setting; we will explicitly state this number, report results averaged over at least three random seeds with standard deviations, and include these details in both the main text and tables. If the performance deltas remain statistically significant under a paired t-test or similar, we will note this as well. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents empirical results from pre-training a GPT-style transformer on large-scale video data followed by fine-tuning on robot manipulation tasks, with performance gains reported on CALVIN (88.9% to 94.9%) and zero-shot settings (53.3% to 85.4%). No equations or first-principles derivations are invoked that reduce any claimed prediction to fitted parameters or self-referential definitions by construction. The architecture is described as flexible for pre-train then fine-tune, and results are positioned as direct benchmark evidence of transfer rather than a closed theoretical system. No load-bearing self-citations or ansatzes are used to justify core claims; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard transformer training assumptions.

pith-pipeline@v0.9.0 · 5574 in / 1064 out tokens · 37322 ms · 2026-05-13T16:25:46.743426+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear

    We introduce GR-1, a straightforward GPT-style model designed for multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs a language instruction, a sequence of observation images, and a sequence of robot states. It predicts robot actions as well as future images in an end-to-end manner.

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

    We provide inaugural evidence that a unified GPT-style transformer, augmented with large-scale video generative pre-training, exhibits remarkable generalization to multi-task visual robot manipulation.

  • Foundation.DimensionForcing dimension_forced unclear

    On CALVIN benchmark, our method outperforms state-of-the-art baseline methods and improves the success rate from 88.9% to 94.9%.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  2. Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 7.0

    MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.

  3. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  4. NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

  5. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  6. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  7. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  8. HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

  9. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  10. GazeVLA: Learning Human Intention for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

  11. CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

    cs.RO 2026-04 unverdicted novelty 6.0

    CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.

  12. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  13. VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

    cs.RO 2026-04 unverdicted novelty 6.0

    VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

  14. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  15. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    cs.AI 2025-06 unverdicted novelty 6.0

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...

  16. Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    cs.RO 2025-04 unverdicted novelty 6.0

    Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...

  17. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    cs.RO 2025-03 unverdicted novelty 6.0

    GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.

  18. Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    cs.CV 2024-12 unverdicted novelty 6.0

    Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.

  19. CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    cs.RO 2024-11 unverdicted novelty 6.0

    CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...

  20. GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    cs.RO 2024-10 unverdicted novelty 6.0

    GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.

  21. StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

    cs.RO 2026-04 unverdicted novelty 5.0

    StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...

  22. M100: An Orchestrated Dataflow Architecture Powering General AI Computing

    cs.LG 2026-04 unverdicted novelty 5.0

    M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.

  23. R3D: Revisiting 3D Policy Learning

    cs.CV 2026-04 unverdicted novelty 5.0

    A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.

  24. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.

  25. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.

  26. ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

    cs.RO 2026-04 unverdicted novelty 5.0

    Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.

  27. WorldVLA: Towards Autoregressive Action World Model

    cs.RO 2025-06 unverdicted novelty 5.0

    WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.

Reference graph

Works this paper leans on

134 extracted references · 134 canonical work pages · cited by 25 Pith papers · 12 internal anchors

  1. [1]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do As I Can, Not As I Say : Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022

  2. [2]

    Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

    Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos . Advances in Neural Information Processing Systems, 35: 0 24639--24654, 2022

  3. [3]

    Robotic Offline RL from Internet Videos via Value-Function Pre-Training

    Chethan Bhateja, Derek Guo, Dibya Ghosh, Anikait Singh, Manan Tomar, Quan Vuong, Yevgen Chebotar, Sergey Levine, and Aviral Kumar. Robotic Offline RL from Internet Videos via Value-Function Pre-Training . arXiv preprint arXiv:2309.13041, 2023

  4. [4]

    RoboCat : A self-improving foundation agent for robotic manipulation

    Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauza, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. RoboCat : A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023

  5. [5]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1 : Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  6. [6]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2 : Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

  7. [7]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, pp.\ 1877--1901, 2020

  8. [8]

    Decision Transformer: Reinforcement learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision Transformer: Reinforcement learning via sequence modeling . Advances in Neural Information Processing Systems, 2021

  9. [9]

    Generative pretraining from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pp.\ 1691--1703. PMLR, 2020

  10. [10]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion Policy : Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023

  11. [11]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E : An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  12. [14]

    Deep visual foresight for planning robot motion

    Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 2786--2793. IEEE, 2017

  13. [15]

    Ego4D : Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D : Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18995--19012, 2022

  14. [16]

    Instruction-driven history-aware policies for robotic manipulations

    Pierre-Louis Guhur, Shizhe Chen, Ricardo Garcia Pinel, Makarand Tapaswi, Ivan Laptev, and Cordelia Schmid. Instruction-driven history-aware policies for robotic manipulations. In Conference on Robot Learning, pp.\ 175--187. PMLR, 2023

  15. [20]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16000--16009, 2022

  16. [21]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner Monologue : Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022

  17. [22]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In International Conference on Machine Learning, pp.\ 4651--4664. PMLR, 2021

  18. [23]

    BC-Z : Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. BC-Z : Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, 2022

  19. [24]

    VIMA : General robot manipulation with multimodal prompts

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. VIMA : General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2022

  20. [25]

    Exploring visual pre-training for robot manipulation: Datasets, models and methods

    Ya Jing, Xuelin Zhu, Xingbin Liu, Qie Sima, Taozheng Yang, Yunhai Feng, and Tao Kong. Exploring visual pre-training for robot manipulation: Datasets, models and methods. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2023

  21. [27]

    Pre-training for robots: Offline RL enables learning new tasks from a handful of trials

    Aviral Kumar, Anikait Singh, Frederik Ebert, Yanlai Yang, Chelsea Finn, and Sergey Levine. Pre-training for robots: Offline RL enables learning new tasks from a handful of trials. arXiv preprint arXiv:2210.05178, 2022

  22. [28]

    CURL : Contrastive unsupervised representations for reinforcement learning

    Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL : Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pp.\ 5639--5650. PMLR, 2020

  23. [34]

    Interactive Language : Talking to robots in real time

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive Language : Talking to robots in real time. arXiv preprint arXiv:2210.06407, 2022

  24. [37]

    What matters in language conditioned robotic imitation learning over unstructured data

    Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7 0 (4): 0 11205--11212, 2022 b

  25. [38]

    CALVIN : A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN : A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7 0 (3): 0 7327--7334, 2022 c

  26. [40]

    R3M : A universal visual representation for robot manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3M : A universal visual representation for robot manipulation. In 6th Annual Conference on Robot Learning, 2022

  27. [41]

    GPT-4 technical report

    R OpenAI. GPT-4 technical report. arXiv, pp.\ 2303--08774, 2023

  28. [42]

    The unsurprising effectiveness of pre-trained vision models for control

    Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, and Abhinav Gupta. The unsurprising effectiveness of pre-trained vision models for control. In International Conference on Machine Learning, pp.\ 17359--17371. PMLR, 2022

  29. [43]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018

  30. [44]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.\ 8748--8763, 2021

  31. [45]

    Real-world robot learning with masked visual pre-training

    Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, 2022

  32. [48]

    Pretraining representations for data-efficient reinforcement learning

    Max Schwarzer, Nitarshan Rajkumar, Michael Noukhovitch, Ankesh Anand, Laurent Charlin, R Devon Hjelm, Philip Bachman, and Aaron C Courville. Pretraining representations for data-efficient reinforcement learning. Advances in Neural Information Processing Systems, pp.\ 12686--12699, 2021

  33. [49]

    Masked world models for visual control

    Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. In Conference on Robot Learning, pp.\ 1332--1344. PMLR, 2023

  34. [50]

    Time-contrastive networks: Self-supervised learning from video

    Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 1134--1141. IEEE, 2018

  35. [51]

    Behavior Transformers : Cloning k modes with one stone

    Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. Behavior Transformers : Cloning k modes with one stone. Advances in Neural Information Processing Systems, pp.\ 22955--22968, 2022

  36. [52]

    LM-Nav : Robotic navigation with large pre-trained models of language, vision, and action

    Dhruv Shah, B a \.z ej Osi \'n ski, Sergey Levine, et al. LM-Nav : Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning, pp.\ 492--504. PMLR, 2023

  37. [53]

    CLIPort : What and where pathways for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. CLIPort : What and where pathways for robotic manipulation. In Conference on Robot Learning, pp.\ 894--906, 2022

  38. [54]

    Perceiver-Actor : A multi-task transformer for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-Actor : A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pp.\ 785--799, 2023

  39. [55]

    SMART : Self-supervised multi-task pretraining with control transformers

    Yanchao Sun, Shuang Ma, Ratnesh Madaan, Rogerio Bonatti, Furong Huang, and Ashish Kapoor. SMART : Self-supervised multi-task pretraining with control transformers. arXiv preprint arXiv:2301.09816, 2023

  40. [56]

    PLEX : Making the most of the available data for robotic manipulation pretraining

    Garrett Thomas, Ching-An Cheng, Ricky Loynd, Vibhav Vineet, Mihai Jalobeanu, and Andrey Kolobov. PLEX : Making the most of the available data for robotic manipulation pretraining. arXiv preprint arXiv:2303.08789, 2023

  41. [58]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need . Advances in Neural Information Processing Systems, 2017

  42. [60]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. VideoGPT : Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021

  43. [61]

    Temporally consistent transformers for video generation, 2023

    Wilson Yan, Danijar Hafner, Stephen James, and Pieter Abbeel. Temporally consistent transformers for video generation, 2023

  44. [62]

    Learning to see before learning to act: Visual pre-training for manipulation

    Lin Yen-Chen, Andy Zeng, Shuran Song, Phillip Isola, and Tsung-Yi Lin. Learning to see before learning to act: Visual pre-training for manipulation. In IEEE International Conference on Robotics and Automation (ICRA), pp.\ 7286--7293. IEEE, 2020

  45. [63]

    Pomerleau, Dean A , journal=

  46. [64]

    Advances in Neural Information Processing Systems , year=

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Advances in Neural Information Processing Systems , year=

  47. [65]

    Chen, Lili and Lu, Kevin and Rajeswaran, Aravind and Lee, Kimin and Grover, Aditya and Laskin, Misha and Abbeel, Pieter and Srinivas, Aravind and Mordatch, Igor , journal=

  48. [66]

    International Conference on Machine Learning , pages=

    Learning transferable visual models from natural language supervision , author=. International Conference on Machine Learning , pages=

  49. [67]

    Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katherine and Reynolds, Malcolm and others , journal=

  50. [68]

    IEEE Robotics and Automation Letters (RA-L) , volume=

    Oier Mees and Lukas Hermann and Erick Rosete-Beas and Wolfram Burgard , title =. IEEE Robotics and Automation Letters (RA-L) , volume=

  51. [69]

    Decoupled Weight Decay Regularization

    Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

  52. [70]

    Keyframe-based Learning from Demonstration: Method and Evaluation , volume =

    Akgun, Baris and Cakmak, Maya and Jiang, Karl and Thomaz, Andrea , year =. Keyframe-based Learning from Demonstration: Method and Evaluation , volume =. International Journal of Social Robotics , doi =

  53. [71]

    Robot See, Robot Do: An Overview of Robot Imitation , journal =

    Bakker, Paul and Kuniyoshi, Yasuo , year =. Robot See, Robot Do: An Overview of Robot Imitation , journal =

  54. [72]

    Proceedings of the 27th International Joint Conference on Artificial Intelligence , year=

    Behavioral cloning from observation , author=. Proceedings of the 27th International Joint Conference on Artificial Intelligence , year=

  55. [73]

    Conference on Robot Learning , year=

    Real-World Robot Learning with Masked Visual Pre-training , author=. Conference on Robot Learning , year=

  56. [74]

    Conference on Robot Learning , year=

    Visual imitation made easy , author=. Conference on Robot Learning , year=

  57. [75]

    Jang, Eric and Irpan, Alex and Khansari, Mohi and Kappler, Daniel and Ebert, Frederik and Lynch, Corey and Levine, Sergey and Finn, Chelsea , booktitle=

  58. [76]

    Sun, Yanchao and Ma, Shuang and Madaan, Ratnesh and Bonatti, Rogerio and Huang, Furong and Kapoor, Ashish , journal=

  59. [77]

    Nair, Suraj and Rajeswaran, Aravind and Kumar, Vikash and Finn, Chelsea and Gupta, Abhinav , booktitle=

  60. [78]

    arXiv preprint arXiv:2111.10364 , year=

    Generalized decision transformer for offline hindsight information matching , author=. arXiv preprint arXiv:2111.10364 , year=

  61. [79]

    International Conference on Machine Learning , pages=

    Online decision transformer , author=. International Conference on Machine Learning , pages=

  62. [80]

    A Generalist Agent

    A generalist agent , author=. arXiv preprint arXiv:2205.06175 , year=

  63. [81]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  64. [82]

    Shridhar, Mohit and Manuelli, Lucas and Fox, Dieter , booktitle=

  65. [83]

    Brohan, Anthony and Brown, Noah and Carbajal, Justice and Chebotar, Yevgen and Dabis, Joseph and Finn, Chelsea and Gopalakrishnan, Keerthana and Hausman, Karol and Herzog, Alex and Hsu, Jasmine and others , journal=

  66. [84]

    2018 , publisher=

    Improving language understanding by generative pre-training , author=. 2018 , publisher=

  67. [85]

    arXiv preprint arXiv:2304.00776 , year=

    Chain-of-Thought Predictive Control , author=. arXiv preprint arXiv:2304.00776 , year=

  68. [86]

    Shafiullah, Nur Muhammad and Cui, Zichen and Altanzaya, Ariuntuya Arty and Pinto, Lerrel , journal=

  69. [87]

    Advances in Neural Information Processing Systems , pages=

    Language models are few-shot learners , author=. Advances in Neural Information Processing Systems , pages=

  70. [88]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  71. [89]

    Driess, Danny and Xia, Fei and Sajjadi, Mehdi SM and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and others , journal=

  72. [90]

    Language conditioned imitation learning over unstructured data,

    Language conditioned imitation learning over unstructured data , author=. arXiv preprint arXiv:2005.07648 , year=

  73. [91]

    IEEE Robotics and Automation Letters , volume=

    What matters in language conditioned robotic imitation learning over unstructured data , author=. IEEE Robotics and Automation Letters , volume=. 2022 , publisher=

  74. [92]

    Grounding language with visual affor- dances over unstructured data.arXiv preprint arXiv:2210.01911, 2022

    Grounding language with visual affordances over unstructured data , author=. arXiv preprint arXiv:2210.01911 , year=

  75. [93]

    Zhang, Edwin and Lu, Yujie and Wang, William and Zhang, Amy , journal=

  76. [94]

    Conference on Robot Learning , pages=

    Instruction-driven history-aware policies for robotic manipulations , author=. Conference on Robot Learning , pages=. 2023 , organization=

  77. [95]

    2021 , publisher=

    Shao, Lin and Migimatsu, Toki and Zhang, Qiang and Yang, Karen and Bohg, Jeannette , journal=. 2021 , publisher=

  78. [96]

    2021 , organization=

    Zeng, Andy and Florence, Pete and Tompson, Jonathan and Welker, Stefan and Chien, Jonathan and Attarian, Maria and Armstrong, Travis and Krasin, Ivan and Duong, Dan and Sindhwani, Vikas and others , booktitle=. 2021 , organization=

  79. [97]

    Advances in Neural Information Processing Systems , volume=

    Language-conditioned imitation learning for robot manipulation tasks , author=. Advances in Neural Information Processing Systems , volume=

  80. [98]

    Huang, Wenlong and Xia, Fei and Xiao, Ted and Chan, Harris and Liang, Jacky and Florence, Pete and Zeng, Andy and Tompson, Jonathan and Mordatch, Igor and Chebotar, Yevgen and others , journal=

Showing first 80 references.