arxiv: 2405.12213 · v2 · submitted 2024-05-20 · 💻 cs.RO · cs.LG

Recognition: no theorem link

Octo: An Open-Source Generalist Robot Policy

Charles Xu, Chelsea Finn, Dibya Ghosh, Dorsa Sadigh, Homer Walke, Jianlan Luo, Joey Hejna, Karl Pertsch, Kevin Black, Lawrence Yunliang Chen, Octo Model Team, Oier Mees, Pannag Sanketi, Quan Vuong, Sergey Levine, Sudeep Dasari, Ted Xiao, Tobias Kreiman, You Liang Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:21 UTC · model grok-4.3

classification 💻 cs.RO cs.LG

keywords generalist robot policypretrained policiesfinetuningtransformer modelrobot manipulationpolicy adaptationopen source modeldiverse datasets

0 comments

The pith

A large policy pretrained on hundreds of thousands of robot trajectories can be finetuned in hours to work with new sensors and actions on different robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that pretraining a large policy on diverse robot manipulation data creates a reusable base model for many tasks and platforms. This base can then be adapted quickly to new robots that have different sensors and ways of moving, using only modest amounts of new data and standard hardware. A reader would care because this shifts robot learning away from training every application from scratch toward reusing and customizing a shared starting point. If correct, the result would make capable robot behaviors easier to develop and deploy across varied setups without large new data collections or long training runs.

Core claim

The paper establishes that a transformer-based policy trained on a large collection of robot manipulation trajectories can accept instructions through language or goal images and can be finetuned to new observation spaces and action spaces across nine different robotic platforms, all within a few hours on standard consumer graphics processing units, providing a versatile initialization for generalist policies.

What carries the argument

A transformer architecture that processes sequences of observations and instructions to predict actions, serving as the reusable initialization for adaptation to new robot configurations.

Load-bearing premise

That pretraining on varied robot data creates features general enough for effective transfer to new hardware through limited finetuning without needing lots of extra data or careful tuning.

What would settle it

An experiment on a new robot setup where finetuning the pretrained policy yields performance no better than training an identical model from random initialization on the same limited data would disprove the central result.

read the original abstract

Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sensors and action spaces, accommodate a variety of commonly used robotic platforms, and finetune readily and efficiently to new domains. In this work, we aim to lay the groundwork for developing open-source, widely applicable, generalist policies for robotic manipulation. As a first step, we introduce Octo, a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset, the largest robot manipulation dataset to date. It can be instructed via language commands or goal images and can be effectively finetuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPUs. In experiments across 9 robotic platforms, we demonstrate that Octo serves as a versatile policy initialization that can be effectively finetuned to new observation and action spaces. We also perform detailed ablations of design decisions for the Octo model, from architecture to training data, to guide future research on building generalist robot models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Octo gives a practical open-source transformer policy pretrained on 800k trajectories that finetunes across nine platforms, but the experiments do not isolate whether pretraining adds value beyond model scale.

read the letter

Octo is a transformer policy trained on 800k trajectories from Open X-Embodiment. It accepts language or goal-image instructions and can be finetuned to new observation and action spaces in a few hours on consumer GPUs. The paper reports results across nine robotic platforms plus ablations on architecture and data choices. That combination of scale, conditioning options, and broad finetuning tests is the main new piece. Releasing the model and code openly is also useful; it gives people a concrete initialization they can adapt instead of starting from random weights every time. The ablations are straightforward and should help others decide on similar design choices later. The central claim is that this pretraining produces transferable features. The experiments show that finetuning succeeds on held-out setups, which is evidence the model is versatile. However, the paper does not report a from-scratch control using the identical architecture and the same target data volume. Without that comparison it remains possible that any reasonably sized transformer would reach similar performance under the reported finetuning protocol. The abstract and stress-test note both flag this gap, and it directly touches the weakest assumption about why modest finetuning works. The work is empirical rather than theoretical, so the numbers stand or fall on the experimental details. Full methods, exact metrics, statistical tests, and data splits are needed to judge robustness, but the abstract alone already shows a solid engineering effort. This paper is aimed at researchers in robot learning who want a ready-made generalist starting point for manipulation tasks on new hardware. Anyone running finetuning experiments or studying scaling in robotics will find the results and ablations worth reading. It is coherent on its own terms and engages the relevant literature, so it deserves a serious referee even if revisions are needed to tighten the controls.

Referee Report

2 major / 1 minor

Summary. The paper introduces Octo, a large transformer-based policy pretrained on 800k trajectories from the Open X-Embodiment dataset. It claims that this model serves as a versatile initialization that can be effectively finetuned (via language or goal-image instructions) to new robotic platforms with different observation and action spaces, with experiments across 9 platforms and ablations on architecture and data choices to support future generalist robot model development.

Significance. If the central claim holds, Octo would provide a valuable open-source starting point for robot learning that reduces per-task data collection needs. The release of the model, code, and detailed ablations on design decisions (architecture, training data) are concrete strengths that can accelerate community progress on generalist policies.

major comments (2)

[Experiments (across 9 platforms)] The experiments across 9 robotic platforms demonstrate successful finetuning, but the manuscript does not report results for an identical transformer architecture trained from random initialization on the same target data volume and compute budget. This comparison is needed to isolate whether the 800k-trajectory pretraining supplies transferable features or whether any sufficiently large model would succeed under the reported finetuning protocol.
[Experiments and ablations] The weakest assumption—that modest finetuning on new sensors/actions yields reliable performance because of pretraining rather than hyperparameter choices or model scale—is not directly tested. Adding from-scratch baselines (or at minimum reporting the finetuning hyperparameter search effort and data volume per platform) would make the versatility claim load-bearing rather than suggestive.

minor comments (1)

[Abstract] The abstract and introduction use 'a few hours on standard consumer GPUs' without specifying exact GPU type, batch size, or number of steps; adding these details would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback and for recognizing the potential impact of Octo as an open-source generalist robot policy. We address the major comments below, focusing on strengthening the experimental validation.

read point-by-point responses

Referee: [Experiments (across 9 platforms)] The experiments across 9 robotic platforms demonstrate successful finetuning, but the manuscript does not report results for an identical transformer architecture trained from random initialization on the same target data volume and compute budget. This comparison is needed to isolate whether the 800k-trajectory pretraining supplies transferable features or whether any sufficiently large model would succeed under the reported finetuning protocol.

Authors: We concur that a from-scratch baseline with the same architecture, data volume, and compute would help isolate the contribution of pretraining. However, the scale of the model and the number of platforms make running these additional experiments infeasible with our current computational budget. Our ablations on architecture variants and subsets of the pretraining data provide indirect evidence for the benefits of large-scale pretraining. In the revised manuscript, we will include a dedicated limitations paragraph discussing this and report the exact finetuning data amounts and hyperparameter optimization details for each of the 9 platforms. revision: partial
Referee: [Experiments and ablations] The weakest assumption—that modest finetuning on new sensors/actions yields reliable performance because of pretraining rather than hyperparameter choices or model scale—is not directly tested. Adding from-scratch baselines (or at minimum reporting the finetuning hyperparameter search effort and data volume per platform) would make the versatility claim load-bearing rather than suggestive.

Authors: We agree that to make the claims more robust, additional controls are valuable. We will update the manuscript to explicitly report the data volume used for finetuning on each platform and describe the extent of hyperparameter tuning performed. While we cannot add full from-scratch runs, we believe the combination of cross-platform results and ablations supports the utility of the pretrained model as a starting point. We will revise the text to emphasize these points more clearly. revision: partial

standing simulated objections not resolved

Full from-scratch baselines across all nine platforms under matched compute constraints

Circularity Check

0 steps flagged

No significant circularity; empirical results are independent of model inputs

full rationale

The paper's central claim is that a transformer policy pretrained on 800k Open X-Embodiment trajectories serves as a versatile initialization for finetuning to new observation and action spaces. This is supported by new experimental runs across 9 robotic platforms with held-out data, not by any derivation that reduces to the pretraining data or fitted parameters by construction. No equations, uniqueness theorems, or self-citations are used to derive performance claims; the work contains no mathematical derivation chain and reports direct empirical outcomes from finetuning protocols. The results are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer assumptions and large-scale supervised pretraining; many architectural and optimization choices are free parameters tuned to the robot data.

free parameters (2)

transformer architecture hyperparameters
Number of layers, attention heads, embedding dimension, and context length chosen to fit the 800k-trajectory dataset.
pretraining and finetuning optimization settings
Learning rate schedule, batch size, and number of finetuning steps selected for the reported results.

axioms (1)

domain assumption A transformer can learn general robot control features from heterogeneous trajectory data collected across many embodiments.
Invoked by the decision to pretrain a single model on the full Open X-Embodiment collection.

pith-pipeline@v0.9.0 · 5596 in / 1352 out tokens · 37072 ms · 2026-05-11T00:21:26.603830+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
cs.RO 2026-04 conditional novelty 8.0

Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.
Test-time Sparsity for Extreme Fast Action Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 7.0

MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates
cs.LG 2026-05 unverdicted novelty 7.0

TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and r...
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
cs.AI 2026-05 unverdicted novelty 7.0

LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 7.0

ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
cs.AI 2026-05 unverdicted novelty 7.0

A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
cs.RO 2026-04 unverdicted novelty 7.0

VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
cs.RO 2026-04 unverdicted novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
STRONG-VLA: Decoupled Robustness Learning for Vision-Language-Action Models under Multimodal Perturbations
cs.RO 2026-04 unverdicted novelty 7.0

STRONG-VLA uses decoupled two-stage training to improve VLA model robustness, yielding up to 16% higher task success rates under seen and unseen perturbations on the LIBERO benchmark.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
cs.RO 2026-04 unverdicted novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
Action Images: End-to-End Policy Learning via Multiview Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
cs.RO 2026-04 conditional novelty 7.0

BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking
cs.RO 2026-04 unverdicted novelty 7.0

A Bayesian expert selection framework with variational Bayesian last layers and lower confidence bounds improves diffusion policies for active multi-target tracking.
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
cs.RO 2026-03 unverdicted novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
cs.RO 2026-05 unverdicted novelty 6.0

FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-05 conditional novelty 6.0

GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 6.0

Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
cs.RO 2026-05 unverdicted novelty 6.0

Retrieve-then-steer stores successful observation-action segments in memory, retrieves relevant chunks, filters them, and uses an elite prior with confidence-adaptive guidance to steer a flow-matching action sampler f...
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
cs.RO 2026-05 unverdicted novelty 6.0

A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
Kintsugi: Learning Policies by Repairing Executable Knowledge Bases
cs.LG 2026-05 unverdicted novelty 6.0

Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
cs.AI 2026-05 unverdicted novelty 6.0

LC-MAPF is a decentralized MAPF solver that uses a learnable multi-round communication module among nearby agents to outperform prior IL and RL methods while preserving scalability.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
cs.RO 2026-05 unverdicted novelty 6.0

VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
cs.CV 2026-05 unverdicted novelty 6.0

TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 6.0

Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 6.0

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
An Efficient Metric for Data Quality Measurement in Imitation Learning
cs.RO 2026-05 unverdicted novelty 6.0

Power spectral density of trajectories ranks demonstration quality for imitation learning, enabling rollout-free curation that improves fine-tuned policy success.
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
cs.RO 2026-05 unverdicted novelty 6.0

VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
cs.RO 2026-05 unverdicted novelty 6.0

Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
cs.RO 2026-04 unverdicted novelty 6.0

M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation
cs.RO 2026-04 unverdicted novelty 6.0

AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
GazeVLA: Learning Human Intention for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
cs.RO 2026-04 unverdicted novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
Exploring High-Order Self-Similarity for Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
cs.RO 2026-04 unverdicted novelty 6.0

UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation
cs.LG 2026-04 unverdicted novelty 6.0

HELM raises long-horizon VLA success from 58.4% to 81.5% on LIBERO-LONG by combining episodic memory retrieval, learned failure prediction, and replanning, outperforming context extension or adaptation alone.
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics
cs.RO 2026-04 unverdicted novelty 6.0

A two-level hierarchical vector quantization tokenizer that clusters actions spatially and temporally achieves new state-of-the-art results in in-context imitation learning for robotics.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
cs.CV 2026-04 unverdicted novelty 6.0

CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

A single transformer model trained offline on expert trajectories from three distinct MARL environments achieves competitive performance against specialized baselines without per-task tuning.
SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation
cs.CV 2026-04 unverdicted novelty 6.0

SnapFlow compresses multi-step denoising in flow-matching VLAs into one step via progressive self-distillation using two-step Euler shortcuts from marginal velocities, matching 10-step teacher success rates with 9.6x ...
Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming
cs.RO 2026-04 unverdicted novelty 6.0

DAERT generates diverse adversarial instructions via a uniform policy in RL to drop VLA task success rates from 93.33% to 5.85% on benchmarks with models like π0 and OpenVLA.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 81 Pith papers · 9 internal anchors

[1]

Introducing scale’s automotive foundation model, 2023

Scale AI. Introducing scale’s automotive foundation model, 2023. URL https://scale.com/blog/afm1

work page 2023
[2]

Hindsight experience replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In NeurIPS, 2017

work page 2017
[3]

Human-to-robot imitation in the wild

Shikhar Bahl, Abhinav Gupta, and Deepak Pathak. Human-to-robot imitation in the wild. arXiv preprint arXiv:2207.09450, 2022

work page arXiv 2022
[4]

Affordances from human videos as a versatile representation for robotics

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. In CVPR, 2023

work page 2023
[5]

Hydra: Hybrid robot actions for imitation learning

Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. arxiv, 2023

work page 2023
[6]

Roboa- gent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking, 2023

Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Ab- hinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manip- ulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918 , 2023

work page arXiv 2023
[7]

Black, M

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pre- trained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023

work page arXiv 2023
[8]

End to End Learning for Self-Driving Cars

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 , 2016

work page internal anchor Pith review arXiv 2016
[9]

RoboCat : A self-improving foundation agent for robotic manipulation

Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauza, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. Robocat: A self-improving foundation agent for robotic manipu- lation. arXiv preprint arXiv:2306.11706 , 2023

work page arXiv 2023
[10]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Do as i can, not as i say: Grounding language in robotic affordances

Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning , pages 287–318. PMLR, 2023

work page 2023
[12]

Scaling data-driven robotics with reward sketching and batch reinforcement learning.Preprint arXiv:1909.12200,

Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, Oleg Sushkov, David Barker, Jonathan Scholz, Misha Denil, Nando de Freitas, and Ziyu Wang. Scaling data- driven robotics with reward sketching and batch rein- forcement learning. arXiv preprint arXiv:19...

work page arXiv 1909
[13]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 11621–11631, 2020

work page 2020
[14]

Berkeley UR5 demonstration dataset

Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https: //sites.google.com/view/berkeley-ur5/home

work page
[15]

Vision-language models provide promptable rep- resentations for reinforcement learning

William Chen, Oier Mees, Aviral Kumar, and Sergey Levine. Vision-language models provide promptable rep- resentations for reinforcement learning. arXiv preprint arXiv:2402.02651, 2024

work page arXiv 2024
[16]

Genaug: Retargeting behaviors to unseen situ- ations via generative augmentation, 2023

Zoey Chen, Sho Kiami, Abhishek Gupta, and Vikash Kumar. Genaug: Retargeting behaviors to unseen situations via generative augmentation. arXiv preprint arXiv:2302.06671, 2023

work page arXiv 2023
[17]

Dif- fusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023

work page 2023
[18]

From play to policy: Conditional behavior generation from uncurated robot data

Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Conditional behavior generation from uncurated robot data. In The Eleventh International Conference on Learning Representations, 2022

work page 2022
[19]

Robonet: Large-scale multi-robot learning

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. In Conference on Robot Learning , pages 885–897. PMLR, 2020

work page 2020
[20]

Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. CLVR jaco play dataset, 2023. URL https://github.com/ clvrai/clvr_jaco_play_dataset

work page 2023
[21]

Causal confusion in imitation learning

Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning. NeurIPS, 2019

work page 2019
[22]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[23]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Behavior retrieval: Few-shot imitation learning by querying unlabeled datasets, 2023

Maximilian Du, Suraj Nair, Dorsa Sadigh, and Chelsea Finn. Behavior retrieval: Few-shot imitation learning by querying unlabeled datasets. ArXiv, abs/2304.08742,

work page arXiv
[25]

URL https://api.semanticscholar.org/CorpusID: 258186973

work page
[26]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross- domain datasets. arXiv preprint arXiv:2109.13396, 2021

work page internal anchor Pith review arXiv 2021
[27]

Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 3:5, 2023

work page 2023
[28]

Deep visual foresight for planning robot motion

Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages 2786–2793. IEEE, 2017

work page 2017
[29]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024

work page internal anchor Pith review arXiv 2024
[30]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition , pages 3354–3361. IEEE, 2012

work page 2012
[31]

Robot learning in homes: Improving generalization and reducing dataset bias

Abhinav Gupta, Adithyavairavan Murali, Dhiraj Prakashchand Gandhi, and Lerrel Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information processing systems, 31, 2018

work page 2018
[32]

Scaling up and distilling down: Language-guided robot skill acquisition

Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning , pages 3766–3777. PMLR, 2023

work page 2023
[33]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016

work page 2016
[34]

Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. In Robotics: Science and Systems , 2023

work page 2023
[35]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems , 33:6840–6851, 2020

work page 2020
[36]

Gaia-1: A generative world model for autonomous driving, 2023

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving, 2023

work page 2023
[37]

Visual language maps for robot navigation

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023

work page 2023
[38]

Audio visual language maps for robot naviga- tion

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Audio visual language maps for robot naviga- tion. In Proceedings of the International Symposium on Experimental Robotics (ISER) , Chiang Mai, Thailand, 2023

work page 2023
[39]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 , 2023

work page internal anchor Pith review arXiv 2023
[40]

Bc-z: Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning , pages 991–1002. PMLR, 2022

work page 2022
[41]

VIMA: Robot manipulation with multimodal prompts

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. VIMA: Robot manipulation with multimodal prompts. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on M...

work page 2023
[42]

Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018

work page arXiv 2018
[43]

Scaling up multi- task robotic reinforcement learning

Dmitry Kalashnikov, Jake Varley, Yevgen Chebotar, Benjamin Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. Scaling up multi- task robotic reinforcement learning. In 5th Annual Conference on Robot Learning , 2021

work page 2021
[44]

Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation

Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett Warnell, Sören Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters , 7(4):11807–11814, 2022

work page 2022
[45]

Berg, Wan-Yen Lo, et al

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, et al. Segment Anything, April 2023

work page 2023
[46]

Robohive: A unified framework for robot learning

Vikash Kumar, Rutav Shah, Gaoyue Zhou, Vincent Moens, Vittorio Caggiano, Abhishek Gupta, and Ar- avind Rajeswaran. Robohive: A unified framework for robot learning. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/ forum?id=0H5fRQcpQ7

work page 2023
[47]

Language models as zero-shot trajectory generators

Teyun Kwon, Norman Di Palo, and Edward Johns. Language models as zero-shot trajectory generators. arXiv preprint arXiv:2310.11604 , 2023

work page arXiv 2023
[48]

End-to-end training of deep visuomotor policies

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research , 17(1):1334– 1373, 2016

work page 2016
[49]

Learning hand-eye coor- dination for robotic grasping with deep learning and large-scale data collection

Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coor- dination for robotic grasping with deep learning and large-scale data collection. The International journal of robotics research, 37(4-5):421–436, 2018

work page 2018
[50]

Wang, Giovanni Sutanto, Ak- shara Rai, and Franziska Meier

Yixin Lin, Austin S. Wang, Giovanni Sutanto, Ak- shara Rai, and Franziska Meier. Polymetis. https: //facebookresearch.github.io/fairo/polymetis/, 2021

work page 2021
[51]

Robot learning on the job: Human- in-the-loop autonomy and learning during deployment

Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human- in-the-loop autonomy and learning during deployment. In Robotics: Science and Systems (RSS) , 2023

work page 2023
[52]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018

work page 2018
[53]

Multi-stage cable routing through hierarchical imitation learning, 2024

Jianlan Luo, Charles Xu, Xinyang Geng, Gilbert Feng, Kuan Fang, Liam Tan, Stefan Schaal, and Sergey Levine. Multi-stage cable routing through hierarchical imitation learning. arXiv preprint arXiv:2307.08927 , 2023

work page arXiv 2023
[54]

Fmb: a functional manipulation benchmark for generalizable robotic learning

Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. arXiv preprint arXiv:2401.08553, 2024

work page arXiv 2024
[55]

Language conditioned imitation learning over unstructured data

Corey Lynch and Pierre Sermanet. Language conditioned imitation learning over unstructured data. In RSS, 2021

work page 2021
[56]

Interactive language: Talking to robots in real time

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023

work page 2023
[57]

Where are we in the search for an artificial visual cortex for embodied intelligence? 2023

Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, Dhruv Batra, Yixin Lin, Oleksandr Maksymets, Aravind Rajeswaran, and Franziska Meier. Where are we in the search for an artificial visual cortex for embodied intelligence? 2023

work page 2023
[58]

Where are we in the search for an artificial vi- sual cortex for embodied intelligence?

Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, et al. Where are we in the search for an artificial visual cortex for embodied intelligence? arXiv preprint arXiv:2303.18240, 2023

work page arXiv 2023
[59]

RoboTurk: A crowdsourcing platform for robotic skill learning through imitation.CoRR, abs/1811.02790, 2018

Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. RoboTurk: A crowdsourcing platform for robotic skill learning through imitation. CoRR, abs/1811.02790, 2018. URL http://arxiv.org/abs/1811. 02790

work page arXiv 2018
[60]

Roboturk: A crowdsourcing platform for robotic skill learning through imitation

Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning , pages 879–

work page
[61]

Mimicgen: A data generation system for scalable robot learning using human demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Ireti- ayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In 7th Annual Conference on Robot Learning , 2023

work page 2023
[62]

What matters in language conditioned robotic imitation learning over unstructured data

Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022

work page 2022
[63]

Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

work page 2022
[64]

Grounding language with visual affordances over un- structured data

Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over un- structured data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023

work page 2023
[65]

Structured world models from human videos

Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. CoRL, 2023

work page 2023
[66]

Learning and retrieval from prior data for skill-based imitation learning

Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill-based imitation learning. In Conference on Robot Learning (CoRL), 2022

work page 2022
[67]

Im- proved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Im- proved denoising diffusion probabilistic models. In International Conference on Machine Learning , pages 8162–8171. PMLR, 2021

work page 2021
[68]

Open X-Embodiment Collaboration, Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Her- zog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, Antonin Raffin, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bernhard Schölkopf, Brian Ichter, Cewu Lu, Charles Xu, Chelsea Finn, Chenfeng Xu, Cheng Chi, Chenguang Huang...

work page internal anchor Pith review arXiv 2023
[69]

GPT-4 Technical Report, March 2023

OpenAI. GPT-4 Technical Report, March 2023

work page 2023
[70]

The surprising effectiveness of representation learning for visual imita- tion, 2021

Jyothish Pari, Nur Muhammad Shafiullah, Sridhar Pan- dian Arunachalam, and Lerrel Pinto. The surprising effectiveness of representation learning for visual imita- tion, 2021

work page 2021
[71]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence , volume 32, 2018

work page 2018
[72]

Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours

Lerrel Pinto and Abhinav Gupta. Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA) , pages 3406–3413. IEEE, 2016

work page 2016
[73]

Shared Control Templates for Assistive Robotics

Gabriel Quere, Annette Hagengruber, Maged Iskandar, Samuel Bustamante, Daniel Leidner, Freek Stulp, and Joern V ogel. Shared Control Templates for Assistive Robotics. In 2020 IEEE International Conference on Robotics and Automation (ICRA) , page 7, Paris, France, 2020

work page 2020
[74]

Robot learning with sensorimotor pre-training

Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell, and Jitendra Malik. Robot learning with sensorimotor pre-training. Conference on Robot Learning, 2023

work page 2023
[75]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research , 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html

work page 2020
[76]

A generalist agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Ser- gio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. Transactions on Machine Learning Research , 2022

work page 2022
[77]

High-Resolution Image Synthesis with Latent Diffusion Models, April 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models, April 2022

work page 2022
[78]

Latent plans for task agnostic offline reinforcement learning

Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task agnostic offline reinforcement learning. In Proceedings of the 6th Conference on Robot Learning (CoRL) , 2022

work page 2022
[79]

Multi-resolution sensing for real-time control with vision- language models

Saumya Saxena, Mohit Sharma, and Oliver Kroemer. Multi-resolution sensing for real-time control with vision- language models. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id= WuBv9-IGDUA

work page 2023
[80]

On bringing robots home, 2023

Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home, 2023

work page 2023

Showing first 80 references.