arxiv: 2304.13705 · v1 · submitted 2023-04-23 · 💻 cs.RO · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Chelsea Finn, Sergey Levine, Tony Z. Zhao, Vikash Kumar

Pith reviewed 2026-05-11 04:11 UTC · model grok-4.3

classification 💻 cs.RO cs.LG

keywords imitation learningbimanual manipulationlow-cost hardwareaction chunkingtransformersfine manipulationrobot learning

0 comments

The pith

Action Chunking with Transformers lets low-cost robots learn precise bimanual tasks from ten minutes of demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that imitation learning can enable low-cost and imprecise robots to perform fine manipulation tasks that normally require expensive hardware, accurate sensors, or careful calibration. It introduces Action Chunking with Transformers (ACT) as a method to learn a generative model over sequences of actions, which helps prevent errors from compounding during execution and handles non-stationary human demonstrations. Using a custom teleoperation interface to collect data, the approach trains a bimanual robot to complete six real-world tasks at 80-90% success rates. This includes opening a translucent condiment cup and slotting a battery, all from roughly ten minutes of demonstrations.

Core claim

The central claim is that a low-cost bimanual robot system performing end-to-end imitation learning with the ACT algorithm, which learns generative models over action sequences from visual observations, can successfully execute difficult fine-grained tasks such as opening a translucent condiment cup and slotting a battery, reaching 80-90% success rates in the real world after training on only ten minutes of demonstrations collected via a custom teleoperation interface.

What carries the argument

Action Chunking with Transformers (ACT), a transformer model that predicts chunks of future actions to enable stable closed-loop control and reduce compounding errors in high-precision imitation learning.

If this is right

Precise bimanual manipulation becomes feasible on inexpensive hardware without specialized force sensors or calibration procedures.
Imitation learning policies can succeed on long-horizon tasks despite non-stationary human demonstrations when action sequences are modeled generatively.
Visual feedback alone suffices for closed-loop control on tasks requiring careful contact forces.
Data collection effort drops to short sessions of roughly ten minutes while still yielding high success rates across multiple tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The chunking approach may extend to other robotic control problems that involve predicting extended action sequences.
Lowering hardware costs could broaden access to fine manipulation capabilities for non-industrial settings.
Combining ACT with additional sensing modalities might further improve reliability on even harder variants of the tasks.

Load-bearing premise

The custom teleoperation interface produces high-quality, consistent demonstrations that capture the necessary precision and force coordination without introducing human-induced biases or noise that the learning algorithm cannot overcome.

What would settle it

Retraining and testing the same tasks with demonstrations collected from a lower-quality or noisier teleoperation interface, then measuring whether success rates fall below 80%, would directly test whether the claim holds.

read the original abstract

Fine manipulation tasks, such as threading cable ties or slotting a battery, are notoriously difficult for robots because they require precision, careful coordination of contact forces, and closed-loop visual feedback. Performing these tasks typically requires high-end robots, accurate sensors, or careful calibration, which can be expensive and difficult to set up. Can learning enable low-cost and imprecise hardware to perform these fine manipulation tasks? We present a low-cost system that performs end-to-end imitation learning directly from real demonstrations, collected with a custom teleoperation interface. Imitation learning, however, presents its own challenges, particularly in high-precision domains: errors in the policy can compound over time, and human demonstrations can be non-stationary. To address these challenges, we develop a simple yet novel algorithm, Action Chunking with Transformers (ACT), which learns a generative model over action sequences. ACT allows the robot to learn 6 difficult tasks in the real world, such as opening a translucent condiment cup and slotting a battery with 80-90% success, with only 10 minutes worth of demonstrations. Project website: https://tonyzhaozh.github.io/aloha/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ACT shows real-world fine bimanual manipulation on cheap hardware with short demos, but the teleop interface's role in demo quality needs clearer separation from the algorithm.

read the letter

The key thing here is that the paper gets low-cost imprecise arms to handle precise bimanual tasks like battery slotting and cup opening at 80-90% success rates using only 10 minutes of demonstrations and a new method called ACT. This is a practical demonstration that learning can work on accessible hardware without expensive calibration or sensors. ACT models generative distributions over chunks of actions rather than single steps, which directly targets compounding errors and the non-stationary quality of human data in imitation learning. They collect the data through a custom teleoperation interface on a bimanual setup and run the policies end-to-end with visual feedback in the real world across six tasks. The results are concrete and show the approach scaling beyond simulation. The work does well by keeping the focus on physical experiments and reporting success on varied tasks that require contact forces and closed-loop control. The method itself is straightforward to describe and implement, which adds to its usefulness. One soft spot is the custom teleoperation interface. The central results depend on demonstrations that encode the needed precision and coordination, and while the paper states that ACT mitigates non-stationarity, there are no detailed ablations or metrics on demo variance or consistency to isolate how much the interface versus the algorithm drives performance. This does not break the claim, but it leaves some room for the data collection to be carrying more weight than acknowledged. The paper is aimed at people doing imitation learning for manipulation who care about real-world transfer to low-cost platforms. Readers looking for practical recipes and empirical benchmarks will get value from the system description and task results. It has enough new technical content and grounded experiments to deserve a serious referee. I would recommend sending it for peer review, with the main request being additional controls that separate the algorithm's contribution from the quality of the collected demonstrations.

Referee Report

2 major / 1 minor

Summary. The paper claims that a low-cost bimanual robot equipped with a custom teleoperation interface for collecting real-world demonstrations, combined with the novel Action Chunking with Transformers (ACT) algorithm, enables end-to-end imitation learning of fine-grained manipulation tasks. ACT models generative distributions over action chunks to mitigate compounding errors and non-stationary demonstrations, allowing 80-90% success rates on six contact-rich tasks (e.g., opening a translucent condiment cup, slotting a battery) using only 10 minutes of data on imprecise hardware.

Significance. If the empirical results hold after verification of demonstration quality and controls, the work would demonstrate that imitation learning with chunked generative policies can achieve high-precision bimanual performance on inexpensive platforms without specialized sensors or calibration. This has clear implications for accessibility in robotics, providing concrete real-world evidence on tasks that typically demand high-end setups.

major comments (2)

[Abstract] Abstract: The headline result that ACT enables 80-90% success with 10 min of demonstrations rests on the unverified assumption that the custom teleoperation interface supplies high-quality, low-bias demonstrations encoding precise contact forces and closed-loop coordination. No independent metrics (trajectory variance, force profiles, inter-demonstrator consistency) or ablations separating interface quality from policy performance are reported, leaving open the possibility that the interface itself supplies the critical precision rather than the learning algorithm.
[Experiments] Experiments section (inferred from reported success rates): Success rates on the six tasks are presented without baselines, ablations, or statistical tests, as highlighted in the review. This makes it impossible to assess whether the central claim—that ACT on low-cost hardware is responsible for the performance—holds or whether post-hoc tuning or task selection inflates the numbers.

minor comments (1)

[Abstract] The project website link is provided but no supplementary video or code repository is referenced in the abstract; adding these would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful review and the opportunity to clarify our contributions. We address the two major comments below, committing to revisions where they strengthen the manuscript without misrepresenting our results.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result that ACT enables 80-90% success with 10 min of demonstrations rests on the unverified assumption that the custom teleoperation interface supplies high-quality, low-bias demonstrations encoding precise contact forces and closed-loop coordination. No independent metrics (trajectory variance, force profiles, inter-demonstrator consistency) or ablations separating interface quality from policy performance are reported, leaving open the possibility that the interface itself supplies the critical precision rather than the learning algorithm.

Authors: The teleoperation interface is an integral component of the proposed low-cost system, as it enables collection of usable demonstrations on imprecise hardware without requiring high-end sensors. We acknowledge that the initial submission lacks explicit quantitative metrics on demonstration quality. We will add analysis of trajectory variance and inter-demonstrator consistency in the revised manuscript. Force profiles cannot be reported because the hardware lacks force sensors; the system relies on visual feedback instead. Full ablations isolating the interface from ACT would require new hardware setups, which we will discuss as a limitation rather than perform within this revision. revision: partial
Referee: [Experiments] Experiments section (inferred from reported success rates): Success rates on the six tasks are presented without baselines, ablations, or statistical tests, as highlighted in the review. This makes it impossible to assess whether the central claim—that ACT on low-cost hardware is responsible for the performance—holds or whether post-hoc tuning or task selection inflates the numbers.

Authors: We agree that the experiments section requires stronger validation. The manuscript already includes comparisons to standard behavior cloning, but we will expand it with additional baselines (e.g., non-chunked policies), architecture ablations, and statistical analysis including the number of evaluation trials, success-rate confidence intervals, and significance tests. These additions will clarify that the reported performance stems from the combination of the interface and ACT rather than task selection or tuning. revision: yes

standing simulated objections not resolved

Direct force profiles cannot be provided because the low-cost hardware does not include force sensors.

Circularity Check

0 steps flagged

No circularity: empirical results from hardware experiments are independent of any fitted inputs or self-referential definitions.

full rationale

The paper introduces the ACT algorithm as a novel generative model over action chunks to mitigate compounding errors in imitation learning, then validates it through real-world bimanual tasks on low-cost hardware using custom teleoperation demonstrations. Success rates (80-90%) are measured outcomes from physical rollouts, not quantities derived by construction from the training data or prior self-citations. No equations, uniqueness theorems, or ansatzes are presented that reduce the central claims to tautological inputs; the derivation chain consists of standard imitation learning setup plus a transformer-based policy whose performance is externally falsifiable via hardware metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that teleoperated demonstrations are sufficiently high-quality and that chunked action prediction mitigates compounding errors in contact-rich tasks; no new physical entities are postulated.

free parameters (1)

ACT model hyperparameters
Standard neural network training parameters fitted during learning; not enumerated in abstract.

axioms (1)

domain assumption Imitation learning from a small number of human demonstrations can generalize to new task instances on physical hardware
Invoked to explain the reported 80-90% success rates across tasks.

invented entities (1)

Action Chunking with Transformers (ACT) no independent evidence
purpose: Generative model over action sequences to address error compounding and non-stationarity in imitation learning
New method introduced by the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5505 in / 1358 out tokens · 43208 ms · 2026-05-11T04:11:19.220360+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
cs.RO 2026-04 conditional novelty 8.0

Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.
Morphologically Equivariant Flow Matching for Bimanual Mobile Manipulation
cs.RO 2026-05 conditional novelty 7.0

A morphologically equivariant flow matching policy for bimanual robots enforces reflective symmetry to improve sample efficiency and enable zero-shot generalization to mirrored task configurations.
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 7.0

MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
cs.AI 2026-05 conditional novelty 7.0

State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...
PhySPRING: Structure-Preserving Reduction of Physics-Informed Twins via GNN
cs.RO 2026-05 unverdicted novelty 7.0

PhySPRING uses differentiable GNNs to learn hierarchical coarsened spring-mass topologies and parameters from observations, delivering up to 2.3x speedup on PhysTwin benchmarks and comparable robot policy success rate...
BrickCraft: Visuomotor Skill Composition with Situated Manual Guidance for Long-Horizon Interlocking Brick Assembly
cs.RO 2026-05 unverdicted novelty 7.0

BrickCraft composes reusable visuomotor skills via relative anchoring to partial structures and situated visual manuals to achieve long-horizon interlocking brick assembly from limited demonstrations with generalizati...
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Shared Autonomy Assisted by Impedance-Driven Anisotropic Guidance Field
cs.RO 2026-05 unverdicted novelty 7.0

IAGF-SA adds a physically-grounded channel to shared autonomy by modulating robot impedance to convey intent, improving task performance, agreement, and user experience in three scenarios per user studies.
OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction
cs.RO 2026-04 unverdicted novelty 7.0

A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 7.0

A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 accept novelty 7.0

3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors
cs.RO 2026-04 unverdicted novelty 7.0

Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
cs.RO 2026-04 unverdicted novelty 7.0

VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
cs.RO 2026-04 unverdicted novelty 7.0

VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
FingerEye: Continuous and Unified Vision-Tactile Sensing for Dexterous Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

FingerEye delivers continuous vision-tactile sensing via binocular RGB cameras and marker-tracked compliant ring deformation, supporting imitation learning policies that generalize across object variations for tasks l...
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
cs.RO 2026-04 conditional novelty 7.0

BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving hi...
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 6.0

Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
cs.RO 2026-05 unverdicted novelty 6.0

Retrieve-then-steer stores successful observation-action segments in memory, retrieves relevant chunks, filters them, and uses an elite prior with confidence-adaptive guidance to steer a flow-matching action sampler f...
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
cs.RO 2026-05 unverdicted novelty 6.0

A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions
cs.RO 2026-05 unverdicted novelty 6.0

DexSynRefine synthesizes HOI motions with an extended manifold method, refines them via task-space residual RL, and adapts for sim-to-real transfer, outperforming kinematic retargeting by 50-70 percentage points on fi...
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
cs.RO 2026-05 unverdicted novelty 6.0

A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...
BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

BifrostUMI enables robot-free human demonstration capture via VR and wrist cameras to train visuomotor policies that predict keypoint trajectories for transfer to humanoid whole-body control through retargeting.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
cs.LG 2026-05 unverdicted novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
cs.CV 2026-05 unverdicted novelty 6.0

A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
An Efficient Metric for Data Quality Measurement in Imitation Learning
cs.RO 2026-05 unverdicted novelty 6.0

Power spectral density of trajectories ranks demonstration quality for imitation learning, enabling rollout-free curation that improves fine-tuned policy success.
TAIL-Safe: Task-Agnostic Safety Monitoring for Imitation Learning Policies
cs.RO 2026-05 unverdicted novelty 6.0

TAIL-Safe learns a Lipschitz Q-function from digital-twin failure data to identify an empirical control-invariant safe set for imitation learning policies and applies gradient-based recovery to keep actions inside it.
Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

Lucid-XR uses XR-headset physics simulation and physics-guided video generation to create synthetic data that trains robot policies transferring zero-shot to unseen real-world manipulation tasks.
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 6.0

Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.
AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation
cs.RO 2026-04 unverdicted novelty 6.0

AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.
RL Token: Bootstrapping Online RL with Vision-Language-Action Models
cs.LG 2026-04 unverdicted novelty 6.0

RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.
Learning from the Best: Smoothness-Driven Metrics for Data Quality in Imitation Learning
cs.RO 2026-04 unverdicted novelty 6.0

RINSE scores robot demonstration trajectories for smoothness via SAL and TED metrics to curate higher-quality data for behavioral cloning, improving success rates with less data on benchmarks and real robots.
GazeVLA: Learning Human Intention for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
LeHome: A Simulation Environment for Deformable Object Manipulation in Household Scenarios
cs.RO 2026-04 unverdicted novelty 6.0

LeHome is a simulation platform offering high-fidelity dynamics for robotic manipulation of varied deformable objects in household settings, with support for multiple robot embodiments including low-cost hardware.
Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

Wiggle and Go! uses system identification from rope motion observations to predict parameters that enable zero-shot goal-conditioned dynamic manipulation, achieving 3.55 cm accuracy on 3D target striking versus 15.34 ...
FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception
cs.RO 2026-04 conditional novelty 6.0

FingerViP equips each finger with a miniature camera and trains a multi-view diffusion policy that achieves 80.8% success on real-world dexterous tasks previously limited by wrist-camera occlusion.
SpaceDex: Generalizable Dexterous Grasping in Tiered Workspaces
cs.RO 2026-04 unverdicted novelty 6.0

SpaceDex achieves 63% success grasping unseen objects in tiered workspaces via VLM spatial planning and arm-hand feature separation, beating a 39% tabletop baseline in 100 real trials.
AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots acro...
From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation
cs.RO 2026-04 unverdicted novelty 6.0

Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.
Long-Term Memory for VLA-based Agents in Open-World Task Execution
cs.RO 2026-04 unverdicted novelty 6.0

ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.
UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception
cs.RO 2026-04 unverdicted novelty 6.0

UMI-3D integrates LiDAR into the UMI hardware for robust multimodal 3D perception in manipulation demonstrations, yielding higher policy success rates and enabling previously infeasible tasks like deformable object handling.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
cs.RO 2026-04 unverdicted novelty 6.0

WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement
cs.RO 2026-04 unverdicted novelty 6.0

AnySlot decouples language grounding from low-level control by inserting an explicit visual goal image, yielding better zero-shot performance on precise slot placement tasks than flat VLA policies.
MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks
cs.RO 2026-04 unverdicted novelty 6.0

MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.
ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration
cs.RO 2026-04 unverdicted novelty 6.0

ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
cs.RO 2026-04 conditional novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA
cs.RO 2026-04 unverdicted novelty 6.0

SV-VLA uses infrequent heavy VLA planning of action chunks plus a lightweight closed-loop verifier to achieve both efficiency and robustness in dynamic robot control.
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
cs.AI 2026-01 conditional novelty 6.0

Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 83 Pith papers · 5 internal anchors

[1]

URL https://www

Viperx 300 robot arm 6dof. URL https://www. trossenrobotics.com/viperx-300-robot-arm-6dof.aspx

work page
[2]

URL https://www

Widowx 250 robot arm 6dof. URL https://www. trossenrobotics.com/widowx-250-robot-arm-6dof.aspx

work page
[3]

URL https://www.youtube.com/watch?v= TearcKVj0iY

Highly dexterous manipulation system - capabilities - part 1, Nov 2014. URL https://www.youtube.com/watch?v= TearcKVj0iY

work page 2014
[4]

URL https://www

Assembly performance metrics and test methods, Apr 2022. URL https://www. nist.gov/el/intelligent-systems-division-73500/ robotic-grasping-and-manipulation-assembly/assembly

work page 2022
[5]

Teleoperated robots - shadow teleoperation system, Nov

work page
[6]

URL https://www.shadowrobot.com/teleoperation/

work page
[7]

Holo-dex: Teaching dexterity with immersive mixed reality,

Sridhar Pandian Arunachalam, Irmak Güzey, Soumith Chintala, and Lerrel Pinto. Holo-dex: Teaching dex- terity with immersive mixed reality. arXiv preprint arXiv:2210.06463, 2022

work page arXiv 2022
[8]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan C. Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang- Huei Lee, Sergey Levine, Yao Lu, U...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Xinlei Chen, Hao Fang, Tsung-yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nico- las Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. ArXiv, abs/2005.12872, 2020

work page arXiv 2005
[10]

Towards human-level bimanual dexterous manipulation with rein- forcement learning

Yuanpei Chen, Yaodong Yang, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang, Stephen McAleer, Hao Dong, Zongqing Lu, and Song-Chun Zhu. Towards human-level bimanual dexterous manipulation with rein- forcement learning. ArXiv, abs/2206.08686, 2022

work page arXiv 2022
[11]

Efﬁcient bimanual manipulation using learned task schemas

Rohan Chitnis, Shubham Tulsiani, Saurabh Gupta, and Abhinav Kumar Gupta. Efﬁcient bimanual manipulation using learned task schemas. 2020 IEEE International Conference on Robotics and Automation (ICRA) , pages 1149–1155, 2019

work page 2020
[12]

Transformers for one-shot visual imitation

Sudeep Dasari and Abhinav Kumar Gupta. Transformers for one-shot visual imitation. In Conference on Robot Learning, 2020

work page 2020
[13]

Causal confusion in imitation learning

Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning. In Neural Information Processing Systems , 2019

work page 2019
[14]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. ArXiv, abs/1810.04805, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[15]

One-shot imitation learn- ing,

Yan Duan, Marcin Andrychowicz, Bradly C. Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, P. Abbeel, and Wojciech Zaremba. One-shot imitation learning. ArXiv, abs/1703.07326, 2017

work page arXiv 2017
[16]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. ArXiv, abs/2109.13396, 2021

work page internal anchor Pith review arXiv 2021
[17]

Florence, Lucas Manuelli, and Russ Tedrake

Peter R. Florence, Lucas Manuelli, and Russ Tedrake. Self- supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters , 5:492–499, 2019

work page 2019
[18]

Will Dabney, Mark Rowland, Marc G Bellemare, and R ´emi Munos

Peter R. Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian S. Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. ArXiv, abs/2109.00137, 2021

work page arXiv 2021
[19]

Learning dense visual correspondences in simulation to smooth and fold real fabrics

Aditya Ganapathi, Priya Sundaresan, Brijen Thananjeyan, Ashwin Balakrishna, Daniel Seita, Jennifer Grannen, Minho Hwang, Ryan Hoque, Joseph Gonzalez, Nawid Jamali, Katsu Yamane, Soshi Iba, and Ken Goldberg. Learning dense visual correspondences in simulation to smooth and fold real fabrics. 2021 IEEE International Conference on Robotics and Automation (IC...

work page 2021
[20]

Untangling dense knots by learning task-relevant keypoints

Jennifer Grannen, Priya Sundaresan, Brijen Thananjeyan, Jeffrey Ichnowski, Ashwin Balakrishna, Minho Hwang, Vainavi Viswanath, Michael Laskey, Joseph Gonzalez, and Ken Goldberg. Untangling dense knots by learning task-relevant keypoints. In Conference on Robot Learning, 2020

work page 2020
[21]

Flingbot: The unreasonable effectiveness of dynamic manipulation for cloth unfolding

Huy Ha and Shuran Song. Flingbot: The unreasonable effectiveness of dynamic manipulation for cloth unfolding. ArXiv, abs/2105.03655, 2021

work page arXiv 2021
[22]

Ratliff, and Dieter Fox

Ankur Handa, Karl Van Wyk, Wei Yang, Jacky Liang, Yu- Wei Chao, Qian Wan, Stan Birchﬁeld, Nathan D. Ratliff, and Dieter Fox. Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170, 2019

work page 2020
[23]

Zhang, Shaoqing Ren, and Jian Sun

Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015

work page 2016
[24]

Burgess, Xavier Glorot, Matthew M

Irina Higgins, Loïc Matthey, Arka Pal, Christopher P. Burgess, Xavier Glorot, Matthew M. Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016

work page 2016
[25]

Novoseller, Albert Wilcox, Daniel S

Ryan Hoque, Ashwin Balakrishna, Ellen R. Novoseller, Albert Wilcox, Daniel S. Brown, and Ken Goldberg. Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning. In Conference on Robot Learning, 2021

work page 2021
[26]

Stephen James, Michael Bloesch, and Andrew J. Davison. Task-embedded control networks for few-shot imitation learning. ArXiv, abs/1810.03237, 2018

work page arXiv 2018
[27]

Bc-z: Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning , 2022

work page 2022
[28]

Master–slave manipulators and remote maintenance at the oak ridge national labora- tory, Jan 1975

R G Jenness and C D Wicker. Master–slave manipulators and remote maintenance at the oak ridge national labora- tory, Jan 1975. URL https://www.osti.gov/biblio/4179544

work page arXiv 1975
[29]

Coarse-to-ﬁne imitation learning: Robot manipulation from a single demonstration

Edward Johns. Coarse-to-ﬁne imitation learning: Robot manipulation from a single demonstration. 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4613–4619, 2021

work page 2021
[30]

Grasping with chopsticks: Combating covariate shift in model-free imitation learning for ﬁne manipulation

Liyiming Ke, Jingqiang Wang, Tapomayukh Bhattachar- jee, Byron Boots, and Siddhartha Srinivasa. Grasping with chopsticks: Combating covariate shift in model-free imitation learning for ﬁne manipulation. In International Conference on Robotics and Automation (ICRA) , 2021

work page 2021
[31]

Driggs-Campbell, and Mykel J

Michael Kelly, Chelsea Sidrane, K. Driggs-Campbell, and Mykel J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. 2019 International Conference on Robotics and Automation (ICRA) , pages 8077–8083, 2018

work page 2019
[32]

Gaze-based dual resolution deep imitation learning for high-precision dexterous robot manipulation

Heecheol Kim, Yoshiyuki Ohmura, and Yasuo Kuniyoshi. Gaze-based dual resolution deep imitation learning for high-precision dexterous robot manipulation. IEEE Robotics and Automation Letters , 6:1630–1637, 2021

work page 2021
[33]

Robot peels banana with goal-conditioned dual-action deep imitation learning

Heecheol Kim, Yoshiyuki Ohmura, and Yasuo Kuniyoshi. Robot peels banana with goal-conditioned dual-action deep imitation learning. ArXiv, abs/2203.09749, 2022

work page arXiv 2022
[34]

Auto-Encoding Variational Bayes

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[35]

Towards learning hierarchical skills for multi-phase manipulation tasks

Oliver Kroemer, Christian Daniel, Gerhard Neumann, Herke van Hoof, and Jan Peters. Towards learning hierarchical skills for multi-phase manipulation tasks. 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1503–1510, 2015

work page 2015
[36]

Action chunking as policy compression, Sep 2022

Lucy Lai, Ann Z Huang, and Samuel J Gershman. Action chunking as policy compression, Sep 2022. URL psyarxiv. com/z8yrv

work page 2022
[37]

Dragan, and Ken Goldberg

Michael Laskey, Jonathan Lee, Roy Fox, Anca D. Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. In Conference on Robot Learning , 2017

work page 2017
[38]

Lee, Henry Lu, Abhishek Gupta, Sergey Levine, and P

Alex X. Lee, Henry Lu, Abhishek Gupta, Sergey Levine, and P. Abbeel. Learning force-based manipulation of deformable objects from multiple demonstrations. 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 177–184, 2015

work page 2015
[39]

Optimal control for biological movement systems

Weiwei Li. Optimal control for biological movement systems. 2006

work page 2006
[40]

Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart’in-Mart’in

Ajay Mandlekar, Danfei Xu, J. Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart’in-Mart’in. What matters in learning from ofﬂine human demonstrations for robot manipulation. In Conference on Robot Learning , 2021

work page 2021
[41]

Driggs-Campbell, and Mykel J

Kunal Menda, K. Driggs-Campbell, and Mykel J. Kochen- derfer. Ensembledagger: A bayesian approach to safe imitation learning. 2019 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS) , pages 5041–5048, 2018

work page 2019
[42]

Intermittent visual servoing: Efﬁciently learning policies robust to instrument changes for high-precision surgical manipula- tion

Samuel Paradis, Minho Hwang, Brijen Thananjeyan, Jeffrey Ichnowski, Daniel Seita, Danyal Fer, Thomas Low, Joseph Gonzalez, and Ken Goldberg. Intermittent visual servoing: Efﬁciently learning policies robust to instrument changes for high-precision surgical manipula- tion. 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages 7166–7173, 2020

work page 2021
[43]

The surprising effectiveness of representation learning for visual imitation

Jyothish Pari, Nur Muhammad, Sridhar Pandian Arunacha- lam, and Lerrel Pinto. The surprising effectiveness of representation learning for visual imitation. arXiv preprint arXiv:2112.01511, 2021

work page arXiv 2021
[44]

Learning and generalization of motor skills by learning from demonstration

Peter Pastor, Heiko Hoffmann, Tamim Asfour, and Stefan Schaal. Learning and generalization of motor skills by learning from demonstration. 2009 IEEE International Conference on Robotics and Automation , pages 763–768, 2009

work page 2009
[45]

Pomerleau

Dean A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In NIPS, 1988

work page 1988
[46]

From one hand to multiple hands: Imitation learning for dexterous manipulation from single-camera teleoperation

Yuzhe Qin, Hao Su, and Xiaolong Wang. From one hand to multiple hands: Imitation learning for dexterous manipulation from single-camera teleoperation. IEEE Robotics and Automation Letters , 7:10873–10881, 2022

work page 2022
[47]

Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration

Rouhollah Rahmatizadeh, Pooya Abolghasemi, Ladislau Bölöni, and Sergey Levine. Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. 2018 IEEE International Conference on Robotics and Automation (ICRA) , pages 3758–3765, 2017

work page 2018
[48]

Gordon, and J

Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artiﬁcial Intelligence and Statistics , 2010

work page 2010
[49]

A uniﬁed framework for coordinated multi-arm motion planning

Seyed Sina Mirrazavi Salehian, Nadia Figueroa, and Aude Billard. A uniﬁed framework for coordinated multi-arm motion planning. The International Journal of Robotics Research, 37:1205 – 1232, 2018

work page 2018
[50]

Behavior trans- formers: Cloning k modes with one stone, 2022

Nur Muhammad (Mahi) Shaﬁullah, Zichen Jeff Cui, Ariuntuya Altanzaya, and Lerrel Pinto. Behavior trans- formers: Cloning k modes with one stone. ArXiv, abs/2206.11251, 2022

work page arXiv 2022
[51]

Sgtm 2.0: Autonomously untangling long cables using interactive perception

Kaushik Shivakumar, Vainavi Viswanath, Anrui Gu, Yahav Avigal, Justin Kerr, Jeffrey Ichnowski, Richard Cheng, Thomas Kollar, and Ken Goldberg. Sgtm 2.0: Autonomously untangling long cables using interactive perception. ArXiv, abs/2209.13706, 2022

work page arXiv 2022
[52]

Cliport: What and where pathways for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. ArXiv, abs/2109.12098, 2021

work page arXiv 2021
[53]

Perceiver-Actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. ArXiv, abs/2209.05451, 2022

work page arXiv 2022
[54]

Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube

Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube. RSS, 2022

work page 2022
[55]

Dimarogonas, and Danica Kragic

Christian Smith, Yiannis Karayiannidis, Lazaros Nal- pantidis, Xavi Gratal, Peng Qi, Dimos V . Dimarogonas, and Danica Kragic. Dual arm manipulation - a survey. Robotics Auton. Syst. , 60:1340–1353, 2012

work page 2012
[56]

Learning structured output representation using deep conditional generative models

Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In NIPS, 2015

work page 2015
[57]

Shadow teleoperation system plays jenga, Mar 2021

srcteam. Shadow teleoperation system plays jenga, Mar 2021. URL https://www.youtube.com/watch?v= 7K9brH27jvM

work page 2021
[58]

How researchers are using shadow robot’s technology, Jun 2022

srcteam. How researchers are using shadow robot’s technology, Jun 2022. URL https://www.youtube.com/ watch?v=p36fYIoTD8M

work page 2022
[59]

Shadow teleoperation system, Jun 2022

srcteam. Shadow teleoperation system, Jun 2022. URL https://www.youtube.com/watch?v=cx8eznfDUJA

work page 2022
[60]

A system for imitation learning of contact-rich bimanual manipulation policies

Simon Stepputtis, Maryam Bandari, Stefan Schaal, and Heni Ben Amor. A system for imitation learning of contact-rich bimanual manipulation policies. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 11810–11817, 2022

work page 2022
[61]

Novoseller, Minho Hwang, Michael Laskey, Joseph Gon- zalez, and Ken Goldberg

Priya Sundaresan, Jennifer Grannen, Brijen Thanan- jeyan, Ashwin Balakrishna, Jeffrey Ichnowski, Ellen R. Novoseller, Minho Hwang, Michael Laskey, Joseph Gon- zalez, and Ken Goldberg. Untangling dense non-planar knots by learning manipulation features and recovery policies. ArXiv, abs/2107.08942, 2021

work page arXiv 2021
[62]

Andrew Bagnell, and Zhiwei Steven Wu

Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, and Zhiwei Steven Wu. Causal imitation learning under temporally correlated noise. In International Conference on Machine Learning , 2022

work page 2022
[63]

Deep learning and the information bottleneck principle

Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. 2015 IEEE Information Theory Workshop (ITW), pages 1–5, 2015

work page 2015
[64]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. 2012 IEEE/RSJ International Conference on Intelligent Robots and Sys- tems, pages 5026–5033, 2012

work page 2012
[65]

Stephen Tu, Alexander Robey, Tingnan Zhang, and N. Matni. On the sample complexity of stability con- strained imitation learning. In Conference on Learning for Dynamics & Control , 2021

work page 2021
[66]

Attention Is All You Need

Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[67]

interbotix_ros_manipulators

Solomon Wiznitzer, Luke Schmitt, and Matt Trossen. interbotix_ros_manipulators. URL https://github.com/ Interbotix/interbotix_ros_manipulators

work page
[68]

Fan Xie, A. M. Masum Bulbul Chowdhury, M. Clara De Paolis Kaluza, Linfeng Zhao, Lawson L. S. Wong, and Rose Yu. Deep imitation learning for bimanual robotic manipulation. ArXiv, abs/2010.05134, 2020

work page arXiv 2010
[69]

Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Vikas Sindhwani, and Johnny Lee

Andy Zeng, Peter R. Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Vikas Sindhwani, and Johnny Lee. Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on Robot Learning, 2020

work page 2020
[70]

Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Ken Goldberg, and P. Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. 2018 IEEE International Conference on Robotics and Automation (ICRA) , pages 1–8, 2017

work page 2018
[71]

Florence, and Chelsea Finn

Allan Zhou, Moo Jin Kim, Lirui Wang, Peter R. Florence, and Chelsea Finn. Nerf in the palm of your hand: Corrective augmentation for robotics via novel-view synthesis. ArXiv, abs/2301.08556, 2023

work page arXiv 2023
[72]

The measure- ment of proprioceptive accuracy: A systematic literature review

Áron Horváth, Eszter Ferentzi, Kristóf Schwartz, Nina Jacobs, Pieter Meyns, and Ferenc Köteles. The measure- ment of proprioceptive accuracy: A systematic literature review. Journal of Sport and Health Science , 2022. ISSN 2095-2546. doi: https://doi.org/10.1016/j.jshs.2022.04

work page doi:10.1016/j.jshs.2022.04 2022
[73]

beer pong

URL https://www.sciencedirect.com/science/article/ pii/S2095254622000473. APPENDIX A. Comparing ALOHA with Prior Teleoperation Setups In Figure 9, we include more teleoperated tasks that ALOHA is capable of. We stress that all objects are taken directly from the real world without any modiﬁcation, to demonstrate ALOHA’s generality in real life settings. A...

work page 1953