Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Pith reviewed 2026-05-11 04:11 UTC · model grok-4.3
The pith
Action Chunking with Transformers lets low-cost robots learn precise bimanual tasks from ten minutes of demonstrations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a low-cost bimanual robot system performing end-to-end imitation learning with the ACT algorithm, which learns generative models over action sequences from visual observations, can successfully execute difficult fine-grained tasks such as opening a translucent condiment cup and slotting a battery, reaching 80-90% success rates in the real world after training on only ten minutes of demonstrations collected via a custom teleoperation interface.
What carries the argument
Action Chunking with Transformers (ACT), a transformer model that predicts chunks of future actions to enable stable closed-loop control and reduce compounding errors in high-precision imitation learning.
If this is right
- Precise bimanual manipulation becomes feasible on inexpensive hardware without specialized force sensors or calibration procedures.
- Imitation learning policies can succeed on long-horizon tasks despite non-stationary human demonstrations when action sequences are modeled generatively.
- Visual feedback alone suffices for closed-loop control on tasks requiring careful contact forces.
- Data collection effort drops to short sessions of roughly ten minutes while still yielding high success rates across multiple tasks.
Where Pith is reading between the lines
- The chunking approach may extend to other robotic control problems that involve predicting extended action sequences.
- Lowering hardware costs could broaden access to fine manipulation capabilities for non-industrial settings.
- Combining ACT with additional sensing modalities might further improve reliability on even harder variants of the tasks.
Load-bearing premise
The custom teleoperation interface produces high-quality, consistent demonstrations that capture the necessary precision and force coordination without introducing human-induced biases or noise that the learning algorithm cannot overcome.
What would settle it
Retraining and testing the same tasks with demonstrations collected from a lower-quality or noisier teleoperation interface, then measuring whether success rates fall below 80%, would directly test whether the claim holds.
read the original abstract
Fine manipulation tasks, such as threading cable ties or slotting a battery, are notoriously difficult for robots because they require precision, careful coordination of contact forces, and closed-loop visual feedback. Performing these tasks typically requires high-end robots, accurate sensors, or careful calibration, which can be expensive and difficult to set up. Can learning enable low-cost and imprecise hardware to perform these fine manipulation tasks? We present a low-cost system that performs end-to-end imitation learning directly from real demonstrations, collected with a custom teleoperation interface. Imitation learning, however, presents its own challenges, particularly in high-precision domains: errors in the policy can compound over time, and human demonstrations can be non-stationary. To address these challenges, we develop a simple yet novel algorithm, Action Chunking with Transformers (ACT), which learns a generative model over action sequences. ACT allows the robot to learn 6 difficult tasks in the real world, such as opening a translucent condiment cup and slotting a battery with 80-90% success, with only 10 minutes worth of demonstrations. Project website: https://tonyzhaozh.github.io/aloha/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a low-cost bimanual robot equipped with a custom teleoperation interface for collecting real-world demonstrations, combined with the novel Action Chunking with Transformers (ACT) algorithm, enables end-to-end imitation learning of fine-grained manipulation tasks. ACT models generative distributions over action chunks to mitigate compounding errors and non-stationary demonstrations, allowing 80-90% success rates on six contact-rich tasks (e.g., opening a translucent condiment cup, slotting a battery) using only 10 minutes of data on imprecise hardware.
Significance. If the empirical results hold after verification of demonstration quality and controls, the work would demonstrate that imitation learning with chunked generative policies can achieve high-precision bimanual performance on inexpensive platforms without specialized sensors or calibration. This has clear implications for accessibility in robotics, providing concrete real-world evidence on tasks that typically demand high-end setups.
major comments (2)
- [Abstract] Abstract: The headline result that ACT enables 80-90% success with 10 min of demonstrations rests on the unverified assumption that the custom teleoperation interface supplies high-quality, low-bias demonstrations encoding precise contact forces and closed-loop coordination. No independent metrics (trajectory variance, force profiles, inter-demonstrator consistency) or ablations separating interface quality from policy performance are reported, leaving open the possibility that the interface itself supplies the critical precision rather than the learning algorithm.
- [Experiments] Experiments section (inferred from reported success rates): Success rates on the six tasks are presented without baselines, ablations, or statistical tests, as highlighted in the review. This makes it impossible to assess whether the central claim—that ACT on low-cost hardware is responsible for the performance—holds or whether post-hoc tuning or task selection inflates the numbers.
minor comments (1)
- [Abstract] The project website link is provided but no supplementary video or code repository is referenced in the abstract; adding these would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and the opportunity to clarify our contributions. We address the two major comments below, committing to revisions where they strengthen the manuscript without misrepresenting our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline result that ACT enables 80-90% success with 10 min of demonstrations rests on the unverified assumption that the custom teleoperation interface supplies high-quality, low-bias demonstrations encoding precise contact forces and closed-loop coordination. No independent metrics (trajectory variance, force profiles, inter-demonstrator consistency) or ablations separating interface quality from policy performance are reported, leaving open the possibility that the interface itself supplies the critical precision rather than the learning algorithm.
Authors: The teleoperation interface is an integral component of the proposed low-cost system, as it enables collection of usable demonstrations on imprecise hardware without requiring high-end sensors. We acknowledge that the initial submission lacks explicit quantitative metrics on demonstration quality. We will add analysis of trajectory variance and inter-demonstrator consistency in the revised manuscript. Force profiles cannot be reported because the hardware lacks force sensors; the system relies on visual feedback instead. Full ablations isolating the interface from ACT would require new hardware setups, which we will discuss as a limitation rather than perform within this revision. revision: partial
-
Referee: [Experiments] Experiments section (inferred from reported success rates): Success rates on the six tasks are presented without baselines, ablations, or statistical tests, as highlighted in the review. This makes it impossible to assess whether the central claim—that ACT on low-cost hardware is responsible for the performance—holds or whether post-hoc tuning or task selection inflates the numbers.
Authors: We agree that the experiments section requires stronger validation. The manuscript already includes comparisons to standard behavior cloning, but we will expand it with additional baselines (e.g., non-chunked policies), architecture ablations, and statistical analysis including the number of evaluation trials, success-rate confidence intervals, and significance tests. These additions will clarify that the reported performance stems from the combination of the interface and ACT rather than task selection or tuning. revision: yes
- Direct force profiles cannot be provided because the low-cost hardware does not include force sensors.
Circularity Check
No circularity: empirical results from hardware experiments are independent of any fitted inputs or self-referential definitions.
full rationale
The paper introduces the ACT algorithm as a novel generative model over action chunks to mitigate compounding errors in imitation learning, then validates it through real-world bimanual tasks on low-cost hardware using custom teleoperation demonstrations. Success rates (80-90%) are measured outcomes from physical rollouts, not quantities derived by construction from the training data or prior self-citations. No equations, uniqueness theorems, or ansatzes are presented that reduce the central claims to tautological inputs; the derivation chain consists of standard imitation learning setup plus a transformer-based policy whose performance is externally falsifiable via hardware metrics.
Axiom & Free-Parameter Ledger
free parameters (1)
- ACT model hyperparameters
axioms (1)
- domain assumption Imitation learning from a small number of human demonstrations can generalize to new task instances on physical hardware
invented entities (1)
-
Action Chunking with Transformers (ACT)
no independent evidence
Forward citations
Cited by 60 Pith papers
-
Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.
-
Point Tracking Improves World Action Models
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
-
Understanding Multimodal Failure in Action-Chunking Behavioral Cloning
The paper identifies distinct failure mechanisms: excessive posterior-prior regularization erases mode information in latent policies, while smooth base-to-action maps limit mode coverage in generative policies.
-
RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
-
DSSP: Diffusion State Space Policy with Full-History Encoding
DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size...
-
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
-
Morphologically Equivariant Flow Matching for Bimanual Mobile Manipulation
A morphologically equivariant flow matching policy for bimanual robots enforces reflective symmetry to improve sample efficiency and enable zero-shot generalization to mirrored task configurations.
-
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models
MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
-
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
A vision-language policy learns state-conditioned commitment depth to Pareto-dominate fixed-depth baselines on long-horizon puzzles, achieving up to 12.5 pp higher solve rate with 25% fewer actions.
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
-
PhySPRING: Structure-Preserving Reduction of Physics-Informed Twins via GNN
PhySPRING uses differentiable GNNs to learn hierarchical coarsened spring-mass topologies and parameters from observations, delivering up to 2.3x speedup on PhysTwin benchmarks and comparable robot policy success rate...
-
BrickCraft: Visuomotor Skill Composition with Situated Manual Guidance for Long-Horizon Interlocking Brick Assembly
BrickCraft composes reusable visuomotor skills via relative anchoring to partial structures and situated visual manuals to achieve long-horizon interlocking brick assembly from limited demonstrations with generalizati...
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
Shared Autonomy Assisted by Impedance-Driven Anisotropic Guidance Field
IAGF-SA adds a physically-grounded channel to shared autonomy by modulating robot impedance to convey intent, improving task performance, agreement, and user experience in three scenarios per user studies.
-
OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction
A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
-
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...
-
3D Generation for Embodied AI and Robotic Simulation: A Survey
3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.
-
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors
Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...
-
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...
-
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
-
FingerEye: Learning Dexterous Manipulation with Continuous Vision-Tactile Sensing
FingerEye delivers continuous vision-tactile sensing via binocular RGB cameras and marker-tracked compliant ring deformation, supporting imitation learning policies that generalize across object variations for tasks l...
-
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
-
Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation
ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving hi...
-
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
-
You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
Optimizing a single constant initial noise vector for frozen generative robot policies improves success rates on 38 of 43 tasks by up to 58% relative improvement.
-
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memor...
-
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation
PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
TouchGuide: Inference-Time Steering of Visuomotor Policies via Touch Guidance
TouchGuide improves contact-rich robot manipulation by steering diffusion or flow-matching visuomotor policies with tactile feasibility scores from a contrastively trained Contact Physical Model.
-
RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation
RoboCOIN is a large multi-embodiment bimanual manipulation dataset with hierarchical annotations and an open processing pipeline that improves model performance across robotic platforms.
-
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
-
Rodrigues Network for Learning Robot Actions
Proposes Rodrigues Network using a learnable Neural Rodrigues Operator to add kinematic inductive biases for improved robot action learning and prediction.
-
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
-
TacO: Benchmarking Tactile Sensors for Object Manipulation
The paper provides a task-driven benchmark comparing visual, acoustic, magnetic, and resistive tactile sensors on three manipulation tasks and concludes that sensor utility depends on modality, material friction, and ...
-
COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones
COBALT enables scalable crowdsourced teleoperation of robots using smartphones, supporting concurrent users with low latency and yielding a 7500+ demonstration dataset validated on imitation learning tasks.
-
COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones
COBALT provides scalable cloud infrastructure for crowdsourced robot teleoperation via smartphones, supporting concurrent users with low latency and enabling collection of a 7500+ demonstration dataset validated throu...
-
DexHoldem: Playing Texas Hold'em with Dexterous Embodied System
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
-
HCLM: A Hierarchical Framework for Cooperative Loco-Manipulation with Dual Quadrupeds
HCLM presents a hierarchical architecture that uses an SE(3)-invariant diffusion policy for coordination and a hybrid whole-body controller with MPC and admittance control for safe closed-chain loco-manipulation on du...
-
DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo
DexJoCo is a benchmark and toolkit with 11 functionally grounded tasks, 1.1K trajectories, and empirical benchmarks for task-oriented dexterous manipulation on MuJoCo.
-
Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation
VLA-AD distills 7B VLA teachers into 158M students using offline VLM semantic guidance on task phases and directions, matching teacher performance on LIBERO with 44x size reduction and 3.28x speedup.
-
Learning Sim-Grounded Policies for Bimanual Rope Manipulation from Human Teleoperation Data
A simulation-grounded state policy using 3D particle dynamics outperforms an egocentric vision policy by 30.8% in L1 error on unseen rope configurations for bimanual manipulation from limited human data.
-
FLASH: Efficient Visuomotor Policy via Sparse Sampling
FLASH Policy uses sparse Legendre polynomial trajectory fitting and history-anchored flow matching to enable single-step inference for visuomotor control, reporting 31.4 ms per-episode latency and >=92% success on fiv...
-
SID: Sliding into Distribution for Robust Few-Demonstration Manipulation
SID achieves approximately 90% success on six real-world manipulation tasks with only two demonstrations under out-of-distribution initializations, with less than 10% performance drop under distractors and disturbances.
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
-
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
Retrieve-then-steer stores successful observation-action segments in memory, retrieves relevant chunks, filters them, and uses an elite prior with confidence-adaptive guidance to steer a flow-matching action sampler f...
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while pr...
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
-
DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions
DexSynRefine synthesizes HOI motions with an extended manifold method, refines them via task-space residual RL, and adapts for sim-to-real transfer, outperforming kinematic retargeting by 50-70 percentage points on fi...
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
-
Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning
Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.
Reference graph
Works this paper leans on
-
[1]
Viperx 300 robot arm 6dof. URL https://www. trossenrobotics.com/viperx-300-robot-arm-6dof.aspx
-
[2]
Widowx 250 robot arm 6dof. URL https://www. trossenrobotics.com/widowx-250-robot-arm-6dof.aspx
-
[3]
URL https://www.youtube.com/watch?v= TearcKVj0iY
Highly dexterous manipulation system - capabilities - part 1, Nov 2014. URL https://www.youtube.com/watch?v= TearcKVj0iY
work page 2014
-
[4]
Assembly performance metrics and test methods, Apr 2022. URL https://www. nist.gov/el/intelligent-systems-division-73500/ robotic-grasping-and-manipulation-assembly/assembly
work page 2022
-
[5]
Teleoperated robots - shadow teleoperation system, Nov
-
[6]
URL https://www.shadowrobot.com/teleoperation/
-
[7]
Holo-dex: Teaching dexterity with immersive mixed reality,
Sridhar Pandian Arunachalam, Irmak Güzey, Soumith Chintala, and Lerrel Pinto. Holo-dex: Teaching dex- terity with immersive mixed reality. arXiv preprint arXiv:2210.06463, 2022
-
[8]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan C. Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang- Huei Lee, Sergey Levine, Yao Lu, U...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
End- to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nico- las Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. ArXiv, abs/2005.12872, 2020
-
[10]
Towards human-level bimanual dexterous manipulation with rein- forcement learning
Yuanpei Chen, Yaodong Yang, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang, Stephen McAleer, Hao Dong, Zongqing Lu, and Song-Chun Zhu. Towards human-level bimanual dexterous manipulation with rein- forcement learning. ArXiv, abs/2206.08686, 2022
-
[11]
Efficient bimanual manipulation using learned task schemas
Rohan Chitnis, Shubham Tulsiani, Saurabh Gupta, and Abhinav Kumar Gupta. Efficient bimanual manipulation using learned task schemas. 2020 IEEE International Conference on Robotics and Automation (ICRA) , pages 1149–1155, 2019
work page 2020
-
[12]
Transformers for one-shot visual imitation
Sudeep Dasari and Abhinav Kumar Gupta. Transformers for one-shot visual imitation. In Conference on Robot Learning, 2020
work page 2020
-
[13]
Causal confusion in imitation learning
Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning. In Neural Information Processing Systems , 2019
work page 2019
-
[14]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. ArXiv, abs/1810.04805, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[15]
Yan Duan, Marcin Andrychowicz, Bradly C. Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, P. Abbeel, and Wojciech Zaremba. One-shot imitation learning. ArXiv, abs/1703.07326, 2017
work page Pith review arXiv 2017
-
[16]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. ArXiv, abs/2109.13396, 2021
work page internal anchor Pith review arXiv 2021
-
[17]
Florence, Lucas Manuelli, and Russ Tedrake
Peter R. Florence, Lucas Manuelli, and Russ Tedrake. Self- supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters , 5:492–499, 2019
work page 2019
- [18]
-
[19]
Learning dense visual correspondences in simulation to smooth and fold real fabrics
Aditya Ganapathi, Priya Sundaresan, Brijen Thananjeyan, Ashwin Balakrishna, Daniel Seita, Jennifer Grannen, Minho Hwang, Ryan Hoque, Joseph Gonzalez, Nawid Jamali, Katsu Yamane, Soshi Iba, and Ken Goldberg. Learning dense visual correspondences in simulation to smooth and fold real fabrics. 2021 IEEE International Conference on Robotics and Automation (IC...
work page 2021
-
[20]
Untangling dense knots by learning task-relevant keypoints
Jennifer Grannen, Priya Sundaresan, Brijen Thananjeyan, Jeffrey Ichnowski, Ashwin Balakrishna, Minho Hwang, Vainavi Viswanath, Michael Laskey, Joseph Gonzalez, and Ken Goldberg. Untangling dense knots by learning task-relevant keypoints. In Conference on Robot Learning, 2020
work page 2020
-
[21]
Flingbot: The unreasonable effectiveness of dynamic manipulation for cloth unfold- ing, 2021
Huy Ha and Shuran Song. Flingbot: The unreasonable effectiveness of dynamic manipulation for cloth unfolding. ArXiv, abs/2105.03655, 2021
-
[22]
Ankur Handa, Karl Van Wyk, Wei Yang, Jacky Liang, Yu- Wei Chao, Qian Wan, Stan Birchfield, Nathan D. Ratliff, and Dieter Fox. Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170, 2019
work page 2020
-
[23]
Zhang, Shaoqing Ren, and Jian Sun
Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015
work page 2016
-
[24]
Burgess, Xavier Glorot, Matthew M
Irina Higgins, Loïc Matthey, Arka Pal, Christopher P. Burgess, Xavier Glorot, Matthew M. Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016
work page 2016
-
[25]
Novoseller, Albert Wilcox, Daniel S
Ryan Hoque, Ashwin Balakrishna, Ellen R. Novoseller, Albert Wilcox, Daniel S. Brown, and Ken Goldberg. Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning. In Conference on Robot Learning, 2021
work page 2021
-
[26]
Stephen James, Michael Bloesch, and Andrew J. Davison. Task-embedded control networks for few-shot imitation learning. ArXiv, abs/1810.03237, 2018
work page Pith review arXiv 2018
-
[27]
Bc-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning , 2022
work page 2022
-
[28]
Master–slave manipulators and remote maintenance at the oak ridge national labora- tory, Jan 1975
R G Jenness and C D Wicker. Master–slave manipulators and remote maintenance at the oak ridge national labora- tory, Jan 1975. URL https://www.osti.gov/biblio/4179544
-
[29]
Coarse-to-fine imitation learning: Robot manipulation from a single demonstration
Edward Johns. Coarse-to-fine imitation learning: Robot manipulation from a single demonstration. 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4613–4619, 2021
work page 2021
-
[30]
Liyiming Ke, Jingqiang Wang, Tapomayukh Bhattachar- jee, Byron Boots, and Siddhartha Srinivasa. Grasping with chopsticks: Combating covariate shift in model-free imitation learning for fine manipulation. In International Conference on Robotics and Automation (ICRA) , 2021
work page 2021
-
[31]
Michael Kelly, Chelsea Sidrane, K. Driggs-Campbell, and Mykel J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. 2019 International Conference on Robotics and Automation (ICRA) , pages 8077–8083, 2018
work page 2019
-
[32]
Gaze-based dual resolution deep imitation learning for high-precision dexterous robot manipulation
Heecheol Kim, Yoshiyuki Ohmura, and Yasuo Kuniyoshi. Gaze-based dual resolution deep imitation learning for high-precision dexterous robot manipulation. IEEE Robotics and Automation Letters , 6:1630–1637, 2021
work page 2021
-
[33]
Robot peels banana with goal- conditioned dual-action deep imitation learn- ing
Heecheol Kim, Yoshiyuki Ohmura, and Yasuo Kuniyoshi. Robot peels banana with goal-conditioned dual-action deep imitation learning. ArXiv, abs/2203.09749, 2022
-
[34]
Auto-Encoding Variational Bayes
Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[35]
Towards learning hierarchical skills for multi-phase manipulation tasks
Oliver Kroemer, Christian Daniel, Gerhard Neumann, Herke van Hoof, and Jan Peters. Towards learning hierarchical skills for multi-phase manipulation tasks. 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1503–1510, 2015
work page 2015
-
[36]
Action chunking as policy compression, Sep 2022
Lucy Lai, Ann Z Huang, and Samuel J Gershman. Action chunking as policy compression, Sep 2022. URL psyarxiv. com/z8yrv
work page 2022
-
[37]
Michael Laskey, Jonathan Lee, Roy Fox, Anca D. Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. In Conference on Robot Learning , 2017
work page 2017
-
[38]
Lee, Henry Lu, Abhishek Gupta, Sergey Levine, and P
Alex X. Lee, Henry Lu, Abhishek Gupta, Sergey Levine, and P. Abbeel. Learning force-based manipulation of deformable objects from multiple demonstrations. 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 177–184, 2015
work page 2015
-
[39]
Optimal control for biological movement systems
Weiwei Li. Optimal control for biological movement systems. 2006
work page 2006
-
[40]
Ajay Mandlekar, Danfei Xu, J. Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart’in-Mart’in. What matters in learning from offline human demonstrations for robot manipulation. In Conference on Robot Learning , 2021
work page 2021
-
[41]
Kunal Menda, K. Driggs-Campbell, and Mykel J. Kochen- derfer. Ensembledagger: A bayesian approach to safe imitation learning. 2019 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS) , pages 5041–5048, 2018
work page 2019
-
[42]
Samuel Paradis, Minho Hwang, Brijen Thananjeyan, Jeffrey Ichnowski, Daniel Seita, Danyal Fer, Thomas Low, Joseph Gonzalez, and Ken Goldberg. Intermittent visual servoing: Efficiently learning policies robust to instrument changes for high-precision surgical manipula- tion. 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages 7166–7173, 2020
work page 2021
-
[43]
The surprising ef- fectiveness of representation learning for visual imitation
Jyothish Pari, Nur Muhammad, Sridhar Pandian Arunacha- lam, and Lerrel Pinto. The surprising effectiveness of representation learning for visual imitation. arXiv preprint arXiv:2112.01511, 2021
-
[44]
Learning and generalization of motor skills by learning from demonstration
Peter Pastor, Heiko Hoffmann, Tamim Asfour, and Stefan Schaal. Learning and generalization of motor skills by learning from demonstration. 2009 IEEE International Conference on Robotics and Automation , pages 763–768, 2009
work page 2009
- [45]
-
[46]
Yuzhe Qin, Hao Su, and Xiaolong Wang. From one hand to multiple hands: Imitation learning for dexterous manipulation from single-camera teleoperation. IEEE Robotics and Automation Letters , 7:10873–10881, 2022
work page 2022
-
[47]
Rouhollah Rahmatizadeh, Pooya Abolghasemi, Ladislau Bölöni, and Sergey Levine. Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. 2018 IEEE International Conference on Robotics and Automation (ICRA) , pages 3758–3765, 2017
work page 2018
-
[48]
Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics , 2010
work page 2010
-
[49]
A unified framework for coordinated multi-arm motion planning
Seyed Sina Mirrazavi Salehian, Nadia Figueroa, and Aude Billard. A unified framework for coordinated multi-arm motion planning. The International Journal of Robotics Research, 37:1205 – 1232, 2018
work page 2018
-
[50]
Behavior Transformers: Cloning $k$ modes with one stone, October 2022
Nur Muhammad (Mahi) Shafiullah, Zichen Jeff Cui, Ariuntuya Altanzaya, and Lerrel Pinto. Behavior trans- formers: Cloning k modes with one stone. ArXiv, abs/2206.11251, 2022
-
[51]
Sgtm 2.0: Autonomously untangling long cables using interactive perception
Kaushik Shivakumar, Vainavi Viswanath, Anrui Gu, Yahav Avigal, Justin Kerr, Jeffrey Ichnowski, Richard Cheng, Thomas Kollar, and Ken Goldberg. Sgtm 2.0: Autonomously untangling long cables using interactive perception. ArXiv, abs/2209.13706, 2022
-
[52]
Cliport: What and where pathways for robotic manipulation,
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. ArXiv, abs/2109.12098, 2021
-
[53]
Perceiver-actor: A multi-task transformer for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. ArXiv, abs/2209.05451, 2022
-
[54]
Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube
Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube. RSS, 2022
work page 2022
-
[55]
Dimarogonas, and Danica Kragic
Christian Smith, Yiannis Karayiannidis, Lazaros Nal- pantidis, Xavi Gratal, Peng Qi, Dimos V . Dimarogonas, and Danica Kragic. Dual arm manipulation - a survey. Robotics Auton. Syst. , 60:1340–1353, 2012
work page 2012
-
[56]
Learning structured output representation using deep conditional generative models
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In NIPS, 2015
work page 2015
-
[57]
Shadow teleoperation system plays jenga, Mar 2021
srcteam. Shadow teleoperation system plays jenga, Mar 2021. URL https://www.youtube.com/watch?v= 7K9brH27jvM
work page 2021
-
[58]
How researchers are using shadow robot’s technology, Jun 2022
srcteam. How researchers are using shadow robot’s technology, Jun 2022. URL https://www.youtube.com/ watch?v=p36fYIoTD8M
work page 2022
-
[59]
Shadow teleoperation system, Jun 2022
srcteam. Shadow teleoperation system, Jun 2022. URL https://www.youtube.com/watch?v=cx8eznfDUJA
work page 2022
-
[60]
A system for imitation learning of contact-rich bimanual manipulation policies
Simon Stepputtis, Maryam Bandari, Stefan Schaal, and Heni Ben Amor. A system for imitation learning of contact-rich bimanual manipulation policies. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 11810–11817, 2022
work page 2022
-
[61]
Novoseller, Minho Hwang, Michael Laskey, Joseph Gon- zalez, and Ken Goldberg
Priya Sundaresan, Jennifer Grannen, Brijen Thanan- jeyan, Ashwin Balakrishna, Jeffrey Ichnowski, Ellen R. Novoseller, Minho Hwang, Michael Laskey, Joseph Gon- zalez, and Ken Goldberg. Untangling dense non-planar knots by learning manipulation features and recovery policies. ArXiv, abs/2107.08942, 2021
-
[62]
Andrew Bagnell, and Zhiwei Steven Wu
Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, and Zhiwei Steven Wu. Causal imitation learning under temporally correlated noise. In International Conference on Machine Learning , 2022
work page 2022
-
[63]
Deep learning and the information bottleneck principle
Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. 2015 IEEE Information Theory Workshop (ITW), pages 1–5, 2015
work page 2015
-
[64]
Mujoco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. 2012 IEEE/RSJ International Conference on Intelligent Robots and Sys- tems, pages 5026–5033, 2012
work page 2012
-
[65]
Stephen Tu, Alexander Robey, Tingnan Zhang, and N. Matni. On the sample complexity of stability con- strained imitation learning. In Conference on Learning for Dynamics & Control , 2021
work page 2021
-
[66]
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[67]
Solomon Wiznitzer, Luke Schmitt, and Matt Trossen. interbotix_ros_manipulators. URL https://github.com/ Interbotix/interbotix_ros_manipulators
- [68]
-
[69]
Andy Zeng, Peter R. Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Vikas Sindhwani, and Johnny Lee. Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on Robot Learning, 2020
work page 2020
-
[70]
Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Ken Goldberg, and P. Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. 2018 IEEE International Conference on Robotics and Automation (ICRA) , pages 1–8, 2017
work page 2018
-
[71]
Allan Zhou, Moo Jin Kim, Lirui Wang, Peter R. Florence, and Chelsea Finn. Nerf in the palm of your hand: Corrective augmentation for robotics via novel-view synthesis. ArXiv, abs/2301.08556, 2023
-
[72]
The measure- ment of proprioceptive accuracy: A systematic literature review
Áron Horváth, Eszter Ferentzi, Kristóf Schwartz, Nina Jacobs, Pieter Meyns, and Ferenc Köteles. The measure- ment of proprioceptive accuracy: A systematic literature review. Journal of Sport and Health Science , 2022. ISSN 2095-2546. doi: https://doi.org/10.1016/j.jshs.2022.04
-
[73]
URL https://www.sciencedirect.com/science/article/ pii/S2095254622000473. APPENDIX A. Comparing ALOHA with Prior Teleoperation Setups In Figure 9, we include more teleoperated tasks that ALOHA is capable of. We stress that all objects are taken directly from the real world without any modification, to demonstrate ALOHA’s generality in real life settings. A...
work page 1953
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.