pith. sign in

arxiv: 2606.27375 · v1 · pith:ZZ73GVFKnew · submitted 2026-06-25 · 💻 cs.RO

Scalable Behavior Cloning with Open Data, Training, and Evaluation

Pith reviewed 2026-06-26 04:12 UTC · model grok-4.3

classification 💻 cs.RO
keywords behavior cloningteleoperation datasetrobotic manipulationopen-source stackdiffusion transformervision language action modelimitation learningsimulation to real transfer
0
0 comments X

The pith

An open-source stack called ABC supplies the largest public teleoperation dataset of 130K episodes to scale behavior cloning for robot manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ABC as a complete open-source collection of data, hardware instructions, training code, and simulation tools for teaching robots by imitation. Its central release is ABC-130K, which records 3500 hours of human-controlled robot motion across 130K episodes and 195 different tasks. The authors also supply 400 hours of simulated teleoperation data and a co-training procedure that aligns simulation results with real-robot performance. They test several model designs, including Diffusion Transformers and Vision-Language-Action networks, and show that the trained policies can fold boxes and pull credit cards from wallets. The explicit aim is to give every research group the same starting materials so that progress on imitation learning can be compared directly.

Core claim

We introduce ABC, a fully open-source stack for manipulation with behavior cloning. At its core is ABC-130K: the largest open-source teleoperation dataset to date, featuring 3,500 hours of data spanning over 130K episodes across 195 diverse tasks. Furthermore, we open-source our accessible hardware setup, training infrastructure, and simulation pipeline. We also release 400 hours of sim-teleop data and provide a co-training recipe that produces correlated simulation and real-world evaluation, offering a reliable proxy for ablating model-design and training decisions before costly real-world evaluation. We explore various training recipes and compare common architectural choices for Diffusion

What carries the argument

ABC-130K, the teleoperation dataset of 130K episodes that supplies the training examples, paired with a co-training procedure that makes simulation results track real-robot performance.

If this is right

  • Policies trained on the released data can perform concrete dexterous actions such as folding boxes and removing credit cards from wallets.
  • The open hardware, training code, and simulation pipeline allow any lab to reproduce the same experimental conditions.
  • Model comparisons between Diffusion Transformers and Vision-Language-Action networks rest on measured real-world success rates rather than simulation alone.
  • The 400 hours of additional sim-teleop data can be mixed with real data during training without requiring new robot hardware.
  • The 195-task coverage supplies a broad base for testing generalization across manipulation skills.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Labs without access to proprietary robot fleets could now run the same ablation studies that previously required closed data.
  • The correlation between simulated and real performance could be tested on entirely new task families to see whether the proxy holds outside the original 195 tasks.
  • Releasing both the dataset and the exact training scripts creates a fixed benchmark that later papers can cite when claiming improvements.

Load-bearing premise

The co-training recipe produces simulation results that reliably predict which models and training choices will work best on real robots.

What would settle it

Run the same set of models on real robots after selecting the best ones by the co-training proxy; if the ranking of models by real success rate differs from the ranking produced by the proxy, the claim that the proxy is reliable is falsified.

Figures

Figures reproduced from arXiv: 2606.27375 by Adam Rashid, Angjoo Kanazawa, Arthur Allshire, David McAllister, Fred Shentu, Guanya Shi, Himanshu Gaurav Singh, Hongsuk Choi, Huang Huang, Jitendra Malik, Justin Yu, Philipp Wu, Phillip Isola, Pieter Abbeel, Ritvik Singh, Rocky Duan, Xi Chen, Yiyuan Chen.

Figure 2
Figure 2. Figure 2: An overview of the ABC stack. ABC-130K provides large-scale real-world teleoperation data. ABC-Models instantiates DiT and VLA policies and studies architecture and compute-scaling choices through real-world ablations. ABC-Sim provides simulation environments and data for studying sim-to-real correlation. ABC-Eval provides a large-scale real-world evaluation suite with rollouts and rubrics. The full stack … view at source ↗
Figure 3
Figure 3. Figure 3: shows random samples of the top camera frames from episodes in the data. We also provide teleoperation metadata, including anonymized teleoperator IDs and collection times￾tamps. In addition, a 1,552-hour subset of ABC-130K includes subgoal annotations: contiguous sub￾trajectories labeled with subtask descriptions. In Appendix H, we show how these metadata and annota￾tions can be used for policy conditioni… view at source ↗
Figure 4
Figure 4. Figure 4: Per-task hours by primitive category. Hours-per-task on a log scale in descending order within each category. Top row: Pick-and-Place and Fine Pick-and-Place (the two largest categories in terms of number of tasks). Bottom row: the five remaining categories. Descriptions of each task category along with example images from the data can be found in Appendix A. VLA variants DiT variants Task Metric Pooled ad… view at source ↗
Figure 5
Figure 5. Figure 5: More training steps and larger batch size give better performance. Real-world task success (left) and task progress (right) with training compute for DiT and VLA policies; each connected pair uses the same effective batch size (approximately 1.5K, 4.6K, or 9K) evaluated at different checkpoints. We consistently find that the DiT is more flop-efficient. 0 500 1,000 1,500 2,000 2,500 3,000 0.05 0.06 0.08 0.1… view at source ↗
Figure 6
Figure 6. Figure 6: Eight diffusion draws reduce VLA train loss at fixed accelerator time. Training loss versus GPU-hours for one- and eight-draw VLA diffusion training. 0 100k 200k 0.04 0.05 0.06 0.08 0.1 Optimization step Loss (Train/Val.) 0% 20% 40% 60% Real progress (%) Train Loss Val Loss Real Progress [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Offline metrics versus real-world policy performance. We analyze correlation between checkpoint training diagnostics and real-world success. Training loss and validation action error are both negatively correlated with real-world performance. Each point is one checkpoint, with real-world success averaged across the three evaluation tasks. 4 ABC-Sim 10 20 30 40 50 60 10 20 30 40 40 50 60 70 80 50 60 70 n = … view at source ↗
Figure 9
Figure 9. Figure 9: Sim-to-real performance correlation. Task-level simulation progress correlation with real-world progress across DiT and VLA checkpoints. Offline metrics allow us to assess many decisions cheaply but cannot report task success or reveal failure modes. Simulation closes this gap as it yields actual success rates and watchable rollouts, and lets researchers without hardware iterate. Our release includes ABC-S… view at source ↗
Figure 10
Figure 10. Figure 10: ABC-Sim tasks. Policy rollouts for our simulated tasks built in MuJoCo for sim-to-real evaluation. The multi-step turning mugs right-side up task, shown across three stages (grasp, flip, place). Rollouts on more simulated tasks can be found in the Appendix. By default we render the rollouts in MuJoCo but we also provide a Blender pipeline for generating high-quality renders. create higher-fidelity images … view at source ↗
Figure 11
Figure 11. Figure 11: ABC-DiT/ABC-VLA real-world performance. Task progress across the three real-world tasks. 0 50 100 150 200 0 25 50 75 100 Checkpoint step (k) Sim progress (%) Bottles 0 50 100 150 200 Checkpoint step (k) Dishrack 0 50 100 150 200 Checkpoint step (k) Mug flip DiT 9.2kBS VLA 9.2kBS [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 15
Figure 15. Figure 15: The robot autonomously folds a cardboard box and closes the lid. DAgger intervention data is crucial for achieving good performance on this long-horizon, precise manipulation task. To demonstrate the ability of our stack to learn very dexterous tasks, we perform box folding with our policies. We start with the pretrained ABC-DiT model which cannot achieve any real-world suc￾cess on this dexterous task. No… view at source ↗
Figure 14
Figure 14. Figure 14: Box folding: effect of DAgger. Average task progress for a finetuned policy, before and after DAgger. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: Representative top-camera frames for each task category. Pick-and-Place (row 1, all 5): Packing luggage, Place fruits in bag, Organize toys on shelf, Place snacks in bag, Place glasses in tray. Fine Pick-and-Place (row 2, all 5): Build block tower, Load dishrack, Place coffee filter, Insert credit cards, Set up chess pieces. Folding (row 3, cells 1–4): Fold paper box, Fold skirt pile, Fold t-shirt pile, R… view at source ↗
Figure 17
Figure 17. Figure 17: Efficient frame access for fast data loading. Encoding keyframes more frequently and in a manner that allows for an analytically reconstructed frame index makes random frame access nearly free. To read one frame from a video, torchcodec’s default scans the entire file to build its frame index (top). Correctly encoding the file allows us to compute the index analytically, meaning we only need to read the f… view at source ↗
Figure 18
Figure 18. Figure 18: Random-frame decode throughput. Decodes per second versus dataloader workers on a local filesystem. Fixing the encoding options and dataloader args improves throughput. naive (GOP 250, no CFR) non-naive (GOP 30 + CFR) 0.1 1 10 9.75 MB per decode (log) 0.14 [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
Figure 20
Figure 20. Figure 20: ABC-VLA pooled adaLN architecture. ABC-VLA uses a Gemma 3 VLM backbone, attention-pools the final VLM hidden states into a compact feature vector, and projects the result to an adaLN conditioning vector for the lightweight DiT action head. Model Parameters Training TFLOPs / sample Backbone Action head Backbone Action head Total ABC-DiT 85.7M 1.93B 0.329 0.349 0.678 ABC-VLA 4.3B 44.7M 6.957 0.063 7.020 [P… view at source ↗
Figure 21
Figure 21. Figure 21: ABC-DiT model-size scaling. Train loss versus optimizer steps (left) and versus cumulative training compute (right) for four ABC-DiT sizes trained with identical hyperparameters and global batch size 9,216. At a fixed compute or step budget the larger DiTs reach lower train loss. The total number of parameters in S/B/L/xL are 153M, 290M, 746M, and 1.93B respectively. (a) (b) (c) [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 22
Figure 22. Figure 22: Hardware setup. (a) Bimanual YAM workstation with two 6-DoF arms, three RealSense D405 cameras, and white enclosure walls. (b) The FlexPoint gripper used in a subset of our dataset. (c) Policies using our dataset can be deployed in in-the-wild settings. and (ii) isolates fine-manipulation learning from the confound of background generalization. We note, however, that despite using only data collected in t… view at source ↗
Figure 23
Figure 23. Figure 23: Inference optimizations for ABC-DiT. Profiling traces for ABC-DiT inference with 10 diffusion denoising steps. All timings are measured on an NVIDIA GeForce RTX 5090. Inference Trace for ABC-VLA (10 diffusion steps) GPU kernels GPU downtime CPU calls Eager 47.8 ms 20.9 Hz, 59% GPU active Separate compile 22.6 ms 44.2 Hz, 88% GPU active Fullgraph compile 17.5 ms 57.2 Hz, 94% GPU active 0 10 20 30 40 50 Tim… view at source ↗
Figure 24
Figure 24. Figure 24: Inference optimizations for ABC-VLA. Profiling traces for ABC-VLA inference with 10 diffusion denoising steps. All timings are measured on an NVIDIA GeForce RTX 5090. to avoid recomputation every diffusion step, and leveraging torch.compile. While the first two are fairly ubiquitous, torch.compile performance can vary depending on how it is used. When wrapping the model in a compile block, we pass in full… view at source ↗
Figure 25
Figure 25. Figure 25: Impact of Operator conditioning. Top row: folds from training demonstrations contributed by Op-0 (more proficient operator) left and Op-1 (less proficient operator) right. Bottom row: repre￾sentative folds from the same trained policy under Op-0 (left) and Op-1 (right) conditioning. Although the policy is the same, varying only the operator prompt produces qualitatively distinct fold outcomes matching the… view at source ↗
Figure 26
Figure 26. Figure 26: Prefix conditioning can suppress visual responsiveness. Prefixing chunk generation on recent actions can bias the generated motion toward the recently executed trajectory, while the uncon￾ditioned generation remains more responsive to the current visual observation. Left: In the first chunk, the robot fails to pick the bottle. Right: In the next chunk, we compare RTC prefix-conditioned and unconditional g… view at source ↗
read the original abstract

We introduce ABC, a fully open-source stack for manipulation with behavior cloning. At its core is ABC-130K: the largest open-source teleoperation dataset to date, featuring 3,500 hours of data spanning over 130K episodes across 195 diverse tasks. Furthermore, we open-source our accessible hardware setup, training infrastructure, and simulation pipeline. We also release 400 hours of sim-teleop data and provide a co-training recipe that produces correlated simulation and real-world evaluation, offering a reliable proxy for ablating model-design and training decisions before costly real-world evaluation. We explore various training recipes and compare common architectural choices for Diffusion Transformers (DiT) and Vision-Language-Action (VLA) models, grounding our findings in real-world evaluations. The resulting policies successfully execute dexterous tasks such as box folding and extracting credit cards from wallets. By providing a reproducible toolkit, we aim to place researchers on an equal footing, establishing the necessary foundation to learn the ABCs of Behavior Cloning together as a community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces ABC, a fully open-source stack for robotic manipulation via behavior cloning. Its primary contribution is the ABC-130K dataset—the largest open teleoperation dataset to date—with 3,500 hours of data across >130K episodes and 195 tasks. The authors also release their hardware setup, training infrastructure, simulation pipeline, and 400 hours of sim-teleop data. They describe a co-training recipe asserted to yield correlated simulation and real-world evaluations (serving as a proxy for model/training ablations), compare DiT and VLA architectures, and report that resulting policies execute dexterous tasks such as box folding and credit-card extraction from wallets.

Significance. If the scale, diversity, and openness of the released dataset, hardware, code, and simulation resources are as described, the work could meaningfully lower barriers to reproducible research in scalable behavior cloning for manipulation. The explicit data and tooling release is a concrete strength that directly addresses community needs for shared benchmarks and training pipelines.

major comments (2)
  1. [Abstract] Abstract: the claim that the co-training recipe 'produces correlated simulation and real-world evaluation, offering a reliable proxy for ablating model-design and training decisions' is unsupported by any reported quantitative evidence (correlation coefficients, scatter plots, per-task success tables, or coverage statistics across the 195 tasks). This assertion is load-bearing for the recipe's advertised utility.
  2. [Abstract] Abstract: the statement that 'the resulting policies successfully execute dexterous tasks such as box folding and extracting credit cards from wallets' and that findings are 'ground[ed] in real-world evaluations' supplies no success rates, trial counts, baselines, error bars, or exclusion criteria, leaving the architectural and training-recipe comparisons without visible empirical grounding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment on the abstract below and will revise accordingly to ensure claims are properly supported.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the co-training recipe 'produces correlated simulation and real-world evaluation, offering a reliable proxy for ablating model-design and training decisions' is unsupported by any reported quantitative evidence (correlation coefficients, scatter plots, per-task success tables, or coverage statistics across the 195 tasks). This assertion is load-bearing for the recipe's advertised utility.

    Authors: We agree the abstract statement lacks the requested quantitative metrics. The manuscript reports comparative simulation and real-world results but does not include explicit correlation coefficients or scatter plots. In revision we will either add these metrics drawn from the existing evaluations or revise the abstract language to avoid overclaiming correlation as a reliable proxy. revision: yes

  2. Referee: [Abstract] Abstract: the statement that 'the resulting policies successfully execute dexterous tasks such as box folding and extracting credit cards from wallets' and that findings are 'ground[ed] in real-world evaluations' supplies no success rates, trial counts, baselines, error bars, or exclusion criteria, leaving the architectural and training-recipe comparisons without visible empirical grounding.

    Authors: We agree the abstract is insufficiently specific. The full manuscript contains real-world evaluation results for the cited tasks, but the abstract does not report success rates or trial details. We will revise the abstract to reference the quantitative results from the evaluations section or remove the unsupported phrasing. revision: yes

Circularity Check

0 steps flagged

No circularity; paper is a data/infrastructure release with no derivation chain

full rationale

The paper introduces an open dataset (ABC-130K), hardware setup, simulation pipeline, and a co-training recipe for behavior cloning. Its central statements concern the scale of released data and the empirical outcome of the recipe ('produces correlated simulation and real-world evaluation'). No equations, fitted parameters, or predictions are presented that reduce by construction to the paper's own inputs. No self-citations are invoked to justify uniqueness theorems or ansatzes. The contribution is the explicit release of artifacts rather than a result obtained by fitting or deriving from itself, rendering the work self-contained with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data-release and infrastructure paper; the abstract introduces no mathematical derivations, fitted constants, or new physical entities.

pith-pipeline@v0.9.1-grok · 5773 in / 1146 out tokens · 73820 ms · 2026-06-26T04:12:04.974434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 2 canonical work pages

  1. [1]

    Gordon, and J

    Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning, 2011. URLhttps://arxiv.org/abs/1011. 0686

  2. [2]

    Scalable diffusion models with transformers, 2023

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URLhttps: //arxiv.org/abs/2212.09748

  3. [3]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexan- der Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalash-...

  4. [4]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

  5. [5]

    Open X-embodiment: Robotic learning datasets and RT-X models

    Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT-X models. In IEEE International Conference on Robotics and Automation (ICRA), 2024

  6. [6]

    BridgeData V2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. BridgeData V2: A dataset for robot learning at scale. In Conference on Robot Learning (CoRL), 2023

  7. [7]

    Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Xindong He, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025

  8. [8]

    Molmoact2: Action reasoning models for real-world deployment,

    Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei- Chuan Tsai, Shirui Chen, Yi Ru Wang, Shanli Xing, Jaemin Cho, Jae Sung Park, Ainaz Eftekhar, 14 Peter Sushko, Karen Farley, Angad Wadhwa, Cole Harrison, Winson Han, Ying-Chun Lee, Eli Van- derBilt, Rose Hendrix, Suveen Ellawela, Lucas Ngoo, Joyce Chai, Zhongzheng Re...

  9. [9]

    URLhttps://arxiv.org/abs/2605.02881

  10. [10]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  11. [11]

    A care- ful examination of large behavior models for multitask dexterous manipulation

    Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching- Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A care- ful examination of large behavior models for multitask dexterous manipulation. arXiv preprint arXiv:2507.05331, 2025

  12. [12]

    GR00T N1: An open foundation model for generalist humanoid robots

    Johan Bjorck, Abhinav Prasad, Aleksei Grigoriev, Feiyu Xia, Peng Ding, Zhengyi Luo, et al. GR00T N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

  13. [13]

    URLhttps://arxiv.org/abs/ 2410.24164

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A visi...

  14. [14]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

  15. [15]

    Rt-1: Robotics transformer for real-world control at scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  16. [16]

    Ren, Michael Equi, and Sergey Levine

    Kevin Black, Allen Z. Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking, 2025. URLhttps://arxiv.org/abs/2512.05964

  17. [17]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps:// arxiv.org/abs/2103.00020

  18. [18]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

  19. [19]

    Decoupled weight decay regularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps: //arxiv.org/abs/1711.05101. 15

  20. [20]

    Gemma 3 tech- nical report

    Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, et al. Gemma 3 tech- nical report. arXiv preprint arXiv:2503.19786, 4, 2025

  21. [21]

    Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine

    Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine. Knowl- edge insulating vision-language-action models: Train fast, run fast, generalize better, 2025. URL https://arxiv.org/abs/2505.23705

  22. [22]

    Fast: Efficient action tokenization for vision-language-action models, 2025

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models, 2025. URLhttps://arxiv.org/abs/2501.09747

  23. [23]

    Query-key normalization for transformers, 2020

    Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for transformers, 2020. URLhttps://arxiv.org/abs/2010.04245

  24. [24]

    Openvla: An open-source vision-language-action model, 2024

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Ben- jamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024. URLhttps://arxiv.org/abs/...

  25. [25]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033,

  26. [26]

    doi: 10.1109/IROS.2012.6386109

  27. [27]

    Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators, 2024

    Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators, 2024. URLhttps://arxiv.org/ abs/2309.13037

  28. [28]

    Dexhub and dart: Towards internet scale robot data collection

    Younghyo Park, Jagdeep Singh Bhatia, Lars Lien Ankile, and Pulkit Agrawal. Dexhub and dart: Towards internet scale robot data collection. ArXiv, abs/2411.02214, 2024. URLhttps://api. semanticscholar.org/CorpusID:273821640

  29. [29]

    Lucid-xr: An extended-reality data engine for robotic manipulation, 2026

    Yajvan Ravan, Adam Rashid, Alan Yu, Kai McClennen, Gio Huh, Kevin Yang, Zhutian Yang, Qinxi Yu, Xiaolong Wang, Phillip Isola, and Ge Yang. Lucid-xr: An extended-reality data engine for robotic manipulation, 2026. URLhttps://arxiv.org/abs/2605.00244

  30. [30]

    Iris: An immersive robot interac- tion system, 2025

    Xinkai Jiang, Qihao Yuan, Enes Ulas Dincer, Hongyi Zhou, Ge Li, Xueyin Li, Julius Haag, Nicolas Schreiber, Kailai Li, Gerhard Neumann, and Rudolf Lioutikov. Iris: An immersive robot interac- tion system, 2025. URLhttps://arxiv.org/abs/2502.03297

  31. [31]

    RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Junbo Wang, Haoyi Zhu, and Cewu Lu. RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot. IEEE Robotics and Automation Letters, 2024

  32. [33]

    Zhao, and Chelsea Finn

    Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manip- ulation with low-cost whole-body teleoperation, 2024. URLhttps://arxiv.org/abs/2401. 02117

  33. [34]

    Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid

    Tony Z. Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity, 2024. URL https://arxiv.org/abs/2410.13126. 16

  34. [35]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

  35. [36]

    Gr-2: A generative video-language-action model with web- scale knowledge for robot manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web- scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024

  36. [37]

    i2rt: Python API for I2RT Robots.https://github.com/i2rt-robotics/ i2rt, 2025

    I2RT Robotics. i2rt: Python API for I2RT Robots.https://github.com/i2rt-robotics/ i2rt, 2025. Accessed: 2026-04-20

  37. [38]

    ZeroMQ: An open-source universal messaging library.https://zeromq

    The ZeroMQ authors. ZeroMQ: An open-source universal messaging library.https://zeromq. org, 2007. Accessed: 2026-05-31

  38. [39]

    Robot oper- ating system 2: Design, architecture, and uses in the wild

    Steven Macenski, Tully Foote, Brian Gerkey, Chris Lalancette, and William Woodall. Robot oper- ating system 2: Design, architecture, and uses in the wild. Science Robotics, 7(66):eabm6074,

  39. [40]

    Robot Operating System 2: Design, architecture, and uses in the wild,

    doi: 10.1126/scirobotics.abm6074. URLhttps://www.science.org/doi/abs/10. 1126/scirobotics.abm6074

  40. [41]

    Learning fine-grained bimanual manipulation with low-cost hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023

  41. [42]

    Factr: Force-attending curriculum training for contact-rich policy learning, 2025

    Jason Jingzhou Liu, Yulong Li, Kenneth Shaw, Tony Tao, Ruslan Salakhutdinov, and Deepak Pathak. Factr: Force-attending curriculum training for contact-rich policy learning, 2025. URL https://arxiv.org/abs/2502.17432

  42. [43]

    Factr 2: Learning force sensing and force-aware policies for any robot arm

    Steven Oh, Jason Jingzhou Liu, Tony Tao, Philip Han, Kenneth Shaw, Satoshi Funabashi, Ruslan Salakhutdinov, and Deepak Pathak. Factr 2: Learning force sensing and force-aware policies for any robot arm. 2026

  43. [44]

    Mink: Python inverse kinematics based on MuJoCo, February 2026

    Kevin Zakka. Mink: Python inverse kinematics based on MuJoCo, February 2026. URLhttps: //github.com/kevinzakka/mink

  44. [45]

    SARM: Stage-aware reward modeling for long horizon robot manipulation

    Qianzhong Chen, Justin Yu, Mac Schwager, Pieter Abbeel, Yide Shentu, and Philipp Wu. SARM: Stage-aware reward modeling for long horizon robot manipulation. InInternational Conference on Learning Representations (ICLR), 2026. 17 Appendices A ABC-130K Task Taxonomy 18 B Training & Model Details 19 B.1 Data Loading . . . . . . . . . . . . . . . . . . . . . ....

  45. [46]

    Conditioning on Op-0, the highest-volume operator, whose training demonstrations are character- ized by short, deliberate execution (mean episode duration 59 s)

  46. [47]

    Conditioning on the task prompt alone, which marginalizes over operators by reverting to the training-time dropout target

  47. [48]

    Conditioning on Op-1, a long-duration operator with 226 episodes of training data whose demon- strations average 205 s per episode and exhibit lower fold quality Marginalized inference improves over the unconditioned baseline, indicating that operator-ID con- ditioning at training time is strictly beneficial even when the operator channel is not used at i...