Scalable Behavior Cloning with Open Data, Training, and Evaluation

Adam Rashid; Angjoo Kanazawa; Arthur Allshire; David McAllister; Fred Shentu; Guanya Shi; Himanshu Gaurav Singh; Hongsuk Choi; Huang Huang; Jitendra Malik

arxiv: 2606.27375 · v1 · pith:ZZ73GVFKnew · submitted 2026-06-25 · 💻 cs.RO

Scalable Behavior Cloning with Open Data, Training, and Evaluation

Arthur Allshire , Himanshu Gaurav Singh , Ritvik Singh , Adam Rashid , Hongsuk Choi , David McAllister , Justin Yu , Yiyuan Chen

show 10 more authors

Huang Huang Pieter Abbeel Xi Chen Rocky Duan Phillip Isola Jitendra Malik Fred Shentu Guanya Shi Philipp Wu Angjoo Kanazawa

This is my paper

Pith reviewed 2026-06-26 04:12 UTC · model grok-4.3

classification 💻 cs.RO

keywords behavior cloningteleoperation datasetrobotic manipulationopen-source stackdiffusion transformervision language action modelimitation learningsimulation to real transfer

0 comments

The pith

An open-source stack called ABC supplies the largest public teleoperation dataset of 130K episodes to scale behavior cloning for robot manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ABC as a complete open-source collection of data, hardware instructions, training code, and simulation tools for teaching robots by imitation. Its central release is ABC-130K, which records 3500 hours of human-controlled robot motion across 130K episodes and 195 different tasks. The authors also supply 400 hours of simulated teleoperation data and a co-training procedure that aligns simulation results with real-robot performance. They test several model designs, including Diffusion Transformers and Vision-Language-Action networks, and show that the trained policies can fold boxes and pull credit cards from wallets. The explicit aim is to give every research group the same starting materials so that progress on imitation learning can be compared directly.

Core claim

What carries the argument

ABC-130K, the teleoperation dataset of 130K episodes that supplies the training examples, paired with a co-training procedure that makes simulation results track real-robot performance.

If this is right

Policies trained on the released data can perform concrete dexterous actions such as folding boxes and removing credit cards from wallets.
The open hardware, training code, and simulation pipeline allow any lab to reproduce the same experimental conditions.
Model comparisons between Diffusion Transformers and Vision-Language-Action networks rest on measured real-world success rates rather than simulation alone.
The 400 hours of additional sim-teleop data can be mixed with real data during training without requiring new robot hardware.
The 195-task coverage supplies a broad base for testing generalization across manipulation skills.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Labs without access to proprietary robot fleets could now run the same ablation studies that previously required closed data.
The correlation between simulated and real performance could be tested on entirely new task families to see whether the proxy holds outside the original 195 tasks.
Releasing both the dataset and the exact training scripts creates a fixed benchmark that later papers can cite when claiming improvements.

Load-bearing premise

The co-training recipe produces simulation results that reliably predict which models and training choices will work best on real robots.

What would settle it

Run the same set of models on real robots after selecting the best ones by the co-training proxy; if the ranking of models by real success rate differs from the ranking produced by the proxy, the claim that the proxy is reliable is falsified.

Figures

Figures reproduced from arXiv: 2606.27375 by Adam Rashid, Angjoo Kanazawa, Arthur Allshire, David McAllister, Fred Shentu, Guanya Shi, Himanshu Gaurav Singh, Hongsuk Choi, Huang Huang, Jitendra Malik, Justin Yu, Philipp Wu, Phillip Isola, Pieter Abbeel, Ritvik Singh, Rocky Duan, Xi Chen, Yiyuan Chen.

**Figure 2.** Figure 2: An overview of the ABC stack. ABC-130K provides large-scale real-world teleoperation data. ABC-Models instantiates DiT and VLA policies and studies architecture and compute-scaling choices through real-world ablations. ABC-Sim provides simulation environments and data for studying sim-to-real correlation. ABC-Eval provides a large-scale real-world evaluation suite with rollouts and rubrics. The full stack … view at source ↗

**Figure 3.** Figure 3: shows random samples of the top camera frames from episodes in the data. We also provide teleoperation metadata, including anonymized teleoperator IDs and collection timestamps. In addition, a 1,552-hour subset of ABC-130K includes subgoal annotations: contiguous subtrajectories labeled with subtask descriptions. In Appendix H, we show how these metadata and annotations can be used for policy conditioni… view at source ↗

**Figure 4.** Figure 4: Per-task hours by primitive category. Hours-per-task on a log scale in descending order within each category. Top row: Pick-and-Place and Fine Pick-and-Place (the two largest categories in terms of number of tasks). Bottom row: the five remaining categories. Descriptions of each task category along with example images from the data can be found in Appendix A. VLA variants DiT variants Task Metric Pooled ad… view at source ↗

**Figure 5.** Figure 5: More training steps and larger batch size give better performance. Real-world task success (left) and task progress (right) with training compute for DiT and VLA policies; each connected pair uses the same effective batch size (approximately 1.5K, 4.6K, or 9K) evaluated at different checkpoints. We consistently find that the DiT is more flop-efficient. 0 500 1,000 1,500 2,000 2,500 3,000 0.05 0.06 0.08 0.1… view at source ↗

**Figure 6.** Figure 6: Eight diffusion draws reduce VLA train loss at fixed accelerator time. Training loss versus GPU-hours for one- and eight-draw VLA diffusion training. 0 100k 200k 0.04 0.05 0.06 0.08 0.1 Optimization step Loss (Train/Val.) 0% 20% 40% 60% Real progress (%) Train Loss Val Loss Real Progress [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Offline metrics versus real-world policy performance. We analyze correlation between checkpoint training diagnostics and real-world success. Training loss and validation action error are both negatively correlated with real-world performance. Each point is one checkpoint, with real-world success averaged across the three evaluation tasks. 4 ABC-Sim 10 20 30 40 50 60 10 20 30 40 40 50 60 70 80 50 60 70 n = … view at source ↗

**Figure 9.** Figure 9: Sim-to-real performance correlation. Task-level simulation progress correlation with real-world progress across DiT and VLA checkpoints. Offline metrics allow us to assess many decisions cheaply but cannot report task success or reveal failure modes. Simulation closes this gap as it yields actual success rates and watchable rollouts, and lets researchers without hardware iterate. Our release includes ABC-S… view at source ↗

**Figure 10.** Figure 10: ABC-Sim tasks. Policy rollouts for our simulated tasks built in MuJoCo for sim-to-real evaluation. The multi-step turning mugs right-side up task, shown across three stages (grasp, flip, place). Rollouts on more simulated tasks can be found in the Appendix. By default we render the rollouts in MuJoCo but we also provide a Blender pipeline for generating high-quality renders. create higher-fidelity images … view at source ↗

**Figure 11.** Figure 11: ABC-DiT/ABC-VLA real-world performance. Task progress across the three real-world tasks. 0 50 100 150 200 0 25 50 75 100 Checkpoint step (k) Sim progress (%) Bottles 0 50 100 150 200 Checkpoint step (k) Dishrack 0 50 100 150 200 Checkpoint step (k) Mug flip DiT 9.2kBS VLA 9.2kBS [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 15.** Figure 15: The robot autonomously folds a cardboard box and closes the lid. DAgger intervention data is crucial for achieving good performance on this long-horizon, precise manipulation task. To demonstrate the ability of our stack to learn very dexterous tasks, we perform box folding with our policies. We start with the pretrained ABC-DiT model which cannot achieve any real-world success on this dexterous task. No… view at source ↗

**Figure 14.** Figure 14: Box folding: effect of DAgger. Average task progress for a finetuned policy, before and after DAgger. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗

**Figure 16.** Figure 16: Representative top-camera frames for each task category. Pick-and-Place (row 1, all 5): Packing luggage, Place fruits in bag, Organize toys on shelf, Place snacks in bag, Place glasses in tray. Fine Pick-and-Place (row 2, all 5): Build block tower, Load dishrack, Place coffee filter, Insert credit cards, Set up chess pieces. Folding (row 3, cells 1–4): Fold paper box, Fold skirt pile, Fold t-shirt pile, R… view at source ↗

**Figure 17.** Figure 17: Efficient frame access for fast data loading. Encoding keyframes more frequently and in a manner that allows for an analytically reconstructed frame index makes random frame access nearly free. To read one frame from a video, torchcodec’s default scans the entire file to build its frame index (top). Correctly encoding the file allows us to compute the index analytically, meaning we only need to read the f… view at source ↗

**Figure 18.** Figure 18: Random-frame decode throughput. Decodes per second versus dataloader workers on a local filesystem. Fixing the encoding options and dataloader args improves throughput. naive (GOP 250, no CFR) non-naive (GOP 30 + CFR) 0.1 1 10 9.75 MB per decode (log) 0.14 [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗

**Figure 20.** Figure 20: ABC-VLA pooled adaLN architecture. ABC-VLA uses a Gemma 3 VLM backbone, attention-pools the final VLM hidden states into a compact feature vector, and projects the result to an adaLN conditioning vector for the lightweight DiT action head. Model Parameters Training TFLOPs / sample Backbone Action head Backbone Action head Total ABC-DiT 85.7M 1.93B 0.329 0.349 0.678 ABC-VLA 4.3B 44.7M 6.957 0.063 7.020 [P… view at source ↗

**Figure 21.** Figure 21: ABC-DiT model-size scaling. Train loss versus optimizer steps (left) and versus cumulative training compute (right) for four ABC-DiT sizes trained with identical hyperparameters and global batch size 9,216. At a fixed compute or step budget the larger DiTs reach lower train loss. The total number of parameters in S/B/L/xL are 153M, 290M, 746M, and 1.93B respectively. (a) (b) (c) [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 22.** Figure 22: Hardware setup. (a) Bimanual YAM workstation with two 6-DoF arms, three RealSense D405 cameras, and white enclosure walls. (b) The FlexPoint gripper used in a subset of our dataset. (c) Policies using our dataset can be deployed in in-the-wild settings. and (ii) isolates fine-manipulation learning from the confound of background generalization. We note, however, that despite using only data collected in t… view at source ↗

**Figure 23.** Figure 23: Inference optimizations for ABC-DiT. Profiling traces for ABC-DiT inference with 10 diffusion denoising steps. All timings are measured on an NVIDIA GeForce RTX 5090. Inference Trace for ABC-VLA (10 diffusion steps) GPU kernels GPU downtime CPU calls Eager 47.8 ms 20.9 Hz, 59% GPU active Separate compile 22.6 ms 44.2 Hz, 88% GPU active Fullgraph compile 17.5 ms 57.2 Hz, 94% GPU active 0 10 20 30 40 50 Tim… view at source ↗

**Figure 24.** Figure 24: Inference optimizations for ABC-VLA. Profiling traces for ABC-VLA inference with 10 diffusion denoising steps. All timings are measured on an NVIDIA GeForce RTX 5090. to avoid recomputation every diffusion step, and leveraging torch.compile. While the first two are fairly ubiquitous, torch.compile performance can vary depending on how it is used. When wrapping the model in a compile block, we pass in full… view at source ↗

**Figure 25.** Figure 25: Impact of Operator conditioning. Top row: folds from training demonstrations contributed by Op-0 (more proficient operator) left and Op-1 (less proficient operator) right. Bottom row: representative folds from the same trained policy under Op-0 (left) and Op-1 (right) conditioning. Although the policy is the same, varying only the operator prompt produces qualitatively distinct fold outcomes matching the… view at source ↗

**Figure 26.** Figure 26: Prefix conditioning can suppress visual responsiveness. Prefixing chunk generation on recent actions can bias the generated motion toward the recently executed trajectory, while the unconditioned generation remains more responsive to the current visual observation. Left: In the first chunk, the robot fails to pick the bottle. Right: In the next chunk, we compare RTC prefix-conditioned and unconditional g… view at source ↗

read the original abstract

We introduce ABC, a fully open-source stack for manipulation with behavior cloning. At its core is ABC-130K: the largest open-source teleoperation dataset to date, featuring 3,500 hours of data spanning over 130K episodes across 195 diverse tasks. Furthermore, we open-source our accessible hardware setup, training infrastructure, and simulation pipeline. We also release 400 hours of sim-teleop data and provide a co-training recipe that produces correlated simulation and real-world evaluation, offering a reliable proxy for ablating model-design and training decisions before costly real-world evaluation. We explore various training recipes and compare common architectural choices for Diffusion Transformers (DiT) and Vision-Language-Action (VLA) models, grounding our findings in real-world evaluations. The resulting policies successfully execute dexterous tasks such as box folding and extracting credit cards from wallets. By providing a reproducible toolkit, we aim to place researchers on an equal footing, establishing the necessary foundation to learn the ABCs of Behavior Cloning together as a community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a large open dataset and infrastructure release for behavior cloning; the sim-real correlation claim is asserted without numbers.

read the letter

The main thing to know is that this paper centers on releasing ABC-130K, a 3500-hour teleoperation dataset with 130k episodes over 195 tasks, plus the full open hardware setup, simulation pipeline, and training code. They also include 400 hours of matched sim-teleop data and a co-training recipe.

The release itself is the real contribution. Prior open teleop sets were smaller, so this scale plus the matched sim data and accessible stack lowers the barrier for people who want to run systematic experiments in dexterous manipulation. Showing that DiT and VLA models can be trained on it and produce policies that handle tasks like box folding and credit-card extraction is useful context.

The soft spot is the lack of evidence for the key claim. The abstract states that the co-training recipe produces correlated sim and real evaluations that serve as a reliable proxy, yet it supplies no success rates, correlation coefficients, scatter plots, or even basic task coverage numbers. Without those, the proxy story rests on an assertion rather than data. The same holds for the policy results: they are described as successful but without quantitative metrics or baselines.

This is for researchers in robot learning who need large open datasets to test scaling ideas or run ablations they could not afford before. A reader looking for a ready-to-use stack will find it practical.

It deserves a serious referee because the artifact is new at this size and the open release has clear community value. I would send it to peer review, with the expectation that the authors add the missing quantitative results on correlation and performance.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces ABC, a fully open-source stack for robotic manipulation via behavior cloning. Its primary contribution is the ABC-130K dataset—the largest open teleoperation dataset to date—with 3,500 hours of data across >130K episodes and 195 tasks. The authors also release their hardware setup, training infrastructure, simulation pipeline, and 400 hours of sim-teleop data. They describe a co-training recipe asserted to yield correlated simulation and real-world evaluations (serving as a proxy for model/training ablations), compare DiT and VLA architectures, and report that resulting policies execute dexterous tasks such as box folding and credit-card extraction from wallets.

Significance. If the scale, diversity, and openness of the released dataset, hardware, code, and simulation resources are as described, the work could meaningfully lower barriers to reproducible research in scalable behavior cloning for manipulation. The explicit data and tooling release is a concrete strength that directly addresses community needs for shared benchmarks and training pipelines.

major comments (2)

[Abstract] Abstract: the claim that the co-training recipe 'produces correlated simulation and real-world evaluation, offering a reliable proxy for ablating model-design and training decisions' is unsupported by any reported quantitative evidence (correlation coefficients, scatter plots, per-task success tables, or coverage statistics across the 195 tasks). This assertion is load-bearing for the recipe's advertised utility.
[Abstract] Abstract: the statement that 'the resulting policies successfully execute dexterous tasks such as box folding and extracting credit cards from wallets' and that findings are 'ground[ed] in real-world evaluations' supplies no success rates, trial counts, baselines, error bars, or exclusion criteria, leaving the architectural and training-recipe comparisons without visible empirical grounding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment on the abstract below and will revise accordingly to ensure claims are properly supported.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the co-training recipe 'produces correlated simulation and real-world evaluation, offering a reliable proxy for ablating model-design and training decisions' is unsupported by any reported quantitative evidence (correlation coefficients, scatter plots, per-task success tables, or coverage statistics across the 195 tasks). This assertion is load-bearing for the recipe's advertised utility.

Authors: We agree the abstract statement lacks the requested quantitative metrics. The manuscript reports comparative simulation and real-world results but does not include explicit correlation coefficients or scatter plots. In revision we will either add these metrics drawn from the existing evaluations or revise the abstract language to avoid overclaiming correlation as a reliable proxy. revision: yes
Referee: [Abstract] Abstract: the statement that 'the resulting policies successfully execute dexterous tasks such as box folding and extracting credit cards from wallets' and that findings are 'ground[ed] in real-world evaluations' supplies no success rates, trial counts, baselines, error bars, or exclusion criteria, leaving the architectural and training-recipe comparisons without visible empirical grounding.

Authors: We agree the abstract is insufficiently specific. The full manuscript contains real-world evaluation results for the cited tasks, but the abstract does not report success rates or trial details. We will revise the abstract to reference the quantitative results from the evaluations section or remove the unsupported phrasing. revision: yes

Circularity Check

0 steps flagged

No circularity; paper is a data/infrastructure release with no derivation chain

full rationale

The paper introduces an open dataset (ABC-130K), hardware setup, simulation pipeline, and a co-training recipe for behavior cloning. Its central statements concern the scale of released data and the empirical outcome of the recipe ('produces correlated simulation and real-world evaluation'). No equations, fitted parameters, or predictions are presented that reduce by construction to the paper's own inputs. No self-citations are invoked to justify uniqueness theorems or ansatzes. The contribution is the explicit release of artifacts rather than a result obtained by fitting or deriving from itself, rendering the work self-contained with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data-release and infrastructure paper; the abstract introduces no mathematical derivations, fitted constants, or new physical entities.

pith-pipeline@v0.9.1-grok · 5773 in / 1146 out tokens · 73820 ms · 2026-06-26T04:12:04.974434+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 2 canonical work pages

[1]

Gordon, and J

Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning, 2011. URLhttps://arxiv.org/abs/1011. 0686

2011
[2]

Scalable diffusion models with transformers, 2023

William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URLhttps: //arxiv.org/abs/2212.09748

Pith/arXiv arXiv 2023
[3]

Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexan- der Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalash-...

Pith/arXiv arXiv 2023
[4]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

Pith/arXiv arXiv 2025
[5]

Open X-embodiment: Robotic learning datasets and RT-X models

Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT-X models. In IEEE International Conference on Robotics and Automation (ICRA), 2024

2024
[6]

BridgeData V2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. BridgeData V2: A dataset for robot learning at scale. In Conference on Robot Learning (CoRL), 2023

2023
[7]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Xindong He, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025

2025
[8]

Molmoact2: Action reasoning models for real-world deployment,

Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei- Chuan Tsai, Shirui Chen, Yi Ru Wang, Shanli Xing, Jaemin Cho, Jae Sung Park, Ainaz Eftekhar, 14 Peter Sushko, Karen Farley, Angad Wadhwa, Cole Harrison, Winson Han, Ying-Chun Lee, Eli Van- derBilt, Rose Hendrix, Suveen Ellawela, Lucas Ngoo, Joyce Chai, Zhongzheng Re...
[9]

URLhttps://arxiv.org/abs/2605.02881

Pith/arXiv arXiv
[10]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[11]

A care- ful examination of large behavior models for multitask dexterous manipulation

Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching- Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A care- ful examination of large behavior models for multitask dexterous manipulation. arXiv preprint arXiv:2507.05331, 2025

Pith/arXiv arXiv 2025
[12]

GR00T N1: An open foundation model for generalist humanoid robots

Johan Bjorck, Abhinav Prasad, Aleksei Grigoriev, Feiyu Xia, Peng Ding, Zhengyi Luo, et al. GR00T N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[13]

URLhttps://arxiv.org/abs/ 2410.24164

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A visi...

Pith/arXiv arXiv 2026
[14]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

Pith/arXiv arXiv 2025
[15]

Rt-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022
[16]

Ren, Michael Equi, and Sergey Levine

Kevin Black, Allen Z. Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking, 2025. URLhttps://arxiv.org/abs/2512.05964

arXiv 2025
[17]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps:// arxiv.org/abs/2103.00020

Pith/arXiv arXiv 2021
[18]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

Pith/arXiv arXiv 2025
[19]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps: //arxiv.org/abs/1711.05101. 15

Pith/arXiv arXiv 2019
[20]

Gemma 3 tech- nical report

Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, et al. Gemma 3 tech- nical report. arXiv preprint arXiv:2503.19786, 4, 2025

Pith/arXiv arXiv 2025
[21]

Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine

Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine. Knowl- edge insulating vision-language-action models: Train fast, run fast, generalize better, 2025. URL https://arxiv.org/abs/2505.23705

arXiv 2025
[22]

Fast: Efficient action tokenization for vision-language-action models, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models, 2025. URLhttps://arxiv.org/abs/2501.09747

Pith/arXiv arXiv 2025
[23]

Query-key normalization for transformers, 2020

Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for transformers, 2020. URLhttps://arxiv.org/abs/2010.04245

arXiv 2020
[24]

Openvla: An open-source vision-language-action model, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Ben- jamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024. URLhttps://arxiv.org/abs/...

Pith/arXiv arXiv 2024
[25]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033,

2012
[26]

doi: 10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012
[27]

Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators, 2024

Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators, 2024. URLhttps://arxiv.org/ abs/2309.13037

arXiv 2024
[28]

Dexhub and dart: Towards internet scale robot data collection

Younghyo Park, Jagdeep Singh Bhatia, Lars Lien Ankile, and Pulkit Agrawal. Dexhub and dart: Towards internet scale robot data collection. ArXiv, abs/2411.02214, 2024. URLhttps://api. semanticscholar.org/CorpusID:273821640

arXiv 2024
[29]

Lucid-xr: An extended-reality data engine for robotic manipulation, 2026

Yajvan Ravan, Adam Rashid, Alan Yu, Kai McClennen, Gio Huh, Kevin Yang, Zhutian Yang, Qinxi Yu, Xiaolong Wang, Phillip Isola, and Ge Yang. Lucid-xr: An extended-reality data engine for robotic manipulation, 2026. URLhttps://arxiv.org/abs/2605.00244

Pith/arXiv arXiv 2026
[30]

Iris: An immersive robot interac- tion system, 2025

Xinkai Jiang, Qihao Yuan, Enes Ulas Dincer, Hongyi Zhou, Ge Li, Xueyin Li, Julius Haag, Nicolas Schreiber, Kailai Li, Gerhard Neumann, and Rudolf Lioutikov. Iris: An immersive robot interac- tion system, 2025. URLhttps://arxiv.org/abs/2502.03297

arXiv 2025
[31]

RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Junbo Wang, Haoyi Zhu, and Cewu Lu. RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot. IEEE Robotics and Automation Letters, 2024

2024
[33]

Zhao, and Chelsea Finn

Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manip- ulation with low-cost whole-body teleoperation, 2024. URLhttps://arxiv.org/abs/2401. 02117

2024
[34]

Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid

Tony Z. Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity, 2024. URL https://arxiv.org/abs/2410.13126. 16

arXiv 2024
[35]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[36]

Gr-2: A generative video-language-action model with web- scale knowledge for robot manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web- scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024

Pith/arXiv arXiv 2024
[37]

i2rt: Python API for I2RT Robots.https://github.com/i2rt-robotics/ i2rt, 2025

I2RT Robotics. i2rt: Python API for I2RT Robots.https://github.com/i2rt-robotics/ i2rt, 2025. Accessed: 2026-04-20

2025
[38]

ZeroMQ: An open-source universal messaging library.https://zeromq

The ZeroMQ authors. ZeroMQ: An open-source universal messaging library.https://zeromq. org, 2007. Accessed: 2026-05-31

2007
[39]

Robot oper- ating system 2: Design, architecture, and uses in the wild

Steven Macenski, Tully Foote, Brian Gerkey, Chris Lalancette, and William Woodall. Robot oper- ating system 2: Design, architecture, and uses in the wild. Science Robotics, 7(66):eabm6074,
[40]

Robot Operating System 2: Design, architecture, and uses in the wild,

doi: 10.1126/scirobotics.abm6074. URLhttps://www.science.org/doi/abs/10. 1126/scirobotics.abm6074

work page doi:10.1126/scirobotics.abm6074
[41]

Learning fine-grained bimanual manipulation with low-cost hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023

Pith/arXiv arXiv 2023
[42]

Factr: Force-attending curriculum training for contact-rich policy learning, 2025

Jason Jingzhou Liu, Yulong Li, Kenneth Shaw, Tony Tao, Ruslan Salakhutdinov, and Deepak Pathak. Factr: Force-attending curriculum training for contact-rich policy learning, 2025. URL https://arxiv.org/abs/2502.17432

arXiv 2025
[43]

Factr 2: Learning force sensing and force-aware policies for any robot arm

Steven Oh, Jason Jingzhou Liu, Tony Tao, Philip Han, Kenneth Shaw, Satoshi Funabashi, Ruslan Salakhutdinov, and Deepak Pathak. Factr 2: Learning force sensing and force-aware policies for any robot arm. 2026

2026
[44]

Mink: Python inverse kinematics based on MuJoCo, February 2026

Kevin Zakka. Mink: Python inverse kinematics based on MuJoCo, February 2026. URLhttps: //github.com/kevinzakka/mink

2026
[45]

SARM: Stage-aware reward modeling for long horizon robot manipulation

Qianzhong Chen, Justin Yu, Mac Schwager, Pieter Abbeel, Yide Shentu, and Philipp Wu. SARM: Stage-aware reward modeling for long horizon robot manipulation. InInternational Conference on Learning Representations (ICLR), 2026. 17 Appendices A ABC-130K Task Taxonomy 18 B Training & Model Details 19 B.1 Data Loading . . . . . . . . . . . . . . . . . . . . . ....

2026
[46]

Conditioning on Op-0, the highest-volume operator, whose training demonstrations are character- ized by short, deliberate execution (mean episode duration 59 s)
[47]

Conditioning on the task prompt alone, which marginalizes over operators by reverting to the training-time dropout target
[48]

Conditioning on Op-1, a long-duration operator with 226 episodes of training data whose demon- strations average 205 s per episode and exhibit lower fold quality Marginalized inference improves over the unconditioned baseline, indicating that operator-ID con- ditioning at training time is strictly beneficial even when the operator channel is not used at i...

[1] [1]

Gordon, and J

Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning, 2011. URLhttps://arxiv.org/abs/1011. 0686

2011

[2] [2]

Scalable diffusion models with transformers, 2023

William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URLhttps: //arxiv.org/abs/2212.09748

Pith/arXiv arXiv 2023

[3] [3]

Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexan- der Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalash-...

Pith/arXiv arXiv 2023

[4] [4]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

Pith/arXiv arXiv 2025

[5] [5]

Open X-embodiment: Robotic learning datasets and RT-X models

Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT-X models. In IEEE International Conference on Robotics and Automation (ICRA), 2024

2024

[6] [6]

BridgeData V2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. BridgeData V2: A dataset for robot learning at scale. In Conference on Robot Learning (CoRL), 2023

2023

[7] [7]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Xindong He, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025

2025

[8] [8]

Molmoact2: Action reasoning models for real-world deployment,

Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei- Chuan Tsai, Shirui Chen, Yi Ru Wang, Shanli Xing, Jaemin Cho, Jae Sung Park, Ainaz Eftekhar, 14 Peter Sushko, Karen Farley, Angad Wadhwa, Cole Harrison, Winson Han, Ying-Chun Lee, Eli Van- derBilt, Rose Hendrix, Suveen Ellawela, Lucas Ngoo, Joyce Chai, Zhongzheng Re...

[9] [9]

URLhttps://arxiv.org/abs/2605.02881

Pith/arXiv arXiv

[10] [10]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[11] [11]

A care- ful examination of large behavior models for multitask dexterous manipulation

Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching- Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A care- ful examination of large behavior models for multitask dexterous manipulation. arXiv preprint arXiv:2507.05331, 2025

Pith/arXiv arXiv 2025

[12] [12]

GR00T N1: An open foundation model for generalist humanoid robots

Johan Bjorck, Abhinav Prasad, Aleksei Grigoriev, Feiyu Xia, Peng Ding, Zhengyi Luo, et al. GR00T N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[13] [13]

URLhttps://arxiv.org/abs/ 2410.24164

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A visi...

Pith/arXiv arXiv 2026

[14] [14]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

Pith/arXiv arXiv 2025

[15] [15]

Rt-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022

[16] [16]

Ren, Michael Equi, and Sergey Levine

Kevin Black, Allen Z. Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking, 2025. URLhttps://arxiv.org/abs/2512.05964

arXiv 2025

[17] [17]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps:// arxiv.org/abs/2103.00020

Pith/arXiv arXiv 2021

[18] [18]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

Pith/arXiv arXiv 2025

[19] [19]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps: //arxiv.org/abs/1711.05101. 15

Pith/arXiv arXiv 2019

[20] [20]

Gemma 3 tech- nical report

Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, et al. Gemma 3 tech- nical report. arXiv preprint arXiv:2503.19786, 4, 2025

Pith/arXiv arXiv 2025

[21] [21]

Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine

Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine. Knowl- edge insulating vision-language-action models: Train fast, run fast, generalize better, 2025. URL https://arxiv.org/abs/2505.23705

arXiv 2025

[22] [22]

Fast: Efficient action tokenization for vision-language-action models, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models, 2025. URLhttps://arxiv.org/abs/2501.09747

Pith/arXiv arXiv 2025

[23] [23]

Query-key normalization for transformers, 2020

Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for transformers, 2020. URLhttps://arxiv.org/abs/2010.04245

arXiv 2020

[24] [24]

Openvla: An open-source vision-language-action model, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Ben- jamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024. URLhttps://arxiv.org/abs/...

Pith/arXiv arXiv 2024

[25] [25]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033,

2012

[26] [26]

doi: 10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012

[27] [27]

Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators, 2024

Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators, 2024. URLhttps://arxiv.org/ abs/2309.13037

arXiv 2024

[28] [28]

Dexhub and dart: Towards internet scale robot data collection

Younghyo Park, Jagdeep Singh Bhatia, Lars Lien Ankile, and Pulkit Agrawal. Dexhub and dart: Towards internet scale robot data collection. ArXiv, abs/2411.02214, 2024. URLhttps://api. semanticscholar.org/CorpusID:273821640

arXiv 2024

[29] [29]

Lucid-xr: An extended-reality data engine for robotic manipulation, 2026

Yajvan Ravan, Adam Rashid, Alan Yu, Kai McClennen, Gio Huh, Kevin Yang, Zhutian Yang, Qinxi Yu, Xiaolong Wang, Phillip Isola, and Ge Yang. Lucid-xr: An extended-reality data engine for robotic manipulation, 2026. URLhttps://arxiv.org/abs/2605.00244

Pith/arXiv arXiv 2026

[30] [30]

Iris: An immersive robot interac- tion system, 2025

Xinkai Jiang, Qihao Yuan, Enes Ulas Dincer, Hongyi Zhou, Ge Li, Xueyin Li, Julius Haag, Nicolas Schreiber, Kailai Li, Gerhard Neumann, and Rudolf Lioutikov. Iris: An immersive robot interac- tion system, 2025. URLhttps://arxiv.org/abs/2502.03297

arXiv 2025

[31] [31]

RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Junbo Wang, Haoyi Zhu, and Cewu Lu. RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot. IEEE Robotics and Automation Letters, 2024

2024

[32] [33]

Zhao, and Chelsea Finn

Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manip- ulation with low-cost whole-body teleoperation, 2024. URLhttps://arxiv.org/abs/2401. 02117

2024

[33] [34]

Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid

Tony Z. Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity, 2024. URL https://arxiv.org/abs/2410.13126. 16

arXiv 2024

[34] [35]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[35] [36]

Gr-2: A generative video-language-action model with web- scale knowledge for robot manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web- scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024

Pith/arXiv arXiv 2024

[36] [37]

i2rt: Python API for I2RT Robots.https://github.com/i2rt-robotics/ i2rt, 2025

I2RT Robotics. i2rt: Python API for I2RT Robots.https://github.com/i2rt-robotics/ i2rt, 2025. Accessed: 2026-04-20

2025

[37] [38]

ZeroMQ: An open-source universal messaging library.https://zeromq

The ZeroMQ authors. ZeroMQ: An open-source universal messaging library.https://zeromq. org, 2007. Accessed: 2026-05-31

2007

[38] [39]

Robot oper- ating system 2: Design, architecture, and uses in the wild

Steven Macenski, Tully Foote, Brian Gerkey, Chris Lalancette, and William Woodall. Robot oper- ating system 2: Design, architecture, and uses in the wild. Science Robotics, 7(66):eabm6074,

[39] [40]

Robot Operating System 2: Design, architecture, and uses in the wild,

doi: 10.1126/scirobotics.abm6074. URLhttps://www.science.org/doi/abs/10. 1126/scirobotics.abm6074

work page doi:10.1126/scirobotics.abm6074

[40] [41]

Learning fine-grained bimanual manipulation with low-cost hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023

Pith/arXiv arXiv 2023

[41] [42]

Factr: Force-attending curriculum training for contact-rich policy learning, 2025

Jason Jingzhou Liu, Yulong Li, Kenneth Shaw, Tony Tao, Ruslan Salakhutdinov, and Deepak Pathak. Factr: Force-attending curriculum training for contact-rich policy learning, 2025. URL https://arxiv.org/abs/2502.17432

arXiv 2025

[42] [43]

Factr 2: Learning force sensing and force-aware policies for any robot arm

Steven Oh, Jason Jingzhou Liu, Tony Tao, Philip Han, Kenneth Shaw, Satoshi Funabashi, Ruslan Salakhutdinov, and Deepak Pathak. Factr 2: Learning force sensing and force-aware policies for any robot arm. 2026

2026

[43] [44]

Mink: Python inverse kinematics based on MuJoCo, February 2026

Kevin Zakka. Mink: Python inverse kinematics based on MuJoCo, February 2026. URLhttps: //github.com/kevinzakka/mink

2026

[44] [45]

SARM: Stage-aware reward modeling for long horizon robot manipulation

Qianzhong Chen, Justin Yu, Mac Schwager, Pieter Abbeel, Yide Shentu, and Philipp Wu. SARM: Stage-aware reward modeling for long horizon robot manipulation. InInternational Conference on Learning Representations (ICLR), 2026. 17 Appendices A ABC-130K Task Taxonomy 18 B Training & Model Details 19 B.1 Data Loading . . . . . . . . . . . . . . . . . . . . . ....

2026

[45] [46]

Conditioning on Op-0, the highest-volume operator, whose training demonstrations are character- ized by short, deliberate execution (mean episode duration 59 s)

[46] [47]

Conditioning on the task prompt alone, which marginalizes over operators by reverting to the training-time dropout target

[47] [48]

Conditioning on Op-1, a long-duration operator with 226 episodes of training data whose demon- strations average 205 s per episode and exhibit lower fold quality Marginalized inference improves over the unconditioned baseline, indicating that operator-ID con- ditioning at training time is strictly beneficial even when the operator channel is not used at i...