COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

Ajay Mandlekar; Animesh Garg; Ansh Gandhi; Aryan Sarswat; Ayush Agarwal; Jeremy A. Collins; Masoud Moghani; Omar Rayyan; Ranjani Koushik

arxiv: 2605.19138 · v2 · pith:TKVYOF5Znew · submitted 2026-05-18 · 💻 cs.RO · cs.AI· cs.LG

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

Ayush Agarwal , Ansh Gandhi , Jeremy A. Collins , Omar Rayyan , Aryan Sarswat , Ranjani Koushik , Masoud Moghani , Ajay Mandlekar

show 1 more author

Animesh Garg

This is my paper

Pith reviewed 2026-05-21 07:23 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords robot learningteleoperationimitation learningcrowdsourcingdemonstration datasmartphone interfacecloud infrastructuredata quality

0 comments

The pith

A cloud teleoperation platform lets anyone with a smartphone contribute robot demonstration data, enabling a 50-hour crowdsourced dataset validated for imitation learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a smartphone-based teleoperation system can remove the data scarcity bottleneck in robot imitation learning by allowing scalable, low-cost collection of high-quality demonstrations from ordinary users worldwide. It shows this through infrastructure that handles many concurrent operators on shared GPUs with low latency, real-time metrics that filter poor demonstrations, and a user study confirming phones match or exceed specialized hardware. A structured training curriculum further boosts collection quality. Guided by these results, the authors crowdsourced over 7500 demonstrations totaling more than 50 hours across nine countries in five days and confirmed the data trains state-of-the-art imitation learning algorithms effectively.

Core claim

COBALT is a teleoperation platform that uses vectorized environments and load-balanced cloud infrastructure to support dozens of concurrent users via smartphones or other common devices, maintaining sub-100 ms latency and 20 Hz control while logging real-time metrics for automatic filtering of suboptimal demonstrations and incorporating a user training curriculum to improve quality, resulting in a validated pilot dataset of 7500+ demonstrations collected over five days.

What carries the argument

The COBALT teleoperation platform, which combines vectorized simulation, in-memory data caching, efficient video streaming, and real-time metric logging to enable concurrent multi-user control and data quality filtering at low cost.

If this is right

Imitation learning for manipulation can scale using data collected from consumer smartphones rather than dedicated equipment.
Teleoperation costs fall sharply when many users share a single GPU through vectorized environments and efficient streaming.
Automatic filtering via logged metrics and short user training curricula can maintain dataset quality during large-scale crowdsourcing.
Global participation becomes practical, allowing data collection across many countries in days rather than weeks or months.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar platforms could extend crowdsourcing to real-robot control once latency and safety issues are addressed.
The geographic spread of contributors may naturally introduce more diverse environmental and task variations into the data.
The same infrastructure pattern could apply to other human-in-the-loop data collection tasks such as preference labeling or trajectory annotation.

Load-bearing premise

Phone-based teleoperation produces demonstration data of quality comparable to specialized hardware, allowing real-time metrics to filter suboptimal examples without discarding useful training signal.

What would settle it

Training state-of-the-art imitation learning algorithms on the crowdsourced dataset and measuring success rates on held-out robotic manipulation tasks; rates significantly below those achieved on datasets from expert hardware would falsify the quality claim.

Figures

Figures reproduced from arXiv: 2605.19138 by Ajay Mandlekar, Animesh Garg, Ansh Gandhi, Aryan Sarswat, Ayush Agarwal, Jeremy A. Collins, Masoud Moghani, Omar Rayyan, Ranjani Koushik.

**Figure 1.** Figure 1: COBALT can be used to collect data across a variety of both simulated and real-world environments, including bimanual tasks. Abstract— The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. B… view at source ↗

**Figure 2.** Figure 2: COBALT System Architecture. a) Cloud provider hosts one group of virtual machines (VM) per task, with dynamic allocation of servers based on demand. b) A load balancer sits in front of the different groups of servers, functioning as a rate limiter and reverse proxy. c) Three main services are utilized: CS (Client Session Service) for client data ingestion, MS (Media Service) for video streaming, and TS (Te… view at source ↗

**Figure 3.** Figure 3: Subset of Calibration Tasks. Left: Position Task (translational motion only). Right: Pose Task (translation and rotational motion). Calibration – Calibration tasks are designed to familiarize users with basic controls. Position calibration asks users to place the gripper at randomly spawned targets; rotation calibration aligns an attached beam to a target circle; and pose calibration combines both position… view at source ↗

**Figure 4.** Figure 4: (a) Reset Rate by Device and Curriculum. Across all devices, curriculum training yields a significant decrease in reset rate across tasks, leading to faster and more efficient data collection. (b) Execution Time by Device and Curriculum. Across all devices, curriculum training reduces the mean and standard deviation of execution time, leading to shorter and more consistent demonstrations. (∆tt +∆tt+1)/2), … view at source ↗

**Figure 5.** Figure 5: (a) Visualization of Isaac Lab tasks in the pilot dataset. Arrangement of tasks left-to-right, top-to-bottom: Assembly, Lift, Cleanup, Kitchen, Stack, Pour. (b) COBALT can be used to control physical (single-arm and bimanual) robots. A real-world recreation of the pour task and a corn cooking task are shown. Metric Smartphone VR Headset 3D Mouse Keyboard Avg. Completion Time (s) (↓) 30.00±16.97 25.60±13.91… view at source ↗

**Figure 6.** Figure 6: Visualization of the primary user study tasks. Top row (left to right): Three Piece Assembly, Lift. Bottom row (left to right): Mug Cleanup, Coffee [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of the tasks in the pilot dataset. Top row (left to right): Assembly, Lift, Cleanup. Bottom row (left to right): Kitchen, Stack, Pour. • Kitchen: Place the bread in the pot, place the pot on the stove, and turn on the switch. Then, place the pot in the green region and turn off the switch. APPENDIX III USER STUDY DETAILS A. Experimental Procedure 1) Recruitment: A total of 18 consenting parti… view at source ↗

**Figure 8.** Figure 8: Mean Translational Jitter by device and curriculum condition during the Position Evaluation Task (Lower is better). Error bars indicate standard deviation [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Path length, completion time, and translational jitter per device. These plots reveal performance differences across devices, with smartphones and VR headsets generally yielding shorter path lengths and faster completion times [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Pose evaluation task’s average position and rotation error by device. The smartphone input modality was shown to have a significantly lower position and rotation error than the other input modalities [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 13.** Figure 13: User self-reported scores across different input devices via the NASA-TLX survey. Higher values indicate higher perceived workload (less favorable), except for Performance (Q4), where higher is better. Error bars show standard deviation [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 11.** Figure 11: Additional NASA-TLX results (Mean scores per device). Lower scores generally indicate lower perceived workload (except for Performance, where higher is better). Error bars show standard deviation [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

read the original abstract

The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An inmemory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency for up to 8 concurrent users per GPU. We also demonstrate stable operation supporting 256 simulated clients across 8 GPUs, underscoring the system's ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset's quality by training state-of-the-art imitation learning algorithms. Please visit https://cobalt-teleop.github.io/ for more details.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COBALT delivers a practical cloud teleop system that scales phone-based crowdsourcing for robot demos, but the dataset quality claim would land harder with concrete policy success numbers.

read the letter

The core contribution here is a working infrastructure for cheap, large-scale demonstration collection. They use vectorized environments to run multiple teleop sessions on one GPU, keep end-to-end latency under 100 ms for up to eight users, and let people connect with just a smartphone. That setup let them run a five-day campaign across nine countries and pull in 7500+ demos totaling over 50 hours. The user study showing phone control is comparable to or better than VR or 3D mice for ergonomics and speed is a useful data point, and the real-time metric filtering plus training curriculum are sensible quality controls.

Referee Report

2 major / 2 minor

Summary. The manuscript presents COBALT, a cloud-based teleoperation platform for scalable crowdsourcing of robot demonstration data using smartphones, VR headsets, and other consumer devices. It describes a load-balanced infrastructure supporting concurrent teleoperation on vectorized environments with low latency (sub-100 ms for up to 8 users per GPU, scaling to 256 clients across 8 GPUs), a user study comparing phone-based teleoperation to specialized hardware, real-time metrics for filtering suboptimal demonstrations, a structured user training curriculum, and the collection of a pilot dataset with 7500+ demonstrations (50+ hours) across nine countries. Dataset quality is validated by training state-of-the-art imitation learning algorithms.

Significance. If the central claims hold, the work could meaningfully lower barriers to large-scale imitation learning by enabling low-cost, geographically distributed data collection without specialized hardware. The demonstrated scalability, low-latency synchronous control, and user-study insights on ergonomics represent concrete engineering contributions. The crowdsourced dataset size and multi-country collection are notable for the field. Credit is due for the open infrastructure details and the empirical focus on real-world deployment metrics.

major comments (2)

[Experiments / dataset validation] Experiments / dataset validation section: The claim that the 7500+ smartphone demonstrations constitute a high-quality dataset rests on training SOTA imitation learning algorithms, yet no quantitative task success rates, success percentages on held-out manipulation tasks, error bars, or direct comparisons to policies trained on specialized-hardware data are reported. This is load-bearing for the central quality claim and leaves the downstream utility of the filtered dataset unanchored.
[User study] User study section: The assertion that phone-based teleoperation performs comparably or better than specialized hardware is used to justify the crowdsourcing approach, but the manuscript provides insufficient detail on the exact performance metrics (e.g., task completion rates, trajectory smoothness, user fatigue scores), statistical tests, or number of participants, making it difficult to assess whether the comparison is robust enough to support the platform's broader claims.

minor comments (2)

[Abstract] Abstract: The latency claim ('sub-100 ms end-to-end latency for up to 8 concurrent users per GPU') would benefit from explicit clarification on whether the figure represents mean, median, or 95th-percentile values and under what network conditions.
[Dataset description] The manuscript would be strengthened by adding a brief table summarizing key dataset statistics (e.g., average demonstration length, success rate before/after filtering, distribution across countries) to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation of the work's significance and for the constructive major comments. We address each point below and will revise the manuscript to incorporate additional quantitative details and clarifications as outlined.

read point-by-point responses

Referee: [Experiments / dataset validation] Experiments / dataset validation section: The claim that the 7500+ smartphone demonstrations constitute a high-quality dataset rests on training SOTA imitation learning algorithms, yet no quantitative task success rates, success percentages on held-out manipulation tasks, error bars, or direct comparisons to policies trained on specialized-hardware data are reported. This is load-bearing for the central quality claim and leaves the downstream utility of the filtered dataset unanchored.

Authors: We agree that quantitative metrics are necessary to fully substantiate the dataset quality claim. In the revised manuscript, we will expand the relevant section to report task success rates and success percentages on held-out manipulation tasks, include error bars from multiple training runs, and add direct comparisons to policies trained on specialized-hardware data where such baselines are available from our experiments. These additions will better anchor the downstream utility of the filtered crowdsourced dataset. revision: yes
Referee: [User study] User study section: The assertion that phone-based teleoperation performs comparably or better than specialized hardware is used to justify the crowdsourcing approach, but the manuscript provides insufficient detail on the exact performance metrics (e.g., task completion rates, trajectory smoothness, user fatigue scores), statistical tests, or number of participants, making it difficult to assess whether the comparison is robust enough to support the platform's broader claims.

Authors: We acknowledge that the current presentation of the user study lacks sufficient granularity. The revised manuscript will specify the number of participants, report exact performance metrics including task completion rates, trajectory smoothness measures, and user fatigue scores, and include the results of statistical tests (such as paired t-tests) to support the comparisons. These details will make the evidence for the comparability or superiority of phone-based teleoperation more transparent and robust. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical systems paper

full rationale

The paper presents a teleoperation platform, user study, and crowdsourced dataset of 7500+ demonstrations validated through training of state-of-the-art imitation learning algorithms. No mathematical derivations, predictions, or first-principles results are claimed that reduce by construction to fitted parameters or self-citations. The validation step relies on external empirical outcomes from IL training rather than internal loops, and the work is self-contained against observable metrics like data scale and collection quality.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is an applied systems paper; the central claims rest on domain assumptions about demonstration quality and teleoperation effectiveness rather than new axioms or free parameters.

axioms (2)

domain assumption Smartphone teleoperation can produce demonstrations of sufficient quality for imitation learning when filtered by real-time metrics
Invoked in the description of data collection and automatic filtering to ensure quality.
domain assumption Concurrent multi-user teleoperation on vectorized environments maintains low latency and stability at scale
Central to the infrastructure claims for supporting 20 Hz with dozens of users.

pith-pipeline@v0.9.0 · 5871 in / 1231 out tokens · 28606 ms · 2026-05-21T07:23:32.761615+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 10 internal anchors

[1]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” arXiv preprint arXiv:2303.04137, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Learning task-oriented grasping for tool manipulation from simulated self-supervision,

K. Fang, Y . Zhu, A. Garg, A. Kuryenkov, V . Mehta, L. Fei-Fei, and S. Savarese, “Learning task-oriented grasping for tool manipulation from simulated self-supervision,”Robotics: Science and Systems (RSS), 2018

work page 2018
[3]

Rvt: Robotic view transformer for 3d object manipulation,

A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox, “Rvt: Robotic view transformer for 3d object manipulation,” inConference on Robot Learning. PMLR, 2023

work page 2023
[4]

Behavior generation with latent actions

S. Lee, Y . Wang, H. Etukuru, H. J. Kim, N. M. M. Shafiullah, and L. Pinto, “Behavior generation with latent actions,”arXiv preprint arXiv:2403.03181, 2024

work page arXiv 2024
[5]

Quest: Self- supervised skill abstractions for learning continuous control,

A. Mete, H. Xue, A. Wilcox, Y . Chen, and A. Garg, “Quest: Self- supervised skill abstractions for learning continuous control,”arXiv preprint arXiv:2407.15840, 2024

work page arXiv 2024
[6]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13705

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Common crawl dataset,

C. Crawl, “Common crawl dataset,” 2008. [Online]. Available: https://registry.opendata.aws/commoncrawl/

work page 2008
[8]

Laion-5b: An open large-scale dataset for training next generation image-text models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman,et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022

work page 2022
[9]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu,et al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- cheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis,et al., “Droid: A large-scale in-the-wild robot manipulation dataset,”arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Open x-embodiment: Robotic learning datasets and rt-x models,

Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, A. Singh, R. Doshi, C. Xu, J. Luo, L. Tan, D. Shah,et al., “Open x-embodiment: Robotic learning datasets and rt-x models,” inTowards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 2023

work page 2023
[12]

Bridgedata v2: A dataset for robot learning at scale,

H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen- Estruch, A. W. He, V . Myers, M. J. Kim, M. Du,et al., “Bridgedata v2: A dataset for robot learning at scale,” inConference on Robot Learning. PMLR, 2023

work page 2023
[13]

Mujoco: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 5026–5033

work page 2012
[14]

Nvidia isaac sim,

NVIDIA, “Nvidia isaac sim,” 2022

work page 2022
[15]

Telemoma: A modular and versatile teleoperation system for mobile manipulation,

S. Dass, W. Ai, Y . Jiang, S. Singh, J. Hu, R. Zhang, P. Stone, B. Abbatematteo, and R. Martín-Martín, “Telemoma: A modular and versatile teleoperation system for mobile manipulation,” 2024. [Online]. Available: https://arxiv.org/abs/2403.07869

work page arXiv 2024
[16]

RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation

A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei-Fei, “Roboturk: A crowdsourcing platform for robotic skill learning through imitation,” 2018. [Online]. Available: https://arxiv.org/abs/1811.02790

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox, “Mimicgen: A data generation system for scalable robot learning using human demonstrations,” 2023. [Online]. Available: https://arxiv.org/abs/2310.17596

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu, “Robocasa: Large-scale simulation of everyday tasks for generalist robots,” 2024. [Online]. Available: https://arxiv.org/abs/2406.02523

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Au- tomation Letters, 8(6):3740–3747, June 2023

M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y . Guo, H. Mazhar, A. Mandlekar, B. Babich, G. State, M. Hutter, and A. Garg, “Orbit: A unified simulation framework for interactive robot learning environments,”IEEE Robotics and Automation Letters, vol. 8, no. 6, p. 3740–3747, June 2023. [Online]. Available: http://dx.doi.org/...

work page doi:10.1109/lra.2023.3270034 2023
[20]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Y . Zhu, J. Wong, A. Mandlekar, R. Martín-Martín, A. Joshi, S. Nasiriany, and Y . Zhu, “robosuite: A modular simulation framework and benchmark for robot learning,” 2022. [Online]. Available: https://arxiv.org/abs/2009.12293

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”arXiv preprint arXiv:2306.03310, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Error-aware imitation learning from teleoperation data for mobile manipulation,

J. Wong, A. Tung, A. Kurenkov, A. Mandlekar, L. Fei-Fei, S. Savarese, and R. Martín-Martín, “Error-aware imitation learning from teleoperation data for mobile manipulation,” 2021. [Online]. Available: https://arxiv.org/abs/2112.05251

work page arXiv 2021
[23]

Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,

A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y . Zhu, A. Garg, S. Savarese, and L. Fei-Fei, “Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,” 2019. [Online]. Available: https://arxiv.org/abs/1911.04052

work page arXiv 2019
[24]

Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators,

P. Wu, Y . Shentu, Z. Yi, X. Lin, and P. Abbeel, “Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators,”

work page
[25]

Available: https://arxiv.org/abs/2309.13037

[Online]. Available: https://arxiv.org/abs/2309.13037

work page arXiv
[26]

Threats, bans, and competition: Ripple effects in the global smartphone market,

A. Nicolle, “Threats, bans, and competition: Ripple effects in the global smartphone market,” Nov. 2024. [Online]. Available: https://ssrn.com/abstract=5038275

work page 2024
[27]

Arcap: Collecting high-quality human demonstrations for robot learning with augmented reality feedback,

S. Chen, C. Wang, K. Nguyen, L. Fei-Fei, and C. K. Liu, “Arcap: Collecting high-quality human demonstrations for robot learning with augmented reality feedback,”arXiv preprint arXiv:2410.08464, 2024

work page arXiv 2024
[28]

Dexhub and dart: Towards internet scale robot data collection,

Y . Park, J. S. Bhatia, L. Ankile, and P. Agrawal, “Dexhub and dart: Towards internet scale robot data collection,”arXiv preprint arXiv:2411.02214, 2024

work page arXiv 2024
[29]

Using 3d mice to control robot manipulators,

V . Dhat, N. Walker, and M. Cakmak, “Using 3d mice to control robot manipulators,” in2024 19th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2024, pp. 896–900

work page 2024
[30]

An empirical evaluation of four off-the-shelf proprietary visual-inertial odometry systems,

J. Kim, M. Song, Y . Lee, M. Jung, and P. Kim, “An empirical evaluation of four off-the-shelf proprietary visual-inertial odometry systems,” 2022. [Online]. Available: https://arxiv.org/abs/2207.06780

work page arXiv 2022
[31]

Fast explicit-input assistance for teleoperation in clutter,

N. Walker, X. Yang, A. Garg, M. Cakmak, D. Fox, and C. Pérez- D’Arpino, “Fast explicit-input assistance for teleoperation in clutter,”

work page
[32]

Available: https://arxiv.org/abs/2402.02612

[Online]. Available: https://arxiv.org/abs/2402.02612

work page arXiv
[33]

Graspgf: Learning score-based grasping primitive for human-assisting dexterous grasping,

T. Wu, M. Wu, J. Zhang, Y . Gan, and H. Dong, “Graspgf: Learning score-based grasping primitive for human-assisting dexterous grasping,”

work page
[34]

Available: https://arxiv.org/abs/2309.06038

[Online]. Available: https://arxiv.org/abs/2309.06038

work page arXiv
[35]

What matters in learning from offline human demonstrations for robot manipulation,

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Martín-Martín, “What matters in learning from offline human demonstrations for robot manipulation,”

work page
[36]

[Online]. Available: https://arxiv.org/abs/2108.03298 APPENDIXI INPUTDEVICEDETAILS COBALTis compatible with a diverse set of input devices: A.Smartphones Smartphones provide accurate 6-DoF pose tracking by utilizing existing AR frameworks (ARCore for Android and ARKit for iOS). Our cross-platform mobile application captures both translational and rotation...

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Recruitment: A total of 18 consenting participants were recruited

work page
[38]

The other half formed the control group (no prior training)

Training Condition: Half of the participants were ran- domly assigned to the training group and completed the curriculum (Section III-B) before the main tasks. The other half formed the control group (no prior training)

work page
[39]

The order of devices was also randomized

Input Device Assignment: Participants were randomly assigned two input devices from: smartphone, virtual reality (VR) headset, keyboard, and 3D mouse. The order of devices was also randomized. Participants prone to motion sickness were excluded from VR

work page
[40]

Task Performance: Each participant used their assigned devices to perform four distinct manipulation tasks (Lift, TPA, MC, Coffee - see Figure 6), providing five successful demonstrations per task per device

work page
[41]

Data Collection:Demonstration data (trajectory, timings, resets) and system metrics (latency, jitter) were collected during the tasks

work page
[42]

Survey Administration: After completing all tasks with one device, participants completed the Likert-scale ques- tionnaire and the NASA-TLX survey. They were allowed TABLE VII:Behavior Cloning (BC) Model Success Rates (User Study Tasks) Model Lift TPA MC Coffee BC-RNN1.00±0.00 0.00±0.01 0.61±0.01 0.49±0.03 BC-TF0.90±0.02 0.03±0.01 0.64±0.08 0.36±0.21 Note...

work page
[43]

Participants were randomly assigned to use either dual smartphones or a VR system first, then switched

Bimanual Control Assessment: A separate group of 6 participants evaluated bimanual control for the Two Arm Lift task. Participants were randomly assigned to use either dual smartphones or a VR system first, then switched. Participants prone to motion sickness were excluded from VR. B.User Study Dataset Statistics The user study conducted in this work invo...

work page
[44]

Mental Demand: Level of mental and perceptual activity required (thinking, deciding, calculating, remembering, looking, searching)

work page
[45]

Physical Demand: Amount of physical activity required (pushing, pulling, turning, controlling, activating)

work page
[46]

Temporal Demand: Amount of time pressure felt due to the rate or pace at which the tasks or task elements occurred

work page
[47]

Performance: How successful the participant felt they were in accomplishing the goals of the task set by the experimenter (their own performance)

work page
[48]

Effort: How hard the participant had to work (mentally and physically) to accomplish their level of performance

work page
[49]

2)Likert-Scale Questions Participants responded to the following statements on a Likert scale from 1 (strongly disagree) to 5 (strongly agree):

Frustration: How insecure, discouraged, irritated, stressed, and annoyed versus secure, gratified, content, relaxed, and complacent the participant felt during the task. 2)Likert-Scale Questions Participants responded to the following statements on a Likert scale from 1 (strongly disagree) to 5 (strongly agree):

work page
[50]

I found it easy to control the robot with the device I used

work page
[51]

The interface felt intuitive

work page
[52]

I felt comfortable and confident throughout the task

work page
[53]

I would be willing to use this system again in the future

work page
[54]

The tasks seemed appropriate for this type of interface

work page
[55]

3)Open-Ended Questions Participants provided qualitative feedback on their experi- ence by answering:

My input device responded accurately to my actions. 3)Open-Ended Questions Participants provided qualitative feedback on their experi- ence by answering:

work page
[56]

What did you like most about controlling the robot with this device?

work page
[57]

What did you find most difficult or frustrating? E.Additional User Study Figures This section contains figures presenting results from the user study surveys and specific evaluation tasks. Fig. 8:Mean Translational Jitter by device and curriculum condition during thePosition Evaluation Task(Lower is better). Error bars indicate standard deviation. Fig. 9:...

work page

[1] [1]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” arXiv preprint arXiv:2303.04137, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Learning task-oriented grasping for tool manipulation from simulated self-supervision,

K. Fang, Y . Zhu, A. Garg, A. Kuryenkov, V . Mehta, L. Fei-Fei, and S. Savarese, “Learning task-oriented grasping for tool manipulation from simulated self-supervision,”Robotics: Science and Systems (RSS), 2018

work page 2018

[3] [3]

Rvt: Robotic view transformer for 3d object manipulation,

A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox, “Rvt: Robotic view transformer for 3d object manipulation,” inConference on Robot Learning. PMLR, 2023

work page 2023

[4] [4]

Behavior generation with latent actions

S. Lee, Y . Wang, H. Etukuru, H. J. Kim, N. M. M. Shafiullah, and L. Pinto, “Behavior generation with latent actions,”arXiv preprint arXiv:2403.03181, 2024

work page arXiv 2024

[5] [5]

Quest: Self- supervised skill abstractions for learning continuous control,

A. Mete, H. Xue, A. Wilcox, Y . Chen, and A. Garg, “Quest: Self- supervised skill abstractions for learning continuous control,”arXiv preprint arXiv:2407.15840, 2024

work page arXiv 2024

[6] [6]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13705

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Common crawl dataset,

C. Crawl, “Common crawl dataset,” 2008. [Online]. Available: https://registry.opendata.aws/commoncrawl/

work page 2008

[8] [8]

Laion-5b: An open large-scale dataset for training next generation image-text models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman,et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022

work page 2022

[9] [9]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu,et al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- cheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis,et al., “Droid: A large-scale in-the-wild robot manipulation dataset,”arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Open x-embodiment: Robotic learning datasets and rt-x models,

Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, A. Singh, R. Doshi, C. Xu, J. Luo, L. Tan, D. Shah,et al., “Open x-embodiment: Robotic learning datasets and rt-x models,” inTowards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 2023

work page 2023

[12] [12]

Bridgedata v2: A dataset for robot learning at scale,

H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen- Estruch, A. W. He, V . Myers, M. J. Kim, M. Du,et al., “Bridgedata v2: A dataset for robot learning at scale,” inConference on Robot Learning. PMLR, 2023

work page 2023

[13] [13]

Mujoco: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 5026–5033

work page 2012

[14] [14]

Nvidia isaac sim,

NVIDIA, “Nvidia isaac sim,” 2022

work page 2022

[15] [15]

Telemoma: A modular and versatile teleoperation system for mobile manipulation,

S. Dass, W. Ai, Y . Jiang, S. Singh, J. Hu, R. Zhang, P. Stone, B. Abbatematteo, and R. Martín-Martín, “Telemoma: A modular and versatile teleoperation system for mobile manipulation,” 2024. [Online]. Available: https://arxiv.org/abs/2403.07869

work page arXiv 2024

[16] [16]

RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation

A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei-Fei, “Roboturk: A crowdsourcing platform for robotic skill learning through imitation,” 2018. [Online]. Available: https://arxiv.org/abs/1811.02790

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox, “Mimicgen: A data generation system for scalable robot learning using human demonstrations,” 2023. [Online]. Available: https://arxiv.org/abs/2310.17596

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu, “Robocasa: Large-scale simulation of everyday tasks for generalist robots,” 2024. [Online]. Available: https://arxiv.org/abs/2406.02523

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Au- tomation Letters, 8(6):3740–3747, June 2023

M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y . Guo, H. Mazhar, A. Mandlekar, B. Babich, G. State, M. Hutter, and A. Garg, “Orbit: A unified simulation framework for interactive robot learning environments,”IEEE Robotics and Automation Letters, vol. 8, no. 6, p. 3740–3747, June 2023. [Online]. Available: http://dx.doi.org/...

work page doi:10.1109/lra.2023.3270034 2023

[20] [20]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Y . Zhu, J. Wong, A. Mandlekar, R. Martín-Martín, A. Joshi, S. Nasiriany, and Y . Zhu, “robosuite: A modular simulation framework and benchmark for robot learning,” 2022. [Online]. Available: https://arxiv.org/abs/2009.12293

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”arXiv preprint arXiv:2306.03310, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Error-aware imitation learning from teleoperation data for mobile manipulation,

J. Wong, A. Tung, A. Kurenkov, A. Mandlekar, L. Fei-Fei, S. Savarese, and R. Martín-Martín, “Error-aware imitation learning from teleoperation data for mobile manipulation,” 2021. [Online]. Available: https://arxiv.org/abs/2112.05251

work page arXiv 2021

[23] [23]

Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,

A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y . Zhu, A. Garg, S. Savarese, and L. Fei-Fei, “Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,” 2019. [Online]. Available: https://arxiv.org/abs/1911.04052

work page arXiv 2019

[24] [24]

Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators,

P. Wu, Y . Shentu, Z. Yi, X. Lin, and P. Abbeel, “Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators,”

work page

[25] [25]

Available: https://arxiv.org/abs/2309.13037

[Online]. Available: https://arxiv.org/abs/2309.13037

work page arXiv

[26] [26]

Threats, bans, and competition: Ripple effects in the global smartphone market,

A. Nicolle, “Threats, bans, and competition: Ripple effects in the global smartphone market,” Nov. 2024. [Online]. Available: https://ssrn.com/abstract=5038275

work page 2024

[27] [27]

Arcap: Collecting high-quality human demonstrations for robot learning with augmented reality feedback,

S. Chen, C. Wang, K. Nguyen, L. Fei-Fei, and C. K. Liu, “Arcap: Collecting high-quality human demonstrations for robot learning with augmented reality feedback,”arXiv preprint arXiv:2410.08464, 2024

work page arXiv 2024

[28] [28]

Dexhub and dart: Towards internet scale robot data collection,

Y . Park, J. S. Bhatia, L. Ankile, and P. Agrawal, “Dexhub and dart: Towards internet scale robot data collection,”arXiv preprint arXiv:2411.02214, 2024

work page arXiv 2024

[29] [29]

Using 3d mice to control robot manipulators,

V . Dhat, N. Walker, and M. Cakmak, “Using 3d mice to control robot manipulators,” in2024 19th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2024, pp. 896–900

work page 2024

[30] [30]

An empirical evaluation of four off-the-shelf proprietary visual-inertial odometry systems,

J. Kim, M. Song, Y . Lee, M. Jung, and P. Kim, “An empirical evaluation of four off-the-shelf proprietary visual-inertial odometry systems,” 2022. [Online]. Available: https://arxiv.org/abs/2207.06780

work page arXiv 2022

[31] [31]

Fast explicit-input assistance for teleoperation in clutter,

N. Walker, X. Yang, A. Garg, M. Cakmak, D. Fox, and C. Pérez- D’Arpino, “Fast explicit-input assistance for teleoperation in clutter,”

work page

[32] [32]

Available: https://arxiv.org/abs/2402.02612

[Online]. Available: https://arxiv.org/abs/2402.02612

work page arXiv

[33] [33]

Graspgf: Learning score-based grasping primitive for human-assisting dexterous grasping,

T. Wu, M. Wu, J. Zhang, Y . Gan, and H. Dong, “Graspgf: Learning score-based grasping primitive for human-assisting dexterous grasping,”

work page

[34] [34]

Available: https://arxiv.org/abs/2309.06038

[Online]. Available: https://arxiv.org/abs/2309.06038

work page arXiv

[35] [35]

What matters in learning from offline human demonstrations for robot manipulation,

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Martín-Martín, “What matters in learning from offline human demonstrations for robot manipulation,”

work page

[36] [36]

[Online]. Available: https://arxiv.org/abs/2108.03298 APPENDIXI INPUTDEVICEDETAILS COBALTis compatible with a diverse set of input devices: A.Smartphones Smartphones provide accurate 6-DoF pose tracking by utilizing existing AR frameworks (ARCore for Android and ARKit for iOS). Our cross-platform mobile application captures both translational and rotation...

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Recruitment: A total of 18 consenting participants were recruited

work page

[38] [38]

The other half formed the control group (no prior training)

Training Condition: Half of the participants were ran- domly assigned to the training group and completed the curriculum (Section III-B) before the main tasks. The other half formed the control group (no prior training)

work page

[39] [39]

The order of devices was also randomized

Input Device Assignment: Participants were randomly assigned two input devices from: smartphone, virtual reality (VR) headset, keyboard, and 3D mouse. The order of devices was also randomized. Participants prone to motion sickness were excluded from VR

work page

[40] [40]

Task Performance: Each participant used their assigned devices to perform four distinct manipulation tasks (Lift, TPA, MC, Coffee - see Figure 6), providing five successful demonstrations per task per device

work page

[41] [41]

Data Collection:Demonstration data (trajectory, timings, resets) and system metrics (latency, jitter) were collected during the tasks

work page

[42] [42]

Survey Administration: After completing all tasks with one device, participants completed the Likert-scale ques- tionnaire and the NASA-TLX survey. They were allowed TABLE VII:Behavior Cloning (BC) Model Success Rates (User Study Tasks) Model Lift TPA MC Coffee BC-RNN1.00±0.00 0.00±0.01 0.61±0.01 0.49±0.03 BC-TF0.90±0.02 0.03±0.01 0.64±0.08 0.36±0.21 Note...

work page

[43] [43]

Participants were randomly assigned to use either dual smartphones or a VR system first, then switched

Bimanual Control Assessment: A separate group of 6 participants evaluated bimanual control for the Two Arm Lift task. Participants were randomly assigned to use either dual smartphones or a VR system first, then switched. Participants prone to motion sickness were excluded from VR. B.User Study Dataset Statistics The user study conducted in this work invo...

work page

[44] [44]

Mental Demand: Level of mental and perceptual activity required (thinking, deciding, calculating, remembering, looking, searching)

work page

[45] [45]

Physical Demand: Amount of physical activity required (pushing, pulling, turning, controlling, activating)

work page

[46] [46]

Temporal Demand: Amount of time pressure felt due to the rate or pace at which the tasks or task elements occurred

work page

[47] [47]

Performance: How successful the participant felt they were in accomplishing the goals of the task set by the experimenter (their own performance)

work page

[48] [48]

Effort: How hard the participant had to work (mentally and physically) to accomplish their level of performance

work page

[49] [49]

2)Likert-Scale Questions Participants responded to the following statements on a Likert scale from 1 (strongly disagree) to 5 (strongly agree):

Frustration: How insecure, discouraged, irritated, stressed, and annoyed versus secure, gratified, content, relaxed, and complacent the participant felt during the task. 2)Likert-Scale Questions Participants responded to the following statements on a Likert scale from 1 (strongly disagree) to 5 (strongly agree):

work page

[50] [50]

I found it easy to control the robot with the device I used

work page

[51] [51]

The interface felt intuitive

work page

[52] [52]

I felt comfortable and confident throughout the task

work page

[53] [53]

I would be willing to use this system again in the future

work page

[54] [54]

The tasks seemed appropriate for this type of interface

work page

[55] [55]

3)Open-Ended Questions Participants provided qualitative feedback on their experi- ence by answering:

My input device responded accurately to my actions. 3)Open-Ended Questions Participants provided qualitative feedback on their experi- ence by answering:

work page

[56] [56]

What did you like most about controlling the robot with this device?

work page

[57] [57]

What did you find most difficult or frustrating? E.Additional User Study Figures This section contains figures presenting results from the user study surveys and specific evaluation tasks. Fig. 8:Mean Translational Jitter by device and curriculum condition during thePosition Evaluation Task(Lower is better). Error bars indicate standard deviation. Fig. 9:...

work page