COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones
Pith reviewed 2026-05-21 07:23 UTC · model grok-4.3
The pith
A cloud teleoperation platform lets anyone with a smartphone contribute robot demonstration data, enabling a 50-hour crowdsourced dataset validated for imitation learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
COBALT is a teleoperation platform that uses vectorized environments and load-balanced cloud infrastructure to support dozens of concurrent users via smartphones or other common devices, maintaining sub-100 ms latency and 20 Hz control while logging real-time metrics for automatic filtering of suboptimal demonstrations and incorporating a user training curriculum to improve quality, resulting in a validated pilot dataset of 7500+ demonstrations collected over five days.
What carries the argument
The COBALT teleoperation platform, which combines vectorized simulation, in-memory data caching, efficient video streaming, and real-time metric logging to enable concurrent multi-user control and data quality filtering at low cost.
If this is right
- Imitation learning for manipulation can scale using data collected from consumer smartphones rather than dedicated equipment.
- Teleoperation costs fall sharply when many users share a single GPU through vectorized environments and efficient streaming.
- Automatic filtering via logged metrics and short user training curricula can maintain dataset quality during large-scale crowdsourcing.
- Global participation becomes practical, allowing data collection across many countries in days rather than weeks or months.
Where Pith is reading between the lines
- Similar platforms could extend crowdsourcing to real-robot control once latency and safety issues are addressed.
- The geographic spread of contributors may naturally introduce more diverse environmental and task variations into the data.
- The same infrastructure pattern could apply to other human-in-the-loop data collection tasks such as preference labeling or trajectory annotation.
Load-bearing premise
Phone-based teleoperation produces demonstration data of quality comparable to specialized hardware, allowing real-time metrics to filter suboptimal examples without discarding useful training signal.
What would settle it
Training state-of-the-art imitation learning algorithms on the crowdsourced dataset and measuring success rates on held-out robotic manipulation tasks; rates significantly below those achieved on datasets from expert hardware would falsify the quality claim.
Figures
read the original abstract
The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An inmemory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency for up to 8 concurrent users per GPU. We also demonstrate stable operation supporting 256 simulated clients across 8 GPUs, underscoring the system's ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset's quality by training state-of-the-art imitation learning algorithms. Please visit https://cobalt-teleop.github.io/ for more details.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents COBALT, a cloud-based teleoperation platform for scalable crowdsourcing of robot demonstration data using smartphones, VR headsets, and other consumer devices. It describes a load-balanced infrastructure supporting concurrent teleoperation on vectorized environments with low latency (sub-100 ms for up to 8 users per GPU, scaling to 256 clients across 8 GPUs), a user study comparing phone-based teleoperation to specialized hardware, real-time metrics for filtering suboptimal demonstrations, a structured user training curriculum, and the collection of a pilot dataset with 7500+ demonstrations (50+ hours) across nine countries. Dataset quality is validated by training state-of-the-art imitation learning algorithms.
Significance. If the central claims hold, the work could meaningfully lower barriers to large-scale imitation learning by enabling low-cost, geographically distributed data collection without specialized hardware. The demonstrated scalability, low-latency synchronous control, and user-study insights on ergonomics represent concrete engineering contributions. The crowdsourced dataset size and multi-country collection are notable for the field. Credit is due for the open infrastructure details and the empirical focus on real-world deployment metrics.
major comments (2)
- [Experiments / dataset validation] Experiments / dataset validation section: The claim that the 7500+ smartphone demonstrations constitute a high-quality dataset rests on training SOTA imitation learning algorithms, yet no quantitative task success rates, success percentages on held-out manipulation tasks, error bars, or direct comparisons to policies trained on specialized-hardware data are reported. This is load-bearing for the central quality claim and leaves the downstream utility of the filtered dataset unanchored.
- [User study] User study section: The assertion that phone-based teleoperation performs comparably or better than specialized hardware is used to justify the crowdsourcing approach, but the manuscript provides insufficient detail on the exact performance metrics (e.g., task completion rates, trajectory smoothness, user fatigue scores), statistical tests, or number of participants, making it difficult to assess whether the comparison is robust enough to support the platform's broader claims.
minor comments (2)
- [Abstract] Abstract: The latency claim ('sub-100 ms end-to-end latency for up to 8 concurrent users per GPU') would benefit from explicit clarification on whether the figure represents mean, median, or 95th-percentile values and under what network conditions.
- [Dataset description] The manuscript would be strengthened by adding a brief table summarizing key dataset statistics (e.g., average demonstration length, success rate before/after filtering, distribution across countries) to improve readability.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of the work's significance and for the constructive major comments. We address each point below and will revise the manuscript to incorporate additional quantitative details and clarifications as outlined.
read point-by-point responses
-
Referee: [Experiments / dataset validation] Experiments / dataset validation section: The claim that the 7500+ smartphone demonstrations constitute a high-quality dataset rests on training SOTA imitation learning algorithms, yet no quantitative task success rates, success percentages on held-out manipulation tasks, error bars, or direct comparisons to policies trained on specialized-hardware data are reported. This is load-bearing for the central quality claim and leaves the downstream utility of the filtered dataset unanchored.
Authors: We agree that quantitative metrics are necessary to fully substantiate the dataset quality claim. In the revised manuscript, we will expand the relevant section to report task success rates and success percentages on held-out manipulation tasks, include error bars from multiple training runs, and add direct comparisons to policies trained on specialized-hardware data where such baselines are available from our experiments. These additions will better anchor the downstream utility of the filtered crowdsourced dataset. revision: yes
-
Referee: [User study] User study section: The assertion that phone-based teleoperation performs comparably or better than specialized hardware is used to justify the crowdsourcing approach, but the manuscript provides insufficient detail on the exact performance metrics (e.g., task completion rates, trajectory smoothness, user fatigue scores), statistical tests, or number of participants, making it difficult to assess whether the comparison is robust enough to support the platform's broader claims.
Authors: We acknowledge that the current presentation of the user study lacks sufficient granularity. The revised manuscript will specify the number of participants, report exact performance metrics including task completion rates, trajectory smoothness measures, and user fatigue scores, and include the results of statistical tests (such as paired t-tests) to support the comparisons. These details will make the evidence for the comparability or superiority of phone-based teleoperation more transparent and robust. revision: yes
Circularity Check
No significant circularity in empirical systems paper
full rationale
The paper presents a teleoperation platform, user study, and crowdsourced dataset of 7500+ demonstrations validated through training of state-of-the-art imitation learning algorithms. No mathematical derivations, predictions, or first-principles results are claimed that reduce by construction to fitted parameters or self-citations. The validation step relies on external empirical outcomes from IL training rather than internal loops, and the work is self-contained against observable metrics like data scale and collection quality.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Smartphone teleoperation can produce demonstrations of sufficient quality for imitation learning when filtered by real-time metrics
- domain assumption Concurrent multi-user teleoperation on vectorized environments maintains low latency and stability at scale
Reference graph
Works this paper leans on
-
[1]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” arXiv preprint arXiv:2303.04137, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Learning task-oriented grasping for tool manipulation from simulated self-supervision,
K. Fang, Y . Zhu, A. Garg, A. Kuryenkov, V . Mehta, L. Fei-Fei, and S. Savarese, “Learning task-oriented grasping for tool manipulation from simulated self-supervision,”Robotics: Science and Systems (RSS), 2018
work page 2018
-
[3]
Rvt: Robotic view transformer for 3d object manipulation,
A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox, “Rvt: Robotic view transformer for 3d object manipulation,” inConference on Robot Learning. PMLR, 2023
work page 2023
-
[4]
Behavior generation with latent actions
S. Lee, Y . Wang, H. Etukuru, H. J. Kim, N. M. M. Shafiullah, and L. Pinto, “Behavior generation with latent actions,”arXiv preprint arXiv:2403.03181, 2024
-
[5]
Quest: Self- supervised skill abstractions for learning continuous control,
A. Mete, H. Xue, A. Wilcox, Y . Chen, and A. Garg, “Quest: Self- supervised skill abstractions for learning continuous control,”arXiv preprint arXiv:2407.15840, 2024
-
[6]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13705
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
C. Crawl, “Common crawl dataset,” 2008. [Online]. Available: https://registry.opendata.aws/commoncrawl/
work page 2008
-
[8]
Laion-5b: An open large-scale dataset for training next generation image-text models,
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman,et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022
work page 2022
-
[9]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu,et al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- cheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis,et al., “Droid: A large-scale in-the-wild robot manipulation dataset,”arXiv preprint arXiv:2403.12945, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Open x-embodiment: Robotic learning datasets and rt-x models,
Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, A. Singh, R. Doshi, C. Xu, J. Luo, L. Tan, D. Shah,et al., “Open x-embodiment: Robotic learning datasets and rt-x models,” inTowards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 2023
work page 2023
-
[12]
Bridgedata v2: A dataset for robot learning at scale,
H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen- Estruch, A. W. He, V . Myers, M. J. Kim, M. Du,et al., “Bridgedata v2: A dataset for robot learning at scale,” inConference on Robot Learning. PMLR, 2023
work page 2023
-
[13]
Mujoco: A physics engine for model-based control,
E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 5026–5033
work page 2012
- [14]
-
[15]
Telemoma: A modular and versatile teleoperation system for mobile manipulation,
S. Dass, W. Ai, Y . Jiang, S. Singh, J. Hu, R. Zhang, P. Stone, B. Abbatematteo, and R. Martín-Martín, “Telemoma: A modular and versatile teleoperation system for mobile manipulation,” 2024. [Online]. Available: https://arxiv.org/abs/2403.07869
-
[16]
RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation
A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei-Fei, “Roboturk: A crowdsourcing platform for robotic skill learning through imitation,” 2018. [Online]. Available: https://arxiv.org/abs/1811.02790
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations
A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox, “Mimicgen: A data generation system for scalable robot learning using human demonstrations,” 2023. [Online]. Available: https://arxiv.org/abs/2310.17596
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu, “Robocasa: Large-scale simulation of everyday tasks for generalist robots,” 2024. [Online]. Available: https://arxiv.org/abs/2406.02523
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y . Guo, H. Mazhar, A. Mandlekar, B. Babich, G. State, M. Hutter, and A. Garg, “Orbit: A unified simulation framework for interactive robot learning environments,”IEEE Robotics and Automation Letters, vol. 8, no. 6, p. 3740–3747, June 2023. [Online]. Available: http://dx.doi.org/...
-
[20]
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
Y . Zhu, J. Wong, A. Mandlekar, R. Martín-Martín, A. Joshi, S. Nasiriany, and Y . Zhu, “robosuite: A modular simulation framework and benchmark for robot learning,” 2022. [Online]. Available: https://arxiv.org/abs/2009.12293
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”arXiv preprint arXiv:2306.03310, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Error-aware imitation learning from teleoperation data for mobile manipulation,
J. Wong, A. Tung, A. Kurenkov, A. Mandlekar, L. Fei-Fei, S. Savarese, and R. Martín-Martín, “Error-aware imitation learning from teleoperation data for mobile manipulation,” 2021. [Online]. Available: https://arxiv.org/abs/2112.05251
-
[23]
A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y . Zhu, A. Garg, S. Savarese, and L. Fei-Fei, “Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,” 2019. [Online]. Available: https://arxiv.org/abs/1911.04052
-
[24]
Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators,
P. Wu, Y . Shentu, Z. Yi, X. Lin, and P. Abbeel, “Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators,”
-
[25]
Available: https://arxiv.org/abs/2309.13037
[Online]. Available: https://arxiv.org/abs/2309.13037
-
[26]
Threats, bans, and competition: Ripple effects in the global smartphone market,
A. Nicolle, “Threats, bans, and competition: Ripple effects in the global smartphone market,” Nov. 2024. [Online]. Available: https://ssrn.com/abstract=5038275
work page 2024
-
[27]
S. Chen, C. Wang, K. Nguyen, L. Fei-Fei, and C. K. Liu, “Arcap: Collecting high-quality human demonstrations for robot learning with augmented reality feedback,”arXiv preprint arXiv:2410.08464, 2024
-
[28]
Dexhub and dart: Towards internet scale robot data collection,
Y . Park, J. S. Bhatia, L. Ankile, and P. Agrawal, “Dexhub and dart: Towards internet scale robot data collection,”arXiv preprint arXiv:2411.02214, 2024
-
[29]
Using 3d mice to control robot manipulators,
V . Dhat, N. Walker, and M. Cakmak, “Using 3d mice to control robot manipulators,” in2024 19th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2024, pp. 896–900
work page 2024
-
[30]
An empirical evaluation of four off-the-shelf proprietary visual-inertial odometry systems,
J. Kim, M. Song, Y . Lee, M. Jung, and P. Kim, “An empirical evaluation of four off-the-shelf proprietary visual-inertial odometry systems,” 2022. [Online]. Available: https://arxiv.org/abs/2207.06780
-
[31]
Fast explicit-input assistance for teleoperation in clutter,
N. Walker, X. Yang, A. Garg, M. Cakmak, D. Fox, and C. Pérez- D’Arpino, “Fast explicit-input assistance for teleoperation in clutter,”
-
[32]
Available: https://arxiv.org/abs/2402.02612
[Online]. Available: https://arxiv.org/abs/2402.02612
-
[33]
Graspgf: Learning score-based grasping primitive for human-assisting dexterous grasping,
T. Wu, M. Wu, J. Zhang, Y . Gan, and H. Dong, “Graspgf: Learning score-based grasping primitive for human-assisting dexterous grasping,”
-
[34]
Available: https://arxiv.org/abs/2309.06038
[Online]. Available: https://arxiv.org/abs/2309.06038
-
[35]
What matters in learning from offline human demonstrations for robot manipulation,
A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Martín-Martín, “What matters in learning from offline human demonstrations for robot manipulation,”
-
[36]
[Online]. Available: https://arxiv.org/abs/2108.03298 APPENDIXI INPUTDEVICEDETAILS COBALTis compatible with a diverse set of input devices: A.Smartphones Smartphones provide accurate 6-DoF pose tracking by utilizing existing AR frameworks (ARCore for Android and ARKit for iOS). Our cross-platform mobile application captures both translational and rotation...
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Recruitment: A total of 18 consenting participants were recruited
-
[38]
The other half formed the control group (no prior training)
Training Condition: Half of the participants were ran- domly assigned to the training group and completed the curriculum (Section III-B) before the main tasks. The other half formed the control group (no prior training)
-
[39]
The order of devices was also randomized
Input Device Assignment: Participants were randomly assigned two input devices from: smartphone, virtual reality (VR) headset, keyboard, and 3D mouse. The order of devices was also randomized. Participants prone to motion sickness were excluded from VR
-
[40]
Task Performance: Each participant used their assigned devices to perform four distinct manipulation tasks (Lift, TPA, MC, Coffee - see Figure 6), providing five successful demonstrations per task per device
-
[41]
Data Collection:Demonstration data (trajectory, timings, resets) and system metrics (latency, jitter) were collected during the tasks
-
[42]
Survey Administration: After completing all tasks with one device, participants completed the Likert-scale ques- tionnaire and the NASA-TLX survey. They were allowed TABLE VII:Behavior Cloning (BC) Model Success Rates (User Study Tasks) Model Lift TPA MC Coffee BC-RNN1.00±0.00 0.00±0.01 0.61±0.01 0.49±0.03 BC-TF0.90±0.02 0.03±0.01 0.64±0.08 0.36±0.21 Note...
-
[43]
Bimanual Control Assessment: A separate group of 6 participants evaluated bimanual control for the Two Arm Lift task. Participants were randomly assigned to use either dual smartphones or a VR system first, then switched. Participants prone to motion sickness were excluded from VR. B.User Study Dataset Statistics The user study conducted in this work invo...
-
[44]
Mental Demand: Level of mental and perceptual activity required (thinking, deciding, calculating, remembering, looking, searching)
-
[45]
Physical Demand: Amount of physical activity required (pushing, pulling, turning, controlling, activating)
-
[46]
Temporal Demand: Amount of time pressure felt due to the rate or pace at which the tasks or task elements occurred
-
[47]
Performance: How successful the participant felt they were in accomplishing the goals of the task set by the experimenter (their own performance)
-
[48]
Effort: How hard the participant had to work (mentally and physically) to accomplish their level of performance
-
[49]
Frustration: How insecure, discouraged, irritated, stressed, and annoyed versus secure, gratified, content, relaxed, and complacent the participant felt during the task. 2)Likert-Scale Questions Participants responded to the following statements on a Likert scale from 1 (strongly disagree) to 5 (strongly agree):
-
[50]
I found it easy to control the robot with the device I used
-
[51]
The interface felt intuitive
-
[52]
I felt comfortable and confident throughout the task
-
[53]
I would be willing to use this system again in the future
-
[54]
The tasks seemed appropriate for this type of interface
-
[55]
My input device responded accurately to my actions. 3)Open-Ended Questions Participants provided qualitative feedback on their experi- ence by answering:
-
[56]
What did you like most about controlling the robot with this device?
-
[57]
What did you find most difficult or frustrating? E.Additional User Study Figures This section contains figures presenting results from the user study surveys and specific evaluation tasks. Fig. 8:Mean Translational Jitter by device and curriculum condition during thePosition Evaluation Task(Lower is better). Error bars indicate standard deviation. Fig. 9:...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.