pith. sign in

arxiv: 2410.18647 · v4 · pith:EE7YO3I4new · submitted 2024-10-24 · 💻 cs.RO

Data Scaling Laws in Imitation Learning for Robotic Manipulation

classification 💻 cs.RO
keywords datademonstrationsenvironmentsobjectsscalingnumbergeneralizationcollect
0
0 comments X
read the original abstract

Data scaling has revolutionized fields like natural language processing and computer vision, providing models with remarkable generalization capabilities. In this paper, we investigate whether similar data scaling laws exist in robotics, particularly in robotic manipulation, and whether appropriate data scaling can yield single-task robot policies that can be deployed zero-shot for any object within the same category in any environment. To this end, we conduct a comprehensive empirical study on data scaling in imitation learning. By collecting data across numerous environments and objects, we study how a policy's generalization performance changes with the number of training environments, objects, and demonstrations. Throughout our research, we collect over 40,000 demonstrations and execute more than 15,000 real-world robot rollouts under a rigorous evaluation protocol. Our findings reveal several intriguing results: the generalization performance of the policy follows a roughly power-law relationship with the number of environments and objects. The diversity of environments and objects is far more important than the absolute number of demonstrations; once the number of demonstrations per environment or object reaches a certain threshold, additional demonstrations have minimal effect. Based on these insights, we propose an efficient data collection strategy. With four data collectors working for one afternoon, we collect sufficient data to enable the policies for two tasks to achieve approximately 90% success rates in novel environments with unseen objects.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    BifrostUMI enables robot-free human demonstration capture via VR and wrist cameras to train visuomotor policies that predict keypoint trajectories for transfer to humanoid whole-body control through retargeting.

  2. QDTraj: Exploration of Diverse Trajectory Primitives for Articulated Objects Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    QDTraj uses Quality-Diversity algorithms with sparse rewards to produce at least five times more diverse high-performing trajectories for articulated object manipulation than compared methods, validated across 30 obje...

  3. UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

    cs.RO 2026-04 unverdicted novelty 6.0

    UMI-3D integrates LiDAR into the UMI hardware for robust multimodal 3D perception in manipulation demonstrations, yielding higher policy success rates and enabling previously infeasible tasks like deformable object handling.

  4. AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence

    cs.RO 2026-04 unverdicted novelty 6.0

    AffordGen generates affordance-aware manipulation demonstrations from 3D mesh correspondences to train policies with zero-shot generalization to novel objects.

  5. EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World

    cs.RO 2026-04 unverdicted novelty 6.0

    EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...

  6. SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

    cs.RO 2026-03 conditional novelty 6.0

    SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.

  7. R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation

    cs.RO 2025-10 unverdicted novelty 6.0

    R2RGen introduces a simulator-free three-stage pipeline that parses, augments, and post-processes real pointcloud observation-action pairs to improve spatial generalization in robotic manipulation policies.

  8. Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

    cs.RO 2025-08 conditional novelty 6.0

    Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks ...

  9. DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

    cs.CV 2025-07 unverdicted novelty 6.0

    DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 avera...

  10. VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

    cs.RO 2025-05 conditional novelty 6.0

    VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.

  11. GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

    cs.RO 2025-05 unverdicted novelty 6.0

    GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.

  12. $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    cs.LG 2025-04 unverdicted novelty 6.0

    π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.

  13. CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

    cs.CV 2025-03 unverdicted novelty 6.0

    CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.

  14. Unified Video Action Model

    cs.RO 2025-02 unverdicted novelty 6.0

    UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...

  15. DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    cs.RO 2025-02 unverdicted novelty 6.0

    DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.

  16. RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields

    cs.RO 2024-12 unverdicted novelty 6.0

    A deep RL vulnerability-prediction policy trained in semantic embedding space finds up to 23% more unique robot manipulation failures than vision-language baselines and enables more efficient fine-tuning.

  17. OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction

    cs.RO 2026-04 unverdicted novelty 5.0

    OmniUMI introduces a multimodal handheld interface that synchronously records RGB, depth, trajectory, tactile, internal grasp force, and external wrench data for training diffusion policies on contact-rich robot manipulation.

  18. GR-3 Technical Report

    cs.RO 2025-07 unverdicted novelty 5.0

    GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.

  19. General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling

    cs.CV 2026-05 unverdicted novelty 4.0

    GAM framework uses arc-length parameterization for temporal invariance and schema-affine factorization for geometric invariance to build a covariant action manifold integrated into VLA models for improved generalizati...

  20. SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection

    cs.RO 2026-05 conditional novelty 4.0

    SEVO raises ACT and SmolVLA pick-and-place success from 30-35% to 75-85% in novel environments by using active illumination, semantic cues, and diversified teleoperation data.