Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours

Abhinav Gupta; Lerrel Pinto

arxiv: 1509.06825 · v1 · pith:HHGVKU3Dnew · submitted 2015-09-23 · 💻 cs.LG · cs.CV· cs.RO

Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours

Lerrel Pinto , Abhinav Gupta This is my paper

classification 💻 cs.LG cs.CVcs.RO

keywords graspingdataexperimentsgrasprobottasktrainingattempts

0 comments

read the original abstract

Current learning-based robot grasping approaches exploit human-labeled datasets for training the models. However, there are two problems with such a methodology: (a) since each object can be grasped in multiple ways, manually labeling grasp locations is not a trivial task; (b) human labeling is biased by semantics. While there have been attempts to train robots using trial-and-error experiments, the amount of data used in such experiments remains substantially low and hence makes the learner prone to over-fitting. In this paper, we take the leap of increasing the available training data to 40 times more than prior work, leading to a dataset size of 50K data points collected over 700 hours of robot grasping attempts. This allows us to train a Convolutional Neural Network (CNN) for the task of predicting grasp locations without severe overfitting. In our formulation, we recast the regression problem to an 18-way binary classification over image patches. We also present a multi-stage learning approach where a CNN trained in one stage is used to collect hard negatives in subsequent stages. Our experiments clearly show the benefit of using large-scale datasets (and multi-stage training) for the task of grasping. We also compare to several baselines and show state-of-the-art performance on generalization to unseen objects for grasping.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Human Universal Grasping
cs.RO 2026-06 unverdicted novelty 7.0

HUG trains a flow-matching model on a new 1M-frame egocentric human grasp dataset to generate retargetable grasps from single RGB-D images, beating baselines by 23-34% on a new 90-object benchmark.
Efficiently Linking Real Scenes with Synthetic Data Generation for AI-based Cognitive Robotics and Computer Vision Applications
cs.RO 2026-06 unverdicted novelty 2.0

The paper reviews limits in AI vision for robotics and describes work-in-progress on bridging sim-to-real domain gaps by linking real and synthetic training data.