PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision

Adam Crespi; Alex Zook; Jonathan Hogins; Pete Parisi; Salehe Erfanian Ebadi; Saurav Dhakad; Steven Borkman; Sujoy Ganguly; You-Cyuan Jhang

arxiv: 2112.09290 · v2 · pith:B5TSQWBOnew · submitted 2021-12-17 · 💻 cs.CV · cs.AI· cs.DB· cs.GR· cs.LG

PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision

Salehe Erfanian Ebadi , You-Cyuan Jhang , Alex Zook , Saurav Dhakad , Adam Crespi , Pete Parisi , Steven Borkman , Jonathan Hogins

show 1 more author

Sujoy Ganguly

This is my paper

classification 💻 cs.CV cs.AIcs.DBcs.GRcs.LG

keywords datasynthetichumanrealgeneratorhuman-centricincreasekeypoint

0 comments

read the original abstract

In recent years, person detection and human pose estimation have made great strides, helped by large-scale labeled datasets. However, these datasets had no guarantees or analysis of human activities, poses, or context diversity. Additionally, privacy, legal, safety, and ethical concerns may limit the ability to collect more human data. An emerging alternative to real-world data that alleviates some of these issues is synthetic data. However, creation of synthetic data generators is incredibly challenging and prevents researchers from exploring their usefulness. Therefore, we release a human-centric synthetic data generator PeopleSansPeople which contains simulation-ready 3D human assets, a parameterized lighting and camera system, and generates 2D and 3D bounding box, instance and semantic segmentation, and COCO pose labels. Using PeopleSansPeople, we performed benchmark synthetic data training using a Detectron2 Keypoint R-CNN variant [1]. We found that pre-training a network using synthetic data and fine-tuning on various sizes of real-world data resulted in a keypoint AP increase of $+38.03$ ($44.43 \pm 0.17$ vs. $6.40$) for few-shot transfer (limited subsets of COCO-person train [2]), and an increase of $+1.47$ ($63.47 \pm 0.19$ vs. $62.00$) for abundant real data regimes, outperforming models trained with the same real data alone. We also found that our models outperformed those pre-trained with ImageNet with a keypoint AP increase of $+22.53$ ($44.43 \pm 0.17$ vs. $21.90$) for few-shot transfer and $+1.07$ ($63.47 \pm 0.19$ vs. $62.40$) for abundant real data regimes. This freely-available data generator should enable a wide range of research into the emerging field of simulation to real transfer learning in the critical area of human-centric computer vision.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-timestamp 3D Human Pose Estimation
cs.CV 2023-12 unverdicted novelty 5.0

LiCamPose combines multi-view RGB and LiDAR inputs via volumetric fusion, pretrains on synthetic data, and applies unsupervised adaptation to achieve robust single-frame 3D human pose estimation on multiple datasets.