Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need
read the original abstract
Large datasets have been crucial to the success of deep learning models in the recent years, which keep performing better as they are trained with more labelled data. While there have been sustained efforts to make these models more data-efficient, the potential benefit of understanding the data itself, is largely untapped. Specifically, focusing on object recognition tasks, we wonder if for common benchmark datasets we can do better than random subsets of the data and find a subset that can generalize on par with the full dataset when trained on. To our knowledge, this is the first result that can find notable redundancies in CIFAR-10 and ImageNet datasets (at least 10%). Interestingly, we observe semantic correlations between required and redundant images. We hope that our findings can motivate further research into identifying additional redundancies and exploiting them for more efficient training or data-collection.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Interaction-Aware Influence Functions for Group Attribution
Extends influence functions with a second-order pairwise interaction term that improves group attribution accuracy over simple summation on multiple model-dataset pairs and instruction-tuning selection tasks.
-
TinyUSFM: Towards Compact and Efficient Ultrasound Foundation Models
TinyUSFM distills a large ultrasound foundation model into a lightweight version using feature-gradient coreset selection and domain-separated masked image modeling, matching performance on a new 18-dataset benchmark ...
-
Data Selection for training Semantic Segmentation CNNs with cross-dataset weak supervision
Two data selection techniques (GMM visual similarity and bounding-box diversity) reduce required weakly labeled images by up to 100x on Open Images and 20x on Cityscapes while maintaining semantic segmentation performance.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.