Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond
read the original abstract
We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is H\"older continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means cost for the input embeddings and $\lambda$ is the H\"older constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
Scenario Generation for Testing of Autonomous Driving Systems Using Real-World Failure Records
LLM-based pipeline generates diverse scenarios from NHTSA crash records for ADS testing in Metadrive simulator, identifying failures in limited tests.
-
ASSS: A Differentiable Adversarial Framework for Task-Aware Data Reduction
ASSS uses an adversarial selector and Gumbel-Softmax relaxation to retain 98.9% task performance with only 30% of the data by preferentially keeping boundary samples.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.