Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

David Saulpic; David Woodruff; Kyriakos Axiotis; Michael Wunder; Monika Henzinger; Sammy Jerome; Vahab Mirrokni; Vincent Cohen-Addad

arxiv: 2402.17327 · v1 · pith:KX54OV3Cnew · submitted 2024-02-27 · 💻 cs.LG · cs.DS

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Kyriakos Axiotis , Vincent Cohen-Addad , Monika Henzinger , Sammy Jerome , Vahab Mirrokni , David Saulpic , David Woodruff , Michael Wunder This is my paper

classification 💻 cs.LG cs.DS

keywords datasamplingapproachlossvarepsilonaveragefoundationlambda

0 comments

read the original abstract

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is H\"older continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means cost for the input embeddings and $\lambda$ is the H\"older constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scenario Generation for Testing of Autonomous Driving Systems Using Real-World Failure Records
cs.AI 2026-06 unverdicted novelty 5.0

LLM-based pipeline generates diverse scenarios from NHTSA crash records for ADS testing in Metadrive simulator, identifying failures in limited tests.
ASSS: A Differentiable Adversarial Framework for Task-Aware Data Reduction
cs.LG 2026-01 unverdicted novelty 5.0

ASSS uses an adversarial selector and Gumbel-Softmax relaxation to retain 98.9% task performance with only 30% of the data by preferentially keeping boundary samples.