Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?

Akshay Mehra; Joshua Kimball; Trisha Mittal

arxiv: 2606.18209 · v1 · pith:6NZQAQFXnew · submitted 2026-06-16 · 💻 cs.LG

Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?

Trisha Mittal , Akshay Mehra , Joshua Kimball This is my paper

Pith reviewed 2026-06-27 01:43 UTC · model grok-4.3

classification 💻 cs.LG

keywords dataset distillationcoreset selectiondata pruningImageNetclassificationdata efficiencymachine learning

0 comments

The pith

Dataset distillation methods match or fall short of coreset selection on large image classification tasks while costing more to build.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether creating synthetic training samples through dataset distillation delivers better results than simply choosing a small number of real examples with coreset selection. It runs this test on ImageNet-1K and related datasets using the same training protocols for both approaches. The findings indicate that top distillation techniques reach accuracy levels comparable to or below those of coresets, with some distillation sets performing no better than random selection. Coresets also show stronger coverage of the original data distribution and require far less computation to produce. This challenges the idea that synthetic samples hold an inherent advantage over selected real data for reducing training sets.

Core claim

Through standardized benchmarks of seven state-of-the-art dataset distillation methods against three coreset strategies on ImageNet-1K, ImageNet100, and ImageNette using three training protocols, the work establishes that leading distillation approaches achieve accuracy comparable to or lower than coresets on large-scale datasets, incur substantially higher construction costs, and deliver inferior representativeness, diversity, and quality compared to coresets, while some distillation methods fail to beat random subsets.

What carries the argument

Direct head-to-head comparison of seven SOTA dataset distillation methods versus three coreset strategies under three fixed training protocols, measured by accuracy, construction cost, and metrics for data distribution coverage.

If this is right

Some dataset distillation methods underperform even random subsets of real data.
Current leading distillation techniques remain comparable to or worse than coresets in final model accuracy on large datasets.
Dataset distillation requires substantially higher computational cost to construct the condensed set than coreset selection.
Coresets consistently provide better coverage, diversity, and quality relative to the original data distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern holds across more settings, practitioners may prefer coreset selection over distillation for practical data reduction in classification.
The results suggest testing whether the expressiveness advantage claimed for synthetic samples survives when total compute for construction is held fixed.
Extending the same standardized comparison to non-image tasks or other modalities could show whether the relative performance is domain-specific.

Load-bearing premise

The three chosen coreset strategies and three training protocols fairly represent typical use cases without favoring one method through implementation details or dataset effects.

What would settle it

A state-of-the-art distillation method achieving clearly higher accuracy than the best coreset at lower total construction cost on ImageNet-1K under identical training conditions would contradict the central result.

Figures

Figures reproduced from arXiv: 2606.18209 by Akshay Mehra, Joshua Kimball, Trisha Mittal.

**Figure 1.** Figure 1: Coresets consistently outperforms distilled sets in accuracy, data coverage, and construction efficiency. We report results on ImageNet-1K using condensed sets consisting of 50 images per class. (a) Relative accuracy gains of a ResNet-50 trained on coreset identified using AUM [5] in comparison to three SOTA DD methods (FADRM+ [6], ManifoldGD [7], VLCP [8]) for three evaluation protocols. (b) Representativ… view at source ↗

**Figure 2.** Figure 2: Experiment Design Overview: We start with a full dataset, and create coresets using the CS methods (green) and also distilled datasets using the various DD methods (orange). We then train various student models on these coresets or distilled datasets via three training protocols and benchmark the performance of the trained student models on the test set. We experiment with three training protocols (blue); … view at source ↗

**Figure 3.** Figure 3: FID comparison for 50 IPC condensed datasets for ImageNet-1K. original data distribution. Diversity is defined as 1 − maxi̸=j cos f T i , fT j , measuring redundancy within the condensed set. Higher representativeness indicates better coverage, while higher diversity reflects a greater sample spread. We report an average for all the metrics computed class-wise for 50 IPC sets for one representative metho… view at source ↗

**Figure 4.** Figure 4: Coreset created by AUM exhibits greater intra [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Representativeness–diversity trade-off at 50 IPC for ImageNet-100, and ImageNette. Results show that AUM achieves a more favorable balance, occupying the upper right corner, compared to SOTA DD methods. C.1 Additional results for diversity and representativeness [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Representativeness–diversity trade-off at 10 IPC on ImageNet-1K, ImageNet-100, and ImageNette. Similar to the 50 IPC results presented, AUM achieves a more favorable balance between representativeness and diversity, while DD methods exhibit reduced diversity. 80 90 100 110 120 130 140 150 160 FID RANDOM FADRM MANIFOLDGD VLCP AUM 4.0% (a) ImageNet-100 50 75 100 125 150 175 200 225 250 FID RANDOM FADRM MANIF… view at source ↗

**Figure 7.** Figure 7: Fréchet Inception Distance (FID) comparison for condensed datasets for [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: 10 images per class sampled from condensed datasets produced by VLCP, ManifoldGD, FADRM+, and AUM for 4 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: 10 images per class sampled from condensed datasets produced by VLCP, ManifoldGD, FADRM+, and AUM for 4 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

Dataset distillation (DD) has emerged as a prominent approach in data centric machine learning, aiming to synthesize compact training sets for efficient training by compressing the information in large datasets into a small number of synthetic samples. However, DD methods are often evaluated under inconsistent evaluation protocols, ranging from standard ERM to single/multi-teacher supervision, making it difficult to isolate the effectiveness of distilled data from evaluation. Moreover, many prior methods claim that DD outperforms data pruning approaches such as coreset selection (CS), based on the assumption that restricting condensed datasets to subsets of real samples fundamentally limits their expressiveness. In this work, we critically evaluate DD methods through large-scale experiments using standardized datasets and evaluation protocols to assess their intrinsic effectiveness. We benchmark seven state-of-the-art (SOTA) DD methods on ImageNet-1K, ImageNet100, and ImageNette, using three widely adopted training protocols against three CS strategies. Our results show that while some DD methods fail to outperform even simple random subsets, the SOTA DD approaches are comparable to or worse than coresets on large-scale datasets and incur a substantially higher cost for construction. Beyond accuracy, we also evaluate the representativeness, diversity, and quality of condensed sets, and find that coresets consistently achieve better coverage of the original data distribution. These findings highlight the limited practical advantages of current DD methods and show that coresets remain competitive and are often a more computationally efficient alternative for data-centric learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standardized comparisons on ImageNet-1K show SOTA DD methods match or underperform coresets with higher build cost, but protocol choices may limit how much this isolates data quality.

read the letter

The main thing to know is that this paper runs a big head-to-head on ImageNet-1K and finds that seven SOTA dataset distillation methods end up comparable to or worse than three coreset strategies under three fixed training protocols, while costing more to produce. Some DD sets even lose to random subsets.

What stands out as new is the scale of the standardized comparison itself. Prior DD work used mismatched setups, so direct claims about beating coresets were hard to judge. This work fixes that by testing the same seven methods across ImageNet-1K, ImageNet-100, and ImageNette, plus extra checks on how well the sets cover the original distribution. That empirical scope is the useful part.

The soft spot is whether the three protocols give DD methods a fair test. Many DD papers originally rely on multi-teacher supervision or custom distillation losses. If the chosen protocols stick to plain ERM or single-teacher setups, any gap could come from that mismatch rather than the synthetic samples being less expressive. The abstract flags inconsistent prior protocols but does not spell out whether the selected ones include the supervision regimes where the DD methods were validated. That detail matters for the central claim.

The representativeness and diversity metrics are a reasonable addition, though their strength depends on the exact computation. Overall this is for people picking data-reduction tools for large-scale training. It deserves a serious referee because the scale and the direct comparison address a real practical question, even if the protocol section needs tightening.

Referee Report

1 major / 0 minor

Summary. The manuscript claims that under standardized evaluation protocols, state-of-the-art dataset distillation (DD) methods do not outperform coreset selection (CS) methods on large-scale datasets including ImageNet-1K, ImageNet-100, and ImageNette. Benchmarking seven SOTA DD methods against three CS strategies with three training protocols shows that some DD methods fail to beat random subsets, SOTA DD is comparable to or worse than CS while incurring higher construction costs, and coresets achieve better coverage, representativeness, diversity, and quality of the condensed sets. The work emphasizes inconsistent prior protocols as motivation for the standardized comparison.

Significance. If the empirical results hold, the paper would provide a valuable large-scale re-evaluation of DD's practical advantages over simpler selection methods, potentially redirecting data-centric ML research toward computationally efficient alternatives. Strengths include the scale of experiments on ImageNet-1K, use of multiple protocols and additional metrics beyond accuracy, and direct comparison that addresses prior claims of DD superiority based on expressiveness assumptions.

major comments (1)

[Abstract] Abstract: The central claim that the three training protocols allow assessment of the 'intrinsic effectiveness' of distilled data (and thus that SOTA DD is comparable/worse than CS) requires that these protocols match the supervision regimes (e.g., multi-teacher or custom distillation losses) under which the seven DD methods originally demonstrated superiority. The manuscript acknowledges inconsistent prior protocols but provides no explicit verification or details that the chosen protocols include those regimes; without this, performance gaps may reflect protocol mismatch rather than the distilled sets themselves being less expressive than selected real samples.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. Below we respond to the major comment.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the three training protocols allow assessment of the 'intrinsic effectiveness' of distilled data (and thus that SOTA DD is comparable/worse than CS) requires that these protocols match the supervision regimes (e.g., multi-teacher or custom distillation losses) under which the seven DD methods originally demonstrated superiority. The manuscript acknowledges inconsistent prior protocols but provides no explicit verification or details that the chosen protocols include those regimes; without this, performance gaps may reflect protocol mismatch rather than the distilled sets themselves being less expressive than selected real samples.

Authors: We appreciate the referee highlighting this point. Our three protocols (standard ERM, single-teacher, and multi-teacher supervision) were selected precisely because they represent the most common evaluation regimes appearing across the DD literature; they are detailed in Section 3.2. The goal of the work is to isolate the quality of the distilled sets themselves under a single, consistent evaluation framework rather than to reproduce each method's original (and mutually inconsistent) training procedure. This standardization is what enables the head-to-head comparison with coreset methods. That said, we agree that an explicit cross-reference would remove any ambiguity. We will add a table in the revised manuscript that maps each of the seven DD methods to the supervision regime(s) used in its original paper and indicates which of our three protocols align with it. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking on external public datasets

full rationale

The paper performs direct head-to-head accuracy, diversity, and cost comparisons of seven DD methods against three CS strategies on ImageNet-1K, ImageNet-100, and ImageNette under three fixed training protocols. No equations, parameter fits, or first-principles derivations are present; all reported outcomes are measured quantities from independent runs on public benchmarks. Self-citations, if any, are limited to prior method descriptions and do not bear the central claim. The work is therefore self-contained against external data and scores 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised learning evaluation assumptions and the representativeness of the chosen coreset baselines; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Standard i.i.d. train/test splits and ERM training assumptions hold for the ImageNet variants used.
Invoked when applying consistent training protocols across DD and CS sets.

pith-pipeline@v0.9.1-grok · 5797 in / 1175 out tokens · 45607 ms · 2026-06-27T01:43:50.745257+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 16 canonical work pages · 5 internal anchors

[1]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

2022
[4]

Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

2024
[5]

Identifying mislabeled data using the area under the margin ranking

Geoff Pleiss et al. Identifying mislabeled data using the area under the margin ranking. In NeurIPS, 2020

2020
[6]

Fadrm: Fast and accurate data residual matching for dataset distillation

Jiacheng Cui, Xinyue Bi, Yaxin Luo, Xiaohan Zhao, Jiacheng Liu, and Zhiqiang Shen. Fadrm: Fast and accurate data residual matching for dataset distillation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
[7]

Mani- foldgd: Training-free hierarchical manifold guidance for diffusion-based dataset distillation

Ayush Roy, Wei-Yang Alex Lee, Rudrasis Chakraborty, and Vishnu Suresh Lokhande. Mani- foldgd: Training-free hierarchical manifold guidance for diffusion-based dataset distillation. arXiv preprint arXiv:2602.23295, 2026

work page arXiv 2026
[8]

Dataset distillation via vision-language category prototype

Yawen Zou, Guang Li, Duo Su, Zi Wang, Jun Yu, and Chao Zhang. Dataset distillation via vision-language category prototype. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[9]

The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673, 2025

Ping Liu and Jiawei Du. The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673, 2025

work page arXiv 2025
[10]

Dataset distillation in the era of large-scale data: Methods, analysis, and future directions.Authorea Preprints, 2025

Xinyi Shang, Peng Sun, Zhiqiang Shen, Tao Lin, and Jing-Hao Xue. Dataset distillation in the era of large-scale data: Methods, analysis, and future directions.Authorea Preprints, 2025

2025
[11]

Coresets for data-efficient training of machine learning models

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. InInternational Conference on Machine Learning, pages 6950–6960. PMLR, 2020. 10

2020
[12]

Deepcore: A comprehensive library for coreset selection in deep learning

Chengcheng Guo, Bo Zhao, and Yanbing Bai. Deepcore: A comprehensive library for coreset selection in deep learning. InInternational Conference on Database and Expert Systems Applications, pages 181–195. Springer, 2022

2022
[13]

Taming diffusion for dataset distillation with high representativeness

Lin Zhao, Yushu Wu, Xinru Jiang, Jianyang Gu, Yanzhi Wang, Xiaolin Xu, Pu Zhao, and Xue Lin. Taming diffusion for dataset distillation with high representativeness. InForty-second International Conference on Machine Learning
[14]

Chan Santiago, Praveen Tirupattur, Gaurav Kumar Nayak, Gaowen Liu, and Mubarak Shah

Jeffrey A. Chan Santiago, Praveen Tirupattur, Gaurav Kumar Nayak, Gaowen Liu, and Mubarak Shah. MGD 3: Mode-guided dataset distillation using diffusion models. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

2025
[15]

Efficient dataset distillation via minimax diffusion

Jianyang Gu, Saeed Vahidian, Vyacheslav Kungurtsev, Haonan Wang, Wei Jiang, Yang You, and Yiran Chen. Efficient dataset distillation via minimax diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[16]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[17]

Coreset selection via llm-based concept bottlenecks.arXiv preprint arXiv:2502.16733, 2025

Akshay Mehra, Trisha Mittal, Subhadra Gopalakrishnan, and Joshua Kimball. Coreset selection via llm-based concept bottlenecks.arXiv preprint arXiv:2502.16733, 2025

work page arXiv 2025
[18]

On the diversity and realism of distilled dataset: An efficient dataset distillation paradigm, 2024

Peng Sun, Bei Shi, Daiwei Yu, and Tao Lin. On the diversity and realism of distilled dataset: An efficient dataset distillation paradigm, 2024

2024
[19]

Elucidating the design space of dataset condensation.arXiv preprint arXiv:2404.13733, 2024

Shitong Shao, Zikai Zhou, Huanran Chen, and Zhiqiang Shen. Elucidating the design space of dataset condensation.arXiv preprint arXiv:2404.13733, 2024

work page arXiv 2024
[20]

D^4: Dataset distillation via disentangled diffusion model

Duo Su, Junjie Hou, Weizhi Gao, Yingjie Tian, and Bowen Tang. D^4: Dataset distillation via disentangled diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5809–5818, June 2024

2024
[21]

Task-specific generative dataset distillation with difficulty-guided sampling

Mingzhuo Li, Guang Li, Jiafeng Mao, Linfeng Ye, Takahiro Ogawa, and Miki Haseyama. Task-specific generative dataset distillation with difficulty-guided sampling. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025

2025
[22]

Fast-dit: Fast diffusion models with transformers

Chuanyang Jin and Saining Xie. Fast-dit: Fast diffusion models with transformers. https: //github.com/chuanyangjin/fast-DiT, 2024

2024
[23]

Generalized large-scale data condensation via various backbone and statistical matching, 2024

Shitong Shao, Zeyuan Yin, Muxin Zhou, Xindong Zhang, and Zhiqiang Shen. Generalized large-scale data condensation via various backbone and statistical matching, 2024

2024
[24]

Squeeze, recover and relabel: Dataset condensation at imagenet scale from a new perspective.Advances in Neural Information Processing Systems, 36:73582–73603, 2023

Zeyuan Yin, Eric Xing, and Zhiqiang Shen. Squeeze, recover and relabel: Dataset condensation at imagenet scale from a new perspective.Advances in Neural Information Processing Systems, 36:73582–73603, 2023

2023
[25]

Dataset condensation with gradient matching

Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. InInternational Conference on Learning Representations (ICLR), 2021

2021
[26]

Dataset distillation by matching training trajectories

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 4750–4759, 2022

2022
[27]

An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159, 2018

Mariya Toneva et al. An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159, 2018

work page arXiv 2018
[28]

Selection via proxy: Efficient data selection for deep learning.arXiv preprint arXiv:1906.11829, 2019

Cody Coleman et al. Selection via proxy: Efficient data selection for deep learning.arXiv preprint arXiv:1906.11829, 2019

work page arXiv 1906
[29]

A label is worth a thousand images in dataset distillation.Advances in Neural Information Processing Systems, 37:131946–131971, 2024

Tian Qin, Zhiwei Deng, and David Alvarez-Melis. A label is worth a thousand images in dataset distillation.Advances in Neural Information Processing Systems, 37:131946–131971, 2024

2024
[30]

Dd-ranking: Rethinking the evaluation of dataset distillation.arXiv preprint arXiv:2505.13300, 2025

Zekai Li, Xinhao Zhong, Samir Khaki, Zhiyuan Liang, Yuhao Zhou, Mingjia Shi, Ziqiao Wang, Xuanlei Zhao, Wangbo Zhao, Ziheng Qin, et al. Dd-ranking: Rethinking the evaluation of dataset distillation.arXiv preprint arXiv:2505.13300, 2025. 11

work page arXiv 2025
[31]

Rectified decoupled dataset distillation: A closer look for fair and comprehensive evaluation.arXiv preprint arXiv:2509.19743, 2025

Xinhao Zhong, Shuoyang Sun, Xulin Gu, Chenyang Zhu, Bin Chen, and Yaowei Wang. Rectified decoupled dataset distillation: A closer look for fair and comprehensive evaluation.arXiv preprint arXiv:2509.19743, 2025

work page arXiv 2025
[32]

Rethinking dataset distillation: Hard truths about soft labels

Priyam Dey, Aditya Sahdev, Sunny Bhati, Konda Reddy Mopuri, and Venkatesh Babu Radhakr- ishnan. Rethinking dataset distillation: Hard truths about soft labels. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 178–187, 2026

2026
[33]

Unifying Dataset Pruning and Distillation for Efficient Large-scale Compression

Lingao Xiao, Songhua Liu, Yang He, and Xinchao Wang. Rethinking large-scale dataset compression: Shifting focus from labels to images.arXiv preprint arXiv:2502.06434, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Coverage-centric coreset selection for high pruning rates.arXiv preprint arXiv:2210.15809, 2022

Haizhong Zheng, Rui Liu, Fan Lai, and Atul Prakash. Coverage-centric coreset selection for high pruning rates.arXiv preprint arXiv:2210.15809, 2022

work page arXiv 2022
[35]

Elfs: Enhancing label-free coreset selection via clustering-based pseudo-labeling

Haizhong Zheng, Elisa Tsai, Yifu Lu, Jiachen Sun, Brian R Bartoldson, Bhavya Kailkhura, and Atul Prakash. Elfs: Enhancing label-free coreset selection via clustering-based pseudo-labeling. arXiv preprint arXiv:2406.04273, 2024

work page arXiv 2024
[36]

Dataset condensation with distribution matching

Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 6514–6523, 2023

2023
[37]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009
[38]

Contrastive multiview coding

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. InEuropean conference on computer vision, pages 776–794. Springer, 2020

2020
[39]

Imagenette: A smaller subset of 10 easily classified classes from imagenet

Jeremy Howard. Imagenette: A smaller subset of 10 easily classified classes from imagenet. https://github.com/fastai/imagenette, April 2019. Accessed: 2019

2019
[40]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[41]

Synthesizing informative training samples with gan, 2022

Bo Zhao and Hakan Bilen. Synthesizing informative training samples with gan, 2022

2022
[42]

Active Learning for Convolutional Neural Networks: A Core-Set Approach

Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach.arXiv preprint arXiv:1708.00489, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Turning big data into tiny data: Constant-size coresets for k-means.SIAM Journal on Computing, 2020

Dan Feldman et al. Turning big data into tiny data: Constant-size coresets for k-means.SIAM Journal on Computing, 2020

2020
[44]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InICML, 2017

2017
[45]

Gradmatch: Gradient matching based data subset selection

Krishnateja Killamsetty et al. Gradmatch: Gradient matching based data subset selection. In ICML, 2021

2021
[46]

Milo: Model-agnostic subset selection framework for efficient model training and tuning.arXiv preprint arXiv:2301.13287, 2023

Krishnateja Killamsetty, Alexandre V Evfimievski, Tejaswini Pedapati, Kiran Kate, Lucian Popa, and Rishabh Iyer. Milo: Model-agnostic subset selection framework for efficient model training and tuning.arXiv preprint arXiv:2301.13287, 2023

work page arXiv 2023
[47]

Provable data subset selection for efficient neural networks training

Murad Tukan, Samson Zhou, Alaa Maalouf, Daniela Rus, Vladimir Braverman, and Dan Feldman. Provable data subset selection for efficient neural networks training. InInternational Conference on Machine Learning, pages 34533–34555. PMLR, 2023

2023
[48]

Grad-match: Gradient matching based data subset selection for efficient deep model training

Krishnateja Killamsetty, Sivasubramanian Durga, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. Grad-match: Gradient matching based data subset selection for efficient deep model training. InInternational Conference on Machine Learning, pages 5464–5474. PMLR, 2021. 12 Appendix We present an additional details of DD and CS methods, followed by details of t...

2021

[1] [1]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

2022

[4] [4]

Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

2024

[5] [5]

Identifying mislabeled data using the area under the margin ranking

Geoff Pleiss et al. Identifying mislabeled data using the area under the margin ranking. In NeurIPS, 2020

2020

[6] [6]

Fadrm: Fast and accurate data residual matching for dataset distillation

Jiacheng Cui, Xinyue Bi, Yaxin Luo, Xiaohan Zhao, Jiacheng Liu, and Zhiqiang Shen. Fadrm: Fast and accurate data residual matching for dataset distillation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

[7] [7]

Mani- foldgd: Training-free hierarchical manifold guidance for diffusion-based dataset distillation

Ayush Roy, Wei-Yang Alex Lee, Rudrasis Chakraborty, and Vishnu Suresh Lokhande. Mani- foldgd: Training-free hierarchical manifold guidance for diffusion-based dataset distillation. arXiv preprint arXiv:2602.23295, 2026

work page arXiv 2026

[8] [8]

Dataset distillation via vision-language category prototype

Yawen Zou, Guang Li, Duo Su, Zi Wang, Jun Yu, and Chao Zhang. Dataset distillation via vision-language category prototype. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[9] [9]

The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673, 2025

Ping Liu and Jiawei Du. The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673, 2025

work page arXiv 2025

[10] [10]

Dataset distillation in the era of large-scale data: Methods, analysis, and future directions.Authorea Preprints, 2025

Xinyi Shang, Peng Sun, Zhiqiang Shen, Tao Lin, and Jing-Hao Xue. Dataset distillation in the era of large-scale data: Methods, analysis, and future directions.Authorea Preprints, 2025

2025

[11] [11]

Coresets for data-efficient training of machine learning models

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. InInternational Conference on Machine Learning, pages 6950–6960. PMLR, 2020. 10

2020

[12] [12]

Deepcore: A comprehensive library for coreset selection in deep learning

Chengcheng Guo, Bo Zhao, and Yanbing Bai. Deepcore: A comprehensive library for coreset selection in deep learning. InInternational Conference on Database and Expert Systems Applications, pages 181–195. Springer, 2022

2022

[13] [13]

Taming diffusion for dataset distillation with high representativeness

Lin Zhao, Yushu Wu, Xinru Jiang, Jianyang Gu, Yanzhi Wang, Xiaolin Xu, Pu Zhao, and Xue Lin. Taming diffusion for dataset distillation with high representativeness. InForty-second International Conference on Machine Learning

[14] [14]

Chan Santiago, Praveen Tirupattur, Gaurav Kumar Nayak, Gaowen Liu, and Mubarak Shah

Jeffrey A. Chan Santiago, Praveen Tirupattur, Gaurav Kumar Nayak, Gaowen Liu, and Mubarak Shah. MGD 3: Mode-guided dataset distillation using diffusion models. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

2025

[15] [15]

Efficient dataset distillation via minimax diffusion

Jianyang Gu, Saeed Vahidian, Vyacheslav Kungurtsev, Haonan Wang, Wei Jiang, Yang You, and Yiran Chen. Efficient dataset distillation via minimax diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[16] [16]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[17] [17]

Coreset selection via llm-based concept bottlenecks.arXiv preprint arXiv:2502.16733, 2025

Akshay Mehra, Trisha Mittal, Subhadra Gopalakrishnan, and Joshua Kimball. Coreset selection via llm-based concept bottlenecks.arXiv preprint arXiv:2502.16733, 2025

work page arXiv 2025

[18] [18]

On the diversity and realism of distilled dataset: An efficient dataset distillation paradigm, 2024

Peng Sun, Bei Shi, Daiwei Yu, and Tao Lin. On the diversity and realism of distilled dataset: An efficient dataset distillation paradigm, 2024

2024

[19] [19]

Elucidating the design space of dataset condensation.arXiv preprint arXiv:2404.13733, 2024

Shitong Shao, Zikai Zhou, Huanran Chen, and Zhiqiang Shen. Elucidating the design space of dataset condensation.arXiv preprint arXiv:2404.13733, 2024

work page arXiv 2024

[20] [20]

D^4: Dataset distillation via disentangled diffusion model

Duo Su, Junjie Hou, Weizhi Gao, Yingjie Tian, and Bowen Tang. D^4: Dataset distillation via disentangled diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5809–5818, June 2024

2024

[21] [21]

Task-specific generative dataset distillation with difficulty-guided sampling

Mingzhuo Li, Guang Li, Jiafeng Mao, Linfeng Ye, Takahiro Ogawa, and Miki Haseyama. Task-specific generative dataset distillation with difficulty-guided sampling. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025

2025

[22] [22]

Fast-dit: Fast diffusion models with transformers

Chuanyang Jin and Saining Xie. Fast-dit: Fast diffusion models with transformers. https: //github.com/chuanyangjin/fast-DiT, 2024

2024

[23] [23]

Generalized large-scale data condensation via various backbone and statistical matching, 2024

Shitong Shao, Zeyuan Yin, Muxin Zhou, Xindong Zhang, and Zhiqiang Shen. Generalized large-scale data condensation via various backbone and statistical matching, 2024

2024

[24] [24]

Squeeze, recover and relabel: Dataset condensation at imagenet scale from a new perspective.Advances in Neural Information Processing Systems, 36:73582–73603, 2023

Zeyuan Yin, Eric Xing, and Zhiqiang Shen. Squeeze, recover and relabel: Dataset condensation at imagenet scale from a new perspective.Advances in Neural Information Processing Systems, 36:73582–73603, 2023

2023

[25] [25]

Dataset condensation with gradient matching

Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. InInternational Conference on Learning Representations (ICLR), 2021

2021

[26] [26]

Dataset distillation by matching training trajectories

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 4750–4759, 2022

2022

[27] [27]

An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159, 2018

Mariya Toneva et al. An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159, 2018

work page arXiv 2018

[28] [28]

Selection via proxy: Efficient data selection for deep learning.arXiv preprint arXiv:1906.11829, 2019

Cody Coleman et al. Selection via proxy: Efficient data selection for deep learning.arXiv preprint arXiv:1906.11829, 2019

work page arXiv 1906

[29] [29]

A label is worth a thousand images in dataset distillation.Advances in Neural Information Processing Systems, 37:131946–131971, 2024

Tian Qin, Zhiwei Deng, and David Alvarez-Melis. A label is worth a thousand images in dataset distillation.Advances in Neural Information Processing Systems, 37:131946–131971, 2024

2024

[30] [30]

Dd-ranking: Rethinking the evaluation of dataset distillation.arXiv preprint arXiv:2505.13300, 2025

Zekai Li, Xinhao Zhong, Samir Khaki, Zhiyuan Liang, Yuhao Zhou, Mingjia Shi, Ziqiao Wang, Xuanlei Zhao, Wangbo Zhao, Ziheng Qin, et al. Dd-ranking: Rethinking the evaluation of dataset distillation.arXiv preprint arXiv:2505.13300, 2025. 11

work page arXiv 2025

[31] [31]

Rectified decoupled dataset distillation: A closer look for fair and comprehensive evaluation.arXiv preprint arXiv:2509.19743, 2025

Xinhao Zhong, Shuoyang Sun, Xulin Gu, Chenyang Zhu, Bin Chen, and Yaowei Wang. Rectified decoupled dataset distillation: A closer look for fair and comprehensive evaluation.arXiv preprint arXiv:2509.19743, 2025

work page arXiv 2025

[32] [32]

Rethinking dataset distillation: Hard truths about soft labels

Priyam Dey, Aditya Sahdev, Sunny Bhati, Konda Reddy Mopuri, and Venkatesh Babu Radhakr- ishnan. Rethinking dataset distillation: Hard truths about soft labels. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 178–187, 2026

2026

[33] [33]

Unifying Dataset Pruning and Distillation for Efficient Large-scale Compression

Lingao Xiao, Songhua Liu, Yang He, and Xinchao Wang. Rethinking large-scale dataset compression: Shifting focus from labels to images.arXiv preprint arXiv:2502.06434, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Coverage-centric coreset selection for high pruning rates.arXiv preprint arXiv:2210.15809, 2022

Haizhong Zheng, Rui Liu, Fan Lai, and Atul Prakash. Coverage-centric coreset selection for high pruning rates.arXiv preprint arXiv:2210.15809, 2022

work page arXiv 2022

[35] [35]

Elfs: Enhancing label-free coreset selection via clustering-based pseudo-labeling

Haizhong Zheng, Elisa Tsai, Yifu Lu, Jiachen Sun, Brian R Bartoldson, Bhavya Kailkhura, and Atul Prakash. Elfs: Enhancing label-free coreset selection via clustering-based pseudo-labeling. arXiv preprint arXiv:2406.04273, 2024

work page arXiv 2024

[36] [36]

Dataset condensation with distribution matching

Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 6514–6523, 2023

2023

[37] [37]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009

[38] [38]

Contrastive multiview coding

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. InEuropean conference on computer vision, pages 776–794. Springer, 2020

2020

[39] [39]

Imagenette: A smaller subset of 10 easily classified classes from imagenet

Jeremy Howard. Imagenette: A smaller subset of 10 easily classified classes from imagenet. https://github.com/fastai/imagenette, April 2019. Accessed: 2019

2019

[40] [40]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[41] [41]

Synthesizing informative training samples with gan, 2022

Bo Zhao and Hakan Bilen. Synthesizing informative training samples with gan, 2022

2022

[42] [42]

Active Learning for Convolutional Neural Networks: A Core-Set Approach

Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach.arXiv preprint arXiv:1708.00489, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[43] [43]

Turning big data into tiny data: Constant-size coresets for k-means.SIAM Journal on Computing, 2020

Dan Feldman et al. Turning big data into tiny data: Constant-size coresets for k-means.SIAM Journal on Computing, 2020

2020

[44] [44]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InICML, 2017

2017

[45] [45]

Gradmatch: Gradient matching based data subset selection

Krishnateja Killamsetty et al. Gradmatch: Gradient matching based data subset selection. In ICML, 2021

2021

[46] [46]

Milo: Model-agnostic subset selection framework for efficient model training and tuning.arXiv preprint arXiv:2301.13287, 2023

Krishnateja Killamsetty, Alexandre V Evfimievski, Tejaswini Pedapati, Kiran Kate, Lucian Popa, and Rishabh Iyer. Milo: Model-agnostic subset selection framework for efficient model training and tuning.arXiv preprint arXiv:2301.13287, 2023

work page arXiv 2023

[47] [47]

Provable data subset selection for efficient neural networks training

Murad Tukan, Samson Zhou, Alaa Maalouf, Daniela Rus, Vladimir Braverman, and Dan Feldman. Provable data subset selection for efficient neural networks training. InInternational Conference on Machine Learning, pages 34533–34555. PMLR, 2023

2023

[48] [48]

Grad-match: Gradient matching based data subset selection for efficient deep model training

Krishnateja Killamsetty, Sivasubramanian Durga, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. Grad-match: Gradient matching based data subset selection for efficient deep model training. InInternational Conference on Machine Learning, pages 5464–5474. PMLR, 2021. 12 Appendix We present an additional details of DD and CS methods, followed by details of t...

2021