arxiv: 2604.11810 · v1 · submitted 2026-04-09 · 💻 cs.DB · cs.AI

Recognition: 2 theorem links

· Lean Theorem

GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization

Tianhao Tang , Haoyang Li , Lei Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:02 UTC · model grok-4.3

classification 💻 cs.DB cs.AI

keywords coreset selectionlarge language modelsdynamic selectionk-NN graphgradient importancetraining efficiencyrepresentation diversity

0 comments

The pith

GRACE dynamically selects and updates representative data subsets for large language models using a k-NN graph to combine diversity and gradient importance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Training large language models on full datasets requires enormous computational resources and time. Existing coreset selection techniques remain static and cannot adjust as the model's focus shifts during training. GRACE builds and refreshes small subsets by blending data representation variety with gradient signals of importance. A k-NN graph spreads these scores to allow efficient updates without recomputing everything from scratch. Experiments across benchmarks show this approach shortens training while often improving results on downstream tasks for multiple models.

Core claim

GRACE dynamically constructs and updates coresets by combining representation diversity with gradient-based importance metrics, ensuring both informativeness and efficiency. To mitigate the computational cost of frequent updates, GRACE leverages a k-NN graph-based propagation mechanism and selectively updates scores and embeddings, adapting to evolving training dynamics. Extensive experiments on three benchmarks demonstrate that GRACE significantly improves training efficiency and downstream performance across diverse LLMs and tasks.

What carries the argument

The k-NN graph-based propagation mechanism that selectively updates scores and embeddings to combine representation diversity with gradient-based importance metrics for dynamic coreset construction.

Load-bearing premise

That combining representation diversity with gradient-based importance via k-NN graph propagation accurately identifies informative data points that adapt to evolving training dynamics without bias or loss of critical examples.

What would settle it

A controlled experiment where an LLM trained on a GRACE-selected coreset shows lower downstream task performance than the same model trained on a random subset of identical size would disprove the claimed performance benefits.

Figures

Figures reproduced from arXiv: 2604.11810 by Haoyang Li, Lei Chen, Tianhao Tang.

**Figure 1.** Figure 1: Overview of GRACE. Stage 1: Hidden states and importance scores of all training samples are extracted from a [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the graph update process. From left to right: we begin with a [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Time Comparison As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Experiment results under different budgets on MathInstruct [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: 𝜆 for Phi-2 0.0 0.2 0.4 0.6 0.8 1.0 Balance Control λ 18 20 22 24 26 Accuracy (%) In-domain Out-of-domain [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 7.** Figure 7: 𝛿 for Phi-2 0.0 0.1 0.2 0.3 Check Threshold δ 20 22 24 26 Accuracy (%) In-domain Out-of-domain [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 9.** Figure 9: 𝑡𝑐/T𝑒 for Phi-2 1/8 1/4 1/2 1 Sample Fraction tc/e 18 20 22 24 26 Accuracy (%) In-domain Out-of-domain [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, their immense number of parameters and complex transformer-based architectures result in significant resource demands and computational complexity during training, making it challenging to optimize them efficiently on large datasets. To reduce training costs while preserving performance, researchers have investigated coreset selection techniques, which aim to identify small, representative subsets of the entire training dataset to accelerate LLM training. However, existing coreset selection methods fail to adapt to the dynamic nature of LLM training and often struggle with scalability for models of this size. To address these limitations, we propose a graph-guided adaptive and dynamic coreset selection framework for LLMs, namely GRACE. GRACE dynamically constructs and updates coresets by combining representation diversity with gradient-based importance metrics, ensuring both informativeness and efficiency. To mitigate the computational cost of frequent updates, GRACE leverages a $k$-NN graph-based propagation mechanism and selectively updates scores and embeddings, adapting to evolving training dynamics. Extensive experiments on three benchmarks demonstrate that GRACE significantly improves training efficiency and downstream performance across diverse LLMs and tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRACE adds a dynamic k-NN graph layer to blend diversity and gradient scores for LLM coresets, but selective updates leave it vulnerable to early-training drift.

read the letter

GRACE is a dynamic coreset selection method for LLM training that builds and refreshes a k-NN graph to combine representation diversity with gradient importance, then uses selective updates to keep the overhead manageable instead of recomputing everything every step. The core idea is to let the coreset adapt as training evolves without paying full cost each time. That combination is the main novelty; prior coreset work is mostly static or uses only one of the two signals, and the graph propagation step is a reasonable engineering move to scale it to large models. The paper shows the method on three benchmarks and reports better training speed plus downstream accuracy across several LLMs, which is the kind of practical result people in this area want to see. The selective-update design is a clear strength because it directly attacks the compute bottleneck that kills most dynamic approaches. The citation list looks standard and covers the relevant coreset and graph literature without obvious gaps. The soft spot is exactly the one the stress test flags. Because scores and embeddings are only updated selectively, the k-NN neighborhoods can go stale when gradients shift quickly, which happens most in the first few epochs. The paper does not supply a bound on tolerable drift or an ablation that varies update frequency, so it is hard to know how much the reported gains depend on lucky timing or particular datasets. Without those checks the central claim that the coreset stays informative throughout training rests on an untested assumption. This paper is for people who work on data-efficient LLM training and need concrete ideas they can try. A reader who already knows the static coreset baselines will get the most out of it and can judge whether the extra graph machinery is worth the added complexity. It deserves a serious referee because the problem is timely, the method is described clearly enough to implement, and the evaluation gaps are fixable with targeted experiments rather than fundamental flaws.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes GRACE, a graph-guided adaptive dynamic coreset selection framework for LLM training. It dynamically builds and refreshes coresets by fusing representation diversity with gradient-based importance scores, using k-NN graph propagation together with selective updates of embeddings and scores to keep computational cost manageable while adapting to evolving training dynamics. The authors state that experiments on three benchmarks show GRACE yields substantial gains in training efficiency and downstream task performance across multiple LLMs and tasks.

Significance. If the empirical claims are borne out by the full results, the work addresses a practically important bottleneck in LLM training by offering a scalable, dynamic alternative to static coreset methods. The combination of diversity and gradient signals via graph propagation is a reasonable engineering choice for balancing informativeness and cost. No machine-checked proofs, parameter-free derivations, or reproducible code artifacts are mentioned, so the contribution rests entirely on the experimental evidence.

major comments (1)

[Method description of k-NN graph-based propagation and selective updates] The central claim that GRACE improves efficiency and performance rests on the k-NN graph propagation correctly identifying informative examples by fusing representation diversity with gradient importance while adapting to training dynamics. The method description states that scores and embeddings are only selectively updated to control cost; this implicitly assumes that local neighborhood structure remains sufficiently stable between updates. If embeddings or gradients change substantially (as is common in the first few epochs of LLM training), the propagated importance scores can become stale, causing the coreset to retain non-informative points or drop critical ones. No parameter-free derivation or worst-case bound is supplied to quantify how large a drift can be tolerated before selection quality degrades.

minor comments (1)

[Abstract] Abstract: The claim of 'significant improvements' is stated without any quantitative metrics, baseline comparisons, or error analysis, which is standard practice for empirical papers in this area and makes immediate assessment difficult.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comment below, providing clarification on the design choices in GRACE while remaining faithful to the manuscript's empirical focus.

read point-by-point responses

Referee: The central claim that GRACE improves efficiency and performance rests on the k-NN graph propagation correctly identifying informative examples by fusing representation diversity with gradient importance while adapting to training dynamics. The method description states that scores and embeddings are only selectively updated to control cost; this implicitly assumes that local neighborhood structure remains sufficiently stable between updates. If embeddings or gradients change substantially (as is common in the first few epochs of LLM training), the propagated importance scores can become stale, causing the coreset to retain non-informative points or drop critical ones. No parameter-free derivation or worst-case bound is supplied to quantify how large a drift can be tolerated before selection quality degrades.

Authors: We agree that the effectiveness of selective updates relies on the k-NN neighborhoods not drifting too rapidly. GRACE mitigates this through a gradient-magnitude-triggered refresh schedule that prioritizes updating points whose embeddings or importance scores have changed beyond a threshold, combined with periodic full-graph rebuilds at fixed intervals. This design is motivated by the observation that, after the initial epochs, representation shifts slow down in LLM training. While the manuscript does not include a parameter-free worst-case bound on tolerable drift (as the contribution is primarily algorithmic and empirical), the three-benchmark experiments demonstrate that the chosen update policy preserves downstream performance gains relative to static baselines and fully dynamic alternatives. We are prepared to expand the method section with additional ablation results on update frequency and to include a short discussion of observed neighborhood stability in the revised version. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper proposes an algorithmic framework (GRACE) for dynamic coreset selection using k-NN graph propagation on representations and gradients, with selective updates for efficiency. No equations, first-principles derivations, or predictions are presented that reduce to fitted parameters or self-referential definitions by construction. The central claims rest on empirical results across benchmarks rather than any load-bearing self-citation chain, ansatz smuggling, or renaming of known results as novel unification. The method description in the abstract and skeptic notes highlight practical assumptions about update stability, but these are not framed as mathematical derivations that collapse to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, background axioms, or invented entities beyond the GRACE framework itself are specified; the contribution is presented as an algorithmic combination rather than a derivation from first principles.

pith-pipeline@v0.9.0 · 5494 in / 1171 out tokens · 40357 ms · 2026-05-10T18:02:51.211155+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GRACE leverages a k-NN graph-based propagation mechanism and selectively updates scores and embeddings, adapting to evolving training dynamics.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a greedy algorithm... achieves a (1 − 1/e) approximation ratio

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

90 extracted references · 56 canonical work pages · 7 internal anchors

[1]

Abhinab Acharya, Dayou Yu, Qi Yu, and Xumin Liu. 2024. Balancing Feature Similarity and Label Variability for Optimal Size-Aware One-shot Subset Selection. In Forty-First International Conference on Machine Learning

2024
[2]

Meta AI. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL] https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Meta AI. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https: //arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lam- bert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang
[5]

https://openreview.net/forum?id=XfHWcNTSHp Sur- vey Certification

A Survey on Data Selection for Language Models.Transactions on Machine GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Learning Research (2024). https://openreview.net/forum?id=XfHWcNTSHp Sur- vey Certification

2018
[6]

André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, and Ian Foster

Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, and Mansheej Paul. 2024. Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models. doi:10.48550/ARXIV.2405.20541

work page doi:10.48550/arxiv.2405.20541 2024
[7]

Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, and Christopher Ré. 2023. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. Proc. VLDB Endow. 17, 2 (Oct. 2023), 92–105. doi:10.14778/3626292.3626294

work page doi:10.14778/3626292.3626294 2023
[8]

Tianyi Bai, Ling Yang, Zhen Hao Wong, Jiahui Peng, Xinlin Zhuang, Chi Zhang, Lijun Wu, Jiantao Qiu, Wentao Zhang, Binhang Yuan, and Conghui He. 2024. Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining. arXiv:2410.08102 [cs] doi:10.48550/arXiv.2410.08102

work page doi:10.48550/arxiv.2410.08102 2024
[9]

Das, Jifan Zhang, Sang T

Gantavya Bhatt, Yifang Chen, Arnav M. Das, Jifan Zhang, Sang T. Truong, Stephen Mussmann, Yinglun Zhu, Jeffrey Bilmes, Simon S. Du, Kevin Jamieson, Jordan T. Ash, and Robert D. Nowak. 2024. An Experimental Design Frame- work for Label-Efficient Supervised Finetuning of Large Language Models. arXiv:2401.06692 [cs] doi:10.48550/arXiv.2401.06692

work page doi:10.48550/arxiv.2401.06692 2024
[10]

Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural networks: tricks of the trade: second edition. Springer, 421–436

2012
[11]

Valérie Castin, Pierre Ablin, and Gabriel Peyré. 2024. How Smooth Is At- tention?. In Forty-first International Conference on Machine Learning. https: //openreview.net/forum?id=aP0H8A1ywk

2024
[12]

Chengliang Chai, Jiabin Liu, Nan Tang, Ju Fan, Dongjing Miao, Jiayi Wang, Yuyu Luo, and Guoliang Li. 2023. GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data. Proc. ACM Manag. Data 1, 2 (June 2023), 157:1–157:27. doi:10.1145/3589302

work page doi:10.1145/3589302 2023
[13]

Chengliang Chai, Jiayi Wang, Nan Tang, Ye Yuan, Jiabin Liu, Yuhao Deng, and Guoren Wang. 2023. Efficient Coreset Selection with Cluster-based Methods. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23). Association for Computing Machinery, New York, NY, USA, 167–178. doi:10.1145/3580305.3599326

work page doi:10.1145/3580305.3599326 2023
[14]

Hao Chen, Yiming Zhang, Qi Zhang, Hantao Yang, Xiaomeng Hu, Xuetao Ma, Yi- fan Yanggong, and Junbo Zhao. 2023. Maybe Only 0.5% Data Is Needed: A Prelimi- nary Exploration of Low Training Data Instruction Tuning. arXiv:2305.09246 [cs] doi:10.48550/arXiv.2305.09246

work page doi:10.48550/arxiv.2305.09246 2023
[15]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Yulong Chen, Yang Liu, Liang Chen, and Yue Zhang. 2021. DialogSum: A Real-Life Scenario Dialogue Summarization Dataset. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 5062–5074. doi:10.18653/v1/2021.findings-acl.449

work page doi:10.18653/v1/2021.findings-acl.449 2021
[17]

Yeseul Cho, Baekrok Shin, Changmin Kang, and Chulhee Yun. 2025. Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty. arXiv:2502.06905 [cs] doi:10.48550/arXiv.2502.06905

work page doi:10.48550/arxiv.2502.06905 2025
[18]

Marina Danilova, Pavel Dvurechensky, Alexander Gasnikov, Eduard Gorbunov, Sergey Guminov, Dmitry Kamzolov, and Innokentiy Shibaev. 2020. Recent Theo- retical Advances in Non-Convex Optimization. arXiv:2012.06188 [math.OC]

work page arXiv 2020
[19]

DeepSeek-AI, Aixin Liu, Bei Feng, et al. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Alexander Vladimirovich Demidovskij, Aleksei Trutnev, Artem Tugarev, Igor Salnikov, and Stanislav Pavlov. 2023. DAREL: Data Reduction with Losses for Training Acceleration of Real and Hypercomplex Neural Networks. InWorkshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@NeurIPS 2023)

2023
[21]

Zhiwei Deng, Tao Li, and Yang Li. 2024. Influential Language Data Selection via Gradient Trajectory Pursuit. arXiv:2410.16710 doi:10.48550/arXiv.2410.16710

work page doi:10.48550/arxiv.2410.16710 2024
[22]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems 36 (2023), 10088–10115

2023
[23]

Ju Fan, Zihui Gu, Songyue Zhang, Yuxin Zhang, Zui Chen, Lei Cao, Guoliang Li, Samuel Madden, Xiaoyong Du, and Nan Tang. 2024. Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL. Proc. VLDB Endow. 17, 11 (July 2024), 2750–2763. doi:10.14778/3681954.3681960

work page doi:10.14778/3681954.3681960 2024
[24]

Dan Feldman. 2020. Introduction to Core-sets: an Updated Survey. arXiv:2011.09384 [cs.LG]

work page arXiv 2020
[25]

Victor Giannakouris and Immanuel Trummer. 2025. 𝜆-Tune: Harnessing Large Language Models for Automated Database System Tuning. Proc. ACM Manag. Data 3, 1, Article 2 (Feb. 2025), 26 pages. doi:10.1145/3709652

work page doi:10.1145/3709652 2025
[26]

Aviv Hadar, Tova Milo, and Kathy Razmadze. 2024. Datamap-Driven Tabular Coreset Selection for Classifier Training. Proc. VLDB Endow. 18, 3 (Nov. 2024), 876–888. doi:10.14778/3712221.3712249

work page doi:10.14778/3712221.3712249 2024
[27]

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. 2022. Towards a Unified View of Parameter-Efficient Transfer Learning. In International Conference on Learning Representations. https://openreview. net/forum?id=0RDcd5Axok

2022
[28]

Yexiao He, Ziyao Wang, Zheyu Shen, Guoheng Sun, Yucong Dai, Yongkai Wu, Hongyi Wang, and Ang Li. 2024. SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning. arXiv:2405.00705 [cs] doi:10.48550/ arXiv.2405.00705

work page arXiv 2024
[29]

Dorit S Hochbaum. 1997. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. Approximation algorithms for NP-hard problems (1997), 94–143

1997
[30]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language mod- els. In Proceedings of the 36th International Conference on Neural Information Processing Systems. 30016–30030

2022
[31]

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International conference on machine learning. PMLR, 2790–2799

2019
[32]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models. ICLR 1, 2 (2022), 3

2022
[33]

Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing (Dallas, Texas, USA) (STOC ’98). As- sociation for Computing Machinery, New York, NY, USA, 604–613. doi:10.1145/ 276698.276876

work page arXiv 1998
[34]

Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, and Sanmi Koyejo. 2024. Scaling Laws for Downstream Task Perfor- mance of Large Language Models. doi:10.48550/ARXIV.2402.04177

work page doi:10.48550/arxiv.2402.04177 2024
[35]

Ayrton San Joaquin, Bin Wang, Zhengyuan Liu, Nicholas Asher, Brian Lim, Philippe Muller, and Nancy Chen. 2024. In2Core: Leveraging Influence Func- tions for Coreset Selection in Instruction Finetuning of Large Language Models. arXiv:2408.03560 [cs, stat] doi:10.48550/arXiv.2408.03560

work page doi:10.48550/arxiv.2408.03560 2024
[36]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547

2019
[37]

Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, and Dan Suciu. 2024. Chorus: Foundation Models for Unified Data Discovery and Exploration. Proc. VLDB Endow. 17, 8 (April 2024), 2104–2114. doi:10.14778/ 3659437.3659461

work page arXiv 2024
[38]

Krishnateja Killamsetty, Durga S, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. 2021. GRAD-MATCH: Gradient Matching Based Data Subset Selection for Ef- ficient Deep Model Training. InProceedings of the 38th International Conference on Machine Learning. PMLR, 5464–5474

2021
[39]

Teddy Lazebnik, Amit Somech, and Abraham Itzhak Weinberg. 2022. SubStrat: A Subset-Based Optimization Strategy for Faster AutoML. Proc. VLDB Endow. 16, 4 (Dec. 2022), 772–780. doi:10.14778/3574245.3574261

work page doi:10.14778/3574245.3574261 2022
[40]

Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. 2024. The Dawn of Natural Language to SQL: Are We Fully Ready? Proc. VLDB Endow. 17, 11 (July 2024), 3318–3331. doi:10.14778/3681954.3682003

work page doi:10.14778/3681954.3682003 2024
[41]

Haoyang Li, Shimin Di, Lei Chen, and Xiaofang Zhou. 2024. E2GCL: Efficient and Expressive Contrastive Learning on Graph Neural Networks. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 859–873

2024
[42]

Haoyang Li, Shimin Di, Calvin Hong Yi Li, Lei Chen, and Xiaofang Zhou. 2024. Fight Fire with Fire: Towards Robust Graph Neural Networks on Dynamic Graphs via Actively Defense. Proceedings of the VLDB Endowment 17, 8 (2024), 2050– 2063

2024
[43]

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. 2024. A survey on large lan- guage model acceleration based on kv cache management. arXiv preprint arXiv:2412.19442 (2024)

work page arXiv 2024
[44]

Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. 2024. From Quantity to Quality: Boost- ing LLM Performance with Self-Guided Data Selection for Instruction Tun- ing. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human La...

work page doi:10.18653/v1/ 2024
[45]

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021)

work page internal anchor Pith review arXiv 2021
[46]

Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023. Textbooks Are All You Need II: phi-1.5 technical report. arXiv:2309.05463 [cs.CL]

work page internal anchor Pith review arXiv 2023
[47]

Yiming Li, Yanyan Shen, and Lei Chen. 2022. Camel: Managing Data for Effi- cient Stream Learning. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1271–1285. doi:10.1145/3514221.3517836 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Tang et al

work page doi:10.1145/3514221.3517836 2022
[48]

Zhaodonghui Li, Haitao Yuan, Huiming Wang, Gao Cong, and Lidong Bing
[49]

LLM-R2: A Large Language Model Enhanced Rule-Based Rewrite System for Boosting Query Efficiency. Proc. VLDB Endow. 18, 1 (Sept. 2024), 53–65. doi:10.14778/3696435.3696440

work page doi:10.14778/3696435.3696440 2024
[50]

Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, and Yufa Zhou. 2024. Multi-Layer Transformers Gradient Can Be Approximated in Almost Linear Time. arXiv:2408.13233 [cs] doi:10.48550/arXiv.2408.13233

work page doi:10.48550/arxiv.2408.13233 2024
[51]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/

2004
[52]

Hanmo Liu, Shimin Di, Haoyang Li, Shuangyin Li, Lei Chen, and Xiaofang Zhou
[53]

Task-oriented gnns training on large knowledge graphs for accurate and efficient modeling,

Effective Data Selection and Replay for Unsupervised Continual Learning. In 40th IEEE International Conference on Data Engineering, ICDE 2024, Utrecht, The Netherlands, May 13-16, 2024. IEEE, 1449–1463. doi:10.1109/ICDE60146.2024. 00119

work page doi:10.1109/icde60146.2024 2024
[54]

Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory 28, 2 (1982), 129–137

1982
[55]

Yuze Lou, Chuan Lei, Xiao Qin, Zichen Wang, Christos Faloutsos, Rishita Anubhai, and Huzefa Rangwala. 2024. DATALORE: Can a Large Language Model Find All Lost Scrolls in a Data Repository?. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). 5170–5176. doi:10.1109/ICDE60146.2024.00388

work page doi:10.1109/icde60146.2024.00388 2024
[56]

and Hofmann, Dennis M

Lei Ma, Lei Cao, Peter M. VanNostrand, Dennis M. Hofmann, Yao Su, and Elke A. Rundensteiner. 2024. Pluto: Sample Selection for Robust Anomaly Detection on Polluted Log Data. Proc. ACM Manag. Data 2, 4, Article 203 (Sept. 2024), 25 pages. doi:10.1145/3677139

work page doi:10.1145/3677139 2024
[57]

Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, and Sara Hooker. 2023. When Less Is More: Investigating Data Pruning for Pretraining LLMs at Scale. arXiv:2309.04564 [cs] doi:10.48550/arXiv.2309.04564

work page doi:10.48550/arxiv.2309.04564 2023
[58]

Dheeraj Mekala, Alex Nguyen, and Jingbo Shang. 2024. Smaller Language Models Are Capable of Selecting Instruction-Tuning Training Data for Larger Language Models. doi:10.48550/ARXIV.2402.10430

work page doi:10.48550/arxiv.2402.10430 2024
[59]

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. 2020. Coresets for Data-efficient Training of Machine Learning Models. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 6950–6960. arXiv:1906.01827 [cs, stat]

work page arXiv 2020
[60]

Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. 2022. Can Founda- tion Models Wrangle Your Data? Proc. VLDB Endow. 16, 4 (Dec. 2022), 738–746. doi:10.14778/3574245.3574258

work page doi:10.14778/3574245.3574258 2022
[61]

Dang Nguyen, Wenhan Yang, Rathul Anand, Yu Yang, and Baharan Mirza- soleiman. 2024. Memory-Efficient Training of LLMs with Larger Mini-batches. arXiv:2407.19580 [cs] doi:10.48550/arXiv.2407.19580

work page doi:10.48550/arxiv.2407.19580 2024
[62]

OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. 2021. Deep Learning on a Data Diet: Finding Important Examples Early in Training. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 20596–20607

2021
[64]

Tonghui Ren, Yuankai Fan, Zhenying He, Ren Huang, Jiaqi Dai, Can Huang, Yinan Jing, Kai Zhang, Yifan Yang, and X Sean Wang. 2024. Purple: Making a large language model a better sql writer. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 15–28

2024
[65]

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Mor- cos. 2022. Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning. Advances in Neural Information Processing Systems 35 (Dec. 2022), 19523–19536

2022
[66]

Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm. github.io/blog/qwen2.5/

2024
[67]

Tristan Thrush, Christopher Potts, and Tatsunori Hashimoto. 2024. Improving Pretraining Data Using Perplexity Correlations. arXiv:2409.05816 [cs, stat] doi:10. 48550/arXiv.2409.05816

work page arXiv 2024
[68]

Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari Morcos. 2023. D4: Improving LLM Pretraining via Document De-Duplication and Diversi- fication. Advances in Neural Information Processing Systems 36 (Dec. 2023), 53983–53995

2023
[69]

Hieu Tran, Zhichao Yang, Zonghai Yao, and Hong Yu. 2024. BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Lan- guage Processing. Journal of the American Medical Informatics Association 31, 9 (June 2024), 1821–1832. arXiv:https://academic.oup.com/jamia/article- pdf/31/9/1821/58868340/ocae122.pdf doi:10.1093/jamia/ocae122

work page doi:10.1093/jamia/ocae122 2024
[70]

A Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems (2017)

2017
[71]

Jiayi Wang, Chengliang Chai, Nan Tang, Jiabin Liu, and Guoliang Li. 2022. Core- sets over multiple tables for feature-rich and data-efficient machine learning. Proc. VLDB Endow. 16, 1 (Sept. 2022), 64–76. doi:10.14778/3561261.3561267

work page doi:10.14778/3561261.3561267 2022
[72]

Shaobo Wang, Xiangqi Jin, Ziming Wang, Jize Wang, Jiajun Zhang, Kaixin Li, Zichen Wen, Zhong Li, Conghui He, Xuming Hu, and Linfeng Zhang. 2025. Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few- Shot In-Context Learning. arXiv:2505.12212 [cs.CL] https://arxiv.org/abs/2505. 12212

work page arXiv 2025
[73]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837

2022
[74]

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. 2024. LESS: Selecting Influential Data for Targeted Instruction Tuning. In Forty-First International Conference on Machine Learning

2024
[75]

Guoliang Li Xinyang Zhao, Xuanhe Zhou. 2024. Chat2Data: An Interactive Data Analysis System with RAG, Vector Databases and LLMs

2024
[76]

Yu Yang, Hao Kang, and Baharan Mirzasoleiman. 2023. Towards Sustainable Learning: Coresets for Data-efficient Deep Learning. In Proceedings of the 40th International Conference on Machine Learning. PMLR, 39314–39330. https:// proceedings.mlr.press/v202/yang23g.html

2023
[77]

Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, and Baharan Mirzasoleiman. 2024. SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models. doi:10.48550/ ARXIV.2403.07384

work page arXiv 2024
[78]

Junjie Oscar Yin and Alexander M Rush. 2024. Compute-constrained data selec- tion. arXiv preprint arXiv:2410.16208 (2024)

work page arXiv 2024
[79]

Simon Yu, Liangyu Chen, Sara Ahmadian, and Marzieh Fadaee. 2024. Diver- sify and Conquer: Diversity-Centric Data Selection with Iterative Refinement. arXiv:2409.11378 [cs] doi:10.48550/arXiv.2409.11378

work page doi:10.48550/arxiv.2409.11378 2024
[80]

Zichun Yu, Spandan Das, and Chenyan Xiong. 2024. MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models. doi:10.48550/ ARXIV.2406.06046

work page arXiv 2024

Showing first 80 references.