pith. machine review for the scientific record. sign in

arxiv: 2604.11810 · v1 · submitted 2026-04-09 · 💻 cs.DB · cs.AI

Recognition: 2 theorem links

· Lean Theorem

GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:02 UTC · model grok-4.3

classification 💻 cs.DB cs.AI
keywords coreset selectionlarge language modelsdynamic selectionk-NN graphgradient importancetraining efficiencyrepresentation diversity
0
0 comments X

The pith

GRACE dynamically selects and updates representative data subsets for large language models using a k-NN graph to combine diversity and gradient importance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Training large language models on full datasets requires enormous computational resources and time. Existing coreset selection techniques remain static and cannot adjust as the model's focus shifts during training. GRACE builds and refreshes small subsets by blending data representation variety with gradient signals of importance. A k-NN graph spreads these scores to allow efficient updates without recomputing everything from scratch. Experiments across benchmarks show this approach shortens training while often improving results on downstream tasks for multiple models.

Core claim

GRACE dynamically constructs and updates coresets by combining representation diversity with gradient-based importance metrics, ensuring both informativeness and efficiency. To mitigate the computational cost of frequent updates, GRACE leverages a k-NN graph-based propagation mechanism and selectively updates scores and embeddings, adapting to evolving training dynamics. Extensive experiments on three benchmarks demonstrate that GRACE significantly improves training efficiency and downstream performance across diverse LLMs and tasks.

What carries the argument

The k-NN graph-based propagation mechanism that selectively updates scores and embeddings to combine representation diversity with gradient-based importance metrics for dynamic coreset construction.

Load-bearing premise

That combining representation diversity with gradient-based importance via k-NN graph propagation accurately identifies informative data points that adapt to evolving training dynamics without bias or loss of critical examples.

What would settle it

A controlled experiment where an LLM trained on a GRACE-selected coreset shows lower downstream task performance than the same model trained on a random subset of identical size would disprove the claimed performance benefits.

Figures

Figures reproduced from arXiv: 2604.11810 by Haoyang Li, Lei Chen, Tianhao Tang.

Figure 1
Figure 1. Figure 1: Overview of GRACE. Stage 1: Hidden states and importance scores of all training samples are extracted from a [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the graph update process. From left to right: we begin with a [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Time Comparison As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Experiment results under different budgets on MathInstruct [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: 𝜆 for Phi-2 0.0 0.2 0.4 0.6 0.8 1.0 Balance Control λ 18 20 22 24 26 Accuracy (%) In-domain Out-of-domain [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: 𝛿 for Phi-2 0.0 0.1 0.2 0.3 Check Threshold δ 20 22 24 26 Accuracy (%) In-domain Out-of-domain [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: 𝑡𝑐/T𝑒 for Phi-2 1/8 1/4 1/2 1 Sample Fraction tc/e 18 20 22 24 26 Accuracy (%) In-domain Out-of-domain [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, their immense number of parameters and complex transformer-based architectures result in significant resource demands and computational complexity during training, making it challenging to optimize them efficiently on large datasets. To reduce training costs while preserving performance, researchers have investigated coreset selection techniques, which aim to identify small, representative subsets of the entire training dataset to accelerate LLM training. However, existing coreset selection methods fail to adapt to the dynamic nature of LLM training and often struggle with scalability for models of this size. To address these limitations, we propose a graph-guided adaptive and dynamic coreset selection framework for LLMs, namely GRACE. GRACE dynamically constructs and updates coresets by combining representation diversity with gradient-based importance metrics, ensuring both informativeness and efficiency. To mitigate the computational cost of frequent updates, GRACE leverages a $k$-NN graph-based propagation mechanism and selectively updates scores and embeddings, adapting to evolving training dynamics. Extensive experiments on three benchmarks demonstrate that GRACE significantly improves training efficiency and downstream performance across diverse LLMs and tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes GRACE, a graph-guided adaptive dynamic coreset selection framework for LLM training. It dynamically builds and refreshes coresets by fusing representation diversity with gradient-based importance scores, using k-NN graph propagation together with selective updates of embeddings and scores to keep computational cost manageable while adapting to evolving training dynamics. The authors state that experiments on three benchmarks show GRACE yields substantial gains in training efficiency and downstream task performance across multiple LLMs and tasks.

Significance. If the empirical claims are borne out by the full results, the work addresses a practically important bottleneck in LLM training by offering a scalable, dynamic alternative to static coreset methods. The combination of diversity and gradient signals via graph propagation is a reasonable engineering choice for balancing informativeness and cost. No machine-checked proofs, parameter-free derivations, or reproducible code artifacts are mentioned, so the contribution rests entirely on the experimental evidence.

major comments (1)
  1. [Method description of k-NN graph-based propagation and selective updates] The central claim that GRACE improves efficiency and performance rests on the k-NN graph propagation correctly identifying informative examples by fusing representation diversity with gradient importance while adapting to training dynamics. The method description states that scores and embeddings are only selectively updated to control cost; this implicitly assumes that local neighborhood structure remains sufficiently stable between updates. If embeddings or gradients change substantially (as is common in the first few epochs of LLM training), the propagated importance scores can become stale, causing the coreset to retain non-informative points or drop critical ones. No parameter-free derivation or worst-case bound is supplied to quantify how large a drift can be tolerated before selection quality degrades.
minor comments (1)
  1. [Abstract] Abstract: The claim of 'significant improvements' is stated without any quantitative metrics, baseline comparisons, or error analysis, which is standard practice for empirical papers in this area and makes immediate assessment difficult.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comment below, providing clarification on the design choices in GRACE while remaining faithful to the manuscript's empirical focus.

read point-by-point responses
  1. Referee: The central claim that GRACE improves efficiency and performance rests on the k-NN graph propagation correctly identifying informative examples by fusing representation diversity with gradient importance while adapting to training dynamics. The method description states that scores and embeddings are only selectively updated to control cost; this implicitly assumes that local neighborhood structure remains sufficiently stable between updates. If embeddings or gradients change substantially (as is common in the first few epochs of LLM training), the propagated importance scores can become stale, causing the coreset to retain non-informative points or drop critical ones. No parameter-free derivation or worst-case bound is supplied to quantify how large a drift can be tolerated before selection quality degrades.

    Authors: We agree that the effectiveness of selective updates relies on the k-NN neighborhoods not drifting too rapidly. GRACE mitigates this through a gradient-magnitude-triggered refresh schedule that prioritizes updating points whose embeddings or importance scores have changed beyond a threshold, combined with periodic full-graph rebuilds at fixed intervals. This design is motivated by the observation that, after the initial epochs, representation shifts slow down in LLM training. While the manuscript does not include a parameter-free worst-case bound on tolerable drift (as the contribution is primarily algorithmic and empirical), the three-benchmark experiments demonstrate that the chosen update policy preserves downstream performance gains relative to static baselines and fully dynamic alternatives. We are prepared to expand the method section with additional ablation results on update frequency and to include a short discussion of observed neighborhood stability in the revised version. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper proposes an algorithmic framework (GRACE) for dynamic coreset selection using k-NN graph propagation on representations and gradients, with selective updates for efficiency. No equations, first-principles derivations, or predictions are presented that reduce to fitted parameters or self-referential definitions by construction. The central claims rest on empirical results across benchmarks rather than any load-bearing self-citation chain, ansatz smuggling, or renaming of known results as novel unification. The method description in the abstract and skeptic notes highlight practical assumptions about update stability, but these are not framed as mathematical derivations that collapse to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, background axioms, or invented entities beyond the GRACE framework itself are specified; the contribution is presented as an algorithmic combination rather than a derivation from first principles.

pith-pipeline@v0.9.0 · 5494 in / 1171 out tokens · 40357 ms · 2026-05-10T18:02:51.211155+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

90 extracted references · 56 canonical work pages · 7 internal anchors

  1. [1]

    Abhinab Acharya, Dayou Yu, Qi Yu, and Xumin Liu. 2024. Balancing Feature Similarity and Label Variability for Optimal Size-Aware One-shot Subset Selection. In Forty-First International Conference on Machine Learning

  2. [2]

    Meta AI. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL] https://arxiv.org/abs/2307.09288

  3. [3]

    Meta AI. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https: //arxiv.org/abs/2407.21783

  4. [4]

    Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lam- bert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang

  5. [5]

    https://openreview.net/forum?id=XfHWcNTSHp Sur- vey Certification

    A Survey on Data Selection for Language Models.Transactions on Machine GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Learning Research (2024). https://openreview.net/forum?id=XfHWcNTSHp Sur- vey Certification

  6. [6]

    André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, and Ian Foster

    Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, and Mansheej Paul. 2024. Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models. doi:10.48550/ARXIV.2405.20541

  7. [7]

    Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, and Christopher Ré. 2023. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. Proc. VLDB Endow. 17, 2 (Oct. 2023), 92–105. doi:10.14778/3626292.3626294

  8. [8]

    Tianyi Bai, Ling Yang, Zhen Hao Wong, Jiahui Peng, Xinlin Zhuang, Chi Zhang, Lijun Wu, Jiantao Qiu, Wentao Zhang, Binhang Yuan, and Conghui He. 2024. Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining. arXiv:2410.08102 [cs] doi:10.48550/arXiv.2410.08102

  9. [9]

    Das, Jifan Zhang, Sang T

    Gantavya Bhatt, Yifang Chen, Arnav M. Das, Jifan Zhang, Sang T. Truong, Stephen Mussmann, Yinglun Zhu, Jeffrey Bilmes, Simon S. Du, Kevin Jamieson, Jordan T. Ash, and Robert D. Nowak. 2024. An Experimental Design Frame- work for Label-Efficient Supervised Finetuning of Large Language Models. arXiv:2401.06692 [cs] doi:10.48550/arXiv.2401.06692

  10. [10]

    Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural networks: tricks of the trade: second edition. Springer, 421–436

  11. [11]

    Valérie Castin, Pierre Ablin, and Gabriel Peyré. 2024. How Smooth Is At- tention?. In Forty-first International Conference on Machine Learning. https: //openreview.net/forum?id=aP0H8A1ywk

  12. [12]

    Chengliang Chai, Jiabin Liu, Nan Tang, Ju Fan, Dongjing Miao, Jiayi Wang, Yuyu Luo, and Guoliang Li. 2023. GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data. Proc. ACM Manag. Data 1, 2 (June 2023), 157:1–157:27. doi:10.1145/3589302

  13. [13]

    Chengliang Chai, Jiayi Wang, Nan Tang, Ye Yuan, Jiabin Liu, Yuhao Deng, and Guoren Wang. 2023. Efficient Coreset Selection with Cluster-based Methods. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23). Association for Computing Machinery, New York, NY, USA, 167–178. doi:10.1145/3580305.3599326

  14. [14]

    Hao Chen, Yiming Zhang, Qi Zhang, Hantao Yang, Xiaomeng Hu, Xuetao Ma, Yi- fan Yanggong, and Junbo Zhao. 2023. Maybe Only 0.5% Data Is Needed: A Prelimi- nary Exploration of Low Training Data Instruction Tuning. arXiv:2305.09246 [cs] doi:10.48550/arXiv.2305.09246

  15. [15]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

  16. [16]

    Yulong Chen, Yang Liu, Liang Chen, and Yue Zhang. 2021. DialogSum: A Real-Life Scenario Dialogue Summarization Dataset. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 5062–5074. doi:10.18653/v1/2021.findings-acl.449

  17. [17]

    Yeseul Cho, Baekrok Shin, Changmin Kang, and Chulhee Yun. 2025. Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty. arXiv:2502.06905 [cs] doi:10.48550/arXiv.2502.06905

  18. [18]

    Marina Danilova, Pavel Dvurechensky, Alexander Gasnikov, Eduard Gorbunov, Sergey Guminov, Dmitry Kamzolov, and Innokentiy Shibaev. 2020. Recent Theo- retical Advances in Non-Convex Optimization. arXiv:2012.06188 [math.OC]

  19. [19]

    DeepSeek-AI, Aixin Liu, Bei Feng, et al. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]

  20. [20]

    Alexander Vladimirovich Demidovskij, Aleksei Trutnev, Artem Tugarev, Igor Salnikov, and Stanislav Pavlov. 2023. DAREL: Data Reduction with Losses for Training Acceleration of Real and Hypercomplex Neural Networks. InWorkshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@NeurIPS 2023)

  21. [21]

    Zhiwei Deng, Tao Li, and Yang Li. 2024. Influential Language Data Selection via Gradient Trajectory Pursuit. arXiv:2410.16710 doi:10.48550/arXiv.2410.16710

  22. [22]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems 36 (2023), 10088–10115

  23. [23]

    Ju Fan, Zihui Gu, Songyue Zhang, Yuxin Zhang, Zui Chen, Lei Cao, Guoliang Li, Samuel Madden, Xiaoyong Du, and Nan Tang. 2024. Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL. Proc. VLDB Endow. 17, 11 (July 2024), 2750–2763. doi:10.14778/3681954.3681960

  24. [24]

    Dan Feldman. 2020. Introduction to Core-sets: an Updated Survey. arXiv:2011.09384 [cs.LG]

  25. [25]

    Victor Giannakouris and Immanuel Trummer. 2025. 𝜆-Tune: Harnessing Large Language Models for Automated Database System Tuning. Proc. ACM Manag. Data 3, 1, Article 2 (Feb. 2025), 26 pages. doi:10.1145/3709652

  26. [26]

    Aviv Hadar, Tova Milo, and Kathy Razmadze. 2024. Datamap-Driven Tabular Coreset Selection for Classifier Training. Proc. VLDB Endow. 18, 3 (Nov. 2024), 876–888. doi:10.14778/3712221.3712249

  27. [27]

    Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. 2022. Towards a Unified View of Parameter-Efficient Transfer Learning. In International Conference on Learning Representations. https://openreview. net/forum?id=0RDcd5Axok

  28. [28]

    Yexiao He, Ziyao Wang, Zheyu Shen, Guoheng Sun, Yucong Dai, Yongkai Wu, Hongyi Wang, and Ang Li. 2024. SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning. arXiv:2405.00705 [cs] doi:10.48550/ arXiv.2405.00705

  29. [29]

    Dorit S Hochbaum. 1997. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. Approximation algorithms for NP-hard problems (1997), 94–143

  30. [30]

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language mod- els. In Proceedings of the 36th International Conference on Neural Information Processing Systems. 30016–30030

  31. [31]

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International conference on machine learning. PMLR, 2790–2799

  32. [32]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models. ICLR 1, 2 (2022), 3

  33. [33]

    Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing (Dallas, Texas, USA) (STOC ’98). As- sociation for Computing Machinery, New York, NY, USA, 604–613. doi:10.1145/ 276698.276876

  34. [34]

    Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, and Sanmi Koyejo. 2024. Scaling Laws for Downstream Task Perfor- mance of Large Language Models. doi:10.48550/ARXIV.2402.04177

  35. [35]

    Ayrton San Joaquin, Bin Wang, Zhengyuan Liu, Nicholas Asher, Brian Lim, Philippe Muller, and Nancy Chen. 2024. In2Core: Leveraging Influence Func- tions for Coreset Selection in Instruction Finetuning of Large Language Models. arXiv:2408.03560 [cs, stat] doi:10.48550/arXiv.2408.03560

  36. [36]

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547

  37. [37]

    Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, and Dan Suciu. 2024. Chorus: Foundation Models for Unified Data Discovery and Exploration. Proc. VLDB Endow. 17, 8 (April 2024), 2104–2114. doi:10.14778/ 3659437.3659461

  38. [38]

    Krishnateja Killamsetty, Durga S, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. 2021. GRAD-MATCH: Gradient Matching Based Data Subset Selection for Ef- ficient Deep Model Training. InProceedings of the 38th International Conference on Machine Learning. PMLR, 5464–5474

  39. [39]

    Teddy Lazebnik, Amit Somech, and Abraham Itzhak Weinberg. 2022. SubStrat: A Subset-Based Optimization Strategy for Faster AutoML. Proc. VLDB Endow. 16, 4 (Dec. 2022), 772–780. doi:10.14778/3574245.3574261

  40. [40]

    Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. 2024. The Dawn of Natural Language to SQL: Are We Fully Ready? Proc. VLDB Endow. 17, 11 (July 2024), 3318–3331. doi:10.14778/3681954.3682003

  41. [41]

    Haoyang Li, Shimin Di, Lei Chen, and Xiaofang Zhou. 2024. E2GCL: Efficient and Expressive Contrastive Learning on Graph Neural Networks. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 859–873

  42. [42]

    Haoyang Li, Shimin Di, Calvin Hong Yi Li, Lei Chen, and Xiaofang Zhou. 2024. Fight Fire with Fire: Towards Robust Graph Neural Networks on Dynamic Graphs via Actively Defense. Proceedings of the VLDB Endowment 17, 8 (2024), 2050– 2063

  43. [43]

    Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. 2024. A survey on large lan- guage model acceleration based on kv cache management. arXiv preprint arXiv:2412.19442 (2024)

  44. [44]

    Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. 2024. From Quantity to Quality: Boost- ing LLM Performance with Self-Guided Data Selection for Instruction Tun- ing. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human La...

  45. [45]

    Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021)

  46. [46]

    Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023. Textbooks Are All You Need II: phi-1.5 technical report. arXiv:2309.05463 [cs.CL]

  47. [47]

    Yiming Li, Yanyan Shen, and Lei Chen. 2022. Camel: Managing Data for Effi- cient Stream Learning. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1271–1285. doi:10.1145/3514221.3517836 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Tang et al

  48. [48]

    Zhaodonghui Li, Haitao Yuan, Huiming Wang, Gao Cong, and Lidong Bing

  49. [49]

    LLM-R2: A Large Language Model Enhanced Rule-Based Rewrite System for Boosting Query Efficiency. Proc. VLDB Endow. 18, 1 (Sept. 2024), 53–65. doi:10.14778/3696435.3696440

  50. [50]

    Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, and Yufa Zhou. 2024. Multi-Layer Transformers Gradient Can Be Approximated in Almost Linear Time. arXiv:2408.13233 [cs] doi:10.48550/arXiv.2408.13233

  51. [51]

    Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/

  52. [52]

    Hanmo Liu, Shimin Di, Haoyang Li, Shuangyin Li, Lei Chen, and Xiaofang Zhou

  53. [53]

    Task-oriented gnns training on large knowledge graphs for accurate and efficient modeling,

    Effective Data Selection and Replay for Unsupervised Continual Learning. In 40th IEEE International Conference on Data Engineering, ICDE 2024, Utrecht, The Netherlands, May 13-16, 2024. IEEE, 1449–1463. doi:10.1109/ICDE60146.2024. 00119

  54. [54]

    Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory 28, 2 (1982), 129–137

  55. [55]

    Yuze Lou, Chuan Lei, Xiao Qin, Zichen Wang, Christos Faloutsos, Rishita Anubhai, and Huzefa Rangwala. 2024. DATALORE: Can a Large Language Model Find All Lost Scrolls in a Data Repository?. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). 5170–5176. doi:10.1109/ICDE60146.2024.00388

  56. [56]

    and Hofmann, Dennis M

    Lei Ma, Lei Cao, Peter M. VanNostrand, Dennis M. Hofmann, Yao Su, and Elke A. Rundensteiner. 2024. Pluto: Sample Selection for Robust Anomaly Detection on Polluted Log Data. Proc. ACM Manag. Data 2, 4, Article 203 (Sept. 2024), 25 pages. doi:10.1145/3677139

  57. [57]

    Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, and Sara Hooker. 2023. When Less Is More: Investigating Data Pruning for Pretraining LLMs at Scale. arXiv:2309.04564 [cs] doi:10.48550/arXiv.2309.04564

  58. [58]

    Dheeraj Mekala, Alex Nguyen, and Jingbo Shang. 2024. Smaller Language Models Are Capable of Selecting Instruction-Tuning Training Data for Larger Language Models. doi:10.48550/ARXIV.2402.10430

  59. [59]

    Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. 2020. Coresets for Data-efficient Training of Machine Learning Models. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 6950–6960. arXiv:1906.01827 [cs, stat]

  60. [60]

    Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. 2022. Can Founda- tion Models Wrangle Your Data? Proc. VLDB Endow. 16, 4 (Dec. 2022), 738–746. doi:10.14778/3574245.3574258

  61. [61]

    Dang Nguyen, Wenhan Yang, Rathul Anand, Yu Yang, and Baharan Mirza- soleiman. 2024. Memory-Efficient Training of LLMs with Larger Mini-batches. arXiv:2407.19580 [cs] doi:10.48550/arXiv.2407.19580

  62. [62]

    OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

  63. [63]

    Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. 2021. Deep Learning on a Data Diet: Finding Important Examples Early in Training. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 20596–20607

  64. [64]

    Tonghui Ren, Yuankai Fan, Zhenying He, Ren Huang, Jiaqi Dai, Can Huang, Yinan Jing, Kai Zhang, Yifan Yang, and X Sean Wang. 2024. Purple: Making a large language model a better sql writer. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 15–28

  65. [65]

    Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Mor- cos. 2022. Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning. Advances in Neural Information Processing Systems 35 (Dec. 2022), 19523–19536

  66. [66]

    Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm. github.io/blog/qwen2.5/

  67. [67]

    Tristan Thrush, Christopher Potts, and Tatsunori Hashimoto. 2024. Improving Pretraining Data Using Perplexity Correlations. arXiv:2409.05816 [cs, stat] doi:10. 48550/arXiv.2409.05816

  68. [68]

    Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari Morcos. 2023. D4: Improving LLM Pretraining via Document De-Duplication and Diversi- fication. Advances in Neural Information Processing Systems 36 (Dec. 2023), 53983–53995

  69. [69]

    Hieu Tran, Zhichao Yang, Zonghai Yao, and Hong Yu. 2024. BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Lan- guage Processing. Journal of the American Medical Informatics Association 31, 9 (June 2024), 1821–1832. arXiv:https://academic.oup.com/jamia/article- pdf/31/9/1821/58868340/ocae122.pdf doi:10.1093/jamia/ocae122

  70. [70]

    A Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems (2017)

  71. [71]

    Jiayi Wang, Chengliang Chai, Nan Tang, Jiabin Liu, and Guoliang Li. 2022. Core- sets over multiple tables for feature-rich and data-efficient machine learning. Proc. VLDB Endow. 16, 1 (Sept. 2022), 64–76. doi:10.14778/3561261.3561267

  72. [72]

    Shaobo Wang, Xiangqi Jin, Ziming Wang, Jize Wang, Jiajun Zhang, Kaixin Li, Zichen Wen, Zhong Li, Conghui He, Xuming Hu, and Linfeng Zhang. 2025. Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few- Shot In-Context Learning. arXiv:2505.12212 [cs.CL] https://arxiv.org/abs/2505. 12212

  73. [73]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837

  74. [74]

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. 2024. LESS: Selecting Influential Data for Targeted Instruction Tuning. In Forty-First International Conference on Machine Learning

  75. [75]

    Guoliang Li Xinyang Zhao, Xuanhe Zhou. 2024. Chat2Data: An Interactive Data Analysis System with RAG, Vector Databases and LLMs

  76. [76]

    Yu Yang, Hao Kang, and Baharan Mirzasoleiman. 2023. Towards Sustainable Learning: Coresets for Data-efficient Deep Learning. In Proceedings of the 40th International Conference on Machine Learning. PMLR, 39314–39330. https:// proceedings.mlr.press/v202/yang23g.html

  77. [77]

    Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, and Baharan Mirzasoleiman. 2024. SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models. doi:10.48550/ ARXIV.2403.07384

  78. [78]

    Junjie Oscar Yin and Alexander M Rush. 2024. Compute-constrained data selec- tion. arXiv preprint arXiv:2410.16208 (2024)

  79. [79]

    Simon Yu, Liangyu Chen, Sara Ahmadian, and Marzieh Fadaee. 2024. Diver- sify and Conquer: Diversity-Centric Data Selection with Iterative Refinement. arXiv:2409.11378 [cs] doi:10.48550/arXiv.2409.11378

  80. [80]

    Zichun Yu, Spandan Das, and Chenyan Xiong. 2024. MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models. doi:10.48550/ ARXIV.2406.06046

Showing first 80 references.