Recognition: 2 theorem links
· Lean TheoremGRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization
Pith reviewed 2026-05-10 18:02 UTC · model grok-4.3
The pith
GRACE dynamically selects and updates representative data subsets for large language models using a k-NN graph to combine diversity and gradient importance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GRACE dynamically constructs and updates coresets by combining representation diversity with gradient-based importance metrics, ensuring both informativeness and efficiency. To mitigate the computational cost of frequent updates, GRACE leverages a k-NN graph-based propagation mechanism and selectively updates scores and embeddings, adapting to evolving training dynamics. Extensive experiments on three benchmarks demonstrate that GRACE significantly improves training efficiency and downstream performance across diverse LLMs and tasks.
What carries the argument
The k-NN graph-based propagation mechanism that selectively updates scores and embeddings to combine representation diversity with gradient-based importance metrics for dynamic coreset construction.
Load-bearing premise
That combining representation diversity with gradient-based importance via k-NN graph propagation accurately identifies informative data points that adapt to evolving training dynamics without bias or loss of critical examples.
What would settle it
A controlled experiment where an LLM trained on a GRACE-selected coreset shows lower downstream task performance than the same model trained on a random subset of identical size would disprove the claimed performance benefits.
Figures
read the original abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, their immense number of parameters and complex transformer-based architectures result in significant resource demands and computational complexity during training, making it challenging to optimize them efficiently on large datasets. To reduce training costs while preserving performance, researchers have investigated coreset selection techniques, which aim to identify small, representative subsets of the entire training dataset to accelerate LLM training. However, existing coreset selection methods fail to adapt to the dynamic nature of LLM training and often struggle with scalability for models of this size. To address these limitations, we propose a graph-guided adaptive and dynamic coreset selection framework for LLMs, namely GRACE. GRACE dynamically constructs and updates coresets by combining representation diversity with gradient-based importance metrics, ensuring both informativeness and efficiency. To mitigate the computational cost of frequent updates, GRACE leverages a $k$-NN graph-based propagation mechanism and selectively updates scores and embeddings, adapting to evolving training dynamics. Extensive experiments on three benchmarks demonstrate that GRACE significantly improves training efficiency and downstream performance across diverse LLMs and tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes GRACE, a graph-guided adaptive dynamic coreset selection framework for LLM training. It dynamically builds and refreshes coresets by fusing representation diversity with gradient-based importance scores, using k-NN graph propagation together with selective updates of embeddings and scores to keep computational cost manageable while adapting to evolving training dynamics. The authors state that experiments on three benchmarks show GRACE yields substantial gains in training efficiency and downstream task performance across multiple LLMs and tasks.
Significance. If the empirical claims are borne out by the full results, the work addresses a practically important bottleneck in LLM training by offering a scalable, dynamic alternative to static coreset methods. The combination of diversity and gradient signals via graph propagation is a reasonable engineering choice for balancing informativeness and cost. No machine-checked proofs, parameter-free derivations, or reproducible code artifacts are mentioned, so the contribution rests entirely on the experimental evidence.
major comments (1)
- [Method description of k-NN graph-based propagation and selective updates] The central claim that GRACE improves efficiency and performance rests on the k-NN graph propagation correctly identifying informative examples by fusing representation diversity with gradient importance while adapting to training dynamics. The method description states that scores and embeddings are only selectively updated to control cost; this implicitly assumes that local neighborhood structure remains sufficiently stable between updates. If embeddings or gradients change substantially (as is common in the first few epochs of LLM training), the propagated importance scores can become stale, causing the coreset to retain non-informative points or drop critical ones. No parameter-free derivation or worst-case bound is supplied to quantify how large a drift can be tolerated before selection quality degrades.
minor comments (1)
- [Abstract] Abstract: The claim of 'significant improvements' is stated without any quantitative metrics, baseline comparisons, or error analysis, which is standard practice for empirical papers in this area and makes immediate assessment difficult.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address the major comment below, providing clarification on the design choices in GRACE while remaining faithful to the manuscript's empirical focus.
read point-by-point responses
-
Referee: The central claim that GRACE improves efficiency and performance rests on the k-NN graph propagation correctly identifying informative examples by fusing representation diversity with gradient importance while adapting to training dynamics. The method description states that scores and embeddings are only selectively updated to control cost; this implicitly assumes that local neighborhood structure remains sufficiently stable between updates. If embeddings or gradients change substantially (as is common in the first few epochs of LLM training), the propagated importance scores can become stale, causing the coreset to retain non-informative points or drop critical ones. No parameter-free derivation or worst-case bound is supplied to quantify how large a drift can be tolerated before selection quality degrades.
Authors: We agree that the effectiveness of selective updates relies on the k-NN neighborhoods not drifting too rapidly. GRACE mitigates this through a gradient-magnitude-triggered refresh schedule that prioritizes updating points whose embeddings or importance scores have changed beyond a threshold, combined with periodic full-graph rebuilds at fixed intervals. This design is motivated by the observation that, after the initial epochs, representation shifts slow down in LLM training. While the manuscript does not include a parameter-free worst-case bound on tolerable drift (as the contribution is primarily algorithmic and empirical), the three-benchmark experiments demonstrate that the chosen update policy preserves downstream performance gains relative to static baselines and fully dynamic alternatives. We are prepared to expand the method section with additional ablation results on update frequency and to include a short discussion of observed neighborhood stability in the revised version. revision: partial
Circularity Check
No circularity detected in derivation or claims
full rationale
The paper proposes an algorithmic framework (GRACE) for dynamic coreset selection using k-NN graph propagation on representations and gradients, with selective updates for efficiency. No equations, first-principles derivations, or predictions are presented that reduce to fitted parameters or self-referential definitions by construction. The central claims rest on empirical results across benchmarks rather than any load-bearing self-citation chain, ansatz smuggling, or renaming of known results as novel unification. The method description in the abstract and skeptic notes highlight practical assumptions about update stability, but these are not framed as mathematical derivations that collapse to inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GRACE leverages a k-NN graph-based propagation mechanism and selectively updates scores and embeddings, adapting to evolving training dynamics.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a greedy algorithm... achieves a (1 − 1/e) approximation ratio
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Abhinab Acharya, Dayou Yu, Qi Yu, and Xumin Liu. 2024. Balancing Feature Similarity and Label Variability for Optimal Size-Aware One-shot Subset Selection. In Forty-First International Conference on Machine Learning
2024
-
[2]
Meta AI. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL] https://arxiv.org/abs/2307.09288
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Meta AI. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https: //arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lam- bert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang
-
[5]
https://openreview.net/forum?id=XfHWcNTSHp Sur- vey Certification
A Survey on Data Selection for Language Models.Transactions on Machine GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Learning Research (2024). https://openreview.net/forum?id=XfHWcNTSHp Sur- vey Certification
2018
-
[6]
Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, and Mansheej Paul. 2024. Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models. doi:10.48550/ARXIV.2405.20541
-
[7]
Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, and Christopher Ré. 2023. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. Proc. VLDB Endow. 17, 2 (Oct. 2023), 92–105. doi:10.14778/3626292.3626294
-
[8]
Tianyi Bai, Ling Yang, Zhen Hao Wong, Jiahui Peng, Xinlin Zhuang, Chi Zhang, Lijun Wu, Jiantao Qiu, Wentao Zhang, Binhang Yuan, and Conghui He. 2024. Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining. arXiv:2410.08102 [cs] doi:10.48550/arXiv.2410.08102
-
[9]
Gantavya Bhatt, Yifang Chen, Arnav M. Das, Jifan Zhang, Sang T. Truong, Stephen Mussmann, Yinglun Zhu, Jeffrey Bilmes, Simon S. Du, Kevin Jamieson, Jordan T. Ash, and Robert D. Nowak. 2024. An Experimental Design Frame- work for Label-Efficient Supervised Finetuning of Large Language Models. arXiv:2401.06692 [cs] doi:10.48550/arXiv.2401.06692
-
[10]
Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural networks: tricks of the trade: second edition. Springer, 421–436
2012
-
[11]
Valérie Castin, Pierre Ablin, and Gabriel Peyré. 2024. How Smooth Is At- tention?. In Forty-first International Conference on Machine Learning. https: //openreview.net/forum?id=aP0H8A1ywk
2024
-
[12]
Chengliang Chai, Jiabin Liu, Nan Tang, Ju Fan, Dongjing Miao, Jiayi Wang, Yuyu Luo, and Guoliang Li. 2023. GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data. Proc. ACM Manag. Data 1, 2 (June 2023), 157:1–157:27. doi:10.1145/3589302
-
[13]
Chengliang Chai, Jiayi Wang, Nan Tang, Ye Yuan, Jiabin Liu, Yuhao Deng, and Guoren Wang. 2023. Efficient Coreset Selection with Cluster-based Methods. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23). Association for Computing Machinery, New York, NY, USA, 167–178. doi:10.1145/3580305.3599326
-
[14]
Hao Chen, Yiming Zhang, Qi Zhang, Hantao Yang, Xiaomeng Hu, Xuetao Ma, Yi- fan Yanggong, and Junbo Zhao. 2023. Maybe Only 0.5% Data Is Needed: A Prelimi- nary Exploration of Low Training Data Instruction Tuning. arXiv:2305.09246 [cs] doi:10.48550/arXiv.2305.09246
-
[15]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
Yulong Chen, Yang Liu, Liang Chen, and Yue Zhang. 2021. DialogSum: A Real-Life Scenario Dialogue Summarization Dataset. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 5062–5074. doi:10.18653/v1/2021.findings-acl.449
-
[17]
Yeseul Cho, Baekrok Shin, Changmin Kang, and Chulhee Yun. 2025. Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty. arXiv:2502.06905 [cs] doi:10.48550/arXiv.2502.06905
- [18]
-
[19]
DeepSeek-AI, Aixin Liu, Bei Feng, et al. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Alexander Vladimirovich Demidovskij, Aleksei Trutnev, Artem Tugarev, Igor Salnikov, and Stanislav Pavlov. 2023. DAREL: Data Reduction with Losses for Training Acceleration of Real and Hypercomplex Neural Networks. InWorkshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@NeurIPS 2023)
2023
-
[21]
Zhiwei Deng, Tao Li, and Yang Li. 2024. Influential Language Data Selection via Gradient Trajectory Pursuit. arXiv:2410.16710 doi:10.48550/arXiv.2410.16710
-
[22]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems 36 (2023), 10088–10115
2023
-
[23]
Ju Fan, Zihui Gu, Songyue Zhang, Yuxin Zhang, Zui Chen, Lei Cao, Guoliang Li, Samuel Madden, Xiaoyong Du, and Nan Tang. 2024. Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL. Proc. VLDB Endow. 17, 11 (July 2024), 2750–2763. doi:10.14778/3681954.3681960
- [24]
-
[25]
Victor Giannakouris and Immanuel Trummer. 2025. 𝜆-Tune: Harnessing Large Language Models for Automated Database System Tuning. Proc. ACM Manag. Data 3, 1, Article 2 (Feb. 2025), 26 pages. doi:10.1145/3709652
-
[26]
Aviv Hadar, Tova Milo, and Kathy Razmadze. 2024. Datamap-Driven Tabular Coreset Selection for Classifier Training. Proc. VLDB Endow. 18, 3 (Nov. 2024), 876–888. doi:10.14778/3712221.3712249
-
[27]
Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. 2022. Towards a Unified View of Parameter-Efficient Transfer Learning. In International Conference on Learning Representations. https://openreview. net/forum?id=0RDcd5Axok
2022
- [28]
-
[29]
Dorit S Hochbaum. 1997. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. Approximation algorithms for NP-hard problems (1997), 94–143
1997
-
[30]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language mod- els. In Proceedings of the 36th International Conference on Neural Information Processing Systems. 30016–30030
2022
-
[31]
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International conference on machine learning. PMLR, 2790–2799
2019
-
[32]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models. ICLR 1, 2 (2022), 3
2022
-
[33]
Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing (Dallas, Texas, USA) (STOC ’98). As- sociation for Computing Machinery, New York, NY, USA, 604–613. doi:10.1145/ 276698.276876
-
[34]
Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, and Sanmi Koyejo. 2024. Scaling Laws for Downstream Task Perfor- mance of Large Language Models. doi:10.48550/ARXIV.2402.04177
-
[35]
Ayrton San Joaquin, Bin Wang, Zhengyuan Liu, Nicholas Asher, Brian Lim, Philippe Muller, and Nancy Chen. 2024. In2Core: Leveraging Influence Func- tions for Coreset Selection in Instruction Finetuning of Large Language Models. arXiv:2408.03560 [cs, stat] doi:10.48550/arXiv.2408.03560
-
[36]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547
2019
- [37]
-
[38]
Krishnateja Killamsetty, Durga S, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. 2021. GRAD-MATCH: Gradient Matching Based Data Subset Selection for Ef- ficient Deep Model Training. InProceedings of the 38th International Conference on Machine Learning. PMLR, 5464–5474
2021
-
[39]
Teddy Lazebnik, Amit Somech, and Abraham Itzhak Weinberg. 2022. SubStrat: A Subset-Based Optimization Strategy for Faster AutoML. Proc. VLDB Endow. 16, 4 (Dec. 2022), 772–780. doi:10.14778/3574245.3574261
-
[40]
Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. 2024. The Dawn of Natural Language to SQL: Are We Fully Ready? Proc. VLDB Endow. 17, 11 (July 2024), 3318–3331. doi:10.14778/3681954.3682003
-
[41]
Haoyang Li, Shimin Di, Lei Chen, and Xiaofang Zhou. 2024. E2GCL: Efficient and Expressive Contrastive Learning on Graph Neural Networks. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 859–873
2024
-
[42]
Haoyang Li, Shimin Di, Calvin Hong Yi Li, Lei Chen, and Xiaofang Zhou. 2024. Fight Fire with Fire: Towards Robust Graph Neural Networks on Dynamic Graphs via Actively Defense. Proceedings of the VLDB Endowment 17, 8 (2024), 2050– 2063
2024
- [43]
-
[44]
Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. 2024. From Quantity to Quality: Boost- ing LLM Performance with Self-Guided Data Selection for Instruction Tun- ing. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human La...
-
[45]
Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021)
work page internal anchor Pith review arXiv 2021
-
[46]
Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023. Textbooks Are All You Need II: phi-1.5 technical report. arXiv:2309.05463 [cs.CL]
work page internal anchor Pith review arXiv 2023
-
[47]
Yiming Li, Yanyan Shen, and Lei Chen. 2022. Camel: Managing Data for Effi- cient Stream Learning. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1271–1285. doi:10.1145/3514221.3517836 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Tang et al
-
[48]
Zhaodonghui Li, Haitao Yuan, Huiming Wang, Gao Cong, and Lidong Bing
-
[49]
LLM-R2: A Large Language Model Enhanced Rule-Based Rewrite System for Boosting Query Efficiency. Proc. VLDB Endow. 18, 1 (Sept. 2024), 53–65. doi:10.14778/3696435.3696440
-
[50]
Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, and Yufa Zhou. 2024. Multi-Layer Transformers Gradient Can Be Approximated in Almost Linear Time. arXiv:2408.13233 [cs] doi:10.48550/arXiv.2408.13233
-
[51]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/
2004
-
[52]
Hanmo Liu, Shimin Di, Haoyang Li, Shuangyin Li, Lei Chen, and Xiaofang Zhou
-
[53]
Task-oriented gnns training on large knowledge graphs for accurate and efficient modeling,
Effective Data Selection and Replay for Unsupervised Continual Learning. In 40th IEEE International Conference on Data Engineering, ICDE 2024, Utrecht, The Netherlands, May 13-16, 2024. IEEE, 1449–1463. doi:10.1109/ICDE60146.2024. 00119
-
[54]
Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory 28, 2 (1982), 129–137
1982
-
[55]
Yuze Lou, Chuan Lei, Xiao Qin, Zichen Wang, Christos Faloutsos, Rishita Anubhai, and Huzefa Rangwala. 2024. DATALORE: Can a Large Language Model Find All Lost Scrolls in a Data Repository?. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). 5170–5176. doi:10.1109/ICDE60146.2024.00388
-
[56]
Lei Ma, Lei Cao, Peter M. VanNostrand, Dennis M. Hofmann, Yao Su, and Elke A. Rundensteiner. 2024. Pluto: Sample Selection for Robust Anomaly Detection on Polluted Log Data. Proc. ACM Manag. Data 2, 4, Article 203 (Sept. 2024), 25 pages. doi:10.1145/3677139
-
[57]
Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, and Sara Hooker. 2023. When Less Is More: Investigating Data Pruning for Pretraining LLMs at Scale. arXiv:2309.04564 [cs] doi:10.48550/arXiv.2309.04564
-
[58]
Dheeraj Mekala, Alex Nguyen, and Jingbo Shang. 2024. Smaller Language Models Are Capable of Selecting Instruction-Tuning Training Data for Larger Language Models. doi:10.48550/ARXIV.2402.10430
- [59]
-
[60]
Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. 2022. Can Founda- tion Models Wrangle Your Data? Proc. VLDB Endow. 16, 4 (Dec. 2022), 738–746. doi:10.14778/3574245.3574258
-
[61]
Dang Nguyen, Wenhan Yang, Rathul Anand, Yu Yang, and Baharan Mirza- soleiman. 2024. Memory-Efficient Training of LLMs with Larger Mini-batches. arXiv:2407.19580 [cs] doi:10.48550/arXiv.2407.19580
-
[62]
OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. 2021. Deep Learning on a Data Diet: Finding Important Examples Early in Training. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 20596–20607
2021
-
[64]
Tonghui Ren, Yuankai Fan, Zhenying He, Ren Huang, Jiaqi Dai, Can Huang, Yinan Jing, Kai Zhang, Yifan Yang, and X Sean Wang. 2024. Purple: Making a large language model a better sql writer. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 15–28
2024
-
[65]
Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Mor- cos. 2022. Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning. Advances in Neural Information Processing Systems 35 (Dec. 2022), 19523–19536
2022
-
[66]
Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm. github.io/blog/qwen2.5/
2024
- [67]
-
[68]
Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari Morcos. 2023. D4: Improving LLM Pretraining via Document De-Duplication and Diversi- fication. Advances in Neural Information Processing Systems 36 (Dec. 2023), 53983–53995
2023
-
[69]
Hieu Tran, Zhichao Yang, Zonghai Yao, and Hong Yu. 2024. BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Lan- guage Processing. Journal of the American Medical Informatics Association 31, 9 (June 2024), 1821–1832. arXiv:https://academic.oup.com/jamia/article- pdf/31/9/1821/58868340/ocae122.pdf doi:10.1093/jamia/ocae122
-
[70]
A Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems (2017)
2017
-
[71]
Jiayi Wang, Chengliang Chai, Nan Tang, Jiabin Liu, and Guoliang Li. 2022. Core- sets over multiple tables for feature-rich and data-efficient machine learning. Proc. VLDB Endow. 16, 1 (Sept. 2022), 64–76. doi:10.14778/3561261.3561267
-
[72]
Shaobo Wang, Xiangqi Jin, Ziming Wang, Jize Wang, Jiajun Zhang, Kaixin Li, Zichen Wen, Zhong Li, Conghui He, Xuming Hu, and Linfeng Zhang. 2025. Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few- Shot In-Context Learning. arXiv:2505.12212 [cs.CL] https://arxiv.org/abs/2505. 12212
-
[73]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837
2022
-
[74]
Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. 2024. LESS: Selecting Influential Data for Targeted Instruction Tuning. In Forty-First International Conference on Machine Learning
2024
-
[75]
Guoliang Li Xinyang Zhao, Xuanhe Zhou. 2024. Chat2Data: An Interactive Data Analysis System with RAG, Vector Databases and LLMs
2024
-
[76]
Yu Yang, Hao Kang, and Baharan Mirzasoleiman. 2023. Towards Sustainable Learning: Coresets for Data-efficient Deep Learning. In Proceedings of the 40th International Conference on Machine Learning. PMLR, 39314–39330. https:// proceedings.mlr.press/v202/yang23g.html
2023
- [77]
- [78]
-
[79]
Simon Yu, Liangyu Chen, Sara Ahmadian, and Marzieh Fadaee. 2024. Diver- sify and Conquer: Diversity-Centric Data Selection with Iterative Refinement. arXiv:2409.11378 [cs] doi:10.48550/arXiv.2409.11378
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.