Online Data Selection for Instruction Tuning via Gaussian Processes

Julien Monteil; Jun Wang; Quoc Phong Nguyen; Vu Nguyen

arxiv: 2606.30077 · v1 · pith:4YKN2PYMnew · submitted 2026-06-29 · 💻 cs.LG · cs.AI

Online Data Selection for Instruction Tuning via Gaussian Processes

Jun Wang , Quoc Phong Nguyen , Julien Monteil , Vu Nguyen This is my paper

Pith reviewed 2026-06-30 06:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords data selectioninstruction tuningGaussian processesonline learningLLM fine-tuningdynamic regrethedge algorithm

0 comments

The pith

GAIA uses global Gaussian Process estimation to select high-utility data during online LLM instruction tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to move online data selection for LLM instruction tuning beyond methods that only look inside random batches. It introduces GAIA, which treats data utility as a continuous function over the full semantic space and estimates that function with Gaussian Process regression. An adaptive fusion step then prioritizes samples according to the current estimate. The update rule is rewritten as an instance of the fixed-share Hedge algorithm, which supplies a dynamic-regret bound that holds even when data quality changes during training. A sympathetic reader would care because modern LLM work has shifted emphasis from data volume to data quality, so a method that reliably finds the better samples could reduce compute while raising final performance.

Core claim

GAIA formulates data valuation as a global estimation process that employs Gaussian Process regression to model continuous utility manifolds across semantic space. It employs an adaptive strategy fusion mechanism to dynamically prioritize high-utility samples. By casting the strategy-posterior update as an instance of the classical fixed-share Hedge framework, the method inherits a dynamic-regret guarantee that characterizes robustness under non-stationary quality scores, and empirical results on three datasets show significant outperformance over baselines such as GREATs.

What carries the argument

Gaussian Process regression that models the continuous utility manifold in semantic space, updated through an adaptive fusion rule derived from the fixed-share Hedge algorithm.

If this is right

GAIA significantly outperforms state-of-the-art batch-constrained baselines on three instruction-tuning datasets.
The method remains robust when quality scores change during the course of training.
It supplies a scalable procedure for efficient instruction tuning that still carries a dynamic-regret bound.
The regret analysis directly quantifies how well the selector tracks the best data under non-stationary conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same global-manifold view could be tested on other sequential selection problems where the value of each item drifts over time.
If the manifold assumption weakens at very large model scales, replacing the Gaussian Process with a neural surrogate would be a direct next experiment.
The approach suggests a general template for turning any online expert-tracking algorithm into a data-selection procedure by reinterpreting posterior weights as utility estimates.

Load-bearing premise

Data utility forms a continuous manifold in semantic space that Gaussian Process regression can reliably estimate from the observed samples during training.

What would settle it

On a new dataset the data points selected by GAIA produce no measurable gain in final model accuracy or loss compared with random or batch-constrained selection, or the Gaussian Process predictions show no correlation with the actual performance contribution of each sample.

Figures

Figures reproduced from arXiv: 2606.30077 by Julien Monteil, Jun Wang, Quoc Phong Nguyen, Vu Nguyen.

**Figure 1.** Figure 1: Workflow. Our framework operates in two distinct phases to decouple valuation from training: (1) Warm-up & GP Training: A lightweight Embeding Model maps the raw data pool into a semantic embedding space. Concurrently, a warm-up training phase on a small dataset set collects initial utility scores. These scores and embeddings are used to fit the Gaussian Process (GP) model, establishing an initial global u… view at source ↗

**Figure 2.** Figure 2: Training dynamics on MMLU-Sociology (top row) and SAMSUM (bottom row). We benchmark GAIA-GREATS against full-data training (Regular) and various selection baselines. The left and middle columns report validation and test perplexity (lower is better), while the right column presents downstream task performance (Accuracy for MMLU, ROUGE for SAMSUM). GAIAGREATS demonstrates superior convergence speed and rob… view at source ↗

**Figure 3.** Figure 3: Performance comparison of four different scoring functions on [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Cumulative data value scores (LLAMA-2-7B, MMLU-Sociology). As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of strategy number on test perplexity for [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics on MMLU-Sociology using the LLAMA-2-7B backbone, comparing downstream performance when using alternative surrogate models for data sampling [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: illustrates the convergence and downstream performance on the TYDIQA dataset. Although GP-GREATS exhibits a slightly more gradual decline in evaluation and test perplexity during the initial training phase, it eventually achieves a superior convergence floor compared to GREATS. Specifically, the final perplexity levels (both eval and test) of our method are marginally lower than those of the strongest base… view at source ↗

**Figure 8.** Figure 8: Training dynamics on MMLU-Sociology and SAMSUM using the Qwen3-4B backbone. We compare GAIA-GREATSagainst Regular training and the GREATS baseline to evaluate crossarchitecture generalization. over both Regular training and the strong GREATS baseline. On the MMLU-Sociology task, GAIA achieves a noticeably faster reduction in both evaluation and test perplexity, which directly translates to a higher and mo… view at source ↗

**Figure 9.** Figure 9: Performance comparison on additional MMLU subjects (Abstract Algebra, Anatomy, US Foreign Policy, Astronomy). We evaluate GAIA-GREATS against Regular training and the GREATS baseline, reporting both test perplexity (PPL) and accuracy. Notably, our method achieves a significant and consistent reduction in PPL across all tasks. While the accuracy metrics exhibit some variance, typical for the inherent stocha… view at source ↗

**Figure 10.** Figure 10: Visualization of sampling dynamics on MMLU-Sociology over the course of training. The scatter plot maps sampled data indices (y-axis) against training steps (x-axis), with colors representing the distinct strategies from which samples were drawn. The distinct clustering observed in the initial phase demonstrates that our method prioritizes high-value data first. As training progresses, the selection becom… view at source ↗

**Figure 11.** Figure 11: Impact of sampling temperature on test perplexity for [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 13.** Figure 13: Impact of warm-up dataset size on test perplexity ( [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

read the original abstract

With Large Language Model (LLM) pre-training and fine-tuning shifting its focus from data volume to data quality, quality data selection has emerged as a critical research topic. Existing online data selection methods for LLM training are typically "batch-constrained", limiting optimization to local utility within random batches. To overcome this, we propose GAIA (Global Adaptive Instruction tuning via GAussian processes), a framework that formulates data valuation as a global estimation process. GAIA employs Gaussian Process regression to model continuous utility manifolds across the semantic space, utilizing an adaptive strategy fusion mechanism to dynamically prioritize high-utility samples. By casting the strategy-posterior update as an instance of the classical fixed-share Hedge framework for tracking the best expert, we inherit a dynamic-regret guarantee that characterizes GAIA's robustness under non-stationary quality scores during training. Empirical evaluations on three datasets demonstrate that GAIA significantly outperforms state-of-the-art baselines like \greats, establishing our method as a scalable and robust solution for efficient instruction tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAIA's GP-plus-Hedge framing for global online data selection is a clean reduction but rests on an untested smoothness assumption that may not hold for sparse samples.

read the letter

The main takeaway is that this paper recasts online instruction data selection as global Gaussian process regression over a semantic utility manifold, then reduces the adaptive fusion step to fixed-share Hedge so it can inherit a dynamic regret bound. That specific combination is not in the baselines it cites.

The reduction itself is the part that works cleanly: once the posterior update is written as an instance of the classical Hedge expert problem, the regret guarantee follows without new analysis. Treating selection as global rather than batch-local also directly targets the limitation the authors identify in prior online methods.

The soft spot is exactly the one the stress-test flags. The method needs the utility surface to be continuous and smooth enough for GP extrapolation from the limited points observed during training. If the kernel or the observed samples do not support that, the estimates revert to something no better than local heuristics and the regret bound applies to a noisy expert. The abstract supplies no derivation details, kernel ablations, or analysis of observation density, so it is impossible to judge how often the assumption holds on real instruction data. The reported gains over GREATs on three datasets are therefore hard to attribute specifically to the global formulation.

This is for researchers working on data-efficient LLM fine-tuning who care about online methods with some theoretical grounding. A reader already familiar with GP regression or Hedge reductions will see the value in the framing. It deserves peer review because the core reduction is coherent and the problem is practical, even though the empirical and modeling details will need expansion.

Referee Report

1 major / 0 minor

Summary. The paper proposes GAIA for online data selection in LLM instruction tuning. It models data utility as a continuous manifold in semantic space via Gaussian Process regression for global estimation, uses an adaptive strategy fusion mechanism to prioritize samples, and reduces the strategy-posterior update to the classical fixed-share Hedge framework to inherit a dynamic-regret guarantee under non-stationary quality scores. Empirical results on three datasets claim significant outperformance over baselines including GREATs.

Significance. If the GP regression reliably produces global utility estimates from the limited online observations and the reduction to fixed-share Hedge is exact (without hidden bias from the GP posterior), the work would supply a theoretically grounded global alternative to batch-constrained selection methods together with a dynamic-regret bound. The explicit inheritance of an existing regret guarantee is a positive feature when the mapping is shown to preserve the required conditions.

major comments (1)

[Abstract] Abstract: the central claim that Gaussian Process regression yields reliable global utility estimates across semantic space (used both for prioritization and to justify the exact fixed-share Hedge reduction) rests on the unexamined assumption that the utility surface is sufficiently smooth and that sparse online samples suffice for accurate extrapolation. No derivation, kernel specification, or error analysis is supplied to support this step; if the GP estimates are no better than local heuristics, the dynamic-regret bound applies only to a noisy expert and the reported gains over GREATs cannot be attributed to the global formulation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the Gaussian Process assumptions. We address the concern point-by-point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that Gaussian Process regression yields reliable global utility estimates across semantic space (used both for prioritization and to justify the exact fixed-share Hedge reduction) rests on the unexamined assumption that the utility surface is sufficiently smooth and that sparse online samples suffice for accurate extrapolation. No derivation, kernel specification, or error analysis is supplied to support this step; if the GP estimates are no better than local heuristics, the dynamic-regret bound applies only to a noisy expert and the reported gains over GREATs cannot be attributed to the global formulation.

Authors: We agree the abstract is concise and omits explicit kernel details and error bounds. The method section specifies a Gaussian Process with RBF kernel over semantic embeddings, where the posterior mean provides the global utility estimate used for prioritization and strategy fusion. The fixed-share Hedge reduction is exact on the sequence of fused posteriors and the dynamic-regret guarantee holds irrespective of GP accuracy (it bounds regret against the best dynamic expert sequence). Empirical gains over GREATs are shown via ablations that isolate the global component. In revision we will add a subsection with kernel specification, the standard GP smoothness assumption, and a reference to posterior convergence rates under accumulating observations to justify extrapolation from sparse samples. revision: yes

Circularity Check

0 steps flagged

No circularity: external inheritance of regret bound and independent GP modeling

full rationale

The paper's derivation casts the strategy-posterior update explicitly as an instance of the classical fixed-share Hedge framework (an external result) to inherit its dynamic-regret guarantee, while using Gaussian Process regression to model utility as a continuous manifold. Neither step reduces a claimed prediction or result to its own inputs by definition, nor relies on load-bearing self-citation, ansatz smuggling, or renaming of known results. The outperformance claim is presented as an empirical outcome rather than a mathematical necessity forced by the method's construction. The chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are extractable beyond the high-level modeling choice of continuous utility manifold.

pith-pipeline@v0.9.1-grok · 5702 in / 928 out tokens · 25355 ms · 2026-06-30T06:58:34.948127+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 10 canonical work pages · 8 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

arXiv preprint arXiv:2504.00993 (2025)

Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025

work page arXiv 2025
[4]

Should chatgpt be biased? challenges and risks of bias in large language models

Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. Challenges and Risks of Bias in Large Language Models (October 26, 2023), 2023

2023
[5]

Data shapley: Equitable valuation of data for machine learning

Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. InICML, 2019

2019
[6]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InICML, 2017

2017
[7]

Greats: Online selection of high-quality data for llm training in every iteration.Advances in Neural Information Processing Systems, 37:131197–131223, 2024

Jiachen T Wang, Tong Wu, Dawn Song, Prateek Mittal, and Ruoxi Jia. Greats: Online selection of high-quality data for llm training in every iteration.Advances in Neural Information Processing Systems, 37:131197–131223, 2024

2024
[8]

Lava: Data valuation without pre-specified learning algorithms

Hoang Anh Just, Feiyang Kang, Tianhao Wang, Yi Zeng, Myeongseob Ko, Ming Jin, and Ruoxi Jia. Lava: Data valuation without pre-specified learning algorithms. InThe Eleventh International Conference on Learning Representations. OpenReview, 2023

2023
[9]

Sava: Scalable learning-agnostic data valuation

Samuel Kessler, Tam Le, and Vu Nguyen. Sava: Scalable learning-agnostic data valuation. In The Thirteenth International Conference on Learning Representations, 2025

2025
[10]

Kairos: Scalable model-agnostic data valuation.Advances in Neural Information Processing Systems, 2025

Jiongli Zhu, Parjanya Prajakta Prashant, Alex Cloninger, and Babak Salimi. Kairos: Scalable model-agnostic data valuation.Advances in Neural Information Processing Systems, 2025

2025
[11]

Shapley-based data valuation for weighted k-nearest neighbors

Guangyi Zhang, Qiyu Liu, and Aristides Gionis. Shapley-based data valuation for weighted k-nearest neighbors. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
[12]

What is your data worth to gpt? llm-scale data valuation with influence functions

Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, et al. What is your data worth to gpt? llm-scale data valuation with influence functions. 2025

2025
[13]

Dataset distillation

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. In ICML, 2018

2018
[14]

Coresets for data-efficient training of machine learning models

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. InInternational Conference on Machine Learning, pages 6950–6960. PMLR, 2020

2020
[15]

Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 33: 19920–19930, 2020

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 33: 19920–19930, 2020

2020
[16]

Rethinking data shapley for data selection tasks: Misleads and merits

Jiachen T Wang, Tianji Yang, James Zou, Yongchan Kwon, and Ruoxi Jia. Rethinking data shapley for data selection tasks: Misleads and merits. InInternational Conference on Machine Learning, pages 52033–52063. PMLR, 2024

2024
[17]

Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36: 69798–69818, 2023

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36: 69798–69818, 2023

2023
[18]

Less: selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: selecting influential data for targeted instruction tuning. InProceedings of the 41st International Conference on Machine Learning, pages 54104–54132, 2024

2024
[19]

Qurating: Selecting high- quality data for training language models

Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. Qurating: Selecting high- quality data for training language models. InForty-first International Conference on Machine Learning. 10
[20]

An empirical study of example forgetting during deep neural network learning

Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. InInternational Conference on Learning Representations, 2019

2019
[21]

Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021

Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021

2021
[22]

Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

2022
[23]

Tracking the best expert.Machine Learning, 32(2): 151–178, 1998

Mark Herbster and Manfred K Warmuth. Tracking the best expert.Machine Learning, 32(2): 151–178, 1998

1998
[24]

Cambridge University Press, 2006

Nicolò Cesa-Bianchi and Gábor Lugosi.Prediction, Learning, and Games. Cambridge University Press, 2006

2006
[25]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021
[26]

SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. InProceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79. Association for Computational Linguistics, November 2019

2019
[27]

Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages.Transactions of the Association for Computational Linguistics, 2020

2020
[28]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.CoRR, abs/2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

doi: 10.48550/ARXIV .2310.06825

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[33]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

2019
[34]

MIT press Cambridge, MA, 2006

Christopher KI Williams and Carl Edward Rasmussen.Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006

2006
[35]

Tighter bounds on the log marginal likelihood of gaussian process regression using conjugate gradients

Artem Artemev, David R Burt, and Mark van der Wilk. Tighter bounds on the log marginal likelihood of gaussian process regression using conjugate gradients. InInternational Conference on Machine Learning, pages 362–372. PMLR, 2021

2021
[36]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. 11

2023
[37]

The flan collection: Designing data and methods for effective instruction tuning

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. InInternational Conference on Machine Learning, pages 22631–22648. PMLR, 2023

2023
[38]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022
[39]

Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

Conover Mike, Hayes Matt, Mathur Ankit, Xie Jianwei, Wan Jun, Shah Sam, Ghodsi Ali, Wendell Patrick, Zaharia Matei, and Xin Reynold. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

2023
[40]

Openassistant conversations-democratizing large language model alignment.Advances in neural information processing systems, 36:47669–47681, 2023

Andreas Köpf, Yannic Kilcher, Dimitri V on Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al. Openassistant conversations-democratizing large language model alignment.Advances in neural information processing systems, 36:47669–47681, 2023

2023
[41]

Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration.Advances in neural information processing systems, 31, 2018

Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration.Advances in neural information processing systems, 31, 2018

2018
[42]

Not all samples are created equal: Deep learning with importance sampling

Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning with importance sampling. In Jennifer G. Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, volume 80, pages 2530–2539. PMLR, 2018

2018
[43]

Online Batch Selection for Faster Training of Neural Networks

Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks. CoRR, abs/1511.06343, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[44]

Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal

Sören Mindermann, Jan Markus Brauner, Muhammed Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal. Prioritized training on points that are learnable, worth learning, and not yet learnt. InInternational Conference on Machine Learning, ICML 2022, volume 162, pages 15630–1564...

work page arXiv 2022

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

arXiv preprint arXiv:2504.00993 (2025)

Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025

work page arXiv 2025

[4] [4]

Should chatgpt be biased? challenges and risks of bias in large language models

Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. Challenges and Risks of Bias in Large Language Models (October 26, 2023), 2023

2023

[5] [5]

Data shapley: Equitable valuation of data for machine learning

Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. InICML, 2019

2019

[6] [6]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InICML, 2017

2017

[7] [7]

Greats: Online selection of high-quality data for llm training in every iteration.Advances in Neural Information Processing Systems, 37:131197–131223, 2024

Jiachen T Wang, Tong Wu, Dawn Song, Prateek Mittal, and Ruoxi Jia. Greats: Online selection of high-quality data for llm training in every iteration.Advances in Neural Information Processing Systems, 37:131197–131223, 2024

2024

[8] [8]

Lava: Data valuation without pre-specified learning algorithms

Hoang Anh Just, Feiyang Kang, Tianhao Wang, Yi Zeng, Myeongseob Ko, Ming Jin, and Ruoxi Jia. Lava: Data valuation without pre-specified learning algorithms. InThe Eleventh International Conference on Learning Representations. OpenReview, 2023

2023

[9] [9]

Sava: Scalable learning-agnostic data valuation

Samuel Kessler, Tam Le, and Vu Nguyen. Sava: Scalable learning-agnostic data valuation. In The Thirteenth International Conference on Learning Representations, 2025

2025

[10] [10]

Kairos: Scalable model-agnostic data valuation.Advances in Neural Information Processing Systems, 2025

Jiongli Zhu, Parjanya Prajakta Prashant, Alex Cloninger, and Babak Salimi. Kairos: Scalable model-agnostic data valuation.Advances in Neural Information Processing Systems, 2025

2025

[11] [11]

Shapley-based data valuation for weighted k-nearest neighbors

Guangyi Zhang, Qiyu Liu, and Aristides Gionis. Shapley-based data valuation for weighted k-nearest neighbors. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

[12] [12]

What is your data worth to gpt? llm-scale data valuation with influence functions

Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, et al. What is your data worth to gpt? llm-scale data valuation with influence functions. 2025

2025

[13] [13]

Dataset distillation

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. In ICML, 2018

2018

[14] [14]

Coresets for data-efficient training of machine learning models

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. InInternational Conference on Machine Learning, pages 6950–6960. PMLR, 2020

2020

[15] [15]

Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 33: 19920–19930, 2020

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 33: 19920–19930, 2020

2020

[16] [16]

Rethinking data shapley for data selection tasks: Misleads and merits

Jiachen T Wang, Tianji Yang, James Zou, Yongchan Kwon, and Ruoxi Jia. Rethinking data shapley for data selection tasks: Misleads and merits. InInternational Conference on Machine Learning, pages 52033–52063. PMLR, 2024

2024

[17] [17]

Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36: 69798–69818, 2023

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36: 69798–69818, 2023

2023

[18] [18]

Less: selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: selecting influential data for targeted instruction tuning. InProceedings of the 41st International Conference on Machine Learning, pages 54104–54132, 2024

2024

[19] [19]

Qurating: Selecting high- quality data for training language models

Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. Qurating: Selecting high- quality data for training language models. InForty-first International Conference on Machine Learning. 10

[20] [20]

An empirical study of example forgetting during deep neural network learning

Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. InInternational Conference on Learning Representations, 2019

2019

[21] [21]

Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021

Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021

2021

[22] [22]

Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

2022

[23] [23]

Tracking the best expert.Machine Learning, 32(2): 151–178, 1998

Mark Herbster and Manfred K Warmuth. Tracking the best expert.Machine Learning, 32(2): 151–178, 1998

1998

[24] [24]

Cambridge University Press, 2006

Nicolò Cesa-Bianchi and Gábor Lugosi.Prediction, Learning, and Games. Cambridge University Press, 2006

2006

[25] [25]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021

[26] [26]

SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. InProceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79. Association for Computational Linguistics, November 2019

2019

[27] [27]

Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages.Transactions of the Association for Computational Linguistics, 2020

2020

[28] [28]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.CoRR, abs/2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

doi: 10.48550/ARXIV .2310.06825

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv

[33] [33]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

2019

[34] [34]

MIT press Cambridge, MA, 2006

Christopher KI Williams and Carl Edward Rasmussen.Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006

2006

[35] [35]

Tighter bounds on the log marginal likelihood of gaussian process regression using conjugate gradients

Artem Artemev, David R Burt, and Mark van der Wilk. Tighter bounds on the log marginal likelihood of gaussian process regression using conjugate gradients. InInternational Conference on Machine Learning, pages 362–372. PMLR, 2021

2021

[36] [36]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. 11

2023

[37] [37]

The flan collection: Designing data and methods for effective instruction tuning

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. InInternational Conference on Machine Learning, pages 22631–22648. PMLR, 2023

2023

[38] [38]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022

[39] [39]

Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

Conover Mike, Hayes Matt, Mathur Ankit, Xie Jianwei, Wan Jun, Shah Sam, Ghodsi Ali, Wendell Patrick, Zaharia Matei, and Xin Reynold. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

2023

[40] [40]

Openassistant conversations-democratizing large language model alignment.Advances in neural information processing systems, 36:47669–47681, 2023

Andreas Köpf, Yannic Kilcher, Dimitri V on Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al. Openassistant conversations-democratizing large language model alignment.Advances in neural information processing systems, 36:47669–47681, 2023

2023

[41] [41]

Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration.Advances in neural information processing systems, 31, 2018

Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration.Advances in neural information processing systems, 31, 2018

2018

[42] [42]

Not all samples are created equal: Deep learning with importance sampling

Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning with importance sampling. In Jennifer G. Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, volume 80, pages 2530–2539. PMLR, 2018

2018

[43] [43]

Online Batch Selection for Faster Training of Neural Networks

Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks. CoRR, abs/1511.06343, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[44] [44]

Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal

Sören Mindermann, Jan Markus Brauner, Muhammed Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal. Prioritized training on points that are learnable, worth learning, and not yet learnt. InInternational Conference on Machine Learning, ICML 2022, volume 162, pages 15630–1564...

work page arXiv 2022