The Long-Term Effects of Data Selection in LLM Fine-Tuning

Aoxiong Zeng; Xiangquan Yang; Yuxin Yang

arxiv: 2605.30537 · v1 · pith:LLNNHA5Znew · submitted 2026-05-28 · 💻 cs.LG

The Long-Term Effects of Data Selection in LLM Fine-Tuning

Yuxin Yang , Aoxiong Zeng , Xiangquan Yang This is my paper

Pith reviewed 2026-06-29 08:57 UTC · model grok-4.3

classification 💻 cs.LG

keywords data selectionLLM fine-tuningmulti-stage trainingmyopic selectionlong-horizon evaluationcatastrophic forgettingadaptation speedrank reversal

0 comments

The pith

Short-term data selectors in multi-stage LLM fine-tuning can slow later learning and increase forgetting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether data selection methods that look best for the immediate fine-tuning stage remain helpful when training proceeds through several stages. It shows that many common selectors, including loss-based and gradient-based ones, produce rank reversal: they raise performance right away yet reduce how quickly the model learns in the next stage and raise forgetting rates. The authors run the comparison under one controlled multi-stage protocol across several selector families, formalize the pattern as myopic selection, and offer a diagnostic objective that adds coverage and future-proxy terms to immediate utility. A reader would care because data selection is widely adopted to cut fine-tuning cost, yet if early choices lock in poorer trajectories the savings may be offset by later inefficiency or capability loss.

Core claim

Short-term selectors exhibit myopic selection: they improve the current stage while slowing subsequent learning and increasing forgetting. Data selection should therefore be evaluated as a training intervention that shapes the model's overall learning trajectory rather than only as a local data-efficiency mechanism.

What carries the argument

The multi-stage evaluation protocol that measures not only immediate task performance but also future adaptation speed, forgetting, capability imbalance, and out-of-distribution robustness, together with the Long-Horizon Aware Selection (LHAS) objective that augments immediate utility with coverage, future-proxy transfer, and anti-concentration terms.

If this is right

Selectors must be scored on future-stage metrics, not only current utility, to avoid rank reversal.
The LHAS objective provides one concrete way to trade a small immediate cost for better long-horizon adaptability.
Diversity-based and random selectors may preserve future learning speed better than pure loss- or gradient-based selectors.
Data selection decisions act as trajectory-shaping interventions whose effects compound across stages.
Forgetting and out-of-distribution robustness become first-class evaluation criteria for selection methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern holds, selection pipelines could incorporate cheap future-stage proxies during the choice step itself.
The same myopic risk may appear in other staged training settings such as continual learning or curriculum design.
Testing whether the reversal persists when stage boundaries are soft rather than hard would clarify the scope of the finding.

Load-bearing premise

The controlled multi-stage protocol used in the experiments accurately captures the dynamics of real-world staged LLM fine-tuning without confounding factors from model scale or data distribution shifts.

What would settle it

A follow-up experiment that applies the same selectors to production-scale models in an actual multi-stage pipeline and finds no rank reversal or difference in later-stage adaptation speed would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.30537 by Aoxiong Zeng, Xiangquan Yang, Yuxin Yang.

**Figure 2.** Figure 2: Experimental summary. (a) Immediate current-stage gains can reverse under future [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation and diagnostic dashboard. (a) Larger budgets improve all selectors but preserve [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Additional diagnostics. (a) Task-order sensitivity across four stage orders. (b) Update [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Data selection is increasingly used to reduce the cost of large language model (LLM) fine-tuning, with recent methods prioritizing samples by current utility, diversity, quality, or influence. This paper studies a different question: when fine-tuning occurs over multiple stages, can selection strategies that look optimal now make the model less adaptable later? We introduce a long-horizon view of LLM data selection in which a selector is evaluated not only by immediate task performance, but also by future adaptation speed, forgetting, capability imbalance, and out-of-distribution robustness. We compare representative random, loss-based, gradient-based, diversity-based, quality-based, and utility-diversity selection families under a unified multi-stage protocol. Through controlled experiments designed to instantiate this protocol, we show how short-term selectors can exhibit rank reversal: they improve the current stage while slowing subsequent learning and increasing forgetting. We formalize this behavior as \emph{myopic selection}, provide a simple local analysis of why it can occur, and propose a diagnostic Long-Horizon Aware Selection (LHAS) objective that augments immediate utility with coverage, future-proxy transfer, and anti-concentration terms. The study argues that data selection should be evaluated as a training intervention that shapes the model's learning trajectory, rather than only as a local data-efficiency mechanism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Short-term data selectors can reverse rank across LLM fine-tuning stages, but the multi-stage protocol needs tighter controls to rule out confounds.

read the letter

The central observation is that selectors tuned for immediate gains can slow later stages and raise forgetting. The paper shows this reversal when comparing random, loss-based, gradient-based, diversity, quality, and utility-diversity methods inside one multi-stage protocol.

The new element is the explicit long-horizon framing plus the LHAS objective that adds coverage, future-proxy transfer, and anti-concentration terms to standard utilities. The local analysis of myopic selection is clear and the unified protocol lets them line up the families without switching setups. That is useful work.

The soft spot is the experimental description. The abstract claims controlled multi-stage runs but gives no numbers on model sizes, stage definitions, data volumes, statistical tests, or how distribution shifts were handled. If reversal appears mainly because the chosen scales or splits create artificial bottlenecks, the result may not travel. The stress-test concern about scale and shift confounds is therefore on point until the methods section is checked.

This is for groups that already run staged fine-tuning and care about downstream adaptability rather than single-task efficiency. A reader who needs to decide between current selection libraries will get concrete comparisons. The question is practical and the proposal is concrete, so it deserves a serious referee even if the controls require tightening.

Referee Report

2 major / 1 minor

Summary. The paper claims that data selection strategies in multi-stage LLM fine-tuning that optimize for immediate performance can exhibit rank reversal, improving the current stage while slowing subsequent learning, increasing forgetting, and harming long-term adaptability. It introduces a long-horizon evaluation framework incorporating future adaptation speed, forgetting, capability imbalance, and OOD robustness; compares random, loss-based, gradient-based, diversity-based, quality-based, and utility-diversity selectors under a unified multi-stage protocol; formalizes myopic selection with a local analysis; and proposes the LHAS objective augmenting immediate utility with coverage, future-proxy transfer, and anti-concentration terms.

Significance. If the empirical results on rank reversal hold under the described protocol, the work would be significant for reframing data selection as a training intervention that shapes learning trajectories rather than a purely local efficiency tool. The LHAS proposal offers a practical diagnostic augmentation to existing utilities, and the cross-family comparisons provide a basis for trajectory-aware evaluation in staged LLM training.

major comments (2)

[Abstract] Abstract: the central claim of rank reversal under short-term selectors rests on 'controlled experiments designed to instantiate' the multi-stage protocol, yet the text provides no information on model sizes, stage definitions (task sequence or data volume per stage), metrics, statistical tests, data exclusion rules, or controls for scale-dependent effects. This is load-bearing for the claim that observed myopic behavior is a general property of selection rather than an artifact of the protocol.
[Abstract] Abstract (multi-stage protocol description): without details on whether distribution shifts between stages are controlled or induced, or statistical controls for scale, it is unclear if reversal would persist at larger scales or with smoother distributions, undermining the generality of the myopic selection formalization.

minor comments (1)

[Abstract] The abstract introduces LHAS but does not specify the relative weighting of the added terms or how future-proxy transfer is operationalized; a concrete equation or pseudocode would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in the abstract regarding the experimental protocol. We agree these details are important for evaluating the generality of the rank reversal and myopic selection claims, and we will revise the abstract to incorporate key elements from the methods while preserving brevity.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of rank reversal under short-term selectors rests on 'controlled experiments designed to instantiate' the multi-stage protocol, yet the text provides no information on model sizes, stage definitions (task sequence or data volume per stage), metrics, statistical tests, data exclusion rules, or controls for scale-dependent effects. This is load-bearing for the claim that observed myopic behavior is a general property of selection rather than an artifact of the protocol.

Authors: We acknowledge that the abstract omits these specifics. The full manuscript details them in Section 3 (model scales, task sequences with per-stage volumes, evaluation metrics including adaptation speed and forgetting, statistical reporting, filtering rules, and multi-scale controls). To make the central claim more robust, we will expand the abstract with a concise summary of these protocol elements. revision: yes
Referee: [Abstract] Abstract (multi-stage protocol description): without details on whether distribution shifts between stages are controlled or induced, or statistical controls for scale, it is unclear if reversal would persist at larger scales or with smoother distributions, undermining the generality of the myopic selection formalization.

Authors: The manuscript specifies the protocol in Section 3.2, including how shifts are induced via task changes with controlled overlap and volume, plus scale variations across experiments. We will update the abstract to note that shifts are task-induced under controlled conditions and that results are reported across tested scales, thereby clarifying the scope of the myopic selection analysis. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on experimental protocol and conceptual augmentation

full rationale

The provided abstract and description contain no equations, fitted parameters, or self-citations that reduce any prediction or formalization to inputs by construction. The long-horizon view, myopic selection concept, and LHAS objective are presented as an augmentation of existing utilities rather than a redefinition or renaming that forces equivalence. The multi-stage protocol is described as a controlled experimental design whose results are offered as empirical observations, not as a derivation that collapses to its own assumptions. This is the most common honest finding for papers whose core contribution is experimental comparison rather than a closed mathematical chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a multi-stage protocol can isolate the long-term effects of selection without other training dynamics dominating; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption A multi-stage fine-tuning protocol can be designed to measure future adaptation speed, forgetting, and robustness independently of immediate utility.
Invoked when the paper states it compares selectors under a unified multi-stage protocol.

pith-pipeline@v0.9.1-grok · 5761 in / 1192 out tokens · 19898 ms · 2026-06-29T08:57:19.711364+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 12 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

M., Longpre, S., Lambert, N., Wang, X., Muennighoff, N., Hou, B., Pan, L., Jeong, H., et al

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haeju Jeong, et al. A survey on data selection for language models.arXiv preprint arXiv:2402.16827,

work page arXiv
[5]

Deduplicating Training Data Makes Language Models Better

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison- Burch, and Nicholas Carlini. Deduplicating training data makes language models better.arXiv preprint arXiv:2107.06499,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Utility-diversity aware online batch selection for llm supervised fine-tuning.arXiv preprint arXiv:2510.16882, 2025a

Heming Zou, Yixiu Mao, Yun Qu, Qi Wang, and Xiangyang Ji. Utility-diversity aware online batch selection for llm supervised fine-tuning.arXiv preprint arXiv:2510.16882, 2025a. 10 Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks.arXiv preprint arXiv:1511.06343,

work page arXiv
[7]

Jiang, Daniel L.-K

Angela H. Jiang, Daniel L.-K. Wong, Giulio Zhou, David G. Andersen, Jeffrey Dean, Gregory R. Ganger, Gauri Joshi, Michael Kaminsky, Michael Kozuch, Zachary C. Lipton, et al. Accelerating deep learning by focusing on the biggest losers.arXiv preprint arXiv:1910.00762,

work page arXiv 1910
[8]

Progressive Neural Networks

Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Structural features of the fly olfactory circuit mitigate the stability-plasticity dilemma in continual learning.arXiv preprint arXiv:2502.01427, 2025b

Heming Zou, Yunliang Zang, and Xiangyang Ji. Structural features of the fly olfactory circuit mitigate the stability-plasticity dilemma in continual learning.arXiv preprint arXiv:2502.01427, 2025b. Heming Zou, Yunliang Zang, Wutong Xu, and Xiangyang Ji. Fly-cl: A fly-inspired framework for enhancing efficient decorrelation and reduced training time in pre...

work page arXiv
[10]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,

2021
[11]

Towards specialized generalists: A multi-task moe-lora framework for domain-specific llm adaptation.arXiv preprint arXiv:2601.07935, 2026a

Yuxin Yang, Aoxiong Zeng, and Xiangquan Yang. Towards specialized generalists: A multi-task moe-lora framework for domain-specific llm adaptation.arXiv preprint arXiv:2601.07935, 2026a. Yuxin Yang, Haoran Zhang, Mingxuan Li, Jiachen Xu, Ruoxi Shen, Zhenyu Wang, Tianhao Liu, Siqi Chen, and Weilin Huang. Neurolora: Context-aware neuromodulation for paramete...

work page arXiv
[12]

Training Verifiers to Solve Math Word Problems

12 Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

M., Longpre, S., Lambert, N., Wang, X., Muennighoff, N., Hou, B., Pan, L., Jeong, H., et al

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haeju Jeong, et al. A survey on data selection for language models.arXiv preprint arXiv:2402.16827,

work page arXiv

[5] [5]

Deduplicating Training Data Makes Language Models Better

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison- Burch, and Nicholas Carlini. Deduplicating training data makes language models better.arXiv preprint arXiv:2107.06499,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Utility-diversity aware online batch selection for llm supervised fine-tuning.arXiv preprint arXiv:2510.16882, 2025a

Heming Zou, Yixiu Mao, Yun Qu, Qi Wang, and Xiangyang Ji. Utility-diversity aware online batch selection for llm supervised fine-tuning.arXiv preprint arXiv:2510.16882, 2025a. 10 Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks.arXiv preprint arXiv:1511.06343,

work page arXiv

[7] [7]

Jiang, Daniel L.-K

Angela H. Jiang, Daniel L.-K. Wong, Giulio Zhou, David G. Andersen, Jeffrey Dean, Gregory R. Ganger, Gauri Joshi, Michael Kaminsky, Michael Kozuch, Zachary C. Lipton, et al. Accelerating deep learning by focusing on the biggest losers.arXiv preprint arXiv:1910.00762,

work page arXiv 1910

[8] [8]

Progressive Neural Networks

Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Structural features of the fly olfactory circuit mitigate the stability-plasticity dilemma in continual learning.arXiv preprint arXiv:2502.01427, 2025b

Heming Zou, Yunliang Zang, and Xiangyang Ji. Structural features of the fly olfactory circuit mitigate the stability-plasticity dilemma in continual learning.arXiv preprint arXiv:2502.01427, 2025b. Heming Zou, Yunliang Zang, Wutong Xu, and Xiangyang Ji. Fly-cl: A fly-inspired framework for enhancing efficient decorrelation and reduced training time in pre...

work page arXiv

[10] [10]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,

2021

[11] [11]

Towards specialized generalists: A multi-task moe-lora framework for domain-specific llm adaptation.arXiv preprint arXiv:2601.07935, 2026a

Yuxin Yang, Aoxiong Zeng, and Xiangquan Yang. Towards specialized generalists: A multi-task moe-lora framework for domain-specific llm adaptation.arXiv preprint arXiv:2601.07935, 2026a. Yuxin Yang, Haoran Zhang, Mingxuan Li, Jiachen Xu, Ruoxi Shen, Zhenyu Wang, Tianhao Liu, Siqi Chen, and Weilin Huang. Neurolora: Context-aware neuromodulation for paramete...

work page arXiv

[12] [12]

Training Verifiers to Solve Math Word Problems

12 Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv