arxiv: 2604.00001 · v2 · submitted 2026-03-08 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Filter-then-Weight: Online Data Selection and Reweighting for LLM Fine-Tuning

Fangxin Wang , Peyman Baghershahi , Langzhou He , Henry Peng Zou , Sourav Medya , Philip S. Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords online data selectionLLM fine-tuninggradient-based selectiondata reweightingoptimizer-awarefilter-then-weight

0 comments

The pith

An optimizer-aware Filter-then-Weight method improves convergence in online LLM fine-tuning by matching updates to the current optimizer state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a gradient-based framework for selecting and reweighting training samples during online fine-tuning of large language models, where data arrives one step at a time. It reframes selection as constructing the next effective update that aligns with a target direction given the adaptive optimizer's current state, rather than ranking samples in isolation. This leads to an update-matching formulation linked to second-order utility measures, which explicitly handles redundancy and interactions within selected subsets. A practical two-stage algorithm first filters geometrically promising candidates and then solves for their optimal coefficients, supported by factorized gradient representations that scale to long-context data. Experiments show faster convergence and stronger downstream results than prior online baselines when the total data budget is held fixed.

Core claim

We propose an optimizer-aware framework for gradient-based online data selection and reweighting in LLM fine-tuning. Our key idea is to view online selection not as static sample ranking, but as shaping the next target-oriented update under the current optimizer state. We formulate this as an optimizer-aware update-matching problem, establish its connection to second-order target utility, and show why subset-level construction must account for interactions and redundancy among selected samples. Based on this view, we develop a two-stage Filter-then-Weight algorithm that first filters geometrically useful candidates and then optimizes their coefficients. To make the framework practical for LL

What carries the argument

The optimizer-aware update-matching problem, which treats selection as shaping the next target-oriented update under the current optimizer state, solved by a two-stage Filter-then-Weight procedure that filters candidates then optimizes coefficients.

If this is right

Faster convergence and higher downstream task performance than prior online selection methods under the same data budget.
Explicit accounting for sample interactions and redundancy during subset construction.
Efficient scaling to long-context data through factorized outer-product gradient representations.
A direct link from online selection to second-order target utility estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could lower the total data volume needed to reach target performance in sequential training pipelines.
It may generalize to other adaptive-optimizer settings such as reinforcement learning from human feedback.
Integration into existing training code would require only modest changes to the data loader and optimizer state tracking.

Load-bearing premise

That the optimizer-aware update-matching formulation correctly captures sample utility and that the two-stage filter-plus-weight procedure can be computed efficiently without introducing new biases for long-context LLM data.

What would settle it

Running the Filter-then-Weight method on a standard online LLM fine-tuning benchmark and finding no improvement or worse convergence and downstream scores than existing online selection baselines when data budget is fixed.

Figures

Figures reproduced from arXiv: 2604.00001 by Fangxin Wang, Henry Peng Zou, Langzhou He, Peyman Baghershahi, Philip S. Yu, Sourav Medya.

**Figure 1.** Figure 1: TyDiQA performance (F1) as a function of the training data ratio. 0 250 500 750 1000 1250 1500 1750 2000 Steps 10 20 30 40 50 F1 Hard Filter Optimizer-Aware Filter Only Vanilla Filter Only Vanilla Filter+Reweight Unbounded Weight Token-Level Score Ours Full Data [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗

read the original abstract

Gradient-based data selection offers a principled framework for estimating sample utility in large language model (LLM) fine-tuning, but existing methods are mostly designed for offline settings. They are therefore less suited to online fine-tuning, where data arrives sequentially, sample utility is step-dependent, and the effective update geometry is shaped by adaptive optimizers. We propose an optimizer-aware framework for gradient-based online data selection and reweighting in LLM fine-tuning. Our key idea is to view online selection not as static sample ranking, but as shaping the next target-oriented update under the current optimizer state. We formulate this as an optimizer-aware update-matching problem, establish its connection to second-order target utility, and show why subset-level construction must account for interactions and redundancy among selected samples. Based on this view, we develop a two-stage Filter-then-Weight algorithm that first filters geometrically useful candidates and then optimizes their coefficients. To make the framework practical for LLMs, we introduce a factorized outer-product gradient representation and optimized matrix computations for long-context data. Experiments show that our method consistently improves convergence and downstream performance over existing online data selection baselines under the same data budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a workable online data selection method for LLM fine-tuning by framing selection as optimizer-aware update matching, but the factorized gradient approximation risks weakening the claims for long contexts.

read the letter

The main takeaway is that this work gives a practical way to handle online data selection during LLM fine-tuning by treating the choice of samples as shaping the next update under the current optimizer state, rather than static ranking. They build a two-stage Filter-then-Weight procedure that first filters candidates based on geometric utility and then solves for their coefficients, plus a factorized outer-product trick to keep computations feasible with long sequences. That framing and the explicit two-stage split look like the actual new pieces compared to prior offline gradient methods. The experiments report steady gains in convergence speed and downstream metrics over other online baselines when the data budget is held fixed, which points to real utility in budgeted or streaming settings. The soft spot sits in the factorization step. Replacing the full outer product with a factorized form drops cross-token and cross-sample interactions, and those terms are likely non-negligible once context length grows past a few thousand tokens. If the dropped parts matter, the utility estimates and the filtered set will deviate from the claimed second-order target, undercutting both the theory link and the reported improvements. The abstract states the connection without showing the derivation steps or error analysis, so the strength of that link is hard to judge from the given material. This paper is aimed at people working on efficient LLM adaptation under data constraints. The algorithmic idea is fresh and the empirical pattern is encouraging, so it deserves a serious referee even though the approximation error and experimental controls will need tightening in revision.

Referee Report

3 major / 2 minor

Summary. The paper proposes an optimizer-aware framework for online data selection and reweighting during LLM fine-tuning. It frames selection as shaping the next target-oriented update under the current optimizer state via an update-matching objective, establishes a connection to second-order target utility, and introduces a two-stage Filter-then-Weight algorithm. To scale to long contexts, it employs a factorized outer-product gradient representation together with optimized matrix computations. Experiments claim consistent gains in convergence speed and downstream performance over existing online baselines under fixed data budgets.

Significance. If the update-matching formulation and its second-order link hold, and if the factorized approximation preserves sufficient fidelity, the work could supply a practical, optimizer-aware method for online data curation in LLM training. This would be valuable for sequential data arrival scenarios where static offline selection is inapplicable, potentially reducing wasted compute on low-utility samples while respecting adaptive optimizer geometry.

major comments (3)

[factorized outer-product section] The factorized outer-product gradient representation (introduced to handle long-context sequences): this approximation necessarily omits cross-term interactions among tokens and samples. For context lengths beyond a few thousand tokens these terms are typically non-negligible; their omission can systematically bias both the filtered candidate set and the subsequent weight optimization away from the claimed target-oriented update, weakening the theoretical connection to second-order utility.
[theoretical formulation] The claimed connection between the optimizer-aware update-matching objective and second-order target utility is asserted in the abstract but no derivation, expansion of the Hessian approximation, or error analysis is supplied. Without this step-by-step justification it is impossible to verify whether the subset-level construction correctly accounts for interactions and redundancy as stated.
[experiments] Experimental results report consistent improvements in convergence and downstream metrics, yet the manuscript provides neither error bars across multiple runs, nor explicit controls isolating the contribution of the filter stage versus the weight-optimization stage, nor ablation of the factorization under varying context lengths. These omissions make it difficult to judge whether the reported gains are robust or reproducible.

minor comments (2)

[methods] Notation for the factorized gradient matrices should be introduced with explicit dimensions and a clear statement of what is discarded by the factorization.
[abstract] The abstract is information-dense; consider separating the algorithmic contribution from the experimental claims for easier parsing.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications on the theoretical and practical aspects of the Filter-then-Weight framework while committing to revisions that strengthen the presentation without altering the core contributions.

read point-by-point responses

Referee: [factorized outer-product section] The factorized outer-product gradient representation (introduced to handle long-context sequences): this approximation necessarily omits cross-term interactions among tokens and samples. For context lengths beyond a few thousand tokens these terms are typically non-negligible; their omission can systematically bias both the filtered candidate set and the subsequent weight optimization away from the claimed target-oriented update, weakening the theoretical connection to second-order utility.

Authors: We acknowledge that the factorized outer-product representation is an approximation that omits certain cross-term interactions to achieve scalability for long-context LLM fine-tuning. This design choice trades some fidelity for computational tractability, as full outer-product computations are prohibitive at scale. In the revised manuscript, we will add a dedicated error analysis section bounding the approximation error relative to the full gradient outer product and its propagation into the update-matching objective. We will also include new experiments ablating the factorization across varying context lengths (e.g., 2k to 16k tokens) to quantify any systematic bias and confirm that the target-oriented update remains sufficiently preserved within practical regimes. revision: yes
Referee: [theoretical formulation] The claimed connection between the optimizer-aware update-matching objective and second-order target utility is asserted in the abstract but no derivation, expansion of the Hessian approximation, or error analysis is supplied. Without this step-by-step justification it is impossible to verify whether the subset-level construction correctly accounts for interactions and redundancy as stated.

Authors: Section 3.2 of the manuscript derives the connection by showing that the update-matching objective corresponds to minimizing a quadratic approximation of the target loss under the current optimizer state, using a Hessian-based second-order expansion. However, we agree that the presentation would benefit from greater explicitness. In the revision, we will expand this section with a complete step-by-step derivation, including the precise Hessian approximation employed, an error analysis for the subset-level objective, and explicit discussion of how interactions and redundancy among samples are captured through the joint optimization of coefficients. revision: yes
Referee: [experiments] Experimental results report consistent improvements in convergence and downstream metrics, yet the manuscript provides neither error bars across multiple runs, nor explicit controls isolating the contribution of the filter stage versus the weight-optimization stage, nor ablation of the factorization under varying context lengths. These omissions make it difficult to judge whether the reported gains are robust or reproducible.

Authors: We agree that these experimental details are necessary to establish robustness. In the revised manuscript, we will report all main results with error bars computed over at least three independent runs using different random seeds. We will add explicit ablation studies that isolate the filter stage from the weight-optimization stage, as well as experiments varying context length to evaluate the factorization's impact. These additions will allow readers to assess the individual contributions and reproducibility of the observed gains. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no reduction to inputs by construction

full rationale

The paper introduces an optimizer-aware update-matching formulation as a new conceptual framing for online selection, derives its link to second-order target utility through explicit mathematical reasoning on update geometry, and presents the Filter-then-Weight procedure plus factorized outer-product representation as algorithmic choices for tractability. These steps do not redefine the target utility in terms of the selected weights or vice versa, nor do they rename a fitted quantity as a prediction. No self-citation chain is invoked to justify the core premise, and the experiments compare against external baselines rather than internal fits. The derivation therefore supplies independent content beyond its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that gradient-based utility estimation remains valid in the online sequential setting and that second-order interactions among samples can be approximated via the proposed matching objective.

axioms (2)

domain assumption Gradient-based data selection offers a principled framework for estimating sample utility
Opening sentence of the abstract treats this as established background.
domain assumption Subset-level construction must account for interactions and redundancy among selected samples
Stated as a necessary consequence of the update-matching view.

pith-pipeline@v0.9.0 · 5525 in / 1288 out tokens · 40347 ms · 2026-05-15T14:20:09.700915+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Let the Target Select for Itself: Data Selection via Target-Aligned Paths
cs.LG 2026-05 unverdicted novelty 6.0

Target-aligned data selection via normalized endpoint loss drop on a validation-induced reference path achieves competitive performance with reduced computational overhead.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · cited by 1 Pith paper

[1]

Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

work page arXiv
[2]

From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning

Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...

work page 2024
[3]

Yulei Qin, Yuncheng Yang, Pengcheng Guo, Gang Li, Hang Shao, Yuchen Shi, Zihan Xu, Yun Gu, Ke Li, and Xing Sun

URL https://proceedings.neurips.cc/paper_files/paper/ 2020/file/e6385d39ec9394f2f3a354d9d2b88eec-Paper.pdf. Yulei Qin, Yuncheng Yang, Pengcheng Guo, Gang Li, Hang Shao, Yuchen Shi, Zihan Xu, Yun Gu, Ke Li, and Xing Sun. Unleashing the power of data tsunami: A comprehensive survey on data assessment and selection for instruction tuning of language models.T...

work page 2020
[4]

Tagcos: Task- agnostic gradient clustered coreset selection for instruction tuning data

Jipeng Zhang, Yaxuan Qin, Renjie Pi, Weizhong Zhang, Rui Pan, and Tong Zhang. Tagcos: Task- agnostic gradient clustered coreset selection for instruction tuning data. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 4671–4686,

work page 2025
[5]

As discussed in Section 4.4, the top-k filtering based can achieve comparable peak performance, with little increased cost compared with filter-only methods

Note that full training setting we train on every arriving mini-batch without gradient accumulation (set to be 4 in this paper), so it can be slower than selection methods. As discussed in Section 4.4, the top-k filtering based can achieve comparable peak performance, with little increased cost compared with filter-only methods. To achieve stable performa...

work page 2020