arxiv: 2605.06166 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

One Algorithm, Two Goals: Dual Scoring for Parameter and Data Selection in LLM Fine-Tuning

Xinrui Chen , Liu Yang , Ou Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM fine-tuningparameter selectiondata selectiongradient interaction matrixbilevel optimizationDualSFTvalidation improvement approximation

0 comments

The pith

Parameter importance and data utility for LLM fine-tuning both emerge as column-wise and row-wise sums from one shared gradient interaction matrix.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats parameter mask selection and data subset selection as two bilevel problems that share the same validation objective. Under first- and second-order approximations of how each choice improves validation performance, both signals reduce to aggregations of the same gradient interaction matrix. Column sums supply parameter importance while row sums supply data utility, giving an exact row-column correspondence. This structure supports a single-pass algorithm that extracts a mask and a subset together from one set of gradient statistics, removing the need for separate scoring passes. On 3B-to-9B models the resulting DualSFT variants improve task accuracy and stability-plasticity balance relative to sequential baselines at matched compute budgets.

Core claim

Under first- and second-order validation-improvement approximations, parameter importance and data utility emerge as column-wise and row-wise aggregations of a single gradient interaction matrix, yielding a closed-form row-column correspondence for co-extracting both signals. Building on this structure, DualSFT produces a parameter mask and data subset from shared gradient statistics in one shot.

What carries the argument

The gradient interaction matrix, whose column-wise sums yield parameter importance scores and whose row-wise sums yield data utility scores.

If this is right

Single-axis DualSFT variants raise target-task performance and improve stability-plasticity trade-offs within their comparison groups.
Full DualSFT produces a more favorable joint-constrained trade-off than sequential hybrid baselines under identical budgets.
The shared matrix eliminates redundant gradient computations that separate parameter and data scorers normally require.
Closed-form row-column correspondence allows simultaneous extraction without iterative re-scoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same matrix structure could be reused for other paired selection tasks such as pruning plus quantization or task selection plus model selection.
If the first-order term dominates in practice, even cheaper first-order-only variants might retain most of the benefit on larger models.
The approach might extend to continual learning settings where both parameter retention and replay data choice must be decided together.

Load-bearing premise

The first- and second-order approximations of validation improvement remain accurate enough for the bilevel selection problems that appear in restricted fine-tuning.

What would settle it

Measure actual validation loss reduction after fine-tuning with the DualSFT mask and subset versus the reduction predicted by the matrix aggregations; a large mismatch would show the approximations have broken down.

Figures

Figures reproduced from arXiv: 2605.06166 by Liu Yang, Ou Wu, Xinrui Chen.

**Figure 1.** Figure 1: DualSFT overview. Separate parameter and data bilevel problems share a local surrogate: view at source ↗

**Figure 2.** Figure 2: Efficiency frontiers comparing against various baselines on Magicoder with Llama-3.2-3B. view at source ↗

**Figure 4.** Figure 4: Sensitivity to key hyperparameters on Llama-3.2-3B (Magicoder). view at source ↗

**Figure 3.** Figure 3: Component ablation on Llama-3.2-3B. Component Ablation view at source ↗

**Figure 5.** Figure 5: Layer-wise parameter selection on Llama-3.2- view at source ↗

**Figure 6.** Figure 6: Forgetting-aware baseline comparison on Magicoder using Llama-3.2-3B. with one-shot scoring already capturing most of the joint structure, with fixed-side passes mainly refining boundary cases after one side is fixed. The cost asymmetry follows from granularity: D.→P. aggregates selected-example gradients, whereas P.→D. rescores candidates under a fixed mask. Appendix F.9 reports multi-round saturation, … view at source ↗

**Figure 7.** Figure 7: Additional layer-wise parameter selection patterns across models and datasets. Cells show view at source ↗

**Figure 8.** Figure 8: Reverse-selection ablation on Llama-3.2-3B (Magicoder). D/P denote data/parameter view at source ↗

**Figure 9.** Figure 9: Fixed-side re-selection diagnostics. (a) Multi-round re-selection saturates after the first view at source ↗

read the original abstract

In Large Language Model (LLM) fine-tuning, parameter and data selection are common strategies for reducing fine-tuning cost, yet they are typically driven by separate scoring mechanisms. When a parameter mask and data subset jointly determine restricted fine-tuning, this separation incurs redundant overhead and makes coordinated selection difficult. We cast parameter and data selection as two bilevel selection problems under a common validation objective and derive a shared local response-surrogate scoring rule. Under first- and second-order validation-improvement approximations, parameter importance and data utility emerge as column-wise and row-wise aggregations of a single gradient interaction matrix, yielding a closed-form row-column correspondence for co-extracting both signals. Building on this structure, we propose DualSFT (Dual-Selection Fine-Tuning), a one-shot dual-scoring algorithm that produces a parameter mask and data subset from shared gradient statistics. On 3B-9B LLMs, single-axis DualSFT variants strengthen target-task performance and stability-plasticity trade-offs within their comparison groups, while full DualSFT yields a more favorable joint-constrained trade-off than sequential hybrid baselines under matched budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DualSFT offers a clean one-shot scoring trick from a shared gradient matrix, but the first- and second-order approximations lack direct validation.

read the letter

The paper's real contribution is the closed-form link it draws between parameter masks and data subsets: both come out as simple row and column aggregates of one gradient interaction matrix once you accept the local Taylor surrogate for validation improvement. That structure lets them replace two separate scoring passes with a single computation, which is the practical win they target for restricted fine-tuning budgets. They implement it as DualSFT and test single-axis and joint versions on 3B-9B models, where the joint version shows better constrained trade-offs than sequential hybrids and the single-axis runs beat their comparison groups on target performance and stability-plasticity balance. The derivation itself is straightforward and the algorithm description is clear enough to reproduce from the abstract alone. The main weakness is that the approximations are asserted rather than measured. There is no reported check of how closely the quadratic surrogate tracks actual validation loss change after the mask and subset are applied, no ablation on approximation order, and no comparison to an exact bilevel solver even on small models. In non-convex LLM landscapes the radius of validity for a second-order expansion can be narrow, and the bilevel coupling (selection alters the training path) could push the error higher than the paper assumes. The experiments report gains under matched budgets but do not include error bars on the approximation fidelity itself. This work is aimed at practitioners who already run gradient-based pruning or data selection and want to combine them without doubling the overhead. It is worth sending to peer review because the idea is concrete, the math is explicit, and the empirical results are at least directionally positive; a referee can ask for the missing approximation diagnostics without needing to rewrite the core claim.

Referee Report

2 major / 2 minor

Summary. The paper claims that parameter and data selection in LLM fine-tuning can be unified via a single gradient interaction matrix derived from first- and second-order approximations of validation improvement. Under these local surrogates, parameter importance scores arise as column-wise aggregations and data utility scores as row-wise aggregations, enabling a one-shot DualSFT algorithm that jointly extracts both signals and yields improved task performance and stability-plasticity trade-offs compared to sequential baselines on 3B-9B models.

Significance. If the approximations hold, the work provides an efficient, closed-form unification of two common fine-tuning reduction strategies under a shared bilevel objective, reducing redundant computation while improving joint-constrained outcomes. The reported empirical gains on models up to 9B parameters, including stronger single-axis variants and favorable full DualSFT trade-offs, constitute a practical strength that could be extended if the surrogate fidelity is confirmed.

major comments (2)

Abstract and derivation of the scoring rule: the central row-column correspondence is obtained only under first- and second-order Taylor approximations of the validation objective; no direct quantification of the approximation error (e.g., ||actual post-selection validation change - surrogate prediction||) or comparison against exact bilevel optimization is provided, leaving the load-bearing link between the gradient interaction matrix and the claimed co-extraction unverified in non-convex LLM landscapes.
§4 (empirical evaluation): while gains versus sequential hybrids are shown under matched budgets, there is no ablation of approximation order (first vs. second) nor measurement of how the one-shot mask/subset alters the subsequent optimization trajectory relative to the assumed local surrogate, which is required to substantiate that the bilevel coupling does not amplify higher-order terms.

minor comments (2)

Notation section: the gradient interaction matrix entries should be given an explicit equation number and a short derivation sketch to clarify how the row and column aggregations are computed from the same statistics.
Experimental tables: add error bars or multiple random seeds for the reported performance and stability metrics to allow assessment of whether the observed improvements are statistically reliable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the work's significance. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: Abstract and derivation of the scoring rule: the central row-column correspondence is obtained only under first- and second-order Taylor approximations of the validation objective; no direct quantification of the approximation error (e.g., ||actual post-selection validation change - surrogate prediction||) or comparison against exact bilevel optimization is provided, leaving the load-bearing link between the gradient interaction matrix and the claimed co-extraction unverified in non-convex LLM landscapes.

Authors: We agree that the row-column correspondence is derived under local first- and second-order Taylor approximations of the validation objective, as is standard for rendering bilevel selection tractable in high-dimensional non-convex settings. Exact bilevel optimization remains computationally prohibitive for LLMs, which is why the surrogate approach is adopted. The empirical gains on 3B-9B models provide indirect support for the utility of the gradient interaction matrix. To directly address the concern, we will add a new subsection in the theoretical analysis discussing the approximation assumptions and their potential limitations in non-convex landscapes. We will also include a targeted experiment on a smaller-scale model to quantify surrogate prediction error by comparing predicted versus actual post-selection validation changes. revision: yes
Referee: §4 (empirical evaluation): while gains versus sequential hybrids are shown under matched budgets, there is no ablation of approximation order (first vs. second) nor measurement of how the one-shot mask/subset alters the subsequent optimization trajectory relative to the assumed local surrogate, which is required to substantiate that the bilevel coupling does not amplify higher-order terms.

Authors: We concur that an ablation on approximation order and trajectory analysis would provide stronger substantiation. In the revised manuscript, we will expand the empirical section with an ablation comparing first-order, second-order, and combined scoring variants under identical budgets. We will additionally report optimization trajectory metrics, including validation loss curves and stability-plasticity indicators during fine-tuning, to examine how the one-shot selections interact with the assumed local surrogate and to check for amplification of higher-order effects. revision: yes

Circularity Check

0 steps flagged

Derivation from Taylor approximations is self-contained and independent

full rationale

The paper derives the dual-scoring rule by applying first- and second-order Taylor expansions directly to the validation-improvement objective and then extracting row- and column-wise aggregations of the resulting gradient interaction matrix. This is a standard local-surrogate construction whose output follows mathematically from the chosen expansion order and the definition of the bilevel objective; it does not rename a fitted quantity as a prediction, invoke a self-citation as the sole justification, or smuggle an ansatz through prior work. No load-bearing step reduces to its own inputs by construction. The computation of the matrix on training data is the usual prerequisite for any gradient-based method and does not create circular dependence on the final selection masks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on two domain assumptions about the validity of low-order Taylor approximations for validation loss change; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption First- and second-order approximations of validation-improvement are valid for the bilevel parameter and data selection problems
Invoked to obtain the closed-form row-column correspondence from the gradient interaction matrix.

pith-pipeline@v0.9.0 · 5500 in / 1407 out tokens · 41407 ms · 2026-05-08T13:37:04.434667+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

124 extracted references · 16 canonical work pages · 6 internal anchors

[1]

Code Llama: Open Foundation Models for Code

B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remez et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review arXiv 2023
[2]

Llemma: An open language model for mathematics,

Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. M. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck, “Llemma: An open language model for mathematics,” in The Twelfth International Conference on Learning Representations, 2024

2024
[3]

Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge

Y . Li, Z. Li, K. Zhang, R. Dan, S. Jiang, and Y . Zhang, “Chatdoctor: A medical chat model fine- tuned on a large language model meta-ai (llama) using medical domain knowledge,” arXiv preprint arXiv:2303.14070, 2023

work page arXiv 2023
[4]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P . Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations , 2022

2022
[5]

QLoRA: Efficient finetuning of quantized LLMs,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023

2023
[6]

PiCa: Parameter-Efficient Fine-Tuning with Column Space Projection

J. Hwang, W. Cho, and T. Kim, “Pica: Parameter-efficient fine-tuning with column space projection,” arXiv preprint arXiv:2505.20211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

QWHA: Quantization-aware walsh-hadamard adapta- tion for parameter-efficient fine-tuning on large language models,

H. Jeon, S. Lee, B. Kang, Y . Kim, and J.-J. Kim, “QWHA: Quantization-aware walsh-hadamard adapta- tion for parameter-efficient fine-tuning on large language models,” in The Fourteenth International Con- ference on Learning Representations, 2026

2026
[8]

Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,

L. Xu, H. Xie, S. J. Qin, X. Tao, and F. L. Wang, “Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026
[9]

Parameter-efficient tuning with special token adaptation,

X. Y ang, J. Y . Huang, W. Zhou, and M. Chen, “Parameter-efficient tuning with special token adaptation,” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 865–872

2023
[10]

Multitask prompt tuning enables parameter-efficient transfer learning,

Z. Wang, R. Panda, L. Karlinsky, R. Feris, H. Sun, and Y . Kim, “Multitask prompt tuning enables parameter-efficient transfer learning,” in The Eleventh International Conference on Learning Represen- tations, 2023

2023
[11]

Galore: Memory-efficient LLM training by gradient low-rank projection,

J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y . Tian, “Galore: Memory-efficient LLM training by gradient low-rank projection,” in Forty-first International Conference on Machine Learning, 2024

2024
[12]

DoRA: Weight-decomposed low-rank adaptation,

S. yang Liu, C.- Y . Wang, H. Yin, P . Molchanov, Y .-C. F. Wang, K.-T. Cheng, and M.-H. Chen, “DoRA: Weight-decomposed low-rank adaptation,” in Forty-first International Conference on Machine Learning, 2024

2024
[13]

Make pre-trained model reversible: From parameter to memory efficient fine-tuning,

B. Liao, S. Tan, and C. Monz, “Make pre-trained model reversible: From parameter to memory efficient fine-tuning,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023

2023
[14]

SVFT: Parameter-efficient fine-tuning with singular vectors,

V . Lingam, A. T. Neerkaje, A. Vavre, A. Shetty, G. K. Gudur, J. Ghosh, E. Choi, A. Dimakis, A. Bo- jchevski, and S. Sanghavi, “SVFT: Parameter-efficient fine-tuning with singular vectors,” in 2nd Work- shop on Advancing Neural Network Training: Computational Eﬀiciency, Scalability, and Resource Opti- mization (WANT@ICML 2024), 2024

2024
[15]

Increasing model capacity for free: A simple strategy for parameter efficient fine-tuning,

H. SONG, H. Zhao, S. Majumder, and T. Lin, “Increasing model capacity for free: A simple strategy for parameter efficient fine-tuning,” in The Twelfth International Conference on Learning Representations , 2024

2024
[16]

LESS: Selecting influential data for targeted instruction tuning,

M. Xia, S. Malladi, S. Gururangan, S. Arora, and D. Chen, “LESS: Selecting influential data for targeted instruction tuning,” in Forty-first International Conference on Machine Learning, 2024

2024
[17]

Upweighting easy samples in fine-tuning mitigates forgetting,

S. Sanyal, H. Prairie, R. Das, A. Kavis, and S. Sanghavi, “Upweighting easy samples in fine-tuning mitigates forgetting,” in Forty-second International Conference on Machine Learning , 2025

2025
[18]

Skrull: Towards efficient long context fine-tuning through dynamic data scheduling,

H. Xu, W. Shen, Y . Wei, A. Wang, G. Runfan, T. Wang, Y . Li, M. Li, and W. Jia, “Skrull: Towards efficient long context fine-tuning through dynamic data scheduling,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[19]

Difficulty is not enough: Curriculum learning for llms fine-tuning must consider utility,

Z. Jiang, J. Han, T. Li, X. Wang, S. Jiang, X. Meng, J. Wei, J. Liang, and Y . Xiao, “Difficulty is not enough: Curriculum learning for llms fine-tuning must consider utility,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 37, 2026, pp. 31 365–31 373

2026
[20]

Deft-ucs: Data efficient fine-tuning for pre-trained language models via unsuper- vised core-set selection for text-editing,

D. Das and V . Khetan, “Deft-ucs: Data efficient fine-tuning for pre-trained language models via unsuper- vised core-set selection for text-editing,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 20 296–20 312

2024
[21]

DELIFT: Data efficient language model in- struction fine-tuning,

I. Agarwal, K. Killamsetty, L. Popa, and M. Danilevsky, “DELIFT: Data efficient language model in- struction fine-tuning,” in The Thirteenth International Conference on Learning Representations , 2025. 10

2025
[22]

Task-specific skill localization in fine-tuned language models,

A. Panigrahi, N. Saunshi, H. Zhao, and S. Arora, “Task-specific skill localization in fine-tuned language models,” in Proceedings of the 40th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 23–29 Jul 2023, pp. 27 011–27 033

2023
[23]

Sparse is enough in fine-tuning pre-trained large language models,

W . Song, Z. Li, L. Zhang, H. Zhao, and B. Du, “Sparse is enough in fine-tuning pre-trained large language models,” in Forty-first International Conference on Machine Learning, 2024

2024
[24]

S$^{2}$FT: Efficient, scalable and generalizable LLM fine-tuning by structured sparsity,

X. Y ang, J. Leng, G. Guo, J. Zhao, R. Nakada, L. Zhang, H. Y ao, and B. Chen, “S$^{2}$FT: Efficient, scalable and generalizable LLM fine-tuning by structured sparsity,” in The Thirty-eighth Annual Confer- ence on Neural Information Processing Systems , 2024

2024
[25]

LISA: Layerwise importance sampling for memory-efficient large language model fine-tuning,

R. Pan, X. Liu, S. Diao, R. Pi, J. Zhang, C. Han, and T. Zhang, “LISA: Layerwise importance sampling for memory-efficient large language model fine-tuning,” in The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[26]

SMT: Fine-tuning large language models with sparse matrices,

H. He, J. B. Li, X. Jiang, and H. Miller, “SMT: Fine-tuning large language models with sparse matrices,” in The Thirteenth International Conference on Learning Representations , 2025

2025
[27]

LIFT the veil for the truth: Principal weights emerge after rank reduction for reasoning-focused supervised fine-tuning,

Z. Liu, T. Pang, O. Balabanov, C. Y ang, T. Huang, L. Yin, Y . Y ang, and S. Liu, “LIFT the veil for the truth: Principal weights emerge after rank reduction for reasoning-focused supervised fine-tuning,” in Forty-second International Conference on Machine Learning , 2025

2025
[28]

Pay attention to small weights,

C. Zhou, T. Jacobs, A. Gadhikar, and R. Burkholz, “Pay attention to small weights,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[29]

Jerry Liu

J. Lin, L. Zettlemoyer, G. Ghosh, W.-T. Yih, A. Markosyan, V .-P . Berges, and B. Oğuz, “Continual learning via sparse memory finetuning,” arXiv preprint arXiv:2510.15103, 2025

work page arXiv 2025
[30]

Gast: Gradient-aligned sparse tuning of large language models with data-layer selection,

K. Y ao, Z. Song, K. Wu, M. Zhong, D. Cheng, Z. Tan, Y . Ji, and P . Gao, “Gast: Gradient-aligned sparse tuning of large language models with data-layer selection,” in Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , 2026, pp. 4401–4416

2026
[31]

FisherSFT: Data-efficient supervised fine-tuning of language models using information gain,

R. Deb, K. K. Thekumparampil, K. Kalantari, G. Hiranandani, S. Sabach, and B. Kveton, “FisherSFT: Data-efficient supervised fine-tuning of language models using information gain,” in Forty-second Inter- national Conference on Machine Learning , 2025

2025
[32]

Boosting multi-domain fine-tuning of large language models through evolving interactions between samples,

X. Liang, L. Y ang, J. Wang, Y . Lu, R. Wu, H. Chen, and J. HAO, “Boosting multi-domain fine-tuning of large language models through evolving interactions between samples,” in Forty-second International Conference on Machine Learning, 2025

2025
[33]

Joint selection for large-scale pre-training data via policy gradient-based mask learning,

Z. Fan, Y . Xian, Y . Sun, L. Shen, and K. Shen, “Joint selection for large-scale pre-training data via policy gradient-based mask learning,” in The Fourteenth International Conference on Learning Representations, 2026

2026
[34]

Gist: Targeted data selection for instruction tuning via coupled optimization geometry, 2026

G. Min, T. Huang, K. Wan, and C. Chen, “Gist: Targeted data selection for instruction tuning via coupled optimization geometry,” arXiv preprint arXiv:2602.18584, 2026

work page arXiv 2026
[35]

SPICE: Submodular penalized information–conflict selection for efficient large language model training,

P . Chang, J. Zhang, B. Chen, C. Wang, C. Guo, Y . Zhang, Y . Gao, J. Xiang, Y . Gao, C. Sun, Y . Chen, and D. Kong, “SPICE: Submodular penalized information–conflict selection for efficient large language model training,” in The Fourteenth International Conference on Learning Representations , 2026

2026
[36]

Tuning layernorm in attention: Towards efficient multi-modal LLM finetuning,

B. Zhao, H. Tu, C. Wei, J. Mei, and C. Xie, “Tuning layernorm in attention: Towards efficient multi-modal LLM finetuning,” in The Twelfth International Conference on Learning Representations , 2024

2024
[37]

SparseloRA: Accelerating LLM fine-tuning with contextual sparsity,

S. Khaki, X. Li, J. Guo, L. Zhu, K. N. Plataniotis, A. Y azdanbakhsh, K. Keutzer, S. Han, and Z. Liu, “SparseloRA: Accelerating LLM fine-tuning with contextual sparsity,” in Forty-second International Conference on Machine Learning, 2025

2025
[38]

Tr-pts: Task-relevant parameter and token selection for efficient tuning,

S. Luo, H. Y ang, Y . Xin, M. Yi, G. Wu, G. Zhai, and X. Liu, “Tr-pts: Task-relevant parameter and token selection for efficient tuning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 4360–4369

2025
[39]

AI progress should be measured by capability-per-resource, not scale alone: A framework for gradient-guided resource allocation in LLMs,

D. McCoy, Y . Wu, and Z. Butzin-Dozier, “AI progress should be measured by capability-per-resource, not scale alone: A framework for gradient-guided resource allocation in LLMs,” in The Thirty-Ninth Annual Conference on Neural Information Processing Systems Position Paper Track, 2025

2025
[40]

Mitigating forgetting in LLM fine- tuning via low-perplexity token learning,

C.-C. Wu, Z. R. Tam, C.- Y . Lin, Y .-N. Chen, S.-H. Sun, and H. yi Lee, “Mitigating forgetting in LLM fine- tuning via low-perplexity token learning,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[41]

On the loss of context awareness in general instruction fine-tuning,

Y . Wang, A. Bai, N. Peng, and C.-J. Hsieh, “On the loss of context awareness in general instruction fine-tuning,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems , 2025

2025
[42]

Mapping post-training forgetting in language models at scale,

J. Harmon, A. Hochlehnert, M. Bethge, and A. Prabhu, “Mapping post-training forgetting in language models at scale,” in The Fourteenth International Conference on Learning Representations , 2026. 11

2026
[43]

Don’t make it up: Preserving ignorance awareness in LLM fine-tuning,

W . F. Shen, X. Qiu, N. Cancedda, and N. D. Lane, “Don’t make it up: Preserving ignorance awareness in LLM fine-tuning,” in NeurIPS 2025 Workshop: Reliable ML from Unreliable Data , 2025

2025
[44]

SFT doesn’t always hurt general capa- bilities: Revisiting domain-specific fine-tuning in LLMs,

J. Lin, Z. Wang, K. Qian, T. Wang, A. Srinivasan, H. Zeng, R. Jiao, X. Zhou, J. Gesi, D. Wang, Y . Guo, K. Zhong, W. Zhang, S. Sanghavi, C. Chen, H. Yun, and L. Li, “SFT doesn’t always hurt general capa- bilities: Revisiting domain-specific fine-tuning in LLMs,” in The Fourteenth International Conference on Learning Representations, 2026

2026
[45]

Mitigating catastrophic forgetting in large language models with forgetting-aware pruning,

W . Huang, A. Cheng, and Y . Wang, “Mitigating catastrophic forgetting in large language models with forgetting-aware pruning,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, 2025, pp. 21 853–21 867

2025
[46]

Magicoder: Empowering code generation with OSS- instruct,

Y . Wei, Z. Wang, J. Liu, Y . Ding, and L. ZHANG, “Magicoder: Empowering code generation with OSS- instruct,” in Forty-first International Conference on Machine Learning, 2024

2024
[47]

Metamath: Bootstrap your own mathematical questions for large language models,

L. Yu, W. Jiang, H. Shi, J. YU, Z. Liu, Y . Zhang, J. Kwok, Z. Li, A. Weller, and W. Liu, “Metamath: Bootstrap your own mathematical questions for large language models,” in The Twelfth International Conference on Learning Representations, 2024

2024
[48]

Lofit: Localized fine-tuning on LLM representations,

F. Yin, X. Y e, and G. Durrett, “Lofit: Localized fine-tuning on LLM representations,” in The Thirty- eighth Annual Conference on Neural Information Processing Systems , 2024

2024
[49]

Scaling sparse fine-tuning to large language models,

A. Ansell, I. Vulić, H. Sterz, A. Korhonen, and E. M. Ponti, “Scaling sparse fine-tuning to large language models,” arXiv preprint arXiv:2401.16405, 2024

work page arXiv 2024
[50]

Taso: Task-aligned sparse optimization for parameter-efficient model adaptation,

D. Miao, Y . Liu, J. Wang, C. Sun, Y . Zhang, D. Y an, S. Dong, Q. Zhang, and Y . Wu, “Taso: Task-aligned sparse optimization for parameter-efficient model adaptation,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , 2025, pp. 22 746–22 758

2025
[51]

How to alleviate catastrophic forgetting in llms finetuning? hierarchical layer-wise and element-wise regularization

S. Song, H. Xu, J. Ma, S. Li, L. Peng, Q. Wan, X. Liu, and J. Yu, “How to alleviate catastrophic forgetting in llms finetuning? hierarchical layer-wise and element-wise regularization,” arXiv preprint arXiv:2501.13669, 2025

work page arXiv 2025
[52]

Floe: Fisher-based layer selection for efficient sparse adaptation of low-rank experts,

X. Wang, L. Gao, H. Wang, Y . Zhang, and J. Zhao, “Floe: Fisher-based layer selection for efficient sparse adaptation of low-rank experts,” arXiv preprint arXiv:2506.00495, 2025

work page arXiv 2025
[53]

Hft: Half fine-tuning for large language models,

T. Hui, Z. Zhang, S. Wang, W. Xu, Y . Sun, and H. Wu, “Hft: Half fine-tuning for large language models,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 12 791–12 819

2025
[54]

Recurrent knowledge identification and fusion for language model continual learning,

Y . Feng, X. Wang, Z. Lu, S. Fu, G. Shi, Y . Xu, Y . Wang, P . S. Yu, X. Chu, and X.-M. Wu, “Recurrent knowledge identification and fusion for language model continual learning,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2025, pp. 27 396–27 413

2025
[55]

Parameter importance-driven continual learning for foundation mod- els,

L. Wang, H. Zhang, and Z. Zheng, “Parameter importance-driven continual learning for foundation mod- els,” arXiv preprint arXiv:2511.15375, 2025

work page arXiv 2025
[56]

MODEL SHAPLEY : Find your ideal parameter player via one gradient backpropagation,

X. Chu, X. Jiang, R. Qiu, J. Gao, and J. Zhao, “MODEL SHAPLEY : Find your ideal parameter player via one gradient backpropagation,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[57]

ShaploRA: Allocation of low-rank adaption on large language models via shapley value inspired importance estimation,

C. Zhao, Q. Y ao, X. Song, and W. Zhu, “ShaploRA: Allocation of low-rank adaption on large language models via shapley value inspired importance estimation,” in The Third Conference on Parsimony and Learning (Proceedings Track), 2026

2026
[58]

Pruning as a cooperative game: Surrogate- assisted layer contribution estimation for large language models,

X. Ding, P . Tong, R. Duan, Y . Zhang, R. Sun, and Y . Zhu, “Pruning as a cooperative game: Surrogate- assisted layer contribution estimation for large language models,” in The Fourteenth International Con- ference on Learning Representations, 2026

2026
[59]

Token cleaning: Fine-grained data selec- tion for LLM supervised fine-tuning,

J. Pang, N. Di, Z. Zhu, J. Wei, H. Cheng, C. Qian, and Y . Liu, “Token cleaning: Fine-grained data selec- tion for LLM supervised fine-tuning,” in Forty-second International Conference on Machine Learning , 2025

2025
[60]

Token-level data selection for safe LLM fine-tuning,

Y . Li, Z. Liu, Z. Li, Z. Lin, and J. Zhang, “Token-level data selection for safe LLM fine-tuning,” in The Fourteenth International Conference on Learning Representations, 2026

2026
[61]

sstoken: Self-modulated and semantic-aware token selection for LLM fine-tuning,

X. Qin, X. Wang, N. Liao, C. Zhang, X. Zhang, M. Feng, J. Wang, and J. Y an, “sstoken: Self-modulated and semantic-aware token selection for LLM fine-tuning,” in The Fourteenth International Conference on Learning Representations, 2026

2026
[62]

Train on validation (tov): Fast data selection with applications to fine-tuning,

A. Jain, A. Montanari, and E. Sasoglu, “Train on validation (tov): Fast data selection with applications to fine-tuning,” in The Fourteenth International Conference on Learning Representations , 2026

2026
[63]

MATES: Model-aware data selection for efficient pretraining with data influence models,

Z. Yu, S. Das, and C. Xiong, “MATES: Model-aware data selection for efficient pretraining with data influence models,” in The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. 12

2024
[64]

Learn more, forget less: A gradient-aware data selection approach for llm,

Z. Liu, Y . Liu, S. Wang, Z. Song, J. Wang, J. Liu, Q. Liu, G. Chen, and Y . Wang, “Learn more, forget less: A gradient-aware data selection approach for llm,” Signal Processing, p. 110611, 2026

2026
[65]

SEAL: Safety-enhanced aligned LLM fine-tuning via bilevel data selection,

H. Shen, P .- Y . Chen, P . Das, and T. Chen, “SEAL: Safety-enhanced aligned LLM fine-tuning via bilevel data selection,” in The Thirteenth International Conference on Learning Representations , 2025

2025
[66]

Data shapley in one training run,

J. T. Wang, P . Mittal, D. Song, and R. Jia, “Data shapley in one training run,” in The Thirteenth Interna- tional Conference on Learning Representations , 2025

2025
[67]

CoIDO: Efficient data selection for visual instruc- tion tuning via coupled importance-diversity optimization,

Y . Y an, M. Zhong, Q. Zhu, X. Gu, J. Chen, and H. Li, “CoIDO: Efficient data selection for visual instruc- tion tuning via coupled importance-diversity optimization,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[68]

Diversity as a reward: Fine-tuning LLMs on a mixture of domain-undetermined data,

Z. Ling, D. Chen, L. Y ao, Q. Shen, Y . Li, and Y . Shen, “Diversity as a reward: Fine-tuning LLMs on a mixture of domain-undetermined data,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[69]

Matched data, better mod- els: Target aligned data filtering with sparse features,

A. M. Das, G. Bhatt, S. Verma, Y . Wang, V . V . Muppirala, and J. Bilmes, “Matched data, better mod- els: Target aligned data filtering with sparse features,” in The Fourteenth International Conference on Learning Representations, 2026

2026
[70]

PASER: Post-training data selection for efficient pruned large language model recovery,

B. He, L. Yin, H.-L. Zhen, X. Zhang, M. Yuan, and C. Ma, “PASER: Post-training data selection for efficient pruned large language model recovery,” inThe Fourteenth International Conference on Learning Representations, 2026

2026
[71]

LoRA learns less and forgets less,

D. Biderman, J. Portes, J. J. G. Ortiz, M. Paul, P . Greengard, C. Jennings, D. King, S. Havens, V . Chiley, J. Frankle, C. Blakeney, and J. P . Cunningham, “LoRA learns less and forgets less,” Transactions on Machine Learning Research, 2024

2024
[72]

CorDA: Context-oriented de- composition adaptation of large language models for task-aware parameter-efficient fine-tuning,

Y . Y ang, X. Li, Z. Zhou, S. L. Song, J. Wu, L. Nie, and B. Ghanem, “CorDA: Context-oriented de- composition adaptation of large language models for task-aware parameter-efficient fine-tuning,” in The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024

2024
[73]

Milora: Harnessing minor singular components for parameter-efficient llm finetuning,

H. Wang, Y . Li, S. Wang, G. Chen, and Y . Chen, “Milora: Harnessing minor singular components for parameter-efficient llm finetuning,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 4823–4836

2025
[74]

Lora-null: Low-rank adaptation via null space for large language models,

P . Tang, Y . Liu, D. Zhang, X. Wu, and D. Zhang, “Lora-null: Low-rank adaptation via null space for large language models,” arXiv e-prints arXiv:2503.02659, 2025

work page arXiv 2025
[75]

Sc-lora: Balancing efficient fine-tuning and knowledge preservation via subspace-constrained lora,

M. Luo, F. Kuang, Y . Wang, Z. Liu, and T. He, “Sc-lora: Balancing efficient fine-tuning and knowledge preservation via subspace-constrained lora,” arXiv preprint arXiv:2505.23724, 2025

work page arXiv 2025
[76]

Slim: Let llm learn more and forget less with soft lora and identity mixture,

J. Han, L. Du, H. Du, X. Zhou, Y . Wu, Y . Zhang, W. Zheng, and D. Han, “Slim: Let llm learn more and forget less with soft lora and identity mixture,” in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 4792–4804

2025
[77]

MoFO: Momentum-filtered optimizer for mitigating forgetting in LLM fine-tuning,

Y . Chen, S. Wang, Y . Zhang, Z. Lin, H. Zhang, W. Sun, T. Ding, and R. Sun, “MoFO: Momentum-filtered optimizer for mitigating forgetting in LLM fine-tuning,” Transactions on Machine Learning Research , 2025, j2C Certification

2025
[78]

Loki: Low-damage knowledge implant- ing of large language models,

R. Wang, P . Ping, Z. Guo, X. Zhang, Q. Shi, L. Zhou, and T. Ji, “Loki: Low-damage knowledge implant- ing of large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 40, no. 39, 2026, pp. 33 620–33 628

2026
[79]

Damoc: Efficiently selecting the optimal large language model for fine-tuning domain tasks based on data and model compression,

W . Huang, H. Wei, and Y . Wang, “Damoc: Efficiently selecting the optimal large language model for fine-tuning domain tasks based on data and model compression,” in Findings of the Association for Com- putational Linguistics: EMNLP 2025 , 2025, pp. 13 012–13 027

2025
[80]

Model-agnostic meta-learning for fast adaptation of deep networks,

C. Finn, P . Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International conference on machine learning . PMLR, 2017, pp. 1126–1135

2017

Showing first 80 references.