pith. machine review for the scientific record. sign in

arxiv: 2605.06166 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

One Algorithm, Two Goals: Dual Scoring for Parameter and Data Selection in LLM Fine-Tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:37 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM fine-tuningparameter selectiondata selectiongradient interaction matrixbilevel optimizationDualSFTvalidation improvement approximation
0
0 comments X

The pith

Parameter importance and data utility for LLM fine-tuning both emerge as column-wise and row-wise sums from one shared gradient interaction matrix.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats parameter mask selection and data subset selection as two bilevel problems that share the same validation objective. Under first- and second-order approximations of how each choice improves validation performance, both signals reduce to aggregations of the same gradient interaction matrix. Column sums supply parameter importance while row sums supply data utility, giving an exact row-column correspondence. This structure supports a single-pass algorithm that extracts a mask and a subset together from one set of gradient statistics, removing the need for separate scoring passes. On 3B-to-9B models the resulting DualSFT variants improve task accuracy and stability-plasticity balance relative to sequential baselines at matched compute budgets.

Core claim

Under first- and second-order validation-improvement approximations, parameter importance and data utility emerge as column-wise and row-wise aggregations of a single gradient interaction matrix, yielding a closed-form row-column correspondence for co-extracting both signals. Building on this structure, DualSFT produces a parameter mask and data subset from shared gradient statistics in one shot.

What carries the argument

The gradient interaction matrix, whose column-wise sums yield parameter importance scores and whose row-wise sums yield data utility scores.

If this is right

  • Single-axis DualSFT variants raise target-task performance and improve stability-plasticity trade-offs within their comparison groups.
  • Full DualSFT produces a more favorable joint-constrained trade-off than sequential hybrid baselines under identical budgets.
  • The shared matrix eliminates redundant gradient computations that separate parameter and data scorers normally require.
  • Closed-form row-column correspondence allows simultaneous extraction without iterative re-scoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same matrix structure could be reused for other paired selection tasks such as pruning plus quantization or task selection plus model selection.
  • If the first-order term dominates in practice, even cheaper first-order-only variants might retain most of the benefit on larger models.
  • The approach might extend to continual learning settings where both parameter retention and replay data choice must be decided together.

Load-bearing premise

The first- and second-order approximations of validation improvement remain accurate enough for the bilevel selection problems that appear in restricted fine-tuning.

What would settle it

Measure actual validation loss reduction after fine-tuning with the DualSFT mask and subset versus the reduction predicted by the matrix aggregations; a large mismatch would show the approximations have broken down.

Figures

Figures reproduced from arXiv: 2605.06166 by Liu Yang, Ou Wu, Xinrui Chen.

Figure 1
Figure 1. Figure 1: DualSFT overview. Separate parameter and data bilevel problems share a local surrogate: view at source ↗
Figure 2
Figure 2. Figure 2: Efficiency frontiers comparing against various baselines on Magicoder with Llama-3.2-3B. view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity to key hyperparameters on Llama-3.2-3B (Magicoder). view at source ↗
Figure 3
Figure 3. Figure 3: Component ablation on Llama-3.2-3B. Component Ablation view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise parameter selection on Llama-3.2- view at source ↗
Figure 6
Figure 6. Figure 6: Forgetting-aware baseline comparison on Magicoder using Llama-3.2-3B. with one-shot scoring already capturing most of the joint structure, with fixed-side passes mainly re￾fining boundary cases after one side is fixed. The cost asymmetry follows from granularity: D.→P. aggregates selected-example gradients, whereas P.→D. rescores candidates under a fixed mask. Ap￾pendix F.9 reports multi-round saturation, … view at source ↗
Figure 7
Figure 7. Figure 7: Additional layer-wise parameter selection patterns across models and datasets. Cells show view at source ↗
Figure 8
Figure 8. Figure 8: Reverse-selection ablation on Llama-3.2-3B (Magicoder). D/P denote data/parameter view at source ↗
Figure 9
Figure 9. Figure 9: Fixed-side re-selection diagnostics. (a) Multi-round re-selection saturates after the first view at source ↗
read the original abstract

In Large Language Model (LLM) fine-tuning, parameter and data selection are common strategies for reducing fine-tuning cost, yet they are typically driven by separate scoring mechanisms. When a parameter mask and data subset jointly determine restricted fine-tuning, this separation incurs redundant overhead and makes coordinated selection difficult. We cast parameter and data selection as two bilevel selection problems under a common validation objective and derive a shared local response-surrogate scoring rule. Under first- and second-order validation-improvement approximations, parameter importance and data utility emerge as column-wise and row-wise aggregations of a single gradient interaction matrix, yielding a closed-form row-column correspondence for co-extracting both signals. Building on this structure, we propose DualSFT (Dual-Selection Fine-Tuning), a one-shot dual-scoring algorithm that produces a parameter mask and data subset from shared gradient statistics. On 3B-9B LLMs, single-axis DualSFT variants strengthen target-task performance and stability-plasticity trade-offs within their comparison groups, while full DualSFT yields a more favorable joint-constrained trade-off than sequential hybrid baselines under matched budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that parameter and data selection in LLM fine-tuning can be unified via a single gradient interaction matrix derived from first- and second-order approximations of validation improvement. Under these local surrogates, parameter importance scores arise as column-wise aggregations and data utility scores as row-wise aggregations, enabling a one-shot DualSFT algorithm that jointly extracts both signals and yields improved task performance and stability-plasticity trade-offs compared to sequential baselines on 3B-9B models.

Significance. If the approximations hold, the work provides an efficient, closed-form unification of two common fine-tuning reduction strategies under a shared bilevel objective, reducing redundant computation while improving joint-constrained outcomes. The reported empirical gains on models up to 9B parameters, including stronger single-axis variants and favorable full DualSFT trade-offs, constitute a practical strength that could be extended if the surrogate fidelity is confirmed.

major comments (2)
  1. Abstract and derivation of the scoring rule: the central row-column correspondence is obtained only under first- and second-order Taylor approximations of the validation objective; no direct quantification of the approximation error (e.g., ||actual post-selection validation change - surrogate prediction||) or comparison against exact bilevel optimization is provided, leaving the load-bearing link between the gradient interaction matrix and the claimed co-extraction unverified in non-convex LLM landscapes.
  2. §4 (empirical evaluation): while gains versus sequential hybrids are shown under matched budgets, there is no ablation of approximation order (first vs. second) nor measurement of how the one-shot mask/subset alters the subsequent optimization trajectory relative to the assumed local surrogate, which is required to substantiate that the bilevel coupling does not amplify higher-order terms.
minor comments (2)
  1. Notation section: the gradient interaction matrix entries should be given an explicit equation number and a short derivation sketch to clarify how the row and column aggregations are computed from the same statistics.
  2. Experimental tables: add error bars or multiple random seeds for the reported performance and stability metrics to allow assessment of whether the observed improvements are statistically reliable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the work's significance. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: Abstract and derivation of the scoring rule: the central row-column correspondence is obtained only under first- and second-order Taylor approximations of the validation objective; no direct quantification of the approximation error (e.g., ||actual post-selection validation change - surrogate prediction||) or comparison against exact bilevel optimization is provided, leaving the load-bearing link between the gradient interaction matrix and the claimed co-extraction unverified in non-convex LLM landscapes.

    Authors: We agree that the row-column correspondence is derived under local first- and second-order Taylor approximations of the validation objective, as is standard for rendering bilevel selection tractable in high-dimensional non-convex settings. Exact bilevel optimization remains computationally prohibitive for LLMs, which is why the surrogate approach is adopted. The empirical gains on 3B-9B models provide indirect support for the utility of the gradient interaction matrix. To directly address the concern, we will add a new subsection in the theoretical analysis discussing the approximation assumptions and their potential limitations in non-convex landscapes. We will also include a targeted experiment on a smaller-scale model to quantify surrogate prediction error by comparing predicted versus actual post-selection validation changes. revision: yes

  2. Referee: §4 (empirical evaluation): while gains versus sequential hybrids are shown under matched budgets, there is no ablation of approximation order (first vs. second) nor measurement of how the one-shot mask/subset alters the subsequent optimization trajectory relative to the assumed local surrogate, which is required to substantiate that the bilevel coupling does not amplify higher-order terms.

    Authors: We concur that an ablation on approximation order and trajectory analysis would provide stronger substantiation. In the revised manuscript, we will expand the empirical section with an ablation comparing first-order, second-order, and combined scoring variants under identical budgets. We will additionally report optimization trajectory metrics, including validation loss curves and stability-plasticity indicators during fine-tuning, to examine how the one-shot selections interact with the assumed local surrogate and to check for amplification of higher-order effects. revision: yes

Circularity Check

0 steps flagged

Derivation from Taylor approximations is self-contained and independent

full rationale

The paper derives the dual-scoring rule by applying first- and second-order Taylor expansions directly to the validation-improvement objective and then extracting row- and column-wise aggregations of the resulting gradient interaction matrix. This is a standard local-surrogate construction whose output follows mathematically from the chosen expansion order and the definition of the bilevel objective; it does not rename a fitted quantity as a prediction, invoke a self-citation as the sole justification, or smuggle an ansatz through prior work. No load-bearing step reduces to its own inputs by construction. The computation of the matrix on training data is the usual prerequisite for any gradient-based method and does not create circular dependence on the final selection masks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on two domain assumptions about the validity of low-order Taylor approximations for validation loss change; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption First- and second-order approximations of validation-improvement are valid for the bilevel parameter and data selection problems
    Invoked to obtain the closed-form row-column correspondence from the gradient interaction matrix.

pith-pipeline@v0.9.0 · 5500 in / 1407 out tokens · 41407 ms · 2026-05-08T13:37:04.434667+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

124 extracted references · 16 canonical work pages · 6 internal anchors

  1. [1]

    Code Llama: Open Foundation Models for Code

    B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remez et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023

  2. [2]

    Llemma: An open language model for mathematics,

    Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. M. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck, “Llemma: An open language model for mathematics,” in The Twelfth International Conference on Learning Representations, 2024

  3. [3]

    Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge

    Y . Li, Z. Li, K. Zhang, R. Dan, S. Jiang, and Y . Zhang, “Chatdoctor: A medical chat model fine- tuned on a large language model meta-ai (llama) using medical domain knowledge,” arXiv preprint arXiv:2303.14070, 2023

  4. [4]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P . Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations , 2022

  5. [5]

    QLoRA: Efficient finetuning of quantized LLMs,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023

  6. [6]

    PiCa: Parameter-Efficient Fine-Tuning with Column Space Projection

    J. Hwang, W. Cho, and T. Kim, “Pica: Parameter-efficient fine-tuning with column space projection,” arXiv preprint arXiv:2505.20211, 2025

  7. [7]

    QWHA: Quantization-aware walsh-hadamard adapta- tion for parameter-efficient fine-tuning on large language models,

    H. Jeon, S. Lee, B. Kang, Y . Kim, and J.-J. Kim, “QWHA: Quantization-aware walsh-hadamard adapta- tion for parameter-efficient fine-tuning on large language models,” in The Fourteenth International Con- ference on Learning Representations, 2026

  8. [8]

    Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,

    L. Xu, H. Xie, S. J. Qin, X. Tao, and F. L. Wang, “Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  9. [9]

    Parameter-efficient tuning with special token adaptation,

    X. Y ang, J. Y . Huang, W. Zhou, and M. Chen, “Parameter-efficient tuning with special token adaptation,” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 865–872

  10. [10]

    Multitask prompt tuning enables parameter-efficient transfer learning,

    Z. Wang, R. Panda, L. Karlinsky, R. Feris, H. Sun, and Y . Kim, “Multitask prompt tuning enables parameter-efficient transfer learning,” in The Eleventh International Conference on Learning Represen- tations, 2023

  11. [11]

    Galore: Memory-efficient LLM training by gradient low-rank projection,

    J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y . Tian, “Galore: Memory-efficient LLM training by gradient low-rank projection,” in Forty-first International Conference on Machine Learning, 2024

  12. [12]

    DoRA: Weight-decomposed low-rank adaptation,

    S. yang Liu, C.- Y . Wang, H. Yin, P . Molchanov, Y .-C. F. Wang, K.-T. Cheng, and M.-H. Chen, “DoRA: Weight-decomposed low-rank adaptation,” in Forty-first International Conference on Machine Learning, 2024

  13. [13]

    Make pre-trained model reversible: From parameter to memory efficient fine-tuning,

    B. Liao, S. Tan, and C. Monz, “Make pre-trained model reversible: From parameter to memory efficient fine-tuning,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023

  14. [14]

    SVFT: Parameter-efficient fine-tuning with singular vectors,

    V . Lingam, A. T. Neerkaje, A. Vavre, A. Shetty, G. K. Gudur, J. Ghosh, E. Choi, A. Dimakis, A. Bo- jchevski, and S. Sanghavi, “SVFT: Parameter-efficient fine-tuning with singular vectors,” in 2nd Work- shop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Opti- mization (WANT@ICML 2024), 2024

  15. [15]

    Increasing model capacity for free: A simple strategy for parameter efficient fine-tuning,

    H. SONG, H. Zhao, S. Majumder, and T. Lin, “Increasing model capacity for free: A simple strategy for parameter efficient fine-tuning,” in The Twelfth International Conference on Learning Representations , 2024

  16. [16]

    LESS: Selecting influential data for targeted instruction tuning,

    M. Xia, S. Malladi, S. Gururangan, S. Arora, and D. Chen, “LESS: Selecting influential data for targeted instruction tuning,” in Forty-first International Conference on Machine Learning, 2024

  17. [17]

    Upweighting easy samples in fine-tuning mitigates forgetting,

    S. Sanyal, H. Prairie, R. Das, A. Kavis, and S. Sanghavi, “Upweighting easy samples in fine-tuning mitigates forgetting,” in Forty-second International Conference on Machine Learning , 2025

  18. [18]

    Skrull: Towards efficient long context fine-tuning through dynamic data scheduling,

    H. Xu, W. Shen, Y . Wei, A. Wang, G. Runfan, T. Wang, Y . Li, M. Li, and W. Jia, “Skrull: Towards efficient long context fine-tuning through dynamic data scheduling,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  19. [19]

    Difficulty is not enough: Curriculum learning for llms fine-tuning must consider utility,

    Z. Jiang, J. Han, T. Li, X. Wang, S. Jiang, X. Meng, J. Wei, J. Liang, and Y . Xiao, “Difficulty is not enough: Curriculum learning for llms fine-tuning must consider utility,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 37, 2026, pp. 31 365–31 373

  20. [20]

    Deft-ucs: Data efficient fine-tuning for pre-trained language models via unsuper- vised core-set selection for text-editing,

    D. Das and V . Khetan, “Deft-ucs: Data efficient fine-tuning for pre-trained language models via unsuper- vised core-set selection for text-editing,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 20 296–20 312

  21. [21]

    DELIFT: Data efficient language model in- struction fine-tuning,

    I. Agarwal, K. Killamsetty, L. Popa, and M. Danilevsky, “DELIFT: Data efficient language model in- struction fine-tuning,” in The Thirteenth International Conference on Learning Representations , 2025. 10

  22. [22]

    Task-specific skill localization in fine-tuned language models,

    A. Panigrahi, N. Saunshi, H. Zhao, and S. Arora, “Task-specific skill localization in fine-tuned language models,” in Proceedings of the 40th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 23–29 Jul 2023, pp. 27 011–27 033

  23. [23]

    Sparse is enough in fine-tuning pre-trained large language models,

    W . Song, Z. Li, L. Zhang, H. Zhao, and B. Du, “Sparse is enough in fine-tuning pre-trained large language models,” in Forty-first International Conference on Machine Learning, 2024

  24. [24]

    S$^{2}$FT: Efficient, scalable and generalizable LLM fine-tuning by structured sparsity,

    X. Y ang, J. Leng, G. Guo, J. Zhao, R. Nakada, L. Zhang, H. Y ao, and B. Chen, “S$^{2}$FT: Efficient, scalable and generalizable LLM fine-tuning by structured sparsity,” in The Thirty-eighth Annual Confer- ence on Neural Information Processing Systems , 2024

  25. [25]

    LISA: Layerwise importance sampling for memory-efficient large language model fine-tuning,

    R. Pan, X. Liu, S. Diao, R. Pi, J. Zhang, C. Han, and T. Zhang, “LISA: Layerwise importance sampling for memory-efficient large language model fine-tuning,” in The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  26. [26]

    SMT: Fine-tuning large language models with sparse matrices,

    H. He, J. B. Li, X. Jiang, and H. Miller, “SMT: Fine-tuning large language models with sparse matrices,” in The Thirteenth International Conference on Learning Representations , 2025

  27. [27]

    LIFT the veil for the truth: Principal weights emerge after rank reduction for reasoning-focused supervised fine-tuning,

    Z. Liu, T. Pang, O. Balabanov, C. Y ang, T. Huang, L. Yin, Y . Y ang, and S. Liu, “LIFT the veil for the truth: Principal weights emerge after rank reduction for reasoning-focused supervised fine-tuning,” in Forty-second International Conference on Machine Learning , 2025

  28. [28]

    Pay attention to small weights,

    C. Zhou, T. Jacobs, A. Gadhikar, and R. Burkholz, “Pay attention to small weights,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  29. [29]

    Jerry Liu

    J. Lin, L. Zettlemoyer, G. Ghosh, W.-T. Yih, A. Markosyan, V .-P . Berges, and B. Oğuz, “Continual learning via sparse memory finetuning,” arXiv preprint arXiv:2510.15103, 2025

  30. [30]

    Gast: Gradient-aligned sparse tuning of large language models with data-layer selection,

    K. Y ao, Z. Song, K. Wu, M. Zhong, D. Cheng, Z. Tan, Y . Ji, and P . Gao, “Gast: Gradient-aligned sparse tuning of large language models with data-layer selection,” in Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , 2026, pp. 4401–4416

  31. [31]

    FisherSFT: Data-efficient supervised fine-tuning of language models using information gain,

    R. Deb, K. K. Thekumparampil, K. Kalantari, G. Hiranandani, S. Sabach, and B. Kveton, “FisherSFT: Data-efficient supervised fine-tuning of language models using information gain,” in Forty-second Inter- national Conference on Machine Learning , 2025

  32. [32]

    Boosting multi-domain fine-tuning of large language models through evolving interactions between samples,

    X. Liang, L. Y ang, J. Wang, Y . Lu, R. Wu, H. Chen, and J. HAO, “Boosting multi-domain fine-tuning of large language models through evolving interactions between samples,” in Forty-second International Conference on Machine Learning, 2025

  33. [33]

    Joint selection for large-scale pre-training data via policy gradient-based mask learning,

    Z. Fan, Y . Xian, Y . Sun, L. Shen, and K. Shen, “Joint selection for large-scale pre-training data via policy gradient-based mask learning,” in The Fourteenth International Conference on Learning Representations, 2026

  34. [34]

    Gist: Targeted data selection for instruction tuning via coupled optimization geometry, 2026

    G. Min, T. Huang, K. Wan, and C. Chen, “Gist: Targeted data selection for instruction tuning via coupled optimization geometry,” arXiv preprint arXiv:2602.18584, 2026

  35. [35]

    SPICE: Submodular penalized information–conflict selection for efficient large language model training,

    P . Chang, J. Zhang, B. Chen, C. Wang, C. Guo, Y . Zhang, Y . Gao, J. Xiang, Y . Gao, C. Sun, Y . Chen, and D. Kong, “SPICE: Submodular penalized information–conflict selection for efficient large language model training,” in The Fourteenth International Conference on Learning Representations , 2026

  36. [36]

    Tuning layernorm in attention: Towards efficient multi-modal LLM finetuning,

    B. Zhao, H. Tu, C. Wei, J. Mei, and C. Xie, “Tuning layernorm in attention: Towards efficient multi-modal LLM finetuning,” in The Twelfth International Conference on Learning Representations , 2024

  37. [37]

    SparseloRA: Accelerating LLM fine-tuning with contextual sparsity,

    S. Khaki, X. Li, J. Guo, L. Zhu, K. N. Plataniotis, A. Y azdanbakhsh, K. Keutzer, S. Han, and Z. Liu, “SparseloRA: Accelerating LLM fine-tuning with contextual sparsity,” in Forty-second International Conference on Machine Learning, 2025

  38. [38]

    Tr-pts: Task-relevant parameter and token selection for efficient tuning,

    S. Luo, H. Y ang, Y . Xin, M. Yi, G. Wu, G. Zhai, and X. Liu, “Tr-pts: Task-relevant parameter and token selection for efficient tuning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 4360–4369

  39. [39]

    AI progress should be measured by capability-per-resource, not scale alone: A framework for gradient-guided resource allocation in LLMs,

    D. McCoy, Y . Wu, and Z. Butzin-Dozier, “AI progress should be measured by capability-per-resource, not scale alone: A framework for gradient-guided resource allocation in LLMs,” in The Thirty-Ninth Annual Conference on Neural Information Processing Systems Position Paper Track, 2025

  40. [40]

    Mitigating forgetting in LLM fine- tuning via low-perplexity token learning,

    C.-C. Wu, Z. R. Tam, C.- Y . Lin, Y .-N. Chen, S.-H. Sun, and H. yi Lee, “Mitigating forgetting in LLM fine- tuning via low-perplexity token learning,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  41. [41]

    On the loss of context awareness in general instruction fine-tuning,

    Y . Wang, A. Bai, N. Peng, and C.-J. Hsieh, “On the loss of context awareness in general instruction fine-tuning,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems , 2025

  42. [42]

    Mapping post-training forgetting in language models at scale,

    J. Harmon, A. Hochlehnert, M. Bethge, and A. Prabhu, “Mapping post-training forgetting in language models at scale,” in The Fourteenth International Conference on Learning Representations , 2026. 11

  43. [43]

    Don’t make it up: Preserving ignorance awareness in LLM fine-tuning,

    W . F. Shen, X. Qiu, N. Cancedda, and N. D. Lane, “Don’t make it up: Preserving ignorance awareness in LLM fine-tuning,” in NeurIPS 2025 Workshop: Reliable ML from Unreliable Data , 2025

  44. [44]

    SFT doesn’t always hurt general capa- bilities: Revisiting domain-specific fine-tuning in LLMs,

    J. Lin, Z. Wang, K. Qian, T. Wang, A. Srinivasan, H. Zeng, R. Jiao, X. Zhou, J. Gesi, D. Wang, Y . Guo, K. Zhong, W. Zhang, S. Sanghavi, C. Chen, H. Yun, and L. Li, “SFT doesn’t always hurt general capa- bilities: Revisiting domain-specific fine-tuning in LLMs,” in The Fourteenth International Conference on Learning Representations, 2026

  45. [45]

    Mitigating catastrophic forgetting in large language models with forgetting-aware pruning,

    W . Huang, A. Cheng, and Y . Wang, “Mitigating catastrophic forgetting in large language models with forgetting-aware pruning,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, 2025, pp. 21 853–21 867

  46. [46]

    Magicoder: Empowering code generation with OSS- instruct,

    Y . Wei, Z. Wang, J. Liu, Y . Ding, and L. ZHANG, “Magicoder: Empowering code generation with OSS- instruct,” in Forty-first International Conference on Machine Learning, 2024

  47. [47]

    Metamath: Bootstrap your own mathematical questions for large language models,

    L. Yu, W. Jiang, H. Shi, J. YU, Z. Liu, Y . Zhang, J. Kwok, Z. Li, A. Weller, and W. Liu, “Metamath: Bootstrap your own mathematical questions for large language models,” in The Twelfth International Conference on Learning Representations, 2024

  48. [48]

    Lofit: Localized fine-tuning on LLM representations,

    F. Yin, X. Y e, and G. Durrett, “Lofit: Localized fine-tuning on LLM representations,” in The Thirty- eighth Annual Conference on Neural Information Processing Systems , 2024

  49. [49]

    Scaling sparse fine-tuning to large language models,

    A. Ansell, I. Vulić, H. Sterz, A. Korhonen, and E. M. Ponti, “Scaling sparse fine-tuning to large language models,” arXiv preprint arXiv:2401.16405, 2024

  50. [50]

    Taso: Task-aligned sparse optimization for parameter-efficient model adaptation,

    D. Miao, Y . Liu, J. Wang, C. Sun, Y . Zhang, D. Y an, S. Dong, Q. Zhang, and Y . Wu, “Taso: Task-aligned sparse optimization for parameter-efficient model adaptation,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , 2025, pp. 22 746–22 758

  51. [51]

    How to alleviate catastrophic forgetting in llms finetuning? hierarchical layer-wise and element-wise regularization

    S. Song, H. Xu, J. Ma, S. Li, L. Peng, Q. Wan, X. Liu, and J. Yu, “How to alleviate catastrophic forgetting in llms finetuning? hierarchical layer-wise and element-wise regularization,” arXiv preprint arXiv:2501.13669, 2025

  52. [52]

    Floe: Fisher-based layer selection for efficient sparse adaptation of low-rank experts,

    X. Wang, L. Gao, H. Wang, Y . Zhang, and J. Zhao, “Floe: Fisher-based layer selection for efficient sparse adaptation of low-rank experts,” arXiv preprint arXiv:2506.00495, 2025

  53. [53]

    Hft: Half fine-tuning for large language models,

    T. Hui, Z. Zhang, S. Wang, W. Xu, Y . Sun, and H. Wu, “Hft: Half fine-tuning for large language models,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 12 791–12 819

  54. [54]

    Recurrent knowledge identification and fusion for language model continual learning,

    Y . Feng, X. Wang, Z. Lu, S. Fu, G. Shi, Y . Xu, Y . Wang, P . S. Yu, X. Chu, and X.-M. Wu, “Recurrent knowledge identification and fusion for language model continual learning,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2025, pp. 27 396–27 413

  55. [55]

    Parameter importance-driven continual learning for foundation mod- els,

    L. Wang, H. Zhang, and Z. Zheng, “Parameter importance-driven continual learning for foundation mod- els,” arXiv preprint arXiv:2511.15375, 2025

  56. [56]

    MODEL SHAPLEY : Find your ideal parameter player via one gradient backpropagation,

    X. Chu, X. Jiang, R. Qiu, J. Gao, and J. Zhao, “MODEL SHAPLEY : Find your ideal parameter player via one gradient backpropagation,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  57. [57]

    ShaploRA: Allocation of low-rank adaption on large language models via shapley value inspired importance estimation,

    C. Zhao, Q. Y ao, X. Song, and W. Zhu, “ShaploRA: Allocation of low-rank adaption on large language models via shapley value inspired importance estimation,” in The Third Conference on Parsimony and Learning (Proceedings Track), 2026

  58. [58]

    Pruning as a cooperative game: Surrogate- assisted layer contribution estimation for large language models,

    X. Ding, P . Tong, R. Duan, Y . Zhang, R. Sun, and Y . Zhu, “Pruning as a cooperative game: Surrogate- assisted layer contribution estimation for large language models,” in The Fourteenth International Con- ference on Learning Representations, 2026

  59. [59]

    Token cleaning: Fine-grained data selec- tion for LLM supervised fine-tuning,

    J. Pang, N. Di, Z. Zhu, J. Wei, H. Cheng, C. Qian, and Y . Liu, “Token cleaning: Fine-grained data selec- tion for LLM supervised fine-tuning,” in Forty-second International Conference on Machine Learning , 2025

  60. [60]

    Token-level data selection for safe LLM fine-tuning,

    Y . Li, Z. Liu, Z. Li, Z. Lin, and J. Zhang, “Token-level data selection for safe LLM fine-tuning,” in The Fourteenth International Conference on Learning Representations, 2026

  61. [61]

    sstoken: Self-modulated and semantic-aware token selection for LLM fine-tuning,

    X. Qin, X. Wang, N. Liao, C. Zhang, X. Zhang, M. Feng, J. Wang, and J. Y an, “sstoken: Self-modulated and semantic-aware token selection for LLM fine-tuning,” in The Fourteenth International Conference on Learning Representations, 2026

  62. [62]

    Train on validation (tov): Fast data selection with applications to fine-tuning,

    A. Jain, A. Montanari, and E. Sasoglu, “Train on validation (tov): Fast data selection with applications to fine-tuning,” in The Fourteenth International Conference on Learning Representations , 2026

  63. [63]

    MATES: Model-aware data selection for efficient pretraining with data influence models,

    Z. Yu, S. Das, and C. Xiong, “MATES: Model-aware data selection for efficient pretraining with data influence models,” in The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. 12

  64. [64]

    Learn more, forget less: A gradient-aware data selection approach for llm,

    Z. Liu, Y . Liu, S. Wang, Z. Song, J. Wang, J. Liu, Q. Liu, G. Chen, and Y . Wang, “Learn more, forget less: A gradient-aware data selection approach for llm,” Signal Processing, p. 110611, 2026

  65. [65]

    SEAL: Safety-enhanced aligned LLM fine-tuning via bilevel data selection,

    H. Shen, P .- Y . Chen, P . Das, and T. Chen, “SEAL: Safety-enhanced aligned LLM fine-tuning via bilevel data selection,” in The Thirteenth International Conference on Learning Representations , 2025

  66. [66]

    Data shapley in one training run,

    J. T. Wang, P . Mittal, D. Song, and R. Jia, “Data shapley in one training run,” in The Thirteenth Interna- tional Conference on Learning Representations , 2025

  67. [67]

    CoIDO: Efficient data selection for visual instruc- tion tuning via coupled importance-diversity optimization,

    Y . Y an, M. Zhong, Q. Zhu, X. Gu, J. Chen, and H. Li, “CoIDO: Efficient data selection for visual instruc- tion tuning via coupled importance-diversity optimization,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  68. [68]

    Diversity as a reward: Fine-tuning LLMs on a mixture of domain-undetermined data,

    Z. Ling, D. Chen, L. Y ao, Q. Shen, Y . Li, and Y . Shen, “Diversity as a reward: Fine-tuning LLMs on a mixture of domain-undetermined data,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  69. [69]

    Matched data, better mod- els: Target aligned data filtering with sparse features,

    A. M. Das, G. Bhatt, S. Verma, Y . Wang, V . V . Muppirala, and J. Bilmes, “Matched data, better mod- els: Target aligned data filtering with sparse features,” in The Fourteenth International Conference on Learning Representations, 2026

  70. [70]

    PASER: Post-training data selection for efficient pruned large language model recovery,

    B. He, L. Yin, H.-L. Zhen, X. Zhang, M. Yuan, and C. Ma, “PASER: Post-training data selection for efficient pruned large language model recovery,” inThe Fourteenth International Conference on Learning Representations, 2026

  71. [71]

    LoRA learns less and forgets less,

    D. Biderman, J. Portes, J. J. G. Ortiz, M. Paul, P . Greengard, C. Jennings, D. King, S. Havens, V . Chiley, J. Frankle, C. Blakeney, and J. P . Cunningham, “LoRA learns less and forgets less,” Transactions on Machine Learning Research, 2024

  72. [72]

    CorDA: Context-oriented de- composition adaptation of large language models for task-aware parameter-efficient fine-tuning,

    Y . Y ang, X. Li, Z. Zhou, S. L. Song, J. Wu, L. Nie, and B. Ghanem, “CorDA: Context-oriented de- composition adaptation of large language models for task-aware parameter-efficient fine-tuning,” in The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024

  73. [73]

    Milora: Harnessing minor singular components for parameter-efficient llm finetuning,

    H. Wang, Y . Li, S. Wang, G. Chen, and Y . Chen, “Milora: Harnessing minor singular components for parameter-efficient llm finetuning,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 4823–4836

  74. [74]

    Lora-null: Low-rank adaptation via null space for large language models,

    P . Tang, Y . Liu, D. Zhang, X. Wu, and D. Zhang, “Lora-null: Low-rank adaptation via null space for large language models,” arXiv e-prints arXiv:2503.02659, 2025

  75. [75]

    Sc-lora: Balancing efficient fine-tuning and knowledge preservation via subspace-constrained lora,

    M. Luo, F. Kuang, Y . Wang, Z. Liu, and T. He, “Sc-lora: Balancing efficient fine-tuning and knowledge preservation via subspace-constrained lora,” arXiv preprint arXiv:2505.23724, 2025

  76. [76]

    Slim: Let llm learn more and forget less with soft lora and identity mixture,

    J. Han, L. Du, H. Du, X. Zhou, Y . Wu, Y . Zhang, W. Zheng, and D. Han, “Slim: Let llm learn more and forget less with soft lora and identity mixture,” in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 4792–4804

  77. [77]

    MoFO: Momentum-filtered optimizer for mitigating forgetting in LLM fine-tuning,

    Y . Chen, S. Wang, Y . Zhang, Z. Lin, H. Zhang, W. Sun, T. Ding, and R. Sun, “MoFO: Momentum-filtered optimizer for mitigating forgetting in LLM fine-tuning,” Transactions on Machine Learning Research , 2025, j2C Certification

  78. [78]

    Loki: Low-damage knowledge implant- ing of large language models,

    R. Wang, P . Ping, Z. Guo, X. Zhang, Q. Shi, L. Zhou, and T. Ji, “Loki: Low-damage knowledge implant- ing of large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 40, no. 39, 2026, pp. 33 620–33 628

  79. [79]

    Damoc: Efficiently selecting the optimal large language model for fine-tuning domain tasks based on data and model compression,

    W . Huang, H. Wei, and Y . Wang, “Damoc: Efficiently selecting the optimal large language model for fine-tuning domain tasks based on data and model compression,” in Findings of the Association for Com- putational Linguistics: EMNLP 2025 , 2025, pp. 13 012–13 027

  80. [80]

    Model-agnostic meta-learning for fast adaptation of deep networks,

    C. Finn, P . Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International conference on machine learning . PMLR, 2017, pp. 1126–1135

Showing first 80 references.