pith. machine review for the scientific record. sign in

arxiv: 2605.09608 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.IT· math.IT

Recognition: no theorem link

Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

Yuanyi Wang , Yifan Yang , Su Lu , Yanggan Gu , Pengkai Wang , Wenjun Wang , Zhaoyi Yan , Congkai Xie , Jianmin Wu , Jialun Cao , Shing-Chi Cheung , Hongxia Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:28 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT
keywords continual post-trainingcatastrophic forgettingLLM parameter updatescovariance geometrymodel state alignmentupdate integrationgeometry conflictdata-free merging
0
0 comments X

The pith

Forgetting during sequential LLM updates occurs when the covariance geometry of a new task misaligns with the current model state geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why some updates to large language models during continual post-training allow new capabilities to build on old ones while others erase prior knowledge. It frames forgetting as an integration failure driven by geometric mismatch: each task induces a covariance structure in its parameter changes, and this structure either fits the geometry already present in the model or clashes with it. When the new update stays compatible with the state left by earlier tasks, knowledge transfers; when conflict rises, interference grows. The authors use this view to build a control mechanism that decides how to merge updates without access to previous training data. This matters because it turns an opaque problem of forgetting into a measurable property of update geometries that can guide integration decisions.

Core claim

Forgetting can be considered as a state-relative update-integration failure, it arises when the covariance geometries induced by tasks misalign with the geometry of the evolving model state. Sequential updates transfer when they remain compatible with the model state shaped by previous updates, and interfere when state-relative geometry conflict becomes high.

What carries the argument

The covariance geometry induced by a task's parameter update, which acts as a descriptor of whether the update will integrate compatibly with the geometry of the current model state.

If this is right

  • Sequential updates transfer when their induced covariance geometry stays compatible with the model state shaped by earlier updates.
  • Interference and forgetting increase when state-relative geometry conflict becomes high.
  • Geometry conflict can serve as a control signal to gate how updates are integrated or corrected.
  • A geometry-aware merging procedure improves retention and final performance on domain-continual and capability-continual tasks without replay data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If geometry conflict is the decisive factor, then task ordering could be chosen in advance by computing pairwise geometry alignments to reduce expected forgetting.
  • The same compatibility diagnostic might apply to other continual adaptation settings such as instruction tuning or multi-domain fine-tuning.
  • Preprocessing updates to reduce their geometry conflict before merging could offer an additional lever for preserving capabilities.

Load-bearing premise

The covariance geometry of a task's parameter update is a sufficient and stable descriptor of whether that task will integrate compatibly with the current model state.

What would settle it

Measuring geometry conflict on a sequence of tasks and finding no reliable correlation between higher conflict values and greater forgetting rates would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.09608 by Congkai Xie, Hongxia Yang, Jialun Cao, Jianmin Wu, Pengkai Wang, Shing-Chi Cheung, Su Lu, Wenjun Wang, Yanggan Gu, Yifan Yang, Yuanyi Wang, Zhaoyi Yan.

Figure 1
Figure 1. Figure 1: State-relative geometry tracks forgetting across continual steps and scales. Panel (a) shows [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Global and method-level as￾sociations. Top: global |ρs|. Bottom: signed method-level ρs; FVR denotes FOREVER. Subspace overlap is a natural compatibility proxy: if two up￾dates act on similar directions, they may be easier to integrate. We therefore compare SAR with geometry conflict (Sec. 2.2). As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pairwise compatibility and conflict complementarity. (a)–(c) SAR and geometry conflict stratify task-pair transfer regimes, while pairwise conflict alone weakly predicts forgetting. GC-drop is the signed association with the immediate old-task delta; GC-forget measures degradation from each old task’s best prior score. (d)–(f) reveal complementary failure modes: top-layer share is the fraction of top-ranke… view at source ↗
Figure 4
Figure 4. Figure 4: GCWM ablation on MMLU-Pro. We ablate two merge-time components of GCWM: the conflict gate and the shared Wasser￾stein metric. All variants use the same Qwen3- 0.6B domain-continual task experts and evalu￾ation protocol, differing only in the integration rule. The w/o gate variant removes conflict￾conditioned gating and applies the geometry￾aware branch uniformly, while w/o Wasser￾stein barycenter replaces … view at source ↗
Figure 5
Figure 5. Figure 5: Statistical confidence for Sec. 3. Error bars show run-cluster bootstrap 95% confidence [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Drift and geometry-discrepancy signals versus forgetting. The four panels are arranged [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pairwise subspace and geometry compatibility. The left panel compares SAR with geometry [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Step-level correlation summary by scale and method. Norm, active-pair conflict, state gap, [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pairwise correlation summary by method and scale. SAR-GC measures the relation between [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Full state-relative geometry diagnostic. This view expands the state-relative analysis [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Step-level continual post-training dynamics by method. We show downstream retention [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Method-level correlation heatmaps. We compare update norm, SAR, geometry conflict, [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Global step-level explanation heatmap. Each cell reports the Spearman correlation between [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Task-pair selective forgetting by method. Each heatmap reports the old-task score change [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Task-pair geometry conflict by method. Pairwise geometry conflict reveals compatibility [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Most harmful task transitions. We visualize the largest old-task drops after introducing [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Family-level mechanism profile. Geometry conflict and gradient conflict emphasize [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Method-wise top-layer family distribution. We decompose top geometry-conflict layers [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Final performance across model scales and continual post-training methods. [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Capability ablation breakdown. Full GCWM is compared with variants that remove [PITH_FULL_IMAGE:figures/full_fig_p040_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: GCWM merge-time runtime and memory profiling. Runtime is decomposed by major [PITH_FULL_IMAGE:figures/full_fig_p043_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: GCWM hyperparameter sensitivity on Qwen3-8B. Each sweep changes one parameter [PITH_FULL_IMAGE:figures/full_fig_p044_22.png] view at source ↗
read the original abstract

Continual post-training aims to extend large language models (LLMs) with new knowledge, skills, and behaviors, yet it remains unclear when sequential updates enable capability transfer and when they cause catastrophic forgetting. Existing methods mitigate forgetting through sequential fine-tuning, replay, regularization, or model merging, but offer limited criteria for determining when incorporating new updates is beneficial or harmful. In this work, we study LLM continual post-training through three questions: What drives forgetting? When do sequentially acquired capabilities transfer or interfere? How can compatibility be used to control update integration? We address these questions through task geometry: we represent each post-training task by its parameter update and study the covariance geometry induced by the update. Our central finding is that: forgetting can be considered as a state-relative update-integration failure, it arises when the covariance geometries induced by tasks misalign with the geometry of the evolving model state. Sequential updates transfer when they remain compatible with the model state shaped by previous updates, and interfere when state-relative geometry conflict becomes high. Motivated by this finding, we propose Geometry-Conflict Wasserstein Merging (GCWM), a data-free update-integration method that constructs a shared Wasserstein metric via Gaussian Wasserstein barycenters and uses geometry conflict to gate geometry-aware correction. Across Qwen3 0.6B--14B on domain-continual and capability-continual settings, GCWM consistently outperforms data-free baselines, improving retention and final performance without replay data. These results identify geometry conflict as both an explanatory signal for forgetting and a practical control signal for LLM continual post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that forgetting in LLM continual post-training arises as a state-relative update-integration failure when covariance geometries induced by task parameter updates misalign with the geometry of the evolving model state. Sequential updates transfer when compatible with the prior state and interfere at high geometry conflict. The authors propose Geometry-Conflict Wasserstein Merging (GCWM), a data-free method using Gaussian Wasserstein barycenters to build a shared metric and gate geometry-aware corrections. On Qwen3 models (0.6B–14B) in domain-continual and capability-continual settings, GCWM outperforms data-free baselines in retention and final performance.

Significance. If the geometry-conflict interpretation is shown to be load-bearing rather than a proxy for simpler signals, the work would supply both an explanatory account of when updates integrate or interfere and a practical data-free control mechanism. The direct linkage of analysis to the GCWM algorithm and the use of optimal-transport barycenters are technical strengths that could influence continual-learning research beyond replay or regularization heuristics.

major comments (3)
  1. [Abstract] Abstract (central finding paragraph): the claim that covariance geometry is a sufficient and stable descriptor of state compatibility is presented as following from the analysis, yet no evidence is given that geometry conflict adds explanatory power beyond proxies such as ||Δθ|| or task embedding similarity. Ablation experiments that isolate the geometry term are required to substantiate the sufficiency assumption.
  2. [Method] GCWM construction (method section): the method relies on Gaussian approximations and Wasserstein barycenters whose covariance parameters are estimated from the same updates used to measure conflict; the manuscript must show that these parameters are independent of the evaluation data or provide a derivation demonstrating that the circularity does not affect the reported gains.
  3. [Experiments] Experimental results (tables/figures): no description of covariance estimation procedure, run-to-run variance, or statistical significance tests is supplied, so it is impossible to determine whether the observed improvements over baselines reliably support the geometry-conflict explanation rather than implementation details.
minor comments (1)
  1. [Abstract] The abstract packs the central claim, method, and results into a single dense paragraph; splitting the finding into two sentences would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the presentation of our geometry-conflict analysis and the GCWM method.

read point-by-point responses
  1. Referee: [Abstract] Abstract (central finding paragraph): the claim that covariance geometry is a sufficient and stable descriptor of state compatibility is presented as following from the analysis, yet no evidence is given that geometry conflict adds explanatory power beyond proxies such as ||Δθ|| or task embedding similarity. Ablation experiments that isolate the geometry term are required to substantiate the sufficiency assumption.

    Authors: We agree that additional evidence is needed to demonstrate that geometry conflict provides explanatory power beyond simpler proxies. In the revised manuscript we will add ablation experiments that directly compare geometry conflict against ||Δθ|| and task-embedding cosine similarity as predictors of forgetting and interference. These ablations will quantify the incremental predictive value of the geometry term on held-out validation sets. We will also revise the abstract to state the findings more precisely as supported by both the geometric analysis and the new ablations. revision: yes

  2. Referee: [Method] GCWM construction (method section): the method relies on Gaussian approximations and Wasserstein barycenters whose covariance parameters are estimated from the same updates used to measure conflict; the manuscript must show that these parameters are independent of the evaluation data or provide a derivation demonstrating that the circularity does not affect the reported gains.

    Authors: We will expand the method section to clarify that covariance matrices are estimated solely from the task-specific parameter updates (via sample covariance of the delta vectors or mini-batch gradients collected during fine-tuning). These statistics are computed before any evaluation on downstream tasks and do not incorporate test or validation data. We will include a short derivation showing that the Wasserstein barycenter construction uses only these pre-computed update covariances to define the shared metric, ensuring the conflict measurement and subsequent merging step remain independent of the reported performance metrics. revision: yes

  3. Referee: [Experiments] Experimental results (tables/figures): no description of covariance estimation procedure, run-to-run variance, or statistical significance tests is supplied, so it is impossible to determine whether the observed improvements over baselines reliably support the geometry-conflict explanation rather than implementation details.

    Authors: We acknowledge the omission of these experimental details. In the revised manuscript we will add: (i) a precise description of the covariance estimation procedure (sample covariance over update vectors with explicit batch size and regularization), (ii) mean and standard deviation of all metrics over at least three independent runs with different random seeds, and (iii) statistical significance tests (paired t-tests or Wilcoxon signed-rank tests with p-values) comparing GCWM against each baseline. Updated tables and figures will report these statistics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claim is interpretive observation from geometry analysis, not reduced by construction to inputs.

full rationale

The paper's derivation chain begins with representing tasks via parameter updates and analyzing induced covariance geometries, leading to the interpretive claim that forgetting arises from state-relative misalignment. This is presented as a finding from the geometry study rather than a mathematical derivation or fitted prediction. GCWM is motivated by the finding and applies Gaussian Wasserstein barycenters with geometry conflict gating, but the abstract and description provide no equations showing that conflict metrics or barycenter parameters are fitted directly to forgetting outcomes or reduce the explanatory claim to the input updates by definition. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are evident in the provided text. The analysis remains self-contained as an empirical geometry-based interpretation without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that covariance geometry of parameter updates is a faithful proxy for task compatibility; no free parameters are named in the abstract, but the Wasserstein barycenter construction implicitly introduces modeling choices.

axioms (1)
  • domain assumption Covariance geometry induced by a task's parameter update reflects the compatibility of that task with the current model state
    Invoked directly in the central finding sentence of the abstract.
invented entities (1)
  • Geometry conflict signal no independent evidence
    purpose: Quantitative measure of misalignment between task covariance geometry and evolving model state used both to explain forgetting and to gate merging
    Newly introduced explanatory and control quantity; no independent falsifiable handle outside the paper is described in the abstract.

pith-pipeline@v0.9.0 · 5629 in / 1497 out tokens · 56330 ms · 2026-05-12T02:28:17.884464+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 7 internal anchors

  1. [1]

    Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

    Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

  2. [2]

    arXiv preprint arXiv:2502.21321

    Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Fahad Shahbaz Khan, and Salman Khan. Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321, 2025

  3. [3]

    Demystifying domain-adaptive post-training for financial llms

    Zixuan Ke, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, and Shafiq Joty. Demystifying domain-adaptive post-training for financial llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 31021–31047, 2025

  4. [4]

    Redone: Revealing domain-specific llm post-training in social networking services

    Fei Zhao, Chonggang Lu, Zheyong Xie, Ziyan Liu, Haofu Qian, Jianzhao Huang, Fangcheng Shi, Zijie Meng, Hongcheng Guo, Mingqian He, et al. Redone: Revealing domain-specific llm post-training in social networking services. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2648–2674, 2025

  5. [5]

    Synthesizing post-training data for llms through multi-agent simulation

    Shuo Tang, Xianghe Pang, Zexi Liu, Bohan Tang, Rui Ye, Tian Jin, Xiaowen Dong, Yanfeng Wang, and Siheng Chen. Synthesizing post-training data for llms through multi-agent simulation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23306–23335, 2025

  6. [6]

    Lamdagent: An autonomous framework for post-training pipeline optimization via llm agents

    Taro Yano, Yoichi Ishibashi, and Masafumi Oyamada. Lamdagent: An autonomous framework for post-training pipeline optimization via llm agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30066–30083, 2025

  7. [7]

    Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

    Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, et al. Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning.arXiv preprint arXiv:2509.25300, 2025

  8. [8]

    How post-training reshapes llms: A mechanistic view on knowledge, truthfulness, refusal, and confidence

    Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Yizhou Sun, Himabindu Lakkaraju, and Shichang Zhang. How post-training reshapes llms: A mechanistic view on knowledge, truthfulness, refusal, and confidence. InThe First Workshop on the Application of LLM Explainability to Reasoning and Planning, 2025

  9. [9]

    Continual learning and catastrophic forgetting.arXiv preprint arXiv:2403.05175, 2024

    Gido M Van de Ven, Nicholas Soures, and Dhireesha Kudithipudi. Continual learning and catastrophic forgetting.arXiv preprint arXiv:2403.05175, 2024

  10. [10]

    Overcoming catastrophic forgetting in neural networks.arXiv preprint arXiv:2507.10485, 2025

    Brandon Shuen Yi Loke, Filippo Quadri, Gabriel Vivanco, Maximilian Casagrande, and Saúl Fenollosa. Overcoming catastrophic forgetting in neural networks.arXiv preprint arXiv:2507.10485, 2025

  11. [11]

    Continual training of language models for few-shot learning

    Zixuan Ke, Haowei Lin, Yijia Shao, Hu Xu, Lei Shu, and Bing Liu. Continual training of language models for few-shot learning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10205–10216, 2022

  12. [12]

    See: Continual fine-tuning with sequential ensemble of experts

    Zhilin Wang, Yafu Li, Xiaoye Qu, and Yu Cheng. See: Continual fine-tuning with sequential ensemble of experts. InFindings of the Association for Computational Linguistics: ACL 2025, pages 7418–7432, 2025

  13. [13]

    Scalable strategies for continual learning with replay.arXiv preprint arXiv:2505.12512, 2025

    Truman Hickok. Scalable strategies for continual learning with replay.arXiv preprint arXiv:2505.12512, 2025

  14. [14]

    Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

    David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

  15. [15]

    Uncertainty-based continual learning with adaptive regularization.Advances in neural information processing systems, 32, 2019

    Hongjoon Ahn, Sungmin Cha, Donggyu Lee, and Taesup Moon. Uncertainty-based continual learning with adaptive regularization.Advances in neural information processing systems, 32, 2019. 11

  16. [16]

    Efficient contin- ual learning in neural networks with embedding regularization.Neurocomputing, 397:139–148, 2020

    Jary Pomponi, Simone Scardapane, Vincenzo Lomonaco, and Aurelio Uncini. Efficient contin- ual learning in neural networks with embedding regularization.Neurocomputing, 397:139–148, 2020

  17. [17]

    Aimmerging: Adaptive iterative model merging using training trajectories for language model continual learning

    Yujie Feng, Jian Li, Xiaoyu Dong, Pengfei Xu, Xiaohui Zhou, Yujia Zhang, Zexin Lu, Yasha Wang, Alan Zhao, Xu Chu, et al. Aimmerging: Adaptive iterative model merging using training trajectories for language model continual learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13431–13448, 2025

  18. [18]

    Merge then realign: Simple and effective modality-incremental continual learning for multimodal llms

    Dingkun Zhang, Shuhan Qi, Xinyu Xiao, Kehai Chen, and Xuan Wang. Merge then realign: Simple and effective modality-incremental continual learning for multimodal llms. InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13159–13175, 2025

  19. [19]

    Model Merging Scaling Laws in Large Language Models

    Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, and Hongxia Yang. Model merging scaling laws in large language models.arXiv preprint arXiv:2509.24244, 2025

  20. [20]

    On the bures–wasserstein distance between positive definite matrices.Expositiones mathematicae, 37(2):165–191, 2019

    Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the bures–wasserstein distance between positive definite matrices.Expositiones mathematicae, 37(2):165–191, 2019

  21. [21]

    Task singular vectors: Reducing task interference in model merging

    Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola. Task singular vectors: Reducing task interference in model merging. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18695–18705, 2025

  22. [22]

    Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models

    Zirui Wang and Yulia Tsvetkov. Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. InProceedings of the International Conference on Learning Representations (ICLR), 2021

  23. [23]

    Editing models with task arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations

  24. [24]

    No task left behind: Isotropic model merging with common and task-specific subspaces

    Daniel Marczak, Simone Magistri, Sebastian Cygert, Bartłomiej Twardowski, Bagdanov An- drew D, and Joost van de Weije. No task left behind: Isotropic model merging with common and task-specific subspaces. In39th International Conference on Machine Learning. Proceedings of Machine Learning Research (PMLR), 2025

  25. [25]

    Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020

    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020

  26. [26]

    Udapdr: unsupervised domain adaptation via llm prompting and distillation of rerankers

    Jon Saad-Falcon, Omar Khattab, Keshav Santhanam, Radu Florian, Martin Franz, Salim Roukos, Avirup Sil, Md Sultan, and Christopher Potts. Udapdr: unsupervised domain adaptation via llm prompting and distillation of rerankers. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 11265–11279, 2023

  27. [27]

    Explor- ing the effectiveness of llm domain adaptation for business it machine translation

    Johannes Eschbach-Dymanus, Frank Essenberger, Bianka Buschbeck, and Miriam Exel. Explor- ing the effectiveness of llm domain adaptation for business it machine translation. InProceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1), pages 610–622, 2024

  28. [28]

    Enhancing llm capabilities beyond scaling up

    Wenpeng Yin, Muhao Chen, Rui Zhang, Ben Zhou, Fei Wang, and Dan Roth. Enhancing llm capabilities beyond scaling up. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, pages 1–10, 2024

  29. [29]

    Llm augmented llms: Expanding capabilities through composition

    Rachit Bansal, Bidisha Samanta, Siddharth Dalmia, Nitish Gupta, Sriram Ganapathy, Abhishek Bapna, Prateek Jain, and Partha Talukdar. Llm augmented llms: Expanding capabilities through composition. InThe Twelfth International Conference on Learning Representations, 2024. 12

  30. [30]

    Behavior alignment: a new perspective of evaluating llm-based conversational recommendation systems

    Dayu Yang, Fumian Chen, and Hui Fang. Behavior alignment: a new perspective of evaluating llm-based conversational recommendation systems. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2286– 2290, 2024

  31. [31]

    Align 3gr: Unified multi-level alignment for llm-based generative recommendation

    Wencai Ye, Mingjie Sun, Shuhang Chen, Wenjin Wu, and Peng Jiang. Align 3gr: Unified multi-level alignment for llm-based generative recommendation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 16154–16162, 2026

  32. [32]

    Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference.Advances in Neural Information Processing Systems, 37:12581–12611, 2024

    Jiabao Ji, Yujian Liu, Yang Zhang, Gaowen Liu, Ramana R Kompella, Sijia Liu, and Shiyu Chang. Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference.Advances in Neural Information Processing Systems, 37:12581–12611, 2024

  33. [33]

    Learn more, but bother less: parameter efficient continual learning.Advances in Neural Information Processing Systems, 37:97476–97498, 2024

    Fuli Qiao and Mehrdad Mahdavi. Learn more, but bother less: parameter efficient continual learning.Advances in Neural Information Processing Systems, 37:97476–97498, 2024

  34. [34]

    Gere: Towards efficient anti-forgetting in continual learning of llm via general samples replay.arXiv preprint arXiv:2508.04676, 2025

    Yunan Zhang, Shuoran Jiang, Mengchen Zhao, Yuefeng Li, Yang Fan, Xiangping Wu, and Qingcai Chen. Gere: Towards efficient anti-forgetting in continual learning of llm via general samples replay.arXiv preprint arXiv:2508.04676, 2025

  35. [35]

    FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning

    Yujie Feng, Hao Wang, Jian Li, Xu Chu, Zhaolu Kang, Yiran Liu, Yasha Wang, Philip S Yu, and Xiao-Ming Wu. Forever: Forgetting curve-inspired memory replay for language model continual learning.arXiv preprint arXiv:2601.03938, 2026

  36. [36]

    Controlled low- rank adaptation with subspace regularization for continued training on large language models

    Yuheng Lu, Bingshuo Qian, Caixia Yuan, Huixing Jiang, and Xiaojie Wang. Controlled low- rank adaptation with subspace regularization for continued training on large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19165–19181, 2025

  37. [37]

    Magmax: Leveraging model merging for seamless continual learning

    Daniel Marczak, Bartłomiej Twardowski, Tomasz Trzci´nski, and Sebastian Cygert. Magmax: Leveraging model merging for seamless continual learning. InEuropean Conference on Computer Vision, pages 379–395. Springer, 2024

  38. [38]

    MergePipe: A budget-aware parameter management system for scalable LLM merging.arXiv preprint arXiv:2602.13273, 2026

    Yuanyi Wang, Yanggan Gu, Zihao Wang, Kunxi Li, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, and Hongxia Yang. Mergepipe: A budget-aware parameter management system for scalable llm merging.arXiv preprint arXiv:2602.13273, 2026

  39. [39]

    Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

    Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

  40. [40]

    Democratizing ai through model fusion: A comprehensive review and future directions.Nexus, 2025

    Qi Zhou, Yiming Zhang, Yanggan Gu, Yuanyi Wang, Zhijie Sang, Zhaoyi Yan, Zhen Li, Shengyu Zhang, Fei Wu, and Hongxia Yang. Democratizing ai through model fusion: A comprehensive review and future directions.Nexus, 2025

  41. [41]

    Became: Bayesian continual learning with adaptive model merging

    Mei Li, Yuxiang Lu, Qinyan Dai, Suizhi Huang, Yue Ding, and Hongtao Lu. Became: Bayesian continual learning with adaptive model merging. InForty-second International Conference on Machine Learning

  42. [42]

    Mergeslide: Continual model merging and task-to-class prompt-aligned inference for lifelong learning on whole slide images

    Doanh C Bui, Ba Hung Ngo, Hoai Luan Pham, Khang Nguyen, Maï K Nguyen, and Yasuhiko Nakashima. Mergeslide: Continual model merging and task-to-class prompt-aligned inference for lifelong learning on whole slide images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4859–4868, 2026

  43. [43]

    Model fusion for scalable and sustainable artificial intelligence: A review and outlook.Journal of Modern Power Systems and Clean Energy, 14(1):37–49, 2026

    Qi Zhou, Yiming Zhang, Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Zhen Li, Chi Yung Chung, and Hongxia Yang. Model fusion for scalable and sustainable artificial intelligence: A review and outlook.Journal of Modern Power Systems and Clean Energy, 14(1):37–49, 2026

  44. [44]

    Merging on the fly without retraining: A sequential approach to scalable continual model merging

    Anke Tang, Enneng Yang, Li Shen, Yong Luo, Han Hu, Lefei Zhang, Bo Du, and Dacheng Tao. Merging on the fly without retraining: A sequential approach to scalable continual model merging. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 13

  45. [45]

    Mingle: Mixture of null-space gated low-rank experts for test-time continual model merging

    Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, and Hongliang Li. Mingle: Mixture of null-space gated low-rank experts for test-time continual model merging. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

  46. [46]

    Null-space filtering for data-free continual model merging: Preserving transparency, promoting fidelity.arXiv preprint arXiv:2509.21413, 2025

    Zihuan Qiu, Lei Wang, Yang Cao, Runtong Zhang, Bing Su, Yi Xu, Fanman Meng, Linfeng Xu, Qingbo Wu, and Hongliang Li. Null-space filtering for data-free continual model merging: Preserving transparency, promoting fidelity.arXiv preprint arXiv:2509.21413, 2025

  47. [47]

    K-merge: Online continual merging of adapters for on-device large language models.arXiv preprint arXiv:2510.13537, 2025

    Donald Shenaj, Ondrej Bohdal, Taha Ceritli, Mete Ozay, Pietro Zanuttigh, and Umberto Michieli. K-merge: Online continual merging of adapters for on-device large language models.arXiv preprint arXiv:2510.13537, 2025

  48. [48]

    Toward a holistic approach to continual model merging.arXiv preprint arXiv:2509.23592, 2025

    Hoang Phan, Sungmin Cha, Tung Lam Tran, and Qi Lei. Toward a holistic approach to continual model merging.arXiv preprint arXiv:2509.23592, 2025

  49. [49]

    From coefficients to directions: Rethinking model merging with directional alignment.arXiv preprint arXiv:2512.00391, 2025

    Zhikang Chen, Sen Cui, Deheng Ye, Min Zhang, Gang Niu, Yu Zhang, Masashi Sugiyama, and Tingting Zhu. From coefficients to directions: Rethinking model merging with directional alignment.arXiv preprint arXiv:2512.00391, 2025

  50. [50]

    Modeling multi-task model merging as adaptive projective gradient descent

    Yongxian Wei, Anke Tang, Li Shen, Zixuan Hu, Chun Yuan, and Xiaochun Cao. Modeling multi-task model merging as adaptive projective gradient descent. InInternational Conference on Machine Learning, pages 66178–66193. PMLR, 2025

  51. [51]

    Merging by matching models in task parameter subspaces.Transactions on Machine Learning Research

    Derek Tam, Mohit Bansal, and Colin Raffel. Merging by matching models in task parameter subspaces.Transactions on Machine Learning Research

  52. [52]

    A fixed-point ap- proach to barycenters in wasserstein space.Journal of Mathematical Analysis and Applications, 441(2):744–762, 2016

    Pedro C Álvarez-Esteban, E Del Barrio, JA Cuesta-Albertos, and C Matrán. A fixed-point ap- proach to barycenters in wasserstein space.Journal of Mathematical Analysis and Applications, 441(2):744–762, 2016

  53. [53]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  54. [54]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  55. [55]

    Whoever started the interference should end it: Guiding data-free model merging via task vectors

    Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. InInternational Conference on Machine Learning, pages 10121–10143. PMLR, 2025

  56. [56]

    Mmlu-pro-cot-train-labeled

    UW-Madison-Lee-Lab. Mmlu-pro-cot-train-labeled. https://huggingface.co/datasets/ UW-Madison-Lee-Lab/MMLU-Pro-CoT-Train-Labeled, 2025. Hugging Face dataset

  57. [57]

    Nemotron-post-training-dataset-v1

    NVIDIA. Nemotron-post-training-dataset-v1. https://huggingface.co/datasets/ nvidia/Nemotron-Post-Training-Dataset-v1, 2025. Hugging Face dataset

  58. [58]

    Opencodeinterpreter: Integrating code generation with execution and refinement

    Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. Opencodeinterpreter: Integrating code generation with execution and refinement. InFindings of the Association for Computational Linguistics: ACL 2024, pages 12834–12859, 2024

  59. [59]

    Localize-and-stitch: Efficient model merging via sparse task arithmetic.Transactions on Machine Learning Research, 2024, 2024

    Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, and Han Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.Transactions on Machine Learning Research, 2024, 2024

  60. [60]

    Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

    Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

  61. [61]

    Language models are super mario: Absorbing abilities from homologous models as a free lunch

    Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learning, 2024. 14

  62. [62]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

  63. [63]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  64. [64]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

  65. [65]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  66. [66]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  67. [67]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling

  68. [68]

    Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion

    Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, and Hongxia Yang. Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion. arXiv preprint arXiv:2505.13893, 2025

  69. [69]

    Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878, 2025

    Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Fei Wu, and Hongxia Yang. Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878, 2025

  70. [70]

    InfiCoEvalChain: A blockchain-based decentralized framework for collaborative LLM evaluation.arXiv preprint arXiv:2602.08229, 2026

    Yifan Yang, Jinjia Li, Kunxi Li, Puhao Zheng, Yuanyi Wang, Zheyan Qu, Yang Yu, Jianmin Wu, Ming Li, and Hongxia Yang. Inficoevalchain: A blockchain-based decentralized framework for collaborative llm evaluation.arXiv preprint arXiv:2602.08229, 2026. 15 Limitations Our analysis and experiments focus on Qwen3-scale open LLMs and on domain and capability con...

  71. [71]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...