arxiv: 2605.09608 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.IT· math.IT

Recognition: no theorem link

Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

Yuanyi Wang , Yifan Yang , Su Lu , Yanggan Gu , Pengkai Wang , Wenjun Wang , Zhaoyi Yan , Congkai Xie , Jianmin Wu , Jialun Cao , Shing-Chi Cheung , Hongxia Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:28 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT

keywords continual post-trainingcatastrophic forgettingLLM parameter updatescovariance geometrymodel state alignmentupdate integrationgeometry conflictdata-free merging

0 comments

The pith

Forgetting during sequential LLM updates occurs when the covariance geometry of a new task misaligns with the current model state geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why some updates to large language models during continual post-training allow new capabilities to build on old ones while others erase prior knowledge. It frames forgetting as an integration failure driven by geometric mismatch: each task induces a covariance structure in its parameter changes, and this structure either fits the geometry already present in the model or clashes with it. When the new update stays compatible with the state left by earlier tasks, knowledge transfers; when conflict rises, interference grows. The authors use this view to build a control mechanism that decides how to merge updates without access to previous training data. This matters because it turns an opaque problem of forgetting into a measurable property of update geometries that can guide integration decisions.

Core claim

Forgetting can be considered as a state-relative update-integration failure, it arises when the covariance geometries induced by tasks misalign with the geometry of the evolving model state. Sequential updates transfer when they remain compatible with the model state shaped by previous updates, and interfere when state-relative geometry conflict becomes high.

What carries the argument

The covariance geometry induced by a task's parameter update, which acts as a descriptor of whether the update will integrate compatibly with the geometry of the current model state.

If this is right

Sequential updates transfer when their induced covariance geometry stays compatible with the model state shaped by earlier updates.
Interference and forgetting increase when state-relative geometry conflict becomes high.
Geometry conflict can serve as a control signal to gate how updates are integrated or corrected.
A geometry-aware merging procedure improves retention and final performance on domain-continual and capability-continual tasks without replay data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If geometry conflict is the decisive factor, then task ordering could be chosen in advance by computing pairwise geometry alignments to reduce expected forgetting.
The same compatibility diagnostic might apply to other continual adaptation settings such as instruction tuning or multi-domain fine-tuning.
Preprocessing updates to reduce their geometry conflict before merging could offer an additional lever for preserving capabilities.

Load-bearing premise

The covariance geometry of a task's parameter update is a sufficient and stable descriptor of whether that task will integrate compatibly with the current model state.

What would settle it

Measuring geometry conflict on a sequence of tasks and finding no reliable correlation between higher conflict values and greater forgetting rates would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.09608 by Congkai Xie, Hongxia Yang, Jialun Cao, Jianmin Wu, Pengkai Wang, Shing-Chi Cheung, Su Lu, Wenjun Wang, Yanggan Gu, Yifan Yang, Yuanyi Wang, Zhaoyi Yan.

**Figure 2.** Figure 2: Global and method-level associations. Top: global |ρs|. Bottom: signed method-level ρs; FVR denotes FOREVER. Subspace overlap is a natural compatibility proxy: if two updates act on similar directions, they may be easier to integrate. We therefore compare SAR with geometry conflict (Sec. 2.2). As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Pairwise compatibility and conflict complementarity. (a)–(c) SAR and geometry conflict stratify task-pair transfer regimes, while pairwise conflict alone weakly predicts forgetting. GC-drop is the signed association with the immediate old-task delta; GC-forget measures degradation from each old task’s best prior score. (d)–(f) reveal complementary failure modes: top-layer share is the fraction of top-ranke… view at source ↗

**Figure 4.** Figure 4: GCWM ablation on MMLU-Pro. We ablate two merge-time components of GCWM: the conflict gate and the shared Wasserstein metric. All variants use the same Qwen3- 0.6B domain-continual task experts and evaluation protocol, differing only in the integration rule. The w/o gate variant removes conflictconditioned gating and applies the geometryaware branch uniformly, while w/o Wasserstein barycenter replaces … view at source ↗

**Figure 5.** Figure 5: Statistical confidence for Sec. 3. Error bars show run-cluster bootstrap 95% confidence [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Drift and geometry-discrepancy signals versus forgetting. The four panels are arranged [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Pairwise subspace and geometry compatibility. The left panel compares SAR with geometry [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Step-level correlation summary by scale and method. Norm, active-pair conflict, state gap, [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Pairwise correlation summary by method and scale. SAR-GC measures the relation between [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Full state-relative geometry diagnostic. This view expands the state-relative analysis [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Step-level continual post-training dynamics by method. We show downstream retention [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Method-level correlation heatmaps. We compare update norm, SAR, geometry conflict, [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Global step-level explanation heatmap. Each cell reports the Spearman correlation between [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Task-pair selective forgetting by method. Each heatmap reports the old-task score change [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Task-pair geometry conflict by method. Pairwise geometry conflict reveals compatibility [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

**Figure 16.** Figure 16: Most harmful task transitions. We visualize the largest old-task drops after introducing [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗

**Figure 17.** Figure 17: Family-level mechanism profile. Geometry conflict and gradient conflict emphasize [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗

**Figure 18.** Figure 18: Method-wise top-layer family distribution. We decompose top geometry-conflict layers [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗

**Figure 19.** Figure 19: Final performance across model scales and continual post-training methods. [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗

**Figure 20.** Figure 20: Capability ablation breakdown. Full GCWM is compared with variants that remove [PITH_FULL_IMAGE:figures/full_fig_p040_20.png] view at source ↗

**Figure 21.** Figure 21: GCWM merge-time runtime and memory profiling. Runtime is decomposed by major [PITH_FULL_IMAGE:figures/full_fig_p043_21.png] view at source ↗

**Figure 22.** Figure 22: GCWM hyperparameter sensitivity on Qwen3-8B. Each sweep changes one parameter [PITH_FULL_IMAGE:figures/full_fig_p044_22.png] view at source ↗

read the original abstract

Continual post-training aims to extend large language models (LLMs) with new knowledge, skills, and behaviors, yet it remains unclear when sequential updates enable capability transfer and when they cause catastrophic forgetting. Existing methods mitigate forgetting through sequential fine-tuning, replay, regularization, or model merging, but offer limited criteria for determining when incorporating new updates is beneficial or harmful. In this work, we study LLM continual post-training through three questions: What drives forgetting? When do sequentially acquired capabilities transfer or interfere? How can compatibility be used to control update integration? We address these questions through task geometry: we represent each post-training task by its parameter update and study the covariance geometry induced by the update. Our central finding is that: forgetting can be considered as a state-relative update-integration failure, it arises when the covariance geometries induced by tasks misalign with the geometry of the evolving model state. Sequential updates transfer when they remain compatible with the model state shaped by previous updates, and interfere when state-relative geometry conflict becomes high. Motivated by this finding, we propose Geometry-Conflict Wasserstein Merging (GCWM), a data-free update-integration method that constructs a shared Wasserstein metric via Gaussian Wasserstein barycenters and uses geometry conflict to gate geometry-aware correction. Across Qwen3 0.6B--14B on domain-continual and capability-continual settings, GCWM consistently outperforms data-free baselines, improving retention and final performance without replay data. These results identify geometry conflict as both an explanatory signal for forgetting and a practical control signal for LLM continual post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames forgetting as covariance geometry conflict between task updates and the evolving model state, then builds a data-free GCWM merging method around that signal.

read the letter

The main thing here is that the authors treat forgetting in LLM continual post-training as a state-relative integration failure driven by misalignment in the covariance geometry of parameter updates. They introduce GCWM, which uses Gaussian Wasserstein barycenters to create a shared metric and gates corrections based on a geometry conflict score. This lets them merge updates without replay data when conflict is low and avoid them when it is high. On Qwen3 models from 0.6B to 14B, across domain-continual and capability-continual tasks, the method beats data-free baselines on retention and final performance. That is the concrete advance: a new explanatory lens plus a practical control that works in the reported settings. The framing and algorithm do not appear in the cited prior work, so the contribution is real rather than a relabeling. The experiments give direct evidence that the approach delivers measurable gains without extra data. The soft spot is the sufficiency claim. The central finding treats covariance geometry as the load-bearing descriptor of compatibility, yet nothing in the abstract isolates whether this geometry adds explanatory power beyond simpler signals such as update magnitude or embedding overlap. If conflict mainly proxies for those quantities, the state-relative interpretation loses force. The paper also needs explicit detail on how the covariance matrices are estimated from updates to rule out circularity with the same data used to measure conflict. Those gaps are fixable but currently limit how far the mechanism can be trusted. This work is aimed at people building or deploying continually updated LLMs who want data-free options and a geometric way to decide when to integrate new tasks. Readers already working on model merging or interference diagnostics will find the control signal useful even if they want tighter ablations. It deserves a serious referee because the problem is central to production LLMs and the results are positive enough to warrant checking the mechanism and controls in review.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that forgetting in LLM continual post-training arises as a state-relative update-integration failure when covariance geometries induced by task parameter updates misalign with the geometry of the evolving model state. Sequential updates transfer when compatible with the prior state and interfere at high geometry conflict. The authors propose Geometry-Conflict Wasserstein Merging (GCWM), a data-free method using Gaussian Wasserstein barycenters to build a shared metric and gate geometry-aware corrections. On Qwen3 models (0.6B–14B) in domain-continual and capability-continual settings, GCWM outperforms data-free baselines in retention and final performance.

Significance. If the geometry-conflict interpretation is shown to be load-bearing rather than a proxy for simpler signals, the work would supply both an explanatory account of when updates integrate or interfere and a practical data-free control mechanism. The direct linkage of analysis to the GCWM algorithm and the use of optimal-transport barycenters are technical strengths that could influence continual-learning research beyond replay or regularization heuristics.

major comments (3)

[Abstract] Abstract (central finding paragraph): the claim that covariance geometry is a sufficient and stable descriptor of state compatibility is presented as following from the analysis, yet no evidence is given that geometry conflict adds explanatory power beyond proxies such as ||Δθ|| or task embedding similarity. Ablation experiments that isolate the geometry term are required to substantiate the sufficiency assumption.
[Method] GCWM construction (method section): the method relies on Gaussian approximations and Wasserstein barycenters whose covariance parameters are estimated from the same updates used to measure conflict; the manuscript must show that these parameters are independent of the evaluation data or provide a derivation demonstrating that the circularity does not affect the reported gains.
[Experiments] Experimental results (tables/figures): no description of covariance estimation procedure, run-to-run variance, or statistical significance tests is supplied, so it is impossible to determine whether the observed improvements over baselines reliably support the geometry-conflict explanation rather than implementation details.

minor comments (1)

[Abstract] The abstract packs the central claim, method, and results into a single dense paragraph; splitting the finding into two sentences would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the presentation of our geometry-conflict analysis and the GCWM method.

read point-by-point responses

Referee: [Abstract] Abstract (central finding paragraph): the claim that covariance geometry is a sufficient and stable descriptor of state compatibility is presented as following from the analysis, yet no evidence is given that geometry conflict adds explanatory power beyond proxies such as ||Δθ|| or task embedding similarity. Ablation experiments that isolate the geometry term are required to substantiate the sufficiency assumption.

Authors: We agree that additional evidence is needed to demonstrate that geometry conflict provides explanatory power beyond simpler proxies. In the revised manuscript we will add ablation experiments that directly compare geometry conflict against ||Δθ|| and task-embedding cosine similarity as predictors of forgetting and interference. These ablations will quantify the incremental predictive value of the geometry term on held-out validation sets. We will also revise the abstract to state the findings more precisely as supported by both the geometric analysis and the new ablations. revision: yes
Referee: [Method] GCWM construction (method section): the method relies on Gaussian approximations and Wasserstein barycenters whose covariance parameters are estimated from the same updates used to measure conflict; the manuscript must show that these parameters are independent of the evaluation data or provide a derivation demonstrating that the circularity does not affect the reported gains.

Authors: We will expand the method section to clarify that covariance matrices are estimated solely from the task-specific parameter updates (via sample covariance of the delta vectors or mini-batch gradients collected during fine-tuning). These statistics are computed before any evaluation on downstream tasks and do not incorporate test or validation data. We will include a short derivation showing that the Wasserstein barycenter construction uses only these pre-computed update covariances to define the shared metric, ensuring the conflict measurement and subsequent merging step remain independent of the reported performance metrics. revision: yes
Referee: [Experiments] Experimental results (tables/figures): no description of covariance estimation procedure, run-to-run variance, or statistical significance tests is supplied, so it is impossible to determine whether the observed improvements over baselines reliably support the geometry-conflict explanation rather than implementation details.

Authors: We acknowledge the omission of these experimental details. In the revised manuscript we will add: (i) a precise description of the covariance estimation procedure (sample covariance over update vectors with explicit batch size and regularization), (ii) mean and standard deviation of all metrics over at least three independent runs with different random seeds, and (iii) statistical significance tests (paired t-tests or Wilcoxon signed-rank tests with p-values) comparing GCWM against each baseline. Updated tables and figures will report these statistics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claim is interpretive observation from geometry analysis, not reduced by construction to inputs.

full rationale

The paper's derivation chain begins with representing tasks via parameter updates and analyzing induced covariance geometries, leading to the interpretive claim that forgetting arises from state-relative misalignment. This is presented as a finding from the geometry study rather than a mathematical derivation or fitted prediction. GCWM is motivated by the finding and applies Gaussian Wasserstein barycenters with geometry conflict gating, but the abstract and description provide no equations showing that conflict metrics or barycenter parameters are fitted directly to forgetting outcomes or reduce the explanatory claim to the input updates by definition. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are evident in the provided text. The analysis remains self-contained as an empirical geometry-based interpretation without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that covariance geometry of parameter updates is a faithful proxy for task compatibility; no free parameters are named in the abstract, but the Wasserstein barycenter construction implicitly introduces modeling choices.

axioms (1)

domain assumption Covariance geometry induced by a task's parameter update reflects the compatibility of that task with the current model state
Invoked directly in the central finding sentence of the abstract.

invented entities (1)

Geometry conflict signal no independent evidence
purpose: Quantitative measure of misalignment between task covariance geometry and evolving model state used both to explain forgetting and to gate merging
Newly introduced explanatory and control quantity; no independent falsifiable handle outside the paper is described in the abstract.

pith-pipeline@v0.9.0 · 5629 in / 1497 out tokens · 56330 ms · 2026-05-12T02:28:17.884464+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 7 internal anchors

[1]

Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

work page 2025
[2]

arXiv preprint arXiv:2502.21321

Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Fahad Shahbaz Khan, and Salman Khan. Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321, 2025

work page arXiv 2025
[3]

Demystifying domain-adaptive post-training for financial llms

Zixuan Ke, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, and Shafiq Joty. Demystifying domain-adaptive post-training for financial llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 31021–31047, 2025

work page 2025
[4]

Redone: Revealing domain-specific llm post-training in social networking services

Fei Zhao, Chonggang Lu, Zheyong Xie, Ziyan Liu, Haofu Qian, Jianzhao Huang, Fangcheng Shi, Zijie Meng, Hongcheng Guo, Mingqian He, et al. Redone: Revealing domain-specific llm post-training in social networking services. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2648–2674, 2025

work page 2025
[5]

Synthesizing post-training data for llms through multi-agent simulation

Shuo Tang, Xianghe Pang, Zexi Liu, Bohan Tang, Rui Ye, Tian Jin, Xiaowen Dong, Yanfeng Wang, and Siheng Chen. Synthesizing post-training data for llms through multi-agent simulation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23306–23335, 2025

work page 2025
[6]

Lamdagent: An autonomous framework for post-training pipeline optimization via llm agents

Taro Yano, Yoichi Ishibashi, and Masafumi Oyamada. Lamdagent: An autonomous framework for post-training pipeline optimization via llm agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30066–30083, 2025

work page 2025
[7]

Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, et al. Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning.arXiv preprint arXiv:2509.25300, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

How post-training reshapes llms: A mechanistic view on knowledge, truthfulness, refusal, and confidence

Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Yizhou Sun, Himabindu Lakkaraju, and Shichang Zhang. How post-training reshapes llms: A mechanistic view on knowledge, truthfulness, refusal, and confidence. InThe First Workshop on the Application of LLM Explainability to Reasoning and Planning, 2025

work page 2025
[9]

Continual learning and catastrophic forgetting.arXiv preprint arXiv:2403.05175, 2024

Gido M Van de Ven, Nicholas Soures, and Dhireesha Kudithipudi. Continual learning and catastrophic forgetting.arXiv preprint arXiv:2403.05175, 2024

work page arXiv 2024
[10]

Overcoming catastrophic forgetting in neural networks.arXiv preprint arXiv:2507.10485, 2025

Brandon Shuen Yi Loke, Filippo Quadri, Gabriel Vivanco, Maximilian Casagrande, and Saúl Fenollosa. Overcoming catastrophic forgetting in neural networks.arXiv preprint arXiv:2507.10485, 2025

work page arXiv 2025
[11]

Continual training of language models for few-shot learning

Zixuan Ke, Haowei Lin, Yijia Shao, Hu Xu, Lei Shu, and Bing Liu. Continual training of language models for few-shot learning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10205–10216, 2022

work page 2022
[12]

See: Continual fine-tuning with sequential ensemble of experts

Zhilin Wang, Yafu Li, Xiaoye Qu, and Yu Cheng. See: Continual fine-tuning with sequential ensemble of experts. InFindings of the Association for Computational Linguistics: ACL 2025, pages 7418–7432, 2025

work page 2025
[13]

Scalable strategies for continual learning with replay.arXiv preprint arXiv:2505.12512, 2025

Truman Hickok. Scalable strategies for continual learning with replay.arXiv preprint arXiv:2505.12512, 2025

work page arXiv 2025
[14]

Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

work page 2019
[15]

Uncertainty-based continual learning with adaptive regularization.Advances in neural information processing systems, 32, 2019

Hongjoon Ahn, Sungmin Cha, Donggyu Lee, and Taesup Moon. Uncertainty-based continual learning with adaptive regularization.Advances in neural information processing systems, 32, 2019. 11

work page 2019
[16]

Efficient contin- ual learning in neural networks with embedding regularization.Neurocomputing, 397:139–148, 2020

Jary Pomponi, Simone Scardapane, Vincenzo Lomonaco, and Aurelio Uncini. Efficient contin- ual learning in neural networks with embedding regularization.Neurocomputing, 397:139–148, 2020

work page 2020
[17]

Aimmerging: Adaptive iterative model merging using training trajectories for language model continual learning

Yujie Feng, Jian Li, Xiaoyu Dong, Pengfei Xu, Xiaohui Zhou, Yujia Zhang, Zexin Lu, Yasha Wang, Alan Zhao, Xu Chu, et al. Aimmerging: Adaptive iterative model merging using training trajectories for language model continual learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13431–13448, 2025

work page 2025
[18]

Merge then realign: Simple and effective modality-incremental continual learning for multimodal llms

Dingkun Zhang, Shuhan Qi, Xinyu Xiao, Kehai Chen, and Xuan Wang. Merge then realign: Simple and effective modality-incremental continual learning for multimodal llms. InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13159–13175, 2025

work page 2025
[19]

Model Merging Scaling Laws in Large Language Models

Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, and Hongxia Yang. Model merging scaling laws in large language models.arXiv preprint arXiv:2509.24244, 2025

work page internal anchor Pith review arXiv 2025
[20]

On the bures–wasserstein distance between positive definite matrices.Expositiones mathematicae, 37(2):165–191, 2019

Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the bures–wasserstein distance between positive definite matrices.Expositiones mathematicae, 37(2):165–191, 2019

work page 2019
[21]

Task singular vectors: Reducing task interference in model merging

Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola. Task singular vectors: Reducing task interference in model merging. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18695–18705, 2025

work page 2025
[22]

Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models

Zirui Wang and Yulia Tsvetkov. Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. InProceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[23]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations

work page
[24]

No task left behind: Isotropic model merging with common and task-specific subspaces

Daniel Marczak, Simone Magistri, Sebastian Cygert, Bartłomiej Twardowski, Bagdanov An- drew D, and Joost van de Weije. No task left behind: Isotropic model merging with common and task-specific subspaces. In39th International Conference on Machine Learning. Proceedings of Machine Learning Research (PMLR), 2025

work page 2025
[25]

Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in neural information processing systems, 33:5824–5836, 2020

work page 2020
[26]

Udapdr: unsupervised domain adaptation via llm prompting and distillation of rerankers

Jon Saad-Falcon, Omar Khattab, Keshav Santhanam, Radu Florian, Martin Franz, Salim Roukos, Avirup Sil, Md Sultan, and Christopher Potts. Udapdr: unsupervised domain adaptation via llm prompting and distillation of rerankers. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 11265–11279, 2023

work page 2023
[27]

Explor- ing the effectiveness of llm domain adaptation for business it machine translation

Johannes Eschbach-Dymanus, Frank Essenberger, Bianka Buschbeck, and Miriam Exel. Explor- ing the effectiveness of llm domain adaptation for business it machine translation. InProceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1), pages 610–622, 2024

work page 2024
[28]

Enhancing llm capabilities beyond scaling up

Wenpeng Yin, Muhao Chen, Rui Zhang, Ben Zhou, Fei Wang, and Dan Roth. Enhancing llm capabilities beyond scaling up. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, pages 1–10, 2024

work page 2024
[29]

Llm augmented llms: Expanding capabilities through composition

Rachit Bansal, Bidisha Samanta, Siddharth Dalmia, Nitish Gupta, Sriram Ganapathy, Abhishek Bapna, Prateek Jain, and Partha Talukdar. Llm augmented llms: Expanding capabilities through composition. InThe Twelfth International Conference on Learning Representations, 2024. 12

work page 2024
[30]

Behavior alignment: a new perspective of evaluating llm-based conversational recommendation systems

Dayu Yang, Fumian Chen, and Hui Fang. Behavior alignment: a new perspective of evaluating llm-based conversational recommendation systems. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2286– 2290, 2024

work page 2024
[31]

Align 3gr: Unified multi-level alignment for llm-based generative recommendation

Wencai Ye, Mingjie Sun, Shuhang Chen, Wenjin Wu, and Peng Jiang. Align 3gr: Unified multi-level alignment for llm-based generative recommendation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 16154–16162, 2026

work page 2026
[32]

Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference.Advances in Neural Information Processing Systems, 37:12581–12611, 2024

Jiabao Ji, Yujian Liu, Yang Zhang, Gaowen Liu, Ramana R Kompella, Sijia Liu, and Shiyu Chang. Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference.Advances in Neural Information Processing Systems, 37:12581–12611, 2024

work page 2024
[33]

Learn more, but bother less: parameter efficient continual learning.Advances in Neural Information Processing Systems, 37:97476–97498, 2024

Fuli Qiao and Mehrdad Mahdavi. Learn more, but bother less: parameter efficient continual learning.Advances in Neural Information Processing Systems, 37:97476–97498, 2024

work page 2024
[34]

Gere: Towards efficient anti-forgetting in continual learning of llm via general samples replay.arXiv preprint arXiv:2508.04676, 2025

Yunan Zhang, Shuoran Jiang, Mengchen Zhao, Yuefeng Li, Yang Fan, Xiangping Wu, and Qingcai Chen. Gere: Towards efficient anti-forgetting in continual learning of llm via general samples replay.arXiv preprint arXiv:2508.04676, 2025

work page arXiv 2025
[35]

FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning

Yujie Feng, Hao Wang, Jian Li, Xu Chu, Zhaolu Kang, Yiran Liu, Yasha Wang, Philip S Yu, and Xiao-Ming Wu. Forever: Forgetting curve-inspired memory replay for language model continual learning.arXiv preprint arXiv:2601.03938, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Controlled low- rank adaptation with subspace regularization for continued training on large language models

Yuheng Lu, Bingshuo Qian, Caixia Yuan, Huixing Jiang, and Xiaojie Wang. Controlled low- rank adaptation with subspace regularization for continued training on large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19165–19181, 2025

work page 2025
[37]

Magmax: Leveraging model merging for seamless continual learning

Daniel Marczak, Bartłomiej Twardowski, Tomasz Trzci´nski, and Sebastian Cygert. Magmax: Leveraging model merging for seamless continual learning. InEuropean Conference on Computer Vision, pages 379–395. Springer, 2024

work page 2024
[38]

MergePipe: A budget-aware parameter management system for scalable LLM merging.arXiv preprint arXiv:2602.13273, 2026

Yuanyi Wang, Yanggan Gu, Zihao Wang, Kunxi Li, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, and Hongxia Yang. Mergepipe: A budget-aware parameter management system for scalable llm merging.arXiv preprint arXiv:2602.13273, 2026

work page arXiv 2026
[39]

Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

work page 2026
[40]

Democratizing ai through model fusion: A comprehensive review and future directions.Nexus, 2025

Qi Zhou, Yiming Zhang, Yanggan Gu, Yuanyi Wang, Zhijie Sang, Zhaoyi Yan, Zhen Li, Shengyu Zhang, Fei Wu, and Hongxia Yang. Democratizing ai through model fusion: A comprehensive review and future directions.Nexus, 2025

work page 2025
[41]

Became: Bayesian continual learning with adaptive model merging

Mei Li, Yuxiang Lu, Qinyan Dai, Suizhi Huang, Yue Ding, and Hongtao Lu. Became: Bayesian continual learning with adaptive model merging. InForty-second International Conference on Machine Learning

work page
[42]

Mergeslide: Continual model merging and task-to-class prompt-aligned inference for lifelong learning on whole slide images

Doanh C Bui, Ba Hung Ngo, Hoai Luan Pham, Khang Nguyen, Maï K Nguyen, and Yasuhiko Nakashima. Mergeslide: Continual model merging and task-to-class prompt-aligned inference for lifelong learning on whole slide images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4859–4868, 2026

work page 2026
[43]

Model fusion for scalable and sustainable artificial intelligence: A review and outlook.Journal of Modern Power Systems and Clean Energy, 14(1):37–49, 2026

Qi Zhou, Yiming Zhang, Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Zhen Li, Chi Yung Chung, and Hongxia Yang. Model fusion for scalable and sustainable artificial intelligence: A review and outlook.Journal of Modern Power Systems and Clean Energy, 14(1):37–49, 2026

work page 2026
[44]

Merging on the fly without retraining: A sequential approach to scalable continual model merging

Anke Tang, Enneng Yang, Li Shen, Yong Luo, Han Hu, Lefei Zhang, Bo Du, and Dacheng Tao. Merging on the fly without retraining: A sequential approach to scalable continual model merging. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 13

work page 2025
[45]

Mingle: Mixture of null-space gated low-rank experts for test-time continual model merging

Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, and Hongliang Li. Mingle: Mixture of null-space gated low-rank experts for test-time continual model merging. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[46]

Null-space filtering for data-free continual model merging: Preserving transparency, promoting fidelity.arXiv preprint arXiv:2509.21413, 2025

Zihuan Qiu, Lei Wang, Yang Cao, Runtong Zhang, Bing Su, Yi Xu, Fanman Meng, Linfeng Xu, Qingbo Wu, and Hongliang Li. Null-space filtering for data-free continual model merging: Preserving transparency, promoting fidelity.arXiv preprint arXiv:2509.21413, 2025

work page arXiv 2025
[47]

K-merge: Online continual merging of adapters for on-device large language models.arXiv preprint arXiv:2510.13537, 2025

Donald Shenaj, Ondrej Bohdal, Taha Ceritli, Mete Ozay, Pietro Zanuttigh, and Umberto Michieli. K-merge: Online continual merging of adapters for on-device large language models.arXiv preprint arXiv:2510.13537, 2025

work page arXiv 2025
[48]

Toward a holistic approach to continual model merging.arXiv preprint arXiv:2509.23592, 2025

Hoang Phan, Sungmin Cha, Tung Lam Tran, and Qi Lei. Toward a holistic approach to continual model merging.arXiv preprint arXiv:2509.23592, 2025

work page arXiv 2025
[49]

From coefficients to directions: Rethinking model merging with directional alignment.arXiv preprint arXiv:2512.00391, 2025

Zhikang Chen, Sen Cui, Deheng Ye, Min Zhang, Gang Niu, Yu Zhang, Masashi Sugiyama, and Tingting Zhu. From coefficients to directions: Rethinking model merging with directional alignment.arXiv preprint arXiv:2512.00391, 2025

work page arXiv 2025
[50]

Modeling multi-task model merging as adaptive projective gradient descent

Yongxian Wei, Anke Tang, Li Shen, Zixuan Hu, Chun Yuan, and Xiaochun Cao. Modeling multi-task model merging as adaptive projective gradient descent. InInternational Conference on Machine Learning, pages 66178–66193. PMLR, 2025

work page 2025
[51]

Merging by matching models in task parameter subspaces.Transactions on Machine Learning Research

Derek Tam, Mohit Bansal, and Colin Raffel. Merging by matching models in task parameter subspaces.Transactions on Machine Learning Research

work page
[52]

A fixed-point ap- proach to barycenters in wasserstein space.Journal of Mathematical Analysis and Applications, 441(2):744–762, 2016

Pedro C Álvarez-Esteban, E Del Barrio, JA Cuesta-Albertos, and C Matrán. A fixed-point ap- proach to barycenters in wasserstein space.Journal of Mathematical Analysis and Applications, 441(2):744–762, 2016

work page 2016
[53]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017
[55]

Whoever started the interference should end it: Guiding data-free model merging via task vectors

Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. InInternational Conference on Machine Learning, pages 10121–10143. PMLR, 2025

work page 2025
[56]

Mmlu-pro-cot-train-labeled

UW-Madison-Lee-Lab. Mmlu-pro-cot-train-labeled. https://huggingface.co/datasets/ UW-Madison-Lee-Lab/MMLU-Pro-CoT-Train-Labeled, 2025. Hugging Face dataset

work page 2025
[57]

Nemotron-post-training-dataset-v1

NVIDIA. Nemotron-post-training-dataset-v1. https://huggingface.co/datasets/ nvidia/Nemotron-Post-Training-Dataset-v1, 2025. Hugging Face dataset

work page 2025
[58]

Opencodeinterpreter: Integrating code generation with execution and refinement

Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. Opencodeinterpreter: Integrating code generation with execution and refinement. InFindings of the Association for Computational Linguistics: ACL 2024, pages 12834–12859, 2024

work page 2024
[59]

Localize-and-stitch: Efficient model merging via sparse task arithmetic.Transactions on Machine Learning Research, 2024, 2024

Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, and Han Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.Transactions on Machine Learning Research, 2024, 2024

work page 2024
[60]

Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

work page 2023
[61]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learning, 2024. 14

work page 2024
[62]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

work page 2024
[63]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[64]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

work page
[65]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[66]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[67]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling

work page
[68]

Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion

Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, and Hongxia Yang. Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion. arXiv preprint arXiv:2505.13893, 2025

work page arXiv 2025
[69]

Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878, 2025

Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Fei Wu, and Hongxia Yang. Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878, 2025

work page arXiv 2025
[70]

InfiCoEvalChain: A blockchain-based decentralized framework for collaborative LLM evaluation.arXiv preprint arXiv:2602.08229, 2026

Yifan Yang, Jinjia Li, Kunxi Li, Puhao Zheng, Yuanyi Wang, Zheyan Qu, Yang Yu, Jianmin Wu, Ming Li, and Hongxia Yang. Inficoevalchain: A blockchain-based decentralized framework for collaborative llm evaluation.arXiv preprint arXiv:2602.08229, 2026. 15 Limitations Our analysis and experiments focus on Qwen3-scale open LLMs and on domain and capability con...

work page arXiv 2026
[71]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page