pith. machine review for the scientific record. sign in

arxiv: 2408.07666 · v5 · pith:S4H6UNB3new · submitted 2024-08-14 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

Pith reviewed 2026-05-17 22:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CV
keywords model merginglarge language modelsmultimodal modelssurveytaxonomycontinual learningmulti-task learningfew-shot learning
0
0 comments X

The pith

Model merging combines trained models without new data or heavy retraining, and this survey organizes the methods into a fresh taxonomy while mapping their uses in language models and many other settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to close the gap in systematic understanding of model merging by introducing a new taxonomy that sorts existing techniques exhaustively. It then traces how these techniques appear in large language models, multimodal large language models, and more than ten other machine learning areas including continual learning and multi-task learning. A reader would care because merging offers a low-cost way to build capable systems once separate models already exist. The survey closes by naming open challenges and sketching future research paths.

Core claim

Model merging is an efficient empowerment technique that avoids collecting raw training data and expensive computation; a new taxonomic approach exhaustively classifies the methods, their theories are reviewed, applications are shown across large language models, multimodal large language models, and more than ten subfields, and remaining challenges plus future directions are highlighted.

What carries the argument

The new taxonomic approach that exhaustively classifies existing model merging methods and serves as the organizing structure for the review of techniques, applications, and open problems.

If this is right

  • Merging supports continual learning by letting models gain new abilities without erasing prior ones.
  • Multi-task learning can use merging to handle several tasks with one combined model rather than separate training runs.
  • Few-shot learning gains from merging as a route to quick adaptation using limited examples.
  • The same methods apply across more than ten machine learning subfields, indicating wide practical reach.
  • Open challenges in scaling and compatibility point to concrete next steps for theory and practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could serve as a checklist for spotting which merging strategies remain under-tested in new domains.
  • Links to related ideas such as model editing may suggest hybrid methods that combine merging with other lightweight updates.
  • Applying the taxonomy to papers published after the survey would test how durable the classification remains.

Load-bearing premise

The proposed taxonomy covers every current model merging method and the reviewed literature accurately represents the state of the field without major omissions.

What would settle it

A search that turns up several well-known model merging papers that fall outside the new taxonomy or were missed in the review would show the survey is incomplete.

read the original abstract

Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature regarding a systematic and thorough review of these techniques. This survey provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and future research directions. Specifically, we first propose a new taxonomic approach that exhaustively discusses existing model merging methods. Secondly, we discuss the application of model merging techniques in large language models, multimodal large language models, and more than ten machine learning subfields, including continual learning, multi-task learning, few-shot learning, etc. Finally, we highlight the remaining challenges of model merging and discuss future research directions. A comprehensive list of papers about model merging is available at https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript is a survey on model merging techniques. It claims to fill a literature gap by providing a comprehensive overview of methods and theories, introducing a new taxonomic approach that exhaustively classifies existing techniques, reviewing applications in LLMs, MLLMs, and over ten ML subfields (e.g., continual learning, multi-task learning, few-shot learning), discussing challenges, and outlining future directions. It includes a linked GitHub repository curating cited papers.

Significance. If the taxonomy proves exhaustive and the summaries of methods/theories/applications are accurate without major omissions, the survey would be a useful organizational contribution in a fast-growing area. The public, updatable GitHub list and cross-domain application coverage add practical value for researchers working on efficient model adaptation without retraining.

major comments (1)
  1. [Taxonomy section] Taxonomy section: the assertion that the proposed taxonomy 'exhaustively discusses existing model merging methods' is central to the survey's contribution but lacks explicit justification or a completeness argument (e.g., search strategy, cutoff date, or handling of edge cases like merging in non-transformer architectures). This risks undercutting the claim of systematic coverage.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'more than ten machine learning subfields' is vague; enumerating the subfields or providing a table of coverage would improve reader orientation.
  2. [Introduction / Conclusion] GitHub repository: confirm that the linked resource (https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications) is actively maintained and includes all cited works with DOIs or arXiv IDs for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We address the major comment below and will revise the manuscript to incorporate the suggested clarifications.

read point-by-point responses
  1. Referee: Taxonomy section: the assertion that the proposed taxonomy 'exhaustively discusses existing model merging methods' is central to the survey's contribution but lacks explicit justification or a completeness argument (e.g., search strategy, cutoff date, or handling of edge cases like merging in non-transformer architectures). This risks undercutting the claim of systematic coverage.

    Authors: We agree that an explicit justification for the taxonomy's coverage would strengthen the manuscript. In the revised version, we will add a dedicated paragraph (or subsection) in the Taxonomy section describing our literature review process. This will include the search strategy (keywords such as 'model merging', 'weight interpolation', 'task arithmetic' on arXiv and Google Scholar), the inclusion criteria, and a cutoff date (papers up to July 2024). For edge cases such as non-transformer architectures, we will clarify that the taxonomy is intentionally architecture-agnostic and derived from core merging operations rather than model-specific details; however, the bulk of published work applies these methods to transformers. We will briefly note any existing examples involving CNNs or other architectures and acknowledge that coverage of non-transformer cases remains limited in the current literature, marking it as an opportunity for future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity in this literature survey

full rationale

This paper is a survey that reviews existing model merging techniques, proposes an organizational taxonomy, covers applications across domains, and lists future directions. No derivations, predictions, fitted parameters, or mathematical claims are present that could reduce to self-definition or self-citation. The taxonomy is an explicit organizational contribution rather than a derived result, and the GitHub repository serves as a supporting reference list without creating load-bearing circularity. The work is self-contained as a review with no internal equations or predictive steps to inspect for equivalence to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey with no new mathematical derivations, free parameters, axioms, or invented entities introduced by the authors.

pith-pipeline@v0.9.0 · 5512 in / 1066 out tokens · 16478 ms · 2026-05-17T22:12:26.890443+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

  2. Differentially Private Model Merging

    cs.LG 2026-04 unverdicted novelty 7.0

    Post-processing via random selection or linear combination generates differentially private models for arbitrary privacy parameters from pre-trained models on the same dataset.

  3. Understanding and Enforcing Weight Disentanglement in Task Arithmetic

    cs.AI 2026-04 unverdicted novelty 7.0

    Task-Feature Specialization explains weight disentanglement in task arithmetic and leads to orthogonality, which OrthoReg enforces to enhance performance of model composition methods.

  4. From OSS to Open Source AI: an Exploratory Study of Collaborative Development Paradigm Divergence

    cs.SE 2026-04 conditional novelty 7.0

    Open source AI shows lower collaboration intensity, reduced direct contributions, and a shift toward adaptive use rather than joint improvement compared to traditional OSS.

  5. MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaMoE unifies domain-specialized experts into a single MoE via diversity-aware public proxy selection that approximates private data distributions for router training and expert alignment.

  6. Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem

    cs.CL 2026-04 unverdicted novelty 6.0

    Treating retention as the dominant task and using constructive gradient synthesis like SAGO allows LLM unlearning to achieve higher general performance recovery without weakening the forgetting effect.

  7. Zero-Shot Synthetic-to-Real Handwritten Text Recognition via Task Analogies

    cs.CV 2026-04 unverdicted novelty 6.0

    A method learns synthetic-to-real parameter corrections from source languages and transfers them to target languages without any real target data, improving HTR across five languages and six models.

  8. Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

    cs.CL 2026-02 unverdicted novelty 6.0

    Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.

  9. AP-BMM: Approximating Capability-Cost Pareto Sets of LLMs via Asynchronous Prior-Guided Bayesian Model Merging

    cs.LG 2025-12 conditional novelty 6.0

    AP-BMM approximates Pareto sets of layer-wise merged LLMs for accuracy-cost trade-offs via prior-guided asynchronous Bayesian optimization and reranking.

  10. TRINITY: An Evolved LLM Coordinator

    cs.LG 2025-12 unverdicted novelty 6.0

    A compact 0.6B-parameter coordinator with a 10K-parameter head uses evolutionary strategy to dynamically delegate roles to LLMs, achieving SOTA results such as 86.2% on LiveCodeBench.

  11. Muon is Scalable for LLM Training

    cs.LG 2025-02 unverdicted novelty 6.0

    Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

  12. ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

    cs.CL 2026-05 unverdicted novelty 5.0

    ORBIT preserves foundational language capabilities during generative retrieval fine-tuning by using origin-regulated weight averaging to constrain parameter drift beyond a distance threshold.

  13. Black-Box Optimization of Mixed Binary-Continuous Variables: Challenges and Opportunities in Evolutionary Model Merging

    cs.NE 2026-05 unverdicted novelty 5.0

    Data flow space model merging is formalized as a mixed binary-continuous black-box optimization problem, where a structured approach respecting variable dependencies achieves 6.7% higher accuracy and 51.4% smaller sea...

  14. Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?

    cs.CL 2026-04 unverdicted novelty 5.0

    Continual pre-training on a German medical corpus lets 7B models close much of the performance gap with 24B general models on medical benchmarks, though merging introduces some language mixing and verbosity.

  15. Domain-Adaptive Model Merging Across Disconnected Modes

    cs.DC 2026-03 unverdicted novelty 5.0

    DMM merges highly divergent domain-specific models without data sharing by synthesizing pseudo-data from normalization statistics and distilling knowledge, achieving state-of-the-art performance on unimodal and multim...

  16. Retrofit: Continual Learning with Controlled Forgetting for Binary Security Detection and Analysis

    cs.LG 2025-11 unverdicted novelty 5.0

    RETROFIT enables continual learning for malware detection and binary summarization by retrospective-free parameter merging with low-rank sparse updates and confidence-guided arbitration, improving retention and genera...

  17. World Simulation with Video Foundation Models for Physical AI

    cs.CV 2025-10 unverdicted novelty 4.0

    Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 17 Pith papers · 15 internal anchors

  1. [1]

    Javier Abad, Konstantin Donhauser, Francesco Pinto, and Fanny Yang. 2024. Strong Copyright Protection for Language Models via Adaptive Model Fusion.ICML(2024)

  2. [2]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  3. [3]

    Linara Adilova, Asja Fischer, and Martin Jaggi. 2024. Layerwise linear mode connectivity.ICLR(2024)

  4. [4]

    Emanuele Aiello, Lili Yu, Yixin Nie, Armen Aghajanyan, and Barlas Oguz. 2024. Jointly training large autoregressive multimodal models.ICLR(2024)

  5. [5]

    Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. 2023. Git Re-Basin: Merging Models modulo Permu- tation Symmetries. InICLR

  6. [6]

    Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. 2025. Evolutionary optimization of model merging recipes.Nature Machine Intelligence7, 2 (2025), 195–204

  7. [7]

    Anton Alexandrov, Veselin Raychev, Mark Niklas Müller, Ce Zhang, Martin Vechev, and Kristina Toutanova. 2024. Mitigating Catastrophic Forgetting in Language Transfer via Model Merging. arXiv:2407.08699

  8. [8]

    Devansh Arpit, Huan Wang, Yingbo Zhou, and Caiming Xiong. 2022. Ensemble of averages: Improving model selection and boosting performance in domain generalization.NeurIPS35 (2022), 8265–8277

  9. [9]

    Nader Asadi, Mahdi Beitollahi, Yasser Khalil, Yinchuan Li, Guojun Zhang, and Xi Chen. 2024. Does Combining Parameter-efficient Modules Improve Few-shot Transfer Accuracy?arXiv preprint arXiv:2402.15414(2024)

  10. [10]

    Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. GAN Cocktail: mixing GANs without dataset access. InECCV. Springer, 205–221

  11. [11]

    Robert Belanec, Simon Ostermann, Ivan Srba, and Maria Bielikova. 2024. Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer.arXiv preprint arXiv:2408.01119(2024)

  12. [12]

    Gregory Benton, Wesley Maddox, Sanae Lotfi, and Andrew Gordon Gordon Wilson. 2021. Loss surface simplexes for mode connecting volumes and fast ensembling. InICML. PMLR, 769–779

  13. [13]

    Rishabh Bhardwaj, Do Duc Anh, and Soujanya Poria. 2024. Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic.arXiv preprint arXiv:2402.11746(2024)

  14. [14]

    Benjamin Biggs, Arjun Seshadri, Yang Zou, Achin Jain, Aditya Golatkar, Yusheng Xie, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. 2024. Diffusion Soup: Model Merging for Text-to-Image Diffusion Models.arXiv preprint arXiv:2406.08431(2024)

  15. [15]

    Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2019. Nuanced metrics for measuring unintended bias with real data for text classification. InWWW. 491–500

  16. [16]

    Ruisi Cai, Zhenyu Zhang, and Zhangyang Wang. 2023. Robust weight signatures: gaining robustness as easy as patching weights?. InICML. PMLR, 3495–3506

  17. [17]

    Sheng Cao, Mingrui Wu, Karthik Prasad, Yuandong Tian, and Zechun Liu. 2025. ParamΔfor Direct Weight Mixing: Post-Train Large Language Model at Zero Cost.CoRRabs/2504.21023 (2025)

  18. [18]

    Rich Caruana. 1997. Multitask learning.Machine learning28 (1997), 41–75

  19. [19]

    Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. 2021. Swad: Domain generalization by seeking flat minima.NeurIPS34 (2021), 22405–22418. J. ACM, Vol. 00, No. 0, Article 000. Publication date: 0000. Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities 000:31

  20. [20]

    Chi Chen, Yiyang Du, Zheng Fang, Ziyue Wang, Fuwen Luo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Maosong Sun, et al. 2024. Model Composition for Multimodal Large Language Models.ACL(2024)

  21. [21]

    Guangyao Chen, Peixi Peng, Yangru Huang, Mengyue Geng, and Yonghong Tian. 2024. Adaptive Discovering and Merging for Incremental Novel Class Discovery. InAAAI, Vol. 38. 11276–11284

  22. [22]

    Weiyu Chen and James Kwok. 2025. Pareto Merging: Multi-Objective Optimization for Preference-Aware Model Merging. InICML

  23. [23]

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. 2024. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187(2024)

  24. [24]

    Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. 2018. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InICML. PMLR, 794–803

  25. [25]

    Feng Cheng, Ziyang Wang, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, and Gedas Bertasius. 2024. DAM: Dynamic Adapter Merging for Continual Video QA Learning.arXiv preprint arXiv:2403.08755(2024)

  26. [26]

    Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. 2025. Whoever Started the interference Should End It: Guiding Data-Free Model Merging via Task Vectors. InICML

  27. [27]

    Rajas Chitale, Ankit Vaidya, Aditya Kane, and Archana Ghotkar. 2023. Task Arithmetic with LoRA for Continual Learning.arXiv preprint arXiv:2311.02428(2023)

  28. [28]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research24, 240 (2023), 1–113

  29. [29]

    Alexandra Chronopoulou, Matthew E Peters, Alexander Fraser, and Jesse Dodge. 2023. AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models. InEACL. 2009–2018

  30. [30]

    Alexandra Chronopoulou, Jonas Pfeiffer, Joshua Maynez, Xinyi Wang, Sebastian Ruder, and Priyanka Agrawal

  31. [31]

    Language and Task Arithmetic with Parameter-Efficient Layers for Zero-Shot Summarization.arXiv preprint arXiv:2311.09344(2023)

  32. [32]

    Tianshuo Cong, Delong Ran, Zesen Liu, Xinlei He, Jinyuan Liu, Yichen Gong, Qi Li, Anyu Wang, and Xiaoyun Wang

  33. [33]

    Have You Merged My Model? On The Robustness of Large Language Model IP Protection Methods Against Model Merging.arXiv preprint arXiv:2404.05188(2024)

  34. [34]

    Donato Crisostomi, Marco Fumero, Daniele Baieri, Florian Bernard, and Emanuele Rodolà. 2024. 𝐶2𝑀 3: Cycle- Consistent Multi-Model Merging.arXiv preprint arXiv:2405.17897(2024)

  35. [35]

    Francesco Croce, Sylvestre-Alvise Rebuffi, Evan Shelhamer, and Sven Gowal. 2023. Seasoning model soups for robustness to adversarial and natural distribution shifts. InCVPR. 12313–12323

  36. [36]

    Nico Daheim, Thomas Möllenhoff, Edoardo Ponti, Iryna Gurevych, and Mohammad Emtiyaz Khan. 2024. Model Merging by Uncertainty-Based Gradient Matching. InICLR

  37. [37]

    Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066(2024)

  38. [38]

    Rui Dai, Sile Hu, Xu Shen, Yonggang Zhang, Xinmei Tian, and Jieping Ye. 2025. Leveraging Submodule Linearity Enhances Task Arithmetic Performance in LLMs. InICLR

  39. [39]

    MohammadReza Davari and Eugene Belilovsky. 2023. Model breadcrumbs: Scaling multi-task model merging with sparse masks.arXiv preprint arXiv:2312.06795(2023)

  40. [40]

    Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. 2024. DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling.arXiv preprint arXiv:2406.11617(2024)

  41. [41]

    Jasper Dekoninck, Marc Fischer, Luca Beurer-Kellner, and Martin Vechev. 2024. Controlled Text Generation via Language Model Arithmetic.ICLR(2024)

  42. [42]

    Caglar Demir, Arnab Sharma, and Axel-Cyrille Ngonga Ngomo. 2024. Adaptive Stochastic Weight Averaging.JMLR (2024)

  43. [43]

    Thomas G Dietterich et al. 2002. Ensemble learning.The handbook of brain theory and neural networks2, 1 (2002), 110–125

  44. [44]

    Omkar Dige, Diljot Singh, Tsz Fung Yau, Qixuan Zhang, Borna Bolandraftar, Xiaodan Zhu, and Faiza Khan Khattak

  45. [45]

    Mitigating Social Biases in Language Models through Unlearning.arXiv preprint arXiv:2406.13551(2024)

  46. [46]

    Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. 2023. Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature Machine Intelligence5, 3 (2023), 220–235

  47. [47]

    Guangyao Dou, Zheyuan Liu, Qing Lyu, Kaize Ding, and Eric Wong. 2024. Avoiding Copyright Infringement via Machine Unlearning.arXiv preprint arXiv:2406.10952(2024). J. ACM, Vol. 00, No. 0, Article 000. Publication date: 0000. 000:32 Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao

  48. [48]

    Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. 2018. Essentially no barriers in neural network energy landscape. InICML. PMLR, 1309–1318

  49. [49]

    Guodong Du, Junlin Lee, Jing Li, Runhua Jiang, Yifei Guo, Shuyang Yu, Hanting Liu, Sim K Goh, Ho-Kin Tang, Daojing He, et al. 2024. Parameter competition balancing for model merging.NeurIPS37 (2024), 84746–84776

  50. [50]

    Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. 2022. The role of permutation invariance in linear mode connectivity of neural networks.ICLR(2022)

  51. [51]

    Damien Ferbach, Baptiste Goujaud, Gauthier Gidel, and Aymeric Dieuleveut. 2024. Proving linear mode connectivity of neural networks via optimal transport. InAISTATS. PMLR, 3853–3861

  52. [52]

    Ronald A Fisher. 1922. On the mathematical foundations of theoretical statistics.Philosophical transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character222, 594-604 (1922), 309–368

  53. [53]

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. 2021. Sharpness-aware minimization for efficiently improving generalization.ICLR(2021)

  54. [54]

    Louis Fournier, Adel Nabli, Masih Aminbeidokhti, Marco Pedersoli, Eugene Belilovsky, and Edouard Oyallon

  55. [55]

    WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average.arXiv preprint arXiv:2405.17517(2024)

  56. [56]

    Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. 2020. Linear mode connectivity and the lottery ticket hypothesis. InICML. PMLR, 3259–3269

  57. [57]

    Tingchen Fu, Deng Cai, Lemao Liu, Shuming Shi, and Rui Yan. 2024. Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction.ACL(2024)

  58. [58]

    Victor Gallego. 2024. Merging Improves Self-Critique Against Jailbreak Attacks.arXiv preprint arXiv:2406.07188 (2024)

  59. [59]

    Lei Gao, Yue Niu, Tingting Tang, Salman Avestimehr, and Murali Annavaram. 2024. Ethos: Rectifying language models in orthogonal parameter space.arXiv preprint arXiv:2403.08994(2024)

  60. [60]

    Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola. 2025. Task singular vectors: Reducing task interference in model merging. InCVPR. 18695–18705

  61. [61]

    Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. 2018. Loss surfaces, mode connectivity, and fast ensembling of dnns.NeurIPS31 (2018)

  62. [63]

    Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vlad Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. 2024. Arcee’s MergeKit: A Toolkit for Merging Large Language Models.arXiv preprint arXiv:2403.13257(2024)

  63. [64]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  64. [65]

    Vipul Gupta, Santiago Akle Serrano, and Dennis DeCoste. 2020. Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well. InICLR. OpenReview.net

  65. [66]

    Hasan Abed Al Kader Hammoud, Umberto Michieli, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem, and Mete Ozay. 2024. Model Merging and Safety Alignment: One Bad Model Spoils the Bunch.arXiv preprint arXiv:2406.14563 (2024)

  66. [67]

    Moritz Hardt, Ben Recht, and Yoram Singer. 2016. Train faster, generalize better: Stability of stochastic gradient descent. InICML. PMLR, 1225–1234

  67. [68]

    Rima Hazra, Sayan Layek, Somnath Banerjee, and Soujanya Poria. 2024. Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations.arXiv preprint arXiv:2406.11801(2024)

  68. [69]

    Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, and Dacheng Tao. 2023. MerA: Merging pretrained adapters for few-shot learning.arXiv preprint arXiv:2308.15982(2023)

  69. [70]

    Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, and Dacheng Tao. 2023. Merging Experts into One: Improving Computational Efficiency of Mixture of Experts. InEMNLP

  70. [71]

    Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, and Han Zhao. 2024. Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic.Transactions on Machine Learning Research(2024)

  71. [72]

    Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, and Han Zhao. 2025. MergeBench: A Benchmark for Merging Domain-Specialized LLMs.NeurIPS 2025 Datasets and Benchmarks Track(2025)

  72. [73]

    Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations.ICLR(2019)

  73. [74]

    Oğuz Kağan Hitit, Leander Girrbach, and Zeynep Akata. 2025. A Systematic Study of Model Merging Techniques in Large Language Models.arXiv preprint arXiv:2511.21437(2025). J. ACM, Vol. 00, No. 0, Article 000. Publication date: 0000. Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities 000:33

  74. [75]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.NeurIPS33 (2020), 6840–6851

  75. [76]

    Chris Jay Hoofnagle, Bart Van Der Sloot, and Frederik Zuiderveen Borgesius. 2019. The European Union general data protection regulation: what it is and what it means.Information & Communications Technology Law28, 1 (2019), 65–98

  76. [77]

    Stefan Horoi, Albert Manuel Orozco Camacho, Eugene Belilovsky, and Guy Wolf. 2024. Harmony in diversity: Merging neural networks with canonical correlation analysis. InICML

  77. [78]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

  78. [79]

    LoRA: Low-Rank Adaptation of Large Language Models. InICLR

  79. [80]

    Xinshuo Hu, Dongfang Li, Baotian Hu, Zihao Zheng, Zhenyu Liu, and Min Zhang. 2024. Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation. InAAAI, Vol. 38. 18252–18260

  80. [81]

    Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. 2024. LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition.COLM(2024)

Showing first 80 references.