arxiv: 2408.07666 · v5 · pith:S4H6UNB3new · submitted 2024-08-14 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

Enneng Yang , Li Shen , Guibing Guo , Xingwei Wang , Xiaochun Cao , Jie Zhang , Dacheng Tao This is my paper

Pith reviewed 2026-05-17 22:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CV

keywords model merginglarge language modelsmultimodal modelssurveytaxonomycontinual learningmulti-task learningfew-shot learning

0 comments

The pith

Model merging combines trained models without new data or heavy retraining, and this survey organizes the methods into a fresh taxonomy while mapping their uses in language models and many other settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to close the gap in systematic understanding of model merging by introducing a new taxonomy that sorts existing techniques exhaustively. It then traces how these techniques appear in large language models, multimodal large language models, and more than ten other machine learning areas including continual learning and multi-task learning. A reader would care because merging offers a low-cost way to build capable systems once separate models already exist. The survey closes by naming open challenges and sketching future research paths.

Core claim

Model merging is an efficient empowerment technique that avoids collecting raw training data and expensive computation; a new taxonomic approach exhaustively classifies the methods, their theories are reviewed, applications are shown across large language models, multimodal large language models, and more than ten subfields, and remaining challenges plus future directions are highlighted.

What carries the argument

The new taxonomic approach that exhaustively classifies existing model merging methods and serves as the organizing structure for the review of techniques, applications, and open problems.

If this is right

Merging supports continual learning by letting models gain new abilities without erasing prior ones.
Multi-task learning can use merging to handle several tasks with one combined model rather than separate training runs.
Few-shot learning gains from merging as a route to quick adaptation using limited examples.
The same methods apply across more than ten machine learning subfields, indicating wide practical reach.
Open challenges in scaling and compatibility point to concrete next steps for theory and practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could serve as a checklist for spotting which merging strategies remain under-tested in new domains.
Links to related ideas such as model editing may suggest hybrid methods that combine merging with other lightweight updates.
Applying the taxonomy to papers published after the survey would test how durable the classification remains.

Load-bearing premise

The proposed taxonomy covers every current model merging method and the reviewed literature accurately represents the state of the field without major omissions.

What would settle it

A search that turns up several well-known model merging papers that fall outside the new taxonomy or were missed in the review would show the survey is incomplete.

read the original abstract

Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature regarding a systematic and thorough review of these techniques. This survey provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and future research directions. Specifically, we first propose a new taxonomic approach that exhaustively discusses existing model merging methods. Secondly, we discuss the application of model merging techniques in large language models, multimodal large language models, and more than ten machine learning subfields, including continual learning, multi-task learning, few-shot learning, etc. Finally, we highlight the remaining challenges of model merging and discuss future research directions. A comprehensive list of papers about model merging is available at https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a useful survey that organizes model merging work into a new taxonomy and reviews applications, but it mainly compiles existing papers rather than adding new results.

read the letter

This paper is a survey that collects and organizes recent work on model merging, a technique for combining trained models without new data or heavy retraining. The main point is that it offers a structured overview of methods, theories, and uses across large models and other areas, plus a GitHub repo to track the papers. That repo is a practical addition for anyone who wants to avoid digging through arXiv on their own. The taxonomy they propose breaks down the approaches and then maps them onto LLMs, multimodal models, and subfields like continual learning, multi-task learning, and few-shot learning. The breadth helps show where merging has been tried and what gaps remain, which can save time for people working on efficient model reuse. The coverage of applications and future directions looks reasonable from the structure, and the paper stays focused on the organizational task rather than claiming new theorems or experiments. The soft spots are the usual ones for a survey. The taxonomy is presented as exhaustive, but any such framing in a fast-moving field will need updates soon, and completeness depends on how thoroughly the citations were checked. Summaries of prior work could have small inaccuracies that only show up on close reading, though nothing in the high-level framing suggests major misclassifications. There are no new derivations or data here, just the synthesis. This paper is for machine learning researchers and engineers who need a starting point on model merging to avoid redundant experiments, especially those working with large models where compute is limited. A reader looking for an organized entry into the literature or ideas for extensions would get value from the applications and challenges sections. It deserves a serious referee because the topic is timely and the organization fills a stated gap in the literature.

Referee Report

1 major / 2 minor

Summary. The manuscript is a survey on model merging techniques. It claims to fill a literature gap by providing a comprehensive overview of methods and theories, introducing a new taxonomic approach that exhaustively classifies existing techniques, reviewing applications in LLMs, MLLMs, and over ten ML subfields (e.g., continual learning, multi-task learning, few-shot learning), discussing challenges, and outlining future directions. It includes a linked GitHub repository curating cited papers.

Significance. If the taxonomy proves exhaustive and the summaries of methods/theories/applications are accurate without major omissions, the survey would be a useful organizational contribution in a fast-growing area. The public, updatable GitHub list and cross-domain application coverage add practical value for researchers working on efficient model adaptation without retraining.

major comments (1)

[Taxonomy section] Taxonomy section: the assertion that the proposed taxonomy 'exhaustively discusses existing model merging methods' is central to the survey's contribution but lacks explicit justification or a completeness argument (e.g., search strategy, cutoff date, or handling of edge cases like merging in non-transformer architectures). This risks undercutting the claim of systematic coverage.

minor comments (2)

[Abstract] Abstract: the phrase 'more than ten machine learning subfields' is vague; enumerating the subfields or providing a table of coverage would improve reader orientation.
[Introduction / Conclusion] GitHub repository: confirm that the linked resource (https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications) is actively maintained and includes all cited works with DOIs or arXiv IDs for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We address the major comment below and will revise the manuscript to incorporate the suggested clarifications.

read point-by-point responses

Referee: Taxonomy section: the assertion that the proposed taxonomy 'exhaustively discusses existing model merging methods' is central to the survey's contribution but lacks explicit justification or a completeness argument (e.g., search strategy, cutoff date, or handling of edge cases like merging in non-transformer architectures). This risks undercutting the claim of systematic coverage.

Authors: We agree that an explicit justification for the taxonomy's coverage would strengthen the manuscript. In the revised version, we will add a dedicated paragraph (or subsection) in the Taxonomy section describing our literature review process. This will include the search strategy (keywords such as 'model merging', 'weight interpolation', 'task arithmetic' on arXiv and Google Scholar), the inclusion criteria, and a cutoff date (papers up to July 2024). For edge cases such as non-transformer architectures, we will clarify that the taxonomy is intentionally architecture-agnostic and derived from core merging operations rather than model-specific details; however, the bulk of published work applies these methods to transformers. We will briefly note any existing examples involving CNNs or other architectures and acknowledge that coverage of non-transformer cases remains limited in the current literature, marking it as an opportunity for future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity in this literature survey

full rationale

This paper is a survey that reviews existing model merging techniques, proposes an organizational taxonomy, covers applications across domains, and lists future directions. No derivations, predictions, fitted parameters, or mathematical claims are present that could reduce to self-definition or self-citation. The taxonomy is an explicit organizational contribution rather than a derived result, and the GitHub repository serves as a supporting reference list without creating load-bearing circularity. The work is self-contained as a review with no internal equations or predictive steps to inspect for equivalence to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey with no new mathematical derivations, free parameters, axioms, or invented entities introduced by the authors.

pith-pipeline@v0.9.0 · 5512 in / 1066 out tokens · 16478 ms · 2026-05-17T22:12:26.890443+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We first propose a new taxonomic approach that exhaustively discusses existing model merging methods... pre-merging and during-merging phases (§2)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Differentially Private Model Merging
cs.LG 2026-04 unverdicted novelty 7.0

Post-processing via random selection or linear combination generates differentially private models for arbitrary privacy parameters from pre-trained models on the same dataset.
Understanding and Enforcing Weight Disentanglement in Task Arithmetic
cs.AI 2026-04 unverdicted novelty 7.0

Task-Feature Specialization explains weight disentanglement in task arithmetic and leads to orthogonality, which OrthoReg enforces to enhance performance of model composition methods.
From OSS to Open Source AI: an Exploratory Study of Collaborative Development Paradigm Divergence
cs.SE 2026-04 conditional novelty 7.0

Open source AI shows lower collaboration intensity, reduced direct contributions, and a shift toward adaptive use rather than joint improvement compared to traditional OSS.
MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification
cs.LG 2026-05 unverdicted novelty 6.0

MetaMoE unifies domain-specialized experts into a single MoE via diversity-aware public proxy selection that approximates private data distributions for router training and expert alignment.
Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem
cs.CL 2026-04 unverdicted novelty 6.0

Treating retention as the dominant task and using constructive gradient synthesis like SAGO allows LLM unlearning to achieve higher general performance recovery without weakening the forgetting effect.
Zero-Shot Synthetic-to-Real Handwritten Text Recognition via Task Analogies
cs.CV 2026-04 unverdicted novelty 6.0

A method learns synthetic-to-real parameter corrections from source languages and transfers them to target languages without any real target data, improving HTR across five languages and six models.
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
cs.CL 2026-02 unverdicted novelty 6.0

Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.
AP-BMM: Approximating Capability-Cost Pareto Sets of LLMs via Asynchronous Prior-Guided Bayesian Model Merging
cs.LG 2025-12 conditional novelty 6.0

AP-BMM approximates Pareto sets of layer-wise merged LLMs for accuracy-cost trade-offs via prior-guided asynchronous Bayesian optimization and reranking.
TRINITY: An Evolved LLM Coordinator
cs.LG 2025-12 unverdicted novelty 6.0

A compact 0.6B-parameter coordinator with a 10K-parameter head uses evolutionary strategy to dynamically delegate roles to LLMs, achieving SOTA results such as 86.2% on LiveCodeBench.
Muon is Scalable for LLM Training
cs.LG 2025-02 unverdicted novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging
cs.CL 2026-05 unverdicted novelty 5.0

ORBIT preserves foundational language capabilities during generative retrieval fine-tuning by using origin-regulated weight averaging to constrain parameter drift beyond a distance threshold.
Black-Box Optimization of Mixed Binary-Continuous Variables: Challenges and Opportunities in Evolutionary Model Merging
cs.NE 2026-05 unverdicted novelty 5.0

Data flow space model merging is formalized as a mixed binary-continuous black-box optimization problem, where a structured approach respecting variable dependencies achieves 6.7% higher accuracy and 51.4% smaller sea...
Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?
cs.CL 2026-04 unverdicted novelty 5.0

Continual pre-training on a German medical corpus lets 7B models close much of the performance gap with 24B general models on medical benchmarks, though merging introduces some language mixing and verbosity.
Domain-Adaptive Model Merging Across Disconnected Modes
cs.DC 2026-03 unverdicted novelty 5.0

DMM merges highly divergent domain-specific models without data sharing by synthesizing pseudo-data from normalization statistics and distilling knowledge, achieving state-of-the-art performance on unimodal and multim...
Retrofit: Continual Learning with Controlled Forgetting for Binary Security Detection and Analysis
cs.LG 2025-11 unverdicted novelty 5.0

RETROFIT enables continual learning for malware detection and binary summarization by retrospective-free parameter merging with low-rank sparse updates and confidence-guided arbitration, improving retention and genera...
World Simulation with Video Foundation Models for Physical AI
cs.CV 2025-10 unverdicted novelty 4.0

Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 17 Pith papers · 15 internal anchors

[1]

Javier Abad, Konstantin Donhauser, Francesco Pinto, and Fanny Yang. 2024. Strong Copyright Protection for Language Models via Adaptive Model Fusion.ICML(2024)

work page 2024
[2]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Linara Adilova, Asja Fischer, and Martin Jaggi. 2024. Layerwise linear mode connectivity.ICLR(2024)

work page 2024
[4]

Emanuele Aiello, Lili Yu, Yixin Nie, Armen Aghajanyan, and Barlas Oguz. 2024. Jointly training large autoregressive multimodal models.ICLR(2024)

work page 2024
[5]

Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. 2023. Git Re-Basin: Merging Models modulo Permu- tation Symmetries. InICLR

work page 2023
[6]

Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. 2025. Evolutionary optimization of model merging recipes.Nature Machine Intelligence7, 2 (2025), 195–204

work page 2025
[7]

Anton Alexandrov, Veselin Raychev, Mark Niklas Müller, Ce Zhang, Martin Vechev, and Kristina Toutanova. 2024. Mitigating Catastrophic Forgetting in Language Transfer via Model Merging. arXiv:2407.08699

work page arXiv 2024
[8]

Devansh Arpit, Huan Wang, Yingbo Zhou, and Caiming Xiong. 2022. Ensemble of averages: Improving model selection and boosting performance in domain generalization.NeurIPS35 (2022), 8265–8277

work page 2022
[9]

Nader Asadi, Mahdi Beitollahi, Yasser Khalil, Yinchuan Li, Guojun Zhang, and Xi Chen. 2024. Does Combining Parameter-efficient Modules Improve Few-shot Transfer Accuracy?arXiv preprint arXiv:2402.15414(2024)

work page arXiv 2024
[10]

Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. GAN Cocktail: mixing GANs without dataset access. InECCV. Springer, 205–221

work page 2022
[11]

Robert Belanec, Simon Ostermann, Ivan Srba, and Maria Bielikova. 2024. Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer.arXiv preprint arXiv:2408.01119(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Gregory Benton, Wesley Maddox, Sanae Lotfi, and Andrew Gordon Gordon Wilson. 2021. Loss surface simplexes for mode connecting volumes and fast ensembling. InICML. PMLR, 769–779

work page 2021
[13]

Rishabh Bhardwaj, Do Duc Anh, and Soujanya Poria. 2024. Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic.arXiv preprint arXiv:2402.11746(2024)

work page arXiv 2024
[14]

Benjamin Biggs, Arjun Seshadri, Yang Zou, Achin Jain, Aditya Golatkar, Yusheng Xie, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. 2024. Diffusion Soup: Model Merging for Text-to-Image Diffusion Models.arXiv preprint arXiv:2406.08431(2024)

work page arXiv 2024
[15]

Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2019. Nuanced metrics for measuring unintended bias with real data for text classification. InWWW. 491–500

work page 2019
[16]

Ruisi Cai, Zhenyu Zhang, and Zhangyang Wang. 2023. Robust weight signatures: gaining robustness as easy as patching weights?. InICML. PMLR, 3495–3506

work page 2023
[17]

Sheng Cao, Mingrui Wu, Karthik Prasad, Yuandong Tian, and Zechun Liu. 2025. ParamΔfor Direct Weight Mixing: Post-Train Large Language Model at Zero Cost.CoRRabs/2504.21023 (2025)

work page arXiv 2025
[18]

Rich Caruana. 1997. Multitask learning.Machine learning28 (1997), 41–75

work page 1997
[19]

Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. 2021. Swad: Domain generalization by seeking flat minima.NeurIPS34 (2021), 22405–22418. J. ACM, Vol. 00, No. 0, Article 000. Publication date: 0000. Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities 000:31

work page 2021
[20]

Chi Chen, Yiyang Du, Zheng Fang, Ziyue Wang, Fuwen Luo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Maosong Sun, et al. 2024. Model Composition for Multimodal Large Language Models.ACL(2024)

work page 2024
[21]

Guangyao Chen, Peixi Peng, Yangru Huang, Mengyue Geng, and Yonghong Tian. 2024. Adaptive Discovering and Merging for Incremental Novel Class Discovery. InAAAI, Vol. 38. 11276–11284

work page 2024
[22]

Weiyu Chen and James Kwok. 2025. Pareto Merging: Multi-Objective Optimization for Preference-Aware Model Merging. InICML

work page 2025
[23]

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. 2024. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. 2018. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InICML. PMLR, 794–803

work page 2018
[25]

Feng Cheng, Ziyang Wang, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, and Gedas Bertasius. 2024. DAM: Dynamic Adapter Merging for Continual Video QA Learning.arXiv preprint arXiv:2403.08755(2024)

work page arXiv 2024
[26]

Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. 2025. Whoever Started the interference Should End It: Guiding Data-Free Model Merging via Task Vectors. InICML

work page 2025
[27]

Rajas Chitale, Ankit Vaidya, Aditya Kane, and Archana Ghotkar. 2023. Task Arithmetic with LoRA for Continual Learning.arXiv preprint arXiv:2311.02428(2023)

work page arXiv 2023
[28]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research24, 240 (2023), 1–113

work page 2023
[29]

Alexandra Chronopoulou, Matthew E Peters, Alexander Fraser, and Jesse Dodge. 2023. AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models. InEACL. 2009–2018

work page 2023
[30]

Alexandra Chronopoulou, Jonas Pfeiffer, Joshua Maynez, Xinyi Wang, Sebastian Ruder, and Priyanka Agrawal

work page
[31]

Language and Task Arithmetic with Parameter-Efficient Layers for Zero-Shot Summarization.arXiv preprint arXiv:2311.09344(2023)

work page arXiv 2023
[32]

Tianshuo Cong, Delong Ran, Zesen Liu, Xinlei He, Jinyuan Liu, Yichen Gong, Qi Li, Anyu Wang, and Xiaoyun Wang

work page
[33]

Have You Merged My Model? On The Robustness of Large Language Model IP Protection Methods Against Model Merging.arXiv preprint arXiv:2404.05188(2024)

work page arXiv 2024
[34]

Donato Crisostomi, Marco Fumero, Daniele Baieri, Florian Bernard, and Emanuele Rodolà. 2024. 𝐶2𝑀 3: Cycle- Consistent Multi-Model Merging.arXiv preprint arXiv:2405.17897(2024)

work page arXiv 2024
[35]

Francesco Croce, Sylvestre-Alvise Rebuffi, Evan Shelhamer, and Sven Gowal. 2023. Seasoning model soups for robustness to adversarial and natural distribution shifts. InCVPR. 12313–12323

work page 2023
[36]

Nico Daheim, Thomas Möllenhoff, Edoardo Ponti, Iryna Gurevych, and Mohammad Emtiyaz Khan. 2024. Model Merging by Uncertainty-Based Gradient Matching. InICLR

work page 2024
[37]

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Rui Dai, Sile Hu, Xu Shen, Yonggang Zhang, Xinmei Tian, and Jieping Ye. 2025. Leveraging Submodule Linearity Enhances Task Arithmetic Performance in LLMs. InICLR

work page 2025
[39]

MohammadReza Davari and Eugene Belilovsky. 2023. Model breadcrumbs: Scaling multi-task model merging with sparse masks.arXiv preprint arXiv:2312.06795(2023)

work page arXiv 2023
[40]

Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. 2024. DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling.arXiv preprint arXiv:2406.11617(2024)

work page arXiv 2024
[41]

Jasper Dekoninck, Marc Fischer, Luca Beurer-Kellner, and Martin Vechev. 2024. Controlled Text Generation via Language Model Arithmetic.ICLR(2024)

work page 2024
[42]

Caglar Demir, Arnab Sharma, and Axel-Cyrille Ngonga Ngomo. 2024. Adaptive Stochastic Weight Averaging.JMLR (2024)

work page 2024
[43]

Thomas G Dietterich et al. 2002. Ensemble learning.The handbook of brain theory and neural networks2, 1 (2002), 110–125

work page 2002
[44]

Omkar Dige, Diljot Singh, Tsz Fung Yau, Qixuan Zhang, Borna Bolandraftar, Xiaodan Zhu, and Faiza Khan Khattak

work page
[45]

Mitigating Social Biases in Language Models through Unlearning.arXiv preprint arXiv:2406.13551(2024)

work page arXiv 2024
[46]

Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. 2023. Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature Machine Intelligence5, 3 (2023), 220–235

work page 2023
[47]

Guangyao Dou, Zheyuan Liu, Qing Lyu, Kaize Ding, and Eric Wong. 2024. Avoiding Copyright Infringement via Machine Unlearning.arXiv preprint arXiv:2406.10952(2024). J. ACM, Vol. 00, No. 0, Article 000. Publication date: 0000. 000:32 Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao

work page arXiv 2024
[48]

Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. 2018. Essentially no barriers in neural network energy landscape. InICML. PMLR, 1309–1318

work page 2018
[49]

Guodong Du, Junlin Lee, Jing Li, Runhua Jiang, Yifei Guo, Shuyang Yu, Hanting Liu, Sim K Goh, Ho-Kin Tang, Daojing He, et al. 2024. Parameter competition balancing for model merging.NeurIPS37 (2024), 84746–84776

work page 2024
[50]

Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. 2022. The role of permutation invariance in linear mode connectivity of neural networks.ICLR(2022)

work page 2022
[51]

Damien Ferbach, Baptiste Goujaud, Gauthier Gidel, and Aymeric Dieuleveut. 2024. Proving linear mode connectivity of neural networks via optimal transport. InAISTATS. PMLR, 3853–3861

work page 2024
[52]

Ronald A Fisher. 1922. On the mathematical foundations of theoretical statistics.Philosophical transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character222, 594-604 (1922), 309–368

work page 1922
[53]

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. 2021. Sharpness-aware minimization for efficiently improving generalization.ICLR(2021)

work page 2021
[54]

Louis Fournier, Adel Nabli, Masih Aminbeidokhti, Marco Pedersoli, Eugene Belilovsky, and Edouard Oyallon

work page
[55]

WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average.arXiv preprint arXiv:2405.17517(2024)

work page arXiv 2024
[56]

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. 2020. Linear mode connectivity and the lottery ticket hypothesis. InICML. PMLR, 3259–3269

work page 2020
[57]

Tingchen Fu, Deng Cai, Lemao Liu, Shuming Shi, and Rui Yan. 2024. Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction.ACL(2024)

work page 2024
[58]

Victor Gallego. 2024. Merging Improves Self-Critique Against Jailbreak Attacks.arXiv preprint arXiv:2406.07188 (2024)

work page arXiv 2024
[59]

Lei Gao, Yue Niu, Tingting Tang, Salman Avestimehr, and Murali Annavaram. 2024. Ethos: Rectifying language models in orthogonal parameter space.arXiv preprint arXiv:2403.08994(2024)

work page arXiv 2024
[60]

Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola. 2025. Task singular vectors: Reducing task interference in model merging. InCVPR. 18695–18705

work page 2025
[61]

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. 2018. Loss surfaces, mode connectivity, and fast ensembling of dnns.NeurIPS31 (2018)

work page 2018
[63]

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vlad Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. 2024. Arcee’s MergeKit: A Toolkit for Merging Large Language Models.arXiv preprint arXiv:2403.13257(2024)

work page arXiv 2024
[64]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Vipul Gupta, Santiago Akle Serrano, and Dennis DeCoste. 2020. Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well. InICLR. OpenReview.net

work page 2020
[66]

Hasan Abed Al Kader Hammoud, Umberto Michieli, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem, and Mete Ozay. 2024. Model Merging and Safety Alignment: One Bad Model Spoils the Bunch.arXiv preprint arXiv:2406.14563 (2024)

work page arXiv 2024
[67]

Moritz Hardt, Ben Recht, and Yoram Singer. 2016. Train faster, generalize better: Stability of stochastic gradient descent. InICML. PMLR, 1225–1234

work page 2016
[68]

Rima Hazra, Sayan Layek, Somnath Banerjee, and Soujanya Poria. 2024. Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations.arXiv preprint arXiv:2406.11801(2024)

work page arXiv 2024
[69]

Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, and Dacheng Tao. 2023. MerA: Merging pretrained adapters for few-shot learning.arXiv preprint arXiv:2308.15982(2023)

work page arXiv 2023
[70]

Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, and Dacheng Tao. 2023. Merging Experts into One: Improving Computational Efficiency of Mixture of Experts. InEMNLP

work page 2023
[71]

Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, and Han Zhao. 2024. Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic.Transactions on Machine Learning Research(2024)

work page 2024
[72]

Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, and Han Zhao. 2025. MergeBench: A Benchmark for Merging Domain-Specialized LLMs.NeurIPS 2025 Datasets and Benchmarks Track(2025)

work page 2025
[73]

Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations.ICLR(2019)

work page 2019
[74]

Oğuz Kağan Hitit, Leander Girrbach, and Zeynep Akata. 2025. A Systematic Study of Model Merging Techniques in Large Language Models.arXiv preprint arXiv:2511.21437(2025). J. ACM, Vol. 00, No. 0, Article 000. Publication date: 0000. Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities 000:33

work page arXiv 2025
[75]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.NeurIPS33 (2020), 6840–6851

work page 2020
[76]

Chris Jay Hoofnagle, Bart Van Der Sloot, and Frederik Zuiderveen Borgesius. 2019. The European Union general data protection regulation: what it is and what it means.Information & Communications Technology Law28, 1 (2019), 65–98

work page 2019
[77]

Stefan Horoi, Albert Manuel Orozco Camacho, Eugene Belilovsky, and Guy Wolf. 2024. Harmony in diversity: Merging neural networks with canonical correlation analysis. InICML

work page 2024
[78]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

work page
[79]

LoRA: Low-Rank Adaptation of Large Language Models. InICLR

work page
[80]

Xinshuo Hu, Dongfang Li, Baotian Hu, Zihao Zheng, Zhenyu Liu, and Min Zhang. 2024. Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation. InAAAI, Vol. 38. 18252–18260

work page 2024
[81]

Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. 2024. LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition.COLM(2024)

work page 2024

Showing first 80 references.