arxiv: 2603.05957 · v2 · submitted 2026-03-06 · 💻 cs.DC · cs.AI

Domain-Adaptive Model Merging Across Disconnected Modes

Junming Liu , Yusen Zhang , Rongchao Zhang , Wenkai Zhu , Tian Wu This is my paper

Pith reviewed 2026-05-15 15:43 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords model mergingdata-free learningdomain adaptationknowledge distillationnormalization statisticsmultimodal modelsprivacy-preserving AI

0 comments

The pith

DMM merges highly divergent domain-specific models without data sharing by selectively combining similar models first then distilling from pseudo-data synthesized from normalization statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When data from different domains cannot be centralized due to privacy or heterogeneity constraints, separate models can be trained locally and then combined to consolidate their knowledge. DMM does this in three steps: independent training of domain models, standard merging of the most similar ones to keep the process stable, and synthesis of pseudo-data solely from each model's normalization statistics to guide a lightweight distillation that transfers knowledge from the remaining divergent models. The goal is to retain rare specialized knowledge that direct merging would lose while avoiding full retraining or data movement. Experiments across unimodal and multimodal benchmarks indicate this staged method outperforms prior merging techniques. If the approach holds, it offers a practical route to unified models when raw data sharing is impossible.

Core claim

DMM is a data-free framework that merges models from disconnected domains through independent training, similarity-based merging with standard techniques for stability, and a final lightweight distillation step that uses pseudo-data synthesized from normalization statistics to incorporate knowledge from highly divergent models, thereby preserving rare knowledge without compromising overall stability.

What carries the argument

The lightweight distillation step guided by pseudo-data synthesized from normalization statistics, which transfers knowledge from divergent models into the merged result.

If this is right

DMM achieves state-of-the-art performance over existing merging methods on unimodal and multimodal benchmarks.
The staged process maintains stability while incorporating knowledge from models that differ substantially.
Rare but critical knowledge from individual domain models is retained in the final merged model.
No raw data sharing or centralized retraining is required to combine models across domains.
The framework applies to both single-modality and multimodal settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could let separate organizations combine locally trained models without exposing private data.
Performance may improve if the pseudo-data generation incorporated statistics from additional layers beyond basic normalization.
The approach suggests a general template for merging models when direct averaging fails due to mode disconnection.

Load-bearing premise

Pseudo-data synthesized solely from normalization statistics supplies enough guidance for the distillation step to preserve rare knowledge from divergent models without introducing instability or bias.

What would settle it

A benchmark experiment on highly divergent models where DMM's final accuracy on rare classes drops below that of the individual domain models or below standard merging baselines.

read the original abstract

Learning across domains is challenging when data cannot be centralized due to privacy or heterogeneity, which limits the ability to train a single comprehensive model. Model merging provides an appealing alternative by consolidating knowledge from multiple specialized models into one, avoiding data sharing and reducing retraining cost. In this work, we present DMM, a data-free model merging framework designed to handle highly divergent models. DMM proceeds in three steps. First, domain-specific models are trained independently. Second, models with high similarity are merged using standard techniques to ensure stability. Third, we synthesize pseudo-data from normalization statistics and distill knowledge from divergent models into the merged model through a lightweight refinement guided by these samples. This approach preserves rare but critical knowledge while maintaining stability. Extensive experiments on unimodal and multimodal benchmarks show that DMM achieves state-of-the-art performance over existing merging methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DMM gives a straightforward three-step merging recipe using norm stats for pseudo-data on divergent models, but the SOTA claim lacks any visible numbers or details to back it up.

read the letter

The main takeaway is that this paper describes DMM as a data-free merging method that first combines similar models with standard techniques, then generates pseudo-samples from per-layer normalization statistics to distill knowledge from the remaining divergent ones. The goal is to keep rare features without centralizing data, which fits federated or privacy-heavy settings. That three-step structure is the concrete contribution here, extending existing merging and distillation ideas to handle disconnected modes more explicitly than prior work mentioned in the abstract. It frames the stability problem well and explains why direct merging breaks down on highly divergent models. The procedural outline is easy to follow and the motivation around avoiding retraining costs is practical. The soft spots are clear from the abstract alone. No quantitative results, baselines, ablation studies, or specifics on how the pseudo-data is sampled or how divergence gets measured appear anywhere. The central SOTA claim therefore sits unsupported. The stress-test point holds: normalization moments only give marginal per-channel stats and cannot reconstruct joint distributions or long-tail examples that define those disconnected modes, so the lightweight distillation step risks missing or biasing the very knowledge it aims to preserve. If the full paper supplies the missing experiments and shows the pseudo-data actually works on rare features, that would strengthen it considerably. This is aimed at people working on model merging for distributed or privacy-sensitive applications. A reader already familiar with merging techniques could pick up the pipeline idea quickly, but without the numbers it is hard to judge whether it delivers on the claims. I would send it to peer review because the problem is relevant and the approach is simple enough to test, even if it needs substantial added evidence and controls before publication.

Referee Report

2 major / 0 minor

Summary. The paper presents DMM, a data-free model merging framework for consolidating knowledge from highly divergent domain-specific models without data sharing. It consists of independent training of domain-specific models, merging similar models with standard techniques for stability, and a refinement step that synthesizes pseudo-data from normalization statistics to distill knowledge from divergent models into the merged model via lightweight guidance. The authors claim this preserves rare knowledge and achieves state-of-the-art performance over existing merging methods on unimodal and multimodal benchmarks.

Significance. If the empirical claims hold, the work could meaningfully advance privacy-preserving model merging by enabling consolidation across disconnected modes without data access or heavy retraining. The procedural three-step design and use of normalization statistics for pseudo-data offer a lightweight, data-free alternative to existing methods, with potential applicability in distributed or federated settings where heterogeneity is high.

major comments (2)

[Abstract] Abstract: The central claim that 'DMM achieves state-of-the-art performance over existing merging methods' is stated without any quantitative results, baseline comparisons, ablation studies, metrics for divergence, or details on pseudo-data generation and distillation. This leaves the primary empirical assertion unsupported by visible evidence.
[Method (third step)] Third step (distillation via pseudo-data): The refinement of divergent models relies on pseudo-samples drawn solely from per-layer normalization statistics to preserve rare/mode-specific knowledge. Normalization moments provide only marginal per-channel means and variances and do not reconstruct joint distributions, class-conditional structure, or long-tail examples; if the synthesized distribution deviates from the original support on these regions, the lightweight distillation cannot reliably recover the claimed knowledge without instability or bias.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'DMM achieves state-of-the-art performance over existing merging methods' is stated without any quantitative results, baseline comparisons, ablation studies, metrics for divergence, or details on pseudo-data generation and distillation. This leaves the primary empirical assertion unsupported by visible evidence.

Authors: We agree the abstract would be strengthened by including key quantitative highlights. In the revised version we will add specific performance gains (e.g., average accuracy improvements over baselines on the reported unimodal and multimodal benchmarks) while keeping the abstract concise. Full baseline comparisons, ablation studies, divergence metrics, and pseudo-data details remain in the experimental and method sections. revision: yes
Referee: [Method (third step)] Third step (distillation via pseudo-data): The refinement of divergent models relies on pseudo-samples drawn solely from per-layer normalization statistics to preserve rare/mode-specific knowledge. Normalization moments provide only marginal per-channel means and variances and do not reconstruct joint distributions, class-conditional structure, or long-tail examples; if the synthesized distribution deviates from the original support on these regions, the lightweight distillation cannot reliably recover the claimed knowledge without instability or bias.

Authors: We acknowledge that per-layer normalization statistics capture only marginal moments and cannot fully reconstruct joint or class-conditional distributions. Nevertheless, our experiments show that the resulting pseudo-samples, when used with the lightweight distillation objective, suffice to transfer rare knowledge without measurable instability or bias on the evaluated benchmarks. To address the concern we will expand the method section with a more explicit description of the pseudo-sample generation process and add ablation results that quantify performance under controlled distribution mismatch. revision: partial

Circularity Check

0 steps flagged

Procedural framework with no self-referential derivations or load-bearing self-citations

full rationale

The paper describes DMM as a three-step procedural method (independent training, similarity-based merging, pseudo-data distillation from normalization statistics) without equations, fitted parameters presented as predictions, or derivations that reduce to inputs by construction. No self-citation chains justify uniqueness theorems or ansatzes. The central claim rests on empirical benchmark results rather than any definitional equivalence or renamed known result. This is a standard non-circular empirical method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that normalization statistics alone suffice to create useful pseudo-data for distillation across divergent models.

axioms (1)

domain assumption Normalization statistics from independently trained models can be used to synthesize pseudo-data that guides effective knowledge distillation for divergent models.
Invoked in the third step of the DMM procedure as described in the abstract.

pith-pipeline@v0.9.0 · 5446 in / 1146 out tokens · 45734 ms · 2026-05-15T15:43:32.838540+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DMM proceeds in three steps... synthesize pseudo-data from normalization statistics and distill knowledge from divergent models... L_KD = E_{x~D_pseudo} [KL(p_Mt(y|x) || p_M(y|x))]
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat_equivNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

buffer-level merging... theoretical guarantees of its effectiveness in capturing global statistics

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

[1]

INTRODUCTION The rapid expansion of machine learning applications across diverse domains has led to a growing demand for methods that can efficiently adapt knowledge without relying on cen- tralized training [1, 2]. In many practical scenarios, data re- mains fragmented due to privacy regulations [3], acquisition costs [4], or domain heterogeneity [5], ma...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

The model is fine-tuned inde- pendently onKdomains with datasetsD 1,

PRELIMINARY Notations.LetW 0 ={W l 0}L l=1 denote the parameters of a pretrained network withLlayers, whereW l 0 represents the parameters of thel-th layer. The model is fine-tuned inde- pendently onKdomains with datasetsD 1, . . . ,DK, yielding domain-specific parametersW 1, . . . , WK. For domaink, the parameter offset is defined as the difference betwe...

work page
[3]

First, we train unimodal and multimodal models on their respective tasks to obtain well-initialized models

METHODS Our method consists of three components. First, we train unimodal and multimodal models on their respective tasks to obtain well-initialized models. Second, we perform model merging using both parameter aggregation and buffer-level statistics alignment, followed by normalization inversion to synthesize proxy data. Finally, to mitigate knowledge co...

work page
[4]

Experimental Setup 4.1.1

EXPERIMENTS 4.1. Experimental Setup 4.1.1. Datasets We evaluate our approach on three benchmarks.CIFAR- 10andCIFAR-100[22] are standard image classification datasets containing 10 and 100 classes, respectively, with 50,000 training and 10,000 test images each.CrisisMMD

work page
[5]

It contains a total of 18,036 image–text pairs anno- tated with humanitarian categories, enabling evaluation of cross-modal classification tasks

is a multimodal dataset consisting of images and accom- panying textual reports collected from 18 real-world crisis events. It contains a total of 18,036 image–text pairs anno- tated with humanitarian categories, enabling evaluation of cross-modal classification tasks. 4.1.2. Baselines We compare our method against representative federated learning and mo...

work page
[6]

CONCLUSION In this work, we introduced DMM, a data-free model merg- ing framework tailored for scenarios with strong domain heterogeneity. By combining buffer-guided pseudo-data gen- eration with selective knowledge distillation from divergent models, DMM effectively reconciles both common and rare domain-specific knowledge while preserving stability. Our...

work page
[7]

20252BAC200613; in part by Jiangxi Provincial Early-Career Youth Science and Technol- ogy Talent Cultivation Project under No.20244BCE52007

ACKNOWLEDGMENTS This work was supported in part by Jiangxi Provincial Natural Science Foundation under No. 20252BAC200613; in part by Jiangxi Provincial Early-Career Youth Science and Technol- ogy Talent Cultivation Project under No.20244BCE52007

work page
[8]

Machine learning: Algorithms, real- world applications and research directions,

Iqbal H. Sarker, “Machine learning: Algorithms, real- world applications and research directions,”SN Computer Science, vol. 2, no. 3, pp. 160, Mar. 2021

work page 2021
[9]

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xi- aochun Cao, Jie Zhang, and Dacheng Tao, “Model merging in llms, mllms, and beyond: Methods, theories, applica- tions and opportunities,”arXiv preprint arXiv:2408.07666, 2024

work page internal anchor Pith review arXiv 2024
[10]

Privacy-preserving deep learning,

Reza Shokri and Vitaly Shmatikov, “Privacy-preserving deep learning,” inProceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, New York, NY , USA, 2015, CCS ’15, p. 1310–1321, As- sociation for Computing Machinery

work page 2015
[11]

Data acquisi- tion for improving machine learning models,

Yifan Li, Xiaohui Yu, and Nick Koudas, “Data acquisi- tion for improving machine learning models,”Proc. VLDB Endow., vol. 14, no. 10, pp. 1832–1844, June 2021

work page 2021
[12]

Heterogeneous domain adaptation: An unsupervised approach,

Feng Liu, Guangquan Zhang, and Jie Lu, “Heterogeneous domain adaptation: An unsupervised approach,”IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 12, pp. 5588–5602, 2020

work page 2020
[13]

Communication- efficient learning of deep networks from decentralized data,

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas, “Communication- efficient learning of deep networks from decentralized data,” inArtificial intelligence and statistics. PMLR, 2017, pp. 1273–1282

work page 2017
[14]

Fedrecon: Missing modality reconstruction in distributed heterogeneous environments,

Junming Liu, Guosun Zeng, Ding Wang, Yanting Gao, and Yufei Jin, “Fedrecon: Missing modality reconstruction in distributed heterogeneous environments,”arXiv preprint arXiv:2504.09941, 2025

work page arXiv 2025
[15]

Mosaic: Data-free knowledge distillation via mixture-of- experts for heterogeneous distributed environments,

Junming Liu, Yanting Gao, Siyuan Meng, Yifei Sun, Aoqi Wu, Yufei Jin, Yirong Chen, Ding Wang, and Guosun Zeng, “Mosaic: Data-free knowledge distillation via mixture-of- experts for heterogeneous distributed environments,”arXiv preprint arXiv:2505.19699, 2025

work page arXiv 2025
[16]

Ties-merging: Resolving in- terference when merging models,

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal, “Ties-merging: Resolving in- terference when merging models,” inAdvances in Neu- ral Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds. 2023, vol. 36, pp. 7093–7115, Curran Associates, Inc

work page 2023
[17]

Git re-basin: Merging models modulo permutation symmetries,

Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srini- vasa, “Git re-basin: Merging models modulo permutation symmetries,” inThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[18]

Training-free pretrained model merg- ing,

Zhengqi Xu, Ke Yuan, Huiqiong Wang, Yong Wang, Mingli Song, and Jie Song, “Training-free pretrained model merg- ing,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), June 2024, pp. 5915–5925

work page 2024
[19]

Representa- tion surgery for multi-task model merging,

Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiao- jun Chen, Xingwei Wang, and Dacheng Tao, “Representa- tion surgery for multi-task model merging,” inForty-first International Conference on Machine Learning, 2024

work page 2024
[20]

Pleas - merging models with permutations and least squares,

Anshul Nasery, Jonathan Hayase, Pang Wei Koh, and Se- woong Oh, “Pleas - merging models with permutations and least squares,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 30493–30502

work page 2025
[21]

CAT merging: A training-free approach for resolving con- flicts in model merging,

Wenju Sun, Qingyong Li, Yangliao Geng, and Boyang Li, “CAT merging: A training-free approach for resolving con- flicts in model merging,” inForty-second International Conference on Machine Learning, 2025

work page 2025
[22]

Representation surgery in model merging with probabilistic modeling,

Qi Wei, Shuo He, Enneng Yang, Tingcong Liu, Haobo Wang, Lei Feng, and Bo An, “Representation surgery in model merging with probabilistic modeling,” inForty- second International Conference on Machine Learning, 2025

work page 2025
[23]

Data-free knowledge distillation for heterogeneous federated learn- ing,

Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou, “Data-free knowledge distillation for heterogeneous federated learn- ing,” inProceedings of the 38th International Conference on Machine Learning, Marina Meila and Tong Zhang, Eds. 18–24 Jul 2021, vol. 139 ofProceedings of Machine Learn- ing Research, pp. 12878–12889, PMLR

work page 2021
[24]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” inPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

work page 2016
[25]

BERT: Pre-training of deep bidirectional transformers for language understanding,

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Pa- pers), Jill Burstein, Christy Do...

work page 2019
[26]

Dreaming to distill: Data-free knowledge trans- fer via deepinversion,

Hongxu Yin, Pavlo Molchanov, Jose M. Alvarez, Zhizhong Li, Arun Mallya, Derek Hoiem, Niraj K. Jha, and Jan Kautz, “Dreaming to distill: Data-free knowledge trans- fer via deepinversion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

work page 2020
[27]

Federated op- timization in heterogeneous networks,

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar San- jabi, Ameet Talwalkar, and Virginia Smith, “Federated op- timization in heterogeneous networks,” inProceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopou- los, and V . Sze, Eds., 2020, vol. 2, pp. 429–450

work page 2020
[28]

Fedbn: Federated learning on non-iid features via local batch normalization,

Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou, “Fedbn: Federated learning on non-iid features via local batch normalization,”arXiv preprint arXiv:2102.07623, 2021

work page arXiv 2021
[29]

Learning multi- ple layers of features from tiny images,

Alex Krizhevsky, Geoffrey Hinton, et al., “Learning multi- ple layers of features from tiny images,” Tech. Rep., Uni- versity of Toronto, 2009

work page 2009
[30]

Crisis- mmd: Multimodal twitter datasets from natural disasters,

Firoj Alam, Ferda Ofli, and Muhammad Imran, “Crisis- mmd: Multimodal twitter datasets from natural disasters,” Proceedings of the International AAAI Conference on Web and Social Media, vol. 12, no. 1, Jun. 2018

work page 2018