Domain-Adaptive Model Merging Across Disconnected Modes
Pith reviewed 2026-05-15 15:43 UTC · model grok-4.3
The pith
DMM merges highly divergent domain-specific models without data sharing by selectively combining similar models first then distilling from pseudo-data synthesized from normalization statistics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DMM is a data-free framework that merges models from disconnected domains through independent training, similarity-based merging with standard techniques for stability, and a final lightweight distillation step that uses pseudo-data synthesized from normalization statistics to incorporate knowledge from highly divergent models, thereby preserving rare knowledge without compromising overall stability.
What carries the argument
The lightweight distillation step guided by pseudo-data synthesized from normalization statistics, which transfers knowledge from divergent models into the merged result.
If this is right
- DMM achieves state-of-the-art performance over existing merging methods on unimodal and multimodal benchmarks.
- The staged process maintains stability while incorporating knowledge from models that differ substantially.
- Rare but critical knowledge from individual domain models is retained in the final merged model.
- No raw data sharing or centralized retraining is required to combine models across domains.
- The framework applies to both single-modality and multimodal settings.
Where Pith is reading between the lines
- The method could let separate organizations combine locally trained models without exposing private data.
- Performance may improve if the pseudo-data generation incorporated statistics from additional layers beyond basic normalization.
- The approach suggests a general template for merging models when direct averaging fails due to mode disconnection.
Load-bearing premise
Pseudo-data synthesized solely from normalization statistics supplies enough guidance for the distillation step to preserve rare knowledge from divergent models without introducing instability or bias.
What would settle it
A benchmark experiment on highly divergent models where DMM's final accuracy on rare classes drops below that of the individual domain models or below standard merging baselines.
read the original abstract
Learning across domains is challenging when data cannot be centralized due to privacy or heterogeneity, which limits the ability to train a single comprehensive model. Model merging provides an appealing alternative by consolidating knowledge from multiple specialized models into one, avoiding data sharing and reducing retraining cost. In this work, we present DMM, a data-free model merging framework designed to handle highly divergent models. DMM proceeds in three steps. First, domain-specific models are trained independently. Second, models with high similarity are merged using standard techniques to ensure stability. Third, we synthesize pseudo-data from normalization statistics and distill knowledge from divergent models into the merged model through a lightweight refinement guided by these samples. This approach preserves rare but critical knowledge while maintaining stability. Extensive experiments on unimodal and multimodal benchmarks show that DMM achieves state-of-the-art performance over existing merging methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents DMM, a data-free model merging framework for consolidating knowledge from highly divergent domain-specific models without data sharing. It consists of independent training of domain-specific models, merging similar models with standard techniques for stability, and a refinement step that synthesizes pseudo-data from normalization statistics to distill knowledge from divergent models into the merged model via lightweight guidance. The authors claim this preserves rare knowledge and achieves state-of-the-art performance over existing merging methods on unimodal and multimodal benchmarks.
Significance. If the empirical claims hold, the work could meaningfully advance privacy-preserving model merging by enabling consolidation across disconnected modes without data access or heavy retraining. The procedural three-step design and use of normalization statistics for pseudo-data offer a lightweight, data-free alternative to existing methods, with potential applicability in distributed or federated settings where heterogeneity is high.
major comments (2)
- [Abstract] Abstract: The central claim that 'DMM achieves state-of-the-art performance over existing merging methods' is stated without any quantitative results, baseline comparisons, ablation studies, metrics for divergence, or details on pseudo-data generation and distillation. This leaves the primary empirical assertion unsupported by visible evidence.
- [Method (third step)] Third step (distillation via pseudo-data): The refinement of divergent models relies on pseudo-samples drawn solely from per-layer normalization statistics to preserve rare/mode-specific knowledge. Normalization moments provide only marginal per-channel means and variances and do not reconstruct joint distributions, class-conditional structure, or long-tail examples; if the synthesized distribution deviates from the original support on these regions, the lightweight distillation cannot reliably recover the claimed knowledge without instability or bias.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'DMM achieves state-of-the-art performance over existing merging methods' is stated without any quantitative results, baseline comparisons, ablation studies, metrics for divergence, or details on pseudo-data generation and distillation. This leaves the primary empirical assertion unsupported by visible evidence.
Authors: We agree the abstract would be strengthened by including key quantitative highlights. In the revised version we will add specific performance gains (e.g., average accuracy improvements over baselines on the reported unimodal and multimodal benchmarks) while keeping the abstract concise. Full baseline comparisons, ablation studies, divergence metrics, and pseudo-data details remain in the experimental and method sections. revision: yes
-
Referee: [Method (third step)] Third step (distillation via pseudo-data): The refinement of divergent models relies on pseudo-samples drawn solely from per-layer normalization statistics to preserve rare/mode-specific knowledge. Normalization moments provide only marginal per-channel means and variances and do not reconstruct joint distributions, class-conditional structure, or long-tail examples; if the synthesized distribution deviates from the original support on these regions, the lightweight distillation cannot reliably recover the claimed knowledge without instability or bias.
Authors: We acknowledge that per-layer normalization statistics capture only marginal moments and cannot fully reconstruct joint or class-conditional distributions. Nevertheless, our experiments show that the resulting pseudo-samples, when used with the lightweight distillation objective, suffice to transfer rare knowledge without measurable instability or bias on the evaluated benchmarks. To address the concern we will expand the method section with a more explicit description of the pseudo-sample generation process and add ablation results that quantify performance under controlled distribution mismatch. revision: partial
Circularity Check
Procedural framework with no self-referential derivations or load-bearing self-citations
full rationale
The paper describes DMM as a three-step procedural method (independent training, similarity-based merging, pseudo-data distillation from normalization statistics) without equations, fitted parameters presented as predictions, or derivations that reduce to inputs by construction. No self-citation chains justify uniqueness theorems or ansatzes. The central claim rests on empirical benchmark results rather than any definitional equivalence or renamed known result. This is a standard non-circular empirical method paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Normalization statistics from independently trained models can be used to synthesize pseudo-data that guides effective knowledge distillation for divergent models.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DMM proceeds in three steps... synthesize pseudo-data from normalization statistics and distill knowledge from divergent models... L_KD = E_{x~D_pseudo} [KL(p_Mt(y|x) || p_M(y|x))]
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat_equivNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
buffer-level merging... theoretical guarantees of its effectiveness in capturing global statistics
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION The rapid expansion of machine learning applications across diverse domains has led to a growing demand for methods that can efficiently adapt knowledge without relying on cen- tralized training [1, 2]. In many practical scenarios, data re- mains fragmented due to privacy regulations [3], acquisition costs [4], or domain heterogeneity [5], ma...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
The model is fine-tuned inde- pendently onKdomains with datasetsD 1,
PRELIMINARY Notations.LetW 0 ={W l 0}L l=1 denote the parameters of a pretrained network withLlayers, whereW l 0 represents the parameters of thel-th layer. The model is fine-tuned inde- pendently onKdomains with datasetsD 1, . . . ,DK, yielding domain-specific parametersW 1, . . . , WK. For domaink, the parameter offset is defined as the difference betwe...
-
[3]
METHODS Our method consists of three components. First, we train unimodal and multimodal models on their respective tasks to obtain well-initialized models. Second, we perform model merging using both parameter aggregation and buffer-level statistics alignment, followed by normalization inversion to synthesize proxy data. Finally, to mitigate knowledge co...
-
[4]
EXPERIMENTS 4.1. Experimental Setup 4.1.1. Datasets We evaluate our approach on three benchmarks.CIFAR- 10andCIFAR-100[22] are standard image classification datasets containing 10 and 100 classes, respectively, with 50,000 training and 10,000 test images each.CrisisMMD
-
[5]
is a multimodal dataset consisting of images and accom- panying textual reports collected from 18 real-world crisis events. It contains a total of 18,036 image–text pairs anno- tated with humanitarian categories, enabling evaluation of cross-modal classification tasks. 4.1.2. Baselines We compare our method against representative federated learning and mo...
-
[6]
CONCLUSION In this work, we introduced DMM, a data-free model merg- ing framework tailored for scenarios with strong domain heterogeneity. By combining buffer-guided pseudo-data gen- eration with selective knowledge distillation from divergent models, DMM effectively reconciles both common and rare domain-specific knowledge while preserving stability. Our...
-
[7]
ACKNOWLEDGMENTS This work was supported in part by Jiangxi Provincial Natural Science Foundation under No. 20252BAC200613; in part by Jiangxi Provincial Early-Career Youth Science and Technol- ogy Talent Cultivation Project under No.20244BCE52007
-
[8]
Machine learning: Algorithms, real- world applications and research directions,
Iqbal H. Sarker, “Machine learning: Algorithms, real- world applications and research directions,”SN Computer Science, vol. 2, no. 3, pp. 160, Mar. 2021
work page 2021
-
[9]
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xi- aochun Cao, Jie Zhang, and Dacheng Tao, “Model merging in llms, mllms, and beyond: Methods, theories, applica- tions and opportunities,”arXiv preprint arXiv:2408.07666, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
Privacy-preserving deep learning,
Reza Shokri and Vitaly Shmatikov, “Privacy-preserving deep learning,” inProceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, New York, NY , USA, 2015, CCS ’15, p. 1310–1321, As- sociation for Computing Machinery
work page 2015
-
[11]
Data acquisi- tion for improving machine learning models,
Yifan Li, Xiaohui Yu, and Nick Koudas, “Data acquisi- tion for improving machine learning models,”Proc. VLDB Endow., vol. 14, no. 10, pp. 1832–1844, June 2021
work page 2021
-
[12]
Heterogeneous domain adaptation: An unsupervised approach,
Feng Liu, Guangquan Zhang, and Jie Lu, “Heterogeneous domain adaptation: An unsupervised approach,”IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 12, pp. 5588–5602, 2020
work page 2020
-
[13]
Communication- efficient learning of deep networks from decentralized data,
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas, “Communication- efficient learning of deep networks from decentralized data,” inArtificial intelligence and statistics. PMLR, 2017, pp. 1273–1282
work page 2017
-
[14]
Fedrecon: Missing modality reconstruction in distributed heterogeneous environments,
Junming Liu, Guosun Zeng, Ding Wang, Yanting Gao, and Yufei Jin, “Fedrecon: Missing modality reconstruction in distributed heterogeneous environments,”arXiv preprint arXiv:2504.09941, 2025
-
[15]
Junming Liu, Yanting Gao, Siyuan Meng, Yifei Sun, Aoqi Wu, Yufei Jin, Yirong Chen, Ding Wang, and Guosun Zeng, “Mosaic: Data-free knowledge distillation via mixture-of- experts for heterogeneous distributed environments,”arXiv preprint arXiv:2505.19699, 2025
-
[16]
Ties-merging: Resolving in- terference when merging models,
Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal, “Ties-merging: Resolving in- terference when merging models,” inAdvances in Neu- ral Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds. 2023, vol. 36, pp. 7093–7115, Curran Associates, Inc
work page 2023
-
[17]
Git re-basin: Merging models modulo permutation symmetries,
Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srini- vasa, “Git re-basin: Merging models modulo permutation symmetries,” inThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[18]
Training-free pretrained model merg- ing,
Zhengqi Xu, Ke Yuan, Huiqiong Wang, Yong Wang, Mingli Song, and Jie Song, “Training-free pretrained model merg- ing,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), June 2024, pp. 5915–5925
work page 2024
-
[19]
Representa- tion surgery for multi-task model merging,
Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiao- jun Chen, Xingwei Wang, and Dacheng Tao, “Representa- tion surgery for multi-task model merging,” inForty-first International Conference on Machine Learning, 2024
work page 2024
-
[20]
Pleas - merging models with permutations and least squares,
Anshul Nasery, Jonathan Hayase, Pang Wei Koh, and Se- woong Oh, “Pleas - merging models with permutations and least squares,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 30493–30502
work page 2025
-
[21]
CAT merging: A training-free approach for resolving con- flicts in model merging,
Wenju Sun, Qingyong Li, Yangliao Geng, and Boyang Li, “CAT merging: A training-free approach for resolving con- flicts in model merging,” inForty-second International Conference on Machine Learning, 2025
work page 2025
-
[22]
Representation surgery in model merging with probabilistic modeling,
Qi Wei, Shuo He, Enneng Yang, Tingcong Liu, Haobo Wang, Lei Feng, and Bo An, “Representation surgery in model merging with probabilistic modeling,” inForty- second International Conference on Machine Learning, 2025
work page 2025
-
[23]
Data-free knowledge distillation for heterogeneous federated learn- ing,
Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou, “Data-free knowledge distillation for heterogeneous federated learn- ing,” inProceedings of the 38th International Conference on Machine Learning, Marina Meila and Tong Zhang, Eds. 18–24 Jul 2021, vol. 139 ofProceedings of Machine Learn- ing Research, pp. 12878–12889, PMLR
work page 2021
-
[24]
Deep residual learning for image recognition,
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” inPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
work page 2016
-
[25]
BERT: Pre-training of deep bidirectional transformers for language understanding,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Pa- pers), Jill Burstein, Christy Do...
work page 2019
-
[26]
Dreaming to distill: Data-free knowledge trans- fer via deepinversion,
Hongxu Yin, Pavlo Molchanov, Jose M. Alvarez, Zhizhong Li, Arun Mallya, Derek Hoiem, Niraj K. Jha, and Jan Kautz, “Dreaming to distill: Data-free knowledge trans- fer via deepinversion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
work page 2020
-
[27]
Federated op- timization in heterogeneous networks,
Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar San- jabi, Ameet Talwalkar, and Virginia Smith, “Federated op- timization in heterogeneous networks,” inProceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopou- los, and V . Sze, Eds., 2020, vol. 2, pp. 429–450
work page 2020
-
[28]
Fedbn: Federated learning on non-iid features via local batch normalization,
Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou, “Fedbn: Federated learning on non-iid features via local batch normalization,”arXiv preprint arXiv:2102.07623, 2021
-
[29]
Learning multi- ple layers of features from tiny images,
Alex Krizhevsky, Geoffrey Hinton, et al., “Learning multi- ple layers of features from tiny images,” Tech. Rep., Uni- versity of Toronto, 2009
work page 2009
-
[30]
Crisis- mmd: Multimodal twitter datasets from natural disasters,
Firoj Alam, Ferda Ofli, and Muhammad Imran, “Crisis- mmd: Multimodal twitter datasets from natural disasters,” Proceedings of the International AAAI Conference on Web and Social Media, vol. 12, no. 1, Jun. 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.