AP-BMM: Approximating Capability-Cost Pareto Sets of LLMs via Asynchronous Prior-Guided Bayesian Model Merging
Pith reviewed 2026-05-16 23:16 UTC · model grok-4.3
The pith
Asynchronous prior-guided Bayesian merging finds higher-quality Pareto sets of accuracy-cost trade-offs for LLMs than synchronous layer-wise or model-level baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AP-BMM approximates the capability-cost Pareto set by using differences in parameters and reasoning activations between a reasoning-oriented source model and a cheaper base model to prioritize which Transformer layers receive merge weights first, combined with an asynchronous Bayesian optimization loop that launches new candidates without waiting for pending evaluations to finish, plus a lightweight reranking step that spreads the final candidates across the accuracy-cost plane.
What carries the argument
Asynchronous Prior-Guided Bayesian Model Merging (AP-BMM), which injects source-model difference signals into the acquisition function of Bayesian optimization and decouples evaluation scheduling from completion order.
If this is right
- Under a fixed number of model evaluations the method returns a set of merged models whose accuracy-cost curve dominates the curves obtained from synchronous layer-wise search and from standard model-level merging.
- Wall-clock time to reach a given Pareto quality drops because the asynchronous scheduler keeps GPUs occupied instead of idling on the slowest evaluations.
- The final reranked set covers a wider range of operating points, allowing a practitioner to select a model whose inference cost matches a target latency or throughput constraint.
- The same search procedure can be applied to any pair of source models where one is stronger but more expensive and the other is cheaper but weaker.
Where Pith is reading between the lines
- If the prior signals remain informative across model families, the same guidance could accelerate merging for non-Transformer architectures without hand-crafted heuristics.
- The asynchronous schedule might be combined with early-stopping or multi-fidelity evaluation to further reduce total compute when many candidates are cheap to score.
- Once a high-quality Pareto set exists, downstream systems could dynamically swap among the merged models at inference time based on current load or user-specified cost limits.
Load-bearing premise
Differences in parameters and reasoning activations between the two source models give reliable early signals about which layers should receive non-default merge weights.
What would settle it
Run the same search budget with the prior-ranking step replaced by random layer ordering and measure whether the resulting Pareto front quality falls measurably below the guided version on the same evaluation metrics.
Figures
read the original abstract
Serving Large Language Models (LLMs) often requires choosing between stronger reasoning and lower inference cost. Model merging offers a practical way to build several models between a reasoning-oriented model and a cheaper base model, but common model-level merging methods usually control this trade-off with only one or two global knobs. We study this setting as a multi-objective optimization problem: instead of producing one merged model, the goal is to find a set of merged models that cover different accuracy--token-cost preferences. Layer-wise merging is more flexible because it can assign different merge weights to different Transformer layers. However, it introduces two practical challenges. First, the layer-wise search space is large, and existing methods often search it without using helpful signals from the source models. Second, LLM evaluations can take very different amounts of time, so synchronous batch optimization wastes GPU time while waiting for slow evaluations. We propose Asynchronous Prior-Guided Bayesian Model Merging (AP-BMM). AP-BMM uses parameter and reasoning-activation differences between the source models to suggest which layers should matter early in the search. It also uses an asynchronous Bayesian optimization loop that accounts for candidate models already being evaluated. A lightweight reranking step further spreads candidates across the accuracy--cost trade-off. Under fixed evaluation budgets, AP-BMM achieves stronger Pareto-set quality and broader trade-off coverage than synchronous layer-wise baselines and representative model-level merging baselines. Compared with the synchronous Bayesian baseline, it also reduces wall-clock time by improving GPU utilization. Code: https://github.com/MiLab-HITSZ/AP-BMM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Asynchronous Prior-Guided Bayesian Model Merging (AP-BMM) to approximate capability-cost Pareto sets for LLMs by layer-wise merging of a reasoning-oriented model with a cheaper base model. It computes parameter and reasoning-activation differences to rank layers for early prioritization within a Bayesian optimizer, employs an asynchronous evaluation loop to improve GPU utilization, and applies a lightweight reranking step to spread candidates across the trade-off surface. Under fixed evaluation budgets, AP-BMM is claimed to produce higher-quality Pareto sets with broader coverage than synchronous layer-wise and model-level baselines while also reducing wall-clock time.
Significance. If the empirical superiority holds, the work addresses a practical need for efficient generation of multiple merged LLMs spanning accuracy-cost preferences, which is relevant for serving scenarios. The public code repository is a positive factor for reproducibility. The core innovation—the use of source-model differences as a prior—could reduce search cost in large layer-wise spaces, but its contribution requires clearer isolation from the asynchronous scheduler.
major comments (2)
- [§3.2] §3.2: The prior derived from parameter and reasoning-activation differences is presented as supplying reliable early signals for layer prioritization, yet no ablation, correlation analysis, or direct measurement is shown linking high-difference layers to changes in the accuracy-cost Pareto front. Without this evidence, observed gains could be driven solely by the asynchronous scheduler or reranking, undermining the claim of prior-guided superiority over synchronous layer-wise baselines.
- [Results] Results (assumed §4–5): The abstract asserts stronger Pareto-set quality and broader trade-off coverage under fixed budgets, but the manuscript description supplies no quantitative tables, statistical significance tests, or per-baseline ablation breakdowns. Soundness therefore rests on external code rather than verifiable in-text derivations or figures, which is insufficient for a central empirical claim.
minor comments (1)
- [Method] The description of the lightweight reranking step and its exact contribution to spread should be expanded with pseudocode or a dedicated paragraph to allow replication.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will strengthen the manuscript accordingly to better isolate the prior's contribution and make the empirical results self-contained.
read point-by-point responses
-
Referee: [§3.2] §3.2: The prior derived from parameter and reasoning-activation differences is presented as supplying reliable early signals for layer prioritization, yet no ablation, correlation analysis, or direct measurement is shown linking high-difference layers to changes in the accuracy-cost Pareto front. Without this evidence, observed gains could be driven solely by the asynchronous scheduler or reranking, undermining the claim of prior-guided superiority over synchronous layer-wise baselines.
Authors: We agree that clearer isolation of the prior's effect is needed. In the revision we will add an ablation comparing runs with and without the difference-based prior (keeping the asynchronous scheduler and reranking fixed), plus a correlation analysis between per-layer parameter/activation differences and their measured impact on Pareto-front quality. These additions will be placed in §3.2 and the results section. revision: yes
-
Referee: [Results] Results (assumed §4–5): The abstract asserts stronger Pareto-set quality and broader trade-off coverage under fixed budgets, but the manuscript description supplies no quantitative tables, statistical significance tests, or per-baseline ablation breakdowns. Soundness therefore rests on external code rather than verifiable in-text derivations or figures, which is insufficient for a central empirical claim.
Authors: We acknowledge the current text lacks sufficient in-paper quantitative detail. The revised manuscript will include expanded results tables reporting hypervolume, coverage, and spread metrics for all methods, Wilcoxon signed-rank tests for statistical significance, and explicit per-component ablation breakdowns (prior, async scheduler, reranking). Key tables and figures will be added to §§4–5 so that the central claims are verifiable directly from the paper. revision: yes
Circularity Check
No significant circularity: AP-BMM is an external-search algorithm whose Pareto quality is measured on held-out accuracy and cost metrics.
full rationale
The manuscript describes a Bayesian optimization procedure that uses parameter/activation differences only to initialize layer priorities and an asynchronous scheduler to improve GPU utilization. All reported gains are quantified via external evaluation budgets on standard benchmarks; no equation, fitted parameter, or self-citation reduces the claimed Pareto-set quality to a quantity defined by the method itself. The prior is a heuristic input, not a self-referential definition, and the optimizer's outputs are assessed against independent accuracy-cost measurements. This is the normal case of a search algorithm whose performance is externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Bayesian optimization efficiently explores high-dimensional merge-weight spaces when guided by cheap priors
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a Structural Importance Prior (SIP) that leverages layer-wise task-vector differences to guide Bayesian optimization. By converting architectural sensitivity into lengthscale priors, SIP enables effective warm-starting...
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we adopt the hierarchical SAAS prior on inverse squared lengthscales ρ_d = 1/ℓ²_d
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204,
[Akibaet al., 2025 ] Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204,
work page 2025
-
[2]
[Daultonet al., 2020 ] Samuel Daulton, Maximilian Balan- dat, and Eytan Bakshy. Differentiable expected hyper- volume improvement for parallel multi-objective bayesian optimization.Advances in neural information processing systems, 33:9851–9864,
work page 2020
-
[3]
Robust multi-objective bayesian optimization under input noise
[Daultonet al., 2022 ] Samuel Daulton, Sait Cakmak, Maxi- milian Balandat, Michael A Osborne, Enlu Zhou, and Ey- tan Bakshy. Robust multi-objective bayesian optimization under input noise. InInternational Conference on Ma- chine Learning, pages 4831–4866. PMLR,
work page 2022
-
[4]
Model breadcrumbs: Scaling multi- task model merging with sparse masks
[Davari and Belilovsky, 2024] MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi- task model merging with sparse masks. InEuropean Con- ference on Computer Vision, pages 270–287. Springer,
work page 2024
-
[5]
[Debet al., 2002 ] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. A fast and elitist multi- objective genetic algorithm: NSGA-II.IEEE transactions on evolutionary computation, 6(2):182–197,
work page 2002
-
[6]
[Deepet al., 2024 ] Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. Della-merging: Reducing interference in model merging through magnitude-based sampling.arXiv preprint arXiv:2406.11617,
-
[7]
High-dimensional bayesian optimization with sparse axis-aligned subspaces
[Eriksson and Jankowiak, 2021] David Eriksson and Martin Jankowiak. High-dimensional bayesian optimization with sparse axis-aligned subspaces. InProceedings of the Thirty-Seventh Conference on Uncertainty in Artificial In- telligence, pages 493–503. PMLR,
work page 2021
-
[8]
[Hansenet al., 2003 ] Nikolaus Hansen, Sibylle D M ¨uller, and Petros Koumoutsakos. Reducing the time complex- ity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES).Evolutionary computation, 11(1):1–18,
work page 2003
-
[9]
[Hoffman and Gelman, 2014] Matthew D Hoffman and An- drew Gelman. The No-U-Turn sampler: adaptively setting path lengths in hamiltonian monte carlo.Journal of Ma- chine Learning Research, 15(1):1593–1623,
work page 2014
-
[10]
Editing Models with Task Arithmetic
[Ilharcoet al., 2022 ] Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089,
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Dynamic fisher-weighted model merging via bayesian optimization
[Leeet al., 2025 ] Sanwoo Lee, Jiahao Liu, Qifan Wang, Jin- gang Wang, Xunliang Cai, and Yunfang Wu. Dynamic fisher-weighted model merging via bayesian optimization. arXiv preprint arXiv:2504.18992,
-
[12]
[Liet al., 2025 ] Bingdong Li, Zixiang Di, Yanting Yang, Hong Qian, Peng Yang, Hao Hao, Ke Tang, and Aimin Zhou. It’s morphing time: Unleashing the potential of multiple llms via multi-objective optimization.IEEE Transactions on Evolutionary Computation,
work page 2025
-
[13]
Gpqa: A graduate-level google-proof q&a benchmark
[Reinet al., 2024 ] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling,
work page 2024
-
[14]
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
[Suiet al., 2025 ] Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large lan- guage models.arXiv preprint arXiv:2503.16419,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
[Teamet al., 2025 ] Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Weight averaging for neural networks and local resampling schemes
[Utans, 1996] Joachim Utans. Weight averaging for neural networks and local resampling schemes. InProc. AAAI-96 Workshop on Integrating Multiple Learned Models. AAAI Press, pages 133–138. Citeseer,
work page 1996
-
[17]
[Weiet al., 2022 ] Jason Wei, Xuezhi Wang, Dale Schuur- mans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,
work page 2022
-
[18]
[Williams and Rasmussen, 2006] Christopher KI Williams and Carl Edward Rasmussen.Gaussian processes for ma- chine learning, volume
work page 2006
-
[19]
Revisiting model interpolation for efficient reasoning.arXiv preprint arXiv:2510.10977,
[Wuet al., 2025 ] Taiqiang Wu, Runming Yang, Tao Liu, Jia- hao Wang, and Ngai Wong. Revisiting model interpolation for efficient reasoning.arXiv preprint arXiv:2510.10977,
-
[20]
[Yadavet al., 2023 ] Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties- merging: Resolving interference when merging mod- els.Advances in Neural Information Processing Systems, 36:7093–7115,
work page 2023
-
[21]
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
[Yanget al., 2024 ] Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Meth- ods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666,
work page internal anchor Pith review arXiv 2024
-
[22]
[Yanget al., 2025 ] An Yang, Anfeng Li, Baosong Yang, Be- ichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Language models are super mario: Ab- sorbing abilities from homologous models as a free lunch
[Yuet al., 2024 ] Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Ab- sorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learn- ing,
work page 2024
-
[24]
A Analysis and Implementation of Block Partitioning This appendix consolidates (i) the analysis of layer-wise task- vector differences and (ii) the formal implementation of the block partitioning strategy used for the granularity compari- son in our experiments. A.1 Layer-wise Difference Analysis and Visualization Figure 8 visualizes the layer-wiseℓ 1-nor...
work page 2021
-
[25]
ANSWER: C Table 8: Detailed individual metrics for different merging granularities (full data). ID Method f1(↑)f2(↑)GPQA ACC(↑)GPQA TOKENS(↓)AIME25ACC(↑)AIME25 TOKENS(↓)Pareto 0 Layer-wise 0.8698 0.8047 0.6212 4037 0.7666 8891✓ 1 0.9527 0.2414 0.6616 9136 0.7666 15496× 2 0.3656 1.0585 0.4545 791 0.6666 7321✓ 3 0.6848 0.8697 0.6364 3427 0.6333 8161✓ 4 0.34...
work page 2037
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.