arxiv: 2512.09972 · v6 · submitted 2025-12-10 · 💻 cs.LG · cs.CL· cs.NE

AP-BMM: Approximating Capability-Cost Pareto Sets of LLMs via Asynchronous Prior-Guided Bayesian Model Merging

Kesheng Chen , Yamin Hu , Zhenqian Zhu , Yiya Diao , Wenjian Luo This is my paper

Pith reviewed 2026-05-16 23:16 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.NE

keywords model mergingPareto optimizationBayesian optimizationLLM efficiencymulti-objective optimizationlayer-wise mergingasynchronous evaluationtrade-off coverage

0 comments

The pith

Asynchronous prior-guided Bayesian merging finds higher-quality Pareto sets of accuracy-cost trade-offs for LLMs than synchronous layer-wise or model-level baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames LLM serving as a multi-objective problem of producing families of merged models that span different accuracy versus token-cost preferences rather than a single compromise. It shows that layer-wise merging can create more flexible trade-offs but requires searching a large space efficiently. AP-BMM supplies early guidance by ranking layers according to parameter and activation differences between the source models and runs Bayesian optimization asynchronously so that fast evaluations do not wait for slow ones. Under a fixed budget of evaluations this produces merged models that cover more of the desired trade-off surface and reach better points on it than either synchronous Bayesian search or conventional global merging methods.

Core claim

AP-BMM approximates the capability-cost Pareto set by using differences in parameters and reasoning activations between a reasoning-oriented source model and a cheaper base model to prioritize which Transformer layers receive merge weights first, combined with an asynchronous Bayesian optimization loop that launches new candidates without waiting for pending evaluations to finish, plus a lightweight reranking step that spreads the final candidates across the accuracy-cost plane.

What carries the argument

Asynchronous Prior-Guided Bayesian Model Merging (AP-BMM), which injects source-model difference signals into the acquisition function of Bayesian optimization and decouples evaluation scheduling from completion order.

If this is right

Under a fixed number of model evaluations the method returns a set of merged models whose accuracy-cost curve dominates the curves obtained from synchronous layer-wise search and from standard model-level merging.
Wall-clock time to reach a given Pareto quality drops because the asynchronous scheduler keeps GPUs occupied instead of idling on the slowest evaluations.
The final reranked set covers a wider range of operating points, allowing a practitioner to select a model whose inference cost matches a target latency or throughput constraint.
The same search procedure can be applied to any pair of source models where one is stronger but more expensive and the other is cheaper but weaker.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the prior signals remain informative across model families, the same guidance could accelerate merging for non-Transformer architectures without hand-crafted heuristics.
The asynchronous schedule might be combined with early-stopping or multi-fidelity evaluation to further reduce total compute when many candidates are cheap to score.
Once a high-quality Pareto set exists, downstream systems could dynamically swap among the merged models at inference time based on current load or user-specified cost limits.

Load-bearing premise

Differences in parameters and reasoning activations between the two source models give reliable early signals about which layers should receive non-default merge weights.

What would settle it

Run the same search budget with the prior-ranking step replaced by random layer ordering and measure whether the resulting Pareto front quality falls measurably below the guided version on the same evaluation metrics.

Figures

Figures reproduced from arXiv: 2512.09972 by Kesheng Chen, Wenjian Luo, Yamin Hu, Yiya Diao, Zhenqian Zhu.

**Figure 1.** Figure 1: Schematic comparison of Pareto fronts constructed via different merging strategies. SIP-BMM leverages structural importance [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Distribution of merged models in objective space. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the Structural Importance Prior (SIP). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Pareto front on benchmarks (Tokens vs. Accuracy). [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation study results in objective space. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: visualizes the layer-wise ℓ1-norm and ℓ2-norm differences between the base model and expert models. We observe that the parameter differences are primarily concentrated in deeper layers, while shallow layers remain relatively stable. This aligns with the common understanding that deeper layers contribute more to high-level semantic processing (e.g., logical reasoning), whereas earlier layers capture mo… view at source ↗

read the original abstract

Serving Large Language Models (LLMs) often requires choosing between stronger reasoning and lower inference cost. Model merging offers a practical way to build several models between a reasoning-oriented model and a cheaper base model, but common model-level merging methods usually control this trade-off with only one or two global knobs. We study this setting as a multi-objective optimization problem: instead of producing one merged model, the goal is to find a set of merged models that cover different accuracy--token-cost preferences. Layer-wise merging is more flexible because it can assign different merge weights to different Transformer layers. However, it introduces two practical challenges. First, the layer-wise search space is large, and existing methods often search it without using helpful signals from the source models. Second, LLM evaluations can take very different amounts of time, so synchronous batch optimization wastes GPU time while waiting for slow evaluations. We propose Asynchronous Prior-Guided Bayesian Model Merging (AP-BMM). AP-BMM uses parameter and reasoning-activation differences between the source models to suggest which layers should matter early in the search. It also uses an asynchronous Bayesian optimization loop that accounts for candidate models already being evaluated. A lightweight reranking step further spreads candidates across the accuracy--cost trade-off. Under fixed evaluation budgets, AP-BMM achieves stronger Pareto-set quality and broader trade-off coverage than synchronous layer-wise baselines and representative model-level merging baselines. Compared with the synchronous Bayesian baseline, it also reduces wall-clock time by improving GPU utilization. Code: https://github.com/MiLab-HITSZ/AP-BMM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AP-BMM gives a practical async Bayesian recipe for layer-wise LLM merging that improves GPU use and Pareto coverage, but the value of its parameter/activation priors still needs direct checks.

read the letter

The paper's main offering is AP-BMM: it ranks layers by parameter and reasoning-activation differences between a strong and a cheap source model, then runs asynchronous Bayesian optimization over merge weights to produce a set of merged models along the accuracy-token cost curve. The async loop and final reranking step are meant to cut wall-clock time and spread the trade-offs better than synchronous layer-wise or model-level baselines. Code is released, which helps anyone who wants to try it on their own models. That combination is new enough in the cited literature and addresses real deployment pain points where you want several variants without retraining from scratch. The async scheduler looks like the part most likely to deliver measurable time savings under fixed evaluation budgets. The prior guidance is a reasonable engineering bet if the differences really flag the layers that move the frontier. The soft spot is exactly there. The abstract and stress-test note do not show a direct check that high-difference layers are the ones whose weights most affect accuracy or cost; the reported gains could trace more to the async schedule or reranking than to the prior. Without ablations that isolate the prior or tables with statistical tests, it is hard to know how much it adds versus standard async BO. Experiments are described only at a high level, so the claims rest on the linked code and full results. This work is for groups already doing model merging or building cost-aware LLM serving stacks. A practitioner who needs multiple accuracy-cost points from existing checkpoints will find the method and code worth testing. It is solid enough on the engineering side and grounded enough in a clear problem to deserve a serious referee who can look at the full experiments and ablations.

Referee Report

2 major / 1 minor

Summary. The paper introduces Asynchronous Prior-Guided Bayesian Model Merging (AP-BMM) to approximate capability-cost Pareto sets for LLMs by layer-wise merging of a reasoning-oriented model with a cheaper base model. It computes parameter and reasoning-activation differences to rank layers for early prioritization within a Bayesian optimizer, employs an asynchronous evaluation loop to improve GPU utilization, and applies a lightweight reranking step to spread candidates across the trade-off surface. Under fixed evaluation budgets, AP-BMM is claimed to produce higher-quality Pareto sets with broader coverage than synchronous layer-wise and model-level baselines while also reducing wall-clock time.

Significance. If the empirical superiority holds, the work addresses a practical need for efficient generation of multiple merged LLMs spanning accuracy-cost preferences, which is relevant for serving scenarios. The public code repository is a positive factor for reproducibility. The core innovation—the use of source-model differences as a prior—could reduce search cost in large layer-wise spaces, but its contribution requires clearer isolation from the asynchronous scheduler.

major comments (2)

[§3.2] §3.2: The prior derived from parameter and reasoning-activation differences is presented as supplying reliable early signals for layer prioritization, yet no ablation, correlation analysis, or direct measurement is shown linking high-difference layers to changes in the accuracy-cost Pareto front. Without this evidence, observed gains could be driven solely by the asynchronous scheduler or reranking, undermining the claim of prior-guided superiority over synchronous layer-wise baselines.
[Results] Results (assumed §4–5): The abstract asserts stronger Pareto-set quality and broader trade-off coverage under fixed budgets, but the manuscript description supplies no quantitative tables, statistical significance tests, or per-baseline ablation breakdowns. Soundness therefore rests on external code rather than verifiable in-text derivations or figures, which is insufficient for a central empirical claim.

minor comments (1)

[Method] The description of the lightweight reranking step and its exact contribution to spread should be expanded with pseudocode or a dedicated paragraph to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will strengthen the manuscript accordingly to better isolate the prior's contribution and make the empirical results self-contained.

read point-by-point responses

Referee: [§3.2] §3.2: The prior derived from parameter and reasoning-activation differences is presented as supplying reliable early signals for layer prioritization, yet no ablation, correlation analysis, or direct measurement is shown linking high-difference layers to changes in the accuracy-cost Pareto front. Without this evidence, observed gains could be driven solely by the asynchronous scheduler or reranking, undermining the claim of prior-guided superiority over synchronous layer-wise baselines.

Authors: We agree that clearer isolation of the prior's effect is needed. In the revision we will add an ablation comparing runs with and without the difference-based prior (keeping the asynchronous scheduler and reranking fixed), plus a correlation analysis between per-layer parameter/activation differences and their measured impact on Pareto-front quality. These additions will be placed in §3.2 and the results section. revision: yes
Referee: [Results] Results (assumed §4–5): The abstract asserts stronger Pareto-set quality and broader trade-off coverage under fixed budgets, but the manuscript description supplies no quantitative tables, statistical significance tests, or per-baseline ablation breakdowns. Soundness therefore rests on external code rather than verifiable in-text derivations or figures, which is insufficient for a central empirical claim.

Authors: We acknowledge the current text lacks sufficient in-paper quantitative detail. The revised manuscript will include expanded results tables reporting hypervolume, coverage, and spread metrics for all methods, Wilcoxon signed-rank tests for statistical significance, and explicit per-component ablation breakdowns (prior, async scheduler, reranking). Key tables and figures will be added to §§4–5 so that the central claims are verifiable directly from the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity: AP-BMM is an external-search algorithm whose Pareto quality is measured on held-out accuracy and cost metrics.

full rationale

The manuscript describes a Bayesian optimization procedure that uses parameter/activation differences only to initialize layer priorities and an asynchronous scheduler to improve GPU utilization. All reported gains are quantified via external evaluation budgets on standard benchmarks; no equation, fitted parameter, or self-citation reduces the claimed Pareto-set quality to a quantity defined by the method itself. The prior is a heuristic input, not a self-referential definition, and the optimizer's outputs are assessed against independent accuracy-cost measurements. This is the normal case of a search algorithm whose performance is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard Bayesian optimization assumptions and the domain premise that layer-wise merge weights can be searched efficiently when guided by source-model differences; no new entities or ad-hoc constants are introduced.

axioms (1)

domain assumption Bayesian optimization efficiently explores high-dimensional merge-weight spaces when guided by cheap priors
Invoked to justify early prioritization of layers via parameter and activation differences.

pith-pipeline@v0.9.0 · 5610 in / 1194 out tokens · 29101 ms · 2026-05-16T23:16:31.593218+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a Structural Importance Prior (SIP) that leverages layer-wise task-vector differences to guide Bayesian optimization. By converting architectural sensitivity into lengthscale priors, SIP enables effective warm-starting...
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we adopt the hierarchical SAAS prior on inverse squared lengthscales ρ_d = 1/ℓ²_d

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 5 internal anchors

[1]

Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204,

[Akibaet al., 2025 ] Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204,

work page 2025
[2]

Differentiable expected hyper- volume improvement for parallel multi-objective bayesian optimization.Advances in neural information processing systems, 33:9851–9864,

[Daultonet al., 2020 ] Samuel Daulton, Maximilian Balan- dat, and Eytan Bakshy. Differentiable expected hyper- volume improvement for parallel multi-objective bayesian optimization.Advances in neural information processing systems, 33:9851–9864,

work page 2020
[3]

Robust multi-objective bayesian optimization under input noise

[Daultonet al., 2022 ] Samuel Daulton, Sait Cakmak, Maxi- milian Balandat, Michael A Osborne, Enlu Zhou, and Ey- tan Bakshy. Robust multi-objective bayesian optimization under input noise. InInternational Conference on Ma- chine Learning, pages 4831–4866. PMLR,

work page 2022
[4]

Model breadcrumbs: Scaling multi- task model merging with sparse masks

[Davari and Belilovsky, 2024] MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi- task model merging with sparse masks. InEuropean Con- ference on Computer Vision, pages 270–287. Springer,

work page 2024
[5]

A fast and elitist multi- objective genetic algorithm: NSGA-II.IEEE transactions on evolutionary computation, 6(2):182–197,

[Debet al., 2002 ] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. A fast and elitist multi- objective genetic algorithm: NSGA-II.IEEE transactions on evolutionary computation, 6(2):182–197,

work page 2002
[6]

Della-merging: Reducing interference in model merging through magnitude-based sampling.arXiv preprint arXiv:2406.11617,

[Deepet al., 2024 ] Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. Della-merging: Reducing interference in model merging through magnitude-based sampling.arXiv preprint arXiv:2406.11617,

work page arXiv 2024
[7]

High-dimensional bayesian optimization with sparse axis-aligned subspaces

[Eriksson and Jankowiak, 2021] David Eriksson and Martin Jankowiak. High-dimensional bayesian optimization with sparse axis-aligned subspaces. InProceedings of the Thirty-Seventh Conference on Uncertainty in Artificial In- telligence, pages 493–503. PMLR,

work page 2021
[8]

Reducing the time complex- ity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES).Evolutionary computation, 11(1):1–18,

[Hansenet al., 2003 ] Nikolaus Hansen, Sibylle D M ¨uller, and Petros Koumoutsakos. Reducing the time complex- ity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES).Evolutionary computation, 11(1):1–18,

work page 2003
[9]

The No-U-Turn sampler: adaptively setting path lengths in hamiltonian monte carlo.Journal of Ma- chine Learning Research, 15(1):1593–1623,

[Hoffman and Gelman, 2014] Matthew D Hoffman and An- drew Gelman. The No-U-Turn sampler: adaptively setting path lengths in hamiltonian monte carlo.Journal of Ma- chine Learning Research, 15(1):1593–1623,

work page 2014
[10]

Editing Models with Task Arithmetic

[Ilharcoet al., 2022 ] Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Dynamic fisher-weighted model merging via bayesian optimization

[Leeet al., 2025 ] Sanwoo Lee, Jiahao Liu, Qifan Wang, Jin- gang Wang, Xunliang Cai, and Yunfang Wu. Dynamic fisher-weighted model merging via bayesian optimization. arXiv preprint arXiv:2504.18992,

work page arXiv 2025
[12]

It’s morphing time: Unleashing the potential of multiple llms via multi-objective optimization.IEEE Transactions on Evolutionary Computation,

[Liet al., 2025 ] Bingdong Li, Zixiang Di, Yanting Yang, Hong Qian, Peng Yang, Hao Hao, Ke Tang, and Aimin Zhou. It’s morphing time: Unleashing the potential of multiple llms via multi-objective optimization.IEEE Transactions on Evolutionary Computation,

work page 2025
[13]

Gpqa: A graduate-level google-proof q&a benchmark

[Reinet al., 2024 ] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling,

work page 2024
[14]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

[Suiet al., 2025 ] Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large lan- guage models.arXiv preprint arXiv:2503.16419,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

[Teamet al., 2025 ] Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Weight averaging for neural networks and local resampling schemes

[Utans, 1996] Joachim Utans. Weight averaging for neural networks and local resampling schemes. InProc. AAAI-96 Workshop on Integrating Multiple Learned Models. AAAI Press, pages 133–138. Citeseer,

work page 1996
[17]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,

[Weiet al., 2022 ] Jason Wei, Xuezhi Wang, Dale Schuur- mans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,

work page 2022
[18]

[Williams and Rasmussen, 2006] Christopher KI Williams and Carl Edward Rasmussen.Gaussian processes for ma- chine learning, volume

work page 2006
[19]

Revisiting model interpolation for efficient reasoning.arXiv preprint arXiv:2510.10977,

[Wuet al., 2025 ] Taiqiang Wu, Runming Yang, Tao Liu, Jia- hao Wang, and Ngai Wong. Revisiting model interpolation for efficient reasoning.arXiv preprint arXiv:2510.10977,

work page arXiv 2025
[20]

Ties- merging: Resolving interference when merging mod- els.Advances in Neural Information Processing Systems, 36:7093–7115,

[Yadavet al., 2023 ] Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties- merging: Resolving interference when merging mod- els.Advances in Neural Information Processing Systems, 36:7093–7115,

work page 2023
[21]

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

[Yanget al., 2024 ] Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Meth- ods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666,

work page internal anchor Pith review arXiv 2024
[22]

Qwen3 Technical Report

[Yanget al., 2025 ] An Yang, Anfeng Li, Baosong Yang, Be- ichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Language models are super mario: Ab- sorbing abilities from homologous models as a free lunch

[Yuet al., 2024 ] Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Ab- sorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learn- ing,

work page 2024
[24]

information mass

A Analysis and Implementation of Block Partitioning This appendix consolidates (i) the analysis of layer-wise task- vector differences and (ii) the formal implementation of the block partitioning strategy used for the granularity compari- son in our experiments. A.1 Layer-wise Difference Analysis and Visualization Figure 8 visualizes the layer-wiseℓ 1-nor...

work page 2021
[25]

ANSWER: C Table 8: Detailed individual metrics for different merging granularities (full data). ID Method f1(↑)f2(↑)GPQA ACC(↑)GPQA TOKENS(↓)AIME25ACC(↑)AIME25 TOKENS(↓)Pareto 0 Layer-wise 0.8698 0.8047 0.6212 4037 0.7666 8891✓ 1 0.9527 0.2414 0.6616 9136 0.7666 15496× 2 0.3656 1.0585 0.4545 791 0.6666 7321✓ 3 0.6848 0.8697 0.6364 3427 0.6333 8161✓ 4 0.34...

work page 2037