pith. machine review for the scientific record. sign in

arxiv: 2512.09972 · v6 · submitted 2025-12-10 · 💻 cs.LG · cs.CL· cs.NE

AP-BMM: Approximating Capability-Cost Pareto Sets of LLMs via Asynchronous Prior-Guided Bayesian Model Merging

Pith reviewed 2026-05-16 23:16 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.NE
keywords model mergingPareto optimizationBayesian optimizationLLM efficiencymulti-objective optimizationlayer-wise mergingasynchronous evaluationtrade-off coverage
0
0 comments X

The pith

Asynchronous prior-guided Bayesian merging finds higher-quality Pareto sets of accuracy-cost trade-offs for LLMs than synchronous layer-wise or model-level baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames LLM serving as a multi-objective problem of producing families of merged models that span different accuracy versus token-cost preferences rather than a single compromise. It shows that layer-wise merging can create more flexible trade-offs but requires searching a large space efficiently. AP-BMM supplies early guidance by ranking layers according to parameter and activation differences between the source models and runs Bayesian optimization asynchronously so that fast evaluations do not wait for slow ones. Under a fixed budget of evaluations this produces merged models that cover more of the desired trade-off surface and reach better points on it than either synchronous Bayesian search or conventional global merging methods.

Core claim

AP-BMM approximates the capability-cost Pareto set by using differences in parameters and reasoning activations between a reasoning-oriented source model and a cheaper base model to prioritize which Transformer layers receive merge weights first, combined with an asynchronous Bayesian optimization loop that launches new candidates without waiting for pending evaluations to finish, plus a lightweight reranking step that spreads the final candidates across the accuracy-cost plane.

What carries the argument

Asynchronous Prior-Guided Bayesian Model Merging (AP-BMM), which injects source-model difference signals into the acquisition function of Bayesian optimization and decouples evaluation scheduling from completion order.

If this is right

  • Under a fixed number of model evaluations the method returns a set of merged models whose accuracy-cost curve dominates the curves obtained from synchronous layer-wise search and from standard model-level merging.
  • Wall-clock time to reach a given Pareto quality drops because the asynchronous scheduler keeps GPUs occupied instead of idling on the slowest evaluations.
  • The final reranked set covers a wider range of operating points, allowing a practitioner to select a model whose inference cost matches a target latency or throughput constraint.
  • The same search procedure can be applied to any pair of source models where one is stronger but more expensive and the other is cheaper but weaker.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the prior signals remain informative across model families, the same guidance could accelerate merging for non-Transformer architectures without hand-crafted heuristics.
  • The asynchronous schedule might be combined with early-stopping or multi-fidelity evaluation to further reduce total compute when many candidates are cheap to score.
  • Once a high-quality Pareto set exists, downstream systems could dynamically swap among the merged models at inference time based on current load or user-specified cost limits.

Load-bearing premise

Differences in parameters and reasoning activations between the two source models give reliable early signals about which layers should receive non-default merge weights.

What would settle it

Run the same search budget with the prior-ranking step replaced by random layer ordering and measure whether the resulting Pareto front quality falls measurably below the guided version on the same evaluation metrics.

Figures

Figures reproduced from arXiv: 2512.09972 by Kesheng Chen, Wenjian Luo, Yamin Hu, Yiya Diao, Zhenqian Zhu.

Figure 1
Figure 1. Figure 1: Schematic comparison of Pareto fronts constructed via different merging strategies. SIP-BMM leverages structural importance [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of merged models in objective space. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the Structural Importance Prior (SIP). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pareto front on benchmarks (Tokens vs. Accuracy). [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study results in objective space. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: visualizes the layer-wise ℓ1-norm and ℓ2-norm dif￾ferences between the base model and expert models. We observe that the parameter differences are primarily concen￾trated in deeper layers, while shallow layers remain rela￾tively stable. This aligns with the common understanding that deeper layers contribute more to high-level semantic process￾ing (e.g., logical reasoning), whereas earlier layers capture mo… view at source ↗
read the original abstract

Serving Large Language Models (LLMs) often requires choosing between stronger reasoning and lower inference cost. Model merging offers a practical way to build several models between a reasoning-oriented model and a cheaper base model, but common model-level merging methods usually control this trade-off with only one or two global knobs. We study this setting as a multi-objective optimization problem: instead of producing one merged model, the goal is to find a set of merged models that cover different accuracy--token-cost preferences. Layer-wise merging is more flexible because it can assign different merge weights to different Transformer layers. However, it introduces two practical challenges. First, the layer-wise search space is large, and existing methods often search it without using helpful signals from the source models. Second, LLM evaluations can take very different amounts of time, so synchronous batch optimization wastes GPU time while waiting for slow evaluations. We propose Asynchronous Prior-Guided Bayesian Model Merging (AP-BMM). AP-BMM uses parameter and reasoning-activation differences between the source models to suggest which layers should matter early in the search. It also uses an asynchronous Bayesian optimization loop that accounts for candidate models already being evaluated. A lightweight reranking step further spreads candidates across the accuracy--cost trade-off. Under fixed evaluation budgets, AP-BMM achieves stronger Pareto-set quality and broader trade-off coverage than synchronous layer-wise baselines and representative model-level merging baselines. Compared with the synchronous Bayesian baseline, it also reduces wall-clock time by improving GPU utilization. Code: https://github.com/MiLab-HITSZ/AP-BMM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Asynchronous Prior-Guided Bayesian Model Merging (AP-BMM) to approximate capability-cost Pareto sets for LLMs by layer-wise merging of a reasoning-oriented model with a cheaper base model. It computes parameter and reasoning-activation differences to rank layers for early prioritization within a Bayesian optimizer, employs an asynchronous evaluation loop to improve GPU utilization, and applies a lightweight reranking step to spread candidates across the trade-off surface. Under fixed evaluation budgets, AP-BMM is claimed to produce higher-quality Pareto sets with broader coverage than synchronous layer-wise and model-level baselines while also reducing wall-clock time.

Significance. If the empirical superiority holds, the work addresses a practical need for efficient generation of multiple merged LLMs spanning accuracy-cost preferences, which is relevant for serving scenarios. The public code repository is a positive factor for reproducibility. The core innovation—the use of source-model differences as a prior—could reduce search cost in large layer-wise spaces, but its contribution requires clearer isolation from the asynchronous scheduler.

major comments (2)
  1. [§3.2] §3.2: The prior derived from parameter and reasoning-activation differences is presented as supplying reliable early signals for layer prioritization, yet no ablation, correlation analysis, or direct measurement is shown linking high-difference layers to changes in the accuracy-cost Pareto front. Without this evidence, observed gains could be driven solely by the asynchronous scheduler or reranking, undermining the claim of prior-guided superiority over synchronous layer-wise baselines.
  2. [Results] Results (assumed §4–5): The abstract asserts stronger Pareto-set quality and broader trade-off coverage under fixed budgets, but the manuscript description supplies no quantitative tables, statistical significance tests, or per-baseline ablation breakdowns. Soundness therefore rests on external code rather than verifiable in-text derivations or figures, which is insufficient for a central empirical claim.
minor comments (1)
  1. [Method] The description of the lightweight reranking step and its exact contribution to spread should be expanded with pseudocode or a dedicated paragraph to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will strengthen the manuscript accordingly to better isolate the prior's contribution and make the empirical results self-contained.

read point-by-point responses
  1. Referee: [§3.2] §3.2: The prior derived from parameter and reasoning-activation differences is presented as supplying reliable early signals for layer prioritization, yet no ablation, correlation analysis, or direct measurement is shown linking high-difference layers to changes in the accuracy-cost Pareto front. Without this evidence, observed gains could be driven solely by the asynchronous scheduler or reranking, undermining the claim of prior-guided superiority over synchronous layer-wise baselines.

    Authors: We agree that clearer isolation of the prior's effect is needed. In the revision we will add an ablation comparing runs with and without the difference-based prior (keeping the asynchronous scheduler and reranking fixed), plus a correlation analysis between per-layer parameter/activation differences and their measured impact on Pareto-front quality. These additions will be placed in §3.2 and the results section. revision: yes

  2. Referee: [Results] Results (assumed §4–5): The abstract asserts stronger Pareto-set quality and broader trade-off coverage under fixed budgets, but the manuscript description supplies no quantitative tables, statistical significance tests, or per-baseline ablation breakdowns. Soundness therefore rests on external code rather than verifiable in-text derivations or figures, which is insufficient for a central empirical claim.

    Authors: We acknowledge the current text lacks sufficient in-paper quantitative detail. The revised manuscript will include expanded results tables reporting hypervolume, coverage, and spread metrics for all methods, Wilcoxon signed-rank tests for statistical significance, and explicit per-component ablation breakdowns (prior, async scheduler, reranking). Key tables and figures will be added to §§4–5 so that the central claims are verifiable directly from the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity: AP-BMM is an external-search algorithm whose Pareto quality is measured on held-out accuracy and cost metrics.

full rationale

The manuscript describes a Bayesian optimization procedure that uses parameter/activation differences only to initialize layer priorities and an asynchronous scheduler to improve GPU utilization. All reported gains are quantified via external evaluation budgets on standard benchmarks; no equation, fitted parameter, or self-citation reduces the claimed Pareto-set quality to a quantity defined by the method itself. The prior is a heuristic input, not a self-referential definition, and the optimizer's outputs are assessed against independent accuracy-cost measurements. This is the normal case of a search algorithm whose performance is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard Bayesian optimization assumptions and the domain premise that layer-wise merge weights can be searched efficiently when guided by source-model differences; no new entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Bayesian optimization efficiently explores high-dimensional merge-weight spaces when guided by cheap priors
    Invoked to justify early prioritization of layers via parameter and activation differences.

pith-pipeline@v0.9.0 · 5610 in / 1194 out tokens · 29101 ms · 2026-05-16T23:16:31.593218+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 5 internal anchors

  1. [1]

    Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204,

    [Akibaet al., 2025 ] Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204,

  2. [2]

    Differentiable expected hyper- volume improvement for parallel multi-objective bayesian optimization.Advances in neural information processing systems, 33:9851–9864,

    [Daultonet al., 2020 ] Samuel Daulton, Maximilian Balan- dat, and Eytan Bakshy. Differentiable expected hyper- volume improvement for parallel multi-objective bayesian optimization.Advances in neural information processing systems, 33:9851–9864,

  3. [3]

    Robust multi-objective bayesian optimization under input noise

    [Daultonet al., 2022 ] Samuel Daulton, Sait Cakmak, Maxi- milian Balandat, Michael A Osborne, Enlu Zhou, and Ey- tan Bakshy. Robust multi-objective bayesian optimization under input noise. InInternational Conference on Ma- chine Learning, pages 4831–4866. PMLR,

  4. [4]

    Model breadcrumbs: Scaling multi- task model merging with sparse masks

    [Davari and Belilovsky, 2024] MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi- task model merging with sparse masks. InEuropean Con- ference on Computer Vision, pages 270–287. Springer,

  5. [5]

    A fast and elitist multi- objective genetic algorithm: NSGA-II.IEEE transactions on evolutionary computation, 6(2):182–197,

    [Debet al., 2002 ] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. A fast and elitist multi- objective genetic algorithm: NSGA-II.IEEE transactions on evolutionary computation, 6(2):182–197,

  6. [6]

    Della-merging: Reducing interference in model merging through magnitude-based sampling.arXiv preprint arXiv:2406.11617,

    [Deepet al., 2024 ] Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. Della-merging: Reducing interference in model merging through magnitude-based sampling.arXiv preprint arXiv:2406.11617,

  7. [7]

    High-dimensional bayesian optimization with sparse axis-aligned subspaces

    [Eriksson and Jankowiak, 2021] David Eriksson and Martin Jankowiak. High-dimensional bayesian optimization with sparse axis-aligned subspaces. InProceedings of the Thirty-Seventh Conference on Uncertainty in Artificial In- telligence, pages 493–503. PMLR,

  8. [8]

    Reducing the time complex- ity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES).Evolutionary computation, 11(1):1–18,

    [Hansenet al., 2003 ] Nikolaus Hansen, Sibylle D M ¨uller, and Petros Koumoutsakos. Reducing the time complex- ity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES).Evolutionary computation, 11(1):1–18,

  9. [9]

    The No-U-Turn sampler: adaptively setting path lengths in hamiltonian monte carlo.Journal of Ma- chine Learning Research, 15(1):1593–1623,

    [Hoffman and Gelman, 2014] Matthew D Hoffman and An- drew Gelman. The No-U-Turn sampler: adaptively setting path lengths in hamiltonian monte carlo.Journal of Ma- chine Learning Research, 15(1):1593–1623,

  10. [10]

    Editing Models with Task Arithmetic

    [Ilharcoet al., 2022 ] Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089,

  11. [11]

    Dynamic fisher-weighted model merging via bayesian optimization

    [Leeet al., 2025 ] Sanwoo Lee, Jiahao Liu, Qifan Wang, Jin- gang Wang, Xunliang Cai, and Yunfang Wu. Dynamic fisher-weighted model merging via bayesian optimization. arXiv preprint arXiv:2504.18992,

  12. [12]

    It’s morphing time: Unleashing the potential of multiple llms via multi-objective optimization.IEEE Transactions on Evolutionary Computation,

    [Liet al., 2025 ] Bingdong Li, Zixiang Di, Yanting Yang, Hong Qian, Peng Yang, Hao Hao, Ke Tang, and Aimin Zhou. It’s morphing time: Unleashing the potential of multiple llms via multi-objective optimization.IEEE Transactions on Evolutionary Computation,

  13. [13]

    Gpqa: A graduate-level google-proof q&a benchmark

    [Reinet al., 2024 ] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling,

  14. [14]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    [Suiet al., 2025 ] Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large lan- guage models.arXiv preprint arXiv:2503.16419,

  15. [15]

    [Teamet al., 2025 ] Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599,

  16. [16]

    Weight averaging for neural networks and local resampling schemes

    [Utans, 1996] Joachim Utans. Weight averaging for neural networks and local resampling schemes. InProc. AAAI-96 Workshop on Integrating Multiple Learned Models. AAAI Press, pages 133–138. Citeseer,

  17. [17]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,

    [Weiet al., 2022 ] Jason Wei, Xuezhi Wang, Dale Schuur- mans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,

  18. [18]

    [Williams and Rasmussen, 2006] Christopher KI Williams and Carl Edward Rasmussen.Gaussian processes for ma- chine learning, volume

  19. [19]

    Revisiting model interpolation for efficient reasoning.arXiv preprint arXiv:2510.10977,

    [Wuet al., 2025 ] Taiqiang Wu, Runming Yang, Tao Liu, Jia- hao Wang, and Ngai Wong. Revisiting model interpolation for efficient reasoning.arXiv preprint arXiv:2510.10977,

  20. [20]

    Ties- merging: Resolving interference when merging mod- els.Advances in Neural Information Processing Systems, 36:7093–7115,

    [Yadavet al., 2023 ] Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties- merging: Resolving interference when merging mod- els.Advances in Neural Information Processing Systems, 36:7093–7115,

  21. [21]

    Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

    [Yanget al., 2024 ] Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Meth- ods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666,

  22. [22]

    Qwen3 Technical Report

    [Yanget al., 2025 ] An Yang, Anfeng Li, Baosong Yang, Be- ichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  23. [23]

    Language models are super mario: Ab- sorbing abilities from homologous models as a free lunch

    [Yuet al., 2024 ] Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Ab- sorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learn- ing,

  24. [24]

    information mass

    A Analysis and Implementation of Block Partitioning This appendix consolidates (i) the analysis of layer-wise task- vector differences and (ii) the formal implementation of the block partitioning strategy used for the granularity compari- son in our experiments. A.1 Layer-wise Difference Analysis and Visualization Figure 8 visualizes the layer-wiseℓ 1-nor...

  25. [25]

    ANSWER: C Table 8: Detailed individual metrics for different merging granularities (full data). ID Method f1(↑)f2(↑)GPQA ACC(↑)GPQA TOKENS(↓)AIME25ACC(↑)AIME25 TOKENS(↓)Pareto 0 Layer-wise 0.8698 0.8047 0.6212 4037 0.7666 8891✓ 1 0.9527 0.2414 0.6616 9136 0.7666 15496× 2 0.3656 1.0585 0.4545 791 0.6666 7321✓ 3 0.6848 0.8697 0.6364 3427 0.6333 8161✓ 4 0.34...