Model Merging by Output-Space Projection

Benjamin Etheridge; Bethan Evans; Jared Tanner; Stephen Roberts

arxiv: 2605.29101 · v1 · pith:IASG4K6Bnew · submitted 2026-05-27 · 💻 cs.LG · cs.IT· math.IT

Model Merging by Output-Space Projection

Bethan Evans , Benjamin Etheridge , Stephen Roberts , Jared Tanner This is my paper

Pith reviewed 2026-06-29 13:37 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT

keywords model mergingquadratic programmingconvex optimizationtask arithmeticmulti-task learningfine-tuned checkpointsresidual updatescalibration objective

0 comments

The pith

Model merging reduces to solving a convex quadratic program that matches fine-tuned outputs on calibration inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that combining fine-tuned models can be expressed as a convex quadratic program whose variables are coefficients multiplying residual updates from a base model. The program chooses those coefficients to minimize the squared difference between the merged model's outputs and the fine-tuned models' outputs on a calibration set of inputs. Because the program is convex, its solution is unique and includes every earlier merging rule as a special case obtained by restricting the basis or adding particular penalties. The same setup produces a simple diagnostic—the fraction of residual energy captured by the chosen basis—that forecasts how accurate the merged model will be on new tasks using only the calibration data. The authors solve the program layer by layer in sequence and report consistent improvements over prior methods on both language and vision tasks.

Core claim

Merging can be formulated as a convex quadratic programme over residual updates, yielding weights that minimise a squared-output calibration objective using calibration inputs and fine-tuned model outputs, and subsuming existing methods as special cases. The framework yields a closed-form diagnostic—the fraction of residual energy captured by a chosen basis—that predicts downstream merge quality using only the calibration set.

What carries the argument

Convex quadratic program over coefficients of residual updates that minimizes squared output discrepancy on calibration inputs.

If this is right

Existing methods such as task arithmetic, model soups, TIES and DARE arise as particular choices of basis or regularization inside the same quadratic program.
The residual-energy diagnostic computed from calibration data alone ranks candidate merges by expected downstream performance.
In the single-layer case the quadratic-program solution matches or exceeds the accuracy of prior heuristics.
Sequential layer-wise application of the program produces merged models that improve over single-layer baselines on language and vision benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same output-space projection could be applied when merging models whose fine-tuning distributions differ markedly from the calibration inputs.
Enlarging the residual basis to include updates from more than two fine-tuned models would yield a direct multi-model extension of the current two-model formulation.
For very large models an approximate solver for the quadratic program could still be validated by checking whether the energy diagnostic remains predictive of final accuracy.

Load-bearing premise

A calibration set of inputs together with the outputs of the fine-tuned models is sufficient to determine a merge whose performance on downstream tasks can be reliably predicted by the residual-energy diagnostic.

What would settle it

Compute the residual-energy fraction on the calibration set and check whether it fails to correlate with measured accuracy of the resulting merged model on held-out downstream tasks across multiple benchmarks.

Figures

Figures reproduced from arXiv: 2605.29101 by Benjamin Etheridge, Bethan Evans, Jared Tanner, Stephen Roberts.

**Figure 1.** Figure 1: Performance on ViT-32 across single-layer merging methods. nal QP achieves the lowest MSE across all datasets and the best average accuracy among the methods considered. The cross-entropy results in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: MSE against captured energy for increasing basis dimensions p on MNIST. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Vit-32 cross-entropy across benchmarks. (a) MSE on ViT-B/32. (b) Accuracy on ViT-B/32 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Performance on ViT-32 across sequential multi-layer merging. benchmark methods in both MSE and accuracy across all datasets, while the remaining methods perform substantially worse. D.7. MNIST MLP 3-layer MLP with ReLU activations trained on MNIST, split into K = 5 non-overlapping digit-pair tasks: {0-1, 2-3, 4-5, 6-7, 8-9}. The base model is a Linear(784, 256) → ReLU → Linear(256, 128) → ReLU → Linear(128… view at source ↗

**Figure 5.** Figure 5: Performance of single-layer merging methods on a 3-layer MNIST MLP across five digit-pair tasks. Our QP method learns task-adaptive row-wise blending weights for layer2, while baselines (Model Soup, Task Arithmetic λ=1, DARE p=0.5, Fisher merging) apply fixed or heuristic mixing. The base model (no merging) is included as a lower bound. D.8. LLaMA 3.1 Single Layer Now we consider an LLM. We start from the … view at source ↗

**Figure 6.** Figure 6: Performance on LLaMA 3.1 across penultimate-layer merging. D.9. LLaMA 3.1 Multi-Layer Sequential We extend this to multi-layer merging using the sequential scheme from Section F, merging all linear layers in the model. Results in [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Performance on LLaMA 3.1 across sequential multi-layer merging [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: LLaMA 3.1 multi-layer sequential cross-entropy across benchmarks. E. Merging in a General Basis Having identified which output subspaces are optimal in the relaxed projection problem, we now turn to the realisable problem: given a fixed basis {qp} and the available residual updates {δ (k) N }, what merge coefficients d minimise the calibration loss? The projection view characterises the best output directi… view at source ↗

read the original abstract

Model merging combines fine-tuned checkpoints into a single multi-task model without retraining. Existing methods - such as task arithmetic, model soups, TIES, and DARE - are computationally efficient and empirically successful, but rely on heuristic design choices and lack formal optimality guarantees. We show that merging can be formulated as a convex quadratic programme over residual updates, yielding weights that minimise a squared-output calibration objective using calibration inputs and fine-tuned model outputs, and subsuming existing methods as special cases. Our framework yields a closed-form diagnostic - the fraction of residual energy captured by a chosen basis - that predicts downstream merge quality using only the calibration set. Empirically, the QP matches or outperforms existing methods in the single-layer setting, and we characterise when the optimal basis provides significant gains over the cheaper diagonal QP. We extend to multi-layer merging via a sequential layer-wise algorithm and demonstrate consistent gains across language and vision benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns model merging into a convex QP over residuals that minimizes output error on calibration data and recovers prior heuristics as special cases, with a direct diagnostic for quality.

read the letter

The core contribution is a convex quadratic program that finds merge weights minimizing squared output differences on a calibration set of inputs and fine-tuned model outputs. This is new relative to the heuristic methods cited, and the optimality claim holds by construction because the objective is quadratic and convex. The residual-energy diagnostic follows immediately from the same quadratic form, so it is a closed-form quantity that measures how much of the residual the chosen basis captures on that calibration distribution.

The paper does the formalization cleanly and shows algebraically how task arithmetic, TIES, and similar approaches emerge from particular choices of basis or regularization. The single-layer experiments are reported to match or beat the baselines, and the sequential layer-wise extension is a practical way to scale beyond one layer.

The main limitation is that both the merge and the diagnostic are defined with respect to the calibration set; if that set is not representative of downstream tasks, the predictive power of the energy fraction will degrade. The multi-layer procedure is sequential rather than joint, so it is not guaranteed to be globally optimal across layers. The abstract claims consistent gains on language and vision benchmarks, but without seeing the full experimental controls it is hard to judge how large those gains are once you account for the extra compute of solving the QP versus the diagonal version.

This is useful reading for anyone already working on merging methods who wants an optimization lens rather than another heuristic tweak. It is worth sending to peer review because the QP formulation and the diagnostic are verifiable claims that can be checked against the derivations and the reported numbers.

Referee Report

1 major / 2 minor

Summary. The paper claims that model merging can be cast as a convex quadratic program (QP) over residual updates that exactly minimizes a squared-output calibration objective given inputs and fine-tuned model outputs; this formulation subsumes task arithmetic, TIES, DARE and model soups as special cases, supplies a closed-form residual-energy diagnostic that predicts downstream quality from the calibration set alone, matches or exceeds prior methods in the single-layer case, and extends to multi-layer merging via a sequential layer-wise procedure with empirical gains on language and vision benchmarks.

Significance. If the optimality and subsumption claims hold, the work supplies the first convex optimization framing of output-space model merging together with an algebraic diagnostic that is computable from the same calibration data used for the merge. The explicit recovery of prior heuristics as special cases of basis or regularization choices, the convexity guarantee, and the layer-wise extension constitute concrete technical contributions that could replace ad-hoc design choices with a principled procedure.

major comments (1)

[Abstract] The residual-energy diagnostic is obtained directly from the quadratic objective of the QP (see abstract statement of the diagnostic). Because it is a monotonic function of the same calibration loss that the QP minimizes, its reported correlation with downstream quality on held-out tasks may be an in-sample artifact rather than an independent predictor; the manuscript should either derive a generalization bound or report the diagnostic's correlation on a calibration set disjoint from the one used to solve the QP.

minor comments (2)

The abstract states that the QP 'subsumes existing methods as special cases' but does not list the precise basis or regularization settings that recover each cited method (task arithmetic, TIES, DARE). Adding an explicit table or corollary would make the subsumption claim immediately verifiable.
The free parameter 'calibration set selection' is listed in the axiom ledger; the experimental section should report sensitivity of both the merged model and the energy diagnostic to different choices of calibration inputs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review, positive assessment of the contributions, and recommendation of minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] The residual-energy diagnostic is obtained directly from the quadratic objective of the QP (see abstract statement of the diagnostic). Because it is a monotonic function of the same calibration loss that the QP minimizes, its reported correlation with downstream quality on held-out tasks may be an in-sample artifact rather than an independent predictor; the manuscript should either derive a generalization bound or report the diagnostic's correlation on a calibration set disjoint from the one used to solve the QP.

Authors: We agree that the residual-energy diagnostic is a direct (monotonic) function of the QP objective evaluated on the calibration data used to solve for the merge, so the reported correlations with held-out downstream performance could in principle reflect an in-sample artifact. Deriving a non-vacuous generalization bound that accounts for the data-dependent basis selection and the specific form of the diagnostic appears difficult without strong distributional assumptions that would not hold across the language and vision settings considered. We will therefore revise the manuscript to include an additional experiment in which the diagnostic is computed on a calibration set held disjoint from the data used to solve the QP, and its correlation with held-out task performance is re-reported under this protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core step is to define a convex QP whose objective is exactly the squared-output calibration loss on the given inputs and fine-tuned outputs; optimality therefore holds by construction of the program, which is the intended formulation rather than a hidden reduction. Subsumption of prior methods follows from algebraic substitution of basis/regularization choices into the same QP and requires no external theorem. The residual-energy diagnostic is the normalized value of the same quadratic objective evaluated on the calibration set, so it is tautological on that set, but the paper presents its correlation with downstream task performance as an empirical finding validated on benchmarks rather than a definitional claim. No load-bearing self-citation, imported uniqueness result, or smuggled ansatz appears in the derivation; the argument remains a self-contained re-formulation whose internal statements are algebraically verifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the modeling choice that merging occurs in output space via linear residual combinations and on the availability of a representative calibration set; no new physical entities are introduced.

free parameters (1)

calibration set selection
Choice of calibration inputs directly determines the QP solution and the energy diagnostic.

axioms (1)

domain assumption Merging can be expressed as a linear combination of residual updates between fine-tuned models
Invoked to set up the quadratic program over residuals.

pith-pipeline@v0.9.1-grok · 5685 in / 1225 out tokens · 49017 ms · 2026-06-29T13:37:31.933143+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Evaluating Large Language Models Trained on Code

ISBN 978-0-521-83378-3. Google-Books-ID: mYm0bLd3fcoC. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brock- man, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhai...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvpr 2021
[2]

IfH≻0, the minimiser is unique and given byd ∗ =−H −1f
[3]

The general-basis QP defines a single global optimisation over all merge coefficients

IfH⪰0is singular andf∈Range(H), then the set of minimisers is the affine subspace {d ∗ =−H +f+w:w∈ker(H)}. The general-basis QP defines a single global optimisation over all merge coefficients. The Hessian H contains cross-terms between models and basis directions, reflecting the fact that their contributions to the output are coupled through shared input...

[1] [1]

Evaluating Large Language Models Trained on Code

ISBN 978-0-521-83378-3. Google-Books-ID: mYm0bLd3fcoC. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brock- man, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhai...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvpr 2021

[2] [2]

IfH≻0, the minimiser is unique and given byd ∗ =−H −1f

[3] [3]

The general-basis QP defines a single global optimisation over all merge coefficients

IfH⪰0is singular andf∈Range(H), then the set of minimisers is the affine subspace {d ∗ =−H +f+w:w∈ker(H)}. The general-basis QP defines a single global optimisation over all merge coefficients. The Hessian H contains cross-terms between models and basis directions, reflecting the fact that their contributions to the output are coupled through shared input...