Transporting Task Vectors across Different Architectures without Training

Angelo Porrello; Aniello Panariello; Filippo Rinaldi; Giacomo Salici; Simone Calderara

arxiv: 2602.12952 · v2 · pith:XE6Q7MYYnew · submitted 2026-02-13 · 💻 cs.LG · cs.AI· cs.CV

Transporting Task Vectors across Different Architectures without Training

Filippo Rinaldi , Aniello Panariello , Giacomo Salici , Angelo Porrello , Simone Calderara This is my paper

Pith reviewed 2026-05-22 10:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords task vectorsmodel adaptationrepresentation alignmentProcrustes analysistransfer learningvision modelslanguage modelstraining-free methods

0 comments

The pith

Task updates transfer across neural networks of different widths by matching their effects on internal activations instead of their parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a task-specific update learned on one model can be moved to another model with a different width without any further training. Rather than copying or matching the raw parameter changes, the approach records how the update alters the model's intermediate activations on sample inputs. These activation effects are then aligned between the two models using an orthogonal transformation that finds the best linear mapping between their representation spaces. Once aligned, a closed-form adjustment applies the update to the target model while keeping its original direction and scale. This matters because it removes the need to re-adapt every new model size from scratch, letting the same task behavior be reused across a family of architectures.

Core claim

We introduce Theseus, a training-free method for transporting task updates across models of different widths. Rather than matching parameters, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update.

What carries the argument

Functional matching of task updates via observed activations, aligned by orthogonal Procrustes analysis to map between representation spaces of different widths while preserving update geometry.

If this is right

Task adaptations learned on one model width become reusable on other widths at no extra training cost.
Task identity is carried more reliably by the functional change in activations than by the specific parameter values.
Vision and language models both show consistent gains over parameter-space baselines when the functional alignment is used.
No backpropagation or optimization step is required to move the update between architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment idea could let practitioners maintain a single library of functional task updates that work across their entire collection of model sizes.
If representation spaces remain alignable, the method might extend to moving updates between models whose architectures differ more substantially than just width.
Model merging or editing pipelines could adopt functional rather than parametric matching as a default when architectures vary.

Load-bearing premise

The way a task update changes a model's internal activations can be stably aligned across different widths by a linear orthogonal mapping that leaves the update's direction and size unchanged.

What would settle it

Apply the transported update to a target model of different width and measure whether downstream task performance stays close to a model that was directly fine-tuned on the target architecture; a large gap or outright failure would falsify the transport.

read the original abstract

Adapting large pre-trained models to downstream tasks often produces task-specific parameter updates that are expensive to relearn for every model variant. While recent work has shown that such updates can be transferred between models with identical architectures, transferring them across models of different widths remains unexplored. In this work, we introduce Theseus, a training-free method for transporting task updates across heterogeneous-width models. Rather than matching parameters, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update. We evaluate Theseus on vision and language models across different widths, showing consistent improvements over baselines without additional training or backpropagation. Our results show that task updates can be meaningfully transferred across architectures when task identity is defined functionally rather than parametrically. Code is available at https://github.com/apanariello4/merge-and-rebase.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They show a closed-form way to move task updates between different-width models by matching activation effects after Procrustes alignment, but the projection step risks losing task geometry when dimensions differ.

read the letter

Hi, the main point is that this paper gives a training-free method to transport task vectors across models with different widths. They define the task update by its functional effect on intermediate activations instead of parameters, align the representation spaces with orthogonal Procrustes, and then solve for the transport in closed form while claiming the geometry stays intact. That extends earlier task-vector work, which stayed within identical architectures, to heterogeneous widths in vision and language models. They report consistent gains over baselines with no extra training or backprop, and the code is public, which helps verification. The functional framing is a clean way to think about task identity, and the closed-form claim after alignment is a practical plus if it holds. The soft spot is the alignment itself. When one model is narrower, the SVD-based Procrustes produces a rectangular semi-orthogonal matrix that projects the wider space onto the top directions of the narrower one. If task-relevant variance sits outside those directions, the transported delta gets distorted and downstream performance may not reflect the original update. The abstract does not show ablations on this or full quantitative results with error bars, so it is difficult to judge how often the assumption bites in their runs. The stress-test note on variance truncation is worth checking against the actual experiments. This is for researchers working on model merging and efficient adaptation who want to reuse fine-tunes without retraining every variant. A reader focused on practical transfer methods would get value from the idea and the code, though the results need the full paper to assess. I would send it to peer review so the alignment details and geometry preservation can be examined properly.

Referee Report

2 major / 2 minor

Summary. The paper introduces Theseus, a training-free method to transport task vectors across pre-trained models of different widths. Task updates are characterized by their functional effects on intermediate activations rather than by parameter differences. Representation spaces are aligned via orthogonal Procrustes analysis, after which a closed-form transport operator is derived that is claimed to preserve the geometry of the update. Experiments on vision and language models report consistent gains over baselines.

Significance. If the central claims hold, the work would enable efficient reuse of task adaptations across architecture variants without retraining or backpropagation, advancing model merging techniques by shifting from parametric to functional task definitions. The closed-form solution and public code release are positive features that support reproducibility.

major comments (2)

[alignment procedure (Section 3)] The orthogonal Procrustes alignment step for unequal widths (d1 < d2) yields a rectangular semi-orthogonal matrix that projects onto the leading d1 directions of the larger space. The manuscript asserts that this alignment permits a geometry-preserving closed-form transport, yet provides no analysis, ablation, or verification that task-relevant variance is retained rather than truncated in the discarded directions. This assumption is load-bearing for the validity of the subsequent transport and all reported performance gains.
[Experiments] The experimental section reports consistent improvements but supplies neither error bars, number of random seeds, nor statistical significance tests for the cross-architecture transfers. Without these, it is difficult to determine whether the gains reliably exceed baseline variability.

minor comments (2)

[Abstract] The abstract states that results show 'consistent improvements' but does not name the specific datasets, model pairs, or quantitative metrics employed.
[Notation] Notation for activations, task deltas, and the transport operator would benefit from an early, explicit definition table or diagram to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We appreciate the positive assessment of the work's potential impact on model merging. We respond to each major comment below and outline the revisions planned for the updated manuscript.

read point-by-point responses

Referee: [alignment procedure (Section 3)] The orthogonal Procrustes alignment step for unequal widths (d1 < d2) yields a rectangular semi-orthogonal matrix that projects onto the leading d1 directions of the larger space. The manuscript asserts that this alignment permits a geometry-preserving closed-form transport, yet provides no analysis, ablation, or verification that task-relevant variance is retained rather than truncated in the discarded directions. This assumption is load-bearing for the validity of the subsequent transport and all reported performance gains.

Authors: We thank the referee for highlighting this aspect of the alignment. The orthogonal Procrustes procedure yields the optimal semi-orthogonal mapping minimizing the Frobenius-norm discrepancy between activation matrices, and the rectangular case naturally retains the leading subspace of the larger model. While the manuscript relies on empirical gains to support that task-relevant geometry is preserved, we agree that an explicit verification of retained variance would strengthen the claims. In the revised version we will add an analysis (new subsection or appendix) that reports the singular values of the Procrustes solution and an ablation measuring downstream task performance as a function of the number of retained dimensions, thereby confirming that the discarded directions contribute little to task identity. revision: yes
Referee: [Experiments] The experimental section reports consistent improvements but supplies neither error bars, number of random seeds, nor statistical significance tests for the cross-architecture transfers. Without these, it is difficult to determine whether the gains reliably exceed baseline variability.

Authors: We agree that the current experimental presentation would be strengthened by statistical reporting. The numbers in the manuscript reflect single-run evaluations. In the revised manuscript we will rerun the principal cross-architecture experiments with multiple random seeds (minimum of three), report means accompanied by standard-deviation error bars, and include statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) comparing Theseus against each baseline to establish that the observed improvements exceed run-to-run variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses independent external alignment

full rationale

The paper formalizes task-vector transport as a functional matching problem on observed activations and applies standard orthogonal Procrustes analysis to align representation spaces before deriving a closed-form solution. Procrustes is an established technique from prior literature, not derived from or fitted to the target transport result. No equation reduces the claimed prediction or geometry preservation to a self-defined input, fitted parameter renamed as output, or load-bearing self-citation chain. The method remains self-contained with independent content from the functional characterization.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests on the premise that task identity is captured by functional effects on activations and that orthogonal alignment of representation spaces preserves update geometry across widths; no free parameters or invented physical entities are mentioned.

axioms (1)

domain assumption Representation spaces of models with different widths can be aligned via orthogonal Procrustes analysis to enable stable functional matching of task updates.
Invoked to derive the closed-form transport solution from observed activations.

invented entities (1)

Theseus no independent evidence
purpose: Name for the proposed training-free transport procedure.
New label for the method; no independent evidence outside the paper.

pith-pipeline@v0.9.0 · 5722 in / 1278 out tokens · 42192 ms · 2026-05-22T10:55:05.533289+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update... τB = T_out τ_A T_in^⊤
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

min ∥Hin,A τA⊤ Hout,A⊤ − Hin,B τB⊤ Hout,B⊤∥F

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.