Transporting Task Vectors across Different Architectures without Training
Pith reviewed 2026-05-22 10:55 UTC · model grok-4.3
The pith
Task updates transfer across neural networks of different widths by matching their effects on internal activations instead of their parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Theseus, a training-free method for transporting task updates across models of different widths. Rather than matching parameters, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update.
What carries the argument
Functional matching of task updates via observed activations, aligned by orthogonal Procrustes analysis to map between representation spaces of different widths while preserving update geometry.
If this is right
- Task adaptations learned on one model width become reusable on other widths at no extra training cost.
- Task identity is carried more reliably by the functional change in activations than by the specific parameter values.
- Vision and language models both show consistent gains over parameter-space baselines when the functional alignment is used.
- No backpropagation or optimization step is required to move the update between architectures.
Where Pith is reading between the lines
- The same alignment idea could let practitioners maintain a single library of functional task updates that work across their entire collection of model sizes.
- If representation spaces remain alignable, the method might extend to moving updates between models whose architectures differ more substantially than just width.
- Model merging or editing pipelines could adopt functional rather than parametric matching as a default when architectures vary.
Load-bearing premise
The way a task update changes a model's internal activations can be stably aligned across different widths by a linear orthogonal mapping that leaves the update's direction and size unchanged.
What would settle it
Apply the transported update to a target model of different width and measure whether downstream task performance stays close to a model that was directly fine-tuned on the target architecture; a large gap or outright failure would falsify the transport.
read the original abstract
Adapting large pre-trained models to downstream tasks often produces task-specific parameter updates that are expensive to relearn for every model variant. While recent work has shown that such updates can be transferred between models with identical architectures, transferring them across models of different widths remains unexplored. In this work, we introduce Theseus, a training-free method for transporting task updates across heterogeneous-width models. Rather than matching parameters, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update. We evaluate Theseus on vision and language models across different widths, showing consistent improvements over baselines without additional training or backpropagation. Our results show that task updates can be meaningfully transferred across architectures when task identity is defined functionally rather than parametrically. Code is available at https://github.com/apanariello4/merge-and-rebase.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Theseus, a training-free method to transport task vectors across pre-trained models of different widths. Task updates are characterized by their functional effects on intermediate activations rather than by parameter differences. Representation spaces are aligned via orthogonal Procrustes analysis, after which a closed-form transport operator is derived that is claimed to preserve the geometry of the update. Experiments on vision and language models report consistent gains over baselines.
Significance. If the central claims hold, the work would enable efficient reuse of task adaptations across architecture variants without retraining or backpropagation, advancing model merging techniques by shifting from parametric to functional task definitions. The closed-form solution and public code release are positive features that support reproducibility.
major comments (2)
- [alignment procedure (Section 3)] The orthogonal Procrustes alignment step for unequal widths (d1 < d2) yields a rectangular semi-orthogonal matrix that projects onto the leading d1 directions of the larger space. The manuscript asserts that this alignment permits a geometry-preserving closed-form transport, yet provides no analysis, ablation, or verification that task-relevant variance is retained rather than truncated in the discarded directions. This assumption is load-bearing for the validity of the subsequent transport and all reported performance gains.
- [Experiments] The experimental section reports consistent improvements but supplies neither error bars, number of random seeds, nor statistical significance tests for the cross-architecture transfers. Without these, it is difficult to determine whether the gains reliably exceed baseline variability.
minor comments (2)
- [Abstract] The abstract states that results show 'consistent improvements' but does not name the specific datasets, model pairs, or quantitative metrics employed.
- [Notation] Notation for activations, task deltas, and the transport operator would benefit from an early, explicit definition table or diagram to aid readability.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We appreciate the positive assessment of the work's potential impact on model merging. We respond to each major comment below and outline the revisions planned for the updated manuscript.
read point-by-point responses
-
Referee: [alignment procedure (Section 3)] The orthogonal Procrustes alignment step for unequal widths (d1 < d2) yields a rectangular semi-orthogonal matrix that projects onto the leading d1 directions of the larger space. The manuscript asserts that this alignment permits a geometry-preserving closed-form transport, yet provides no analysis, ablation, or verification that task-relevant variance is retained rather than truncated in the discarded directions. This assumption is load-bearing for the validity of the subsequent transport and all reported performance gains.
Authors: We thank the referee for highlighting this aspect of the alignment. The orthogonal Procrustes procedure yields the optimal semi-orthogonal mapping minimizing the Frobenius-norm discrepancy between activation matrices, and the rectangular case naturally retains the leading subspace of the larger model. While the manuscript relies on empirical gains to support that task-relevant geometry is preserved, we agree that an explicit verification of retained variance would strengthen the claims. In the revised version we will add an analysis (new subsection or appendix) that reports the singular values of the Procrustes solution and an ablation measuring downstream task performance as a function of the number of retained dimensions, thereby confirming that the discarded directions contribute little to task identity. revision: yes
-
Referee: [Experiments] The experimental section reports consistent improvements but supplies neither error bars, number of random seeds, nor statistical significance tests for the cross-architecture transfers. Without these, it is difficult to determine whether the gains reliably exceed baseline variability.
Authors: We agree that the current experimental presentation would be strengthened by statistical reporting. The numbers in the manuscript reflect single-run evaluations. In the revised manuscript we will rerun the principal cross-architecture experiments with multiple random seeds (minimum of three), report means accompanied by standard-deviation error bars, and include statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) comparing Theseus against each baseline to establish that the observed improvements exceed run-to-run variability. revision: yes
Circularity Check
No significant circularity; derivation uses independent external alignment
full rationale
The paper formalizes task-vector transport as a functional matching problem on observed activations and applies standard orthogonal Procrustes analysis to align representation spaces before deriving a closed-form solution. Procrustes is an established technique from prior literature, not derived from or fitted to the target transport result. No equation reduces the claimed prediction or geometry preservation to a self-defined input, fitted parameter renamed as output, or load-bearing self-citation chain. The method remains self-contained with independent content from the functional characterization.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Representation spaces of models with different widths can be aligned via orthogonal Procrustes analysis to enable stable functional matching of task updates.
invented entities (1)
-
Theseus
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update... τB = T_out τ_A T_in^⊤
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_high_calibrated_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
min ∥Hin,A τA⊤ Hout,A⊤ − Hin,B τB⊤ Hout,B⊤∥F
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.