Recognition: no theorem link
Revitalizing the Beginning: Avoiding Storage Dependency for Model Merging in Continual Learning
Pith reviewed 2026-05-12 01:17 UTC · model grok-4.3
The pith
Model merging in continual learning can proceed without storing prior models or data by optimizing in an augmented trajectory subspace.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that reformulating the merging phase as an optimization process within an augmented trajectory subspace, driven by the simultaneous pursuit of task alignment, prediction consistency, and gradient responsiveness, lets a merged model retain historical stability and re-activate optimization dynamics at the start of each new task without any storage of previous models or data.
What carries the argument
Trajectory Regularized Merging (TRM), an optimization procedure performed inside an augmented trajectory subspace that jointly enforces three objectives to balance stability and responsiveness.
If this is right
- Merged models retain task-specific performance without progressive degradation across a long sequence of tasks.
- Gradient signals remain usable at the beginning of each new task, allowing continued training to proceed efficiently.
- The same merged model can serve as the starting point for arbitrary new tasks without replay buffers or saved checkpoints.
- Multi-task unification becomes feasible in memory-constrained continual-learning settings that previously required separate model copies.
Where Pith is reading between the lines
- The subspace-augmentation idea could be grafted onto other merging algorithms to reduce their reliance on replay data.
- If the three objectives prove separable, practitioners might drop one or two of them in low-resource deployments while keeping most of the benefit.
- Similar trajectory-based regularization might help in federated or distributed settings where old model versions cannot be retained.
Load-bearing premise
The main obstacles are error accumulation from global alignment and vanishing gradients at task onset, and that these three objectives can be jointly optimized in the augmented subspace to correct them without storing any prior models or data.
What would settle it
On a standard continual-learning benchmark sequence, if the TRM-merged model still shows rapid error growth from early tasks or requires many more steps than storage-based methods to reach low loss on a new task, the central claim would be refuted.
Figures
read the original abstract
Model merging provides a compelling paradigm for integrating specialized expertise into a unified multi-task model, a goal that aligns naturally with the sequential knowledge acquisition in continual learning (CL). However, the requirement for preserving diverse forms of previous knowledge conflicts with the storage limitations inherent to CL. In this paper, we systematically analyze existing model merging methods under the constraints of CL. We find that current methods prioritize global alignment, which often leads to the accumulation and amplification of task-specific errors within the continuous data stream; and the vanishing gradients at the onset of subsequent tasks frequently cause optimization to stagnate. These leave the merged model in a suboptimal state at the beginning of the next training phase. To address these challenges, we propose Trajectory Regularized Merging (TRM), a framework that reformulates the merging phase as an optimization process within an augmented trajectory subspace. Our framework integrates three synergistic objectives including task alignment, prediction consistency, and gradient responsiveness to concurrently preserve merged model's historical stability and re-activate optimization dynamics. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes model merging methods under continual learning (CL) constraints, identifying that global alignment causes error accumulation and amplification in sequential data streams, while vanishing gradients at new task onsets lead to optimization stagnation and suboptimal merged models. It proposes Trajectory Regularized Merging (TRM), which reformulates merging as an optimization process inside an augmented trajectory subspace. TRM integrates three objectives—task alignment, prediction consistency, and gradient responsiveness—to preserve historical stability while reactivating dynamics, all without storing prior models or data. Experiments claim state-of-the-art results across multiple CL benchmarks.
Significance. If TRM's no-storage guarantee holds and the three objectives can be jointly optimized from current-task information alone, the work would meaningfully advance CL by enabling practical model merging without the memory overhead that currently limits its adoption. The focus on restoring early-task optimization dynamics addresses a practical bottleneck in sequential training pipelines.
major comments (3)
- [§3] §3 (TRM Framework description): The construction of the 'augmented trajectory subspace' and the explicit loss formulations for prediction consistency and gradient responsiveness must be shown to depend only on the current task's data and model parameters. Prediction consistency normally requires comparing against prior outputs or logits, and gradient responsiveness typically needs historical gradient statistics; if either references stored trajectories or samples, the central no-storage claim fails even if optimization converges.
- [§3.2–3.3] §3.2–3.3 (objective definitions): The paper must provide the precise mathematical definitions (e.g., the forms of the consistency and responsiveness terms) and prove or demonstrate that they are computable without any prior-model access. Without these equations, it is impossible to verify that the three objectives are synergistic yet storage-independent.
- [§4] §4 (Experiments and ablations): The reported SOTA results should be accompanied by controlled ablations that isolate each objective and explicitly confirm that no hidden storage of previous models, gradients, or samples occurs during merging or evaluation. Current high-level claims leave open the possibility that gains arise from mechanisms that violate the storage-avoidance premise.
minor comments (3)
- [Abstract] Abstract: The high-level description of the three objectives would be clearer if it briefly indicated the information each objective uses (current-task only vs. historical).
- [§3] Notation: Ensure consistent use of symbols for the augmented subspace and the three loss terms across the method and experiment sections to avoid reader confusion.
- [§2] Related work: A short paragraph contrasting TRM with prior merging methods that also claim reduced storage (e.g., those using parameter-efficient adapters) would strengthen positioning.
Simulated Author's Rebuttal
We thank the referee for the careful reading and valuable comments that help clarify the storage-independence claims of TRM. We address each major point below and will incorporate the requested details and ablations in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (TRM Framework description): The construction of the 'augmented trajectory subspace' and the explicit loss formulations for prediction consistency and gradient responsiveness must be shown to depend only on the current task's data and model parameters. Prediction consistency normally requires comparing against prior outputs or logits, and gradient responsiveness typically needs historical gradient statistics; if either references stored trajectories or samples, the central no-storage claim fails even if optimization converges.
Authors: The augmented trajectory subspace is constructed exclusively from the current model's parameters and the current task's data via a low-rank augmentation of the parameter space. The prediction consistency objective is implemented as a self-consistency regularizer that penalizes divergence between predictions on the original current-task samples and their augmented versions generated on-the-fly; no prior logits or models are involved. The gradient responsiveness term is the expected norm of the gradient of the merging loss with respect to the subspace coordinates, again computed solely from current-task forward and backward passes. We will expand §3 with explicit pseudocode and equations to demonstrate this dependence. revision: yes
-
Referee: [§3.2–3.3] §3.2–3.3 (objective definitions): The paper must provide the precise mathematical definitions (e.g., the forms of the consistency and responsiveness terms) and prove or demonstrate that they are computable without any prior-model access. Without these equations, it is impossible to verify that the three objectives are synergistic yet storage-independent.
Authors: We agree that the current presentation is insufficiently precise. In the revision we will insert the exact loss expressions: task alignment is the standard cross-entropy on current data; prediction consistency is the KL divergence between f_θ(x) and f_θ(x+δ) for current x and on-the-fly perturbations δ; gradient responsiveness is ||∇_v L(θ + v)|| where v lies in the current-task-derived subspace. A short paragraph will argue that each term requires only the current model and current mini-batches, with no external storage. revision: yes
-
Referee: [§4] §4 (Experiments and ablations): The reported SOTA results should be accompanied by controlled ablations that isolate each objective and explicitly confirm that no hidden storage of previous models, gradients, or samples occurs during merging or evaluation. Current high-level claims leave open the possibility that gains arise from mechanisms that violate the storage-avoidance premise.
Authors: We will add a dedicated ablation subsection in §4 that reports performance when each of the three objectives is removed in turn, together with a memory-footprint table showing that peak memory during merging equals that of a single forward-backward pass on the current task. We will also state explicitly in the experimental protocol that no prior models, gradients, or replay buffers are retained or accessed at any stage. revision: yes
Circularity Check
No derivation chain or equations present; proposal is algorithmic design without deductive reduction
full rationale
The provided abstract and description contain no equations, derivations, or first-principles steps. The paper introduces TRM as a conceptual framework reformulating merging via an augmented subspace and three objectives, supported by experimental claims. No load-bearing mathematical argument reduces to its inputs by construction, self-definition, or fitted renaming. The no-storage premise is a design claim verified (or not) empirically rather than deduced circularly. This is the common case for method-proposal papers lacking formal proofs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
work page 2000
-
[2]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
work page 1980
-
[3]
M. J. Kearns , title =
-
[4]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
work page 1983
-
[5]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
work page 2000
-
[6]
Suppressed for Anonymity , author=
-
[7]
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
work page 1981
-
[8]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
work page 1959
-
[9]
Magmax: Leveraging model merging for seamless continual learning , author=. 2024 , organization=
work page 2024
- [10]
-
[11]
Class-incremental learning with clip: Adaptive representation adjustment and parameter fusion , author=. 2024 , organization=
work page 2024
- [12]
-
[13]
Long-tail class incremental learning via independent sub-prototype construction , author=
-
[14]
Learning to prompt for continual learning , author=
-
[15]
Dualprompt: Complementary prompting for rehearsal-free continual learning , author=. 2022 , organization=
work page 2022
-
[16]
Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning , author=
-
[17]
Editing models with task arithmetic , booktitle =
Gabriel Ilharco and Marco T. Editing models with task arithmetic , booktitle =
-
[18]
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. 2022 , organization=
work page 2022
-
[19]
Do better imagenet models transfer better? , author=
-
[20]
Model stock: All we need is just a few fine-tuned models , author=. 2024 , organization=
work page 2024
-
[21]
Ties-merging: Resolving interference when merging models , author=
-
[22]
Sebastian Dziadzio and Vishaal Udandarao and Karsten Roth and Ameya Prabhu and Zeynep Akata and Samuel Albanie and Matthias Bethge , title =
-
[23]
Improving task-free continual learning by distributionally robust memory evolution , author=. 2022 , organization=
work page 2022
-
[24]
Does continual learning equally forget all parameters? , author=. 2023 , organization=
work page 2023
-
[25]
icarl: Incremental classifier and representation learning , author=
-
[26]
Introducing language guidance in prompt-based continual learning , author=
-
[27]
Language Guided Concept Bottleneck Models for Interpretable Continual Learning , author=
-
[28]
Training-free pretrained model merging , author=
-
[29]
Bagging predictors , author=. Machine learning , volume=. 1996 , publisher=
work page 1996
-
[30]
Averaging Weights Leads to Wider Optima and Better Generalization
Averaging weights leads to wider optima and better generalization , author=. arXiv preprint arXiv:1803.05407 , year=
-
[31]
Class Incremental Learning via Contrastive Complementary Augmentation , author=. 2025 , publisher=
work page 2025
-
[32]
Prompt gradient projection for continual learning , author=
-
[33]
Class-incremental learning via dual augmentation , author=
-
[34]
Mixture of experts meets prompt-based continual learning , author=
-
[35]
Beyond prompt learning: Continual adapter for efficient rehearsal-free continual learning , author=. 2024 , organization=
work page 2024
-
[36]
Semantically-shifted incremental adapter-tuning is a continual vitransformer , author=
-
[37]
Boosting continual learning of vision-language models via mixture-of-experts adapters , author=
-
[38]
arXiv preprint arXiv:2310.05905 , year=
Tail: Task-specific adapters for imitation learning with large pretrained models , author=. arXiv preprint arXiv:2310.05905 , year=
-
[39]
Junhao Zheng and Xidi Cai and Shengjie Qiu and Qianli Ma , title =
-
[40]
Learning multiple layers of features from tiny images , author=. 2009 , publisher=
work page 2009
-
[41]
The many faces of robustness: A critical analysis of out-of-distribution generalization , author=
-
[42]
Proceedings of the IEEE international conference on computer vision workshops , pages=
3d object representations for fine-grained categorization , author=. Proceedings of the IEEE international conference on computer vision workshops , pages=
-
[43]
Moment matching for multi-source domain adaptation , author=
-
[44]
Proceedings of the national academy of sciences , volume=
Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=
work page 2017
-
[45]
arXiv preprint arXiv:2505.22389 , year=
Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning , author=. arXiv preprint arXiv:2505.22389 , year=
- [46]
- [47]
- [48]
-
[49]
BECAME: BayEsian Continual Learning with Adaptive Model MErging , author =
-
[50]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
On large-batch training for deep learning: Generalization gap and sharp minima , author=. arXiv preprint arXiv:1609.04836 , year=
work page internal anchor Pith review arXiv
- [51]
- [52]
-
[53]
iCaRL: Incremental classifier and representation learning , author=. CVPR , year=
-
[54]
Gradient episodic memory for continual learning , author=. NeurIPS , year=
-
[55]
Dark experience for backward compatibility: Re-addressing teacher-student learning in continual learning , author=. NeurIPS , year=
-
[56]
Progressive neural networks , author=. arXiv preprint arXiv:1606.04671 , year=
work page internal anchor Pith review arXiv
-
[57]
DER: Dynamically expandable representation for class-incremental learning , author=. CVPR , year=
-
[58]
Understanding the Role of Optimization in Catastrophic Forgetting , author=. ICML , year=
- [59]
-
[60]
Modeling the optimization landscape of continual learning , author=. ICLR , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.