Unlocking the Potential of Continual Model Merging: An ODE Perspective

Haidong Kang; Lihong Lin

arxiv: 2605.19409 · v3 · pith:2HT32N3Wnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

Unlocking the Potential of Continual Model Merging: An ODE Perspective

Lihong Lin , Haidong Kang This is my paper

Pith reviewed 2026-05-21 08:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords continual model mergingODE perspectivemode connectivitylow-loss pathscatastrophic forgettingparameter spacevelocity fieldbarrier constraints

0 comments

The pith

Continual model merging follows time-dependent ODE paths in parameter space to avoid forgetting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing merging methods use fixed algebraic combinations on isolated model parameters, causing accumulated forgetting as tasks arrive sequentially. The paper instead assumes desirable merged models sit on low-loss connecting paths between task models, as suggested by mode connectivity. It proposes ODE-M to trace these paths by integrating a time-dependent velocity field while adding barrier constraints that block any loss-increasing moves. This gives explicit control over how much capacity is kept for old tasks versus new ones. A reader would care because it offers a scalable way to customize foundation models for new tasks without full retraining.

Core claim

The central claim is that continual model merging should follow low-loss connecting paths in parameter space without crossing loss barriers. ODE-M achieves this by integrating a time-dependent velocity field to trace the path and enforcing barrier constraints to prevent loss-increasing steps, resulting in state-of-the-art performance across mainstream CMM benchmarks compared to competitors.

What carries the argument

ODE-driven Merging (ODE-M), which traces low-loss paths using a time-dependent velocity field with barrier constraints.

If this is right

Merging allocates capacity more consistently between old and new capabilities.
Forgetting is reduced in sequences with heterogeneous task importance.
Merged models maintain performance better over many sequential tasks.
Provides a controllable alternative to repeated retraining for foundation model customization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The ODE perspective could link merging to continuous optimization dynamics in training.
It might apply to merging in other domains like reinforcement learning policies.
Scaling the method to very large models could test if low-loss paths remain accessible.

Load-bearing premise

Desirable merged models lie on low-loss connecting paths in parameter space, and continual merging must follow these paths without crossing loss barriers.

What would settle it

Running the ODE-M path on a benchmark and observing whether the loss increases at any step or if final performance matches or exceeds baselines without the constraints.

Figures

Figures reproduced from arXiv: 2605.19409 by Haidong Kang, Lihong Lin.

**Figure 1.** Figure 1: Mode connectivity suggests that two independently trained models can often be joined by a continuous path in parameter space along which the loss remains low (nearly unchanged). rather than retraining from scratch whenever users’ requirements change (Ilharco et al., 2022; Yadav et al., 2023b). Model merging addresses this goal by integrating multiple models with shared architectures directly in parameter… view at source ↗

**Figure 2.** Figure 2: Per-task performance on 8-task continual merging. Radar plots show the per-task accuracy of different continual merging methods on an 8-task stream for three CLIP ViT backbones: (a) ViT-B/32, (b) ViT-B/16, and (c) ViT-L/14. where M denotes the merging algorithm, and ψ0 denotes the weights of the shared pre-trained model. 2.2. Objective Mismatch in CMM Existing CMM benchmarks report performance by macroave… view at source ↗

**Figure 3.** Figure 3: Conventional vs. ODE-based CMM. Conventional CMM merges two models in a single step, whereas ODE-based CMM follows a continuous trajectory between endpoints and selects operating points θ(t) along the path for controllable updates. model forgetting. Concretely, we seek to construct a continuous path in parameter space: θ : [0, 1] → R d , θ(0) = θ0, θ(1) = θ1, (4) With this formulation, each update can be… view at source ↗

**Figure 4.** Figure 4: Barrier-aware rectification of the velocity field. Left: the base field ut(θ) induces an unconstrained transport toward the target model θ1. Right: after gradient projection and adaptive damping, the rectified field vt(θ) suppresses loss-increasing components, yielding trajectories that avoid high-loss (barrier) regions while still progressing toward θ1. • ⟨gt, ut⟩ > 0: Moving toward the target necessitat… view at source ↗

**Figure 5.** Figure 5: Empirical statistics of the adaptive rectification coefficient γ(θ, t) along the ODE-induced trajectory. Colors show log-frequency over integration steps aggregated across runs. Per-task Accuracy under Heterogeneous Task Utility. To further characterize behavior under our utility-weighted setting, we report the per-task accuracies on the full 20-task stream under heterogeneous task utility. The results are… view at source ↗

**Figure 6.** Figure 6: Empirical statistics of the alignment ratio ∥Pgt (ut)∥/∥ut∥ along the ODE trajectory. Colors show log-frequency over integration steps aggregated across runs. empirically best operating time t ⋆ k according to validation performance. Since t ⋆ k controls the degree of movement toward the incoming model, 1−t ⋆ k can be interpreted as a trajectory-level proxy for the retained contribution of the previous mer… view at source ↗

**Figure 7.** Figure 7: Per-task performance on 20-task continual merging. Radar plots show the per-task accuracy of different continual merging methods on a 20-task stream for three CLIP ViT backbones: (a) ViT-B/32, (b) ViT-B/16, and (c) ViT-L/14. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Per-task performance on 20-task continual merging (heterogeneous utility). Radar plots show the per-task accuracy of different continual merging methods on a 20-task stream for three CLIP ViT backbones: (a) ViT-B/32, (b) ViT-B/16, and (c) ViT-L/14. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

read the original abstract

Continual Model Merging (CMM) enables rapid customization of foundation models by sequentially incorporating task-adapted models without repeated retraining. However, existing merging rules usually update the deployed model through fixed algebraic or projection-based operations, providing limited control over how much previously accumulated knowledge should be retained relative to the incoming task model. This limitation leads to unstable retention and performance degradation in long task streams, and becomes more pronounced when tasks have heterogeneous utilities. We propose ODE-driven Merging (ODE-M), a controllable framework that formulates each continual merge as a trajectory in parameter space rather than a one-step endpoint update. Motivated by mode connectivity, ODE-M constructs a barrier-aware trajectory using a rectified time-dependent velocity field, where lightweight first-order feedback from a small calibration set suppresses loss-increasing motion while preserving progress toward the incoming model. The next merged model is then obtained by selecting an operating point along this trajectory through a utility-aware time schedule, providing an explicit mechanism for balancing retained historical knowledge and incoming task expertise. Extensive experiments on standard CMM benchmarks show that ODE-M consistently improves over strong continual merging baselines across CLIP ViT backbones, stream lengths, and heterogeneous task-utility settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The ODE framing adds a time-dependent path idea to continual merging but the local barrier checks leave the global forgetting claim unproven.

read the letter

The core new piece is the ODE-M construction: instead of fixed algebraic merges they integrate a time-dependent velocity field while adding a local barrier term that rejects loss-increasing steps at each integration point. That gives explicit control over how capacity is allocated across sequential tasks, which is a practical gap in prior CMM work. The motivation from mode connectivity is laid out cleanly and the barrier idea is a straightforward way to operationalize staying on low-loss paths during the merge process. Experiments are reported as SOTA on standard benchmarks, which suggests the method moves the needle in practice at least on the tasks they tested. The soft spot is exactly the one the stress-test flags. The barrier is enforced only at discrete steps during integration, so there is no certificate that the final merged weights after many tasks still sit on a globally low-loss connecting path. In high-dimensional space a sequence of locally safe steps can still accumulate drift or loss increase once the next task arrives, and nothing in the setup appears to audit the entire trajectory or provide a Lyapunov-style guarantee. Without that, the reduced-forgetting claim rests on extrapolation from local behavior. The paper is aimed at researchers who already work on model merging or continual adaptation of foundation models and who are comfortable thinking in parameter-space geometry. A reader who wants concrete recipes for controllable merging will find the velocity-field framing useful even if they end up modifying the barrier. It deserves a serious referee because the idea is cleanly motivated, the math is not circular on its face, and the experimental claims are falsifiable; the local-to-global gap is fixable with additional analysis or audits rather than fatal to the contribution.

Referee Report

2 major / 2 minor

Summary. The paper introduces ODE-M for Continual Model Merging (CMM). Motivated by mode connectivity, it assumes desirable merged models lie on low-loss connecting paths in parameter space. The method constructs a transition by integrating a time-dependent velocity field v(t, θ) while enforcing barrier constraints that reject loss-increasing steps at each integration step. This is claimed to provide explicit controllability over capacity allocation, reduce forgetting relative to fixed algebraic merging rules, and achieve state-of-the-art performance on mainstream CMM benchmarks.

Significance. If the central construction and empirical claims hold, the work supplies a dynamical-systems framing for CMM that moves beyond static linear combinations, potentially enabling more consistent performance allocation across heterogeneous tasks. The explicit use of an ODE trajectory with local barrier enforcement is a concrete technical contribution that could be extended to other continual-learning settings.

major comments (2)

[Method section (velocity field and barrier term)] The barrier constraint is described as a local, per-step rejection of loss-increasing moves during integration of the time-dependent velocity field. No global certificate (e.g., Lyapunov function, a-posteriori loss audit along the full trajectory, or invariance argument) is supplied showing that the final parameter vector after T sequential merges remains on a connecting path whose loss is no higher than the individual task minima. This local-to-global extrapolation is load-bearing for the claim that ODE-M “prevents loss-increasing steps” and thereby reduces forgetting.
[Experiments / Abstract] The abstract asserts SOTA results across mainstream CMM benchmarks, yet the manuscript supplies no quantitative tables, error bars, dataset sizes, task sequences, or implementation details for the velocity field and constraint enforcement. Without these, the performance claim cannot be verified and the comparison to prior algebraic merging methods remains ungrounded.

minor comments (2)

[§3] Clarify the precise functional form of the time-dependent velocity field v(t, θ) and the exact mathematical statement of the barrier constraint (e.g., whether it is a hard rejection or a soft penalty term).
[Implementation details] Add a short discussion of how the integration step size and discretization scheme affect the fidelity of the path-following argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the constructive feedback on our manuscript. We address the major comments point by point below, clarifying our approach and committing to revisions where appropriate to enhance the clarity and rigor of the work.

read point-by-point responses

Referee: [Method section (velocity field and barrier term)] The barrier constraint is described as a local, per-step rejection of loss-increasing moves during integration of the time-dependent velocity field. No global certificate (e.g., Lyapunov function, a-posteriori loss audit along the full trajectory, or invariance argument) is supplied showing that the final parameter vector after T sequential merges remains on a connecting path whose loss is no higher than the individual task minima. This local-to-global extrapolation is load-bearing for the claim that ODE-M “prevents loss-increasing steps” and thereby reduces forgetting.

Authors: We thank the referee for highlighting this important aspect of our theoretical grounding. The barrier constraint is indeed enforced locally at each integration step to reject loss-increasing moves, which is intended to keep the trajectory on low-loss paths incrementally. We acknowledge that a formal global certificate, such as a Lyapunov function or invariance argument, is not provided in the current manuscript. To address the concern, we will add an a-posteriori loss audit along sampled trajectories in the revised version to empirically demonstrate that the final parameter vectors do not incur higher loss than the individual task minima. We will also expand the discussion to clarify the rationale behind local enforcement and its relation to the ODE perspective. revision: yes
Referee: [Experiments / Abstract] The abstract asserts SOTA results across mainstream CMM benchmarks, yet the manuscript supplies no quantitative tables, error bars, dataset sizes, task sequences, or implementation details for the velocity field and constraint enforcement. Without these, the performance claim cannot be verified and the comparison to prior algebraic merging methods remains ungrounded.

Authors: We agree with the referee that the experimental validation requires more explicit presentation to support the SOTA claims. The current manuscript's experiments section will be augmented in the revision with detailed quantitative tables, including performance metrics with error bars from multiple runs, specifications of dataset sizes, task sequences used in the continual merging benchmarks, and implementation details regarding the time-dependent velocity field and the barrier constraint enforcement mechanism. This will ground the comparisons to prior algebraic merging methods and allow for independent verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent ODE construction

full rationale

The paper motivates its proposal by citing the established mode-connectivity literature and stating an assumption that desirable merged models lie on low-loss paths. It then defines ODE-M as a distinct technical construction that integrates a time-dependent velocity field subject to local barrier constraints. No quoted equation or step reduces the claimed path-tracing behavior to the input assumption by definition, renames a fitted quantity as a prediction, or relies on a self-citation chain for uniqueness. The central derivation therefore remains self-contained and does not collapse to its motivating premise.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central proposal rests on the domain assumption of low-loss connecting paths from mode connectivity; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Desirable merged models lie on low loss connecting paths and continual merging should follow such paths without crossing loss barriers that induce forgetting
Invoked in the motivation paragraph immediately before proposing ODE-M; grounded in mode connectivity literature.

pith-pipeline@v0.9.0 · 5731 in / 1218 out tokens · 49421 ms · 2026-05-21T08:28:15.356287+00:00 · methodology

Review history (2 revisions) →

Unlocking the Potential of Continual Model Merging: An ODE Perspective

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)