arxiv: 2604.14016 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.AI

Recognition: unknown

MAny: Merge Anything for Multimodal Continual Instruction Tuning

Bo Ding, Huaimin Wang, Kele Xu, Pengfei Qian, Tao Sun, Wangwang Jia, Xingxing Zhang, Yong Dou, Zijian Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multimodal continual instruction tuningcatastrophic forgettingmodel mergingcross-modal projectionlow-rank parameter mergingmultimodal large language modelstraining-free adaptation

0 comments

The pith

Merging cross-modal projections and low-rank parameters prevents dual forgetting in multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that a training-free merging framework can address catastrophic forgetting in multimodal continual instruction tuning by handling both perception drift in cross-modal projections and reasoning collapse in low-rank parameters. This would matter because current multimodal large language models lose previously acquired skills when adapting to new sequential tasks, limiting their usefulness in real-world settings where data arrives over time. If the approach works, it would enable efficient updates to these models using simple algebraic operations instead of expensive retraining. The method shows concrete accuracy improvements on standard benchmarks compared to existing techniques.

Core claim

MAny resolves the dual-forgetting phenomenon in MCIT by merging task-specific knowledge through Cross-modal Projection Merging (CPM) and Low-rank Parameter Merging (LPM). CPM adaptively merges visual representations using visual-prototype guidance to recover perceptual alignment during inference. LPM recursively merges low-rank weight matrices via recursive least squares to provide a closed-form optimal fusion that eliminates interference and ensures reasoning stability. The entire process is training-free and relies on efficient CPU-based operations, leading to superior performance across multiple MLLMs and benchmarks including leads of up to 8.57% in final average accuracy on the UCIT.

What carries the argument

Cross-modal Projection Merging guided by visual prototypes combined with recursive least-squares Low-rank Parameter Merging, which together fuse task knowledge without further training or interference.

If this is right

Models retain both perceptual and reasoning capabilities across arbitrary task sequences.
No additional gradient-based optimization is needed after initial task tuning.
The approach delivers accuracy gains of up to 8.57% and 2.85% over prior methods on the UCIT benchmark for two MLLMs.
Merging occurs via closed-form algebraic operations executable on CPU.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar merging techniques might apply to other forms of continual learning in non-multimodal settings.
The closed-form nature of the fusion could allow for provable bounds on forgetting rates in future theoretical work.
This could lower barriers to deploying adaptable multimodal systems in dynamic environments like robotics or personalized assistants.
Testing on even longer task sequences would reveal if the interference-free property holds indefinitely.

Load-bearing premise

That the visual-prototype guided merging of projections and the recursive least-squares merging of low-rank matrices will recover alignment and stability for any task sequence without new interference or task-specific tuning.

What would settle it

A direct test would be to apply MAny to a sequence of 10 or more diverse multimodal tasks and measure if average accuracy across all tasks remains higher than baselines or if feature similarity metrics show no drift in projections.

Figures

Figures reproduced from arXiv: 2604.14016 by Bo Ding, Huaimin Wang, Kele Xu, Pengfei Qian, Tao Sun, Wangwang Jia, Xingxing Zhang, Yong Dou, Zijian Gao.

**Figure 1.** Figure 1: Performance diagnosis on UCIT benchmark for LLaVA-1.5-7B and InternVL-Chat7B during sequential LoRA fine-tuning. Tasks follow a clockwise sequence from Image-R. oversight: they focus almost entirely on the anti-forgetting ability of language backbone, assuming the multimodal projector is immune to forgetting. To pinpoint the root causes of forgetting, we conduct a decoupling analysis on the UCIT Guo et al… view at source ↗

**Figure 2.** Figure 2: Visualizing Perception Drift and Reasoning Collapse. (a) Cosine similarity of projector features across tasks. (b) Layer-wise feature similarity between oracle states and the final model. In both aspects, MAny effectively mitigates internal representation drift. 3.2 PERCEPTION DRIFT AND REASONING COLLAPSE To empirically motivate our approach, we investigate the internal dynamics of forgetting in MLLMs. As … view at source ↗

**Figure 3.** Figure 3: Overview of MAny. The dual-track design explicitly decouples perception from reasoning. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity of λ. Dashed lines denote the SOTA performance of all baseline. Sensitivity of the Scaling Factor λ [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of γ. Dashed lines denote the SOTA performance of all baseline [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

read the original abstract

Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbf{MAny} (\textbf{M}erge \textbf{Any}thing), a framework that merges task-specific knowledge through \textbf{C}ross-modal \textbf{P}rojection \textbf{M}erging (\textbf{CPM}) and \textbf{L}ow-rank \textbf{P}arameter \textbf{M}erging (\textbf{LPM}). Specifically, CPM recovers perceptual alignment by adaptively merging cross-modal visual representations via visual-prototype guidance, ensuring accurate feature recovery during inference. Simultaneously, LPM eliminates mutual interference among task-specific low-rank modules by recursively merging low-rank weight matrices. By leveraging recursive least squares, LPM provides a closed-form solution that mathematically guarantees an optimal fusion trajectory for reasoning stability. Notably, MAny operates as a training-free paradigm that achieves knowledge merging via efficient CPU-based algebraic operations, eliminating additional gradient-based optimization beyond initial tuning. Our extensive evaluations confirm the superior performance and robustness of MAny across multiple MLLMs and benchmarks. Specifically, on the UCIT benchmark, MAny achieves significant leads of up to 8.57\% and 2.85\% in final average accuracy over state-of-the-art methods across two different MLLMs, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAny splits forgetting into perception and reasoning spaces then merges them separately with prototypes and recursive least squares, but the closed-form optimality claim is not backed by derivation.

read the letter

MAny identifies a dual-forgetting problem in multimodal continual instruction tuning, with perception drift in the cross-modal projection space and reasoning collapse in low-rank parameter space. It then merges task-specific knowledge through CPM using visual prototypes and LPM using recursive least squares on low-rank matrices, all without further training after the initial tuning steps. The approach runs on CPU with simple algebraic operations and reports gains of up to 8.57% and 2.85% final average accuracy on the UCIT benchmark over prior methods for two different MLLMs. That efficiency and the concrete benchmark numbers are the parts that stand out as useful. The split into two merging spaces and the specific use of recursive least squares for a claimed closed-form fusion is the main new element, though it builds directly on existing parameter-merging and prototype ideas. The soft spots are in the justification and the evidence. The abstract states that recursive least squares yields a mathematically guaranteed optimal trajectory for reasoning stability, yet no derivation is shown and it is not obvious why minimizing a quadratic loss on observed low-rank updates would preserve performance across arbitrary task sequences if the modules are not orthogonal or if merge order introduces bias. The reported leads would carry more weight with error bars, ablation results on the two merging components, and checks that the gains hold under different task orders. The full paper may contain these details, but they are missing from the provided description. This work is aimed at researchers working on continual adaptation of multimodal models. Someone already building sequential tuning pipelines for MLLMs could pick up the merging recipe and test it on their own data. It deserves a serious referee because the problem is practical, the method is specific, and the empirical claims are falsifiable even if the theory needs tightening. I would send it for review and ask the authors to supply the missing derivation for LPM and the additional controls on robustness.

Referee Report

3 major / 2 minor

Summary. The paper introduces MAny, a training-free framework for Multimodal Continual Instruction Tuning (MCIT) of MLLMs that addresses dual-forgetting (perception drift in cross-modal projection space and reasoning collapse in low-rank parameter space) via two merging operations: Cross-modal Projection Merging (CPM) using visual-prototype guidance for adaptive feature recovery, and Low-rank Parameter Merging (LPM) that recursively fuses task-specific low-rank matrices with recursive least squares to yield a closed-form optimal fusion trajectory claimed to guarantee reasoning stability. It reports empirical leads of up to 8.57% and 2.85% final average accuracy on the UCIT benchmark over SOTA methods for two MLLMs, operating entirely via CPU-based algebraic operations after initial tuning.

Significance. If the mathematical guarantee and empirical robustness hold, the work would be significant for enabling efficient, gradient-free continual adaptation of multimodal models at scale, directly tackling both perceptual and reasoning interference without task-specific hyperparameters or retraining. The training-free algebraic merging paradigm, if reproducible and generalizable, could reduce compute barriers in sequential MLLM deployment.

major comments (3)

[Abstract] Abstract: The assertion that LPM via recursive least squares 'provides a closed-form solution that mathematically guarantees an optimal fusion trajectory for reasoning stability' is load-bearing for the central claim but lacks any derivation, normal-equation expansion, or proof sketch showing how the RLS quadratic loss on low-rank updates corresponds to the MCIT objective of preserving prior-task performance and perceptual alignment across arbitrary sequences; without this, the guarantee does not follow from standard RLS properties.
[Abstract] Abstract and evaluation sections: The reported gains of 8.57% and 2.85% on UCIT are presented without error bars, standard deviations across runs, or ablation studies isolating CPM vs. LPM contributions, making it impossible to verify robustness or rule out benchmark-specific artifacts in the post-hoc selection of tasks and MLLMs.
[Method (LPM)] Method description (LPM): The claim that recursive merging eliminates mutual interference assumes low-rank modules remain sufficiently orthogonal or that merging order introduces no bias, yet no analysis or counterexample test is provided to confirm the fused matrix preserves earlier-task performance when these conditions are violated.

minor comments (2)

[Method] Notation for visual prototypes and low-rank matrices should be defined explicitly with dimensions before use in the merging equations.
[Experiments] The manuscript would benefit from a clear statement of the exact UCIT task sequence and MLLM architectures used for the reported numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, providing clarifications and indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that LPM via recursive least squares 'provides a closed-form solution that mathematically guarantees an optimal fusion trajectory for reasoning stability' is load-bearing for the central claim but lacks any derivation, normal-equation expansion, or proof sketch showing how the RLS quadratic loss on low-rank updates corresponds to the MCIT objective of preserving prior-task performance and perceptual alignment across arbitrary sequences; without this, the guarantee does not follow from standard RLS properties.

Authors: We acknowledge that the current manuscript does not provide an explicit derivation or proof sketch linking the RLS quadratic loss to the MCIT objectives. While recursive least squares yields a closed-form solution by solving the normal equations for the minimum of the quadratic objective, we agree that explicitly showing how this preserves prior-task performance and perceptual alignment would better support the claim. In the revised version, we will add a proof sketch in the appendix that expands the normal equations and connects the fusion trajectory to the dual-forgetting mitigation in MCIT. revision: yes
Referee: [Abstract] Abstract and evaluation sections: The reported gains of 8.57% and 2.85% on UCIT are presented without error bars, standard deviations across runs, or ablation studies isolating CPM vs. LPM contributions, making it impossible to verify robustness or rule out benchmark-specific artifacts in the post-hoc selection of tasks and MLLMs.

Authors: We agree that error bars, standard deviations, and isolating ablations are necessary to demonstrate robustness. The full experimental results were obtained over multiple random seeds, but these statistics were not reported in the abstract or highlighted in the evaluation. We will revise the abstract to include error bars on the reported gains and add ablation studies in the experiments section that isolate the contributions of CPM and LPM. revision: yes
Referee: [Method (LPM)] Method description (LPM): The claim that recursive merging eliminates mutual interference assumes low-rank modules remain sufficiently orthogonal or that merging order introduces no bias, yet no analysis or counterexample test is provided to confirm the fused matrix preserves earlier-task performance when these conditions are violated.

Authors: The recursive least squares update in LPM is formulated to minimize the combined quadratic loss over all prior low-rank matrices at each step, which by design reduces mutual interference without requiring strict orthogonality. However, we recognize that additional analysis on merging order and potential biases would strengthen the presentation. In the revision, we will include a discussion of these assumptions along with empirical tests that vary task order and provide counterexample checks to confirm preservation of earlier-task performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations rely on independent algebraic procedures

full rationale

The paper introduces CPM via visual-prototype guidance and LPM via recursive least-squares fusion of low-rank matrices as new merging operations. These are presented as training-free algebraic steps whose optimality is defined with respect to standard RLS normal equations rather than any quantity fitted from the target continual-learning objective or prior self-citations. No load-bearing self-citation chains, self-definitional loops, or renamings of known results appear in the abstract or high-level claims. Empirical gains on UCIT are reported separately from the algebraic construction, leaving the central derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the method relies on adaptive merging and recursive least squares whose concrete formulations and assumptions are not detailed.

pith-pipeline@v0.9.0 · 5617 in / 1115 out tokens · 41156 ms · 2026-05-10T13:52:14.843462+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 17 canonical work pages · 3 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and J Qwen-VL Zhou. A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 6:3, 2023a. Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Ji...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Less confidence, less forgetting: Learning with a humbler teacher in exemplar-free class- incremental learning.Neural Networks, 179:106513, 2024a

Zijian Gao, Kele Xu, Huiping Zhuang, Li Liu, Xinjun Mao, Bo Ding, Dawei Feng, and Huaimin Wang. Less confidence, less forgetting: Learning with a humbler teacher in exemplar-free class- incremental learning.Neural Networks, 179:106513, 2024a. ISSN 0893-6080. Zijian Gao, Xingxing Zhang, Kele Xu, Xinjun Mao, and Huaimin Wang. Stabilizing zero-shot predictio...

work page arXiv 2025
[3]

Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286,

work page arXiv 2003
[4]

Editing Models with Task Arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089,

work page internal anchor Pith review arXiv
[5]

Dataless knowl- edge fusion by merging weights of language models,

Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models.arXiv preprint arXiv:2212.09849,

work page arXiv
[6]

Oasis: Online sample selection for continual visual instruction tuning.arXiv preprint arXiv:2506.02011,

Minjae Lee, Minhyuk Seo, Tingyu Qu, Tinne Tuytelaars, and Jonghyun Choi. Oasis: Online sample selection for continual visual instruction tuning.arXiv preprint arXiv:2506.02011,

work page arXiv
[7]

Multimodal continual instruction tuning with dynamic gradient guidance.arXiv preprint arXiv:2511.15164,

Songze Li, Mingyu Gao, Tonghua Su, Xu-Yao Zhang, and Zhongjie Wang. Multimodal continual instruction tuning with dynamic gradient guidance.arXiv preprint arXiv:2511.15164,

work page arXiv
[8]

Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning

Adam Dahlgren Lindström and Savitha Sam Abraham. Clevr-math: A dataset for compositional language, visual and mathematical reasoning.arXiv preprint arXiv:2208.05358,

work page arXiv
[9]

Continual learning for vlms: A survey and taxonomy beyond forgetting.arXiv preprint arXiv:2508.04227, 2025

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 26296–26306, 2024a. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is yo...

work page arXiv
[10]

Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning.arXiv preprint arXiv:2110.13214,

work page arXiv
[11]

Cat merging: A training-free approach for resolving conflicts in model merging.arXiv preprint arXiv:2505.06977, 2025a

Wenju Sun, Qingyong Li, Yangli-ao Geng, and Boyang Li. Cat merging: A training-free approach for resolving conflicts in model merging.arXiv preprint arXiv:2505.06977, 2025a. Wenju Sun, Qingyong Li, Wen Wang, Yangliao Geng, and Boyang Li. Task arithmetic in trust region: A training-free model merging approach to navigate knowledge conflicts. InProceedings ...

work page arXiv
[12]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan- Jing Huang. Orthogonal subspace learning for language model continual learning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 10658–10671, 2023a. Ziao Wang, Yuhang Li, Junda Wu, Jaehyeon Soon, and Xiaofeng Zhang. Finvis-gpt: A multi...

work page arXiv 2023
[14]

Progressive lora for multimodal continual instruction tuning

Yahan Yu, Duzhen Zhang, Yong Ren, Xuanle Zhao, Xiuyi Chen, and Chenhui Chu. Progressive lora for multimodal continual instruction tuning. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 2779–2796,

2025
[15]

Modalprompt: Towards efficient multimodal continual instruction tuning with dual-modality guided prompt

Fanhu Zeng, Fei Zhu, Haiyang Guo, Xu-Yao Zhang, and Cheng-Lin Liu. Modalprompt: Towards efficient multimodal continual instruction tuning with dual-modality guided prompt. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 12137–12152,

2025
[16]

, L} 3:fort= 1tondo 4:1

13 Algorithm 1The Proposed MAny Method Input: Pre-trained weights{W l pre,Φ v}, Task sequence{T t}n t=1, Scaling factorλ, Temperatureη Output: Merged task vectors {τ l∗ n }L l=1, Task-specific Projectors {Pi}n i=1, and Prototypes {µi}n i=1 1:Training Phase: 2: Initialize cumulative covariance Hl 0 =0 and global task vector τ l∗ 0 =0 for each layer l∈ {1, ...

2024
[17]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199,

work page Pith review arXiv
[18]

Mllm-cl: Continual learning for multimodal large language models.arXiv preprint arXiv:2506.05453, 2025a

Hongbo Zhao, Fei Zhu, Haiyang Guo, Meng Wang, Rundong Wang, Gaofeng Meng, and Zhaoxiang Zhang. Mllm-cl: Continual learning for multimodal large language models.arXiv preprint arXiv:2506.05453, 2025a. Xuanle Zhao, Xianzhen Luo, Qi Shi, Chi Chen, Shuo Wang, Zhiyuan Liu, and Maosong Sun. Chart- coder: Advancing multimodal large language model for chart-to-co...

work page arXiv
[19]

(20) Worst case (Fully collinear features).Conversely, we consider the opposite extreme where all task features share an identical set of right singular vectors, i.e., Vl i =V l for all i∈ {1, . . . , t} . This scenario represents a state of maximum interference, where tasks compete for the same parameter directions. Under this highly overlapping conditio...

work page arXiv 2022
[20]

On the MLLM-DCL benchmark, the batch size is reduced to 8, with learning rates of 2e-5 for LLaV A and 1e-4 for InternVL

The learning rates for LLaV A and InternVL are set to 2e-4 and 1e-4, respectively. On the MLLM-DCL benchmark, the batch size is reduced to 8, with learning rates of 2e-5 for LLaV A and 1e-4 for InternVL. Training epochs for this benchmark are task-specific: 3 epochs for the Medicine task, 2 epochs for the Science task, and 1 epoch for all remaining tasks....

work page arXiv 2025