Curvature-Guided Mixing for MLLM Adaptation

Jianguo Zhang; Jiaxuan He; Jinglong Yang; Wenjian Huang; Zhan Zhuang

arxiv: 2606.24963 · v1 · pith:O4CBQFQWnew · submitted 2026-06-23 · 💻 cs.CV · cs.LG

Curvature-Guided Mixing for MLLM Adaptation

Jinglong Yang , Jiaxuan He , Wenjian Huang , Zhan Zhuang , Jianguo Zhang This is my paper

Pith reviewed 2026-06-26 00:41 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords model mergingmultimodal large language modelscatastrophic forgettingHessian approximationparameter mixingfine-tuning adaptationcurvature guidanceloss landscape

0 comments

The pith

A Hessian approximation of loss landscapes yields a closed-form soft mixing ratio that blends MLLM parameters by relative task curvatures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that merging pre-trained and fine-tuned multimodal models can be done optimally by deriving a parameter blending ratio directly from second-order curvature information rather than through heuristics. It formulates a joint objective over the two models and applies a Hessian approximation to obtain an analytical expression for the mixing weights that reflect how sharply each task's loss changes with parameter perturbations. Experiments on LLaVA-1.5 and Qwen2.5VL demonstrate that this curvature-guided approach, along with a sparse hard-mixing variant, produces a better balance between downstream task gains and preservation of general capabilities than prior merging techniques.

Core claim

Curvature-Guided Mixing (CGM) formulates a joint optimization objective and uses a second-order (Hessian) approximation of the loss landscapes to analytically derive an optimal, closed-form soft mixing ratio that blends parameters based on their relative task-specific curvatures; a robust hard-mixing variant (CGM†) performs sparse parameter selection guided by a curvature-aware score, and both variants improve the trade-off between task specialization and general knowledge retention over existing methods on LLaVA-1.5 and Qwen2.5VL across multiple downstream tasks.

What carries the argument

The curvature-guided soft mixing ratio, obtained in closed form from the Hessian approximation of the joint loss objective, which sets blend weights for each parameter according to the relative curvatures of the pre-trained and fine-tuned loss surfaces.

Load-bearing premise

The joint optimization objective admits an analytical closed-form solution under the Hessian approximation without requiring iterative numerical optimization or post-hoc parameter tuning.

What would settle it

An experiment in which models merged using the curvature-derived ratio show no improvement, or a clear degradation, in the combined metric of task accuracy and general capability retention compared with simple averaging or other fixed-ratio heuristics would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.24963 by Jianguo Zhang, Jiaxuan He, Jinglong Yang, Wenjian Huang, Zhan Zhuang.

**Figure 1.** Figure 1: Performance comparison of our methods (CGM and CGM† ) against baselines for LLaVA fine-tuned on OKVQA. We evaluate general knowledge retention (Pre-Avg: average performance of pre-training tasks), specialization on the new task (Target), and the harmonic mean of both (Hscore) to measure the overall balance. forgetting [15,25,44]. The central challenge, therefore, is to develop a methodology that can effec… view at source ↗

**Figure 2.** Figure 2: A conceptual illustration of the motivation behind CGM. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Radar plots illustrating the performance trade-off between downstream adaptation and general knowledge retention. The “Target Task” axis shows performance on the fine-tuned task, while all other axes measure general pre-trained capabilities. Our methods, CGM and CGM† demonstrate superior balance by achieving high targettask performance while simultaneously preserving pre-trained knowledge. – We propose C… view at source ↗

**Figure 4.** Figure 4: Hyperparameter sensitivity analysis on the Qwen3B backbone for the LaTeXOCR and Flickr30k tasks. prevent catastrophic forgetting. Consequently, the primary role of K is to control the degree of downstream specialization rather than managing forgetting, with the optimal Hscore and target performance typically achieved at a 10% sparsity ratio. Similarly, the balancing coefficient α modulates the trade-off … view at source ↗

**Figure 5.** Figure 5: Visualization of selection masks (10% sparsity) for different methods [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Quantitative comparison of column-wise recovery ratios between CGM† (orange) and Magnitude (Mag, blue) at varying update sparsity levels. The Y-axis represents the fraction of pre-trained parameters kept. Across all sparsity levels, CGM† exhibits a non-uniform, structured selection that consistently targets or protects the same columns, whereas the Magnitude baseline remains uniform and diffuse. groups (… view at source ↗

read the original abstract

Fine-tuning Multimodal Large Language Models (MLLMs) on specialized tasks often leads to catastrophic forgetting of their general capabilities. Existing model merging methods to combat this are often heuristic or use sub-optimal objectives. We propose CurvatureGuided Mixing (CGM), a theoretically grounded framework that merges pre-trained and fine-tuned models. CGM formulates a joint optimization objective and uses a second-order (Hessian) approximation of the loss landscapes to analytically derive an optimal, closed-form "soft mixing" ratio. This ratio intelligently blends parameters based on their relative task-specific curvatures. We also introduce CGM$\dagger$, a robust "hard mixing" variant that performs sparse parameter selection guided by a novel, curvature-aware score. Experiments on LLaVA-1.5 and Qwen2.5VL across multiple downstream tasks show that CGM and CGM$\dagger$ consistently improve the trade-off between task specialization and general knowledge retention over existing methods. Code is available at github.com/zzsyjl/CGM-ECCV-2026.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a Hessian-derived closed-form mixing ratio for MLLM merging that improves the specialization-forgetting trade-off, but the derivation's exactness under the stated approximation is the part that needs verification.

read the letter

The core new piece is the curvature-guided mixing (CGM) that starts from a joint objective over pre-trained and fine-tuned parameters, approximates the loss with Hessians, and arrives at a closed-form soft-mixing ratio based on relative curvatures. They also give CGM†, a sparse hard-mixing version using a curvature-aware score. Experiments on LLaVA-1.5 and Qwen2.5VL report better retention of general capabilities alongside task gains compared with existing merging baselines, and the code is public.

The approach is a step past purely heuristic merging, and the second-order framing is a reasonable direction for this problem. The experiments appear to test the practical outcome on standard models and tasks.

The soft spot is exactly the stress-test point: whether the joint objective plus Hessian approximation produces a genuinely parameter-free closed-form solution without diagonal assumptions, ignored cross terms, or later adjustments that would make the ratio no longer strictly analytical. The abstract states it does, but the derivation itself is not shown here, so it is impossible to confirm the algebra holds without extra fitting. If the full methods section reveals post-hoc normalization or iterative steps, the theoretical grounding claim shrinks.

This is for researchers working on efficient adaptation and merging of multimodal models. The work shows clear engagement with the forgetting problem and prior merging literature, so it deserves a serious referee even if the derivation requires tightening.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Curvature-Guided Mixing (CGM) to merge pre-trained and fine-tuned MLLMs, mitigating catastrophic forgetting. It formulates a joint optimization objective over the two models and applies a second-order Hessian approximation of the loss landscapes to analytically derive a closed-form 'soft mixing' ratio that blends parameters according to their relative task-specific curvatures. A sparse 'hard mixing' variant (CGM†) is also introduced using a curvature-aware selection score. Experiments on LLaVA-1.5 and Qwen2.5VL across downstream tasks report improved specialization-retention trade-offs relative to prior merging methods.

Significance. If the claimed analytical derivation is correct and produces a genuinely parameter-free closed-form ratio without iterative solvers or post-hoc adjustments, the work would supply a principled, second-order alternative to heuristic merging techniques. Reproducibility is supported by the linked code repository.

major comments (1)

[Methods / Derivation of CGM] The central claim rests on an analytical derivation of the mixing ratio from the joint objective under Hessian approximation. The manuscript must present the complete derivation (including the joint objective, the precise form of the Hessian approximation, and the algebraic steps to the closed-form ratio) to confirm the absence of cross-parameter terms, iterative numerical optimization, or implicit tuning that would undermine the 'parameter-free' and 'analytically derived' assertions.

minor comments (2)

[Abstract] The abstract contains a typographical error ('CurvatureGuided' should be 'Curvature-Guided').
[Abstract] The repository link in the abstract points to a 2026 conference; confirm the target venue and update if needed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The primary concern centers on ensuring the analytical derivation of the CGM mixing ratio is fully presented to substantiate the parameter-free and closed-form claims. We agree this is essential and will expand the manuscript accordingly.

read point-by-point responses

Referee: [Methods / Derivation of CGM] The central claim rests on an analytical derivation of the mixing ratio from the joint objective under Hessian approximation. The manuscript must present the complete derivation (including the joint objective, the precise form of the Hessian approximation, and the algebraic steps to the closed-form ratio) to confirm the absence of cross-parameter terms, iterative numerical optimization, or implicit tuning that would undermine the 'parameter-free' and 'analytically derived' assertions.

Authors: We agree that a complete, self-contained derivation is required to rigorously support the claims. In the revised manuscript we will add (in Section 3 and/or a dedicated appendix) the full derivation: (1) the joint optimization objective over the pre-trained and fine-tuned parameter sets, (2) the precise second-order Hessian approximation employed (including whether a diagonal or block-diagonal form is used and how curvature is estimated per parameter), and (3) every algebraic step from the approximated objective to the closed-form soft-mixing ratio. This exposition will explicitly show the absence of cross-parameter coupling terms and confirm that the ratio is obtained analytically without iterative solvers or any post-hoc tuning, thereby validating the parameter-free nature of the method. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is a standard second-order approximation from stated joint objective.

full rationale

The paper formulates a joint optimization objective over pre-trained and fine-tuned parameters, then applies a Hessian approximation to derive a closed-form mixing ratio. This is a conventional analytic step in optimization literature and does not reduce to a fitted parameter renamed as prediction, self-definition, or load-bearing self-citation. No equations in the provided abstract or description exhibit the reduction patterns (self-definitional, fitted-input-called-prediction, etc.). The result remains independent of the target mixing ratio by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard second-order Taylor approximation of loss surfaces being sufficiently accurate for deriving mixing ratios; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Loss landscapes admit a useful quadratic approximation via the Hessian matrix for the purpose of deriving mixing ratios.
Invoked to obtain the closed-form soft mixing ratio from the joint optimization objective.

pith-pipeline@v0.9.1-grok · 5722 in / 1166 out tokens · 19819 ms · 2026-06-26T00:41:19.424159+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 3 linked inside Pith

[1]

arXiv preprint arXiv:2502.13923 (2025)

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

Pith/arXiv arXiv 2025
[2]

Annals of Operations Research134, 19–67 (2005)

de Boer, P.T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross- entropy method. Annals of Operations Research134, 19–67 (2005)

2005
[3]

arXiv preprint arXiv:2411.02564 (2024)

Cao, M., Liu, Y., Liu, Y., Wang, T., Dong, J., Ding, H., Zhang, X., Reid, I., Liang, X.: Continual llava: Continual instruction tuning in large vision-language models. arXiv preprint arXiv:2411.02564 (2024)

arXiv 2024
[4]

In: ICML

Cen, J., Wu, C., Liu, X., Yin, S., Pei, Y., Yang, J., Chen, Q., Duan, N., Zhang, J.: Using left and right brains together: Towards vision and language planning. In: ICML. pp. 5982–6001 (2024)

2024
[5]

NeurIPS37, 57817–57840 (2024)

Chen, C., Zhu, J., Luo, X., Shen, H.T., Song, J., Gao, L.: Coin: A benchmark of continual instruction tuning for multimodel large language models. NeurIPS37, 57817–57840 (2024)

2024
[6]

In: CVPR

Chen, H., Yang, Y., Zhong, N., Ma, K.: Hiding images in diffusion models by editing learned score functions. In: CVPR. pp. 18663–18673 (2025)

2025
[7]

In: ICML (2023)

Frantar, E., Alistarh, D.: SparseGPT: Massive language models can be accurately pruned in one-shot. In: ICML (2023)

2023
[8]

In: CVPR

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: CVPR. pp. 6904–6913 (2017)

2017
[9]

In: CVPR

Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J.P.: Vizwiz grand challenge: Answering visual questions from blind people. In: CVPR. pp. 3608–3617 (2018)

2018
[10]

In: ICML

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: ICML. pp. 2790–2799 (2019)

2019
[11]

In: ICLR (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. In: ICLR (2022)

2022
[12]

arXiv preprint arXiv:2503.04543 (2025)

Huang, W., Liang, J., Guo, X., Fang, Y., Wan, G., Rong, X., Wen, C., Shi, Z., Li, Q., Zhu, D., Ma, Y., Liang, K., Yang, B., Li, H., Shao, J., Ye, M., Du, B.: Keeping yourself is important in downstream tuning multimodal large language model. arXiv preprint arXiv:2503.04543 (2025)

arXiv 2025
[13]

In: ICML (2025)

Huang, W., Liang, J., Shi, Z., Zhu, D., Wan, G., Li, H., Du, B., Tao, D., Ye, M.: Learn from downstream and be yourself in multimodal large language model fine-tuning. In: ICML (2025)

2025
[14]

In: CVPR

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: CVPR. pp. 6700–6709 (2019)

2019
[15]

In: NeurIPS (2024)

Jha, S., Gong, D., Yao, L.: CLAP4CLIP: Continual learning with probabilistic finetuning for vision-language models. In: NeurIPS (2024)

2024
[16]

Proceedings of the National Academy of Sciences114(13), 3521–3526 (2017)

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al.: Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences114(13), 3521–3526 (2017)

2017
[17]

In: NeurIPS

LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: NeurIPS. vol. 2, pp. 598–605 (1990) 16 J. Yang et al

1990
[18]

In: ICML (2023)

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023)

2023
[19]

In: EMNLP

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallu- cination in large vision-language models. In: EMNLP. pp. 292–305 (2023)

2023
[20]

In: CVPR

Liang, Y.S., Li, W.J.: Inflora: Interference-free low-rank adaptation for continual learning. In: CVPR. pp. 23638–23647 (2024)

2024
[21]

In: CVPR (2024)

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024)

2024
[22]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: ECCV. pp. 216–233 (2024)

2024
[23]

In: NeurIPS

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: NeurIPS. vol. 35, pp. 2507–2521 (2022)

2022
[24]

In: CVPR (2025)

Luo, G., Yang, X., Dou, W., Wang, Z., Liu, J., Dai, J., Qiao, Y., Zhu, X.: Mono- internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. In: CVPR (2025)

2025
[25]

IEEE/ACM Transactions on Audio, Speech, and Language Processing33, 3776– 3786 (2025)

Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., Zhang, Y.: An empirical study of catastrophic forgetting in large language models during continual fine-tuning. IEEE/ACM Transactions on Audio, Speech, and Language Processing33, 3776– 3786 (2025)

2025
[26]

In: CVPR

Marino,K.,Rastegari,M.,Farhadi,A.,Mottaghi,R.:Ok-vqa:Avisualquestionan- swering benchmark requiring external knowledge. In: CVPR. pp. 3195–3204 (2019)

2019
[27]

In: ICML

Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored ap- proximate curvature. In: ICML. pp. 2408–2417 (2015)

2015
[28]

In: NeurIPS (2022)

Matena, M., Raffel, C.: Merging models with fisher-weighted averaging. In: NeurIPS (2022)

2022
[29]

In: WACV

Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: Info- graphicvqa. In: WACV. pp. 1697–1706 (2022)

2022
[30]

In: ICML

Panigrahi, A., Saunshi, N., Zhao, H., Arora, S.: Task-specific skill localization in fine-tuned language models. In: ICML. pp. 27011–27033 (2023)

2023
[31]

co / datasets/unsloth/LaTeX_OCR(2024)

Roboflow: Latex-ocr dataset (unsloth version).https : / / huggingface . co / datasets/unsloth/LaTeX_OCR(2024)

2024
[32]

In: CVPR

Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: CVPR. pp. 8317–8326 (2019)

2019
[33]

NeurIPS35, 29440–29453 (2022)

Srinivasan, T., Chang, T.Y., Pinto Alva, L., Chochlakis, G., Rostami, M., Thoma- son, J.: Climb: A continual learning benchmark for vision-and-language tasks. NeurIPS35, 29440–29453 (2022)

2022
[34]

In: Workshop on Efficient Systems for Foundation Models @ ICML2023 (2023)

Sun, M., Liu, Z., Bair, A., Kolter, J.Z.: A simple and effective pruning approach for large language models. In: Workshop on Efficient Systems for Foundation Models @ ICML2023 (2023)

2023
[35]

In: ICCV

Wang, X., Zhuang, Z., Zhang, Y.: Plan: Proactive low-rank allocation for continual learning. In: ICCV. pp. 2909–2918 (2025)

2025
[36]

In: Findings of the Association for Computational Linguistics: EMNLP 2025

Wu, J., Xiong, Y., Li, X., Xia, Y., Wang, R., Wang, Y., Yu, T., Kim, S., Rossi, R.A., Yao, L., Shang, J., McAuley, J.: Mitigating visual knowledge forgetting in MLLM instruction-tuning via modality-decoupled gradient descent. In: Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 2282–2295 (2025)

2025
[37]

In: ICLR (2025) Curvature-Guided Mixing for MLLM Adaptation 17

Wu, Y., Piao, H., Huang, L., Wang, R., Li, W., Pfister, H., Meng, D., Ma, K., Wei, Y.: Sd-lora: Scalable decoupled low-rank adaptation for class incremental learning. In: ICLR (2025) Curvature-Guided Mixing for MLLM Adaptation 17

2025
[38]

Transactions of the Association for Computational Linguistics2, 67–78 (2014)

Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics2, 67–78 (2014)

2014
[39]

NeurIPS37, 49834–49858 (2024)

Yu, J., Xiong, H., Zhang, L., Diao, H., Zhuge, Y., Hong, L., Wang, D., Lu, H., He, Y., Chen, L.: Llms can evolve continually on modality for x-modal reasoning. NeurIPS37, 49834–49858 (2024)

2024
[40]

In: CVPR

Yu, J., Zhuge, Y., Zhang, L., Hu, P., Wang, D., Lu, H., He, Y.: Boosting continual learning of vision-language models via mixture-of-experts adapters. In: CVPR. pp. 23219–23230 (2024)

2024
[41]

In: ICML (2024)

Yu, L., Yu, B., Yu, H., Huang, F., Li, Y.: Language models are super mario: Absorbing abilities from homologous models as a free lunch. In: ICML (2024)

2024
[42]

In: EMNLP

Zeng, F., Zhu, F., Guo, H., Zhang, X.Y., Liu, C.L.: Modalprompt: Towards efficient multimodal continual instruction tuning with dual-modality guided prompt. In: EMNLP. pp. 12126–12141 (2025)

2025
[43]

In: ICML

Zenke, F., Poole, B., Ganguli, S.: Continual learning through synaptic intelligence. In: ICML. pp. 3987–3995 (2017)

2017
[44]

In: Con- ference on Parsimony and Learning (2023)

Zhai, Y., Tong, S., Li, X., Cai, M., Qu, Q., Lee, Y.J., Ma, Y.: Investigating the catastrophic forgetting in multimodal large language model fine-tuning. In: Con- ference on Parsimony and Learning (2023)

2023
[45]

arXiv preprint arXiv:2309.15112 (2023)

Zhang, P., Dong, X., Wang, B., Cao, Y., Xu, C., Ouyang, L., Zhao, Z., Duan, H., Zhang, S., Ding, S., et al.: Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023)

Pith/arXiv arXiv 2023
[46]

In: ICML (2024)

Zhu, D., Sun, Z., Li, Z., Shen, T., Yan, K., Ding, S., Kuang, K., Wu, C.: Model tailor: Mitigating catastrophic forgetting in multi-modal large language models. In: ICML (2024)

2024
[47]

arXiv preprint arXiv:2504.10479 (2025)

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., Gao, Z., Cui, E., Wang, X., Cao, Y., Liu, Y., Wei, X., Zhang, H., Wang, H., Xu, W., Li, H., Wang, J., Deng, N., Li, S., He, Y., Jiang, T., Luo, J., Wang, Y., He, C., Shi, B., Zhang, X., Shao, W., He, J., Xiong, Y., Qu, W., Sun, P., Jiao, P., Lv, H., Wu, L., Zhang, ...

Pith/arXiv arXiv 2025

[1] [1]

arXiv preprint arXiv:2502.13923 (2025)

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

Pith/arXiv arXiv 2025

[2] [2]

Annals of Operations Research134, 19–67 (2005)

de Boer, P.T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross- entropy method. Annals of Operations Research134, 19–67 (2005)

2005

[3] [3]

arXiv preprint arXiv:2411.02564 (2024)

Cao, M., Liu, Y., Liu, Y., Wang, T., Dong, J., Ding, H., Zhang, X., Reid, I., Liang, X.: Continual llava: Continual instruction tuning in large vision-language models. arXiv preprint arXiv:2411.02564 (2024)

arXiv 2024

[4] [4]

In: ICML

Cen, J., Wu, C., Liu, X., Yin, S., Pei, Y., Yang, J., Chen, Q., Duan, N., Zhang, J.: Using left and right brains together: Towards vision and language planning. In: ICML. pp. 5982–6001 (2024)

2024

[5] [5]

NeurIPS37, 57817–57840 (2024)

Chen, C., Zhu, J., Luo, X., Shen, H.T., Song, J., Gao, L.: Coin: A benchmark of continual instruction tuning for multimodel large language models. NeurIPS37, 57817–57840 (2024)

2024

[6] [6]

In: CVPR

Chen, H., Yang, Y., Zhong, N., Ma, K.: Hiding images in diffusion models by editing learned score functions. In: CVPR. pp. 18663–18673 (2025)

2025

[7] [7]

In: ICML (2023)

Frantar, E., Alistarh, D.: SparseGPT: Massive language models can be accurately pruned in one-shot. In: ICML (2023)

2023

[8] [8]

In: CVPR

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: CVPR. pp. 6904–6913 (2017)

2017

[9] [9]

In: CVPR

Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J.P.: Vizwiz grand challenge: Answering visual questions from blind people. In: CVPR. pp. 3608–3617 (2018)

2018

[10] [10]

In: ICML

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: ICML. pp. 2790–2799 (2019)

2019

[11] [11]

In: ICLR (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. In: ICLR (2022)

2022

[12] [12]

arXiv preprint arXiv:2503.04543 (2025)

Huang, W., Liang, J., Guo, X., Fang, Y., Wan, G., Rong, X., Wen, C., Shi, Z., Li, Q., Zhu, D., Ma, Y., Liang, K., Yang, B., Li, H., Shao, J., Ye, M., Du, B.: Keeping yourself is important in downstream tuning multimodal large language model. arXiv preprint arXiv:2503.04543 (2025)

arXiv 2025

[13] [13]

In: ICML (2025)

Huang, W., Liang, J., Shi, Z., Zhu, D., Wan, G., Li, H., Du, B., Tao, D., Ye, M.: Learn from downstream and be yourself in multimodal large language model fine-tuning. In: ICML (2025)

2025

[14] [14]

In: CVPR

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: CVPR. pp. 6700–6709 (2019)

2019

[15] [15]

In: NeurIPS (2024)

Jha, S., Gong, D., Yao, L.: CLAP4CLIP: Continual learning with probabilistic finetuning for vision-language models. In: NeurIPS (2024)

2024

[16] [16]

Proceedings of the National Academy of Sciences114(13), 3521–3526 (2017)

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al.: Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences114(13), 3521–3526 (2017)

2017

[17] [17]

In: NeurIPS

LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: NeurIPS. vol. 2, pp. 598–605 (1990) 16 J. Yang et al

1990

[18] [18]

In: ICML (2023)

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023)

2023

[19] [19]

In: EMNLP

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallu- cination in large vision-language models. In: EMNLP. pp. 292–305 (2023)

2023

[20] [20]

In: CVPR

Liang, Y.S., Li, W.J.: Inflora: Interference-free low-rank adaptation for continual learning. In: CVPR. pp. 23638–23647 (2024)

2024

[21] [21]

In: CVPR (2024)

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024)

2024

[22] [22]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: ECCV. pp. 216–233 (2024)

2024

[23] [23]

In: NeurIPS

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: NeurIPS. vol. 35, pp. 2507–2521 (2022)

2022

[24] [24]

In: CVPR (2025)

Luo, G., Yang, X., Dou, W., Wang, Z., Liu, J., Dai, J., Qiao, Y., Zhu, X.: Mono- internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. In: CVPR (2025)

2025

[25] [25]

IEEE/ACM Transactions on Audio, Speech, and Language Processing33, 3776– 3786 (2025)

Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., Zhang, Y.: An empirical study of catastrophic forgetting in large language models during continual fine-tuning. IEEE/ACM Transactions on Audio, Speech, and Language Processing33, 3776– 3786 (2025)

2025

[26] [26]

In: CVPR

Marino,K.,Rastegari,M.,Farhadi,A.,Mottaghi,R.:Ok-vqa:Avisualquestionan- swering benchmark requiring external knowledge. In: CVPR. pp. 3195–3204 (2019)

2019

[27] [27]

In: ICML

Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored ap- proximate curvature. In: ICML. pp. 2408–2417 (2015)

2015

[28] [28]

In: NeurIPS (2022)

Matena, M., Raffel, C.: Merging models with fisher-weighted averaging. In: NeurIPS (2022)

2022

[29] [29]

In: WACV

Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: Info- graphicvqa. In: WACV. pp. 1697–1706 (2022)

2022

[30] [30]

In: ICML

Panigrahi, A., Saunshi, N., Zhao, H., Arora, S.: Task-specific skill localization in fine-tuned language models. In: ICML. pp. 27011–27033 (2023)

2023

[31] [31]

co / datasets/unsloth/LaTeX_OCR(2024)

Roboflow: Latex-ocr dataset (unsloth version).https : / / huggingface . co / datasets/unsloth/LaTeX_OCR(2024)

2024

[32] [32]

In: CVPR

Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: CVPR. pp. 8317–8326 (2019)

2019

[33] [33]

NeurIPS35, 29440–29453 (2022)

Srinivasan, T., Chang, T.Y., Pinto Alva, L., Chochlakis, G., Rostami, M., Thoma- son, J.: Climb: A continual learning benchmark for vision-and-language tasks. NeurIPS35, 29440–29453 (2022)

2022

[34] [34]

In: Workshop on Efficient Systems for Foundation Models @ ICML2023 (2023)

Sun, M., Liu, Z., Bair, A., Kolter, J.Z.: A simple and effective pruning approach for large language models. In: Workshop on Efficient Systems for Foundation Models @ ICML2023 (2023)

2023

[35] [35]

In: ICCV

Wang, X., Zhuang, Z., Zhang, Y.: Plan: Proactive low-rank allocation for continual learning. In: ICCV. pp. 2909–2918 (2025)

2025

[36] [36]

In: Findings of the Association for Computational Linguistics: EMNLP 2025

Wu, J., Xiong, Y., Li, X., Xia, Y., Wang, R., Wang, Y., Yu, T., Kim, S., Rossi, R.A., Yao, L., Shang, J., McAuley, J.: Mitigating visual knowledge forgetting in MLLM instruction-tuning via modality-decoupled gradient descent. In: Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 2282–2295 (2025)

2025

[37] [37]

In: ICLR (2025) Curvature-Guided Mixing for MLLM Adaptation 17

Wu, Y., Piao, H., Huang, L., Wang, R., Li, W., Pfister, H., Meng, D., Ma, K., Wei, Y.: Sd-lora: Scalable decoupled low-rank adaptation for class incremental learning. In: ICLR (2025) Curvature-Guided Mixing for MLLM Adaptation 17

2025

[38] [38]

Transactions of the Association for Computational Linguistics2, 67–78 (2014)

Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics2, 67–78 (2014)

2014

[39] [39]

NeurIPS37, 49834–49858 (2024)

Yu, J., Xiong, H., Zhang, L., Diao, H., Zhuge, Y., Hong, L., Wang, D., Lu, H., He, Y., Chen, L.: Llms can evolve continually on modality for x-modal reasoning. NeurIPS37, 49834–49858 (2024)

2024

[40] [40]

In: CVPR

Yu, J., Zhuge, Y., Zhang, L., Hu, P., Wang, D., Lu, H., He, Y.: Boosting continual learning of vision-language models via mixture-of-experts adapters. In: CVPR. pp. 23219–23230 (2024)

2024

[41] [41]

In: ICML (2024)

Yu, L., Yu, B., Yu, H., Huang, F., Li, Y.: Language models are super mario: Absorbing abilities from homologous models as a free lunch. In: ICML (2024)

2024

[42] [42]

In: EMNLP

Zeng, F., Zhu, F., Guo, H., Zhang, X.Y., Liu, C.L.: Modalprompt: Towards efficient multimodal continual instruction tuning with dual-modality guided prompt. In: EMNLP. pp. 12126–12141 (2025)

2025

[43] [43]

In: ICML

Zenke, F., Poole, B., Ganguli, S.: Continual learning through synaptic intelligence. In: ICML. pp. 3987–3995 (2017)

2017

[44] [44]

In: Con- ference on Parsimony and Learning (2023)

Zhai, Y., Tong, S., Li, X., Cai, M., Qu, Q., Lee, Y.J., Ma, Y.: Investigating the catastrophic forgetting in multimodal large language model fine-tuning. In: Con- ference on Parsimony and Learning (2023)

2023

[45] [45]

arXiv preprint arXiv:2309.15112 (2023)

Zhang, P., Dong, X., Wang, B., Cao, Y., Xu, C., Ouyang, L., Zhao, Z., Duan, H., Zhang, S., Ding, S., et al.: Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023)

Pith/arXiv arXiv 2023

[46] [46]

In: ICML (2024)

Zhu, D., Sun, Z., Li, Z., Shen, T., Yan, K., Ding, S., Kuang, K., Wu, C.: Model tailor: Mitigating catastrophic forgetting in multi-modal large language models. In: ICML (2024)

2024

[47] [47]

arXiv preprint arXiv:2504.10479 (2025)

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., Gao, Z., Cui, E., Wang, X., Cao, Y., Liu, Y., Wei, X., Zhang, H., Wang, H., Xu, W., Li, H., Wang, J., Deng, N., Li, S., He, Y., Jiang, T., Luo, J., Wang, Y., He, C., Shi, B., Zhang, X., Shao, W., He, J., Xiong, Y., Qu, W., Sun, P., Jiao, P., Lv, H., Wu, L., Zhang, ...

Pith/arXiv arXiv 2025