pith. sign in

arxiv: 2606.24963 · v1 · pith:O4CBQFQWnew · submitted 2026-06-23 · 💻 cs.CV · cs.LG

Curvature-Guided Mixing for MLLM Adaptation

Pith reviewed 2026-06-26 00:41 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords model mergingmultimodal large language modelscatastrophic forgettingHessian approximationparameter mixingfine-tuning adaptationcurvature guidanceloss landscape
0
0 comments X

The pith

A Hessian approximation of loss landscapes yields a closed-form soft mixing ratio that blends MLLM parameters by relative task curvatures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that merging pre-trained and fine-tuned multimodal models can be done optimally by deriving a parameter blending ratio directly from second-order curvature information rather than through heuristics. It formulates a joint objective over the two models and applies a Hessian approximation to obtain an analytical expression for the mixing weights that reflect how sharply each task's loss changes with parameter perturbations. Experiments on LLaVA-1.5 and Qwen2.5VL demonstrate that this curvature-guided approach, along with a sparse hard-mixing variant, produces a better balance between downstream task gains and preservation of general capabilities than prior merging techniques.

Core claim

Curvature-Guided Mixing (CGM) formulates a joint optimization objective and uses a second-order (Hessian) approximation of the loss landscapes to analytically derive an optimal, closed-form soft mixing ratio that blends parameters based on their relative task-specific curvatures; a robust hard-mixing variant (CGM†) performs sparse parameter selection guided by a curvature-aware score, and both variants improve the trade-off between task specialization and general knowledge retention over existing methods on LLaVA-1.5 and Qwen2.5VL across multiple downstream tasks.

What carries the argument

The curvature-guided soft mixing ratio, obtained in closed form from the Hessian approximation of the joint loss objective, which sets blend weights for each parameter according to the relative curvatures of the pre-trained and fine-tuned loss surfaces.

Load-bearing premise

The joint optimization objective admits an analytical closed-form solution under the Hessian approximation without requiring iterative numerical optimization or post-hoc parameter tuning.

What would settle it

An experiment in which models merged using the curvature-derived ratio show no improvement, or a clear degradation, in the combined metric of task accuracy and general capability retention compared with simple averaging or other fixed-ratio heuristics would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.24963 by Jianguo Zhang, Jiaxuan He, Jinglong Yang, Wenjian Huang, Zhan Zhuang.

Figure 1
Figure 1. Figure 1: Performance comparison of our methods (CGM and CGM† ) against baselines for LLaVA fine-tuned on OKVQA. We evaluate general knowledge retention (Pre-Avg: average performance of pre-training tasks), specialization on the new task (Target), and the harmonic mean of both (Hscore) to measure the overall balance. forgetting [15,25,44]. The central challenge, therefore, is to develop a methodol￾ogy that can effec… view at source ↗
Figure 2
Figure 2. Figure 2: A conceptual illustration of the motivation behind CGM. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Radar plots illustrating the performance trade-off between downstream adap￾tation and general knowledge retention. The “Target Task” axis shows performance on the fine-tuned task, while all other axes measure general pre-trained capabilities. Our methods, CGM and CGM† demonstrate superior balance by achieving high target￾task performance while simultaneously preserving pre-trained knowledge. – We propose C… view at source ↗
Figure 4
Figure 4. Figure 4: Hyperparameter sensitivity analysis on the Qwen3B backbone for the LaTeX￾OCR and Flickr30k tasks. prevent catastrophic forgetting. Consequently, the primary role of K is to con￾trol the degree of downstream specialization rather than managing forgetting, with the optimal Hscore and target performance typically achieved at a 10% sparsity ratio. Similarly, the balancing coefficient α modulates the trade-off … view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of selection masks (10% sparsity) for different methods [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Quantitative comparison of column-wise recovery ratios between CGM† (or￾ange) and Magnitude (Mag, blue) at varying update sparsity levels. The Y-axis rep￾resents the fraction of pre-trained parameters kept. Across all sparsity levels, CGM† exhibits a non-uniform, structured selection that consistently targets or protects the same columns, whereas the Magnitude baseline remains uniform and diffuse. groups (… view at source ↗
read the original abstract

Fine-tuning Multimodal Large Language Models (MLLMs) on specialized tasks often leads to catastrophic forgetting of their general capabilities. Existing model merging methods to combat this are often heuristic or use sub-optimal objectives. We propose CurvatureGuided Mixing (CGM), a theoretically grounded framework that merges pre-trained and fine-tuned models. CGM formulates a joint optimization objective and uses a second-order (Hessian) approximation of the loss landscapes to analytically derive an optimal, closed-form "soft mixing" ratio. This ratio intelligently blends parameters based on their relative task-specific curvatures. We also introduce CGM$\dagger$, a robust "hard mixing" variant that performs sparse parameter selection guided by a novel, curvature-aware score. Experiments on LLaVA-1.5 and Qwen2.5VL across multiple downstream tasks show that CGM and CGM$\dagger$ consistently improve the trade-off between task specialization and general knowledge retention over existing methods. Code is available at github.com/zzsyjl/CGM-ECCV-2026.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Curvature-Guided Mixing (CGM) to merge pre-trained and fine-tuned MLLMs, mitigating catastrophic forgetting. It formulates a joint optimization objective over the two models and applies a second-order Hessian approximation of the loss landscapes to analytically derive a closed-form 'soft mixing' ratio that blends parameters according to their relative task-specific curvatures. A sparse 'hard mixing' variant (CGM†) is also introduced using a curvature-aware selection score. Experiments on LLaVA-1.5 and Qwen2.5VL across downstream tasks report improved specialization-retention trade-offs relative to prior merging methods.

Significance. If the claimed analytical derivation is correct and produces a genuinely parameter-free closed-form ratio without iterative solvers or post-hoc adjustments, the work would supply a principled, second-order alternative to heuristic merging techniques. Reproducibility is supported by the linked code repository.

major comments (1)
  1. [Methods / Derivation of CGM] The central claim rests on an analytical derivation of the mixing ratio from the joint objective under Hessian approximation. The manuscript must present the complete derivation (including the joint objective, the precise form of the Hessian approximation, and the algebraic steps to the closed-form ratio) to confirm the absence of cross-parameter terms, iterative numerical optimization, or implicit tuning that would undermine the 'parameter-free' and 'analytically derived' assertions.
minor comments (2)
  1. [Abstract] The abstract contains a typographical error ('CurvatureGuided' should be 'Curvature-Guided').
  2. [Abstract] The repository link in the abstract points to a 2026 conference; confirm the target venue and update if needed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The primary concern centers on ensuring the analytical derivation of the CGM mixing ratio is fully presented to substantiate the parameter-free and closed-form claims. We agree this is essential and will expand the manuscript accordingly.

read point-by-point responses
  1. Referee: [Methods / Derivation of CGM] The central claim rests on an analytical derivation of the mixing ratio from the joint objective under Hessian approximation. The manuscript must present the complete derivation (including the joint objective, the precise form of the Hessian approximation, and the algebraic steps to the closed-form ratio) to confirm the absence of cross-parameter terms, iterative numerical optimization, or implicit tuning that would undermine the 'parameter-free' and 'analytically derived' assertions.

    Authors: We agree that a complete, self-contained derivation is required to rigorously support the claims. In the revised manuscript we will add (in Section 3 and/or a dedicated appendix) the full derivation: (1) the joint optimization objective over the pre-trained and fine-tuned parameter sets, (2) the precise second-order Hessian approximation employed (including whether a diagonal or block-diagonal form is used and how curvature is estimated per parameter), and (3) every algebraic step from the approximated objective to the closed-form soft-mixing ratio. This exposition will explicitly show the absence of cross-parameter coupling terms and confirm that the ratio is obtained analytically without iterative solvers or any post-hoc tuning, thereby validating the parameter-free nature of the method. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is a standard second-order approximation from stated joint objective.

full rationale

The paper formulates a joint optimization objective over pre-trained and fine-tuned parameters, then applies a Hessian approximation to derive a closed-form mixing ratio. This is a conventional analytic step in optimization literature and does not reduce to a fitted parameter renamed as prediction, self-definition, or load-bearing self-citation. No equations in the provided abstract or description exhibit the reduction patterns (self-definitional, fitted-input-called-prediction, etc.). The result remains independent of the target mixing ratio by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard second-order Taylor approximation of loss surfaces being sufficiently accurate for deriving mixing ratios; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Loss landscapes admit a useful quadratic approximation via the Hessian matrix for the purpose of deriving mixing ratios.
    Invoked to obtain the closed-form soft mixing ratio from the joint optimization objective.

pith-pipeline@v0.9.1-grok · 5722 in / 1166 out tokens · 19819 ms · 2026-06-26T00:41:19.424159+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 3 linked inside Pith

  1. [1]

    arXiv preprint arXiv:2502.13923 (2025)

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  2. [2]

    Annals of Operations Research134, 19–67 (2005)

    de Boer, P.T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross- entropy method. Annals of Operations Research134, 19–67 (2005)

  3. [3]

    arXiv preprint arXiv:2411.02564 (2024)

    Cao, M., Liu, Y., Liu, Y., Wang, T., Dong, J., Ding, H., Zhang, X., Reid, I., Liang, X.: Continual llava: Continual instruction tuning in large vision-language models. arXiv preprint arXiv:2411.02564 (2024)

  4. [4]

    In: ICML

    Cen, J., Wu, C., Liu, X., Yin, S., Pei, Y., Yang, J., Chen, Q., Duan, N., Zhang, J.: Using left and right brains together: Towards vision and language planning. In: ICML. pp. 5982–6001 (2024)

  5. [5]

    NeurIPS37, 57817–57840 (2024)

    Chen, C., Zhu, J., Luo, X., Shen, H.T., Song, J., Gao, L.: Coin: A benchmark of continual instruction tuning for multimodel large language models. NeurIPS37, 57817–57840 (2024)

  6. [6]

    In: CVPR

    Chen, H., Yang, Y., Zhong, N., Ma, K.: Hiding images in diffusion models by editing learned score functions. In: CVPR. pp. 18663–18673 (2025)

  7. [7]

    In: ICML (2023)

    Frantar, E., Alistarh, D.: SparseGPT: Massive language models can be accurately pruned in one-shot. In: ICML (2023)

  8. [8]

    In: CVPR

    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: CVPR. pp. 6904–6913 (2017)

  9. [9]

    In: CVPR

    Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J.P.: Vizwiz grand challenge: Answering visual questions from blind people. In: CVPR. pp. 3608–3617 (2018)

  10. [10]

    In: ICML

    Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: ICML. pp. 2790–2799 (2019)

  11. [11]

    In: ICLR (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. In: ICLR (2022)

  12. [12]

    arXiv preprint arXiv:2503.04543 (2025)

    Huang, W., Liang, J., Guo, X., Fang, Y., Wan, G., Rong, X., Wen, C., Shi, Z., Li, Q., Zhu, D., Ma, Y., Liang, K., Yang, B., Li, H., Shao, J., Ye, M., Du, B.: Keeping yourself is important in downstream tuning multimodal large language model. arXiv preprint arXiv:2503.04543 (2025)

  13. [13]

    In: ICML (2025)

    Huang, W., Liang, J., Shi, Z., Zhu, D., Wan, G., Li, H., Du, B., Tao, D., Ye, M.: Learn from downstream and be yourself in multimodal large language model fine-tuning. In: ICML (2025)

  14. [14]

    In: CVPR

    Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: CVPR. pp. 6700–6709 (2019)

  15. [15]

    In: NeurIPS (2024)

    Jha, S., Gong, D., Yao, L.: CLAP4CLIP: Continual learning with probabilistic finetuning for vision-language models. In: NeurIPS (2024)

  16. [16]

    Proceedings of the National Academy of Sciences114(13), 3521–3526 (2017)

    Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al.: Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences114(13), 3521–3526 (2017)

  17. [17]

    In: NeurIPS

    LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: NeurIPS. vol. 2, pp. 598–605 (1990) 16 J. Yang et al

  18. [18]

    In: ICML (2023)

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023)

  19. [19]

    In: EMNLP

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallu- cination in large vision-language models. In: EMNLP. pp. 292–305 (2023)

  20. [20]

    In: CVPR

    Liang, Y.S., Li, W.J.: Inflora: Interference-free low-rank adaptation for continual learning. In: CVPR. pp. 23638–23647 (2024)

  21. [21]

    In: CVPR (2024)

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024)

  22. [22]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: ECCV. pp. 216–233 (2024)

  23. [23]

    In: NeurIPS

    Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: NeurIPS. vol. 35, pp. 2507–2521 (2022)

  24. [24]

    In: CVPR (2025)

    Luo, G., Yang, X., Dou, W., Wang, Z., Liu, J., Dai, J., Qiao, Y., Zhu, X.: Mono- internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. In: CVPR (2025)

  25. [25]

    IEEE/ACM Transactions on Audio, Speech, and Language Processing33, 3776– 3786 (2025)

    Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., Zhang, Y.: An empirical study of catastrophic forgetting in large language models during continual fine-tuning. IEEE/ACM Transactions on Audio, Speech, and Language Processing33, 3776– 3786 (2025)

  26. [26]

    In: CVPR

    Marino,K.,Rastegari,M.,Farhadi,A.,Mottaghi,R.:Ok-vqa:Avisualquestionan- swering benchmark requiring external knowledge. In: CVPR. pp. 3195–3204 (2019)

  27. [27]

    In: ICML

    Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored ap- proximate curvature. In: ICML. pp. 2408–2417 (2015)

  28. [28]

    In: NeurIPS (2022)

    Matena, M., Raffel, C.: Merging models with fisher-weighted averaging. In: NeurIPS (2022)

  29. [29]

    In: WACV

    Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: Info- graphicvqa. In: WACV. pp. 1697–1706 (2022)

  30. [30]

    In: ICML

    Panigrahi, A., Saunshi, N., Zhao, H., Arora, S.: Task-specific skill localization in fine-tuned language models. In: ICML. pp. 27011–27033 (2023)

  31. [31]

    co / datasets/unsloth/LaTeX_OCR(2024)

    Roboflow: Latex-ocr dataset (unsloth version).https : / / huggingface . co / datasets/unsloth/LaTeX_OCR(2024)

  32. [32]

    In: CVPR

    Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: CVPR. pp. 8317–8326 (2019)

  33. [33]

    NeurIPS35, 29440–29453 (2022)

    Srinivasan, T., Chang, T.Y., Pinto Alva, L., Chochlakis, G., Rostami, M., Thoma- son, J.: Climb: A continual learning benchmark for vision-and-language tasks. NeurIPS35, 29440–29453 (2022)

  34. [34]

    In: Workshop on Efficient Systems for Foundation Models @ ICML2023 (2023)

    Sun, M., Liu, Z., Bair, A., Kolter, J.Z.: A simple and effective pruning approach for large language models. In: Workshop on Efficient Systems for Foundation Models @ ICML2023 (2023)

  35. [35]

    In: ICCV

    Wang, X., Zhuang, Z., Zhang, Y.: Plan: Proactive low-rank allocation for continual learning. In: ICCV. pp. 2909–2918 (2025)

  36. [36]

    In: Findings of the Association for Computational Linguistics: EMNLP 2025

    Wu, J., Xiong, Y., Li, X., Xia, Y., Wang, R., Wang, Y., Yu, T., Kim, S., Rossi, R.A., Yao, L., Shang, J., McAuley, J.: Mitigating visual knowledge forgetting in MLLM instruction-tuning via modality-decoupled gradient descent. In: Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 2282–2295 (2025)

  37. [37]

    In: ICLR (2025) Curvature-Guided Mixing for MLLM Adaptation 17

    Wu, Y., Piao, H., Huang, L., Wang, R., Li, W., Pfister, H., Meng, D., Ma, K., Wei, Y.: Sd-lora: Scalable decoupled low-rank adaptation for class incremental learning. In: ICLR (2025) Curvature-Guided Mixing for MLLM Adaptation 17

  38. [38]

    Transactions of the Association for Computational Linguistics2, 67–78 (2014)

    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics2, 67–78 (2014)

  39. [39]

    NeurIPS37, 49834–49858 (2024)

    Yu, J., Xiong, H., Zhang, L., Diao, H., Zhuge, Y., Hong, L., Wang, D., Lu, H., He, Y., Chen, L.: Llms can evolve continually on modality for x-modal reasoning. NeurIPS37, 49834–49858 (2024)

  40. [40]

    In: CVPR

    Yu, J., Zhuge, Y., Zhang, L., Hu, P., Wang, D., Lu, H., He, Y.: Boosting continual learning of vision-language models via mixture-of-experts adapters. In: CVPR. pp. 23219–23230 (2024)

  41. [41]

    In: ICML (2024)

    Yu, L., Yu, B., Yu, H., Huang, F., Li, Y.: Language models are super mario: Absorbing abilities from homologous models as a free lunch. In: ICML (2024)

  42. [42]

    In: EMNLP

    Zeng, F., Zhu, F., Guo, H., Zhang, X.Y., Liu, C.L.: Modalprompt: Towards efficient multimodal continual instruction tuning with dual-modality guided prompt. In: EMNLP. pp. 12126–12141 (2025)

  43. [43]

    In: ICML

    Zenke, F., Poole, B., Ganguli, S.: Continual learning through synaptic intelligence. In: ICML. pp. 3987–3995 (2017)

  44. [44]

    In: Con- ference on Parsimony and Learning (2023)

    Zhai, Y., Tong, S., Li, X., Cai, M., Qu, Q., Lee, Y.J., Ma, Y.: Investigating the catastrophic forgetting in multimodal large language model fine-tuning. In: Con- ference on Parsimony and Learning (2023)

  45. [45]

    arXiv preprint arXiv:2309.15112 (2023)

    Zhang, P., Dong, X., Wang, B., Cao, Y., Xu, C., Ouyang, L., Zhao, Z., Duan, H., Zhang, S., Ding, S., et al.: Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023)

  46. [46]

    In: ICML (2024)

    Zhu, D., Sun, Z., Li, Z., Shen, T., Yan, K., Ding, S., Kuang, K., Wu, C.: Model tailor: Mitigating catastrophic forgetting in multi-modal large language models. In: ICML (2024)

  47. [47]

    arXiv preprint arXiv:2504.10479 (2025)

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., Gao, Z., Cui, E., Wang, X., Cao, Y., Liu, Y., Wei, X., Zhang, H., Wang, H., Xu, W., Li, H., Wang, J., Deng, N., Li, S., He, Y., Jiang, T., Luo, J., Wang, Y., He, C., Shi, B., Zhang, X., Shao, W., He, J., Xiong, Y., Qu, W., Sun, P., Jiao, P., Lv, H., Wu, L., Zhang, ...