pith. sign in

arxiv: 2605.18904 · v1 · pith:HMYJ3J2Gnew · submitted 2026-05-17 · 💻 cs.LG · cs.AI· cs.CL

Dynamic Model Merging Made Slim

Pith reviewed 2026-05-20 15:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords model mergingdynamic mergingparameter efficiencylow-rank modulesmulti-task learningdata-free refinementexpert allocation
0
0 comments X

The pith

DiDi-Merging achieves dynamic model merging results with just 1.24 times the parameters of a single fine-tuned model by optimizing ranks inside low-rank modules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a dynamic merging approach that reuses multiple fine-tuned models in one compact system without retraining or access to original data. It treats the split between parameters shared across tasks and those kept expert-specific as a problem solvable by differentiable optimization of matrix ranks in low-rank modules. A subsequent refinement step then restores task performance without any training data. This produces accuracy comparable to earlier dynamic methods at 1.24 times single-model size and better accuracy at 1.4 times size, while staying well below the storage cost of methods that exceed twice the size. The technique is shown to work for vision, language, and multimodal tasks.

Core claim

DiDi-Merging formulates parameter budgeting as differentiable rank optimization in low-rank modules and introduces a data-free refinement step to recover task fidelity, allowing the merged model to match prior dynamic baselines at only 1.24 times the parameters of one fine-tuned model and surpass them at 1.4 times while remaining far more compact than approaches that require over 2 times storage.

What carries the argument

Differentiable rank optimization inside low-rank modules that allocates capacity between shared and expert parameters.

If this is right

  • Matches the accuracy of prior dynamic merging methods at 1.24 times the storage of one fine-tuned model
  • Exceeds those methods at 1.4 times single-model size
  • Uses substantially less total storage than dynamic methods that exceed twice the size
  • Operates without original training data and across vision, language, and multimodal domains

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reduced footprint could make merging many more tasks feasible on memory-limited hardware
  • The data-free refinement property opens use in settings where training data must stay private
  • The rank-allocation idea might combine with quantization or pruning for further size cuts

Load-bearing premise

Differentiable rank optimization can find an allocation of shared versus expert parameters such that a later data-free refinement step will restore full task performance.

What would settle it

A controlled test on held-out tasks where the 1.24-times-size merged model falls measurably below baseline accuracy and the data-free refinement step fails to close the gap.

Figures

Figures reproduced from arXiv: 2605.18904 by Guodong Du, Wanyu Lin.

Figure 1
Figure 1. Figure 1: Comparison of static and dynamic [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Parameter allocation in dynamic model merging. The balance beam denotes the router, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: DiDi-Merging achieves a su￾perior Pareto trade-off between param￾eter efficiency and accuracy across all dynamic merging baselines [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Parameter accounting under three dynamic merging [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: The DiDi-Merging pipeline. Left: task vectors {τ t} are split into a shared component Ms via averaging and per-task expert residuals τ t − Ms; SVD with smooth truncation produces low-rank M˜ s,M˜ t, the Find-Optimal-Rank stage learns per-module ranks rs, r1, . . . , rT , and LoRA factors (A, B) are refined data-free against an MSE reconstruction loss with respect to the original task vectors. Right: during… view at source ↗
Figure 7
Figure 7. Figure 7: Task-vector similarity (left) vs. optimal [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Model merging enables the reuse of fine-tuned models without joint training or access to original data. Dynamic merging further improves flexibility by selectively activating task-relevant parameters and efficiently composing experts across multiple tasks. However, existing dynamic methods either maintain a full shared model with tiny experts or allocate excessive capacity to experts, leading to suboptimal accuracy--efficiency trade-offs. To address this, we propose DiDi-Merging, a slim dynamic merging framework that leverages differentiable rank allocation to balance shared and expert parameters. By formulating parameter budgeting as differentiable rank optimization in low-rank modules and introducing a data-free refinement step to recover task fidelity, DiDi-Merging matches prior dynamic baselines at only 1.24x the parameters of a single fine-tuned model and surpasses them at 1.4x, substantially more compact than methods requiring > 2x storage. DiDi-Merging applies across vision, language, and multimodal tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce DiDi-Merging, a dynamic model merging method that uses differentiable rank allocation in low-rank modules to balance shared and expert parameters more efficiently than existing approaches. It adds a data-free refinement step to recover performance. The key result is matching prior methods at 1.24 times the parameters of one fine-tuned model and surpassing at 1.4 times, with applications in vision, language, and multimodal domains.

Significance. Should the empirical claims be substantiated, this would be a notable contribution to model merging literature by demonstrating a slimmer dynamic merging strategy that reduces storage requirements while maintaining or improving performance. The differentiable budgeting and data-free aspect address practical constraints in multi-task learning scenarios.

major comments (2)
  1. Abstract: Performance numbers are stated without any mention of experimental setup, datasets, number of runs, or error bars. This is a load-bearing issue for assessing whether the differentiable rank optimization and data-free refinement deliver the claimed efficiency gains.
  2. Data-free refinement procedure: The refinement step is presented as recovering task fidelity using proxy signals without original data. However, there is no analysis or benchmarks quantifying the fidelity recovery gap or identifying when it fails (e.g., dissimilar tasks), which directly impacts the validity of the accuracy claims at 1.24x and 1.4x parameters.
minor comments (2)
  1. Consider adding a table comparing parameter counts and performance metrics across all baselines for clarity.
  2. Ensure consistent use of symbols for rank variables and allocation parameters throughout the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and strengthen the presentation of our results.

read point-by-point responses
  1. Referee: Abstract: Performance numbers are stated without any mention of experimental setup, datasets, number of runs, or error bars. This is a load-bearing issue for assessing whether the differentiable rank optimization and data-free refinement deliver the claimed efficiency gains.

    Authors: We agree that the abstract would benefit from additional context on the experimental setup. In the revised manuscript, we have expanded the abstract to note that results are reported on standard vision (e.g., CIFAR-100, ImageNet), language (GLUE), and multimodal (VQAv2) benchmarks, averaged over 3 independent runs with standard deviations, and that the 1.24x and 1.4x parameter budgets are measured relative to a single fine-tuned model. This provides the necessary details to evaluate the efficiency claims without exceeding abstract length limits. revision: yes

  2. Referee: Data-free refinement procedure: The refinement step is presented as recovering task fidelity using proxy signals without original data. However, there is no analysis or benchmarks quantifying the fidelity recovery gap or identifying when it fails (e.g., dissimilar tasks), which directly impacts the validity of the accuracy claims at 1.24x and 1.4x parameters.

    Authors: We acknowledge the value of a more explicit analysis of the data-free refinement's robustness. While the original manuscript shows empirical recovery on the primary task sets, it does not include a dedicated quantification of the fidelity gap or failure cases for dissimilar tasks. In the revision, we have added an appendix section with new benchmarks that measure the performance gap before and after refinement across task similarity levels, including highly dissimilar combinations, along with discussion of proxy signal limitations. This directly supports the validity of the reported accuracy at the stated parameter budgets. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent optimization objective

full rationale

The paper introduces DiDi-Merging as a new framework that formulates parameter budgeting via differentiable rank optimization inside low-rank modules, followed by a data-free refinement step. These are presented as algorithmic contributions rather than derivations that reduce to prior inputs by construction. Performance numbers (1.24x and 1.4x parameter efficiency) are reported as experimental outcomes on vision, language, and multimodal tasks, not as predictions forced by fitting or self-citation. No load-bearing step equates the claimed result to its own fitted parameters or renames a known pattern; the central premise relies on the proposed optimization and refinement procedure, which is externally falsifiable via accuracy measurements.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on the unstated premise that low-rank modules plus differentiable budgeting can separate shared and task-specific capacity without catastrophic interference, plus the assumption that data-free refinement can restore fidelity.

free parameters (1)
  • rank allocation variables
    Differentiable optimization over ranks in low-rank modules requires parameters that are fitted during the merging process.
axioms (1)
  • domain assumption Low-rank decomposition can approximate task-specific parameter updates without significant loss of expressivity
    Invoked when formulating parameter budgeting inside low-rank modules.

pith-pipeline@v0.9.0 · 5676 in / 1243 out tokens · 36864 ms · 2026-05-20T15:12:11.843244+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages · 16 internal anchors

  1. [1]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  2. [2]

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  3. [3]

    C. Chen, Y . Du, Z. Fang, Z. Wang, F. Luo, P. Li, M. Yan, J. Zhang, F. Huang, M. Sun, and Y . Liu. Model composition for multimodal large language models. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024

  4. [4]

    D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 190–200, 2011

  5. [5]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  6. [6]

    S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei. Beats: Audio pre-training with acoustic tokenizers. InProceedings of the International Conference on Machine Learning (ICML), volume 202, pages 5178–5193, 2023

  7. [7]

    S. Chen, Y . Zhang, and Q. Yang. Multi-task learning in natural language processing: An overview.ACM Computing Surveys, 56(12):1–32, 2024

  8. [8]

    Y . Chen, J. Li, W. Yao, X. Ma, G. Du, W. Wang, and J. Li. V ocabulary hijacking in lvlms: Unveiling critical attention heads by excluding inert tokens to mitigate hallucination.arXiv preprint arXiv:2605.10622, 2026

  9. [9]

    Cheng, J

    G. Cheng, J. Han, and X. Lu. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

  10. [10]

    Cheng, F

    R. Cheng, F. Xiong, Y . Wei, W. Zhu, and C. Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors.Proceedings of the International Conference on Machine Learning (ICML), 2025

  11. [11]

    Cimpoi, S

    M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. InCVPR, pages 3606–3613, 2014

  12. [12]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  13. [13]

    Deitke, D

    M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023

  14. [14]

    N. Ding, Y . Qin, G. Yang, F. Wei, Z. Yang, Y . Su, S. Hu, Y . Chen, C.-M. Chan, W. Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature Machine Intelligence, 5(3):220–235, 2023

  15. [15]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the International Conference on Learning Representations (ICLR), 2021

  16. [16]

    Drossos, S

    K. Drossos, S. Lipping, and T. Virtanen. Clotho: an audio captioning dataset. InProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 736–740, 2020

  17. [17]

    G. Du, H. Deng, J. Su, and Y . Huang. End-to-end rain streak removal with raw images.arXiv preprint arXiv:2312.13304, 2023. 10

  18. [18]

    G. Du, R. Jiang, S. Yang, H. Li, W. Chen, K. Li, S. K. Goh, and H.-K. Tang. Impacts of darwinian evolution on pre-trained deep neural networks. In2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 1907–1912. IEEE, 2024

  19. [19]

    G. Du, J. Lee, J. Li, R. Jiang, Y . Guo, S. Yu, H. Liu, S. K. Goh, H.-K. Tang, D. He, and M. Zhang. Parameter competition balancing for model merging. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

  20. [20]

    G. Du, J. Li, H. Liu, R. Jiang, S. Yu, Y . Guo, S. K. Goh, and H.-K. Tang. Knowledge fusion by evolving weights of language models. InFindings of the Association for Computational Linguistics ACL 2024, pages 11727–11742, 2024

  21. [21]

    G. Du, Z. Fang, J. Li, J. Li, R. Jiang, S. Yu, Y . Guo, Y . Chen, S. K. Goh, H.-K. Tang, D. He, H. Liu, and M. Zhang. Neural parameter search for slimmer fine-tuned models and better transfer. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. doi: 10.18653/v1/2025.acl-long.1570. URL ht...

  22. [22]

    G. Du, Z. Li, X. Zhou, J. Li, Z. Shi, W. Lin, H.-K. Tang, X. Li, F. Liu, W. Wang, M. Zhang, and J. Li. Knowledge fusion of large language models via modular skillpacks. InProceedings of the International Conference on Learning Representations (ICLR), 2026

  23. [23]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  24. [24]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Y . Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

  25. [26]

    Z. Fang, G. Du, S. Yu, Y . Guo, Y . Zhang, J. Li, H.-K. Tang, and S. K. Goh. Disentangling task interference within neurons: Model merging in alignment with neuronal mechanisms.arXiv preprint arXiv:2503.05320, 2025

  26. [27]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

  27. [28]

    A. A. Gargiulo, D. Crisostomi, M. S. Bucarelli, S. Scardapane, F. Silvestri, and E. Rodola. Task singular vectors: Reducing task interference in model merging. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18695–18705, 2025

  28. [30]

    A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y . Zhao, X. Du, M. R. G. Madani, et al. Are we done with mmlu?arXiv preprint arXiv:2406.04127, 2024

  29. [31]

    Y . Gong, J. Yu, and J. R. Glass. V ocalsound: A dataset for improving human vocal sounds recognition. InProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 151–155, 2022

  30. [32]

    Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. R. Glass. Listen, think, and understand. In Proceedings of the International Conference on Learning Representations (ICLR), 2024

  31. [33]

    Goyal, T

    Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, 2017. 11

  32. [34]

    H. Gu, W. Li, L. Li, Q. Zhu, M. Lee, S. Sun, W. Xue, and Y . Guo. Delta decompression for moe-based llms compression.arXiv preprint arXiv:2502.17298, 2025

  33. [35]

    Gurari, Q

    D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3608–3617, 2018

  34. [36]

    Y . He, Y . Hu, Y . Lin, T. Zhang, and H. Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.Transactions on Machine Learning Research (TMLR), 2024

  35. [37]

    Helber, B

    P. Helber, B. Bischke, A. Dengel, and D. Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

  36. [38]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

  37. [39]

    E. J. Hu, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

  38. [40]

    Huang, P

    C. Huang, P. Ye, T. Chen, T. He, X. Yue, and W. Ouyang. Emr-merging: Tuning-free high-performance model merging.arXiv preprint arXiv:2405.17461, 2024

  39. [41]

    D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6700–6709, 2019

  40. [42]

    Editing Models with Task Arithmetic

    G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

  41. [43]

    X. Jin, X. Ren, D. Preotiuc-Pietro, and P. Cheng. Dataless knowledge fusion by merging weights of language models.arXiv preprint arXiv:2212.09849, 2022

  42. [44]

    Krause, M

    J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. InICCV workshops, pages 554–561, 2013

  43. [45]

    Y . LeCun. The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998

  44. [46]

    J. Li, G. Du, J. Li, S. K. Goh, W. Wang, Y . Wang, F. Liu, H.-K. Tang, S. Alharbi, D. He, et al. Multi-modality expansion and retention for llms through parameter merging and decoupling. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. doi: 10.18653/v1/2025.acl-long.1491. URL https://...

  45. [47]

    J. Li, S. Song, G. Du, N. Wong, X. Liu, Y . Li, M. Zhang, J. Li, and X. Li. D-qrelo: Training-and data-free delta compression for large language models via quantization and residual low-rank approximation.arXiv preprint arXiv:2604.16940, 2026

  46. [48]

    From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

    T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, and I. Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939, 2024

  47. [49]

    X. Li, T. Zhang, Y . Dubois, R. Taori, I. Gulrajani, C. Guestrin, P. Liang, and T. B. Hashimoto. AlpacaEval: An automatic evaluator of instruction-following models, 2023

  48. [50]

    Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J. Wen. Evaluating object hallucination in large vision-language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 292–305, 2023. 12

  49. [51]

    Z. Li, G. Du, W. Guo, Y . Zhou, X. Li, W. Wang, F. Liu, Y . Wang, D. Ye, M. Zhang, et al. Multi-objective large language model alignment with hierarchical experts.arXiv preprint arXiv:2505.20925, 2025

  50. [52]

    B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023

  51. [53]

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6:87–100, 2024

  52. [54]

    J. Lin, C. Zhu, P. J. Kneuertz, Y . Bai, and Y . Xue. Medcausalx: Adaptive causal reason- ing with self-reflection for trustworthy medical vision-language models.arXiv preprint arXiv:2603.23085, 2026

  53. [55]

    H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, 2024

  54. [56]

    J. Liu, G. Xiao, K. Li, J. D. Lee, S. Han, T. Dao, and T. Cai. Bitdelta: Your fine-tune may only be worth one bit.Advances in Neural Information Processing Systems (NeurIPS), 37: 13579–13600, 2024

  55. [57]

    P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 2507–2521, 2022

  56. [58]

    Z. Lu, C. Fan, W. Wei, X. Qu, D. Chen, and Y . Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.arXiv preprint arXiv:2406.15479, 2024

  57. [59]

    R. Luo, Z. Zhao, M. Yang, J. Dong, D. Li, P. Lu, T. Wang, L. Hu, M. Qiu, and Z. Wei. Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207, 2023

  58. [60]

    M. Maaz, H. A. Rasheed, S. Khan, and F. Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024

  59. [61]

    Marczak, S

    D. Marczak, S. Magistri, S. Cygert, B. Twardowski, A. D. Bagdanov, and J. van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces. Proceedings of the International Conference on Machine Learning (ICML), 2025

  60. [62]

    Marino, M

    K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3195–3204, 2019

  61. [63]

    X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y . Zou, and W. Wang. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research.IEEE ACM Transactions on Audio, Speech, and Language Processing (TASLP), 32:3339–3354, 2024

  62. [64]

    Mesaros, T

    A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen. DCASE2017 challenge setup: Tasks, datasets and baseline system. InProceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), pages 85–92, 2017

  63. [65]

    Panagopoulou, L

    A. Panagopoulou, L. Xue, N. Yu, J. Li, D. Li, S. Joty, R. Xu, S. Savarese, C. Xiong, and J. C. Niebles. X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. InProceedings of the European Conference on Computer Vision (ECCV), 2024

  64. [66]

    Panigrahi, N

    A. Panigrahi, N. Saunshi, H. Zhao, and S. Arora. Task-specific skill localization in fine-tuned language models.arXiv preprint arXiv:2302.06600, 2023. 13

  65. [67]

    B. Ping, S. Wang, H. Wang, X. Han, Y . Xu, Y . Yan, Y . Chen, B. Chang, Z. Liu, and M. Sun. Delta-come: Training-free delta-compression with mixed-precision for large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

  66. [68]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

  67. [69]

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bow- man. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

  68. [70]

    V . Sanh, A. Webson, C. Raffel, S. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, et al. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022

  69. [71]

    Singh, V

    A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8317–8326, 2019

  70. [72]

    Stallkamp, M

    J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The german traffic sign recognition benchmark: a multi-class classification competition. InIJCNN, pages 1453–1460. IEEE, 2011

  71. [73]

    Stoica, P

    G. Stoica, P. Ramesh, B. Ecsedi, L. Choshen, and J. Hoffman. Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735, 2024

  72. [74]

    W. Sun, Q. Li, Y .-a. Geng, and B. Li. Cat merging: A training-free approach for resolving conflicts in model merging.Proceedings of the International Conference on Machine Learning (ICML), 2025

  73. [75]

    A. Tang, L. Shen, Y . Luo, N. Yin, L. Zhang, and D. Tao. Merging multi-task models via weight-ensembling mixture of experts. InICML, 2024

  74. [76]

    A. Tang, L. Shen, Y . Luo, S. Xie, H. Hu, L. Zhang, B. Du, and D. Tao. Zero-shot sparse mixture of low-rank experts construction from pre-trained foundation models.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–12, 2025. doi: 10.1109/TPAMI.2025. 3612480

  75. [77]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  76. [78]

    Valipour, M

    M. Valipour, M. Rezagholizadeh, I. Kobyzev, and A. Ghodsi. DyLoRA: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

  77. [79]

    F. Wan, X. Huang, D. Cai, X. Quan, W. Bi, and S. Shi. Knowledge fusion of large language models.arXiv preprint arXiv:2401.10491, 2024

  78. [80]

    K. Wang, N. Dimitriadis, G. Ortiz-Jimenez, F. Fleuret, and P. Frossard. Localizing task information for improved model merging and compression.arXiv preprint arXiv:2405.07813, 2024

  79. [81]

    Q. Wang, J. Ke, M. Tomizuka, K. Keutzer, and C. Xu. Dobi-SVD: Differentiable SVD for LLM compression and some new perspectives. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

  80. [82]

    X. Wang, Y . Zheng, Z. Wan, and M. Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378, 2024. 14

Showing first 80 references.