Dynamic Model Merging Made Slim

Guodong Du; Wanyu Lin

arxiv: 2605.18904 · v1 · pith:HMYJ3J2Gnew · submitted 2026-05-17 · 💻 cs.LG · cs.AI· cs.CL

Dynamic Model Merging Made Slim

Guodong Du , Wanyu Lin This is my paper

Pith reviewed 2026-05-20 15:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords model mergingdynamic mergingparameter efficiencylow-rank modulesmulti-task learningdata-free refinementexpert allocation

0 comments

The pith

DiDi-Merging achieves dynamic model merging results with just 1.24 times the parameters of a single fine-tuned model by optimizing ranks inside low-rank modules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a dynamic merging approach that reuses multiple fine-tuned models in one compact system without retraining or access to original data. It treats the split between parameters shared across tasks and those kept expert-specific as a problem solvable by differentiable optimization of matrix ranks in low-rank modules. A subsequent refinement step then restores task performance without any training data. This produces accuracy comparable to earlier dynamic methods at 1.24 times single-model size and better accuracy at 1.4 times size, while staying well below the storage cost of methods that exceed twice the size. The technique is shown to work for vision, language, and multimodal tasks.

Core claim

DiDi-Merging formulates parameter budgeting as differentiable rank optimization in low-rank modules and introduces a data-free refinement step to recover task fidelity, allowing the merged model to match prior dynamic baselines at only 1.24 times the parameters of one fine-tuned model and surpass them at 1.4 times while remaining far more compact than approaches that require over 2 times storage.

What carries the argument

Differentiable rank optimization inside low-rank modules that allocates capacity between shared and expert parameters.

If this is right

Matches the accuracy of prior dynamic merging methods at 1.24 times the storage of one fine-tuned model
Exceeds those methods at 1.4 times single-model size
Uses substantially less total storage than dynamic methods that exceed twice the size
Operates without original training data and across vision, language, and multimodal domains

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reduced footprint could make merging many more tasks feasible on memory-limited hardware
The data-free refinement property opens use in settings where training data must stay private
The rank-allocation idea might combine with quantization or pruning for further size cuts

Load-bearing premise

Differentiable rank optimization can find an allocation of shared versus expert parameters such that a later data-free refinement step will restore full task performance.

What would settle it

A controlled test on held-out tasks where the 1.24-times-size merged model falls measurably below baseline accuracy and the data-free refinement step fails to close the gap.

Figures

Figures reproduced from arXiv: 2605.18904 by Guodong Du, Wanyu Lin.

**Figure 2.** Figure 2: Parameter allocation in dynamic model merging. The balance beam denotes the router, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: DiDi-Merging achieves a superior Pareto trade-off between parameter efficiency and accuracy across all dynamic merging baselines [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Parameter accounting under three dynamic merging [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 6.** Figure 6: The DiDi-Merging pipeline. Left: task vectors {τ t} are split into a shared component Ms via averaging and per-task expert residuals τ t − Ms; SVD with smooth truncation produces low-rank M˜ s,M˜ t, the Find-Optimal-Rank stage learns per-module ranks rs, r1, . . . , rT , and LoRA factors (A, B) are refined data-free against an MSE reconstruction loss with respect to the original task vectors. Right: during… view at source ↗

**Figure 7.** Figure 7: Task-vector similarity (left) vs. optimal [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Model merging enables the reuse of fine-tuned models without joint training or access to original data. Dynamic merging further improves flexibility by selectively activating task-relevant parameters and efficiently composing experts across multiple tasks. However, existing dynamic methods either maintain a full shared model with tiny experts or allocate excessive capacity to experts, leading to suboptimal accuracy--efficiency trade-offs. To address this, we propose DiDi-Merging, a slim dynamic merging framework that leverages differentiable rank allocation to balance shared and expert parameters. By formulating parameter budgeting as differentiable rank optimization in low-rank modules and introducing a data-free refinement step to recover task fidelity, DiDi-Merging matches prior dynamic baselines at only 1.24x the parameters of a single fine-tuned model and surpasses them at 1.4x, substantially more compact than methods requiring > 2x storage. DiDi-Merging applies across vision, language, and multimodal tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiDi-Merging uses differentiable rank allocation plus data-free refinement to slim down dynamic merging to 1.24-1.4x parameters, but the abstract leaves the key recovery step unverified.

read the letter

The one or two things to know about this paper are that it introduces differentiable rank allocation for balancing shared and expert parameters in dynamic merging, combined with a data-free refinement to maintain performance, and claims this gets down to 1.24x or 1.4x the size of a single model while matching or exceeding prior dynamic baselines. What is new is the formulation of parameter budgeting as a differentiable optimization inside low-rank modules, which seems like a step beyond the referenced baselines. The paper does well at laying out the accuracy-efficiency problem with current approaches and suggesting a more compact alternative that works across different task types. The soft spots are in the evidence. The abstract reports performance numbers but skips experimental details, error bars, datasets, and ablations, leaving the optimization behavior unverified. The data-free refinement step looks like the weakest part, since it needs to recover task fidelity using only proxy signals without original data, and the abstract gives no numbers on how well that works or when it fails. If that recovery doesn't hold, the compactness claims won't either. The stress-test concern about this link seems fair from the abstract alone. This is for people focused on deploying multiple fine-tuned models under tight memory limits. A reader looking for new ideas in parameter-efficient merging would find the technical approach useful to consider. I would send it for peer review because the core idea is distinct enough and the problem matters, though the full paper needs to show the experiments to back it up.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce DiDi-Merging, a dynamic model merging method that uses differentiable rank allocation in low-rank modules to balance shared and expert parameters more efficiently than existing approaches. It adds a data-free refinement step to recover performance. The key result is matching prior methods at 1.24 times the parameters of one fine-tuned model and surpassing at 1.4 times, with applications in vision, language, and multimodal domains.

Significance. Should the empirical claims be substantiated, this would be a notable contribution to model merging literature by demonstrating a slimmer dynamic merging strategy that reduces storage requirements while maintaining or improving performance. The differentiable budgeting and data-free aspect address practical constraints in multi-task learning scenarios.

major comments (2)

Abstract: Performance numbers are stated without any mention of experimental setup, datasets, number of runs, or error bars. This is a load-bearing issue for assessing whether the differentiable rank optimization and data-free refinement deliver the claimed efficiency gains.
Data-free refinement procedure: The refinement step is presented as recovering task fidelity using proxy signals without original data. However, there is no analysis or benchmarks quantifying the fidelity recovery gap or identifying when it fails (e.g., dissimilar tasks), which directly impacts the validity of the accuracy claims at 1.24x and 1.4x parameters.

minor comments (2)

Consider adding a table comparing parameter counts and performance metrics across all baselines for clarity.
Ensure consistent use of symbols for rank variables and allocation parameters throughout the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and strengthen the presentation of our results.

read point-by-point responses

Referee: Abstract: Performance numbers are stated without any mention of experimental setup, datasets, number of runs, or error bars. This is a load-bearing issue for assessing whether the differentiable rank optimization and data-free refinement deliver the claimed efficiency gains.

Authors: We agree that the abstract would benefit from additional context on the experimental setup. In the revised manuscript, we have expanded the abstract to note that results are reported on standard vision (e.g., CIFAR-100, ImageNet), language (GLUE), and multimodal (VQAv2) benchmarks, averaged over 3 independent runs with standard deviations, and that the 1.24x and 1.4x parameter budgets are measured relative to a single fine-tuned model. This provides the necessary details to evaluate the efficiency claims without exceeding abstract length limits. revision: yes
Referee: Data-free refinement procedure: The refinement step is presented as recovering task fidelity using proxy signals without original data. However, there is no analysis or benchmarks quantifying the fidelity recovery gap or identifying when it fails (e.g., dissimilar tasks), which directly impacts the validity of the accuracy claims at 1.24x and 1.4x parameters.

Authors: We acknowledge the value of a more explicit analysis of the data-free refinement's robustness. While the original manuscript shows empirical recovery on the primary task sets, it does not include a dedicated quantification of the fidelity gap or failure cases for dissimilar tasks. In the revision, we have added an appendix section with new benchmarks that measure the performance gap before and after refinement across task similarity levels, including highly dissimilar combinations, along with discussion of proxy signal limitations. This directly supports the validity of the reported accuracy at the stated parameter budgets. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent optimization objective

full rationale

The paper introduces DiDi-Merging as a new framework that formulates parameter budgeting via differentiable rank optimization inside low-rank modules, followed by a data-free refinement step. These are presented as algorithmic contributions rather than derivations that reduce to prior inputs by construction. Performance numbers (1.24x and 1.4x parameter efficiency) are reported as experimental outcomes on vision, language, and multimodal tasks, not as predictions forced by fitting or self-citation. No load-bearing step equates the claimed result to its own fitted parameters or renames a known pattern; the central premise relies on the proposed optimization and refinement procedure, which is externally falsifiable via accuracy measurements.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on the unstated premise that low-rank modules plus differentiable budgeting can separate shared and task-specific capacity without catastrophic interference, plus the assumption that data-free refinement can restore fidelity.

free parameters (1)

rank allocation variables
Differentiable optimization over ranks in low-rank modules requires parameters that are fitted during the merging process.

axioms (1)

domain assumption Low-rank decomposition can approximate task-specific parameter updates without significant loss of expressivity
Invoked when formulating parameter budgeting inside low-rank modules.

pith-pipeline@v0.9.0 · 5676 in / 1243 out tokens · 36864 ms · 2026-05-20T15:12:11.843244+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By formulating parameter budgeting as differentiable rank optimization in low-rank modules and introducing a data-free refinement step to recover task fidelity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages · 16 internal anchors

[1]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

C. Chen, Y . Du, Z. Fang, Z. Wang, F. Luo, P. Li, M. Yan, J. Zhang, F. Huang, M. Sun, and Y . Liu. Model composition for multimodal large language models. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024
[4]

D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 190–200, 2011

work page 2011
[5]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei. Beats: Audio pre-training with acoustic tokenizers. InProceedings of the International Conference on Machine Learning (ICML), volume 202, pages 5178–5193, 2023

work page 2023
[7]

S. Chen, Y . Zhang, and Q. Yang. Multi-task learning in natural language processing: An overview.ACM Computing Surveys, 56(12):1–32, 2024

work page 2024
[8]

Y . Chen, J. Li, W. Yao, X. Ma, G. Du, W. Wang, and J. Li. V ocabulary hijacking in lvlms: Unveiling critical attention heads by excluding inert tokens to mitigate hallucination.arXiv preprint arXiv:2605.10622, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Cheng, J

G. Cheng, J. Han, and X. Lu. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

work page 2017
[10]

Cheng, F

R. Cheng, F. Xiong, Y . Wei, W. Zhu, and C. Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors.Proceedings of the International Conference on Machine Learning (ICML), 2025

work page 2025
[11]

Cimpoi, S

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. InCVPR, pages 3606–3613, 2014

work page 2014
[12]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Deitke, D

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023

work page 2023
[14]

N. Ding, Y . Qin, G. Yang, F. Wei, Z. Yang, Y . Su, S. Hu, Y . Chen, C.-M. Chan, W. Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature Machine Intelligence, 5(3):220–235, 2023

work page 2023
[15]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[16]

Drossos, S

K. Drossos, S. Lipping, and T. Virtanen. Clotho: an audio captioning dataset. InProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 736–740, 2020

work page 2020
[17]

G. Du, H. Deng, J. Su, and Y . Huang. End-to-end rain streak removal with raw images.arXiv preprint arXiv:2312.13304, 2023. 10

work page arXiv 2023
[18]

G. Du, R. Jiang, S. Yang, H. Li, W. Chen, K. Li, S. K. Goh, and H.-K. Tang. Impacts of darwinian evolution on pre-trained deep neural networks. In2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 1907–1912. IEEE, 2024

work page 1907
[19]

G. Du, J. Lee, J. Li, R. Jiang, Y . Guo, S. Yu, H. Liu, S. K. Goh, H.-K. Tang, D. He, and M. Zhang. Parameter competition balancing for model merging. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[20]

G. Du, J. Li, H. Liu, R. Jiang, S. Yu, Y . Guo, S. K. Goh, and H.-K. Tang. Knowledge fusion by evolving weights of language models. InFindings of the Association for Computational Linguistics ACL 2024, pages 11727–11742, 2024

work page 2024
[21]

G. Du, Z. Fang, J. Li, J. Li, R. Jiang, S. Yu, Y . Guo, Y . Chen, S. K. Goh, H.-K. Tang, D. He, H. Liu, and M. Zhang. Neural parameter search for slimmer fine-tuned models and better transfer. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. doi: 10.18653/v1/2025.acl-long.1570. URL ht...

work page doi:10.18653/v1/2025.acl-long.1570 2025
[22]

G. Du, Z. Li, X. Zhou, J. Li, Z. Shi, W. Lin, H.-K. Tang, X. Li, F. Liu, W. Wang, M. Zhang, and J. Li. Knowledge fusion of large language models via modular skillpacks. InProceedings of the International Conference on Learning Representations (ICLR), 2026

work page 2026
[23]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Y . Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Z. Fang, G. Du, S. Yu, Y . Guo, Y . Zhang, J. Li, H.-K. Tang, and S. K. Goh. Disentangling task interference within neurons: Model merging in alignment with neuronal mechanisms.arXiv preprint arXiv:2503.05320, 2025

work page arXiv 2025
[27]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

A. A. Gargiulo, D. Crisostomi, M. S. Bucarelli, S. Scardapane, F. Silvestri, and E. Rodola. Task singular vectors: Reducing task interference in model merging. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18695–18705, 2025

work page 2025
[30]

A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y . Zhao, X. Du, M. R. G. Madani, et al. Are we done with mmlu?arXiv preprint arXiv:2406.04127, 2024

work page arXiv 2024
[31]

Y . Gong, J. Yu, and J. R. Glass. V ocalsound: A dataset for improving human vocal sounds recognition. InProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 151–155, 2022

work page 2022
[32]

Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. R. Glass. Listen, think, and understand. In Proceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024
[33]

Goyal, T

Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, 2017. 11

work page 2017
[34]

H. Gu, W. Li, L. Li, Q. Zhu, M. Lee, S. Sun, W. Xue, and Y . Guo. Delta decompression for moe-based llms compression.arXiv preprint arXiv:2502.17298, 2025

work page arXiv 2025
[35]

Gurari, Q

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3608–3617, 2018

work page 2018
[36]

Y . He, Y . Hu, Y . Lin, T. Zhang, and H. Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.Transactions on Machine Learning Research (TMLR), 2024

work page 2024
[37]

Helber, B

P. Helber, B. Bischke, A. Dengel, and D. Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

work page 2019
[38]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021
[39]

E. J. Hu, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

work page 2022
[40]

Huang, P

C. Huang, P. Ye, T. Chen, T. He, X. Yue, and W. Ouyang. Emr-merging: Tuning-free high-performance model merging.arXiv preprint arXiv:2405.17461, 2024

work page arXiv 2024
[41]

D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6700–6709, 2019

work page 2019
[42]

Editing Models with Task Arithmetic

G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

X. Jin, X. Ren, D. Preotiuc-Pietro, and P. Cheng. Dataless knowledge fusion by merging weights of language models.arXiv preprint arXiv:2212.09849, 2022

work page arXiv 2022
[44]

Krause, M

J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. InICCV workshops, pages 554–561, 2013

work page 2013
[45]

Y . LeCun. The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998

work page 1998
[46]

J. Li, G. Du, J. Li, S. K. Goh, W. Wang, Y . Wang, F. Liu, H.-K. Tang, S. Alharbi, D. He, et al. Multi-modality expansion and retention for llms through parameter merging and decoupling. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. doi: 10.18653/v1/2025.acl-long.1491. URL https://...

work page doi:10.18653/v1/2025.acl-long.1491 2025
[47]

J. Li, S. Song, G. Du, N. Wong, X. Liu, Y . Li, M. Zhang, J. Li, and X. Li. D-qrelo: Training-and data-free delta compression for large language models via quantization and residual low-rank approximation.arXiv preprint arXiv:2604.16940, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, and I. Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

X. Li, T. Zhang, Y . Dubois, R. Taori, I. Gulrajani, C. Guestrin, P. Liang, and T. B. Hashimoto. AlpacaEval: An automatic evaluator of instruction-following models, 2023

work page 2023
[50]

Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J. Wen. Evaluating object hallucination in large vision-language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 292–305, 2023. 12

work page 2023
[51]

Z. Li, G. Du, W. Guo, Y . Zhou, X. Li, W. Wang, F. Liu, Y . Wang, D. Ye, M. Zhang, et al. Multi-objective large language model alignment with hierarchical experts.arXiv preprint arXiv:2505.20925, 2025

work page arXiv 2025
[52]

B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6:87–100, 2024

work page 2024
[54]

J. Lin, C. Zhu, P. J. Kneuertz, Y . Bai, and Y . Xue. Medcausalx: Adaptive causal reason- ing with self-reflection for trustworthy medical vision-language models.arXiv preprint arXiv:2603.23085, 2026

work page arXiv 2026
[55]

H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, 2024

work page 2024
[56]

J. Liu, G. Xiao, K. Li, J. D. Lee, S. Han, T. Dao, and T. Cai. Bitdelta: Your fine-tune may only be worth one bit.Advances in Neural Information Processing Systems (NeurIPS), 37: 13579–13600, 2024

work page 2024
[57]

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 2507–2521, 2022

work page 2022
[58]

Z. Lu, C. Fan, W. Wei, X. Qu, D. Chen, and Y . Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.arXiv preprint arXiv:2406.15479, 2024

work page arXiv 2024
[59]

R. Luo, Z. Zhao, M. Yang, J. Dong, D. Li, P. Lu, T. Wang, L. Hu, M. Qiu, and Z. Wei. Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207, 2023

work page arXiv 2023
[60]

M. Maaz, H. A. Rasheed, S. Khan, and F. Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024
[61]

Marczak, S

D. Marczak, S. Magistri, S. Cygert, B. Twardowski, A. D. Bagdanov, and J. van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces. Proceedings of the International Conference on Machine Learning (ICML), 2025

work page 2025
[62]

Marino, M

K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3195–3204, 2019

work page 2019
[63]

X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y . Zou, and W. Wang. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research.IEEE ACM Transactions on Audio, Speech, and Language Processing (TASLP), 32:3339–3354, 2024

work page 2024
[64]

Mesaros, T

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen. DCASE2017 challenge setup: Tasks, datasets and baseline system. InProceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), pages 85–92, 2017

work page 2017
[65]

Panagopoulou, L

A. Panagopoulou, L. Xue, N. Yu, J. Li, D. Li, S. Joty, R. Xu, S. Savarese, C. Xiong, and J. C. Niebles. X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. InProceedings of the European Conference on Computer Vision (ECCV), 2024

work page 2024
[66]

Panigrahi, N

A. Panigrahi, N. Saunshi, H. Zhao, and S. Arora. Task-specific skill localization in fine-tuned language models.arXiv preprint arXiv:2302.06600, 2023. 13

work page arXiv 2023
[67]

B. Ping, S. Wang, H. Wang, X. Han, Y . Xu, Y . Yan, Y . Chen, B. Chang, Z. Liu, and M. Sun. Delta-come: Training-free delta-compression with mixed-precision for large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[68]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[69]

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bow- man. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

V . Sanh, A. Webson, C. Raffel, S. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, et al. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022

work page 2022
[71]

Singh, V

A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8317–8326, 2019

work page 2019
[72]

Stallkamp, M

J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The german traffic sign recognition benchmark: a multi-class classification competition. InIJCNN, pages 1453–1460. IEEE, 2011

work page 2011
[73]

Stoica, P

G. Stoica, P. Ramesh, B. Ecsedi, L. Choshen, and J. Hoffman. Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735, 2024

work page arXiv 2024
[74]

W. Sun, Q. Li, Y .-a. Geng, and B. Li. Cat merging: A training-free approach for resolving conflicts in model merging.Proceedings of the International Conference on Machine Learning (ICML), 2025

work page 2025
[75]

A. Tang, L. Shen, Y . Luo, N. Yin, L. Zhang, and D. Tao. Merging multi-task models via weight-ensembling mixture of experts. InICML, 2024

work page 2024
[76]

A. Tang, L. Shen, Y . Luo, S. Xie, H. Hu, L. Zhang, B. Du, and D. Tao. Zero-shot sparse mixture of low-rank experts construction from pre-trained foundation models.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–12, 2025. doi: 10.1109/TPAMI.2025. 3612480

work page doi:10.1109/tpami.2025 2025
[77]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

Valipour, M

M. Valipour, M. Rezagholizadeh, I. Kobyzev, and A. Ghodsi. DyLoRA: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

work page 2023
[79]

F. Wan, X. Huang, D. Cai, X. Quan, W. Bi, and S. Shi. Knowledge fusion of large language models.arXiv preprint arXiv:2401.10491, 2024

work page arXiv 2024
[80]

K. Wang, N. Dimitriadis, G. Ortiz-Jimenez, F. Fleuret, and P. Frossard. Localizing task information for improved model merging and compression.arXiv preprint arXiv:2405.07813, 2024

work page arXiv 2024
[81]

Q. Wang, J. Ke, M. Tomizuka, K. Keutzer, and C. Xu. Dobi-SVD: Differentiable SVD for LLM compression and some new perspectives. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025
[82]

X. Wang, Y . Zheng, Z. Wan, and M. Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378, 2024. 14

work page arXiv 2024

Showing first 80 references.

[1] [1]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

C. Chen, Y . Du, Z. Fang, Z. Wang, F. Luo, P. Li, M. Yan, J. Zhang, F. Huang, M. Sun, and Y . Liu. Model composition for multimodal large language models. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024

[4] [4]

D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 190–200, 2011

work page 2011

[5] [5]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei. Beats: Audio pre-training with acoustic tokenizers. InProceedings of the International Conference on Machine Learning (ICML), volume 202, pages 5178–5193, 2023

work page 2023

[7] [7]

S. Chen, Y . Zhang, and Q. Yang. Multi-task learning in natural language processing: An overview.ACM Computing Surveys, 56(12):1–32, 2024

work page 2024

[8] [8]

Y . Chen, J. Li, W. Yao, X. Ma, G. Du, W. Wang, and J. Li. V ocabulary hijacking in lvlms: Unveiling critical attention heads by excluding inert tokens to mitigate hallucination.arXiv preprint arXiv:2605.10622, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Cheng, J

G. Cheng, J. Han, and X. Lu. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

work page 2017

[10] [10]

Cheng, F

R. Cheng, F. Xiong, Y . Wei, W. Zhu, and C. Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors.Proceedings of the International Conference on Machine Learning (ICML), 2025

work page 2025

[11] [11]

Cimpoi, S

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. InCVPR, pages 3606–3613, 2014

work page 2014

[12] [12]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Deitke, D

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023

work page 2023

[14] [14]

N. Ding, Y . Qin, G. Yang, F. Wei, Z. Yang, Y . Su, S. Hu, Y . Chen, C.-M. Chan, W. Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature Machine Intelligence, 5(3):220–235, 2023

work page 2023

[15] [15]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021

[16] [16]

Drossos, S

K. Drossos, S. Lipping, and T. Virtanen. Clotho: an audio captioning dataset. InProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 736–740, 2020

work page 2020

[17] [17]

G. Du, H. Deng, J. Su, and Y . Huang. End-to-end rain streak removal with raw images.arXiv preprint arXiv:2312.13304, 2023. 10

work page arXiv 2023

[18] [18]

G. Du, R. Jiang, S. Yang, H. Li, W. Chen, K. Li, S. K. Goh, and H.-K. Tang. Impacts of darwinian evolution on pre-trained deep neural networks. In2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 1907–1912. IEEE, 2024

work page 1907

[19] [19]

G. Du, J. Lee, J. Li, R. Jiang, Y . Guo, S. Yu, H. Liu, S. K. Goh, H.-K. Tang, D. He, and M. Zhang. Parameter competition balancing for model merging. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[20] [20]

G. Du, J. Li, H. Liu, R. Jiang, S. Yu, Y . Guo, S. K. Goh, and H.-K. Tang. Knowledge fusion by evolving weights of language models. InFindings of the Association for Computational Linguistics ACL 2024, pages 11727–11742, 2024

work page 2024

[21] [21]

G. Du, Z. Fang, J. Li, J. Li, R. Jiang, S. Yu, Y . Guo, Y . Chen, S. K. Goh, H.-K. Tang, D. He, H. Liu, and M. Zhang. Neural parameter search for slimmer fine-tuned models and better transfer. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. doi: 10.18653/v1/2025.acl-long.1570. URL ht...

work page doi:10.18653/v1/2025.acl-long.1570 2025

[22] [22]

G. Du, Z. Li, X. Zhou, J. Li, Z. Shi, W. Lin, H.-K. Tang, X. Li, F. Liu, W. Wang, M. Zhang, and J. Li. Knowledge fusion of large language models via modular skillpacks. InProceedings of the International Conference on Learning Representations (ICLR), 2026

work page 2026

[23] [23]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Y . Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [26]

Z. Fang, G. Du, S. Yu, Y . Guo, Y . Zhang, J. Li, H.-K. Tang, and S. K. Goh. Disentangling task interference within neurons: Model merging in alignment with neuronal mechanisms.arXiv preprint arXiv:2503.05320, 2025

work page arXiv 2025

[26] [27]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [28]

A. A. Gargiulo, D. Crisostomi, M. S. Bucarelli, S. Scardapane, F. Silvestri, and E. Rodola. Task singular vectors: Reducing task interference in model merging. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18695–18705, 2025

work page 2025

[28] [30]

A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y . Zhao, X. Du, M. R. G. Madani, et al. Are we done with mmlu?arXiv preprint arXiv:2406.04127, 2024

work page arXiv 2024

[29] [31]

Y . Gong, J. Yu, and J. R. Glass. V ocalsound: A dataset for improving human vocal sounds recognition. InProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 151–155, 2022

work page 2022

[30] [32]

Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. R. Glass. Listen, think, and understand. In Proceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024

[31] [33]

Goyal, T

Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, 2017. 11

work page 2017

[32] [34]

H. Gu, W. Li, L. Li, Q. Zhu, M. Lee, S. Sun, W. Xue, and Y . Guo. Delta decompression for moe-based llms compression.arXiv preprint arXiv:2502.17298, 2025

work page arXiv 2025

[33] [35]

Gurari, Q

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3608–3617, 2018

work page 2018

[34] [36]

Y . He, Y . Hu, Y . Lin, T. Zhang, and H. Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.Transactions on Machine Learning Research (TMLR), 2024

work page 2024

[35] [37]

Helber, B

P. Helber, B. Bischke, A. Dengel, and D. Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

work page 2019

[36] [38]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021

[37] [39]

E. J. Hu, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

work page 2022

[38] [40]

Huang, P

C. Huang, P. Ye, T. Chen, T. He, X. Yue, and W. Ouyang. Emr-merging: Tuning-free high-performance model merging.arXiv preprint arXiv:2405.17461, 2024

work page arXiv 2024

[39] [41]

D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6700–6709, 2019

work page 2019

[40] [42]

Editing Models with Task Arithmetic

G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[41] [43]

X. Jin, X. Ren, D. Preotiuc-Pietro, and P. Cheng. Dataless knowledge fusion by merging weights of language models.arXiv preprint arXiv:2212.09849, 2022

work page arXiv 2022

[42] [44]

Krause, M

J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. InICCV workshops, pages 554–561, 2013

work page 2013

[43] [45]

Y . LeCun. The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998

work page 1998

[44] [46]

J. Li, G. Du, J. Li, S. K. Goh, W. Wang, Y . Wang, F. Liu, H.-K. Tang, S. Alharbi, D. He, et al. Multi-modality expansion and retention for llms through parameter merging and decoupling. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. doi: 10.18653/v1/2025.acl-long.1491. URL https://...

work page doi:10.18653/v1/2025.acl-long.1491 2025

[45] [47]

J. Li, S. Song, G. Du, N. Wong, X. Liu, Y . Li, M. Zhang, J. Li, and X. Li. D-qrelo: Training-and data-free delta compression for large language models via quantization and residual low-rank approximation.arXiv preprint arXiv:2604.16940, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [48]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, and I. Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [49]

X. Li, T. Zhang, Y . Dubois, R. Taori, I. Gulrajani, C. Guestrin, P. Liang, and T. B. Hashimoto. AlpacaEval: An automatic evaluator of instruction-following models, 2023

work page 2023

[48] [50]

Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J. Wen. Evaluating object hallucination in large vision-language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 292–305, 2023. 12

work page 2023

[49] [51]

Z. Li, G. Du, W. Guo, Y . Zhou, X. Li, W. Wang, F. Liu, Y . Wang, D. Ye, M. Zhang, et al. Multi-objective large language model alignment with hierarchical experts.arXiv preprint arXiv:2505.20925, 2025

work page arXiv 2025

[50] [52]

B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [53]

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6:87–100, 2024

work page 2024

[52] [54]

J. Lin, C. Zhu, P. J. Kneuertz, Y . Bai, and Y . Xue. Medcausalx: Adaptive causal reason- ing with self-reflection for trustworthy medical vision-language models.arXiv preprint arXiv:2603.23085, 2026

work page arXiv 2026

[53] [55]

H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, 2024

work page 2024

[54] [56]

J. Liu, G. Xiao, K. Li, J. D. Lee, S. Han, T. Dao, and T. Cai. Bitdelta: Your fine-tune may only be worth one bit.Advances in Neural Information Processing Systems (NeurIPS), 37: 13579–13600, 2024

work page 2024

[55] [57]

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 2507–2521, 2022

work page 2022

[56] [58]

Z. Lu, C. Fan, W. Wei, X. Qu, D. Chen, and Y . Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.arXiv preprint arXiv:2406.15479, 2024

work page arXiv 2024

[57] [59]

R. Luo, Z. Zhao, M. Yang, J. Dong, D. Li, P. Lu, T. Wang, L. Hu, M. Qiu, and Z. Wei. Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207, 2023

work page arXiv 2023

[58] [60]

M. Maaz, H. A. Rasheed, S. Khan, and F. Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024

[59] [61]

Marczak, S

D. Marczak, S. Magistri, S. Cygert, B. Twardowski, A. D. Bagdanov, and J. van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces. Proceedings of the International Conference on Machine Learning (ICML), 2025

work page 2025

[60] [62]

Marino, M

K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3195–3204, 2019

work page 2019

[61] [63]

X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y . Zou, and W. Wang. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research.IEEE ACM Transactions on Audio, Speech, and Language Processing (TASLP), 32:3339–3354, 2024

work page 2024

[62] [64]

Mesaros, T

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen. DCASE2017 challenge setup: Tasks, datasets and baseline system. InProceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), pages 85–92, 2017

work page 2017

[63] [65]

Panagopoulou, L

A. Panagopoulou, L. Xue, N. Yu, J. Li, D. Li, S. Joty, R. Xu, S. Savarese, C. Xiong, and J. C. Niebles. X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. InProceedings of the European Conference on Computer Vision (ECCV), 2024

work page 2024

[64] [66]

Panigrahi, N

A. Panigrahi, N. Saunshi, H. Zhao, and S. Arora. Task-specific skill localization in fine-tuned language models.arXiv preprint arXiv:2302.06600, 2023. 13

work page arXiv 2023

[65] [67]

B. Ping, S. Wang, H. Wang, X. Han, Y . Xu, Y . Yan, Y . Chen, B. Chang, Z. Liu, and M. Sun. Delta-come: Training-free delta-compression with mixed-precision for large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[66] [68]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021

[67] [69]

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bow- man. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[68] [70]

V . Sanh, A. Webson, C. Raffel, S. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, et al. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022

work page 2022

[69] [71]

Singh, V

A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8317–8326, 2019

work page 2019

[70] [72]

Stallkamp, M

J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The german traffic sign recognition benchmark: a multi-class classification competition. InIJCNN, pages 1453–1460. IEEE, 2011

work page 2011

[71] [73]

Stoica, P

G. Stoica, P. Ramesh, B. Ecsedi, L. Choshen, and J. Hoffman. Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735, 2024

work page arXiv 2024

[72] [74]

W. Sun, Q. Li, Y .-a. Geng, and B. Li. Cat merging: A training-free approach for resolving conflicts in model merging.Proceedings of the International Conference on Machine Learning (ICML), 2025

work page 2025

[73] [75]

A. Tang, L. Shen, Y . Luo, N. Yin, L. Zhang, and D. Tao. Merging multi-task models via weight-ensembling mixture of experts. InICML, 2024

work page 2024

[74] [76]

A. Tang, L. Shen, Y . Luo, S. Xie, H. Hu, L. Zhang, B. Du, and D. Tao. Zero-shot sparse mixture of low-rank experts construction from pre-trained foundation models.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–12, 2025. doi: 10.1109/TPAMI.2025. 3612480

work page doi:10.1109/tpami.2025 2025

[75] [77]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[76] [78]

Valipour, M

M. Valipour, M. Rezagholizadeh, I. Kobyzev, and A. Ghodsi. DyLoRA: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

work page 2023

[77] [79]

F. Wan, X. Huang, D. Cai, X. Quan, W. Bi, and S. Shi. Knowledge fusion of large language models.arXiv preprint arXiv:2401.10491, 2024

work page arXiv 2024

[78] [80]

K. Wang, N. Dimitriadis, G. Ortiz-Jimenez, F. Fleuret, and P. Frossard. Localizing task information for improved model merging and compression.arXiv preprint arXiv:2405.07813, 2024

work page arXiv 2024

[79] [81]

Q. Wang, J. Ke, M. Tomizuka, K. Keutzer, and C. Xu. Dobi-SVD: Differentiable SVD for LLM compression and some new perspectives. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025

[80] [82]

X. Wang, Y . Zheng, Z. Wan, and M. Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378, 2024. 14

work page arXiv 2024