5% > 100%: Flatness Preference is All You Need for Multimodal Parameter-Efficient Fine-Tuning

Can Lin; Hangjie Yuan; Pengfei Zhang; Tao Feng; Yifan Zhu; Zhonghong Ou; Zixiang Zhao

arxiv: 2606.10488 · v1 · pith:7F7NZD4Jnew · submitted 2026-06-09 · 💻 cs.CV

5% > 100%: Flatness Preference is All You Need for Multimodal Parameter-Efficient Fine-Tuning

Yifan Zhu , Can Lin , Hangjie Yuan , Zixiang Zhao , Pengfei Zhang , Tao Feng , Zhonghong Ou This is my paper

Pith reviewed 2026-06-27 14:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords parameter-efficient fine-tuningmultimodalflatness preferencegeneralizationPEFToptimizationsharp dimensions

0 comments

The pith

A small fraction of sharp dimensions dominates generalization in multimodal PEFT methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that various parameter-efficient fine-tuning methods for large multimodal models share a flatness preference, where generalization is controlled by only a small fraction of sharp dimensions rather than the full set. A sympathetic reader would care because this points to a simpler path for better adaptation: selectively flattening those critical dimensions instead of optimizing everything. The authors introduce Flatness Preference Optimization (FlatPO) to target and flatten the sharp dimensions. Experiments across multiple PEFT approaches on multimodal tasks show improved generalization from this targeted change.

Core claim

Various PEFT methods exhibit a flatness preference where a small fraction of sharp dimensions dominates the generalization of PEFT. Flatness Preference Optimization (FlatPO) flattens these key sharpness dimensions, leading various PEFTs toward better generalization.

What carries the argument

Flatness preference, the property that a small fraction of sharp dimensions dominates generalization in PEFT, addressed via Flatness Preference Optimization (FlatPO) to selectively flatten them.

If this is right

Various existing PEFT methods achieve better generalization when FlatPO flattens their identified sharp dimensions.
A small fraction of dimensions, as little as around 5 percent, suffices for superior performance over optimizing the full set.
The approach applies across different PEFT techniques on multimodal downstream tasks.
Generalization improves by focusing optimization on flatness of the dominant sharp dimensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

PEFT techniques could be redesigned to detect and prioritize sharp dimensions during initial adaptation.
The same preference might appear in fine-tuning outside multimodal settings, such as in language or vision-only tasks.
Scaling experiments on larger models would test whether the controlling fraction remains small.

Load-bearing premise

The flatness preference in the sharp dimensions is what causes better generalization, and selectively flattening them improves results across PEFT methods without side effects.

What would settle it

Applying FlatPO to a held-out multimodal dataset and observing no gain or a drop in generalization performance compared to standard PEFT would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.10488 by Can Lin, Hangjie Yuan, Pengfei Zhang, Tao Feng, Yifan Zhu, Zhonghong Ou, Zixiang Zhao.

**Figure 2.** Figure 2: Left shows the gradient distribution of the parameters in the same layer for LoRA and prefix tuning. In LoRA, A is the down-projection matrix and B is the up-projection matrix. The gradient values for each method in this layer are primarily concentrated around zero, indicating the presence of many flat dimensions in the loss landscape. Right illustrates the variation trends of the average gradient across d… view at source ↗

**Figure 4.** Figure 4: Loss landscapes of LoRA, Prefix Tuning, and Prompt Tuning using [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on VQA tasks with LoRA using different optimizers: Base, SAM, GAM, and our FlatPO. The examples are selected as [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Gradient Contribution of LoRA and Prefix Tuning. At each step, we [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Parameter-Efficient Fine-Tuning (PEFT) methods provide a streamlined and efficient tool for adapting large models to domain-specific multimodal downstream tasks. Although these methods proved their tangible effects in practice, their principal aspects remain under-explored. Therefore we remain curious about the underlying generalization mechanisms in various PEFT methods and how they can be further enhanced. In this paper, we reveal the flatness preference widely present in various PEFTs, where a small fraction of sharp dimensions dominates the generalization of PEFT. This finding suggests an appealing possibility: we may be satisfied with a better generalization by merely attending to this small fraction of sharp dimensions instead of all of them. Furthermore, we propose Flatness Preference Optimization (FlatPO) to flatten these key sharpness dimensions, leading various PEFTs toward better generalization. Extensive experiments demonstrate the effectiveness of our findings and the proposed method. Code is available at https://github.com/Can-Lin/FlatPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The flatness preference observation in multimodal PEFT is interesting but the causal link to generalization gains rests on unverified controls.

read the letter

The main point is that this paper identifies a small fraction of sharp dimensions as the dominant factor in PEFT generalization for multimodal tasks and introduces FlatPO to flatten them.

The new element is the explicit framing of flatness preference as a widespread pattern across PEFT methods plus the targeted optimization procedure. They test the idea on several standard PEFT approaches and report gains, which is a concrete step beyond general flatness literature.

The experiments are described as extensive and the code is public, which helps reproducibility. That counts as a positive.

The soft spot is the missing evidence for causality. The central claim needs ablations showing that flattening random or non-sharp dimensions does not produce comparable improvements and that the identified dimensions remain stable across runs and tasks. The abstract gives no sign these checks were done, so the preference could still be correlational. Without them the practical recommendation to focus only on those dimensions is not yet secured.

This work is aimed at researchers tuning PEFT for vision-language models who want new optimization levers. A reader already familiar with flatness measures in optimization will see the most direct value.

I would send it for peer review. The idea is worth testing properly, and referees can verify whether the controls are present in the full manuscript.

Referee Report

2 major / 0 minor

Summary. The paper claims that PEFT methods for multimodal tasks exhibit a 'flatness preference' in which a small fraction (~5%) of sharp dimensions dominates generalization performance. It proposes Flatness Preference Optimization (FlatPO) to selectively flatten these dimensions, yielding better generalization across various PEFT methods, with the claim supported by extensive experiments.

Significance. If the causal relationship between the identified sharp dimensions and generalization holds after proper controls, the result could simplify PEFT by showing that focusing on a small subset of dimensions suffices, with potential efficiency gains. The empirical observation across multiple PEFT methods is noteworthy, but the absence of ablations for causality reduces the strength of the central claim.

major comments (2)

[Abstract / Experiments] The central claim that a small fraction of sharp dimensions causally dominates PEFT generalization requires controls that are not described. Experiments must demonstrate that flattening random or non-sharp dimensions does not produce comparable gains; without such ablations the observed preference remains correlational rather than causal.
[Experiments] Stability of the identified sharp dimensions across random seeds, tasks, and PEFT methods is not addressed. If the dimensions vary substantially, the proposed FlatPO method would require method-specific retuning, undermining the claim that the preference is a general property.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The concerns about establishing causality and assessing stability are valid and will be addressed through additional experiments in the revised manuscript. Below we respond point by point.

read point-by-point responses

Referee: [Abstract / Experiments] The central claim that a small fraction of sharp dimensions causally dominates PEFT generalization requires controls that are not described. Experiments must demonstrate that flattening random or non-sharp dimensions does not produce comparable gains; without such ablations the observed preference remains correlational rather than causal.

Authors: We agree that explicit controls are required to move from correlation to causation. In the revision we will add ablation experiments that (i) randomly select and flatten an equal number of dimensions and (ii) flatten the least-sharp dimensions, then compare generalization performance against FlatPO on the same multimodal tasks and PEFT backbones. These results will be reported in a new subsection of the Experiments section together with statistical significance tests. revision: yes
Referee: [Experiments] Stability of the identified sharp dimensions across random seeds, tasks, and PEFT methods is not addressed. If the dimensions vary substantially, the proposed FlatPO method would require method-specific retuning, undermining the claim that the preference is a general property.

Authors: We will include a new stability analysis in the revised manuscript. For each PEFT method we will compute the Jaccard overlap of the top-5% sharp dimensions across five random seeds, across three distinct multimodal tasks, and across the different PEFT families. The results will be presented in a dedicated table; if overlap is high we will also report the performance of a single set of dimensions transferred across settings to quantify the practical generality of FlatPO. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation and method proposal are independent of fitted inputs

full rationale

The paper's central contribution is an empirical finding of flatness preference in PEFT methods followed by the proposal of FlatPO to act on identified sharp dimensions. No derivation chain, equations, or self-citation load-bearing steps are present that reduce a claimed prediction or result to its own inputs by construction. The abstract and description frame the work as observational discovery plus a new optimization technique validated by experiments, with no evidence of self-definitional fits, renamed known results, or uniqueness theorems imported from prior author work. This is a standard empirical paper whose claims rest on external validation rather than internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim depends on the empirical existence of a small set of sharp dimensions whose flattening produces the reported gains; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.1-grok · 5716 in / 1048 out tokens · 17756 ms · 2026-06-27T14:01:33.997039+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 9 linked inside Pith

[1]

Parameter-efficient fine-tuning for large models: A comprehensive survey,

Z. Han, C. Gao, J. Liu, J. Zhang, and S. Q. Zhang, “Parameter-efficient fine-tuning for large models: A comprehensive survey,”arXiv preprint arXiv:2403.14608, 2024

Pith/arXiv arXiv 2024
[2]

Parameter-efficient fine-tuning in large language models: a survey of methodologies,

L. Wang, S. Chen, L. Jiang, S. Pan, R. Cai, S. Yang, and F. Yang, “Parameter-efficient fine-tuning in large language models: a survey of methodologies,”Artificial Intelligence Review, vol. 58, no. 8, p. 227, 2025

2025
[3]

Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,

L. Xu, H. Xie, S. J. Qin, X. Tao, and F. L. Wang, “Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026
[4]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021

Pith/arXiv arXiv 2021
[5]

The power of scale for parameter-efficient prompt tuning,

B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,”arXiv preprint arXiv:2104.08691, 2021

Pith/arXiv arXiv 2021
[6]

Llm-adapters: An adapter family for parameter-efficient fine- tuning of large language models,

Z. Hu, L. Wang, Y . Lan, W. Xu, E.-P. Lim, L. Bing, X. Xu, S. Poria, and R. Lee, “Llm-adapters: An adapter family for parameter-efficient fine- tuning of large language models,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 5254– 5276. IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, XX 2026 10

2023
[7]

Dual modality prompt tuning for vision-language pre-trained model,

Y . Xing, Q. Wu, D. Cheng, S. Zhang, G. Liang, P. Wang, and Y . Zhang, “Dual modality prompt tuning for vision-language pre-trained model,” IEEE Transactions on Multimedia, vol. 26, pp. 2056–2068, 2024

2056
[8]

Unleash the power of vision-language models by visual attention prompt and multimodal interaction,

W. Zhang, L. Wu, Z. Zhang, T. Yu, C. Ma, X. Jin, X. Yang, and W. Zeng, “Unleash the power of vision-language models by visual attention prompt and multimodal interaction,”IEEE Transactions on Multimedia, vol. 27, pp. 2399–2411, 2024

2024
[9]

Unleashing the power of singular values for parameter-efficient fine- tuning of large pre-trained models,

C. Sun, J. Wei, Y . Wu, Y . Shi, S. He, Z. Ma, N. Xie, and Y . Yang, “Unleashing the power of singular values for parameter-efficient fine- tuning of large pre-trained models,”IEEE Transactions on Multimedia, 2026

2026
[10]

Sharpness-aware minimization: General analysis and improved rates,

D. Oikonomou and N. Loizou, “Sharpness-aware minimization: General analysis and improved rates,”arXiv preprint arXiv:2503.02225, 2025

arXiv 2025
[11]

Bi-lora: Efficient sharpness-aware minimization for fine-tuning large-scale models,

Y . Liu, T. Li, Z. Huang, Z. Yang, and X. Huang, “Bi-lora: Efficient sharpness-aware minimization for fine-tuning large-scale models,”arXiv preprint arXiv:2508.19564, 2025

Pith/arXiv arXiv 2025
[12]

Sparse is enough in fine-tuning pre-trained large language model,

W. Song, Z. Li, L. Zhang, H. Zhao, and B. Du, “Sparse is enough in fine-tuning pre-trained large language model,”International Conference on Machine Learning, 2024

2024
[13]

Understanding pre-training and fine-tuning from loss landscape per- spectives,

H. Chen, Y . Dong, Z. Wei, Y . Huang, Y . Zhang, H. Su, and J. Zhu, “Understanding pre-training and fine-tuning from loss landscape per- spectives,”arXiv e-prints, pp. arXiv–2505, 2025

2025
[14]

Towards understanding con- vergence and generalization of adamw,

P. Zhou, X. Xie, Z. Lin, and S. Yan, “Towards understanding con- vergence and generalization of adamw,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024
[15]

Eflat-lora: Efficiently seeking flat minima for better generalization in fine-tuning large language models and beyond,

J. Deng, Q. Zhu, J. Pang, L. Yang, Z. Fu, and B. Zhang, “Eflat-lora: Efficiently seeking flat minima for better generalization in fine-tuning large language models and beyond,”arXiv e-prints, pp. arXiv–2508, 2025

2025
[16]

Flat- lora: Low-rank adaption over a flat loss landscape,

T. Li, Z. He, Y . Li, Y . Wang, L. Shang, and X. Huang, “Flat- lora: Low-rank adaption over a flat loss landscape,”arXiv preprint arXiv:2409.14396, 2024

arXiv 2024
[17]

Improving generalization and convergence by enhancing implicit regularization,

M. Wang, J. Wang, H. He, Z. Wang, G. Huang, F. Xiong, Z. Li, L. Wu et al., “Improving generalization and convergence by enhancing implicit regularization,”Advances in Neural Information Processing Systems, vol. 37, pp. 118 701–118 744, 2024

2024
[18]

Domain generalization using large pretrained models with mixture-of-adapters,

G. Lee, W. Jang, J. H. Kim, J. Jung, and S. Kim, “Domain generalization using large pretrained models with mixture-of-adapters,”arXiv preprint arXiv:2310.11031, 2023

arXiv 2023
[19]

Ptp: Boosting stability and per- formance of prompt tuning with perturbation-based regularizer,

L. Chen, H. Huang, and M. Cheng, “Ptp: Boosting stability and per- formance of prompt tuning with perturbation-based regularizer,”arXiv preprint arXiv:2305.02423, 2023

arXiv 2023
[20]

Grit–geometry-aware peft with k- facpreconditioning, fisher-guided reprojection, anddynamic rank adapta- tion,

P. Saha, C. Rajbangshi, R. Goyal, M. Goyal, A. Deo, B. Roy, N. D. Singh, R. Goswami, and A. Das, “Grit–geometry-aware peft with k- facpreconditioning, fisher-guided reprojection, anddynamic rank adapta- tion,”arXiv preprint arXiv:2601.00231, 2026

arXiv 2026
[21]

Sharpness-aware minimization efficiently selects flatter minima late in training,

Z. Zhou, M. Wang, Y . Mao, B. Li, and J. Yan, “Sharpness-aware minimization efficiently selects flatter minima late in training,” in International Conference on Learning Representations, vol. 2025, 2025, pp. 20 949–20 980

2025
[22]

Parameter-efficient transfer learning for nlp,

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” inInternational conference on machine learning. PMLR, 2019, pp. 2790–2799

2019
[23]

Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models,

Y . Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, and J. Gao, “Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models,”arXiv preprint arXiv:2205.12410, vol. 1, no. 2, p. 4, 2022

arXiv 2022
[24]

Ctrl-adapter: An efficient and versatile framework for adapting diverse controls to any diffusion model,

H. Lin, J. Cho, A. Zala, and M. Bansal, “Ctrl-adapter: An efficient and versatile framework for adapting diverse controls to any diffusion model,”arXiv preprint arXiv:2404.09967, 2024

arXiv 2024
[25]

Adapter-x: A novel general parameter-efficient fine-tuning framework for vision,

M. Li, P. Ye, Y . Huang, L. Zhang, T. Chen, T. He, J. Fan, and W. Ouyang, “Adapter-x: A novel general parameter-efficient fine-tuning framework for vision,”arXiv preprint arXiv:2406.03051, 2024

arXiv 2024
[26]

Prefix-tuning: Optimizing continuous prompts for generation,

X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,”arXiv preprint arXiv:2101.00190, 2021

Pith/arXiv arXiv 2021
[27]

Visual prompt tuning,

M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 709–727

2022
[28]

Cp-prompt: Composition-based cross-modal prompting for domain- incremental continual learning,

Y . Feng, Z. Tian, Y . Zhu, Z. Han, H. Luo, G. Zhang, and M. Song, “Cp-prompt: Composition-based cross-modal prompting for domain- incremental continual learning,” inProceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024 - 1 November 2024. ACM, 2024, pp. 2729–2738

2024
[29]

Adalora: Adaptive budget allocation for parameter-efficient fine-tuning,

Q. Zhang, M. Chen, A. Bukharin, N. Karampatziakis, P. He, Y . Cheng, W. Chen, and T. Zhao, “Adalora: Adaptive budget allocation for parameter-efficient fine-tuning,”arXiv preprint arXiv:2303.10512, 2023

Pith/arXiv arXiv 2023
[30]

Lora+: Efficient low rank adaptation of large models,

S. Hayou, N. Ghosh, and B. Yu, “Lora+: Efficient low rank adaptation of large models,”arXiv preprint arXiv:2402.12354, 2024

arXiv 2024
[31]

Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications,

Q. Liu, X. Wu, X. Zhao, Y . Zhu, D. Xu, F. Tian, and Y . Zheng, “Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications,”arXiv preprint arXiv:2310.18339, 2023

arXiv 2023
[32]

Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms,

S. Chen, Z. Jie, and L. Ma, “Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms,”arXiv preprint arXiv:2401.16160, 2024

arXiv 2024
[33]

Self-regulating prompts: Foundational model adaptation without forgetting,

M. U. Khattak, S. T. Wasim, M. Naseer, S. Khan, M.-H. Yang, and F. S. Khan, “Self-regulating prompts: Foundational model adaptation without forgetting,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 190–15 200

2023
[34]

Consistency-guided prompt learning for vision- language models,

S. Roy and A. Etemad, “Consistency-guided prompt learning for vision- language models,”arXiv preprint arXiv:2306.01195, 2023

arXiv 2023
[35]

Make continual learning stronger via c-flat,

A. Bian, W. Li, H. Yuan, M. Wang, Z. Zhao, A. Lu, P. Ji, T. Feng et al., “Make continual learning stronger via c-flat,”Advances in Neural Information Processing Systems, vol. 37, pp. 7608–7630, 2024

2024
[36]

Pace: Marrying generalization in parameter-efficient fine-tuning with consistency regularization,

Y . Ni, S. Zhang, and P. Koniusz, “Pace: Marrying generalization in parameter-efficient fine-tuning with consistency regularization,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 61 238– 61 266, 2024

2024
[37]

Glad: Generalizable tuning for vision-language models,

Y . Peng, P. Wang, J. Liu, and S. Chen, “Glad: Generalizable tuning for vision-language models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 4310–4320

2025
[38]

Sharpness-aware minimization for efficiently improving generalization,

P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,”arXiv preprint arXiv:2010.01412, 2020

Pith/arXiv arXiv 2010
[39]

Gradient norm aware minimization seeks first-order flatness and improves generalization,

X. Zhang, R. Xu, H. Yu, H. Zou, and P. Cui, “Gradient norm aware minimization seeks first-order flatness and improves generalization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 247–20 257

2023
[40]

Parameter-efficient fine-tuning of large-scale pre-trained language models,

N. Ding, Y . Qin, G. Yang, F. Wei, Z. Yang, Y . Su, S. Hu, Y . Chen, C.- M. Chan, W. Chenet al., “Parameter-efficient fine-tuning of large-scale pre-trained language models,”Nature Machine Intelligence, 2023

2023
[41]

Opt: Open pre-trained transformer language models,

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022

Pith/arXiv arXiv 2022
[42]

Llama: open and efficient foundation language models. arxiv,

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: open and efficient foundation language models. arxiv,”arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023
[43]

Superglue: A stickier benchmark for general- purpose language understanding systems,

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “Superglue: A stickier benchmark for general- purpose language understanding systems,”Advances in neural informa- tion processing systems, vol. 32, 2019

2019
[44]

Sci- enceqa: A novel resource for question answering on scholarly articles,

T. Saikh, T. Ghosal, A. Mittal, A. Ekbal, and P. Bhattacharyya, “Sci- enceqa: A novel resource for question answering on scholarly articles,” International Journal on Digital Libraries, vol. 23, no. 3, pp. 289–301, 2022

2022
[45]

Vizwiz grand challenge: Answering visual questions from blind people,

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand challenge: Answering visual questions from blind people,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3608–3617

2018
[46]

Iconqa: A new benchmark for abstract diagram understand- ing and visual language reasoning,

P. Lu, L. Qiu, J. Chen, T. Xia, Y . Zhao, W. Zhang, Z. Yu, X. Liang, and S.-C. Zhu, “Iconqa: A new benchmark for abstract diagram understand- ing and visual language reasoning,”arXiv preprint arXiv:2110.13214, 2021

arXiv 2021
[47]

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hocken- maier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 2641– 2649

2015
[48]

Ok-vqa: A visual question answering benchmark requiring external knowledge,

K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “Ok-vqa: A visual question answering benchmark requiring external knowledge,” in Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, 2019, pp. 3195–3204

2019
[49]

Ocr-vqa: Visual question answering by reading text in images,

A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty, “Ocr-vqa: Visual question answering by reading text in images,” in2019 inter- national conference on document analysis and recognition (ICDAR). IEEE, 2019, pp. 947–952

2019
[50]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering,

Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6904–6913

2017

[1] [1]

Parameter-efficient fine-tuning for large models: A comprehensive survey,

Z. Han, C. Gao, J. Liu, J. Zhang, and S. Q. Zhang, “Parameter-efficient fine-tuning for large models: A comprehensive survey,”arXiv preprint arXiv:2403.14608, 2024

Pith/arXiv arXiv 2024

[2] [2]

Parameter-efficient fine-tuning in large language models: a survey of methodologies,

L. Wang, S. Chen, L. Jiang, S. Pan, R. Cai, S. Yang, and F. Yang, “Parameter-efficient fine-tuning in large language models: a survey of methodologies,”Artificial Intelligence Review, vol. 58, no. 8, p. 227, 2025

2025

[3] [3]

Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,

L. Xu, H. Xie, S. J. Qin, X. Tao, and F. L. Wang, “Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026

[4] [4]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021

Pith/arXiv arXiv 2021

[5] [5]

The power of scale for parameter-efficient prompt tuning,

B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,”arXiv preprint arXiv:2104.08691, 2021

Pith/arXiv arXiv 2021

[6] [6]

Llm-adapters: An adapter family for parameter-efficient fine- tuning of large language models,

Z. Hu, L. Wang, Y . Lan, W. Xu, E.-P. Lim, L. Bing, X. Xu, S. Poria, and R. Lee, “Llm-adapters: An adapter family for parameter-efficient fine- tuning of large language models,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 5254– 5276. IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, XX 2026 10

2023

[7] [7]

Dual modality prompt tuning for vision-language pre-trained model,

Y . Xing, Q. Wu, D. Cheng, S. Zhang, G. Liang, P. Wang, and Y . Zhang, “Dual modality prompt tuning for vision-language pre-trained model,” IEEE Transactions on Multimedia, vol. 26, pp. 2056–2068, 2024

2056

[8] [8]

Unleash the power of vision-language models by visual attention prompt and multimodal interaction,

W. Zhang, L. Wu, Z. Zhang, T. Yu, C. Ma, X. Jin, X. Yang, and W. Zeng, “Unleash the power of vision-language models by visual attention prompt and multimodal interaction,”IEEE Transactions on Multimedia, vol. 27, pp. 2399–2411, 2024

2024

[9] [9]

Unleashing the power of singular values for parameter-efficient fine- tuning of large pre-trained models,

C. Sun, J. Wei, Y . Wu, Y . Shi, S. He, Z. Ma, N. Xie, and Y . Yang, “Unleashing the power of singular values for parameter-efficient fine- tuning of large pre-trained models,”IEEE Transactions on Multimedia, 2026

2026

[10] [10]

Sharpness-aware minimization: General analysis and improved rates,

D. Oikonomou and N. Loizou, “Sharpness-aware minimization: General analysis and improved rates,”arXiv preprint arXiv:2503.02225, 2025

arXiv 2025

[11] [11]

Bi-lora: Efficient sharpness-aware minimization for fine-tuning large-scale models,

Y . Liu, T. Li, Z. Huang, Z. Yang, and X. Huang, “Bi-lora: Efficient sharpness-aware minimization for fine-tuning large-scale models,”arXiv preprint arXiv:2508.19564, 2025

Pith/arXiv arXiv 2025

[12] [12]

Sparse is enough in fine-tuning pre-trained large language model,

W. Song, Z. Li, L. Zhang, H. Zhao, and B. Du, “Sparse is enough in fine-tuning pre-trained large language model,”International Conference on Machine Learning, 2024

2024

[13] [13]

Understanding pre-training and fine-tuning from loss landscape per- spectives,

H. Chen, Y . Dong, Z. Wei, Y . Huang, Y . Zhang, H. Su, and J. Zhu, “Understanding pre-training and fine-tuning from loss landscape per- spectives,”arXiv e-prints, pp. arXiv–2505, 2025

2025

[14] [14]

Towards understanding con- vergence and generalization of adamw,

P. Zhou, X. Xie, Z. Lin, and S. Yan, “Towards understanding con- vergence and generalization of adamw,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024

[15] [15]

Eflat-lora: Efficiently seeking flat minima for better generalization in fine-tuning large language models and beyond,

J. Deng, Q. Zhu, J. Pang, L. Yang, Z. Fu, and B. Zhang, “Eflat-lora: Efficiently seeking flat minima for better generalization in fine-tuning large language models and beyond,”arXiv e-prints, pp. arXiv–2508, 2025

2025

[16] [16]

Flat- lora: Low-rank adaption over a flat loss landscape,

T. Li, Z. He, Y . Li, Y . Wang, L. Shang, and X. Huang, “Flat- lora: Low-rank adaption over a flat loss landscape,”arXiv preprint arXiv:2409.14396, 2024

arXiv 2024

[17] [17]

Improving generalization and convergence by enhancing implicit regularization,

M. Wang, J. Wang, H. He, Z. Wang, G. Huang, F. Xiong, Z. Li, L. Wu et al., “Improving generalization and convergence by enhancing implicit regularization,”Advances in Neural Information Processing Systems, vol. 37, pp. 118 701–118 744, 2024

2024

[18] [18]

Domain generalization using large pretrained models with mixture-of-adapters,

G. Lee, W. Jang, J. H. Kim, J. Jung, and S. Kim, “Domain generalization using large pretrained models with mixture-of-adapters,”arXiv preprint arXiv:2310.11031, 2023

arXiv 2023

[19] [19]

Ptp: Boosting stability and per- formance of prompt tuning with perturbation-based regularizer,

L. Chen, H. Huang, and M. Cheng, “Ptp: Boosting stability and per- formance of prompt tuning with perturbation-based regularizer,”arXiv preprint arXiv:2305.02423, 2023

arXiv 2023

[20] [20]

Grit–geometry-aware peft with k- facpreconditioning, fisher-guided reprojection, anddynamic rank adapta- tion,

P. Saha, C. Rajbangshi, R. Goyal, M. Goyal, A. Deo, B. Roy, N. D. Singh, R. Goswami, and A. Das, “Grit–geometry-aware peft with k- facpreconditioning, fisher-guided reprojection, anddynamic rank adapta- tion,”arXiv preprint arXiv:2601.00231, 2026

arXiv 2026

[21] [21]

Sharpness-aware minimization efficiently selects flatter minima late in training,

Z. Zhou, M. Wang, Y . Mao, B. Li, and J. Yan, “Sharpness-aware minimization efficiently selects flatter minima late in training,” in International Conference on Learning Representations, vol. 2025, 2025, pp. 20 949–20 980

2025

[22] [22]

Parameter-efficient transfer learning for nlp,

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” inInternational conference on machine learning. PMLR, 2019, pp. 2790–2799

2019

[23] [23]

Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models,

Y . Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, and J. Gao, “Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models,”arXiv preprint arXiv:2205.12410, vol. 1, no. 2, p. 4, 2022

arXiv 2022

[24] [24]

Ctrl-adapter: An efficient and versatile framework for adapting diverse controls to any diffusion model,

H. Lin, J. Cho, A. Zala, and M. Bansal, “Ctrl-adapter: An efficient and versatile framework for adapting diverse controls to any diffusion model,”arXiv preprint arXiv:2404.09967, 2024

arXiv 2024

[25] [25]

Adapter-x: A novel general parameter-efficient fine-tuning framework for vision,

M. Li, P. Ye, Y . Huang, L. Zhang, T. Chen, T. He, J. Fan, and W. Ouyang, “Adapter-x: A novel general parameter-efficient fine-tuning framework for vision,”arXiv preprint arXiv:2406.03051, 2024

arXiv 2024

[26] [26]

Prefix-tuning: Optimizing continuous prompts for generation,

X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,”arXiv preprint arXiv:2101.00190, 2021

Pith/arXiv arXiv 2021

[27] [27]

Visual prompt tuning,

M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 709–727

2022

[28] [28]

Cp-prompt: Composition-based cross-modal prompting for domain- incremental continual learning,

Y . Feng, Z. Tian, Y . Zhu, Z. Han, H. Luo, G. Zhang, and M. Song, “Cp-prompt: Composition-based cross-modal prompting for domain- incremental continual learning,” inProceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024 - 1 November 2024. ACM, 2024, pp. 2729–2738

2024

[29] [29]

Adalora: Adaptive budget allocation for parameter-efficient fine-tuning,

Q. Zhang, M. Chen, A. Bukharin, N. Karampatziakis, P. He, Y . Cheng, W. Chen, and T. Zhao, “Adalora: Adaptive budget allocation for parameter-efficient fine-tuning,”arXiv preprint arXiv:2303.10512, 2023

Pith/arXiv arXiv 2023

[30] [30]

Lora+: Efficient low rank adaptation of large models,

S. Hayou, N. Ghosh, and B. Yu, “Lora+: Efficient low rank adaptation of large models,”arXiv preprint arXiv:2402.12354, 2024

arXiv 2024

[31] [31]

Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications,

Q. Liu, X. Wu, X. Zhao, Y . Zhu, D. Xu, F. Tian, and Y . Zheng, “Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications,”arXiv preprint arXiv:2310.18339, 2023

arXiv 2023

[32] [32]

Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms,

S. Chen, Z. Jie, and L. Ma, “Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms,”arXiv preprint arXiv:2401.16160, 2024

arXiv 2024

[33] [33]

Self-regulating prompts: Foundational model adaptation without forgetting,

M. U. Khattak, S. T. Wasim, M. Naseer, S. Khan, M.-H. Yang, and F. S. Khan, “Self-regulating prompts: Foundational model adaptation without forgetting,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 190–15 200

2023

[34] [34]

Consistency-guided prompt learning for vision- language models,

S. Roy and A. Etemad, “Consistency-guided prompt learning for vision- language models,”arXiv preprint arXiv:2306.01195, 2023

arXiv 2023

[35] [35]

Make continual learning stronger via c-flat,

A. Bian, W. Li, H. Yuan, M. Wang, Z. Zhao, A. Lu, P. Ji, T. Feng et al., “Make continual learning stronger via c-flat,”Advances in Neural Information Processing Systems, vol. 37, pp. 7608–7630, 2024

2024

[36] [36]

Pace: Marrying generalization in parameter-efficient fine-tuning with consistency regularization,

Y . Ni, S. Zhang, and P. Koniusz, “Pace: Marrying generalization in parameter-efficient fine-tuning with consistency regularization,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 61 238– 61 266, 2024

2024

[37] [37]

Glad: Generalizable tuning for vision-language models,

Y . Peng, P. Wang, J. Liu, and S. Chen, “Glad: Generalizable tuning for vision-language models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 4310–4320

2025

[38] [38]

Sharpness-aware minimization for efficiently improving generalization,

P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,”arXiv preprint arXiv:2010.01412, 2020

Pith/arXiv arXiv 2010

[39] [39]

Gradient norm aware minimization seeks first-order flatness and improves generalization,

X. Zhang, R. Xu, H. Yu, H. Zou, and P. Cui, “Gradient norm aware minimization seeks first-order flatness and improves generalization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 247–20 257

2023

[40] [40]

Parameter-efficient fine-tuning of large-scale pre-trained language models,

N. Ding, Y . Qin, G. Yang, F. Wei, Z. Yang, Y . Su, S. Hu, Y . Chen, C.- M. Chan, W. Chenet al., “Parameter-efficient fine-tuning of large-scale pre-trained language models,”Nature Machine Intelligence, 2023

2023

[41] [41]

Opt: Open pre-trained transformer language models,

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022

Pith/arXiv arXiv 2022

[42] [42]

Llama: open and efficient foundation language models. arxiv,

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: open and efficient foundation language models. arxiv,”arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023

[43] [43]

Superglue: A stickier benchmark for general- purpose language understanding systems,

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “Superglue: A stickier benchmark for general- purpose language understanding systems,”Advances in neural informa- tion processing systems, vol. 32, 2019

2019

[44] [44]

Sci- enceqa: A novel resource for question answering on scholarly articles,

T. Saikh, T. Ghosal, A. Mittal, A. Ekbal, and P. Bhattacharyya, “Sci- enceqa: A novel resource for question answering on scholarly articles,” International Journal on Digital Libraries, vol. 23, no. 3, pp. 289–301, 2022

2022

[45] [45]

Vizwiz grand challenge: Answering visual questions from blind people,

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand challenge: Answering visual questions from blind people,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3608–3617

2018

[46] [46]

Iconqa: A new benchmark for abstract diagram understand- ing and visual language reasoning,

P. Lu, L. Qiu, J. Chen, T. Xia, Y . Zhao, W. Zhang, Z. Yu, X. Liang, and S.-C. Zhu, “Iconqa: A new benchmark for abstract diagram understand- ing and visual language reasoning,”arXiv preprint arXiv:2110.13214, 2021

arXiv 2021

[47] [47]

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hocken- maier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 2641– 2649

2015

[48] [48]

Ok-vqa: A visual question answering benchmark requiring external knowledge,

K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “Ok-vqa: A visual question answering benchmark requiring external knowledge,” in Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, 2019, pp. 3195–3204

2019

[49] [49]

Ocr-vqa: Visual question answering by reading text in images,

A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty, “Ocr-vqa: Visual question answering by reading text in images,” in2019 inter- national conference on document analysis and recognition (ICDAR). IEEE, 2019, pp. 947–952

2019

[50] [50]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering,

Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6904–6913

2017