Recognition: 2 theorem links
· Lean TheoremDoRA: Weight-Decomposed Low-Rank Adaptation
Pith reviewed 2026-05-15 22:22 UTC · model grok-4.3
The pith
DoRA splits pretrained weights into magnitude and direction then updates only the direction with LoRA to narrow the gap to full fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DoRA decomposes each pretrained weight matrix into a magnitude scalar and a directional unit vector. The magnitude is kept fixed at its pretrained value while low-rank adaptation matrices are used to update only the direction. The resulting fine-tuned weights are recombined at inference time exactly as in standard LoRA, yet the method recovers a larger fraction of full fine-tuning capacity and exhibits more stable training dynamics.
What carries the argument
Weight decomposition that isolates magnitude (frozen) from direction (updated by LoRA).
If this is right
- DoRA raises accuracy on commonsense reasoning benchmarks compared with LoRA when fine-tuning LLaMA.
- It improves visual instruction following performance for LLaVA without changing inference latency.
- Training curves for DoRA show fewer oscillations than standard LoRA on the same tasks.
- The same decomposition yields gains on image and video-text understanding for VL-BART.
- The method adds zero extra parameters or compute once training finishes.
Where Pith is reading between the lines
- The result suggests that directional alignment may be the dominant degree of freedom needed during adaptation while magnitude mainly sets scale.
- DoRA could be combined with other low-rank or prompt-based methods to shrink the remaining gap to full fine-tuning.
- The separation might allow selective magnitude rescaling at later training stages without increasing the LoRA rank.
- Similar decomposition could be tested on convolutional or diffusion models to check whether the same magnitude-direction split helps there.
Load-bearing premise
The performance edge of full fine-tuning over LoRA comes mainly from its freedom to adjust both magnitude and direction, and fixing magnitude while updating direction recovers most of that edge.
What would settle it
An experiment in which low-rank updates are allowed to change both magnitude and direction simultaneously and still fail to match DoRA accuracy, or in which updating magnitude alone while freezing direction closes the gap instead.
read the original abstract
Among the widely used parameter-efficient fine-tuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning (FT). In this work, we first introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA. Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA). DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing \ours, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. \ours~consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding. Code is available at https://github.com/NVlabs/DoRA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DoRA, a PEFT method that decomposes pre-trained weights into magnitude and direction, applies LoRA updates solely to the direction, and scales the magnitude component. Motivated by an analysis showing FT modifies magnitude more than standard LoRA, DoRA is claimed to increase learning capacity and stability over LoRA while incurring no extra inference cost. Experiments demonstrate consistent gains when fine-tuning LLaMA on commonsense reasoning, LLaVA on visual instruction tuning, and VL-BART on image/video-text understanding tasks.
Significance. If the empirical improvements hold under controlled hyperparameter regimes, DoRA would constitute a simple, practical upgrade to LoRA that narrows the gap to full fine-tuning without runtime overhead. The public code release strengthens verifiability and potential for follow-up work. The decomposition perspective may also inform future PEFT designs, though its explanatory power depends on isolating the magnitude-direction split as causal.
major comments (2)
- [§3] §3 (weight decomposition analysis): the claim that FT alters magnitude more than LoRA is used to justify freezing/scaling magnitude while updating direction. The comparison does not appear to control for confounds such as optimizer state, effective learning-rate scaling, or total update steps between the FT and LoRA runs; without these controls the observed magnitude shift may be correlative rather than the primary driver of the FT-LoRA gap.
- [Experimental setup] Experimental setup (hyperparameter details): the manuscript gives limited information on the search ranges and protocol for the magnitude scaling factor. It is unclear whether this factor is tuned globally or per layer and how many trials were performed; this detail is load-bearing for interpreting whether the reported gains over LoRA reflect the decomposition itself or differences in tuning effort.
minor comments (1)
- [Abstract] Abstract: the assertion that DoRA enhances training stability would be strengthened by reporting a concrete stability metric (e.g., standard deviation of validation accuracy across random seeds).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment below.
read point-by-point responses
-
Referee: [§3] §3 (weight decomposition analysis): the claim that FT alters magnitude more than LoRA is used to justify freezing/scaling magnitude while updating direction. The comparison does not appear to control for confounds such as optimizer state, effective learning-rate scaling, or total update steps between the FT and LoRA runs; without these controls the observed magnitude shift may be correlative rather than the primary driver of the FT-LoRA gap.
Authors: We thank the referee for this observation. Our analysis in §3 compares FT and LoRA under their standard reported training protocols without explicit controls for optimizer state, learning-rate scaling, or update steps. We agree this renders the magnitude-difference observation correlative rather than strictly causal. In the revision we will add a clarifying paragraph acknowledging these potential confounds and will stress that the primary justification for DoRA remains its consistent empirical gains over LoRA across tasks and models. revision: partial
-
Referee: [Experimental setup] Experimental setup (hyperparameter details): the manuscript gives limited information on the search ranges and protocol for the magnitude scaling factor. It is unclear whether this factor is tuned globally or per layer and how many trials were performed; this detail is load-bearing for interpreting whether the reported gains over LoRA reflect the decomposition itself or differences in tuning effort.
Authors: We agree that additional hyperparameter details are required for reproducibility. The magnitude scaling factor was tuned independently per layer via grid search over the range [0.1, 10.0] with multiple trials per configuration. We will expand the experimental-setup section in the revised manuscript to report the exact search ranges, per-layer protocol, and number of trials performed. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's derivation proceeds from an empirical weight decomposition analysis of differences between full fine-tuning and LoRA, to a design choice that freezes magnitude and applies low-rank updates only to direction, followed by independent experimental validation on held-out downstream tasks. No step reduces by construction to its own inputs: the performance metrics are measured separately and do not equate to quantities defined by the decomposition itself. No load-bearing self-citations, fitted inputs renamed as predictions, or ansatzes smuggled via prior work appear in the chain.
Axiom & Free-Parameter Ledger
free parameters (2)
- LoRA rank r
- magnitude scaling factor
Forward citations
Cited by 19 Pith papers
-
Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration
CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
-
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...
-
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.
-
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...
-
Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
Sub-token routing in LoRA-adapted transformers adds a finer compression axis for KV caches, with query-independent and query-aware designs that improve efficiency under reduced budgets when combined with token-level s...
-
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
-
Sensitivity-Positional Co-Localization in GQA Transformers
In Llama 3.1 8B, task-sensitive layers cluster late while RoPE adaptation is strongest early, yet applying both adaptations only to sensitivity-identified layers outperforms other layer choices by 4-16 points on MMLU,...
-
STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training
STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.
-
ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.
-
Constraint-Driven Warm-Freeze for Efficient Transfer Learning in Photovoltaic Systems
CDWF achieves 90-99% of full fine-tuning performance with up to 120x fewer trainable parameters by dynamically allocating full trainability to gradient-important blocks and LoRA to others for PV cyberattack transfer learning.
-
GAIN: Multiplicative Modulation for Domain Adaptation
GAIN's multiplicative modulation preserves pretrained weight column spans during sequential domain adaptation, yielding 7-13% better prior-domain perplexity than LoRA across 774M-70B models while matching replay-augme...
-
Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures
Gradient-guided layer selection for LoRA yields 15-28% training speedup with matched downstream results on MMLU, GSM8K, and HumanEval across 14 models from 0.5B to 72B parameters.
-
Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters
PoLAR-VBLL combines orthogonalized low-rank adapters with variational Bayesian last-layer inference to enable scalable, well-calibrated uncertainty quantification in fine-tuned LLMs.
-
Deep Reprogramming Distillation for Medical Foundation Models
DRD introduces a reprogramming module and CKA-based distillation to enable efficient, robust adaptation of medical foundation models to downstream 2D/3D classification and segmentation tasks, outperforming prior PEFT ...
-
SplitFT: An Adaptive Federated Split Learning System For LLMs Fine-Tuning
SplitFT adapts cut-layer selection and reduces LoRA rank per client in federated split learning to improve efficiency and performance when fine-tuning LLMs on heterogeneous devices and data.
-
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering close the gap.
-
LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language
Qwen2.5-3B was continued-pretrained and then fine-tuned with rsLoRA r256 on Sardinian data to reach 28.5 BLEU into the language, outperforming full fine-tuning and other LoRA variants.
-
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
-
TLoRA+: A Low-Rank Parameter-Efficient Fine-Tuning Method for Large Language Models
TLoRA+ augments LoRA with a dedicated optimizer to improve fine-tuning performance on GLUE tasks without meaningful added compute.
Reference graph
Works this paper leans on
-
[1]
Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo
-
[2]
The Eleventh International Conference on Learning Representations , year=
Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning , author=. The Eleventh International Conference on Learning Representations , year=
-
[3]
International Conference on Learning Representations , year=
FedPara: Low-rank Hadamard Product for Communication-Efficient Federated Learning , author=. International Conference on Learning Representations , year=
-
[5]
International Conference on Learning Representations , year=
VeRA: Vector-based Random Matrix Adaptation , author=. International Conference on Learning Representations , year=
-
[6]
Proceedings of the 30th International Conference on Neural Information Processing Systems , pages=
Weight normalization: a simple reparameterization to accelerate training of deep neural networks , author=. Proceedings of the 30th International Conference on Neural Information Processing Systems , pages=
-
[7]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[8]
LLM -Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models
Hu, Zhiqiang and Wang, Lei and Lan, Yihuai and Xu, Wanyu and Lim, Ee-Peng and Bing, Lidong and Xu, Xing and Poria, Soujanya and Lee, Roy. LLM -Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023
work page 2023
-
[9]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. 2023 , url =
work page 2023
-
[10]
Proceedings of the IEEE International Conference on Computer Vision , pages=
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=
-
[11]
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Li, Xiang Lisa and Liang, Percy. Prefix-Tuning: Optimizing Continuous Prompts for Generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021
work page 2021
-
[12]
International Conference on Machine Learning , pages=
Parameter-efficient transfer learning for NLP , author=. International Conference on Machine Learning , pages=
-
[13]
International Conference on Learning Representations , year=
Towards a Unified View of Parameter-Efficient Transfer Learning , author=. International Conference on Learning Representations , year=
-
[14]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[15]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[16]
A Corpus for Reasoning about Natural Language Grounded in Photographs
Suhr, Alane and Zhou, Stephanie and Zhang, Ally and Zhang, Iris and Bai, Huajun and Artzi, Yoav. A Corpus for Reasoning about Natural Language Grounded in Photographs. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019
work page 2019
-
[18]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=
-
[19]
Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020
work page 2020
-
[20]
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) , year=
-
[21]
TVQA : Localized, Compositional Video Question Answering
Lei, Jie and Yu, Licheng and Bansal, Mohit and Berg, Tamara. TVQA : Localized, Compositional Video Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018
work page 2018
-
[23]
HERO : Hierarchical Encoder for V ideo+ L anguage Omni-representation Pre-training
Li, Linjie and Chen, Yen-Chun and Cheng, Yu and Gan, Zhe and Yu, Licheng and Liu, Jingjing. HERO : Hierarchical Encoder for V ideo+ L anguage Omni-representation Pre-training. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020
work page 2020
-
[24]
European Conference on Computer Vision , pages=
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval , author=. European Conference on Computer Vision , pages=
-
[25]
Proceedings of the AAAI Conference on Artificial Intelligence , year=
Towards automatic learning of procedures from web instructional videos , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=
-
[26]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Visual Instruction Tuning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[28]
Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages=
Ok-vqa: A visual question answering benchmark requiring external knowledge , author=. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages=
-
[29]
European Conference on Computer Vision , pages=
A-okvqa: A benchmark for visual question answering using world knowledge , author=. European Conference on Computer Vision , pages=
-
[30]
OCR-VQA: Visual Question Answering by Reading Text in Images , year=
Mishra, Anand and Shekhar, Shashank and Singh, Ajeet Kumar and Chakraborty, Anirban , booktitle=. OCR-VQA: Visual Question Answering by Reading Text in Images , year=
-
[31]
Textcaps: a dataset for image captioning with reading comprehension , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=
work page 2020
-
[32]
R efer I t G ame: Referring to Objects in Photographs of Natural Scenes
Kazemzadeh, Sahar and Ordonez, Vicente and Matten, Mark and Berg, Tamara. R efer I t G ame: Referring to Objects in Photographs of Natural Scenes. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2014
work page 2014
-
[33]
International journal of computer vision , pages=
Visual genome: Connecting language and vision using crowdsourced dense image annotations , author=. International journal of computer vision , pages=
-
[34]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Generation and comprehension of unambiguous object descriptions , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[35]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Vizwiz grand challenge: Answering visual questions from blind people , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[36]
Advances in Neural Information Processing Systems , pages=
Learn to explain: Multimodal reasoning via thought chains for science question answering , author=. Advances in Neural Information Processing Systems , pages=
-
[37]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[38]
Evaluating Object Hallucination in Large Vision-Language Models
Li, Yifan and Du, Yifan and Zhou, Kun and Wang, Jinpeng and Zhao, Xin and Wen, Ji-Rong. Evaluating Object Hallucination in Large Vision-Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023
work page 2023
-
[40]
Gonzalez and Ion Stoica , booktitle=
Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , booktitle=. Judging
-
[41]
Is ChatGPT a General-Purpose Natural Language Processing Task Solver?
Qin, Chengwei and Zhang, Aston and Zhang, Zhuosheng and Chen, Jiaao and Yasunaga, Michihiro and Yang, Diyi. Is ChatGPT a General-Purpose Natural Language Processing Task Solver?. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023
work page 2023
-
[42]
International Conference on Machine Learning , pages=
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=
-
[43]
B it F it: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
Ben Zaken, Elad and Goldberg, Yoav and Ravfogel, Shauli. B it F it: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022
work page 2022
-
[44]
Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks
Karimi Mahabadi, Rabeeh and Ruder, Sebastian and Dehghani, Mostafa and Henderson, James. Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Pap...
work page 2021
-
[45]
Advances in Neural Information Processing Systems , year=
Compacter: Efficient Low-Rank Hypercomplex Adapter Layers , author=. Advances in Neural Information Processing Systems , year=
-
[46]
The Power of Scale for Parameter-Efficient Prompt Tuning
Lester, Brian and Al-Rfou, Rami and Constant, Noah. The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021
work page 2021
-
[47]
Residual Prompt Tuning: improving prompt tuning with residual reparameterization
Razdaibiedina, Anastasiia and Mao, Yuning and Khabsa, Madian and Lewis, Mike and Hou, Rui and Ba, Jimmy and Almahairi, Amjad. Residual Prompt Tuning: improving prompt tuning with residual reparameterization. Findings of the Association for Computational Linguistics: ACL 2023. 2023
work page 2023
-
[49]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Controlling Text-to-Image Diffusion by Orthogonal Finetuning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[50]
Advances in Neural Information Processing Systems , year=
Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , year=
- [51]
-
[55]
Orca-Math: Unlocking the potential of SLMs in Grade School Math , author=. 2024 , eprint=
work page 2024
-
[56]
Efficient finetuning of Llama 3 with FSDP QDoRA , author=
-
[57]
QLoRA: Efficient Finetuning of Quantized LLMs , volume =
Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. QLoRA: Efficient Finetuning of Quantized LLMs , volume =
-
[59]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[61]
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Doll \'a r, P., and Zitnick, C. L. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[62]
Qlora: Efficient finetuning of quantized llms
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. In Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 10088--10115. Curran Associates, Inc., 2023
work page 2023
-
[63]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 6904--6913, 2017
work page 2017
-
[64]
J., Guo, A., Lin, C., Grauman, K., Luo, J., and Bigham, J
Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., and Bigham, J. P. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3608--3617, 2018
work page 2018
-
[65]
Towards a unified view of parameter-efficient transfer learning
He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., and Neubig, G. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations, 2021
work page 2021
-
[66]
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification
He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 1026--1034, 2015
work page 2015
-
[67]
Parameter-efficient transfer learning for nlp
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.\ 2790--2799, 2019
work page 2019
-
[68]
J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W
Hu, E. J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022
work page 2022
-
[69]
LLM -adapters: An adapter family for parameter-efficient fine-tuning of large language models
Hu, Z., Wang, L., Lan, Y., Xu, W., Lim, E.-P., Bing, L., Xu, X., Poria, S., and Lee, R. LLM -adapters: An adapter family for parameter-efficient fine-tuning of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[70]
Hudson, D. A. and Manning, C. D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 6700--6709, 2019
work page 2019
-
[71]
Fedpara: Low-rank hadamard product for communication-efficient federated learning
Hyeon-Woo, N., Ye-Bin, M., and Oh, T.-H. Fedpara: Low-rank hadamard product for communication-efficient federated learning. In International Conference on Learning Representations, 2022
work page 2022
-
[72]
Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks
Karimi Mahabadi, R., Ruder, S., Dehghani, M., and Henderson, J. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 565--576, 2021
work page 2021
-
[73]
R efer I t G ame: Referring to objects in photographs of natural scenes
Kazemzadeh, S., Ordonez, V., Matten, M., and Berg, T. R efer I t G ame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pp.\ 787--798, 2014
work page 2014
-
[74]
Kerem Turgutlu, Jonathan Whitaker, J. H. Efficient finetuning of llama 3 with fsdp qdora. https://www.answer.ai/posts/2024-04-26-fsdp-qdora-llama3.html, 2024
work page 2024
-
[75]
J., Blankevoort, T., and Asano, Y
Kopiczko, D. J., Blankevoort, T., and Asano, Y. M. Vera: Vector-based random matrix adaptation. In International Conference on Learning Representations, 2024
work page 2024
-
[76]
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, pp.\ 32--73, 2017
work page 2017
-
[77]
TVQA : Localized, compositional video question answering
Lei, J., Yu, L., Bansal, M., and Berg, T. TVQA : Localized, compositional video question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 1369--1379, 2018
work page 2018
-
[78]
Lei, J., Yu, L., Berg, T. L., and Bansal, M. Tvr: A large-scale dataset for video-subtitle moment retrieval. In European Conference on Computer Vision, pp.\ 447--463, 2020
work page 2020
-
[79]
The power of scale for parameter-efficient prompt tuning
Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 3045--3059, 2021
work page 2021
-
[80]
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. BART : Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 7871--7880, 2020
work page 2020
-
[81]
Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.\ 12888--12900, 2022
work page 2022
-
[82]
HERO : Hierarchical encoder for V ideo+ L anguage omni-representation pre-training
Li, L., Chen, Y.-C., Cheng, Y., Gan, Z., Yu, L., and Liu, J. HERO : Hierarchical encoder for V ideo+ L anguage omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 2046--2065, 2020
work page 2020
-
[83]
Li, L., Lei, J., Gan, Z., Yu, L., Chen, Y.-C., Pillai, R., Cheng, Y., Zhou, L., Wang, X. E., Wang, W. Y., et al. Value: A multi-task benchmark for video-and-language understanding evaluation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021
work page 2021
-
[84]
Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 4582--4597, 2021
work page 2021
-
[85]
Evaluating object hallucination in large vision-language models
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 292--305, 2023
work page 2023
-
[86]
Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023 a
work page 2023
-
[87]
Parameter-efficient orthogonal finetuning via butterfly factorization
Liu, W., Qiu, Z., Feng, Y., Xiu, Y., Xue, Y., Yu, L., Feng, H., Liu, Z., Heo, J., Peng, S., et al. Parameter-efficient orthogonal finetuning via butterfly factorization. arXiv preprint arXiv:2311.06243, 2023 b
-
[88]
MMBench: Is Your Multi-modal Model an All-around Player?
Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023 c
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[89]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Clark, P., and Kalyan, A. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, pp.\ 2507--2521, 2022
work page 2022
-
[90]
K., Henderson, J., and Ruder, S
mahabadi, R. K., Henderson, J., and Ruder, S. Compacter: Efficient low-rank hypercomplex adapter layers. In Advances in Neural Information Processing Systems, 2021
work page 2021
-
[91]
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L., and Murphy, K. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 11--20, 2016
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.