Recognition: 2 theorem links
· Lean TheoremA Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
Pith reviewed 2026-05-15 22:02 UTC · model grok-4.3
The pith
LoRA adapters should be scaled by dividing by the square root of the rank rather than the full rank to stabilize learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper proves that the LoRA adapter update must be scaled by 1/sqrt(rank) rather than 1/rank. Under this corrected scaling the magnitude of the gradient signal remains constant as rank increases, so larger-rank adapters learn effectively instead of being suppressed. The only change required is in the scaling constant applied during training; the low-rank decomposition itself and the inference-time computation are untouched.
What carries the argument
The rank-dependent scaling factor applied to the low-rank matrix product inside each LoRA adapter; replacing the conventional divisor of rank with the square root of rank stabilizes the variance of the update.
If this is right
- Higher ranks become practical in LoRA, directly improving fine-tuning quality on the same data.
- Inference latency and memory stay identical because only the training-time scaling constant changes.
- A continuous compute-performance curve appears: extra training FLOPs from larger rank yield measurable gains.
- Existing LoRA implementations require only a one-line change to the scaling factor to adopt the method.
Where Pith is reading between the lines
- The same scaling logic may extend to other low-rank adaptation schemes that multiply an update by a rank-dependent constant.
- Models fine-tuned with rsLoRA at moderate ranks could match or exceed the quality of full fine-tuning at lower total training cost.
- The result suggests re-examining scaling factors in related parameter-efficient methods such as adapter layers or prefix tuning.
Load-bearing premise
The optimality of the square-root scaling rests on the assumption that initialization variance and gradient magnitudes behave exactly as they do in the standard LoRA forward and backward passes.
What would settle it
Train identical models on the same task with ranks from 4 to 128 using both the original 1/rank scaling and the proposed 1/sqrt(rank) scaling, then compare final validation accuracy; if accuracy stops improving or declines with the square-root scaling, the claim is falsified.
read the original abstract
As large language models (LLMs) have become increasingly compute and memory intensive, parameter-efficient fine-tuning (PEFT) methods are now a common strategy to fine-tune LLMs. A popular PEFT method is Low-Rank Adapters (LoRA), which adds trainable low-rank "adapters" to selected layers. Each adapter consists of a low-rank matrix product, multiplicatively scaled by a rank-dependent factor. This scaling factor, which divides adapters by a factor of the rank, results in slowed learning and stunted performance for LoRA with higher-rank adapters. Consequently, the use of LoRA in practice has generally been limited to very low ranks. In this work, we study the impact of the scaling factor on the learning process and prove that LoRA adapters should be divided by a factor of the square root of the rank. Modifying LoRA with the appropriate scaling factor, which we call the rank-stabilized LoRA (rsLoRA) method, easily provides for a fine-tuning compute/performance trade-off, where larger ranks can be used to trade off increased computational resources during training for better fine-tuning performance, with no change in inference computing cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies the scaling factor applied to LoRA adapters during fine-tuning of LLMs. It claims that the conventional factor of 1/rank slows learning and limits performance at higher ranks, and asserts a proof that the factor should instead be 1/sqrt(rank) to stabilize the learning dynamics. The proposed rsLoRA modification is said to enable a compute-performance trade-off by supporting larger ranks at training time without changing inference cost.
Significance. If the claimed proof and any accompanying experiments hold, the result would be significant for parameter-efficient fine-tuning: it would remove an artificial barrier that has kept LoRA ranks low in practice and would give practitioners a principled way to trade additional training compute for better adaptation quality.
major comments (2)
- [Abstract] Abstract: the central claim that 'we ... prove that LoRA adapters should be divided by a factor of the square root of the rank' is unsupported because no derivation, no equations modeling the interaction of the scaling factor with adapter initialization (e.g., variance of A or B), and no gradient-flow or SGD analysis appear in the manuscript.
- [Abstract] The load-bearing modeling assumptions (initialization variance, continuous-time approximation of discrete updates, pre-trained weight norms) are never stated, so it is impossible to assess whether the derived optimum 1/sqrt(rank) remains valid when those assumptions are relaxed.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive feedback. The comments correctly identify that the theoretical justification requires more explicit presentation. We will revise the manuscript to include the full derivation, stated assumptions, and supporting analysis.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'we ... prove that LoRA adapters should be divided by a factor of the square root of the rank' is unsupported because no derivation, no equations modeling the interaction of the scaling factor with adapter initialization (e.g., variance of A or B), and no gradient-flow or SGD analysis appear in the manuscript.
Authors: We agree that the submitted manuscript does not contain a sufficiently detailed derivation in the main text. The proof was condensed to fit space constraints. In revision we will add a dedicated theoretical section that (i) models the initialization variances explicitly (A ~ N(0, 1/rank), B = 0), (ii) derives the interaction of the scaling factor with the adapter update, and (iii) presents the continuous-time gradient-flow / SGD analysis that yields the 1/sqrt(rank) optimum. The revised abstract will be updated to reflect the expanded treatment. revision: yes
-
Referee: [Abstract] The load-bearing modeling assumptions (initialization variance, continuous-time approximation of discrete updates, pre-trained weight norms) are never stated, so it is impossible to assess whether the derived optimum 1/sqrt(rank) remains valid when those assumptions are relaxed.
Authors: We acknowledge the omission. The revised manuscript will open the theoretical section with an explicit list of assumptions (Gaussian initialization variances, continuous-time limit of SGD, bounded pre-trained weight norms). We will also add a short robustness subsection that discusses how the 1/sqrt(rank) result behaves under relaxed assumptions, supported by additional controlled experiments that vary initialization scale and learning-rate schedules. revision: yes
Circularity Check
No significant circularity in the claimed proof of 1/sqrt(rank) scaling
full rationale
The paper derives the rank-stabilized scaling via analysis of initialization variance and gradient magnitudes under the LoRA update rule. No quoted equations reduce the 1/sqrt(rank) factor to a post-hoc fit, self-definition, or load-bearing self-citation. The central claim rests on modeling assumptions about learning dynamics rather than re-expressing the input data or prior fitted quantities as the output. This is a standard non-finding for a theoretical derivation paper whose result is not forced by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The learning dynamics of LoRA are governed by a rank-dependent scaling that can be corrected by a sqrt(rank) factor
Forward citations
Cited by 19 Pith papers
-
MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning
MatryoshkaLoRA inserts a crafted diagonal matrix P into LoRA to learn accurate nested low-rank adapters that support dynamic rank selection with minimal performance drop.
-
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of either static method, with an efficient LoRA-only variant outperforming prior adaptive approaches.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
-
HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA
HAC provides a parameter-efficient way to move CLIP into hyperbolic geometry, yielding consistent gains on zero-shot VQA benchmarks without any VQA training data overlap.
-
DifFoundMAD: Foundation Models meet Differential Morphing Attack Detection
DifFoundMAD improves differential morphing attack detection by replacing traditional embeddings with those from vision foundation models and applying class-balanced lightweight fine-tuning, cutting high-security error...
-
PreFT: Prefill-only finetuning for efficient inference
Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.
-
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.
-
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
-
A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need
Frozen random backbones with low-rank LoRA adapters recover 96-100% of fully trained performance on diverse architectures while training only 0.5-40% of parameters.
-
InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation
InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
-
SplitFT: An Adaptive Federated Split Learning System For LLMs Fine-Tuning
SplitFT adapts cut-layer selection and reduces LoRA rank per client in federated split learning to improve efficiency and performance when fine-tuning LLMs on heterogeneous devices and data.
-
Benchmarking Linguistic Adaptation in Comparable-Sized LLMs: A Study of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali
Fine-tuning Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali data enables effective generation where zero-shot fails, with Qwen3-8B performing best overall and Llama-3.1-8B showing the largest gains.
-
Can Muon Fine-tune Adam-Pretrained Models?
Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.
-
LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language
Qwen2.5-3B was continued-pretrained and then fine-tuned with rsLoRA r256 on Sardinian data to reach 28.5 BLEU into the language, outperforming full fine-tuning and other LoRA variants.
-
One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech
A system based on OmniVoice with multi-model ensemble distillation for fine-tuning shows consistent gains in intelligibility metrics while keeping speaker similarity for cross-lingual scientific speech.
-
When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden
Small language models detect the triple burden of PCOS, disordered eating, and body image issues in social media posts at 75.3% exact match accuracy with grounded explanations.
-
LLMs and Speech: Integration vs. Combination
Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.
-
Efficient Task Adaptation in Large Language Models via Selective Parameter Optimization
The paper claims a selective fine-tuning method that identifies and freezes core parameters to mitigate catastrophic forgetting in LLMs while improving domain adaptation, shown in experiments with GPT-J and LLaMA-3.
Reference graph
Works this paper leans on
-
[1]
Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, Thomas Wolf , title =. 2023 , publisher =
work page 2023
-
[2]
doi:10.5281/zenodo.5371628 , url =
Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and Phang, Jason and Reynolds, Laria and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy , title =. doi:10.5281/zenodo.5371628 , url =
-
[3]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=
work page 2018
-
[4]
HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=
work page 2019
-
[5]
Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=
work page 2021
-
[6]
TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2022 , eprint=
work page 2022
-
[7]
Michael McCloskey and Neal J. Cohen , abstract =. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , editor =. 1989 , issn =. doi:https://doi.org/10.1016/S0079-7421(08)60536-8 , url =
-
[8]
doi:10.1037/0033-295x.97.2.285 , url =
Roger Ratcliff , title =. doi:10.1037/0033-295x.97.2.285 , url =
-
[9]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
Measuring Catastrophic Forgetting in Neural Networks , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2018 , month=. doi:10.1609/aaai.v32i1.11651 , abstractNote=
-
[10]
Goodfellow and Mehdi Mirza and Xia Da and Aaron C
Ian J. Goodfellow and Mehdi Mirza and Xia Da and Aaron C. Courville and Yoshua Bengio , editor =. An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , booktitle =. 2014 , url =
work page 2014
-
[11]
James Kirkpatrick and Razvan Pascanu and Neil Rabinowitz and Joel Veness and Guillaume Desjardins and Andrei A. Rusu and Kieran Milan and John Quan and Tiago Ramalho and Agnieszka Grabska-Barwinska and Demis Hassabis and Claudia Clopath and Dharshan Kumaran and Raia Hadsell , title =. Proceedings of the National Academy of Sciences , volume =. 2017 , doi ...
-
[12]
A Continual Learning Survey: Defying Forgetting in Classification Tasks , year=
De Lange, Matthias and Aljundi, Rahaf and Masana, Marc and Parisot, Sarah and Jia, Xu and Leonardis, Aleš and Slabaugh, Gregory and Tuytelaars, Tinne , journal=. A Continual Learning Survey: Defying Forgetting in Classification Tasks , year=
-
[13]
International Conference on Learning Representations , year=
Continual learning with hypernetworks , author=. International Conference on Learning Representations , year=
-
[14]
The Challenges of Continuous Self-Supervised Learning
Purushwalkam, Senthil and Morgado, Pedro and Gupta, Abhinav. The Challenges of Continuous Self-Supervised Learning. Computer Vision -- ECCV 2022. 2022
work page 2022
-
[15]
Andrei A. Rusu and Neil C. Rabinowitz and Guillaume Desjardins and Hubert Soyer and James Kirkpatrick and Koray Kavukcuoglu and Razvan Pascanu and Raia Hadsell , title =. CoRR , volume =. 2016 , url =. 1606.04671 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
Rebuffi, Sylvestre-Alvise and Kolesnikov, Alexander and Sperl, Georg and Lampert, Christoph H. , booktitle=. iCaRL: Incremental Classifier and Representation Learning , year=
-
[17]
ANTHONY ROBINS , title =. Connection Science , volume =. 1995 , publisher =. doi:10.1080/09540099550039318 , URL =
-
[18]
A Bio-Inspired Incremental Learning Architecture for Applied Perceptual Problems , volume =
Gepperth, Alexander and Karaoguz, Cem , year =. A Bio-Inspired Incremental Learning Architecture for Applied Perceptual Problems , volume =. Cognitive Computation , doi =
-
[19]
Parisi and Ronald Kemker and Jose L
German I. Parisi and Ronald Kemker and Jose L. Part and Christopher Kanan and Stefan Wermter , keywords =. Continual lifelong learning with neural networks: A review , journal =. 2019 , issn =. doi:https://doi.org/10.1016/j.neunet.2019.01.012 , url =
-
[20]
Reinforced Continual Learning , url =
Xu, Ju and Zhu, Zhanxing , booktitle =. Reinforced Continual Learning , url =
-
[21]
R. Aljundi and P. Chakravarty and T. Tuytelaars , booktitle =. Expert Gate: Lifelong Learning with a Network of Experts , year =. doi:10.1109/CVPR.2017.753 , url =
-
[22]
Experience Replay for Continual Learning , url =
Rolnick, David and Ahuja, Arun and Schwarz, Jonathan and Lillicrap, Timothy and Wayne, Gregory , booktitle =. Experience Replay for Continual Learning , url =
-
[23]
Isele, David and Cosgun, Akansel , title =. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence , articleno =. 2018 , isbn =
work page 2018
-
[24]
Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =
Lopez-Paz, David and Ranzato, Marc'Aurelio , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =
work page 2017
-
[25]
On Tiny Episodic Memories in Continual Learning
On Tiny Episodic Memories in Continual Learning. arXiv e-prints , keywords =. doi:10.48550/arXiv.1902.10486 , archivePrefix =. 1902.10486 , primaryClass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1902.10486 1902
-
[26]
Ahn, Hongjoon and Cha, Sungmin and Lee, Donggyu and Moon, Taesup , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =
work page 2019
-
[28]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[29]
Universality and Limitations of Prompt Tuning , author=. 2023 , eprint=
work page 2023
-
[30]
Two-stage LLM Fine-tuning with Less Specialization and More Generalization , author=. 2023 , eprint=
work page 2023
-
[31]
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning , author=. 2023 , eprint=
work page 2023
-
[32]
Measuring Massive Multitask Language Understanding
Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[33]
Neural Domain Adaptation for Biomedical Question Answering , author=. 2017 , eprint=
work page 2017
-
[34]
arXiv preprint arXiv:1909.11299 , year=
Mixout: Effective regularization to finetune large-scale pretrained language models , author=. arXiv preprint arXiv:1909.11299 , year=
-
[35]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[36]
Scaling Laws for Neural Language Models
Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[37]
Training Compute-Optimal Large Language Models
Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
International Conference on Machine Learning , pages=
Unified scaling laws for routed language models , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[39]
Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=
work page 2022
- [40]
- [41]
- [42]
-
[43]
International Conference on Learning Representations , year=
Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=
-
[44]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
-
[45]
International Conference on Learning Representations , year=
Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=
-
[46]
Orca: Progressive Learning from Complex Explanation Traces of GPT-4 , author=. 2023 , eprint=
work page 2023
-
[47]
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , author=. 2018 , eprint=
work page 2018
-
[48]
Representation Stability as a Regularizer for Improved Text Analytics Transfer Learning , author=. 2017 , url=
work page 2017
-
[49]
Lin, Chin-Yew and Hovy, Eduard , title =. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1 , pages =. 2003 , publisher =. doi:10.3115/1073445.1073465 , abstract =
-
[50]
Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages =
Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , title =. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages =. 2002 , publisher =. doi:10.3115/1073083.1073135 , abstract =
-
[51]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. ArXiv , year=
-
[52]
Scaling Laws for Autoregressive Generative Modeling , author=. 2020 , eprint=
work page 2020
-
[53]
Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models , author=. 2022 , eprint=
work page 2022
-
[54]
NPJ digital medicine , volume=
Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction , author=. NPJ digital medicine , volume=. 2021 , publisher=
work page 2021
-
[55]
Code as Policies: Language Model Programs for Embodied Control , author=. 2023 , eprint=
work page 2023
- [56]
-
[57]
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , author=. 2022 , eprint=
work page 2022
-
[58]
Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=
work page 2023
-
[59]
Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning , author=. 2023 , eprint=
work page 2023
-
[60]
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning , author=. 2022 , eprint=
work page 2022
-
[61]
FedPara: Low-Rank Hadamard Product for Communication-Efficient Federated Learning , author=. 2023 , eprint=
work page 2023
-
[62]
KronA: Parameter Efficient Tuning with Kronecker Adapter , author=. 2022 , eprint=
work page 2022
-
[63]
Parameter-Efficient Transfer Learning for
Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and De Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , booktitle =. Parameter-Efficient Transfer Learning for. 2019 , editor =
work page 2019
-
[64]
Feature Learning in Infinite-Width Neural Networks , author=. 2022 , eprint=
work page 2022
-
[65]
Measuring the Intrinsic Dimension of Objective Landscapes , author=. 2018 , eprint=
work page 2018
-
[66]
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning , author=. 2020 , eprint=
work page 2020
- [67]
-
[68]
Wang, Ben and Komatsuzaki, Aran , title =
-
[69]
Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=
work page 2021
-
[70]
Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis , author=. 2023 , eprint=
work page 2023
-
[71]
Intrinsic dimensionality explains the effectiveness of language model fine-tuning, 2020
Aghajanyan, A., Zettlemoyer, L., and Gupta, S. Intrinsic dimensionality explains the effectiveness of language model fine-tuning, 2020
work page 2020
-
[72]
On the Opportunities and Risks of Foundation Models
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R. B., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel, K., Davis, J. Q., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei - Fei, L., ...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[73]
Training verifiers to solve math word problems, 2021
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems, 2021
work page 2021
-
[74]
Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., Hu, S., Chen, Y., Chan, C.-M., Chen, W., Yi, J., Zhao, W., Wang, X., Liu, Z., Zheng, H.-T., Chen, J., Liu, Y., Tang, J., Li, J., and Sun, M. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models, 2022
work page 2022
-
[75]
Edalati, A., Tahaei, M., Kobyzev, I., Nia, V. P., Clark, J. J., and Rezagholizadeh, M. Krona: Parameter efficient tuning with kronecker adapter, 2022
work page 2022
-
[76]
Parameter-efficient transfer learning for NLP
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for NLP . In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.\ 2790--2799. P...
work page 2019
-
[77]
J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W
Hu, E. J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[78]
Fedpara: Low-rank hadamard product for communication-efficient federated learning, 2023
Hyeon-Woo, N., Ye-Bin, M., and Oh, T.-H. Fedpara: Low-rank hadamard product for communication-efficient federated learning, 2023
work page 2023
-
[79]
Measuring the intrinsic dimension of objective landscapes, 2018
Li, C., Farkhoor, H., Liu, R., and Yosinski, J. Measuring the intrinsic dimension of objective landscapes, 2018
work page 2018
-
[80]
Code as policies: Language model programs for embodied control, 2023
Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control, 2023
work page 2023
-
[81]
Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022
Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., and Raffel, C. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.