arxiv: 2312.03732 · v1 · submitted 2023-11-28 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

Damjan Kalajdzievski

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:02 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords LoRAparameter-efficient fine-tuningscaling factorrank stabilizationlarge language modelsfine-tuninglow-rank adapters

0 comments

The pith

LoRA adapters should be scaled by dividing by the square root of the rank rather than the full rank to stabilize learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard LoRA applies a scaling factor that divides the adapter update by the chosen rank, which slows learning and limits performance as rank grows. Analysis of the learning process shows that dividing instead by the square root of the rank keeps the effective update size stable across different ranks. The resulting rank-stabilized LoRA (rsLoRA) lets practitioners increase rank to gain fine-tuning accuracy while keeping inference cost unchanged. This removes the practical restriction to very low ranks and creates a direct trade-off between training compute and downstream performance.

Core claim

The paper proves that the LoRA adapter update must be scaled by 1/sqrt(rank) rather than 1/rank. Under this corrected scaling the magnitude of the gradient signal remains constant as rank increases, so larger-rank adapters learn effectively instead of being suppressed. The only change required is in the scaling constant applied during training; the low-rank decomposition itself and the inference-time computation are untouched.

What carries the argument

The rank-dependent scaling factor applied to the low-rank matrix product inside each LoRA adapter; replacing the conventional divisor of rank with the square root of rank stabilizes the variance of the update.

If this is right

Higher ranks become practical in LoRA, directly improving fine-tuning quality on the same data.
Inference latency and memory stay identical because only the training-time scaling constant changes.
A continuous compute-performance curve appears: extra training FLOPs from larger rank yield measurable gains.
Existing LoRA implementations require only a one-line change to the scaling factor to adopt the method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scaling logic may extend to other low-rank adaptation schemes that multiply an update by a rank-dependent constant.
Models fine-tuned with rsLoRA at moderate ranks could match or exceed the quality of full fine-tuning at lower total training cost.
The result suggests re-examining scaling factors in related parameter-efficient methods such as adapter layers or prefix tuning.

Load-bearing premise

The optimality of the square-root scaling rests on the assumption that initialization variance and gradient magnitudes behave exactly as they do in the standard LoRA forward and backward passes.

What would settle it

Train identical models on the same task with ranks from 4 to 128 using both the original 1/rank scaling and the proposed 1/sqrt(rank) scaling, then compare final validation accuracy; if accuracy stops improving or declines with the square-root scaling, the claim is falsified.

read the original abstract

As large language models (LLMs) have become increasingly compute and memory intensive, parameter-efficient fine-tuning (PEFT) methods are now a common strategy to fine-tune LLMs. A popular PEFT method is Low-Rank Adapters (LoRA), which adds trainable low-rank "adapters" to selected layers. Each adapter consists of a low-rank matrix product, multiplicatively scaled by a rank-dependent factor. This scaling factor, which divides adapters by a factor of the rank, results in slowed learning and stunted performance for LoRA with higher-rank adapters. Consequently, the use of LoRA in practice has generally been limited to very low ranks. In this work, we study the impact of the scaling factor on the learning process and prove that LoRA adapters should be divided by a factor of the square root of the rank. Modifying LoRA with the appropriate scaling factor, which we call the rank-stabilized LoRA (rsLoRA) method, easily provides for a fine-tuning compute/performance trade-off, where larger ranks can be used to trade off increased computational resources during training for better fine-tuning performance, with no change in inference computing cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes switching LoRA's scaling from 1/rank to 1/sqrt(rank) to support higher ranks without slowing learning, but the derivation and results need more visible support.

read the letter

The central suggestion is straightforward: replace the standard LoRA scaling factor of 1/rank with 1/sqrt(rank) so that larger adapter ranks do not slow down training as much. The authors call this rsLoRA and frame it as a way to get a training-time compute versus performance trade-off while leaving inference unchanged once the adapters are merged in. If the change works as described, it removes one reason people default to very low ranks like 8 or 16 in practice. The motivation is clear from the abstract: the original divisor appears to shrink the effective update size too aggressively as rank grows, which hurts convergence. The paper positions the sqrt choice as coming from an analysis of how the scaling interacts with the learning dynamics. That is the new piece relative to the usual LoRA literature. The rest of the work is mostly explaining why the current factor creates the observed limitation and stating that a proof exists for the alternative. On the positive side, the idea is simple to implement and test, and it targets a real pain point for anyone running parameter-efficient fine-tuning on LLMs. The soft spots are more noticeable. The text does not show the actual derivation steps, the initialization assumptions, or how the gradient magnitudes are modeled, so it is difficult to judge whether the proof holds under the discrete updates and weight norms seen in real models. No experimental results or ablation numbers appear in the supplied material either, which leaves the practical size of the improvement unclear. The stress-test concern about idealized variance preservation is therefore still open. This paper is mainly for people already working with LoRA who want to experiment with higher ranks. A reader who cares about small, testable tweaks to popular PEFT methods would find it worth trying. I would send it to peer review because the core claim is easy to check and the motivation is grounded, even though the current version needs the missing derivation and results filled in before it could be considered solid.

Referee Report

2 major / 0 minor

Summary. The manuscript studies the scaling factor applied to LoRA adapters during fine-tuning of LLMs. It claims that the conventional factor of 1/rank slows learning and limits performance at higher ranks, and asserts a proof that the factor should instead be 1/sqrt(rank) to stabilize the learning dynamics. The proposed rsLoRA modification is said to enable a compute-performance trade-off by supporting larger ranks at training time without changing inference cost.

Significance. If the claimed proof and any accompanying experiments hold, the result would be significant for parameter-efficient fine-tuning: it would remove an artificial barrier that has kept LoRA ranks low in practice and would give practitioners a principled way to trade additional training compute for better adaptation quality.

major comments (2)

[Abstract] Abstract: the central claim that 'we ... prove that LoRA adapters should be divided by a factor of the square root of the rank' is unsupported because no derivation, no equations modeling the interaction of the scaling factor with adapter initialization (e.g., variance of A or B), and no gradient-flow or SGD analysis appear in the manuscript.
[Abstract] The load-bearing modeling assumptions (initialization variance, continuous-time approximation of discrete updates, pre-trained weight norms) are never stated, so it is impossible to assess whether the derived optimum 1/sqrt(rank) remains valid when those assumptions are relaxed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. The comments correctly identify that the theoretical justification requires more explicit presentation. We will revise the manuscript to include the full derivation, stated assumptions, and supporting analysis.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'we ... prove that LoRA adapters should be divided by a factor of the square root of the rank' is unsupported because no derivation, no equations modeling the interaction of the scaling factor with adapter initialization (e.g., variance of A or B), and no gradient-flow or SGD analysis appear in the manuscript.

Authors: We agree that the submitted manuscript does not contain a sufficiently detailed derivation in the main text. The proof was condensed to fit space constraints. In revision we will add a dedicated theoretical section that (i) models the initialization variances explicitly (A ~ N(0, 1/rank), B = 0), (ii) derives the interaction of the scaling factor with the adapter update, and (iii) presents the continuous-time gradient-flow / SGD analysis that yields the 1/sqrt(rank) optimum. The revised abstract will be updated to reflect the expanded treatment. revision: yes
Referee: [Abstract] The load-bearing modeling assumptions (initialization variance, continuous-time approximation of discrete updates, pre-trained weight norms) are never stated, so it is impossible to assess whether the derived optimum 1/sqrt(rank) remains valid when those assumptions are relaxed.

Authors: We acknowledge the omission. The revised manuscript will open the theoretical section with an explicit list of assumptions (Gaussian initialization variances, continuous-time limit of SGD, bounded pre-trained weight norms). We will also add a short robustness subsection that discusses how the 1/sqrt(rank) result behaves under relaxed assumptions, supported by additional controlled experiments that vary initialization scale and learning-rate schedules. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the claimed proof of 1/sqrt(rank) scaling

full rationale

The paper derives the rank-stabilized scaling via analysis of initialization variance and gradient magnitudes under the LoRA update rule. No quoted equations reduce the 1/sqrt(rank) factor to a post-hoc fit, self-definition, or load-bearing self-citation. The central claim rests on modeling assumptions about learning dynamics rather than re-expressing the input data or prior fitted quantities as the output. This is a standard non-finding for a theoretical derivation paper whose result is not forced by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on an analysis of how scaling interacts with rank in the LoRA update rule; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption The learning dynamics of LoRA are governed by a rank-dependent scaling that can be corrected by a sqrt(rank) factor
This assumption underpins the proof mentioned in the abstract.

pith-pipeline@v0.9.0 · 5501 in / 1191 out tokens · 71657 ms · 2026-05-15T22:02:04.412118+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning
cs.CL 2026-05 unverdicted novelty 7.0

MatryoshkaLoRA inserts a crafted diagonal matrix P into LoRA to learn accurate nested low-rank adapters that support dynamic rank selection with minimal performance drop.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
cs.CL 2026-05 unverdicted novelty 7.0

MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of either static method, with an efficient LoRA-only variant outperforming prior adaptive approaches.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 7.0

VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA
cs.CV 2026-04 unverdicted novelty 7.0

HAC provides a parameter-efficient way to move CLIP into hyperbolic geometry, yielding consistent gains on zero-shot VQA benchmarks without any VQA training data overlap.
DifFoundMAD: Foundation Models meet Differential Morphing Attack Detection
cs.CV 2026-04 unverdicted novelty 7.0

DifFoundMAD improves differential morphing attack detection by replacing traditional embeddings with those from vision foundation models and applying class-balanced lightweight fine-tuning, cutting high-security error...
PreFT: Prefill-only finetuning for efficient inference
cs.LG 2026-05 accept novelty 6.0

Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need
cs.LG 2026-04 unverdicted novelty 6.0

Frozen random backbones with low-rank LoRA adapters recover 96-100% of fully trained performance on diverse architectures while training only 0.5-40% of parameters.
InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation
cs.RO 2026-02 unverdicted novelty 6.0

InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 5.0

VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
SplitFT: An Adaptive Federated Split Learning System For LLMs Fine-Tuning
cs.DC 2026-04 unverdicted novelty 5.0

SplitFT adapts cut-layer selection and reduces LoRA rank per client in federated split learning to improve efficiency and performance when fine-tuning LLMs on heterogeneous devices and data.
Benchmarking Linguistic Adaptation in Comparable-Sized LLMs: A Study of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali
cs.CL 2026-03 unverdicted novelty 5.0

Fine-tuning Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali data enables effective generation where zero-shot fails, with Qwen3-8B performing best overall and Llama-3.1-8B showing the largest gains.
Can Muon Fine-tune Adam-Pretrained Models?
cs.LG 2026-05 unverdicted novelty 4.0

Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.
LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language
cs.CL 2026-05 conditional novelty 4.0

Qwen2.5-3B was continued-pretrained and then fine-tuned with rsLoRA r256 on Sardinian data to reach 28.5 BLEU into the language, outperforming full fine-tuning and other LoRA variants.
One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech
eess.AS 2026-04 unverdicted novelty 4.0

A system based on OmniVoice with multi-model ensemble distillation for fine-tuning shows consistent gains in intelligibility metrics while keeping speaker similarity for cross-lingual scientific speech.
When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden
cs.CL 2026-04 unverdicted novelty 4.0

Small language models detect the triple burden of PCOS, disordered eating, and body image issues in social media posts at 75.3% exact match accuracy with grounded explanations.
LLMs and Speech: Integration vs. Combination
eess.AS 2026-03 unverdicted novelty 4.0

Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.
Efficient Task Adaptation in Large Language Models via Selective Parameter Optimization
cs.CL 2026-04 unverdicted novelty 3.0

The paper claims a selective fine-tuning method that identifies and freezes core parameters to mitigate catastrophic forgetting in LLMs while improving domain adaptation, shown in experiments with GPT-J and LLaMA-3.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 18 Pith papers · 6 internal anchors

[1]

2023 , publisher =

Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, Thomas Wolf , title =. 2023 , publisher =

work page 2023
[2]

doi:10.5281/zenodo.5371628 , url =

Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and Phang, Jason and Reynolds, Laria and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy , title =. doi:10.5281/zenodo.5371628 , url =

work page doi:10.5281/zenodo.5371628
[3]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

work page 2018
[4]

2019 , eprint=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

work page 2019
[5]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021
[6]

2022 , eprint=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2022 , eprint=

work page 2022
[7]

Cohen , abstract =

Michael McCloskey and Neal J. Cohen , abstract =. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , editor =. 1989 , issn =. doi:https://doi.org/10.1016/S0079-7421(08)60536-8 , url =

work page doi:10.1016/s0079-7421(08)60536-8 1989
[8]

doi:10.1037/0033-295x.97.2.285 , url =

Roger Ratcliff , title =. doi:10.1037/0033-295x.97.2.285 , url =

work page doi:10.1037/0033-295x.97.2.285
[9]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Measuring Catastrophic Forgetting in Neural Networks , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2018 , month=. doi:10.1609/aaai.v32i1.11651 , abstractNote=

work page doi:10.1609/aaai.v32i1.11651 2018
[10]

Goodfellow and Mehdi Mirza and Xia Da and Aaron C

Ian J. Goodfellow and Mehdi Mirza and Xia Da and Aaron C. Courville and Yoshua Bengio , editor =. An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , booktitle =. 2014 , url =

work page 2014
[11]

Rusu and Kieran Milan and John Quan and Tiago Ramalho and Agnieszka Grabska-Barwinska and Demis Hassabis and Claudia Clopath and Dharshan Kumaran and Raia Hadsell , title =

James Kirkpatrick and Razvan Pascanu and Neil Rabinowitz and Joel Veness and Guillaume Desjardins and Andrei A. Rusu and Kieran Milan and John Quan and Tiago Ramalho and Agnieszka Grabska-Barwinska and Demis Hassabis and Claudia Clopath and Dharshan Kumaran and Raia Hadsell , title =. Proceedings of the National Academy of Sciences , volume =. 2017 , doi ...

work page doi:10.1073/pnas.1611835114 2017
[12]

A Continual Learning Survey: Defying Forgetting in Classification Tasks , year=

De Lange, Matthias and Aljundi, Rahaf and Masana, Marc and Parisot, Sarah and Jia, Xu and Leonardis, Aleš and Slabaugh, Gregory and Tuytelaars, Tinne , journal=. A Continual Learning Survey: Defying Forgetting in Classification Tasks , year=

work page
[13]

International Conference on Learning Representations , year=

Continual learning with hypernetworks , author=. International Conference on Learning Representations , year=

work page
[14]

The Challenges of Continuous Self-Supervised Learning

Purushwalkam, Senthil and Morgado, Pedro and Gupta, Abhinav. The Challenges of Continuous Self-Supervised Learning. Computer Vision -- ECCV 2022. 2022

work page 2022
[15]

Progressive Neural Networks

Andrei A. Rusu and Neil C. Rabinowitz and Guillaume Desjardins and Hubert Soyer and James Kirkpatrick and Koray Kavukcuoglu and Razvan Pascanu and Raia Hadsell , title =. CoRR , volume =. 2016 , url =. 1606.04671 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

, booktitle=

Rebuffi, Sylvestre-Alvise and Kolesnikov, Alexander and Sperl, Georg and Lampert, Christoph H. , booktitle=. iCaRL: Incremental Classifier and Representation Learning , year=

work page
[17]

Connection Science , volume =

ANTHONY ROBINS , title =. Connection Science , volume =. 1995 , publisher =. doi:10.1080/09540099550039318 , URL =

work page doi:10.1080/09540099550039318 1995
[18]

A Bio-Inspired Incremental Learning Architecture for Applied Perceptual Problems , volume =

Gepperth, Alexander and Karaoguz, Cem , year =. A Bio-Inspired Incremental Learning Architecture for Applied Perceptual Problems , volume =. Cognitive Computation , doi =

work page
[19]

Parisi and Ronald Kemker and Jose L

German I. Parisi and Ronald Kemker and Jose L. Part and Christopher Kanan and Stefan Wermter , keywords =. Continual lifelong learning with neural networks: A review , journal =. 2019 , issn =. doi:https://doi.org/10.1016/j.neunet.2019.01.012 , url =

work page doi:10.1016/j.neunet.2019.01.012 2019
[20]

Reinforced Continual Learning , url =

Xu, Ju and Zhu, Zhanxing , booktitle =. Reinforced Continual Learning , url =

work page
[21]

Aljundi and P

R. Aljundi and P. Chakravarty and T. Tuytelaars , booktitle =. Expert Gate: Lifelong Learning with a Network of Experts , year =. doi:10.1109/CVPR.2017.753 , url =

work page doi:10.1109/cvpr.2017.753 2017
[22]

Experience Replay for Continual Learning , url =

Rolnick, David and Ahuja, Arun and Schwarz, Jonathan and Lillicrap, Timothy and Wayne, Gregory , booktitle =. Experience Replay for Continual Learning , url =

work page
[23]

Isele, David and Cosgun, Akansel , title =. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence , articleno =. 2018 , isbn =

work page 2018
[24]

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

Lopez-Paz, David and Ranzato, Marc'Aurelio , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

work page 2017
[25]

On Tiny Episodic Memories in Continual Learning

On Tiny Episodic Memories in Continual Learning. arXiv e-prints , keywords =. doi:10.48550/arXiv.1902.10486 , archivePrefix =. 1902.10486 , primaryClass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1902.10486 1902
[26]

Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =

Ahn, Hongjoon and Cha, Sungmin and Lee, Donggyu and Moon, Taesup , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =

work page 2019
[28]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[29]

2023 , eprint=

Universality and Limitations of Prompt Tuning , author=. 2023 , eprint=

work page 2023
[30]

2023 , eprint=

Two-stage LLM Fine-tuning with Less Specialization and More Generalization , author=. 2023 , eprint=

work page 2023
[31]

2023 , eprint=

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning , author=. 2023 , eprint=

work page 2023
[32]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[33]

2017 , eprint=

Neural Domain Adaptation for Biomedical Question Answering , author=. 2017 , eprint=

work page 2017
[34]

arXiv preprint arXiv:1909.11299 , year=

Mixout: Effective regularization to finetune large-scale pretrained language models , author=. arXiv preprint arXiv:1909.11299 , year=

work page arXiv 1909
[35]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[36]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[37]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

International Conference on Machine Learning , pages=

Unified scaling laws for routed language models , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[39]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

work page 2022
[40]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[41]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[42]

2022 , url=

Robust fine-tuning of zero-shot models , author=. 2022 , url=

work page 2022
[43]

International Conference on Learning Representations , year=

Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=

work page
[44]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[45]

International Conference on Learning Representations , year=

Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

work page
[46]

2023 , eprint=

Orca: Progressive Learning from Complex Explanation Traces of GPT-4 , author=. 2023 , eprint=

work page 2023
[47]

2018 , eprint=

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , author=. 2018 , eprint=

work page 2018
[48]

2017 , url=

Representation Stability as a Regularizer for Improved Text Analytics Transfer Learning , author=. 2017 , url=

work page 2017
[49]

Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1 , pages =

Lin, Chin-Yew and Hovy, Eduard , title =. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1 , pages =. 2003 , publisher =. doi:10.3115/1073445.1073465 , abstract =

work page doi:10.3115/1073445.1073465 2003
[50]

Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages =

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , title =. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages =. 2002 , publisher =. doi:10.3115/1073083.1073135 , abstract =

work page doi:10.3115/1073083.1073135 2002
[51]

ArXiv , year=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. ArXiv , year=

work page
[52]

2020 , eprint=

Scaling Laws for Autoregressive Generative Modeling , author=. 2020 , eprint=

work page 2020
[53]

2022 , eprint=

Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models , author=. 2022 , eprint=

work page 2022
[54]

NPJ digital medicine , volume=

Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction , author=. NPJ digital medicine , volume=. 2021 , publisher=

work page 2021
[55]

2023 , eprint=

Code as Policies: Language Model Programs for Embodied Control , author=. 2023 , eprint=

work page 2023
[56]

2023 , eprint=

PaLM 2 Technical Report , author=. 2023 , eprint=

work page 2023
[57]

2022 , eprint=

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , author=. 2022 , eprint=

work page 2022
[58]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

work page 2023
[59]

2023 , eprint=

Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning , author=. 2023 , eprint=

work page 2023
[60]

2022 , eprint=

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning , author=. 2022 , eprint=

work page 2022
[61]

2023 , eprint=

FedPara: Low-Rank Hadamard Product for Communication-Efficient Federated Learning , author=. 2023 , eprint=

work page 2023
[62]

2022 , eprint=

KronA: Parameter Efficient Tuning with Kronecker Adapter , author=. 2022 , eprint=

work page 2022
[63]

Parameter-Efficient Transfer Learning for

Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and De Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , booktitle =. Parameter-Efficient Transfer Learning for. 2019 , editor =

work page 2019
[64]

2022 , eprint=

Feature Learning in Infinite-Width Neural Networks , author=. 2022 , eprint=

work page 2022
[65]

2018 , eprint=

Measuring the Intrinsic Dimension of Objective Landscapes , author=. 2018 , eprint=

work page 2018
[66]

2020 , eprint=

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning , author=. 2020 , eprint=

work page 2020
[67]

2019 , eprint=

Decoupled Weight Decay Regularization , author=. 2019 , eprint=

work page 2019
[68]

Wang, Ben and Komatsuzaki, Aran , title =

work page
[69]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021
[70]

2023 , eprint=

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis , author=. 2023 , eprint=

work page 2023
[71]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning, 2020

Aghajanyan, A., Zettlemoyer, L., and Gupta, S. Intrinsic dimensionality explains the effectiveness of language model fine-tuning, 2020

work page 2020
[72]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R. B., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel, K., Davis, J. Q., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei - Fei, L., ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[73]

Training verifiers to solve math word problems, 2021

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems, 2021

work page 2021
[74]

Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models, 2022

Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., Hu, S., Chen, Y., Chan, C.-M., Chen, W., Yi, J., Zhao, W., Wang, X., Liu, Z., Zheng, H.-T., Chen, J., Liu, Y., Tang, J., Li, J., and Sun, M. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models, 2022

work page 2022
[75]

P., Clark, J

Edalati, A., Tahaei, M., Kobyzev, I., Nia, V. P., Clark, J. J., and Rezagholizadeh, M. Krona: Parameter efficient tuning with kronecker adapter, 2022

work page 2022
[76]

Parameter-efficient transfer learning for NLP

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for NLP . In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.\ 2790--2799. P...

work page 2019
[77]

J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

Hu, E. J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[78]

Fedpara: Low-rank hadamard product for communication-efficient federated learning, 2023

Hyeon-Woo, N., Ye-Bin, M., and Oh, T.-H. Fedpara: Low-rank hadamard product for communication-efficient federated learning, 2023

work page 2023
[79]

Measuring the intrinsic dimension of objective landscapes, 2018

Li, C., Farkhoor, H., Liu, R., and Yosinski, J. Measuring the intrinsic dimension of objective landscapes, 2018

work page 2018
[80]

Code as policies: Language model programs for embodied control, 2023

Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control, 2023

work page 2023
[81]

Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022

Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., and Raffel, C. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022

work page 2022

Showing first 80 references.