arxiv: 2605.07366 · v1 · submitted 2026-05-08 · 💻 cs.CL

Recognition: no theorem link

Gradient-Based LoRA Rank Allocation Under GRPO: An Empirical Study

Yash Ganpat Sawant

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:15 UTC · model grok-4.3

classification 💻 cs.CL

keywords LoRAGRPOrank allocationreinforcement learninggradient profilingfine-tuningpolicy optimizationGSM8K

0 comments

The pith

Proportional LoRA rank allocation under GRPO lowers accuracy by 4.5 points versus uniform allocation on identical budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether gradient-based adaptive LoRA rank allocation, which works well under supervised fine-tuning, transfers to Group Relative Policy Optimization. Experiments on Qwen 2.5 1.5B with GSM8K show proportional allocation reaches only 70.0 percent accuracy while uniform ranks reach 74.5 percent. The authors trace the gap to a flatter gradient landscape under GRPO, where layer importance varies by just 2.17 times, and to a feedback loop in which non-uniform ranks widen that spread to 3.00 times and progressively silence low-rank layers. A reader would care because the result warns against direct reuse of SFT allocation heuristics during RL alignment.

Core claim

Gradient-based proportional rank allocation for LoRA under GRPO reinforcement learning reduces accuracy by 4.5 points relative to uniform allocation on the same parameter budget. The GRPO gradient landscape is flatter than under SFT, with a max-to-min layer importance ratio of only 2.17x, so every layer carries meaningful signal. Non-uniform allocation triggers a gradient amplification effect that widens the importance spread to 3.00x, creating a positive feedback loop in which high-rank layers absorb more gradient while low-rank layers are progressively silenced.

What carries the argument

The gradient amplification effect under GRPO, in which non-uniform LoRA ranks increase the max-to-min gradient magnitude ratio from 2.17x to 3.00x and thereby create a positive feedback loop that silences low-rank layers.

If this is right

Uniform rank allocation avoids the feedback loop and preserves higher accuracy under GRPO.
Gradient importance measured at the start of training is not a reliable predictor of the capacity a layer needs during reinforcement learning.
Naive transfer of SFT-era proportional LoRA allocation to alignment training should be avoided.
All layers carry meaningful gradient signal under GRPO, unlike the highly skewed importance patterns reported for SFT.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The flatter gradient profile may stem from the relative nature of policy optimization rather than absolute token prediction, suggesting allocation strategies tailored to RL objectives.
Future work could test whether dynamic rank adjustment during training, rather than static allocation, mitigates the amplification effect.
The result may extend to other RL-based alignment methods that rely on relative advantage signals instead of supervised losses.

Load-bearing premise

That the gradient magnitudes measured on Qwen 2.5 1.5B with GSM8K reliably indicate the capacity each layer needs and that the observed performance gap generalizes to other models, tasks, and GRPO implementations.

What would settle it

Repeating the uniform-versus-proportional comparison on a different base model or task under the same GRPO setup and checking whether the 4.5-point gap persists or reverses.

Figures

Figures reproduced from arXiv: 2605.07366 by Yash Ganpat Sawant.

**Figure 1.** Figure 1: shows the reward sensitivity map. Key observations: • Flat distribution: Max/min importance ratio is 2.17× (Layer 15 hottest at 4.68%, Layer 26 coldest 0 200 400 600 800 Training Step 0 2 4 6 8 10 12 14 16 18 20 22 24 26 Layer Index Per-Layer Gradient Magnitude During GRPO Training 0.005 0.010 0.015 0.020 0.025 0.030 Gradient Norm (smoothed) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Normalized layer importance under uniform vs. proportional allocation. Non-uniform allocation amplifies the original importance spread from 2.17× to 3.00×. genuinely important layers. However, both lose to uniform, suggesting that while the signal is directionally correct, any deviation from uniform allocation damages performance under GRPO. Second, training reward curves are nearly identical across all c… view at source ↗

read the original abstract

Adaptive rank allocation for LoRA, allocating more parameters to important layers and fewer to unimportant ones, consistently improves efficiency under supervised fine-tuning (SFT). We investigate whether this success transfers to reinforcement learning, specifically Group Relative Policy Optimization (GRPO). Using gradient-magnitude profiling on Qwen 2.5 1.5B with GSM8K, we find that it does not: proportional rank allocation degrades accuracy by 4.5 points compared to uniform allocation (70.0% vs. 74.5%), despite using identical parameter budgets. We identify two mechanisms behind this failure. First, the gradient landscape under GRPO is fundamentally flatter than under SFT, the max-to-min layer importance ratio is only 2.17x, compared to >10x reported in SFT literature. All layers carry meaningful gradient signal; none are truly idle. Second, we discover a gradient amplification effect: non-uniform allocation widens the importance spread from 2.17x to 3.00x, creating a positive feedback loop where high-rank layers absorb more gradient while low-rank layers are progressively silenced. Our results suggest that gradient importance does not predict capacity requirements under RL, and that naive transfer of SFT-era rank allocation to alignment training should be avoided.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gradient-based LoRA allocation from SFT lowers GRPO accuracy by 4.5 points on this single model-task pair due to flatter gradients and an amplification loop, but the narrow scope leaves generalization open.

read the letter

The main point is that proportional rank allocation based on SFT gradients actually hurts performance under GRPO, at least here: 70% accuracy versus 74.5% for uniform ranks on the same budget with Qwen 2.5 1.5B and GSM8K. The paper traces this to two mechanisms: a much flatter gradient landscape (max-to-min ratio of only 2.17x instead of the >10x typical in SFT) and a feedback effect where uneven ranks widen the spread to 3x, progressively starving low-rank layers of signal. Those measured ratios and the accuracy gap are the concrete new pieces; they give a clear empirical counterexample to direct transfer of the SFT trick and spell out why the landscape differs under RL. The work is honest about the numbers it reports and avoids overclaiming beyond the observed setup. The limitation is the single 1.5B model and single math task. No error bars, no ablations on other GRPO variants, and no tests on larger models or different tasks mean we cannot yet tell whether the flatness or the amplification loop is general. Gradient magnitude as a capacity proxy may simply not hold under RL, but the current data does not establish that across the board. This is useful for anyone running GRPO after SFT who is considering adaptive LoRA ranks; it flags a potential pitfall worth checking in their own pipelines. A reader focused on efficient RL fine-tuning would get immediate practical value from the warning and the proposed mechanisms. The paper deserves a serious referee because the observation is reproducible in principle and the question matters for current alignment workflows, even though revisions would need to add scale and statistical checks before the recommendation against transfer becomes firm. I would send it for review.

Referee Report

2 major / 1 minor

Summary. The paper claims that gradient-based proportional LoRA rank allocation, which improves efficiency under SFT, fails to transfer to GRPO reinforcement learning. On Qwen 2.5 1.5B with GSM8K, proportional allocation yields 70.0% accuracy versus 74.5% for uniform allocation at identical parameter budgets. The authors attribute the 4.5-point gap to a flatter GRPO gradient landscape (max-to-min ratio 2.17x vs. >10x in SFT literature) where all layers carry signal, plus a positive feedback loop in which non-uniform ranks amplify the importance spread to 3.00x.

Significance. If the reported gradient flatness and amplification effects hold, the work supplies concrete empirical evidence that SFT-era adaptive LoRA heuristics are not reliable under GRPO-style RL, motivating the development of RL-specific rank allocation methods. The measurements of layer-wise gradient ratios and the observed feedback dynamic are potentially useful for practitioners tuning LoRA on alignment tasks.

major comments (2)

[Abstract / Results] Abstract and results section: the central 4.5-point accuracy gap (70.0% vs. 74.5%) is reported without error bars, number of runs, random seeds, or statistical tests. Given that the claim rests on this difference being meaningful and reproducible, the absence of these details makes it impossible to judge whether the gap exceeds run-to-run variance.
[Abstract] Abstract: the claim that gradient-based allocation 'should be avoided' for GRPO is supported only by measurements on a single 1.5B model and a single math task (GSM8K). The reported 2.17x flatness ratio and 3.00x amplification are presented as general properties of GRPO, yet no additional scales, architectures, or GRPO variants are shown; this single-experiment basis is load-bearing for the recommendation against transferring SFT methods.

minor comments (1)

[Abstract] The abstract refers to 'gradient-magnitude profiling' without stating at which training step(s) the ratios are computed or whether they are averaged; a brief clarification would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, with revisions where feasible to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results section: the central 4.5-point accuracy gap (70.0% vs. 74.5%) is reported without error bars, number of runs, random seeds, or statistical tests. Given that the claim rests on this difference being meaningful and reproducible, the absence of these details makes it impossible to judge whether the gap exceeds run-to-run variance.

Authors: We agree that reporting the 4.5-point gap without error bars, run counts, seeds, or statistical tests limits the ability to assess its robustness against variance. In the revised manuscript, we will add results from five independent runs using distinct random seeds, include standard deviation error bars on all relevant accuracy figures and tables, and report a paired t-test p-value confirming statistical significance of the difference. These updates will appear in both the abstract and results section. revision: yes
Referee: [Abstract] Abstract: the claim that gradient-based allocation 'should be avoided' for GRPO is supported only by measurements on a single 1.5B model and a single math task (GSM8K). The reported 2.17x flatness ratio and 3.00x amplification are presented as general properties of GRPO, yet no additional scales, architectures, or GRPO variants are shown; this single-experiment basis is load-bearing for the recommendation against transferring SFT methods.

Authors: We acknowledge that the study uses a single model scale and task, which restricts the generality of the observed flatness ratio and amplification effect. The core contribution is an empirical demonstration that SFT-style gradient allocation fails to transfer under GRPO in this setting. We have revised the abstract to replace the prescriptive phrasing 'should be avoided' with 'may not transfer reliably to GRPO, motivating RL-specific methods,' better aligning the language with the evidence presented. Broader validation would be valuable but is not feasible within current resource constraints. revision: partial

standing simulated objections not resolved

Additional experiments across multiple model scales, architectures, and GRPO variants to establish broader properties of GRPO gradient landscapes.

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential fits.

full rationale

The paper conducts direct experiments profiling gradient magnitudes on Qwen 2.5 1.5B under GRPO with GSM8K, then measures accuracy for uniform vs. proportional LoRA rank allocations using identical parameter budgets. Reported values (74.5% uniform vs. 70.0% proportional; gradient ratio 2.17x widening to 3.00x) are observed experimental outcomes, not quantities obtained by fitting parameters to a subset and renaming the fit as a prediction. No equations, ansatzes, uniqueness theorems, or self-citations are invoked to derive the central claims; the work contains no derivation chain that reduces to its own inputs. The findings are falsifiable via replication on other models/tasks and stand as independent empirical evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that gradient magnitude measured during GRPO training is a meaningful indicator of per-layer capacity requirements; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Gradient magnitude during training is a valid proxy for the importance of a layer when deciding LoRA rank allocation
The proportional allocation strategy is constructed directly from measured gradient magnitudes.

pith-pipeline@v0.9.0 · 5520 in / 1387 out tokens · 54771 ms · 2026-05-11T02:15:02.901836+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 3 internal anchors

[1]

Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=

work page
[2]

Zhang, Qingru and Chen, Minshuo and Bukharin, Alexander and He, Pengcheng and Cheng, Yu and Chen, Weizhu and Zhao, Tuo , booktitle=

work page
[3]

He, Haonan and Ye, Peng and Ren, Yuchen and Yuan, Yuan and Zhou, Luyang and Ju, Shucun and Chen, Lei , booktitle=

work page
[4]

Cui, Xuan and Li, Huiyue and Zeng, Run and Zhao, Yunfei and Qian, Jinrui and Duan, Wei and Liu, Bo and Zhou, Zhanpeng , journal=

work page
[5]

Aletheia: Gradient-Guided Layer Selection for Efficient

Saket, Abdulmalek , journal=. Aletheia: Gradient-Guided Layer Selection for Efficient

work page
[6]

Understanding Layer Significance in

Shi, Guangyuan and Lu, Zexin and Dong, Xiaoyu and Zhang, Wenlong and Zhang, Xuanyu and Feng, Yujie and Wu, Xiao-Ming , journal=. Understanding Layer Significance in

work page
[7]

Young, Robin , journal=. Why Is

work page
[8]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , journal=

work page
[10]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

IGU-LoRA : Adaptive rank allocation via integrated gradients and uncertainty-aware scoring

Cui, X., Li, H., Zeng, R., Zhao, Y., Qian, J., Duan, W., Liu, B., and Zhou, Z. IGU-LoRA : Adaptive rank allocation via integrated gradients and uncertainty-aware scoring. arXiv preprint arXiv:2603.13792, 2026

work page arXiv 2026
[12]

GoRA : Gradient-driven adaptive low rank adaptation

He, H., Ye, P., Ren, Y., Yuan, Y., Zhou, L., Ju, S., and Chen, L. GoRA : Gradient-driven adaptive low rank adaptation. In Advances in Neural Information Processing Systems, 2025

work page 2025
[13]

J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

work page 2022
[14]

Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures

Saket, A. Aletheia: Gradient-guided layer selection for efficient LoRA fine-tuning across architectures. arXiv preprint arXiv:2604.15351, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y. K., Wu, Y., and Guo, D. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Understanding layer significance in llm alignment, 2025

Shi, G., Lu, Z., Dong, X., Zhang, W., Zhang, X., Feng, Y., and Wu, X.-M. Understanding layer significance in LLM alignment. arXiv preprint arXiv:2410.17875, 2024

work page arXiv 2024
[17]

Why is rlhf alignment shallow? a gradient analysis.arXiv preprint arXiv: 2603.04851, 2026

Young, R. Why is RLHF alignment shallow? A gradient analysis. arXiv preprint arXiv:2603.04851, 2026

work page arXiv 2026
[18]

AdaLoRA : Adaptive budget allocation for parameter-efficient fine-tuning

Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., and Zhao, T. AdaLoRA : Adaptive budget allocation for parameter-efficient fine-tuning. In International Conference on Learning Representations, 2023

work page 2023