FIM-LoRA: Task-Informative Rank Allocation for LoRA via Calibration-Time Gradient-Variance Estimation

Ramakrishnan Sathyavageeswaran

arxiv: 2605.16800 · v1 · pith:7P25Z3VUnew · submitted 2026-05-16 · 💻 cs.LG · cs.CL

FIM-LoRA: Task-Informative Rank Allocation for LoRA via Calibration-Time Gradient-Variance Estimation

Ramakrishnan Sathyavageeswaran This is my paper

Pith reviewed 2026-05-19 21:14 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LoRArank allocationgradient varianceFisher informationparameter-efficient fine-tuninglayer importancetransformer adaptation

0 comments

The pith

LoRA can match uniform-rank performance by reallocating its fixed budget according to per-layer gradient variances measured in eight calibration passes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper starts from the observation that standard LoRA applies the same rank to every adapted matrix even though layers contribute unequally to a given task. It shows that a short pre-training calibration consisting of eight backward passes is enough to estimate the gradient variance of each LoRA-B matrix. These variances serve as a proxy for layer informativeness and are used to redistribute the total rank budget proportionally. The outcome is still an ordinary LoRA adapter that needs no extra parameters, no extra training cost, and no changes at inference time. Experiments on GLUE with DeBERTa-v3-base and commonsense reasoning with LLaMA-3-8B confirm that the redistributed ranks preserve accuracy to within 0.1-0.2 points of the uniform baseline.

Core claim

An efficient, memory-light approximation to the diagonal of the empirical Fisher Information Matrix, restricted to the LoRA adapter matrices and computed from only eight calibration examples, yields per-layer gradient variances that can be used to allocate ranks proportionally inside a fixed total budget while leaving the adapter format, training procedure, and serving stack unchanged.

What carries the argument

Gradient variance of each LoRA-B matrix, estimated from eight calibration backward passes and used as a proxy for layer informativeness to set per-layer ranks.

If this is right

The final adapter remains a standard LoRA object and can be used with any existing LoRA-compatible inference engine.
Higher ranks are consistently assigned to value projections and early-to-middle layers, matching prior observations about transformer roles.
The calibration step reduces memory for the rank-allocation computation by roughly 256 times relative to a full-model Fisher estimate.
Performance stays within 0.2 points of uniform LoRA on both GLUE classification and commonsense reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same variance signal could be tested as a cheap importance measure for deciding which layers to adapt at all rather than only how much rank to give them.
If the eight-pass estimate generalizes across tasks, the method could be run once on a generic calibration set and reused for multiple downstream adaptations of the same base model.
The resulting per-layer rank maps supply an explicit, inspectable record of which parts of the network the adaptation actually used.

Load-bearing premise

Gradient variance observed over eight calibration passes is a stable and task-relevant measure of how much each layer contributes to adaptation.

What would settle it

On a new task, uniform-rank LoRA at the same total parameter count substantially outperforms the variance-based rank allocation.

Figures

Figures reproduced from arXiv: 2605.16800 by Ramakrishnan Sathyavageeswaran.

read the original abstract

Low-rank adaptation (LoRA) assigns a uniform rank to every adapted weight matrix - a practical convenience that ignores a fundamental reality: different layers contribute unequally to task adaptation. We address this with a lightweight engineering solution: before fine-tuning begins, run eight calibration backward passes, compute the gradient variance of each LoRA-B matrix as a proxy for layer informativeness, and redistribute the rank budget proportionally. The resulting adapter is a standard LoRA with a per-layer rank pattern - no new parameters, no training overhead, no changes to serving infrastructure. We implement this via an efficient approximation of the empirical Fisher Information Matrix (eFIM) diagonal, restricted to LoRA adapter matrices only, which reduces memory cost by approximately 256x compared to full-model Fisher estimation. On GLUE with DeBERTa-v3-base, FIM-LoRA matches LoRA (88.6 vs. 88.7) at the same parameter budget, and on commonsense reasoning with LLaMA-3-8B reaches 68.5 vs. 68.7 for LoRA. The per-layer rank maps are interpretable: value projections and early-to-middle layers consistently receive higher rank, consistent with established findings on transformer layer roles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FIM-LoRA reallocates a fixed LoRA rank budget using gradient variance from eight calibration passes but delivers only parity with uniform-rank LoRA on the reported tasks.

read the letter

The key takeaway is that FIM-LoRA reallocates a fixed LoRA rank budget across layers based on gradient variance computed during a short calibration phase, and it matches standard LoRA performance on GLUE and commonsense reasoning tasks. What stands out as new is the combination of using an efficient eFIM diagonal approximation limited to the LoRA adapter matrices to estimate per-layer informativeness from just eight backward passes. This avoids the high memory cost of full-model Fisher estimation while producing a standard LoRA adapter with varying ranks. The approach has some practical strengths. It requires no changes to training or inference pipelines beyond the initial calibration, and the per-layer rank maps are interpretable, assigning higher ranks to value projections and early-to-middle layers in line with prior observations on transformer components. That said, the evidence for the method's effectiveness is limited. The reported results show only very close performance to uniform LoRA—88.6 versus 88.7 on GLUE with DeBERTa-v3-base and 68.5 versus 68.7 on commonsense with LLaMA-3-8B—at the same parameter count. Without error bars, statistical tests, or ablations on the number of calibration passes, it's difficult to know if the variance proxy reliably identifies important layers or if the allocation ends up similar to uniform due to noisy estimates. The concern about eight passes potentially being insufficient for stable ranking is worth checking against the full paper's experiments. This paper is for researchers and engineers focused on parameter-efficient fine-tuning of large models. Someone looking for small, implementable tweaks to LoRA to potentially improve adaptation without extra cost would find it relevant. I think it deserves peer review. The idea is concrete and the implementation details appear reproducible, so a referee could help clarify the robustness of the gradient variance proxy and whether it provides consistent gains across more tasks and models.

Referee Report

2 major / 1 minor

Summary. The paper presents FIM-LoRA, a method for allocating ranks in LoRA adapters in a task-informative manner. It uses eight calibration backward passes to compute gradient variances of LoRA-B matrices via an efficient eFIM diagonal approximation, then redistributes the total rank budget proportionally to these variances. The resulting LoRA adapter is claimed to match the performance of standard uniform-rank LoRA on GLUE with DeBERTa-v3-base (88.6 vs 88.7) and commonsense reasoning with LLaMA-3-8B (68.5 vs 68.7), while offering interpretable per-layer rank patterns without additional parameters or overhead.

Significance. Should the central claim hold under rigorous validation, this work provides a lightweight, practical solution for adaptive rank allocation in parameter-efficient fine-tuning. The engineering efficiency of the eFIM approximation (256x memory reduction) and the lack of changes to training or inference pipelines are strengths. The interpretability of the rank allocations, aligning with known roles of transformer layers, is a positive aspect. However, the near-identical performance numbers indicate that the significance is primarily in the proposed allocation strategy rather than empirical superiority.

major comments (2)

[Abstract and experimental results] The reported performance metrics lack error bars, results from multiple random seeds, or any statistical significance testing. Since the key claim is that FIM-LoRA matches LoRA performance at the same parameter budget, the absence of these makes it impossible to assess whether the observed differences (e.g., 88.6 vs. 88.7) reflect true equivalence or are due to random variation.
[Method (calibration phase)] The method depends on gradient variance estimates from only eight calibration backward passes to determine layer informativeness. There is no ablation study varying the number of passes or direct verification that these variances correlate with the layers' actual contribution to task performance. This is critical because with such a small number of samples, the estimates may be dominated by noise rather than signal, potentially undermining the reliability of the rank redistribution.

minor comments (1)

[Abstract] Clarify the exact basis for the 'approximately 256x' memory cost reduction compared to full-model Fisher estimation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We provide point-by-point responses to the major comments below, and we are committed to incorporating revisions that address the concerns raised to improve the rigor of our presentation.

read point-by-point responses

Referee: [Abstract and experimental results] The reported performance metrics lack error bars, results from multiple random seeds, or any statistical significance testing. Since the key claim is that FIM-LoRA matches LoRA performance at the same parameter budget, the absence of these makes it impossible to assess whether the observed differences (e.g., 88.6 vs. 88.7) reflect true equivalence or are due to random variation.

Authors: We agree that reporting statistical measures would strengthen the equivalence claim. The current manuscript presents results from representative runs without error bars, which is not uncommon in initial reports of PEFT methods. To address this, we will rerun the experiments with multiple random seeds (specifically, 5 seeds) and include mean performance with standard deviations in the revised tables for both GLUE and commonsense reasoning tasks. We will also note that the observed differences fall within the typical variance seen in such fine-tuning experiments. revision: yes
Referee: [Method (calibration phase)] The method depends on gradient variance estimates from only eight calibration backward passes to determine layer informativeness. There is no ablation study varying the number of passes or direct verification that these variances correlate with the layers' actual contribution to task performance. This is critical because with such a small number of samples, the estimates may be dominated by noise rather than signal, potentially undermining the reliability of the rank redistribution.

Authors: The selection of eight calibration backward passes was intended to minimize the computational overhead of the calibration phase while still capturing meaningful gradient variance information through our efficient eFIM diagonal approximation. We recognize that an ablation on the number of passes and a direct correlation analysis are absent from the initial submission. In the revision, we will add an ablation study showing performance for 4, 8, and 16 passes, demonstrating that 8 provides a good trade-off with stable rank allocations. Regarding direct verification, while we do not perform layer-wise removal experiments, the alignment of the allocated ranks with known transformer layer functionalities (e.g., higher ranks for value projections) and the matching task performance provide indirect support for the proxy's validity. We will expand the discussion section to include these points. revision: yes

Circularity Check

0 steps flagged

No significant circularity; rank allocation uses independent pre-training calibration gradients

full rationale

The derivation computes per-layer gradient variance of LoRA-B matrices via an eFIM-diagonal approximation on eight calibration backward passes, then redistributes a fixed total rank budget proportionally before fine-tuning begins. This produces a standard LoRA adapter whose performance is measured empirically after training. The allocation step is not defined in terms of final task metrics, not fitted to target performance numbers, and does not rely on self-citations or prior uniqueness theorems by the same authors. The reported parity results (88.6 vs 88.7 on GLUE, 68.5 vs 68.7 on commonsense) are post-hoc empirical observations rather than quantities forced by construction from the inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that gradient variance from a small number of passes is a faithful importance signal, plus the engineering choice of eight passes and a fixed total rank budget.

free parameters (2)

number of calibration backward passes
Fixed at eight as a lightweight choice; directly affects variance estimate stability and therefore the resulting rank map.
total rank budget
Kept identical to the uniform LoRA baseline so that any performance difference is attributed solely to allocation.

axioms (1)

domain assumption Gradient variance of LoRA-B matrices is a valid proxy for layer informativeness
Invoked to justify proportional redistribution of ranks before any task-specific training occurs.

pith-pipeline@v0.9.0 · 5756 in / 1558 out tokens · 39979 ms · 2026-05-19T21:14:54.758541+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction (8-tick period) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

run eight calibration backward passes, compute the gradient variance of each LoRA-B matrix as a proxy for layer informativeness, and redistribute the rank budget proportionally... T = 8 throughout unless otherwise specified
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the diagonal of the empirical Fisher Information Matrix is the expected squared gradient: Fii = 1/T ∑ (∂Lt/∂θi)²

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR , 2022

work page 2022
[2]

Zhang, M

Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, W. Chen, and T. Zhao. AdaLoRA : Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. In ICLR , 2023

work page 2023
[3]

Paischer, L

F. Paischer, L. Hauzenberger, T. Schmied, B. Alkin, M. P. Deisenroth, and S. Hochreiter. EVA : One-shot Initialization of Low-Rank Adaptation via Activation Variance. In NeurIPS , 2025

work page 2025
[4]

H. He, X. Cai, J. Wu, Y. Zhao, Y. Liu, X. Liu, X. Wang, and Y. Yang. GoRA : Gradient-driven Adaptive Low Rank Adaptation. arXiv:2502.12171 , 2025

work page arXiv 2025
[5]

Z. Liu, J. Lyn, W. Zhu, X. Tian, and Y. Graham. ALoRA : Allocating Low-Rank Adaptation for Fine-tuning Large Language Models. In NAACL , 2024

work page 2024
[6]

LeCun, J

Y. LeCun, J. S. Denker, and S. A. Solla. Optimal Brain Damage. In NeurIPS , 1990

work page 1990
[7]

Kirkpatrick et al

J. Kirkpatrick et al. Overcoming Catastrophic Forgetting in Neural Networks. PNAS , 114(13):3521--3526, 2017

work page 2017
[8]

Martens and R

J. Martens and R. Grosse. Optimizing Neural Networks with Kronecker-Factored Approximate Curvature. In ICML , 2015

work page 2015
[9]

Lodha, A

A. Lodha, A. Belapurkar, G. Chalkapurkar, S. Tao, Y. Ghosh, R. Basu, S. Petrov, and D. Srinivasan. On Surgical Fine-Tuning for Language Encoders. In EMNLP , 2023

work page 2023
[10]

Y. Kim, E. Kim, B. Chang, and J. Choe. Improving Fisher Information Estimation and Efficiency for LoRA-based LLM Unlearning. In COLM , 2025

work page 2025
[11]

K. A. Ogawa, B. L. Yamamoto, L. L. de Alcantara, L. Pellicer, R. P. Costa, E. Bollis, A. H. R. Costa, and A. Jordao. Layer-wise LoRA Fine-tuning: A Similarity Metric Approach. arXiv:2602.05988 , 2026

work page arXiv 2026

[1] [1]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR , 2022

work page 2022

[2] [2]

Zhang, M

Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, W. Chen, and T. Zhao. AdaLoRA : Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. In ICLR , 2023

work page 2023

[3] [3]

Paischer, L

F. Paischer, L. Hauzenberger, T. Schmied, B. Alkin, M. P. Deisenroth, and S. Hochreiter. EVA : One-shot Initialization of Low-Rank Adaptation via Activation Variance. In NeurIPS , 2025

work page 2025

[4] [4]

H. He, X. Cai, J. Wu, Y. Zhao, Y. Liu, X. Liu, X. Wang, and Y. Yang. GoRA : Gradient-driven Adaptive Low Rank Adaptation. arXiv:2502.12171 , 2025

work page arXiv 2025

[5] [5]

Z. Liu, J. Lyn, W. Zhu, X. Tian, and Y. Graham. ALoRA : Allocating Low-Rank Adaptation for Fine-tuning Large Language Models. In NAACL , 2024

work page 2024

[6] [6]

LeCun, J

Y. LeCun, J. S. Denker, and S. A. Solla. Optimal Brain Damage. In NeurIPS , 1990

work page 1990

[7] [7]

Kirkpatrick et al

J. Kirkpatrick et al. Overcoming Catastrophic Forgetting in Neural Networks. PNAS , 114(13):3521--3526, 2017

work page 2017

[8] [8]

Martens and R

J. Martens and R. Grosse. Optimizing Neural Networks with Kronecker-Factored Approximate Curvature. In ICML , 2015

work page 2015

[9] [9]

Lodha, A

A. Lodha, A. Belapurkar, G. Chalkapurkar, S. Tao, Y. Ghosh, R. Basu, S. Petrov, and D. Srinivasan. On Surgical Fine-Tuning for Language Encoders. In EMNLP , 2023

work page 2023

[10] [10]

Y. Kim, E. Kim, B. Chang, and J. Choe. Improving Fisher Information Estimation and Efficiency for LoRA-based LLM Unlearning. In COLM , 2025

work page 2025

[11] [11]

K. A. Ogawa, B. L. Yamamoto, L. L. de Alcantara, L. Pellicer, R. P. Costa, E. Bollis, A. H. R. Costa, and A. Jordao. Layer-wise LoRA Fine-tuning: A Similarity Metric Approach. arXiv:2602.05988 , 2026

work page arXiv 2026