arxiv: 2605.01046 · v2 · submitted 2026-05-01 · 💻 cs.LG

Recognition: unknown

Learning in the Fisher Subspace: A Guided Initialization for LoRA Fine-Tuning

Zhi-Quan Feng , Ying-Jia Lin , Hung-Yu Kao

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:30 UTC · model grok-4.3

classification 💻 cs.LG

keywords LoRAfine-tuningFisher informationinitializationlarge language modelslow-rank adaptationparameter-efficient tuningcurvature

0 comments

The pith

Using Fisher curvature from downstream data to initialize LoRA subspaces improves fine-tuning performance over weight-only methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LoRA adaptation benefits from choosing low-rank update directions based on how they affect model behavior on the specific downstream data, rather than solely on the structure of the pre-trained weights. It proposes using the Fisher information matrix to capture this sensitivity through curvature induced by the target distribution. A sympathetic reader would care because standard initializations can allocate the limited low-rank capacity to irrelevant directions, limiting how well the model adapts. If this data-aware approach works, it means more effective and efficient fine-tuning of large models without increasing the rank or training time.

Core claim

LoRA initialization can be reformulated as identifying parameter directions with high impact on predictions under the downstream data distribution. By leveraging the Fisher information to quantify the curvature of the loss landscape with respect to these directions, the method selects subspaces that align adaptation more closely with the target objective, leading to better downstream performance.

What carries the argument

The Fisher information matrix computed from the downstream data, which measures the sensitivity of model predictions to parameter perturbations and guides the selection of LoRA adaptation directions.

If this is right

LoRA fine-tuning with Fisher-guided initialization achieves higher performance on diverse tasks and modalities compared to existing weight-based initializations.
The approach provides a task-dependent criterion for subspace selection without relying on assumptions about weight geometry alone.
Data-aware sensitivity governs better allocation of adaptation capacity in low-rank updates.
Empirical improvements hold across multiple modalities and tasks, suggesting broad applicability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could imply that similar curvature-based initialization might benefit other parameter-efficient fine-tuning methods like adapters or prefix tuning.
Exploring how to efficiently approximate the Fisher matrix for very large models could extend the practicality of this method.
Connections to natural gradient descent suggest that this initialization might reduce the number of training steps needed for convergence.

Load-bearing premise

The curvature information from the downstream data distribution accurately reflects which parameter directions most strongly influence the model's performance on the target task.

What would settle it

Observing that on a range of standard benchmarks the Fisher-guided LoRA performs similarly or worse than random or SVD-based initialization would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.01046 by Hung-Yu Kao, Ying-Jia Lin, Zhi-Quan Feng.

**Figure 1.** Figure 1: Experiments comparing singular-direction selection and magnitude-scaling strategies for LoRA initialization. For the 32 samples, panels (a) and (c) sort directions by singular values, while (b) and (d) sort them by their Fisher Energy values. Results are obtained on ARC-Challenge and BoolQ using Llama2-7B with rank = 32. The horizontal axis denotes the index of the sorted experiments. Scatter points show t… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed Fisher-Guided LoRA Initialization framework. The three subfigures correspond to its key components: (a) Fisher Factor Computation, where we compute the Fisher information using Kronecker-factored statistics using a minibatch of data; (b) Fisher-Aligned Direction Selection, where we identify Fisher-aligned directions by projecting onto surrogate bases derived from pre-trained weight… view at source ↗

**Figure 3.** Figure 3: Experimental results of varying LoRA ranks on Llama2- 7B. Average accuracy across reasoning tasks is reported. We further investigate the impact of LoRA rank on finetuning performance [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Extra initialization time comparison. We report the total additional initialization time (in seconds) for different LoRA initialization methods using an input length of 512, rank r = 32, and BF16 precision, measured on a single NVIDIA A100 GPU. Three model scales are evaluated: Llama3.2-1B (“1B”), Llama3.2- 3B (“3B”), and Llama3-8B (“8B”). For KaSA, we report the full initialization time, whereas for LoRA-… view at source ↗

**Figure 6.** Figure 6: Full results of the ablation study on the Llama2-7B model. 80 160 320 480 640 Full 74.5 75.0 75.5 76.0 ACC BoolQ 80 160 320 480 640 Full 89.5 90.0 90.5 91.0 ACC PIQA 80 160 320 480 640 Full 82.0 82.5 83.0 ACC SIQA 80 160 320 480 640 Full 95.0 95.5 96.0 ACC HellaS. 80 160 320 480 640 Full 88.0 88.5 89.0 ACC WinoG. 80 160 320 480 640 Full 91.0 91.5 92.0 ACC ARC-e 80 160 320 480 640 Full 79.5 80.0 80.5 ACC AR… view at source ↗

**Figure 7.** Figure 7: Full results of the ablation study on the Llama3-8B model. The complete experimental results show that FILet exhibits robust and stable performance across a wide range of minibatch sizes. In general, increasing the minibatch size yields more accurate estimates of the empirical second-moment statistics, which in turn leads to improved downstream adaptation performance. Nevertheless, FILet remains competitiv… view at source ↗

**Figure 8.** Figure 8: Direction overlap matrices different tasks using Llama2-7B as the base model. From these visualizations, we observe that ”ARC-e” and ”ARC-c” exhibit a notably high degree of direction overlap, which is expected since they are essentially two subsets of the same benchmark. Beyond this pair, most task combinations display relatively low overlap in their selected adaptation directions, indicating that FILet c… view at source ↗

**Figure 9.** Figure 9: Direction overlap matrices different tasks using Llama3-8B as the base model. H. Limitations Compared to SVD-based initialization methods, FILet incurs additional memory overhead during the initialization phase to compute and store empirical second-moment statistics. While this overhead is not significant in most scenarios, it may become a practical challenge when adapting extremely large models or deployi… view at source ↗

read the original abstract

LoRA adapts large language models (LLMs) by restricting updates to low-rank subspaces of pre-trained weights. While this substantially reduces training cost, the effectiveness of adaptation critically depends on which subspace is chosen at initialization: a poor initialization that allocates capacity to task-irrelevant directions can severely hinder downstream performance. Existing initialization strategies primarily rely on the intrinsic properties of pre-trained weights, implicitly assuming that weight geometry alone reflects task relevance. However, such criteria overlook how the model interacts with the downstream data distribution. In this work, we formulate LoRA initialization as identifying the degree of impact of directions in parameter space under the target data distribution. We argue that data-aware sensitivity, rather than weight-only magnitude, should govern the choice of adaptation subspaces. Building on this perspective, we propose a Fisher-guided framework that leverages curvature information induced by downstream data to characterize how parameter perturbations influence model predictions. This perspective yields a principled, task-dependent criterion for selecting LoRA directions that better align adaptation with the target objective. Empirical results across diverse tasks and modalities demonstrate that data-aware initialization consistently and significantly improves downstream performance over existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fisher-guided LoRA init is a clean data-dependent framing but the abstract gives no numbers or controls to judge whether it actually works.

read the letter

The main contribution is treating LoRA subspace choice as a sensitivity problem under the downstream data distribution and using the Fisher information matrix to pick directions. This is distinct from the weight-magnitude or SVD baselines referenced in the abstract, and the motivation that purely intrinsic weight properties can miss task relevance is stated plainly and logically. The paper does a decent job spelling out why curvature information induced by the target data might better align adaptation capacity with the objective. That perspective is coherent and worth considering for anyone tuning large models on specific tasks. The soft spots are clear from the abstract alone. No quantitative results, ablations, or implementation details are supplied, so the claim of consistent significant gains cannot be evaluated. The stress-test concern lands: the Fisher matrix captures likelihood sensitivity, but that does not automatically identify directions whose adjustment reduces fine-tuning loss most, particularly under distribution shift or when the base model is already decent on the target data. Without experiments that isolate the criterion or show it beats generic data-dependent inits, the gains could be explained by other factors. This is aimed at researchers working on parameter-efficient fine-tuning who are open to trying curvature-based heuristics. It deserves peer review if the full paper contains careful controls and reproducible results, because the idea is straightforward and the motivation holds up even if the mapping from Fisher to optimal directions still needs stronger support.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Fisher-guided initialization for LoRA fine-tuning of large models. It computes an approximation to the Fisher information matrix on downstream task data, extracts leading eigenvectors to define task-relevant parameter directions, and initializes the low-rank LoRA factors along those directions rather than using magnitude-based or random criteria derived only from pre-trained weights. The central claim is that this data-aware curvature criterion yields consistently better downstream performance across tasks and modalities.

Significance. If the reported gains are robust and the Fisher directions demonstrably align with loss reduction on the target objective, the method supplies a principled, task-dependent alternative to heuristic LoRA initializations. This could improve sample efficiency and final accuracy in parameter-efficient adaptation of large models while remaining computationally lightweight.

major comments (2)

[§3.2] §3.2, Eq. (7): the claim that the top eigenvectors of the (Monte-Carlo approximated) Fisher matrix identify directions whose perturbations most reduce the fine-tuning loss is not directly tested; the manuscript should add a controlled measurement of loss sensitivity (e.g., directional derivatives or finite-difference loss change) along Fisher vs. random vs. gradient-magnitude directions on held-out target data.
[Table 3] Table 3 (main results): the reported improvements over baselines are presented without per-task standard deviations across random seeds or statistical significance tests; this weakens the assertion of 'consistent and significant' gains, especially given that LoRA performance is known to be sensitive to initialization variance.

minor comments (2)

The distinction between the 'empirical Fisher' and the 'true Fisher' (model predictive distribution) is mentioned only briefly; an explicit equation for the Monte-Carlo estimator used in practice would improve reproducibility.
Figure 2 caption should state the exact number of samples and the random seed used for the Fisher approximation so that the curvature estimate can be replicated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical support and statistical rigor of our claims.

read point-by-point responses

Referee: [§3.2] §3.2, Eq. (7): the claim that the top eigenvectors of the (Monte-Carlo approximated) Fisher matrix identify directions whose perturbations most reduce the fine-tuning loss is not directly tested; the manuscript should add a controlled measurement of loss sensitivity (e.g., directional derivatives or finite-difference loss change) along Fisher vs. random vs. gradient-magnitude directions on held-out target data.

Authors: We agree that a direct empirical verification of loss sensitivity would provide stronger support for the interpretation of Eq. (7). In the revised version we will add a controlled experiment on held-out target data that computes both finite-difference loss changes and directional derivatives along the top Fisher eigenvectors, compared against random directions and gradient-magnitude directions. This addition will directly test whether Fisher directions exhibit greater loss reduction under small perturbations. revision: yes
Referee: [Table 3] Table 3 (main results): the reported improvements over baselines are presented without per-task standard deviations across random seeds or statistical significance tests; this weakens the assertion of 'consistent and significant' gains, especially given that LoRA performance is known to be sensitive to initialization variance.

Authors: We acknowledge that the absence of per-task variability measures and significance testing limits the strength of our claims. We will rerun all experiments with at least five independent random seeds, report per-task standard deviations in the revised Table 3, and include paired statistical significance tests (e.g., Wilcoxon signed-rank or t-tests with appropriate correction) against the strongest baseline. These additions will quantify robustness to initialization variance and substantiate the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; Fisher-guided initialization applies standard curvature without reducing claims to fitted inputs

full rationale

The paper defines LoRA subspace selection via the Fisher information matrix computed on downstream data, using the standard definition E[∇log p(y|x;θ) ∇log p(y|x;θ)^T] to rank parameter directions by sensitivity. No equation or step equates the claimed performance gains to a quantity fitted from the same evaluation data by construction, nor does any self-citation chain justify the core criterion. Empirical results on diverse tasks serve as external validation rather than tautological confirmation. The derivation remains self-contained against the pre-trained weights and target distribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The central assumption that Fisher curvature measures task-relevant sensitivity is treated as a domain assumption rather than derived.

axioms (1)

domain assumption The Fisher information matrix induced by the downstream data distribution characterizes the impact of parameter perturbations on model predictions.
This is the load-bearing premise that justifies selecting LoRA directions according to curvature rather than weight geometry.

pith-pipeline@v0.9.0 · 5498 in / 1207 out tokens · 24546 ms · 2026-05-09T19:30:10.890636+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 9 canonical work pages

[1]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

2022
[2]

Fanxu Meng and Zhaohui Wang and Muhan Zhang , booktitle=. Pi. 2024 , url=

2024
[3]

doi: 10.18653/v1/2025.naacl-long.248

Wang, Hanqing and Li, Yixia and Wang, Shuo and Chen, Guanhua and Chen, Yun. M i L o RA : Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.1...

work page doi:10.18653/v1/2025.naacl-long.248 2025
[4]

2025 , eprint=

NEAT: Nonlinear Parameter-efficient Adaptation of Pre-trained Models , author=. 2025 , eprint=

2025
[5]

The Thirteenth International Conference on Learning Representations , year=

Unleashing the Power of Task-Specific Directions in Parameter Efficient Fine-tuning , author=. The Thirteenth International Conference on Learning Representations , year=
[6]

Fan Wang and Juyong Jiang and Chansung Park and Sunghun Kim and Jing Tang , booktitle=. Ka. 2025 , url=

2025
[7]

Shihyang Liu and Chienyi Wang and Hongxu Yin and Pavlo Molchanov and Yu-Chiang Frank Wang and Kwang-Ting Cheng and Min-Hung Chen , booktitle=. Do. 2024 , url=

2024
[8]

2025 , eprint=

GoRA: Gradient-driven Adaptive Low Rank Adaptation , author=. 2025 , eprint=

2025
[9]

Chenghao Fan and Zhenyi Lu and Sichen Liu and Chengfeng Gu and Xiaoye Qu and Wei Wei and Yu Cheng , booktitle=. Make Lo. 2025 , url=

2025
[10]

The Twelfth International Conference on Learning Representations , year=

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning , author=. The Twelfth International Conference on Learning Representations , year=
[11]

UORA : Uniform Orthogonal Reinitialization Adaptation in Parameter Efficient Fine-Tuning of Large Models

Zhang, Xueyan and Zhao, Jinman and Yang, Zhifei and Zhong, Yibo and Guan, Shuhao and Cao, Linbo and Wang, Yining. UORA : Uniform Orthogonal Reinitialization Adaptation in Parameter Efficient Fine-Tuning of Large Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/20...

work page doi:10.18653/v1/2025.acl-long.575 2025
[12]

Dawid Jan Kopiczko and Tijmen Blankevoort and Yuki M Asano , booktitle=. Ve. 2024 , url=

2024
[13]

The Eleventh International Conference on Learning Representations , year=

Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning , author=. The Eleventh International Conference on Learning Representations , year=
[14]

Yibo Yang and Xiaojie Li and Zhongzhu Zhou and Shuaiwen Leon Song and Jianlong Wu and Liqiang Nie and Bernard Ghanem , booktitle=. Cor. 2024 , url=

2024
[15]

Shaowen Wang and Linxi Yu and Jian Li , booktitle=. Lo. 2024 , url=

2024
[16]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Parameter Efficient Fine-tuning via Explained Variance Adaptation , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[17]

2024 , eprint=

SARA: Singular-Value Based Adaptive Low-Rank Adaption , author=. 2024 , eprint=

2024
[18]

The Thirteenth International Conference on Learning Representations , year=

Efficient Learning with Sine-Activated Low-Rank Matrices , author=. The Thirteenth International Conference on Learning Representations , year=
[19]

Juzheng Zhang and Jiacheng You and Ashwinee Panda and Tom Goldstein , booktitle=. Lo. 2025 , url=

2025
[20]

MTL-LoRA: Low-Rank Adaptation for Multi-Task Learning , booktitle =

MTL-LoRA: Low-Rank Adaptation for Multi-Task Learning , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i20.35509 , abstractNote=

work page doi:10.1609/aaai.v39i20.35509 2025
[21]

Chunlin Tian and Zhan Shi and Zhijiang Guo and Li Li and Cheng-zhong Xu , booktitle=. HydraLo. 2024 , url=

2024
[22]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina. B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long ...

work page doi:10.18653/v1/n19-1300 2019
[23]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

2018
[24]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1260

work page doi:10.18653/v1/d18-1260 2018
[25]

H ella S wag: Can a Machine Really Finish Your Sentence?

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. H ella S wag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1472

work page doi:10.18653/v1/p19-1472 2019
[26]

2019 , eprint=

PIQA: Reasoning about Physical Commonsense in Natural Language , author=. 2019 , eprint=

2019
[27]

Social IQa: Commonsense Reasoning about Social Interactions

Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin. Social IQ a: Commonsense Reasoning about Social Interactions. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1454

work page doi:10.18653/v1/d19-1454 2019
[28]

2019 , eprint=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

2019
[29]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

2023
[30]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[31]

3D Object Representations for Fine-Grained Categorization , year=

Krause, Jonathan and Stark, Michael and Deng, Jia and Fei-Fei, Li , booktitle=. 3D Object Representations for Fine-Grained Categorization , year=
[32]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Cimpoi, Mircea and Maji, Subhransu and Kokkinos, Iasonas and Mohamed, Sammy and Vedaldi, Andrea , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
[33]

Introducing Eurosat: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification , year=

Helber, Patrick and Bischke, Benjamin and Dengel, Andreas and Borth, Damian , booktitle=. Introducing Eurosat: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification , year=
[34]

2016 , eprint=

Traffic Sign Classification Using Deep Inception Based Convolutional Networks , author=. 2016 , eprint=

2016
[35]

Remote Sensing Image Scene Classification: Benchmark and State of the Art , year=

Cheng, Gong and Han, Junwei and Lu, Xiaoqiang , journal=. Remote Sensing Image Scene Classification: Benchmark and State of the Art , year=
[36]

and Oliva, Aude and Torralba, Antonio , booktitle=

Xiao, Jianxiong and Hays, James and Ehinger, Krista A. and Oliva, Aude and Torralba, Antonio , booktitle=. SUN database: Large-scale scene recognition from abbey to zoo , year=
[37]

Reading digits in natural images with unsupervised feature learning , author=
[38]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=
[39]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

2021
[40]

2021 , eprint=

Program Synthesis with Large Language Models , author=. 2021 , eprint=

2021
[41]

Gonzalez and Ion Stoica , booktitle=

Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , booktitle=. Judging. 2023 , url=

2023
[42]

The E 2 E dataset: New challenges for end-to-end generation

Novikova, Jekaterina and Du s ek, Ond r ej and Rieser, Verena. The E 2 E Dataset: New Challenges For End-to-End Generation. Proceedings of the 18th Annual SIG dial Meeting on Discourse and Dialogue. 2017. doi:10.18653/v1/W17-5525

work page doi:10.18653/v1/w17-5525 2017
[43]

ACM Trans

Weyssow, Martin and Kamanda, Aton and Zhou, Xin and Sahraoui, Houari , title =. ACM Trans. Softw. Eng. Methodol. , month = may, keywords =. 2025 , publisher =. doi:10.1145/3736407 , abstract =

work page doi:10.1145/3736407 2025
[44]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023
[45]

2024 , eprint=

Gemma: Open Models Based on Gemini Research and Technology , author=. 2024 , eprint=

2024
[46]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[47]

Proceedings of the 32nd International Conference on Machine Learning , pages =

Optimizing Neural Networks with Kronecker-factored Approximate Curvature , author =. Proceedings of the 32nd International Conference on Machine Learning , pages =. 2015 , editor =

2015
[48]

2014 , eprint=

Revisiting Natural Gradient for Deep Networks , author=. 2014 , eprint=

2014