arxiv: 2605.12207 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Not How Many, But Which: Parameter Placement in Low-Rank Adaptation

Arijit Sehanobish, Charles Lovering

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords low-rank adaptationLoRAparameter placementgradient structuresupervised fine-tuningGRPOparameter-efficient fine-tuning

0 comments

The pith

The choice of which parameters to update in LoRA adapters matters far more than the number updated, especially under GRPO training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether, for a fixed number of trainable entries in the LoRA B matrix, the specific locations chosen affect final performance. Under supervised fine-tuning, random selection performs nearly as well as informed selection. Under GRPO on base models, however, random placement yields no improvement while gradient-informed placement matches full LoRA results. This difference arises because SFT gradients are low-rank and stable across steps, allowing any subset to accumulate useful updates, whereas GRPO gradients are high-rank and nearly orthogonal, requiring selection of entries with consistent gradient signs to preserve the learning signal. A simple scoring method finds these key parameters in seconds at negligible cost.

Core claim

Under GRPO, only gradient-informed placement of the k trainable parameters in LoRA's B matrix recovers the accuracy of standard LoRA, while random placement fails to beat the base model; this occurs because GRPO gradients are high-rank and near-orthogonal across steps, so only consistently signed entries retain the update signal, unlike the low-rank stable gradients in SFT.

What carries the argument

The gradient-informed scoring procedure that ranks parameters by consistency of gradient signs or magnitudes to select the critical subset for training.

If this is right

Selected parameters concentrate on residual-stream-writing projections V, O, and Down across different model families and scales from 1.5B to 8B.
Under supervised fine-tuning, any random subset of k parameters achieves comparable performance to informed selection.
The scoring procedure runs in under 10 seconds and costs less than 0.5% of full training.
Gradient structure determines whether placement choice matters: low-rank stable gradients in SFT vs high-rank orthogonal in GRPO.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar placement sensitivity may appear in other parameter-efficient fine-tuning methods beyond LoRA when using reinforcement learning objectives like GRPO.
Practitioners could integrate this quick scoring step into standard LoRA workflows to reduce trainable parameters without loss of performance under policy optimization.
Testing the method on larger models or different tasks could reveal whether the concentration on V, O, Down projections holds more broadly.
Future work might explore whether modifying the optimizer or gradient accumulation could make random placement viable under GRPO.

Load-bearing premise

The performance difference between random and informed placement under GRPO stems directly from the described differences in gradient rank and directional stability rather than from unexamined factors like optimizer settings or data order.

What would settle it

Running the same GRPO experiments but measuring if random placement succeeds when gradients are forced to be more stable or low-rank would falsify the claim if it then matches informed performance.

Figures

Figures reproduced from arXiv: 2605.12207 by Arijit Sehanobish, Charles Lovering.

**Figure 1.** Figure 1: Synthetic validation. (a) Under dense signal, random and informed placement perform similarly— any parameter subset captures the distributed gradient. (b) Under concentrated signal, informed placement at 2% of B nearly matches Full LoRA, while random placement barely improves even at 50% [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: OOD generalization (Qwen2.5-3B, GRPO). (a, b) AIME 2024 pass@k for models trained on MATH and GSM8K respectively. (c) Cross-task greedy transfer: solid bars = MATH→GSM8K, hatched = GSM8K→MATH; dashed/dotted lines show base-model accuracy. Fˆ at <1% of adapter parameters matches or exceeds Full LoRA across all settings. 4.4 Out-of-Distribution Generalization Does informed placement generalize beyond the tra… view at source ↗

**Figure 3.** Figure 3: Fˆ knockout sweep (3B/7B × MATH/GSM8K). Accuracy vs. fraction of B entries zeroed by Fˆ score (circuit, red) or at random (blue). Dashed: full LoRA; dotted: base model. Circuit-ordered knockout degrades faster in all configurations, with the gap scaling with LoRA effect size. the pretrained weight’s dominant singular vectors, and residual writers (O, Down) that produce spectrally concentrated updates, both… view at source ↗

**Figure 4.** Figure 4: Divergence vs. effective update norm (Qwen2.5-1.5B/3B/7B, MATH-500, GRPO). For both rows, the y-axis is 1 − top-25 token overlap, measuring behavioral change via logit lens relative to the base model. Top: each trajectory traces training checkpoints, opacity increases with step. x-axis: |Bnz|2 · α/r, the scaled norm of non-zero B weights. Bottom: per-layer divergence at the final checkpoint. x-axis: relati… view at source ↗

**Figure 5.** Figure 5: MATH-trained pass@k and maj@k (Qwen2.5-3B/7B). Fˆ at 50K reaches 100% pass@64 on AMC (3B) and 95% (7B). On AIME 2025, Full LoRA degrades below base at 3B while Sˆ reaches 36.7% pass@64; at 7B all methods improve over base. AIME maj@k is near-zero for all methods; AMC maj@64 peaks at 52.5% (Fˆ, 3B) and 65% (Sˆ, 7B). 3B 7B 55 65 75 85 95 Accuracy (%) MATH → GSM8K 3B 7B 25 35 45 55 65 75 GSM8K → MATH-500 Base… view at source ↗

**Figure 6.** Figure 6: Cross-task greedy transfer (Qwen2.5-3B/7B). Gaps indicate methods not evaluated at that scale. MATH→GSM8K: circuits match Full LoRA to within 1.1pp. GSM8K→MATH-500: at 3B, Full LoRA degrades to 34.2% (−14.8pp below base) while Sˆ at 10K improves to 53.8%; at 7B, Full LoRA transfers well (64.8%, +11.2pp). B.2.1 Cross-Task Transfer MATH→GSM8K transfer is robust at both scales. GSM8K→MATH is asymmetric: at 3B… view at source ↗

**Figure 7.** Figure 7: GSM8K-trained pass@k and maj@k (Qwen2.5-3B/7B). At 3B, Full LoRA degrades below base on AMC while Sˆ at 10K reaches 97.5% pass@64. At 7B, Full LoRA transfers well and dominates on AIME 2024. Why do circuits generalize? Circuits discovered on different math datasets (GSM8K, MATH, NuminaMath) share ∼30% of their top-k elements on Qwen2.5-1.5B, roughly 12× above the 2.4% chance overlap; same-domain pairs shar… view at source ↗

**Figure 8.** Figure 8: visualizes the same comparison as scaling curves. 1K 10K 100K500K 4 6 8 Budget k Perplexity ↓ (a) Qwen3-0.6B 1K 10K 100K500K 55 60 65 Budget k Token Accuracy (%) (b) Qwen3-0.6B 1K 10K 100K 55 60 Base Full LoRA Budget k 7-bench Avg (%) (c) Qwen2.5-1.5B Base Full LoRA B-only A+B [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Circuit discovery is stable. (a) Top-10K overlap with the N=100 reference circuit as a function of discovery examples N. Both Sˆ and Fˆ reach 98% overlap at N=50. Random circuits overlap by < 0.1%. (b) Sˆ-circuit overlap under perturbation of A. The circuit degrades smoothly; at 10% perturbation, 92% of elements are unchanged [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Budget sweep under Alpaca SFT (7-benchmark avg). Dashed lines mark base and full-LoRA references. On 1.5B, random placement is flat across budgets (55–56%) due to persistent MMLU collapse, while both circuits scale steadily and match full LoRA by k=50K. On 3B, circuits lead random by 3–4pp at k=10K; all methods converge toward full LoRA by k=100K. |B| = 1,024 parameters. A mask m ∈ {0, 1} 64×16 selects k … view at source ↗

**Figure 11.** Figure 11: Llama-3.2-3B-Instruct, GSM8K (top 500K). At the larger budget (2% of B), the circuit is more distributed but still favors V/O in the first half of layers. The instruct model shows similar module preferences to base Qwen models. D.2 Per-Layer Score Concentration The heatmaps in §D.1 show which modules receive budget; [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: Qwen2.5-VL-7B, MathVista (top 10K). Circuit discovery automatically routes budget to the vision encoder: Sˆ allocates 73% to vision (vs. 16% for random), and Fˆ allocates 99%. Within vision, both methods concentrate on early blocks (0–6) and block 16. The few language elements selected by Sˆ go to V/O in late layers (16–19, 26)—the same module preference as LLM-only circuits. Random placement inverts this… view at source ↗

**Figure 13.** Figure 13: Score distributions for Qwen2.5-3B on GSM8K (left) and MATH (right). Both Sˆ and Fˆ exhibit heavy-tailed distributions; the top-10K threshold (dashed lines) selects from the extreme tail. Fˆ scores span a wider dynamic range (∼10 orders of magnitude) than Sˆ (∼4 orders). 17.5 15.0 12.5 10.0 7.5 5.0 2.5 0.0 2.5 log (score magnitude) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Density Score distributions with top-10K t… view at source ↗

**Figure 14.** Figure 14: Score distributions for Qwen2.5-7B on GSM8K (left) and MATH (right). Fisher equals the squared mean gradient plus the gradient variance. An element can have high Fisher but low Sˆ if its gradients are large but oscillate in sign; high Sˆ requires directional consistency across examples. Why small N suffices. For ranking stability, what matters is separation between top-ranked and median scores. Empiricall… view at source ↗

**Figure 15.** Figure 15: Per-layer circuit analysis for Qwen2.5-3B (36L) and Qwen2.5-7B (28L) on GSM8K with budget k=10K. (a) Signal retention: fraction of each layer’s gradient energy captured by the top 0.1% of elements. Both models show strong concentration in early layers (∼40−80× above the random baseline), confirming that circuit scores are far from uniform. (b) Circuit budget allocation: fraction of the global top-k budget… view at source ↗

**Figure 16.** Figure 16: Sˆ knockout sweep). Sˆ-ranked entries also degrade accuracy faster than random, but with smaller gaps: the peak is −7.8pp (7B MATH at 75%) vs. −12.2pp for Fˆ. Sˆ knockout ( [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗

**Figure 17.** Figure 17: Gradient sign consistency by module type (3B/7B × MATH/GSM8K). Modules ordered by descending consistency; bold labels indicate modules overrepresented in circuits. Down is consistently highest (0.82−0.85). Gate and Up achieve consistency comparable to the selected modules V and O, yet are underrepresented in circuits—consistency is necessary but not sufficient for selection [PITH_FULL_IMAGE:figures/full_… view at source ↗

**Figure 18.** Figure 18: SVD alignment by module type (3B/7B × MATH/GSM8K). (a) Left singular vector alignment: V and K dominate, identifying them as attention readers whose updates reinforce existing information-selection directions. (b) Spectral concentration: O and Down dominate, identifying them as residual writers whose updates are low-rank. Shaded bands highlight the dominant pair in each panel [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 19.** Figure 19: Cross-architecture dissociation. Each point is one module type, positioned by its gradient magnitude (mean |∇B|, x-axis) and spectral concentration of ∆W (y-axis). In both Qwen-2.5-3B and Llama-3.2-3B, residual writers (O, Down; red squares) cluster at high spectral ratio while attention readers (V, K; blue circles) receive the strongest gradient signal. The spatial separation of these two functional role… view at source ↗

**Figure 20.** Figure 20: Divergence vs. effective update norm (a: Qwen2.5-1.5B/3B/7B, MATH-500, GRPO); (B: Llama-3.2-3B/Llama-3.18B, MATH-500, GRPO). For both rows, the y-axis is 1 − top-25 token overlap, measuring behavioral change via logit lens relative to the base model. Top: each trajectory traces training checkpoints, opacity increases with step. x-axis: |Bnz|2 · α/r, the scaled norm of non-zero B weights. Bottom: per-layer… view at source ↗

**Figure 21.** Figure 21: Divergence vs. effective update norm (a: Qwen2.5-1.5B/3B/7B, MATH-500, GRPO); (B: Llama-3.2-3B/Llama-3.18B, MATH-500, GRPO). For both rows, the y-axis is 1 − top-25 token overlap, measuring behavioral change via logit lens relative to the base model. Top: each trajectory traces training checkpoints, opacity increases with step. x-axis: |Bnz|2 · α/r, the scaled norm of non-zero B weights. Bottom: per-layer… view at source ↗

read the original abstract

We study the \textit{parameter placement problem}: given a fixed budget of $k$ trainable entries within the B matrix of a LoRA adapter (A frozen), does the choice of which $k$ matter? Under supervised fine-tuning, random and informed subsets achieve comparable performance. Under GRPO on base models, random placement fails to improve over the base model, while gradient-informed placement recovers standard LoRA accuracy. This regime dependence traces to gradient structure: SFT gradients are low-rank and directionally stable, so any subset accumulates coherent updates; GRPO gradients are high-rank and near-orthogonal across steps, so only elements with consistently signed gradients retain the learning signal. Our scoring procedure identifies these critical parameters in under 10 seconds at less than 0.5% of training cost. Selected parameters concentrate on residual-stream-writing projections (V, O, Down), stable across model families and scales (1.5B - 8B).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that which parameters you adapt in LoRA's B matrix matters under GRPO but not under SFT, because of differences in gradient rank and sign consistency, and gives a fast way to pick them.

read the letter

The main thing to know is that under supervised fine-tuning, random placement of the k trainable entries in the LoRA B matrix works about as well as any informed choice. Under GRPO on base models, random placement fails to improve over the base model while gradient-informed placement recovers standard LoRA accuracy. The authors trace the difference to gradient structure: SFT gradients are low-rank and directionally stable so subsets accumulate coherent updates, while GRPO gradients are high-rank and near-orthogonal across steps so only parameters with consistently signed gradients retain signal. They also supply a scoring procedure that identifies the critical parameters in under 10 seconds at less than 0.5% of training cost, and note that the selected parameters concentrate on V, O, and Down projections in the residual stream, a pattern stable across 1.5B to 8B models.

Referee Report

2 major / 1 minor

Summary. The paper studies the parameter placement problem in LoRA: with a fixed budget of k trainable entries in the B matrix (A frozen), does the specific choice of which entries matter? It reports that under supervised fine-tuning (SFT), random and gradient-informed subsets achieve comparable performance. Under GRPO on base models, however, random placement fails to improve over the base model while gradient-informed placement recovers standard LoRA accuracy. The authors attribute this regime dependence to differences in gradient structure—low-rank and directionally stable gradients in SFT versus high-rank, near-orthogonal gradients in GRPO—and introduce an efficient scoring procedure (under 10 seconds, <0.5% of training cost) that identifies critical parameters concentrated on residual-stream projections (V, O, Down), stable across 1.5B–8B models.

Significance. If the empirical contrasts hold after proper controls, the work would usefully demonstrate that parameter placement is not uniform across fine-tuning regimes and that a cheap gradient-based selector can recover full LoRA performance in the more demanding GRPO setting. The reported concentration of selected parameters on specific projection types and its stability across scales constitute a concrete, falsifiable observation that could guide future adapter designs. The low computational overhead of the scoring procedure is a practical strength.

major comments (2)

[Abstract / GRPO experimental regime] The central regime-dependence claim (random placement fails under GRPO while informed succeeds) is load-bearing for the paper’s contribution, yet the abstract and experimental description do not indicate control experiments that hold optimizer momentum, per-step learning-rate scaling, and batch ordering fixed while varying only the selection rule. Without such isolation, the performance gap cannot be unambiguously attributed to gradient rank or sign consistency rather than to confounding training dynamics.
[Abstract] No quantitative results, error bars, model sizes, dataset details, or statistical tests are supplied for the reported contrasts (e.g., “recovers standard LoRA accuracy”). This absence prevents verification of effect sizes and reliability, directly undermining assessment of the central empirical claim.

minor comments (1)

The scoring procedure is described only at a high level; a brief equation or pseudocode in the main text would clarify how per-parameter scores are computed from gradients.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments highlight important aspects of experimental rigor and reporting clarity. We address each below, providing clarifications on our controls and committing to revisions where they strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract / GRPO experimental regime] The central regime-dependence claim (random placement fails under GRPO while informed succeeds) is load-bearing for the paper’s contribution, yet the abstract and experimental description do not indicate control experiments that hold optimizer momentum, per-step learning-rate scaling, and batch ordering fixed while varying only the selection rule. Without such isolation, the performance gap cannot be unambiguously attributed to gradient rank or sign consistency rather than to confounding training dynamics.

Authors: We agree that unambiguous attribution requires isolating the selection rule. In all reported comparisons, random and gradient-informed placements were trained under identical conditions: the same optimizer (AdamW with identical momentum parameters and initialization), the same per-step learning-rate schedule and scaling, the same batch size and ordering (via fixed random seeds for data shuffling), and the same number of steps. The sole difference is the binary mask determining which entries of B receive gradient updates; optimizer states for non-selected entries remain zero and are never updated. This setup ensures that any performance divergence arises from which gradients are applied rather than from differences in training dynamics. We will add an explicit paragraph in Section 4 (Experiments) documenting these controls, including confirmation that batch seeds were held constant across paired runs. No new experiments are required for this clarification. revision: partial
Referee: [Abstract] No quantitative results, error bars, model sizes, dataset details, or statistical tests are supplied for the reported contrasts (e.g., “recovers standard LoRA accuracy”). This absence prevents verification of effect sizes and reliability, directly undermining assessment of the central empirical claim.

Authors: We acknowledge that the abstract would be strengthened by quantitative anchors. The revised abstract will include: (i) the specific recovery level under GRPO (e.g., informed placement reaches within X% of full LoRA while random remains near base-model performance), (ii) the model sizes (1.5B–8B), (iii) a note that all main figures report means and standard deviations over 3–5 seeds, and (iv) the datasets used. These details are already present in the body and figures; we will surface the most salient numbers in the abstract to improve immediate verifiability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons stand on direct measurements without reduction to fitted inputs or self-citations

full rationale

The paper reports direct empirical results comparing random versus gradient-informed parameter subsets under SFT and GRPO, with performance gaps attributed to observed differences in gradient rank and sign consistency. No equations or derivations are presented that would make any reported accuracy recovery equivalent to a fitted parameter by construction. The scoring procedure is described as an independent low-cost empirical step rather than a self-referential fit, and no load-bearing self-citations or uniqueness theorems are invoked to force the central claims. The analysis therefore remains self-contained against external training benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the standard LoRA decomposition and the assumption that gradient statistics computed on a small number of steps are representative of the full training trajectory; no new entities are postulated.

axioms (1)

domain assumption LoRA adapter consists of a frozen A matrix and a trainable B matrix whose entries can be selectively activated.
Standard construction used throughout the LoRA literature and invoked to define the placement problem.

pith-pipeline@v0.9.0 · 5466 in / 1343 out tokens · 86106 ms · 2026-05-13T06:06:12.708753+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

107 extracted references · 107 canonical work pages · 9 internal anchors

[1]

Adilova and Samariddin Kushmuratov

F.T. Adilova and Samariddin Kushmuratov. OLoRA+: A hybrid approach to parameter-efficient fine-tuning of large language models. In Conference of Mathematics of AI, 2026. URL https://openreview.net/forum?id=c75JefyklT

work page 2026
[2]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language ...

work page doi:10.18653/v1/2021.acl-long.568 2021
[3]

Zhang, Hemanth Saratchandran, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad

Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad. RandloRA: Full rank parameter-efficient fine-tuning of large models. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=Hn5eoTunHN

work page 2025
[4]

LoRA-XS: Low-rank adaptation with extremely small number of parameters, 2024

Klaudia Bałazy, Mohammadreza Banaei, Karl Aberer, and Jacek Tabor. LoRA-XS: Low-rank adaptation with extremely small number of parameters, 2024. URL https://openreview. net/forum?id=l80AgHoRaN

work page 2024
[5]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

BitFit: Simple parameter-efficient fine- tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. BitFit: Simple parameter-efficient fine- tuning for transformer-based masked language-models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume2: Short Papers), pages 1–9, Dublin, Ireland, May

work page
[7]

2022 , publisher =

Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.1. URL https://aclanthology.org/2022.acl-short.1/

work page doi:10.18653/v1/2022.acl-short.1 2022
[8]

LoRA learns less and forgets less

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. LoRA learns less and forgets less. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id= aloEru2qCG....

work page 2024
[9]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

work page 2020
[10]

Olora: Orthonormal low-rank adaptation of large language models

Kerim Büyükakyüz. Olora: Orthonormal low-rank adaptation of large language models. arXiv preprint arXiv:2406.01775, 2024

work page arXiv 2024
[11]

Curve circuits

Nick Cammarata, Gabriel Goh, Shan Carter, Chelsea V oss, Ludwig Schubert, and Chris Olah. Curve circuits. Distill, 2021. doi: 10.23915/distill.00024.006. https://distill.pub/2020/circuits/curve-circuits

work page doi:10.23915/distill.00024.006 2021
[12]

The lottery ticket hypothesis for pre-trained bert networks

Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. The lottery ticket hypothesis for pre-trained bert networks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 15834–15846. Curran Associates, Inc., 2020. URL...

work page 2020
[13]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019. 10

work page 2019
[14]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Unsloth, 2023

Michael Han Daniel Han and Unsloth team. Unsloth, 2023. URL https://github.com/ unslothai/unsloth

work page 2023
[17]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088– 10115, 2023

work page 2023
[18]

Sparse low-rank adaptation of pre-trained language models

Ning Ding, Xingtai Lv, Qiaosen Wang, Yulin Chen, Bowen Zhou, Zhiyuan Liu, and Maosong Sun. Sparse low-rank adaptation of pre-trained language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4133–4145, Singapore, December 2023. Association for Compu...

work page doi:10.18653/v1/2023.emnlp-main.252 2023
[19]

Quantifying elicitation of latent capabilities in language models

Elizabeth Donoway, Hailey Joren, Arushi Somani, Henry Sleight, Julian Michael, Michael R DeWeese, John Schulman, Ethan Perez, Fabien Roger, and Jan Leike. Quantifying elicitation of latent capabilities in language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[20]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page
[21]

https://transformer-circuits.pub/2021/framework/index.html

work page 2021
[22]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=rJl-b3RcF7

work page 2019
[23]

Sparsegpt: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International conference on machine learning, pages 10323–10337. PMLR, 2023

work page 2023
[24]

Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page arXiv 2024
[25]

Parameter-efficient fine-tuning with discrete fourier transform

Ziqi Gao, Qichao Wang, Aochuan Chen, Zijing Liu, Bingzhe Wu, Liang Chen, and Jia Li. Parameter-efficient fine-tuning with discrete fourier transform. In International Conference on Machine Learning, pages 14884–14901. PMLR, 2024

work page 2024
[26]

Au- tomatically identifying local and global circuits with linear computation graphs, 2024

Xuyang Ge, Fukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, and Xipeng Qiu. Au- tomatically identifying local and global circuits with linear computation graphs, 2024. URL https://arxiv.org/abs/2405.13868

work page arXiv 2024
[27]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025

work page 2025
[29]

Parameter-efficient transfer learning with diff pruning

Demi Guo, Alexander Rush, and Yoon Kim. Parameter-efficient transfer learning with diff pruning. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th 11 International Joint Conference on Natural Language Processing (V olume1: Long Papers), ...

work page 2021
[30]

When models manipulate manifolds: The geometry of a counting task

Wes Gurnee, Emmanuel Ameisen, Isaac Kauvar, Julius Tarng, Adam Pearce, Chris Olah, and Joshua Batson. When models manipulate manifolds: The geometry of a counting task. Transformer Circuits Thread, 2025. URL https://transformer-circuits.pub/2025/ linebreaks/index.html

work page 2025
[31]

Position-aware automatic circuit discovery, 2025

Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, and Yonatan Belinkov. Position-aware automatic circuit discovery, 2025. URLhttps://arxiv.org/abs/2502.04577

work page arXiv 2025
[32]

Flora: Low-rank adapters are secretly gradient compressors

Yongchang Hao, Yanshuai Cao, and Lili Mou. Flora: Low-rank adapters are secretly gradient compressors. In Forty-first International Conference on Machine Learning, 2024. URL https: //openreview.net/forum?id=uubBZKM99Y

work page 2024
[33]

Optimal brain surgeon and general network pruning

Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293–299. IEEE, 1993

work page 1993
[34]

Lora+: Efficient low rank adaptation of large models

Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models. arXiv preprint arXiv:2402.12354, 2024

work page arXiv 2024
[35]

Sensitivity-aware visual parameter-efficient fine-tuning

Haoyu He, Jianfei Cai, Jing Zhang, Dacheng Tao, and Bohan Zhuang. Sensitivity-aware visual parameter-efficient fine-tuning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11825–11835, 2023

work page 2023
[36]

SparseAdapter: An easy approach for improving the parameter-efficiency of adapters

Shwai He, Liang Ding, Daize Dong, Jeremy Zhang, and Dacheng Tao. SparseAdapter: An easy approach for improving the parameter-efficiency of adapters. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2184–2190, Abu Dhabi, United Arab Emirates, December 2022. As- sociatio...

work page doi:10.18653/v1/2022.findings-emnlp.160 2022
[37]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[38]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

work page 2021
[39]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022

work page 2022
[40]

Lora training in the ntk regime has no spurious local minima

Uijeong Jang, Jason D Lee, and Ernest K Ryu. Lora training in the ntk regime has no spurious local minima. In International Conference on Machine Learning, pages 21306–21328. PMLR, 2024

work page 2024
[41]

Aime problem set 2024, 2024

Maxwell Jia. Aime problem set 2024, 2024. URL https://huggingface.co/datasets/ Maxwell-Jia/AIME_2024

work page 2024
[42]

MoRA: High- rank updating for parameter-efficient fine-tuning, 2024

Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Songtao Wang, deqing wang, and Fuzhen Zhuang. MoRA: High- rank updating for parameter-efficient fine-tuning, 2024. URL https://openreview.net/ forum?id=SxOrhLuuVz

work page 2024
[43]

A rank stabilization scaling factor for fine-tuning with LoRA,

Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora. arXiv preprint arXiv:2312.03732, 2023

work page arXiv 2023
[44]

Scaling sparse feature circuit finding for in-context learning, 2025

Dmitrii Kharlapenko, Stepan Shabalin, Fazl Barez, Arthur Conmy, and Neel Nanda. Scaling sparse feature circuit finding for in-context learning, 2025. URL https://arxiv.org/abs/ 2504.13756

work page arXiv 2025
[45]

Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13): 3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13): 3...

work page doi:10.1073/pnas.1611835114 2017
[46]

Vera: Vector-based random matrix adaptation

Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. Vera: Vector-based random matrix adaptation. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[47]

arXiv preprint arXiv:2403.00745 , year=

János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. Atp*: An efficient and scalable method for localizing llm behaviour to components, 2024. URL https://arxiv.org/abs/ 2403.00745

work page arXiv 2024
[48]

Bespoke-stratos: The unreasonable effectiveness of reasoning distil- lation

Bespoke Labs. Bespoke-stratos: The unreasonable effectiveness of reasoning distil- lation. https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of- reasoning-distillation, 2025. Accessed: 2025-01-22

work page 2025
[49]

Optimal brain damage

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989

work page 1989
[50]

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. SNIP: single-shot network pruning based on connection sensitivity. CoRR, abs/1810.02340, 2018. URLhttp://arxiv. org/abs/1810.02340

work page arXiv 2018
[51]

Measuring the intrinsic dimension of objective landscapes

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations,

work page
[52]

URLhttps://openreview.net/forum?id=ryup8-WCW

work page
[53]

Vb-lora: Extreme parameter efficient fine-tuning with vector banks

Yang Li, Shaobo Han, and Shihao Ji. Vb-lora: Extreme parameter efficient fine-tuning with vector banks. In The 38th Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[54]

Remax: a simple, effective, and efficient reinforcement learning method for aligning large language models

Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: a simple, effective, and efficient reinforcement learning method for aligning large language models. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[55]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 3214–3252, Dublin, Ireland, May

work page
[56]

Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap, volume

Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229

work page doi:10.18653/v1/2022.acl-long.229 2022
[57]

Dora: Weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[58]

Understanding r1-zero-like training: A critical perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. In Conference on Language Modeling (COLM), 2025

work page 2025
[59]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu et al. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In ICLR, 2024

work page 2024
[60]

Merging models with fisher-weighted averaging

Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022

work page 2022
[61]

Aime problem set 2025, 2025

math ai. Aime problem set 2025, 2025. URL https://huggingface.co/datasets/ math-ai/aime25

work page 2025
[62]

Pissa: Principal singular values and singular vectors adaptation of large language models

Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models. Advances in Neural Information Processing Systems, 37:121038–121072, 2024

work page 2024
[63]

Mass-Editing Memory in a Transformer

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass- editing memory in a transformer. arXiv preprint arXiv:2210.07229, 2022

work page internal anchor Pith review arXiv 2022
[64]

Learning to reason in 13 parameters

John X Morris, Niloofar Mireshghallah, Mark Ibrahim, and Saeed Mahloujifar. Learning to reason in 13 parameters. arXiv preprint arXiv:2602.04118, 2026

work page arXiv 2026
[65]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023

work page internal anchor Pith review arXiv 2023
[66]

Amc 2023: American mathematics competitions 2023 dataset

Mathematical Association of America (MAA). Amc 2023: American mathematics competitions 2023 dataset. https://huggingface.co/datasets/math-ai/amc23, 2023. Accessed: 2024-05-05. 13

work page 2023
[67]

Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning

Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning. arXiv preprint arXiv:2403.17919, 2024

work page arXiv 2024
[68]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9

work page 2023
[69]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. In First Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98

work page 2024
[70]

MELoRA: Mini-ensemble low-rank adapters for parameter- efficient fine-tuning

Pengjie Ren, Chengshun Shi, Shiguang Wu, Mengqi Zhang, Zhaochun Ren, Maarten de Rijke, Zhumin Chen, and Jiahuan Pei. MELoRA: Mini-ensemble low-rank adapters for parameter- efficient fine-tuning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long ...

work page doi:10.18653/v1/2024.acl-long.168 2024
[71]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019

work page internal anchor Pith review arXiv 1907
[72]

Movement pruning: Adaptive sparsity by fine-tuning

Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning. Advances in neural information processing systems, 33:20378–20389, 2020

work page 2020
[73]

Efficient rlhf: Reducing the memory usage of ppo

Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, and Yelong Shen. Efficient rlhf: Reducing the memory usage of ppo. arXiv preprint arXiv:2309.00754, 2023

work page arXiv 2023
[74]

Lora without regret

John Schulman and Thinking Machines Lab. Lora without regret. Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/

work page doi:10.64434/tml.20250929 2025
[75]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Ash, and Dipendra Misra

Pratyusha Sharma, Jordan T. Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=ozX92bu8VA

work page 2024
[77]

Math-llava: Bootstrapping mathematical reasoning for multimodal large language models, 2024

Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models, 2024

work page 2024
[78]

A simple and effective pruning approach for large language models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. In The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=PxoFut3dWW

work page 2024
[80]

URLhttps://arxiv.org/abs/2006.05467

work page arXiv 2006
[81]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023

Showing first 80 references.