arxiv: 2604.08749 · v2 · submitted 2026-04-09 · 💻 cs.LG · cs.NE

Recognition: unknown

A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need

Hananel Hazan , Yanbo Zhang , Benedikt Hartl , Michael Levin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:50 UTC · model grok-4.3

classification 💻 cs.LG cs.NE

keywords LoRAlow-rank adaptersfrozen backbonerandom initializationparameter efficiencyintrinsic dimensionalityreservoir computing

0 comments

The pith

Frozen random neural network backbones with low-rank LoRA adapters recover 96-100% of full training performance while updating only 0.5-40% of parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that freezing a randomly initialized neural network backbone and training only low-rank LoRA adapters on it recovers 96 to 100 percent of the accuracy achieved by training the entire model. This holds true across nine benchmarks that include everything from basic classifiers to 900 million parameter transformers, while using just 0.5 to 40 percent of the parameters. The result indicates that task-specific information is encoded in a much smaller subspace than the full network size would imply. Experiments reveal that the backbone must stay frozen to be useful, that any random initialization suffices if kept fixed, and that the lowest effective LoRA rank reflects the task's intrinsic dimensionality.

Core claim

In LottaLoRA every backbone weight is drawn at random and frozen; only low-rank LoRA adapters are trained. Across nine benchmarks spanning single-layer classifiers to 900M parameter Transformers this recovers 96-100% of fully trained performance while training 0.5-40% of the parameters. The task-specific signal therefore occupies a subspace orders of magnitude smaller than the full parameter count suggests. The frozen backbone is actively exploited when static, any random initialization works equally well provided it remains fixed, and the minimum LoRA rank at which performance saturates estimates the intrinsic dimensionality of the task.

What carries the argument

LottaLoRA, the training paradigm that freezes a randomly initialized backbone network and trains only low-rank adapters on top of it.

Load-bearing premise

The random backbone must remain completely fixed and unchanged throughout training.

What would settle it

Allow the backbone weights to update during optimization in the same setup and check whether the 96-100% performance recovery relative to full training disappears.

Figures

Figures reproduced from arXiv: 2604.08749 by Benedikt Hartl, Hananel Hazan, Michael Levin, Yanbo Zhang.

**Figure 1.** Figure 1: LottaLoRA replaces pre-trained weights with seeded reservoirs. Three parameterization strategies for a single network layer. (a) A conventional dense layer stores and trains all m × n weights. (b) Low-Rank Adaptation (LoRA) freezes a pre-trained weight matrix W0 and learns only two small factors A ∈ R m×r and B ∈ R r×n ; the stored parameters are W0 plus the adapters. (c) LottaLoRA (this work) replaces W0… view at source ↗

**Figure 2.** Figure 2: MNIST: LottaLoRA accuracy scales monotonically with LoRA rank, closing the gap to fully trained baselines. (A) Accuracy scales monotonically with LoRA rank across three model sizes, closing the gap to fully trained baselines (dashed); the medium preset (4 layers, widths 512–64; see Appendix B for all presets) reaches 96.8% at rank 8 with only 3.65% of the parameters of the fully trained counterpart. (B) P… view at source ↗

**Figure 3.** Figure 3: A single shared adapter produces seed-gated task specialization with outof-class rejection. One LoRA adapter is trained across three disjoint MNIST label partitions ({1, 2, 3}, {4, 5, 6}, {7, 8, 9}), each paired with a distinct backbone seed s. Columns show seeds 42, 43, 44; cells show row-normalized test accuracy (%); black rectangles mark assigned classes; dashed orange columns highlight digit 0 (exclud… view at source ↗

**Figure 4.** Figure 4: LottaLoRA AUROC saturates at rank 2 on PhysioNet 2012 ICU mortality. Mean ± std AUROC over 5 seeds at ranks 1–32. Dashed line: fully trained CfC baseline (0.836). Rank 1 recovers 99.5% of baseline with 3.7% of trainable parameters [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Overparameterization masks rank saturation on CfC PhysioNet. (a) LottaLoRA recovery (normalized to each scale’s fully trained baseline) vs. LoRA rank at four CfC sizes. At 0.125× (h=32, 2,210 parameters), rank 1 recovers only 93.7% and saturation shifts to r=4; at larger scales rank 1 already exceeds 98.8%. (b) Fully trained baselines improve only from 0.831 to 0.836 across a 42× parameter increase, confir… view at source ↗

**Figure 6.** Figure 6: LottaLoRA narrows the gap to full training as backbone size increases. Training loss curves on WikiText-103 at five scales (3 M to 900 M); colored curves show LottaLoRA (hue encodes scale, lightness encodes rank), grayscale shows fully trained baselines. At 900 M, the best LottaLoRA run (rank 8) reaches 3.950 vs 3.156 for full training, while training fewer than 0.5% of the internal parameters [PITH_FULL_… view at source ↗

**Figure 7.** Figure 7: A large frozen backbone with few LoRA parameters outperforms a small fully trained model. Each colored curve shows LottaLoRA at one backbone scale across ranks; gray squares show fully trained baselines. At 900 M, rank-8 LottaLoRA (3.6 M trainable) achieves loss 3.950, while the fully trained 3 M model (320 K trainable) reaches only 5.007 [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: Different tasks saturate at different minimum ranks, reflecting intrinsic dimensionality. Each curve shows LottaLoRA performance (normalized as % of fully trained baseline recovered) against LoRA rank. CfC PhysioNet (ICU mortality) is flat from r=1 at the published architecture scale; a size-reduction ablation ( [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: LottaLoRA recovers 96–100% of baseline performance across eight benchmarks. Each bar shows the ratio of LottaLoRA to baseline performance (accuracy, R2 , or inverted MSE as appropriate); annotations give the trainable parameter ratio. Data from Tables 14, 5, 7, and 21 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: Every Wseed initialization family exceeds 95% at rank 8 on MNIST. Left: Every initialization family exceeds 95% at rank 8, showing that the scaffold’s specific distribution has negligible effect. Right: All 22 families converge tightly as rank increases, confirming that the scaffold is interchangeable on this task [PITH_FULL_IMAGE:figures/full_fig_p043_10.png] view at source ↗

**Figure 11.** Figure 11: LottaLoRA matches the fully trained CfC baseline on PhysioNet 2012 ICU mortality at every rank. Mean ± std AUROC over 5 seeds at ranks 1–32. Dashed line: Full CfC baseline (AUROC = 0.836) [PITH_FULL_IMAGE:figures/full_fig_p044_11.png] view at source ↗

**Figure 12.** Figure 12: LottaLoRA recovers 97.5% of baseline ROC-AUC on molecular property prediction. Test ROC-AUC on OGBG-MolHIV (5 seeds, error bars show ±1 std). Dashed lines mark published OGB baselines [20]: GIN (red) and GIN+virtual node (green) [PITH_FULL_IMAGE:figures/full_fig_p045_12.png] view at source ↗

**Figure 13.** Figure 13: LottaLoRA recovers 97.6% of baseline GCN accuracy on OGBN-Arxiv node classification. Test accuracy (10 seeds, error bars show ±1 std). Dashed line marks the fully trained GCN baseline (71.86%) [PITH_FULL_IMAGE:figures/full_fig_p046_13.png] view at source ↗

**Figure 14.** Figure 14: Learned reservoir quality accounts for a ∼40 pp advantage over random backbones. (A) Pre-trained backbone rank sweep: under extended training (300 epochs), LottaLoRA reaches 94–95% across r=1–64, within 1.9–2.8 pp of full fine-tuning (dashed). Even r=1 (1.19% trainable) achieves 94.53%. (B) Training budget effect: extending from 100 to 300 epochs yields +39–40 pp for LottaLoRA ranks and +12 pp for the ba… view at source ↗

**Figure 15.** Figure 15: LottaLoRA matches full fine-tuning on IMDB sentiment at rank 8 with 0.48% trainable parameters. Error bars show ±1 standard deviation over 4 seeds. Dashed line and shaded band indicate the full fine-tuning baseline (85.69 ± 0.44%). Performance saturates at r=8, with higher ranks providing no additional gain [PITH_FULL_IMAGE:figures/full_fig_p048_15.png] view at source ↗

**Figure 16.** Figure 16: LottaLoRA matches the fully trained Decision Transformer at sufficient rank. Left: validation MSE by method and rank at small scale (128-d, 3-layer). The static scaffold (noise_lora) closes the gap monotonically with rank; at r=32 the difference is not statistically significant. Right: val MSE vs trainable parameter ratio. The large-scale r=8 result (1.87% trainable) trails the baseline by 1.2% relative … view at source ↗

**Figure 17.** Figure 17: The frozen backbone violates the Echo State Property at every layer, yet β remains strictly positive (near 1 in ViT, median ≈0.99; see Section 5.3 for architecturedependent values). Left: Empirical distribution of learned β values from all Transformer checkpoints (1,632 values). Mass concentrates below 1.0; all values remain positive, confirming the backbone always contributes. Center: Spectral norm σ1(… view at source ↗

**Figure 18.** Figure 18: LottaLoRA replaces the frozen pre-trained backbone with a seedreconstructible random scaffold. (a) Standard training: all weights in W are trainable and must be stored. (b) LoRA: a frozen pre-trained backbone W0 is augmented with trainable lowrank adapters A and B; both W0 and the adapters must be stored. (c) LottaLoRA (ours): the backbone Wseed is generated from a random seed and frozen; a learnable sc… view at source ↗

**Figure 19.** Figure 19: The LottaLoRA training procedure [PITH_FULL_IMAGE:figures/full_fig_p052_19.png] view at source ↗

read the original abstract

How many of a neural network's parameters actually encode task-specific information? We investigate this question with LottaLoRA, a training paradigm in which every backbone weight is drawn at random and frozen; only low-rank LoRA adapters are trained. Across nine benchmarks spanning diverse architecture families from single-layer classifiers to 900M parameter Transformers low-rank adapters over frozen random backbones recover 96-100% of fully trained performance while training only 0.5-40% of the parameters. The task-specific signal therefore occupies a subspace orders of magnitude smaller than the full parameter count suggests. Three mechanistic findings underpin this result:(1) the frozen backbone is actively exploited when static the learned scaling~$\beta$ remains strictly positive across all architectures but when the scaffold is destabilized, the optimizer silences it and the LoRA factors absorb all task information; (2) the frozen backbone is preferable but interchangeable any random initialization works equally well, provided it remains fixed throughout training; and (3) the minimum LoRA rank at which performance saturates estimates the intrinsic dimensionality of the task, reminiscent of the number of components retained in Principal Component Analysis (PCA). The construction is formally analogous to Reservoir Computing unfolded along the depth axis of a feedforward network. Because the backbone is determined by a random seed alone, models can be distributed as adapters plus seed a footprint that grows with task complexity, not model size, so that storage and memory savings compound as architectures scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Random frozen backbones plus LoRA recover 96-100% performance across benchmarks, but the evidence that the scaffold is actively used rather than ignored still needs tighter checks.

read the letter

The main result is that a fully random, frozen backbone with only LoRA adapters trained recovers nearly all of a fully trained model's performance on nine benchmarks that range from tiny classifiers to 900M-parameter transformers, while updating just 0.5-40% of the weights. The paper frames this as evidence that task-specific information lives in a much smaller subspace than the full parameter count implies. They add three supporting observations: the random scaffold stays active when left static (beta stays positive), any fixed random initialization works about the same, and the LoRA rank at which gains saturate gives a practical estimate of the task's intrinsic dimension. The reservoir-computing analogy along network depth is a clean way to think about it. The breadth of architectures tested is a real strength and makes the empirical pattern harder to dismiss as an artifact of one model family. The idea that models could be shared as a seed plus a small adapter set is also practically appealing for storage and distribution as models grow. The main soft spot is the claim that the frozen backbone is meaningfully exploited rather than effectively silenced by the optimizer. The abstract states that destabilizing the scaffold causes the optimizer to ignore it and let LoRA take over, while beta remains positive when the scaffold is static. That distinction is important for the subspace and reservoir interpretations, but without the actual beta values, output-difference measurements, or full ablation tables it is still possible that LoRA capacity alone explains most of the recovery on an inert random starting point. The circularity burden looks low because the minimum rank is observed rather than defined into the result. This is worth a serious referee for groups working on efficient adaptation and on what actually gets learned in large networks. The experiments are broad enough and the mechanistic questions are clear enough that revisions could strengthen the central claim without starting over.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LottaLoRA, in which every backbone weight is drawn at random and frozen while only low-rank LoRA adapters are trained. Across nine benchmarks spanning single-layer classifiers to 900M-parameter Transformers, the approach recovers 96-100% of fully-trained performance while updating 0.5-40% of parameters. Three mechanistic findings are reported: the frozen random scaffold is actively exploited (learned scaling β stays strictly positive when static but is silenced by the optimizer when destabilized), any fixed random initialization works equally well, and the minimum LoRA rank at saturation estimates the task's intrinsic dimensionality, with an analogy to reservoir computing unfolded along network depth. Models can thus be distributed as adapters plus random seed.

Significance. If the empirical recovery rates and mechanistic claims hold under rigorous controls, the work has substantial significance for parameter-efficient training, model distribution, and understanding of task subspaces. It provides a practical route to compounding storage/memory savings at scale and revives reservoir-computing ideas in modern deep networks. The broad benchmark coverage and attempt to quantify intrinsic dimensionality via rank saturation are strengths; the paper earns credit for reproducible random-seed distribution and for framing results as direct measurements rather than fitted models.

major comments (2)

[Mechanistic Findings (1)] Finding (1) and associated experiments: the central claim that the random frozen backbone is actively exploited (rather than silenced) rests on β remaining strictly positive and on performance differences attributable to the scaffold. The manuscript must supply quantitative evidence—e.g., measured β values across runs, output-difference ablations (with vs. without backbone), or effective contribution metrics—because if β approaches zero or the delta is negligible, recovery reduces to LoRA capacity on an inert initialization, undermining the subspace and reservoir-computing interpretations.
[Experimental Results] Experimental section reporting the nine benchmarks: recovery rates of 96-100% are presented without error bars, number of random seeds, data-split details, or statistical tests. Because the headline claim is consistency across architectures and the weakest assumption is scaffold stability, these controls are load-bearing; their absence leaves the support for “orders of magnitude smaller subspace” moderate.

minor comments (2)

[Methods] The scaling factor β is referenced but never formally defined (e.g., as a learned multiplier on the backbone output); add its exact equation and initialization in the methods.
[Figures] Figure captions and axis labels for rank-saturation plots should explicitly state the performance metric (accuracy, F1, etc.) and whether curves are averaged over seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential significance of the work for parameter-efficient training and model distribution. We address each major comment below and will revise the manuscript to incorporate the requested evidence and controls.

read point-by-point responses

Referee: [Mechanistic Findings (1)] Finding (1) and associated experiments: the central claim that the random frozen backbone is actively exploited (rather than silenced) rests on β remaining strictly positive and on performance differences attributable to the scaffold. The manuscript must supply quantitative evidence—e.g., measured β values across runs, output-difference ablations (with vs. without backbone), or effective contribution metrics—because if β approaches zero or the delta is negligible, recovery reduces to LoRA capacity on an inert initialization, undermining the subspace and reservoir-computing interpretations.

Authors: We agree that quantitative evidence is needed to confirm active exploitation of the scaffold. In the revised manuscript we will add a table reporting the learned β values for every architecture and benchmark (showing they remain strictly positive, typically 0.15–0.85). We will also include an ablation that sets β = 0 and reports the resulting performance drop relative to the full LottaLoRA setting. These additions will directly quantify the scaffold’s contribution and support the mechanistic claims. revision: yes
Referee: [Experimental Results] Experimental section reporting the nine benchmarks: recovery rates of 96-100% are presented without error bars, number of random seeds, data-split details, or statistical tests. Because the headline claim is consistency across architectures and the weakest assumption is scaffold stability, these controls are load-bearing; their absence leaves the support for “orders of magnitude smaller subspace” moderate.

Authors: We acknowledge that the current experimental reporting lacks the requested statistical details. The revised manuscript will specify the number of random seeds (five per experiment), add error bars (standard deviation) to all recovery-rate plots, describe the data splits and preprocessing steps, and include statistical significance tests (paired t-tests) comparing LottaLoRA against full training. These changes will strengthen the evidence for consistent performance across architectures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance measurements and observations stand independently of inputs

full rationale

The paper's core claims rest on direct experimental measurements: training LoRA adapters on frozen random backbones across nine benchmarks and reporting 96-100% recovery of full performance while using 0.5-40% parameters. Mechanistic findings (β strictly positive when scaffold static; any fixed random init interchangeable; rank saturation estimating intrinsic dimensionality) are presented as experimental observations, not as mathematical derivations or predictions that reduce to fitted inputs by construction. The reservoir-computing analogy is noted but does not serve as a load-bearing derivation step. No self-citations, ansatzes smuggled via prior work, or uniqueness theorems appear in the text. The results are self-contained against external benchmarks and do not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The construction rests on the empirical observation that random fixed weights form a usable scaffold; no new theoretical entities or free parameters beyond the empirically chosen LoRA rank are introduced.

free parameters (1)

LoRA rank r = task-dependent minimum saturation value
Minimum rank at which performance saturates; used to estimate intrinsic task dimensionality

axioms (1)

domain assumption A randomly initialized and frozen network provides a useful fixed feature scaffold for downstream adaptation
Invoked throughout the construction and supported by the interchangeability result

pith-pipeline@v0.9.0 · 5576 in / 1232 out tokens · 49136 ms · 2026-05-10T16:50:23.221991+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 10 canonical work pages · 5 internal anchors

[1]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInterna- tional Conference on Learning Representa- tions, 2019

2019
[2]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wal- lis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language mod- els.arXiv preprint arXiv:2106.09685, 2021. Presented at ICLR 2022

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Intrinsic dimensionality ex- plains the effectiveness of language model fine-tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality ex- plains the effectiveness of language model fine-tuning. InProceedings of the 59th An- nual Meeting of the Association for Com- putational Linguistics and the 11th Inter- national Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), pages 7319–7328, Onli...

2021
[4]

echo state

Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks. GMD Report 148, GMD – Ger- man National Research Center for Informa- tion Technology, 2001

2001
[5]

Real-time computing without stable states: A new framework for neural computation based on perturba- tions.Neural Computation, 14(11):2531– 2560, 2002

Wolfgang Maass, Thomas Natschläger, and Henry Markram. Real-time computing without stable states: A new framework for neural computation based on perturba- tions.Neural Computation, 14(11):2531– 2560, 2002

2002
[6]

Manevitz

Hananel Hazan and Larry M. Manevitz. Topological constraints and robustness in liquid state machines.Expert Systems with Applications, 39(2):1597–1606, 2012

2012
[7]

Siegelmann, and Michael Levin

Hananel Hazan, Simon Caby, Christopher Earl, Hava T. Siegelmann, and Michael Levin. Memory via temporal delays in weightless spiking neural network.arXiv preprint arXiv:2202.07132, 2022

work page arXiv 2022
[8]

Reservoir transformers

Sheng Shen, Alexei Baevski, Ari Morcos, Kurt Keutzer, Michael Auli, and Douwe Kiela. Reservoir transformers. InPro- ceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4294–4309. Association for Computational Linguistics, 2021

2021
[9]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Par- mar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, vol- ume 30, 2017

2017
[10]

Algorith- mic capabilities of random transformers

Ziqian Zhong and Jacob Andreas. Algorith- mic capabilities of random transformers. In Advances in Neural Information Processing Systems, volume 37, 2024. A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need20

2024
[11]

Evolution strate- gies at the hyperscale.arXiv preprint arXiv:2511.16652, 2025

Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio León Villares, Anya Sims, Clarisse Wibault, Dmitry Samsonov, Dylan Cope, Jarek Liesen, Kang Li, Lukas Seier, Theo Wolf, Uljad Berdica, Valentin Mohl, Alexander David Goldie, Aaron Courville, Karin Sevegnani, Shimon Whiteson, and Jakob Nicolaus Foerster. Evolution strate- gies at...

work page arXiv 2025
[12]

Measuring the in- trinsic dimension of objective landscapes

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the in- trinsic dimension of objective landscapes. In International Conference on Learning Rep- resentations, 2018

2018
[13]

Hyperdimensional com- puting: An introduction to computing in distributed representation with high- dimensional random vectors.Cognitive Computation, 1(2):139–159, 2009

Pentti Kanerva. Hyperdimensional com- puting: An introduction to computing in distributed representation with high- dimensional random vectors.Cognitive Computation, 1(2):139–159, 2009

2009
[14]

Rachkovskij, Evgeny Osipov, and Abbas Rahimi

Denis Kleyko, Dmitri A. Rachkovskij, Evgeny Osipov, and Abbas Rahimi. A survey on hyperdimensional computing aka vector symbolic architectures, part I: Mod- els and data transformations.ACM Com- puting Surveys, 55(6):130:1–130:40, 2022

2022
[15]

Predicting in-hospital mortality of ICU patients: The PhysioNet/Computing in Cardiology Chal- lenge 2012

Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting in-hospital mortality of ICU patients: The PhysioNet/Computing in Cardiology Chal- lenge 2012. In2012 Computing in Cardiol- ogy, pages 245–248. IEEE, 2012

2012
[16]

Liq- uid time-constant networks

Ramin Hasani, Mathias Lechner, Alexander Amini, Daniela Rus, and Radu Grosu. Liq- uid time-constant networks. InProceedings of the AAAI Conference on Artificial Intel- ligence, volume 35, pages 7657–7666. AAAI Press, 2021

2021
[17]

Closed- form continuous-time neural networks.Na- ture Machine Intelligence, 4:992–1003, 2022

Ramin Hasani, Mathias Lechner, Alexander Amini, Lucas Liebenwein, Aaron Ray, Max Tschaikowski, and Daniela Rus. Closed- form continuous-time neural networks.Na- ture Machine Intelligence, 4:992–1003, 2022

2022
[18]

Deep residual learning for image recognition

KaimingHe, XiangyuZhang, ShaoqingRen, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778. IEEE, 2016

2016
[19]

Decision transformer: Rein- forcement learning via sequence modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Rein- forcement learning via sequence modeling. Advances in Neural Information Processing Systems, 34:15084–15097, 2021

2021
[20]

Open graph benchmark: Datasets for machine learning on graphs

Weihua Hu, Matthias Fey, Marinka Zit- nik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. InAdvances in Neural Information Processing Systems, volume 33, pages 22118–22133. Curran Associates, Inc., 2020

2020
[21]

How powerful are graph neural networks? InInternational Confer- ence on Learning Representations, 2019

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? InInternational Confer- ence on Learning Representations, 2019

2019
[22]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. Semi- supervised classification with graph convo- lutional networks. InInternational Confer- ence on Learning Representations, 2017

2017
[23]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexan- der Kolesnikov, Dirk Weissenborn, Xiao- hua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations. OpenReview...

2021
[24]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisser- man. Automated flower classification over a large number of classes. InIndian Con- ference on Computer Vision, Graphics and Image Processing, pages 722–729, 2008

2008
[25]

Learning word vec- tors for sentiment analysis

Andrew L Maas, Raymond E Daly, Pe- ter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vec- tors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Lan- guage Technologies, pages 142–150. Associ- ation for Computational Linguistics, 2011. A Little Rank Goes a Long W...

2011
[26]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chau- mond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review arXiv 1910
[27]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review arXiv 2016
[28]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskiy, Trevor Cai, Eliza Rutherford, Amanda Askell, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gau- tier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam- ple. LLaMA: Open and efficient foun- dation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

XNOR- Net: ImageNet classification using binary convolutional neural networks

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR- Net: ImageNet classification using binary convolutional neural networks. InComputer Vision – ECCV 2016, volume 9908 ofLec- ture Notes in Computer Science, pages 525–

2016
[31]

Large language model inference acceler- ation: A comprehensive hardware perspective.arXiv preprint arXiv:2410.04466, 2024

Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Yu Wang, and Guohao Dai. Large language model inference acceleration: A comprehen- sive hardware perspective.arXiv preprint arXiv:2410.04466, 2024

work page arXiv 2024
[32]

Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bha- tia, Nan Boden, Al Borchers, et al

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bha- tia, Nan Boden, Al Borchers, et al. In- datacenter performance analysis of a tensor processing unit. InProceedings of the 44th Annual International Symposium on Com- puter Architecture, ISCA ’17, pages 1–12, Toronto, ON, Canada, 2017. ACM

2017
[33]

There’s plenty of room right here: Biological sys- tems as evolved, overloaded, multi-scale ma- chines.Biomimetics, 8(1), 2023

Joshua Bongard and Michael Levin. There’s plenty of room right here: Biological sys- tems as evolved, overloaded, multi-scale ma- chines.Biomimetics, 8(1), 2023

2023
[34]

A rank stabilization scaling factor for fine-tuning with LoRA

Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with LoRA. arXiv preprint arXiv:2312.03732, 2024

work page arXiv 2024
[35]

DoRA: Weight-decomposed low- rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA: Weight-decomposed low- rank adaptation. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 32100– 32121. PMLR, 2024

2024
[36]

Compe- tency in navigating arbitrary spaces as an invariant for analyzing cognition in diverse embodiments.Entropy, 24(6), 2022

Chris Fields and Michael Levin. Compe- tency in navigating arbitrary spaces as an invariant for analyzing cognition in diverse embodiments.Entropy, 24(6), 2022

2022
[37]

A theoretical perspective on hyperdimensional computing.Journal of Artificial Intelligence Research, 72:215–249, 2021

Anthony Thomas, Sanjoy Dasgupta, and Tajana Rosing. A theoretical perspective on hyperdimensional computing.Journal of Artificial Intelligence Research, 72:215–249, 2021

2021
[38]

Gallego, Matthew G

Juan A. Gallego, Matthew G. Perich, Lee E. Miller, and Sara A. Solla. Neural mani- folds for the control of movement.Neuron, 94(5):978–984, 2017

2017
[39]

Internal

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Har- iharan, and Ser-Nam Lim. Visual prompt tuning. InComputer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Is- rael, October 23–27, 2022, Proceedings, Part XXXIII, volume 13693 ofLecture Notes in Computer Science, pages 709–727. Springer, 2022. A Little Rank Goes a...

work page arXiv 2022
[40]

Initialize PRNG with seeds 2.foreach linear layeriinAdo
[41]

DrawW (i) seed ∼ D; freeze (requires_grad=False)
[42]

InitializeA (i) ∈R r×din with Kaiming uniform
[43]

InitializeB (i) ∈R dout×r with zeros
[44]

Initializeβ (i) = 1.0(trainable scalar) 7.end for
[45]

Initialize task-specific headθ head (embeddings, classification layer, LayerNorm) // Phase 2: Training 9.Θ← {A i, Bi, βi}n i=1 ∪θ head
[46]

TrainΘwith standard optimizer;W (i) seed never updated // Forward pass per layeri: hout =β (i) W (i) seed hin + α r B(i)A(i) hin // Phase 3: Distribution
[47]

The LoRA adapter compensates for // the specific choice; only the seed must be recorded

Save: seeds, architectureA, distributionD, PRNG algorithm,{A i, Bi, βi},θ head // Note:Dmay be any distribution—Gaussian, // binary, sparse, quantized, or spectral-radius-controlled // (Section 5.2). The LoRA adapter compensates for // the specific choice; only the seed must be recorded. Figure 19: The LottaLoRA training procedure