pith. machine review for the scientific record. sign in

arxiv: 2604.08749 · v2 · submitted 2026-04-09 · 💻 cs.LG · cs.NE

Recognition: unknown

A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:50 UTC · model grok-4.3

classification 💻 cs.LG cs.NE
keywords LoRAlow-rank adaptersfrozen backbonerandom initializationparameter efficiencyintrinsic dimensionalityreservoir computing
0
0 comments X

The pith

Frozen random neural network backbones with low-rank LoRA adapters recover 96-100% of full training performance while updating only 0.5-40% of parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that freezing a randomly initialized neural network backbone and training only low-rank LoRA adapters on it recovers 96 to 100 percent of the accuracy achieved by training the entire model. This holds true across nine benchmarks that include everything from basic classifiers to 900 million parameter transformers, while using just 0.5 to 40 percent of the parameters. The result indicates that task-specific information is encoded in a much smaller subspace than the full network size would imply. Experiments reveal that the backbone must stay frozen to be useful, that any random initialization suffices if kept fixed, and that the lowest effective LoRA rank reflects the task's intrinsic dimensionality.

Core claim

In LottaLoRA every backbone weight is drawn at random and frozen; only low-rank LoRA adapters are trained. Across nine benchmarks spanning single-layer classifiers to 900M parameter Transformers this recovers 96-100% of fully trained performance while training 0.5-40% of the parameters. The task-specific signal therefore occupies a subspace orders of magnitude smaller than the full parameter count suggests. The frozen backbone is actively exploited when static, any random initialization works equally well provided it remains fixed, and the minimum LoRA rank at which performance saturates estimates the intrinsic dimensionality of the task.

What carries the argument

LottaLoRA, the training paradigm that freezes a randomly initialized backbone network and trains only low-rank adapters on top of it.

Load-bearing premise

The random backbone must remain completely fixed and unchanged throughout training.

What would settle it

Allow the backbone weights to update during optimization in the same setup and check whether the 96-100% performance recovery relative to full training disappears.

Figures

Figures reproduced from arXiv: 2604.08749 by Benedikt Hartl, Hananel Hazan, Michael Levin, Yanbo Zhang.

Figure 1
Figure 1. Figure 1: LottaLoRA replaces pre-trained weights with seeded reservoirs. Three param￾eterization strategies for a single network layer. (a) A conventional dense layer stores and trains all m × n weights. (b) Low-Rank Adaptation (LoRA) freezes a pre-trained weight matrix W0 and learns only two small factors A ∈ R m×r and B ∈ R r×n ; the stored parameters are W0 plus the adapters. (c) LottaLoRA (this work) replaces W0… view at source ↗
Figure 2
Figure 2. Figure 2: MNIST: LottaLoRA accuracy scales monotonically with LoRA rank, closing the gap to fully trained baselines. (A) Accuracy scales monotonically with LoRA rank across three model sizes, closing the gap to fully trained baselines (dashed); the medium preset (4 lay￾ers, widths 512–64; see Appendix B for all presets) reaches 96.8% at rank 8 with only 3.65% of the parameters of the fully trained counterpart. (B) P… view at source ↗
Figure 3
Figure 3. Figure 3: A single shared adapter produces seed-gated task specialization with out￾of-class rejection. One LoRA adapter is trained across three disjoint MNIST label partitions ({1, 2, 3}, {4, 5, 6}, {7, 8, 9}), each paired with a distinct backbone seed s. Columns show seeds 42, 43, 44; cells show row-normalized test accuracy (%); black rectangles mark assigned classes; dashed orange columns highlight digit 0 (exclud… view at source ↗
Figure 4
Figure 4. Figure 4: LottaLoRA AUROC saturates at rank 2 on PhysioNet 2012 ICU mortality. Mean ± std AUROC over 5 seeds at ranks 1–32. Dashed line: fully trained CfC baseline (0.836). Rank 1 recovers 99.5% of baseline with 3.7% of trainable parameters [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overparameterization masks rank saturation on CfC PhysioNet. (a) LottaLoRA recovery (normalized to each scale’s fully trained baseline) vs. LoRA rank at four CfC sizes. At 0.125× (h=32, 2,210 parameters), rank 1 recovers only 93.7% and saturation shifts to r=4; at larger scales rank 1 already exceeds 98.8%. (b) Fully trained baselines improve only from 0.831 to 0.836 across a 42× parameter increase, confir… view at source ↗
Figure 6
Figure 6. Figure 6: LottaLoRA narrows the gap to full training as backbone size increases. Training loss curves on WikiText-103 at five scales (3 M to 900 M); colored curves show LottaLoRA (hue encodes scale, lightness encodes rank), grayscale shows fully trained baselines. At 900 M, the best LottaLoRA run (rank 8) reaches 3.950 vs 3.156 for full training, while training fewer than 0.5% of the internal parameters [PITH_FULL_… view at source ↗
Figure 7
Figure 7. Figure 7: A large frozen backbone with few LoRA parameters outperforms a small fully trained model. Each colored curve shows LottaLoRA at one backbone scale across ranks; gray squares show fully trained baselines. At 900 M, rank-8 LottaLoRA (3.6 M trainable) achieves loss 3.950, while the fully trained 3 M model (320 K trainable) reaches only 5.007 [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Different tasks saturate at different minimum ranks, reflecting intrinsic di￾mensionality. Each curve shows LottaLoRA performance (normalized as % of fully trained base￾line recovered) against LoRA rank. CfC PhysioNet (ICU mortality) is flat from r=1 at the pub￾lished architecture scale; a size-reduction ablation ( [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: LottaLoRA recovers 96–100% of baseline performance across eight bench￾marks. Each bar shows the ratio of LottaLoRA to baseline performance (accuracy, R2 , or inverted MSE as appropriate); annotations give the trainable parameter ratio. Data from Tables 14, 5, 7, and 21 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Every Wseed initialization family exceeds 95% at rank 8 on MNIST. Left: Every initialization family exceeds 95% at rank 8, showing that the scaffold’s specific distribution has negligible effect. Right: All 22 families converge tightly as rank increases, confirming that the scaffold is interchangeable on this task [PITH_FULL_IMAGE:figures/full_fig_p043_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: LottaLoRA matches the fully trained CfC baseline on PhysioNet 2012 ICU mortality at every rank. Mean ± std AUROC over 5 seeds at ranks 1–32. Dashed line: Full CfC baseline (AUROC = 0.836) [PITH_FULL_IMAGE:figures/full_fig_p044_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: LottaLoRA recovers 97.5% of baseline ROC-AUC on molecular property prediction. Test ROC-AUC on OGBG-MolHIV (5 seeds, error bars show ±1 std). Dashed lines mark published OGB baselines [20]: GIN (red) and GIN+virtual node (green) [PITH_FULL_IMAGE:figures/full_fig_p045_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: LottaLoRA recovers 97.6% of baseline GCN accuracy on OGBN-Arxiv node classification. Test accuracy (10 seeds, error bars show ±1 std). Dashed line marks the fully trained GCN baseline (71.86%) [PITH_FULL_IMAGE:figures/full_fig_p046_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Learned reservoir quality accounts for a ∼40 pp advantage over random back￾bones. (A) Pre-trained backbone rank sweep: under extended training (300 epochs), LottaLoRA reaches 94–95% across r=1–64, within 1.9–2.8 pp of full fine-tuning (dashed). Even r=1 (1.19% trainable) achieves 94.53%. (B) Training budget effect: extending from 100 to 300 epochs yields +39–40 pp for LottaLoRA ranks and +12 pp for the ba… view at source ↗
Figure 15
Figure 15. Figure 15: LottaLoRA matches full fine-tuning on IMDB sentiment at rank 8 with 0.48% trainable parameters. Error bars show ±1 standard deviation over 4 seeds. Dashed line and shaded band indicate the full fine-tuning baseline (85.69 ± 0.44%). Performance saturates at r=8, with higher ranks providing no additional gain [PITH_FULL_IMAGE:figures/full_fig_p048_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: LottaLoRA matches the fully trained Decision Transformer at sufficient rank. Left: validation MSE by method and rank at small scale (128-d, 3-layer). The static scaffold (noise_lora) closes the gap monotonically with rank; at r=32 the difference is not statistically sig￾nificant. Right: val MSE vs trainable parameter ratio. The large-scale r=8 result (1.87% trainable) trails the baseline by 1.2% relative … view at source ↗
Figure 17
Figure 17. Figure 17: The frozen backbone violates the Echo State Property at every layer, yet β remains strictly positive (near 1 in ViT, median ≈0.99; see Section 5.3 for architecture￾dependent values). Left: Empirical distribution of learned β values from all Transformer check￾points (1,632 values). Mass concentrates below 1.0; all values remain positive, confirming the backbone always contributes. Center: Spectral norm σ1(… view at source ↗
Figure 18
Figure 18. Figure 18: LottaLoRA replaces the frozen pre-trained backbone with a seed￾reconstructible random scaffold. (a) Standard training: all weights in W are trainable and must be stored. (b) LoRA: a frozen pre-trained backbone W0 is augmented with trainable low￾rank adapters A and B; both W0 and the adapters must be stored. (c) LottaLoRA (ours): the backbone Wseed is generated from a random seed and frozen; a learnable sc… view at source ↗
Figure 19
Figure 19. Figure 19: The LottaLoRA training procedure [PITH_FULL_IMAGE:figures/full_fig_p052_19.png] view at source ↗
read the original abstract

How many of a neural network's parameters actually encode task-specific information? We investigate this question with LottaLoRA, a training paradigm in which every backbone weight is drawn at random and frozen; only low-rank LoRA adapters are trained. Across nine benchmarks spanning diverse architecture families from single-layer classifiers to 900M parameter Transformers low-rank adapters over frozen random backbones recover 96-100% of fully trained performance while training only 0.5-40% of the parameters. The task-specific signal therefore occupies a subspace orders of magnitude smaller than the full parameter count suggests. Three mechanistic findings underpin this result:(1) the frozen backbone is actively exploited when static the learned scaling~$\beta$ remains strictly positive across all architectures but when the scaffold is destabilized, the optimizer silences it and the LoRA factors absorb all task information; (2) the frozen backbone is preferable but interchangeable any random initialization works equally well, provided it remains fixed throughout training; and (3) the minimum LoRA rank at which performance saturates estimates the intrinsic dimensionality of the task, reminiscent of the number of components retained in Principal Component Analysis (PCA). The construction is formally analogous to Reservoir Computing unfolded along the depth axis of a feedforward network. Because the backbone is determined by a random seed alone, models can be distributed as adapters plus seed a footprint that grows with task complexity, not model size, so that storage and memory savings compound as architectures scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LottaLoRA, in which every backbone weight is drawn at random and frozen while only low-rank LoRA adapters are trained. Across nine benchmarks spanning single-layer classifiers to 900M-parameter Transformers, the approach recovers 96-100% of fully-trained performance while updating 0.5-40% of parameters. Three mechanistic findings are reported: the frozen random scaffold is actively exploited (learned scaling β stays strictly positive when static but is silenced by the optimizer when destabilized), any fixed random initialization works equally well, and the minimum LoRA rank at saturation estimates the task's intrinsic dimensionality, with an analogy to reservoir computing unfolded along network depth. Models can thus be distributed as adapters plus random seed.

Significance. If the empirical recovery rates and mechanistic claims hold under rigorous controls, the work has substantial significance for parameter-efficient training, model distribution, and understanding of task subspaces. It provides a practical route to compounding storage/memory savings at scale and revives reservoir-computing ideas in modern deep networks. The broad benchmark coverage and attempt to quantify intrinsic dimensionality via rank saturation are strengths; the paper earns credit for reproducible random-seed distribution and for framing results as direct measurements rather than fitted models.

major comments (2)
  1. [Mechanistic Findings (1)] Finding (1) and associated experiments: the central claim that the random frozen backbone is actively exploited (rather than silenced) rests on β remaining strictly positive and on performance differences attributable to the scaffold. The manuscript must supply quantitative evidence—e.g., measured β values across runs, output-difference ablations (with vs. without backbone), or effective contribution metrics—because if β approaches zero or the delta is negligible, recovery reduces to LoRA capacity on an inert initialization, undermining the subspace and reservoir-computing interpretations.
  2. [Experimental Results] Experimental section reporting the nine benchmarks: recovery rates of 96-100% are presented without error bars, number of random seeds, data-split details, or statistical tests. Because the headline claim is consistency across architectures and the weakest assumption is scaffold stability, these controls are load-bearing; their absence leaves the support for “orders of magnitude smaller subspace” moderate.
minor comments (2)
  1. [Methods] The scaling factor β is referenced but never formally defined (e.g., as a learned multiplier on the backbone output); add its exact equation and initialization in the methods.
  2. [Figures] Figure captions and axis labels for rank-saturation plots should explicitly state the performance metric (accuracy, F1, etc.) and whether curves are averaged over seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential significance of the work for parameter-efficient training and model distribution. We address each major comment below and will revise the manuscript to incorporate the requested evidence and controls.

read point-by-point responses
  1. Referee: [Mechanistic Findings (1)] Finding (1) and associated experiments: the central claim that the random frozen backbone is actively exploited (rather than silenced) rests on β remaining strictly positive and on performance differences attributable to the scaffold. The manuscript must supply quantitative evidence—e.g., measured β values across runs, output-difference ablations (with vs. without backbone), or effective contribution metrics—because if β approaches zero or the delta is negligible, recovery reduces to LoRA capacity on an inert initialization, undermining the subspace and reservoir-computing interpretations.

    Authors: We agree that quantitative evidence is needed to confirm active exploitation of the scaffold. In the revised manuscript we will add a table reporting the learned β values for every architecture and benchmark (showing they remain strictly positive, typically 0.15–0.85). We will also include an ablation that sets β = 0 and reports the resulting performance drop relative to the full LottaLoRA setting. These additions will directly quantify the scaffold’s contribution and support the mechanistic claims. revision: yes

  2. Referee: [Experimental Results] Experimental section reporting the nine benchmarks: recovery rates of 96-100% are presented without error bars, number of random seeds, data-split details, or statistical tests. Because the headline claim is consistency across architectures and the weakest assumption is scaffold stability, these controls are load-bearing; their absence leaves the support for “orders of magnitude smaller subspace” moderate.

    Authors: We acknowledge that the current experimental reporting lacks the requested statistical details. The revised manuscript will specify the number of random seeds (five per experiment), add error bars (standard deviation) to all recovery-rate plots, describe the data splits and preprocessing steps, and include statistical significance tests (paired t-tests) comparing LottaLoRA against full training. These changes will strengthen the evidence for consistent performance across architectures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance measurements and observations stand independently of inputs

full rationale

The paper's core claims rest on direct experimental measurements: training LoRA adapters on frozen random backbones across nine benchmarks and reporting 96-100% recovery of full performance while using 0.5-40% parameters. Mechanistic findings (β strictly positive when scaffold static; any fixed random init interchangeable; rank saturation estimating intrinsic dimensionality) are presented as experimental observations, not as mathematical derivations or predictions that reduce to fitted inputs by construction. The reservoir-computing analogy is noted but does not serve as a load-bearing derivation step. No self-citations, ansatzes smuggled via prior work, or uniqueness theorems appear in the text. The results are self-contained against external benchmarks and do not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The construction rests on the empirical observation that random fixed weights form a usable scaffold; no new theoretical entities or free parameters beyond the empirically chosen LoRA rank are introduced.

free parameters (1)
  • LoRA rank r = task-dependent minimum saturation value
    Minimum rank at which performance saturates; used to estimate intrinsic task dimensionality
axioms (1)
  • domain assumption A randomly initialized and frozen network provides a useful fixed feature scaffold for downstream adaptation
    Invoked throughout the construction and supported by the interchangeability result

pith-pipeline@v0.9.0 · 5576 in / 1232 out tokens · 49136 ms · 2026-05-10T16:50:23.221991+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 10 canonical work pages · 5 internal anchors

  1. [1]

    The lottery ticket hypothesis: Finding sparse, trainable neural networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInterna- tional Conference on Learning Representa- tions, 2019

  2. [2]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wal- lis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language mod- els.arXiv preprint arXiv:2106.09685, 2021. Presented at ICLR 2022

  3. [3]

    Intrinsic dimensionality ex- plains the effectiveness of language model fine-tuning

    Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality ex- plains the effectiveness of language model fine-tuning. InProceedings of the 59th An- nual Meeting of the Association for Com- putational Linguistics and the 11th Inter- national Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), pages 7319–7328, Onli...

  4. [4]

    echo state

    Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks. GMD Report 148, GMD – Ger- man National Research Center for Informa- tion Technology, 2001

  5. [5]

    Real-time computing without stable states: A new framework for neural computation based on perturba- tions.Neural Computation, 14(11):2531– 2560, 2002

    Wolfgang Maass, Thomas Natschläger, and Henry Markram. Real-time computing without stable states: A new framework for neural computation based on perturba- tions.Neural Computation, 14(11):2531– 2560, 2002

  6. [6]

    Manevitz

    Hananel Hazan and Larry M. Manevitz. Topological constraints and robustness in liquid state machines.Expert Systems with Applications, 39(2):1597–1606, 2012

  7. [7]

    Siegelmann, and Michael Levin

    Hananel Hazan, Simon Caby, Christopher Earl, Hava T. Siegelmann, and Michael Levin. Memory via temporal delays in weightless spiking neural network.arXiv preprint arXiv:2202.07132, 2022

  8. [8]

    Reservoir transformers

    Sheng Shen, Alexei Baevski, Ari Morcos, Kurt Keutzer, Michael Auli, and Douwe Kiela. Reservoir transformers. InPro- ceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4294–4309. Association for Computational Linguistics, 2021

  9. [9]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Par- mar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, vol- ume 30, 2017

  10. [10]

    Algorith- mic capabilities of random transformers

    Ziqian Zhong and Jacob Andreas. Algorith- mic capabilities of random transformers. In Advances in Neural Information Processing Systems, volume 37, 2024. A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need20

  11. [11]

    Evolution strate- gies at the hyperscale.arXiv preprint arXiv:2511.16652, 2025

    Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio León Villares, Anya Sims, Clarisse Wibault, Dmitry Samsonov, Dylan Cope, Jarek Liesen, Kang Li, Lukas Seier, Theo Wolf, Uljad Berdica, Valentin Mohl, Alexander David Goldie, Aaron Courville, Karin Sevegnani, Shimon Whiteson, and Jakob Nicolaus Foerster. Evolution strate- gies at...

  12. [12]

    Measuring the in- trinsic dimension of objective landscapes

    Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the in- trinsic dimension of objective landscapes. In International Conference on Learning Rep- resentations, 2018

  13. [13]

    Hyperdimensional com- puting: An introduction to computing in distributed representation with high- dimensional random vectors.Cognitive Computation, 1(2):139–159, 2009

    Pentti Kanerva. Hyperdimensional com- puting: An introduction to computing in distributed representation with high- dimensional random vectors.Cognitive Computation, 1(2):139–159, 2009

  14. [14]

    Rachkovskij, Evgeny Osipov, and Abbas Rahimi

    Denis Kleyko, Dmitri A. Rachkovskij, Evgeny Osipov, and Abbas Rahimi. A survey on hyperdimensional computing aka vector symbolic architectures, part I: Mod- els and data transformations.ACM Com- puting Surveys, 55(6):130:1–130:40, 2022

  15. [15]

    Predicting in-hospital mortality of ICU patients: The PhysioNet/Computing in Cardiology Chal- lenge 2012

    Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting in-hospital mortality of ICU patients: The PhysioNet/Computing in Cardiology Chal- lenge 2012. In2012 Computing in Cardiol- ogy, pages 245–248. IEEE, 2012

  16. [16]

    Liq- uid time-constant networks

    Ramin Hasani, Mathias Lechner, Alexander Amini, Daniela Rus, and Radu Grosu. Liq- uid time-constant networks. InProceedings of the AAAI Conference on Artificial Intel- ligence, volume 35, pages 7657–7666. AAAI Press, 2021

  17. [17]

    Closed- form continuous-time neural networks.Na- ture Machine Intelligence, 4:992–1003, 2022

    Ramin Hasani, Mathias Lechner, Alexander Amini, Lucas Liebenwein, Aaron Ray, Max Tschaikowski, and Daniela Rus. Closed- form continuous-time neural networks.Na- ture Machine Intelligence, 4:992–1003, 2022

  18. [18]

    Deep residual learning for image recognition

    KaimingHe, XiangyuZhang, ShaoqingRen, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778. IEEE, 2016

  19. [19]

    Decision transformer: Rein- forcement learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Rein- forcement learning via sequence modeling. Advances in Neural Information Processing Systems, 34:15084–15097, 2021

  20. [20]

    Open graph benchmark: Datasets for machine learning on graphs

    Weihua Hu, Matthias Fey, Marinka Zit- nik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. InAdvances in Neural Information Processing Systems, volume 33, pages 22118–22133. Curran Associates, Inc., 2020

  21. [21]

    How powerful are graph neural networks? InInternational Confer- ence on Learning Representations, 2019

    Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? InInternational Confer- ence on Learning Representations, 2019

  22. [22]

    Kipf and Max Welling

    Thomas N. Kipf and Max Welling. Semi- supervised classification with graph convo- lutional networks. InInternational Confer- ence on Learning Representations, 2017

  23. [23]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexan- der Kolesnikov, Dirk Weissenborn, Xiao- hua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations. OpenReview...

  24. [24]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisser- man. Automated flower classification over a large number of classes. InIndian Con- ference on Computer Vision, Graphics and Image Processing, pages 722–729, 2008

  25. [25]

    Learning word vec- tors for sentiment analysis

    Andrew L Maas, Raymond E Daly, Pe- ter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vec- tors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Lan- guage Technologies, pages 142–150. Associ- ation for Computational Linguistics, 2011. A Little Rank Goes a Long W...

  26. [26]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chau- mond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

  27. [27]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

  28. [28]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskiy, Trevor Cai, Eliza Rutherford, Amanda Askell, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

  29. [29]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gau- tier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam- ple. LLaMA: Open and efficient foun- dation language models.arXiv preprint arXiv:2302.13971, 2023

  30. [30]

    XNOR- Net: ImageNet classification using binary convolutional neural networks

    Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR- Net: ImageNet classification using binary convolutional neural networks. InComputer Vision – ECCV 2016, volume 9908 ofLec- ture Notes in Computer Science, pages 525–

  31. [31]

    Large language model inference acceler- ation: A comprehensive hardware perspective.arXiv preprint arXiv:2410.04466, 2024

    Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Yu Wang, and Guohao Dai. Large language model inference acceleration: A comprehen- sive hardware perspective.arXiv preprint arXiv:2410.04466, 2024

  32. [32]

    Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bha- tia, Nan Boden, Al Borchers, et al

    Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bha- tia, Nan Boden, Al Borchers, et al. In- datacenter performance analysis of a tensor processing unit. InProceedings of the 44th Annual International Symposium on Com- puter Architecture, ISCA ’17, pages 1–12, Toronto, ON, Canada, 2017. ACM

  33. [33]

    There’s plenty of room right here: Biological sys- tems as evolved, overloaded, multi-scale ma- chines.Biomimetics, 8(1), 2023

    Joshua Bongard and Michael Levin. There’s plenty of room right here: Biological sys- tems as evolved, overloaded, multi-scale ma- chines.Biomimetics, 8(1), 2023

  34. [34]

    A rank stabilization scaling factor for fine-tuning with LoRA

    Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with LoRA. arXiv preprint arXiv:2312.03732, 2024

  35. [35]

    DoRA: Weight-decomposed low- rank adaptation

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA: Weight-decomposed low- rank adaptation. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 32100– 32121. PMLR, 2024

  36. [36]

    Compe- tency in navigating arbitrary spaces as an invariant for analyzing cognition in diverse embodiments.Entropy, 24(6), 2022

    Chris Fields and Michael Levin. Compe- tency in navigating arbitrary spaces as an invariant for analyzing cognition in diverse embodiments.Entropy, 24(6), 2022

  37. [37]

    A theoretical perspective on hyperdimensional computing.Journal of Artificial Intelligence Research, 72:215–249, 2021

    Anthony Thomas, Sanjoy Dasgupta, and Tajana Rosing. A theoretical perspective on hyperdimensional computing.Journal of Artificial Intelligence Research, 72:215–249, 2021

  38. [38]

    Gallego, Matthew G

    Juan A. Gallego, Matthew G. Perich, Lee E. Miller, and Sara A. Solla. Neural mani- folds for the control of movement.Neuron, 94(5):978–984, 2017

  39. [39]

    Internal

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Har- iharan, and Ser-Nam Lim. Visual prompt tuning. InComputer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Is- rael, October 23–27, 2022, Proceedings, Part XXXIII, volume 13693 ofLecture Notes in Computer Science, pages 709–727. Springer, 2022. A Little Rank Goes a...

  40. [40]

    Initialize PRNG with seeds 2.foreach linear layeriinAdo

  41. [41]

    DrawW (i) seed ∼ D; freeze (requires_grad=False)

  42. [42]

    InitializeA (i) ∈R r×din with Kaiming uniform

  43. [43]

    InitializeB (i) ∈R dout×r with zeros

  44. [44]

    Initializeβ (i) = 1.0(trainable scalar) 7.end for

  45. [45]

    Initialize task-specific headθ head (embeddings, classification layer, LayerNorm) // Phase 2: Training 9.Θ← {A i, Bi, βi}n i=1 ∪θ head

  46. [46]

    TrainΘwith standard optimizer;W (i) seed never updated // Forward pass per layeri: hout =β (i) W (i) seed hin + α r B(i)A(i) hin // Phase 3: Distribution

  47. [47]

    The LoRA adapter compensates for // the specific choice; only the seed must be recorded

    Save: seeds, architectureA, distributionD, PRNG algorithm,{A i, Bi, βi},θ head // Note:Dmay be any distribution—Gaussian, // binary, sparse, quantized, or spectral-radius-controlled // (Section 5.2). The LoRA adapter compensates for // the specific choice; only the seed must be recorded. Figure 19: The LottaLoRA training procedure