Recognition: 3 theorem links
· Lean TheoremTraining Transformers in Cosine Coefficient Space
Pith reviewed 2026-05-10 20:03 UTC · model grok-4.3
The pith
Transformers reach near-dense performance when trained only on half the 2D DCT coefficients of their weight matrices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We replace each linear layer with one that stores K out of mn two-dimensional DCT coefficients per weight matrix and reconstructs the full matrix through an inverse DCT at every forward pass; the K coefficients are the trainable parameters. A 4-layer, 128-dim transformer trained from scratch on character-level Shakespeare reaches validation loss 1.604 at K = mn/2, against 1.580 for a standard dense baseline. A rank-48 LoRA factorization at the same trainable parameter count reaches only 1.801. The structural advantage of sparse-coefficient over low-rank parameterizations at matched K is qualitative, stemming from rank flexibility in the coefficient subspace.
What carries the argument
Trainable subset of 2D DCT coefficients per weight matrix, with inverse DCT reconstruction to form the full matrix for forward passes.
If this is right
- The performance gap to dense training stays small at half the parameters on this task.
- Low-rank methods like LoRA underperform the coefficient approach at equivalent parameter budgets.
- Orthonormal bases that preserve high-rank capacity maintain low loss, unlike those that force low-rank blocks.
- The fast separable DCT allows fused on-chip reconstruction, converting parameter savings into bandwidth savings.
Where Pith is reading between the lines
- If the rank flexibility benefit holds at scale, this could enable more parameter-efficient training of large language models without low-rank constraints.
- The approach might integrate with existing optimizers without modification, but testing on longer sequences or different architectures would be needed.
- Other frequency-based transforms could be explored if they offer similar fast reconstruction and rank properties.
Load-bearing premise
The near-parity performance and rank advantage seen in this small-scale experiment will generalize to larger models and tasks without causing optimization instabilities from the repeated reconstructions.
What would settle it
Scaling the model to more layers or training on a larger corpus and checking if the loss difference to the dense baseline exceeds the terminal variation observed in the small run.
Figures
read the original abstract
Linear layers hold most of a transformer's parameters. We replace each linear layer with one that stores $K$ out of $mn$ two-dimensional DCT coefficients per weight matrix and reconstructs the full matrix through an inverse DCT at every forward pass; the $K$ coefficients are the trainable parameters. A 4-layer, 128-dim transformer trained from scratch on character-level Shakespeare reaches validation loss $1.604$ at $K = mn/2$, against $1.580$ for a standard dense baseline -- a gap of $+0.024$ at half the trainable parameter count, within the terminal-epoch variation of the dense run. A rank-48 LoRA factorization at the same trainable parameter count reaches only $1.801$ ($+0.221$). The structural advantage of sparse-coefficient over low-rank parameterizations at matched $K$ is qualitative. We identify rank flexibility as the mechanism. A random orthonormal basis matches the DCT within noise at $K = mn/2$, and a compression sweep through $K = mn/10$ and $K = mn/20$ shows that subspaces that can host high-rank matrices keep the loss low, while subspaces that flatten into a low-rank block (zigzag-selection variants) converge onto the observed stable rank \emph{and} the loss line of the rank-48 LoRA reference in lock-step. Among these orthonormal bases, the DCT is preferred because its separable fast transform admits a fused reconstruction kernel: the materialized weight matrix never leaves on-chip memory, so the parameter saving translates into a bandwidth saving as well.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes replacing the linear layers in a transformer with a parameterization that stores only K out of mn two-dimensional DCT coefficients per weight matrix and reconstructs the full matrix via inverse DCT at every forward pass. On a 4-layer 128-dim model trained from scratch on character-level Shakespeare, the method reaches validation loss 1.604 at K = mn/2 (half the dense parameter count), compared with 1.580 for a standard dense baseline and 1.801 for a rank-48 LoRA at matched K. The authors identify rank flexibility as the key mechanism, supported by a random-orthonormal basis match at K = mn/2 and a compression sweep showing that only high-rank-capable subspaces avoid the LoRA loss line; they further note that the separable fast DCT admits a fused kernel that converts the parameter saving into a bandwidth saving.
Significance. If the empirical result and the rank-flexibility mechanism hold under scaling, the approach would supply a concrete alternative to low-rank adaptations such as LoRA, with the added practical benefit of on-chip reconstruction that never materializes the full weight matrix off-chip. The direct comparison at matched K and the orthonormal-basis controls already provide a clearer structural diagnosis than most existing parameter-efficient fine-tuning papers. The current evidence, however, is confined to a single tiny model and dataset, so the broader significance remains conditional on further validation.
major comments (2)
- [Experimental results (implicit in abstract and §4)] The experimental description provides no information on the number of independent runs, the hyperparameter search budget, random-seed handling, or the precise train/validation split of the Shakespeare corpus. Because the central claim rests on a loss gap of only +0.024 that is stated to lie inside terminal-epoch variation, these details are load-bearing for assessing whether the result is reproducible or statistically distinguishable from noise.
- [Discussion and conclusion] No scaling experiments are reported beyond the 4-layer 128-dim setting. The repeated IDCT reconstruction at every forward pass could alter gradient magnitudes or introduce numerical instability once widths or depths increase; the assertion that “standard optimizers suffice” therefore rests on an extrapolation whose validity is not tested within the manuscript.
minor comments (1)
- [Abstract] The abstract would be clearer if it stated the model depth and hidden dimension in the first sentence rather than only in the results clause.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review. We address the major comments point by point below, and have updated the manuscript to incorporate additional experimental details and to qualify our claims appropriately.
read point-by-point responses
-
Referee: [Experimental results (implicit in abstract and §4)] The experimental description provides no information on the number of independent runs, the hyperparameter search budget, random-seed handling, or the precise train/validation split of the Shakespeare corpus. Because the central claim rests on a loss gap of only +0.024 that is stated to lie inside terminal-epoch variation, these details are load-bearing for assessing whether the result is reproducible or statistically distinguishable from noise.
Authors: We agree that these details are critical for assessing reproducibility and the statistical significance of the small loss gap. In the revised manuscript, we have expanded the experimental section to include the number of independent runs conducted, the random seeds employed, the hyperparameter search budget and methodology, and the precise train/validation split used for the Shakespeare corpus. Additionally, we report the terminal-epoch variation observed in the dense baseline runs to support our statement that the +0.024 difference lies within this variation. revision: yes
-
Referee: [Discussion and conclusion] No scaling experiments are reported beyond the 4-layer 128-dim setting. The repeated IDCT reconstruction at every forward pass could alter gradient magnitudes or introduce numerical instability once widths or depths increase; the assertion that “standard optimizers suffice” therefore rests on an extrapolation whose validity is not tested within the manuscript.
Authors: We acknowledge the lack of scaling experiments in the current work. Our study is scoped to a small model to isolate and diagnose the rank-flexibility mechanism through controlled comparisons, including the orthonormal basis controls. We have revised the Discussion and Conclusion to explicitly discuss the potential for changes in gradient behavior or numerical issues with repeated IDCT at larger scales, and we qualify the statement about standard optimizers as applying to the reported experimental regime. We agree that validating these aspects at scale is an important next step but lies beyond the present manuscript. revision: partial
Circularity Check
No circularity: empirical claims rest on direct training runs and independent sweeps
full rationale
The paper introduces a DCT-coefficient parameterization for linear layers and supports its claims exclusively through concrete training experiments (4-layer 128-dim transformer on character Shakespeare) and auxiliary sweeps (random orthonormal bases, K=mn/10 and mn/20 compression). Validation losses are measured outcomes, not quantities fitted and then relabeled as predictions. No equations appear that define a quantity in terms of itself, no self-citations are used to justify core premises, and the rank-flexibility mechanism is inferred from observable loss behavior across bases rather than from any tautological reduction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 2D DCT provides a basis in which a subset of coefficients can reconstruct a full matrix via inverse transform with acceptable fidelity for neural network weights.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWe replace each linear layer with one that stores K out of mn two-dimensional DCT coefficients per weight matrix and reconstructs the full matrix through an inverse DCT at every forward pass; the K coefficients are the trainable parameters.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclearThe mechanism is rank flexibility. A random orthonormal basis matches the DCT within noise at K = mn/2, and a compression sweep through K = mn/10 and K = mn/20 shows that subspaces that can host high-rank matrices keep the loss low.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearA 4-layer, 128-dim transformer trained from scratch on character-level Shakespeare reaches validation loss 1.604 at K = mn/2, against 1.580 for a standard dense baseline.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2012.13255 , year=
Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning.arXiv preprint arXiv:2012.13255,
-
[2]
Zhang, Hemanth Saratchandran Rodriguez, Ehsan Abbasnejad, Wray Buntine, and Anton van den Hengel
Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran Rodriguez, Ehsan Abbasnejad, Wray Buntine, and Anton van den Hengel. RandLoRA: Full rank parameter-efficient fine-tuning of large models. InProc. Int. Conf. Learning Representations (ICLR), 2025.https://arxiv.org/abs/2502.00987. Tri Dao, Beidi Chen, Nimit S. Sohoni, Arjun Desai, Michael Poli, Jessica G...
-
[3]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
The unreasonable effectiveness of recurrent neural networks
Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks. 2015.http://karpathy.github.io/2015/ 05/21/rnn-effectiveness/. Jan Koutn´ ık, Faustino Gomez, and J¨ urgen Schmidhuber. Evolving neural networks in compressed weight space. InProc. Genetic and Evolutionary Computation Conf. (GECCO),
2015
-
[5]
Lora vs full fine-tuning: An illusion of equivalence
Reece Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. LoRA vs full fine-tuning: An illusion of equivalence.arXiv preprint arXiv:2410.21228,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.