arxiv: 2604.04440 · v2 · submitted 2026-04-06 · 💻 cs.PF · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Training Transformers in Cosine Coefficient Space

Mohamed Amine Bergach

Pith reviewed 2026-05-10 20:03 UTC · model grok-4.3

classification 💻 cs.PF cs.AI

keywords transformersDCT coefficientsparameter efficiencylinear layer parameterizationLoRArank flexibilitycosine transformmodel compression

0 comments

The pith

Transformers reach near-dense performance when trained only on half the 2D DCT coefficients of their weight matrices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes replacing the dense weight matrices in transformer linear layers with a parameterization that trains only a subset K of the possible 2D discrete cosine transform coefficients and reconstructs the full matrix using the inverse transform during each forward pass. On a small 4-layer transformer trained on character-level Shakespeare data, this approach achieves a validation loss of 1.604 when K equals half the total coefficients, compared to 1.580 for the standard dense version, a difference within the run-to-run variation. The method outperforms a low-rank LoRA baseline at the same number of trainable parameters, which reaches 1.801. The advantage comes from the ability of the DCT coefficient space to support high effective rank matrices rather than being constrained to low-rank structures.

Core claim

We replace each linear layer with one that stores K out of mn two-dimensional DCT coefficients per weight matrix and reconstructs the full matrix through an inverse DCT at every forward pass; the K coefficients are the trainable parameters. A 4-layer, 128-dim transformer trained from scratch on character-level Shakespeare reaches validation loss 1.604 at K = mn/2, against 1.580 for a standard dense baseline. A rank-48 LoRA factorization at the same trainable parameter count reaches only 1.801. The structural advantage of sparse-coefficient over low-rank parameterizations at matched K is qualitative, stemming from rank flexibility in the coefficient subspace.

What carries the argument

Trainable subset of 2D DCT coefficients per weight matrix, with inverse DCT reconstruction to form the full matrix for forward passes.

If this is right

The performance gap to dense training stays small at half the parameters on this task.
Low-rank methods like LoRA underperform the coefficient approach at equivalent parameter budgets.
Orthonormal bases that preserve high-rank capacity maintain low loss, unlike those that force low-rank blocks.
The fast separable DCT allows fused on-chip reconstruction, converting parameter savings into bandwidth savings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the rank flexibility benefit holds at scale, this could enable more parameter-efficient training of large language models without low-rank constraints.
The approach might integrate with existing optimizers without modification, but testing on longer sequences or different architectures would be needed.
Other frequency-based transforms could be explored if they offer similar fast reconstruction and rank properties.

Load-bearing premise

The near-parity performance and rank advantage seen in this small-scale experiment will generalize to larger models and tasks without causing optimization instabilities from the repeated reconstructions.

What would settle it

Scaling the model to more layers or training on a larger corpus and checking if the loss difference to the dense baseline exceeds the terminal variation observed in the small run.

Figures

Figures reproduced from arXiv: 2604.04440 by Mohamed Amine Bergach.

read the original abstract

Linear layers hold most of a transformer's parameters. We replace each linear layer with one that stores $K$ out of $mn$ two-dimensional DCT coefficients per weight matrix and reconstructs the full matrix through an inverse DCT at every forward pass; the $K$ coefficients are the trainable parameters. A 4-layer, 128-dim transformer trained from scratch on character-level Shakespeare reaches validation loss $1.604$ at $K = mn/2$, against $1.580$ for a standard dense baseline -- a gap of $+0.024$ at half the trainable parameter count, within the terminal-epoch variation of the dense run. A rank-48 LoRA factorization at the same trainable parameter count reaches only $1.801$ ($+0.221$). The structural advantage of sparse-coefficient over low-rank parameterizations at matched $K$ is qualitative. We identify rank flexibility as the mechanism. A random orthonormal basis matches the DCT within noise at $K = mn/2$, and a compression sweep through $K = mn/10$ and $K = mn/20$ shows that subspaces that can host high-rank matrices keep the loss low, while subspaces that flatten into a low-rank block (zigzag-selection variants) converge onto the observed stable rank \emph{and} the loss line of the rank-48 LoRA reference in lock-step. Among these orthonormal bases, the DCT is preferred because its separable fast transform admits a fused reconstruction kernel: the materialized weight matrix never leaves on-chip memory, so the parameter saving translates into a bandwidth saving as well.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows DCT-coefficient parameterization lets a tiny transformer hit near-dense loss at half the parameters and beat LoRA, thanks to rank flexibility, but only on a 4-layer 128-dim Shakespeare model.

read the letter

The main thing to know is that storing and training K out of mn DCT coefficients per linear layer, then reconstructing the weight matrix via inverse DCT each forward pass, reaches validation loss 1.604 on a 4-layer 128-dim transformer trained on character Shakespeare. That sits 0.024 above the dense baseline of 1.580 at K = mn/2, inside the reported variation, while a rank-48 LoRA at the same parameter count hits 1.801. The paper argues the edge comes from rank flexibility rather than low-rank structure, backed by a random orthonormal basis matching the DCT result and a compression sweep where only high-rank-capable subspaces avoid the LoRA loss line. DCT gets the nod for its fast separable transform that keeps the materialized matrix on-chip for bandwidth savings too. What is new here is the direct training-time use of DCT coefficients for transformer weights plus the explicit mechanism test against LoRA. The small-scale experiments are clean enough to show the gap is modest and the subspace comparison is informative. The work does well at isolating rank as the operative factor without circular fitting. The soft spots sit at scale and reporting. Everything is on one tiny model and dataset with no checks on larger widths, depths, other tasks, or longer training. Repeated IDCT reconstruction could create gradient or numerical issues invisible at this size, and the abstract gives no run counts or hyperparameter search details. This is for researchers working on parameter-efficient training or on-the-fly compression in transformers. A reader focused on memory-bandwidth tradeoffs or alternatives to low-rank adapters would get concrete value from the rank-flexibility evidence. It deserves a serious referee because the empirical comparison and mechanism analysis are grounded at the reported scale and the idea is straightforward to test further. I would send it to peer review for feedback on scaling behavior and implementation robustness.

Referee Report

2 major / 1 minor

Summary. The paper proposes replacing the linear layers in a transformer with a parameterization that stores only K out of mn two-dimensional DCT coefficients per weight matrix and reconstructs the full matrix via inverse DCT at every forward pass. On a 4-layer 128-dim model trained from scratch on character-level Shakespeare, the method reaches validation loss 1.604 at K = mn/2 (half the dense parameter count), compared with 1.580 for a standard dense baseline and 1.801 for a rank-48 LoRA at matched K. The authors identify rank flexibility as the key mechanism, supported by a random-orthonormal basis match at K = mn/2 and a compression sweep showing that only high-rank-capable subspaces avoid the LoRA loss line; they further note that the separable fast DCT admits a fused kernel that converts the parameter saving into a bandwidth saving.

Significance. If the empirical result and the rank-flexibility mechanism hold under scaling, the approach would supply a concrete alternative to low-rank adaptations such as LoRA, with the added practical benefit of on-chip reconstruction that never materializes the full weight matrix off-chip. The direct comparison at matched K and the orthonormal-basis controls already provide a clearer structural diagnosis than most existing parameter-efficient fine-tuning papers. The current evidence, however, is confined to a single tiny model and dataset, so the broader significance remains conditional on further validation.

major comments (2)

[Experimental results (implicit in abstract and §4)] The experimental description provides no information on the number of independent runs, the hyperparameter search budget, random-seed handling, or the precise train/validation split of the Shakespeare corpus. Because the central claim rests on a loss gap of only +0.024 that is stated to lie inside terminal-epoch variation, these details are load-bearing for assessing whether the result is reproducible or statistically distinguishable from noise.
[Discussion and conclusion] No scaling experiments are reported beyond the 4-layer 128-dim setting. The repeated IDCT reconstruction at every forward pass could alter gradient magnitudes or introduce numerical instability once widths or depths increase; the assertion that “standard optimizers suffice” therefore rests on an extrapolation whose validity is not tested within the manuscript.

minor comments (1)

[Abstract] The abstract would be clearer if it stated the model depth and hidden dimension in the first sentence rather than only in the results clause.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address the major comments point by point below, and have updated the manuscript to incorporate additional experimental details and to qualify our claims appropriately.

read point-by-point responses

Referee: [Experimental results (implicit in abstract and §4)] The experimental description provides no information on the number of independent runs, the hyperparameter search budget, random-seed handling, or the precise train/validation split of the Shakespeare corpus. Because the central claim rests on a loss gap of only +0.024 that is stated to lie inside terminal-epoch variation, these details are load-bearing for assessing whether the result is reproducible or statistically distinguishable from noise.

Authors: We agree that these details are critical for assessing reproducibility and the statistical significance of the small loss gap. In the revised manuscript, we have expanded the experimental section to include the number of independent runs conducted, the random seeds employed, the hyperparameter search budget and methodology, and the precise train/validation split used for the Shakespeare corpus. Additionally, we report the terminal-epoch variation observed in the dense baseline runs to support our statement that the +0.024 difference lies within this variation. revision: yes
Referee: [Discussion and conclusion] No scaling experiments are reported beyond the 4-layer 128-dim setting. The repeated IDCT reconstruction at every forward pass could alter gradient magnitudes or introduce numerical instability once widths or depths increase; the assertion that “standard optimizers suffice” therefore rests on an extrapolation whose validity is not tested within the manuscript.

Authors: We acknowledge the lack of scaling experiments in the current work. Our study is scoped to a small model to isolate and diagnose the rank-flexibility mechanism through controlled comparisons, including the orthonormal basis controls. We have revised the Discussion and Conclusion to explicitly discuss the potential for changes in gradient behavior or numerical issues with repeated IDCT at larger scales, and we qualify the statement about standard optimizers as applying to the reported experimental regime. We agree that validating these aspects at scale is an important next step but lies beyond the present manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on direct training runs and independent sweeps

full rationale

The paper introduces a DCT-coefficient parameterization for linear layers and supports its claims exclusively through concrete training experiments (4-layer 128-dim transformer on character Shakespeare) and auxiliary sweeps (random orthonormal bases, K=mn/10 and mn/20 compression). Validation losses are measured outcomes, not quantities fitted and then relabeled as predictions. No equations appear that define a quantity in terms of itself, no self-citations are used to justify core premises, and the rank-flexibility mechanism is inferred from observable loss behavior across bases rather than from any tautological reduction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard properties of the discrete cosine transform from signal processing literature; no new free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (1)

domain assumption The 2D DCT provides a basis in which a subset of coefficients can reconstruct a full matrix via inverse transform with acceptable fidelity for neural network weights.
Invoked by the replacement of linear layers with coefficient storage and reconstruction.

pith-pipeline@v0.9.0 · 5580 in / 1602 out tokens · 59386 ms · 2026-05-10T20:03:44.854807+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We replace each linear layer with one that stores K out of mn two-dimensional DCT coefficients per weight matrix and reconstructs the full matrix through an inverse DCT at every forward pass; the K coefficients are the trainable parameters.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear
The mechanism is rank flexibility. A random orthonormal basis matches the DCT within noise at K = mn/2, and a compression sweep through K = mn/10 and K = mn/20 shows that subspaces that can host high-rank matrices keep the loss low.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
A 4-layer, 128-dim transformer trained from scratch on character-level Shakespeare reaches validation loss 1.604 at K = mn/2, against 1.580 for a standard dense baseline.

Reference graph

Works this paper leans on

5 extracted references · 4 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2012.13255 , year=

Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning.arXiv preprint arXiv:2012.13255,

work page arXiv 2012
[2]

Zhang, Hemanth Saratchandran Rodriguez, Ehsan Abbasnejad, Wray Buntine, and Anton van den Hengel

Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran Rodriguez, Ehsan Abbasnejad, Wray Buntine, and Anton van den Hengel. RandLoRA: Full rank parameter-efficient fine-tuning of large models. InProc. Int. Conf. Learning Representations (ICLR), 2025.https://arxiv.org/abs/2502.00987. Tri Dao, Beidi Chen, Nimit S. Sohoni, Arjun Desai, Michael Poli, Jessica G...

work page arXiv 2025
[3]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

The unreasonable effectiveness of recurrent neural networks

Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks. 2015.http://karpathy.github.io/2015/ 05/21/rnn-effectiveness/. Jan Koutn´ ık, Faustino Gomez, and J¨ urgen Schmidhuber. Evolving neural networks in compressed weight space. InProc. Genetic and Evolutionary Computation Conf. (GECCO),

2015
[5]

Lora vs full fine-tuning: An illusion of equivalence

Reece Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. LoRA vs full fine-tuning: An illusion of equivalence.arXiv preprint arXiv:2410.21228,

work page arXiv