arxiv: 2604.09709 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networks

Wang Zixian

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision transformersfeed-forward networksorthogonal complementsquadratic featuresimage classificationCIFAR-100TinyImageNetbilinear models

0 comments

The pith

Projecting low-rank quadratic auxiliary branches onto the orthogonal complement of the main representation in vision transformer feed-forward networks improves classification accuracy on CIFAR-100 and TinyImageNet.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Orthogonal Quadratic Complements to ensure that quadratic features added to vision transformer feed-forward layers carry only information absent from the main branch. It builds a low-rank quadratic auxiliary path and projects it orthogonally before combining the signals. Under parameter-matched conditions on CIFAR-100 the method lifts an AFBO baseline from 64.25 to 65.59 percent accuracy, with a low-rank variant preserving most of the gain at lower cost. On TinyImageNet a gated dynamic version reaches 51.88 percent versus the baseline 50.45 percent. Analyses confirm near-zero post-projection overlap together with improved class separation and representation geometry, and the gains hold across gated and ungated variants.

Core claim

Orthogonal Quadratic Complements construct a low-rank quadratic auxiliary branch and explicitly project it onto the orthogonal complement of the main branch before injection, so that the auxiliary features contribute only information not already captured by the dominant hidden representation.

What carries the argument

The orthogonal projection of the low-rank quadratic auxiliary branch onto the complement of the main hidden representation, which enforces non-redundant information contribution.

Load-bearing premise

That forcing the quadratic auxiliary features to lie in the orthogonal complement of the main representation will guarantee they add genuinely useful new information that improves class separation and geometry without unintended side effects.

What would settle it

Removing the orthogonal projection step and checking whether the reported accuracy gains on CIFAR-100 and TinyImageNet disappear while the measured overlap between auxiliary and main representations rises above near-zero.

Figures

Figures reproduced from arXiv: 2604.09709 by Wang Zixian.

**Figure 1.** Figure 1: Architecture evolution. Left: The host FFN (e.g., standard MLP or AFBO) produces a dominant main branch. Middle: OQC adds a low-rank quadratic auxiliary feature and explicitly projects it onto the orthogonal complement of the main branch. Right: The gated OQC family maintains the orthogonal complement but dynamically or statically modulates its injection to the host branch. Method Acc. (%) Params (M) Img/s… view at source ↗

**Figure 3.** Figure 3: Accuracy–efficiency trade-off. (a) Rank sweep: full OQC improves monotonically with rank, while OQC-LR peaks at r=56 (starred). (b) Pareto frontier: OQC-LR r56 delivers 65.52% at 7,485 img/s, substantially faster than full OQC (5,655 img/s) while nearly matching its accuracy. Alternative Complement Designs. We explored alternative, cheaper designs for the auxiliary branch to validate the necessity of each … view at source ↗

**Figure 2.** Figure 2: Mechanism visualization. (a) Auxiliary–main absolute cosine overlap before and after complement projection; orthogonalization reduces it to machine-precision zero. (b) Effective-rank vs. separation score: full OQC achieves the richest geometry while OQC-dynamic achieves the best class separation. (c) Complement usage: OQC-dynamic lowers average gate activation and introduces genuine input-dependent varia… view at source ↗

read the original abstract

Recent bilinear feed-forward replacements for vision transformers can substantially improve accuracy, but they often conflate two effects: stronger second-order interactions and increased redundancy relative to the main branch. We study a complementary design principle in which auxiliary quadratic features contribute only information not already captured by the dominant hidden representation. To this end, we propose Orthogonal Quadratic Complements (OQC), which construct a low-rank quadratic auxiliary branch and explicitly project it onto the orthogonal complement of the main branch before injection. We further study an efficient low-rank realization (OQC-LR) and gated extensions (OQC-static and OQC-dynamic). Under a parameter-matched Deep-ViT and CIFAR-100 protocol with a fixed penultimate residual readout, full OQC improves an AFBO baseline from 64.25 +/- 0.22 to 65.59 +/- 0.22, while OQC-LR reaches 65.52 +/- 0.25 with a substantially better speed-accuracy tradeoff. On TinyImageNet, the gated extension OQC-dynamic achieves 51.88 +/- 0.32, improving the baseline (50.45 +/- 0.21) by 1.43 points and outperforming all ungated variants. Mechanism analyses show near-zero post-projection auxiliary-main overlap together with improved representation geometry and class separation. The full family, including both ungated and gated variants, generalizes consistently across both datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OQC adds an orthogonal low-rank quadratic branch to ViT FFNs for modest accuracy gains on CIFAR-100 and TinyImageNet, but the experiments do not isolate whether the projection itself drives the improvement or if extra capacity would suffice.

read the letter

The core idea is to build a low-rank quadratic auxiliary path in the feed-forward block and project it onto the orthogonal complement of the main hidden representation before adding it back. This is presented as a way to avoid the redundancy that bilinear replacements sometimes introduce while still capturing second-order interactions. The paper tests this on Deep-ViT under parameter-matched conditions and reports consistent lifts: full OQC moves CIFAR-100 from 64.25 to 65.59 and the gated dynamic version moves TinyImageNet from 50.45 to 51.88, both with standard deviations around 0.2-0.3. They also include a low-rank efficient variant and some geometry and overlap measurements that line up with the design intent. Those numbers and the controlled protocol are the parts that hold up cleanly. The mechanism plots showing near-zero post-projection overlap are straightforward to follow and support the claim that the branches are not duplicating each other. The soft spot is exactly the one the stress-test note flags. There is no ablation that runs the same quadratic branch without the orthogonal projection step. The reported gains are small enough that it remains possible the benefit comes from the added parameters, the quadratic terms, or implicit regularization rather than the enforced complementarity. A non-orthogonal control would have made the central design principle much more convincing. The paper stays within its empirical scope and does not overclaim the geometry results as causal proof. This work is aimed at researchers who already tune ViT feed-forward blocks for small accuracy or efficiency edges. Someone building on bilinear or second-order replacements would find the projection trick and the variant comparisons worth checking. It is solid enough on the data side to merit peer review rather than a desk reject; the main request to authors would be the missing ablation and clearer discussion of how much the orthogonality contributes versus the extra branch alone.

Referee Report

2 major / 2 minor

Summary. The paper proposes Orthogonal Quadratic Complements (OQC) for vision transformer feed-forward networks. It constructs a low-rank quadratic auxiliary branch that is explicitly projected onto the orthogonal complement of the main hidden representation before injection, with the goal of adding only novel information and avoiding redundancy. Variants include an efficient low-rank realization (OQC-LR) and gated extensions (OQC-static, OQC-dynamic). Under a parameter-matched Deep-ViT protocol, full OQC improves an AFBO baseline from 64.25 ± 0.22 to 65.59 ± 0.22 on CIFAR-100; OQC-dynamic reaches 51.88 ± 0.32 (vs. baseline 50.45 ± 0.21) on TinyImageNet. Mechanism analyses report near-zero post-projection overlap together with improved representation geometry and class separation. The family generalizes across both datasets.

Significance. If the orthogonality mechanism is causally responsible for the gains, the work supplies a clean design principle for reducing redundancy when augmenting ViT FFNs with second-order terms. The modest but consistent accuracy deltas, the favorable speed-accuracy tradeoff of OQC-LR, and the explicit mechanism measurements (overlap, geometry) are strengths. The approach could influence future bilinear or quadratic FFN replacements in vision transformers.

major comments (2)

[Experiments] Experiments section: the reported accuracy improvements (1.34 points on CIFAR-100, 1.43 on TinyImageNet) are measured only against the AFBO baseline and other OQC variants. No ablation compares the full OQC (with explicit orthogonal projection) against an otherwise identical quadratic auxiliary branch added without the projection step. This control is required to isolate whether the orthogonality itself, rather than added capacity or quadratic interactions, produces the observed gains in class separation and geometry.
[Mechanism analyses] Mechanism analyses: the paper shows near-zero post-projection auxiliary-main overlap and improved geometry metrics, but these are correlational. Without a direct comparison of representation quality (or downstream accuracy) between the projected and non-projected quadratic branches, the claim that the orthogonal complement step is what “guarantees it contributes only genuinely new information” remains unverified.

minor comments (2)

[Abstract] The abstract and experimental protocol description should explicitly state the total parameter count for the AFBO baseline and each OQC variant to confirm the “parameter-matched” claim.
[Method] Notation for the projection operator and the low-rank factorization (OQC-LR) could be introduced earlier and used consistently in the method and analysis sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We agree that the suggested ablation is necessary to more rigorously isolate the causal contribution of the orthogonal projection and will add the requested experiments in the revised manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the reported accuracy improvements (1.34 points on CIFAR-100, 1.43 on TinyImageNet) are measured only against the AFBO baseline and other OQC variants. No ablation compares the full OQC (with explicit orthogonal projection) against an otherwise identical quadratic auxiliary branch added without the projection step. This control is required to isolate whether the orthogonality itself, rather than added capacity or quadratic interactions, produces the observed gains in class separation and geometry.

Authors: We agree that a direct ablation isolating the orthogonal projection step is required to substantiate the claim that orthogonality, rather than quadratic capacity alone, drives the gains. In the revised manuscript we will add this control experiment: an otherwise identical low-rank quadratic auxiliary branch injected without the orthogonal complement projection, trained under the exact same parameter-matched Deep-ViT protocol on both CIFAR-100 and TinyImageNet. We will report accuracy, post-injection overlap, and the same geometry/class-separation metrics for both variants so that the incremental benefit of the projection can be quantified directly. revision: yes
Referee: [Mechanism analyses] Mechanism analyses: the paper shows near-zero post-projection auxiliary-main overlap and improved geometry metrics, but these are correlational. Without a direct comparison of representation quality (or downstream accuracy) between the projected and non-projected quadratic branches, the claim that the orthogonal complement step is what “guarantees it contributes only genuinely new information” remains unverified.

Authors: We acknowledge that the current mechanism results are correlational and that a head-to-head comparison is needed to verify causality. The new ablation described above will directly address this by providing representation-quality and accuracy metrics for the projected versus non-projected branches. These results will be added to the mechanism-analyses section, allowing us to test whether the orthogonal projection is what produces the near-zero overlap and improved geometry. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured against explicit baselines

full rationale

The paper defines OQC by constructing a low-rank quadratic branch and explicitly projecting it onto the orthogonal complement of the main representation before injection. Reported results are direct accuracy measurements (e.g., +1.34 points on CIFAR-100, +1.43 on TinyImageNet) under parameter-matched protocols against named baselines such as AFBO. Mechanism analyses confirm near-zero post-projection overlap, but these are verification of the enforced construction rather than load-bearing derivations. No equations reduce the accuracy deltas to fitted parameters renamed as predictions, no self-citations justify uniqueness theorems, and no ansatz is smuggled via prior work. The derivation chain is self-contained as an architectural proposal validated externally.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that feature-space orthogonality equates to non-redundant information and that low-rank approximations preserve the essential quadratic interactions.

free parameters (1)

low-rank dimension
The rank chosen for the efficient OQC-LR realization is a tunable hyperparameter that controls the auxiliary branch capacity.

axioms (1)

domain assumption Orthogonal projection removes all overlapping information between auxiliary quadratic features and the main branch representation
Invoked when the paper states that the projection ensures the auxiliary branch contributes only information not already captured.

pith-pipeline@v0.9.0 · 5548 in / 1301 out tokens · 60734 ms · 2026-05-10T18:57:33.679401+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
mechanism analyses show near-zero post-projection auxiliary–main overlap together with improved representation geometry and class separation

Reference graph

Works this paper leans on

6 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In Advances in Neural Information Processing Systems (NeurIPS), 2017. 6

2017
[2]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations (ICLR), 2021

2021
[3]

Training Data-Efficient Image Transformers & Distilla- tion Through Attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Fran- cisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training Data-Efficient Image Transformers & Distilla- tion Through Attention. InProceedings of the 38th Inter- national Conference on Machine Learning (ICML), vol- ume 139 ofProceedings of Machine Learning Research, pages 10347–10357, 2021

2021
[4]

Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 10012–10022, 2021

2021
[5]

GLU Variants Improve Transformer

Noam Shazeer. GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[6]

Asymmetric Factorized Bilinear Operation for Vision Transformer

Junjie Wu, Qilong Wang, Jiangtao Xie, Pengfei Zhu, and Qinghua Hu. Asymmetric Factorized Bilinear Operation for Vision Transformer. InInternational Conference on Learning Representations (ICLR), 2025. 7

2025