pith. machine review for the scientific record. sign in

arxiv: 2604.09709 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networks

Wang Zixian

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision transformersfeed-forward networksorthogonal complementsquadratic featuresimage classificationCIFAR-100TinyImageNetbilinear models
0
0 comments X

The pith

Projecting low-rank quadratic auxiliary branches onto the orthogonal complement of the main representation in vision transformer feed-forward networks improves classification accuracy on CIFAR-100 and TinyImageNet.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Orthogonal Quadratic Complements to ensure that quadratic features added to vision transformer feed-forward layers carry only information absent from the main branch. It builds a low-rank quadratic auxiliary path and projects it orthogonally before combining the signals. Under parameter-matched conditions on CIFAR-100 the method lifts an AFBO baseline from 64.25 to 65.59 percent accuracy, with a low-rank variant preserving most of the gain at lower cost. On TinyImageNet a gated dynamic version reaches 51.88 percent versus the baseline 50.45 percent. Analyses confirm near-zero post-projection overlap together with improved class separation and representation geometry, and the gains hold across gated and ungated variants.

Core claim

Orthogonal Quadratic Complements construct a low-rank quadratic auxiliary branch and explicitly project it onto the orthogonal complement of the main branch before injection, so that the auxiliary features contribute only information not already captured by the dominant hidden representation.

What carries the argument

The orthogonal projection of the low-rank quadratic auxiliary branch onto the complement of the main hidden representation, which enforces non-redundant information contribution.

Load-bearing premise

That forcing the quadratic auxiliary features to lie in the orthogonal complement of the main representation will guarantee they add genuinely useful new information that improves class separation and geometry without unintended side effects.

What would settle it

Removing the orthogonal projection step and checking whether the reported accuracy gains on CIFAR-100 and TinyImageNet disappear while the measured overlap between auxiliary and main representations rises above near-zero.

Figures

Figures reproduced from arXiv: 2604.09709 by Wang Zixian.

Figure 1
Figure 1. Figure 1: Architecture evolution. Left: The host FFN (e.g., standard MLP or AFBO) produces a dominant main branch. Middle: OQC adds a low-rank quadratic auxiliary feature and explicitly projects it onto the orthogonal complement of the main branch. Right: The gated OQC family maintains the orthogonal complement but dynamically or statically modulates its injection to the host branch. Method Acc. (%) Params (M) Img/s… view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy–efficiency trade-off. (a) Rank sweep: full OQC improves monotonically with rank, while OQC-LR peaks at r=56 (starred). (b) Pareto frontier: OQC-LR r56 delivers 65.52% at 7,485 img/s, substantially faster than full OQC (5,655 img/s) while nearly matching its accuracy. Alternative Complement Designs. We explored alternative, cheaper designs for the auxiliary branch to validate the necessity of each … view at source ↗
Figure 2
Figure 2. Figure 2: Mechanism visualization. (a) Auxiliary–main ab￾solute cosine overlap before and after complement projection; orthogonalization reduces it to machine-precision zero. (b) Effective-rank vs. separation score: full OQC achieves the richest geometry while OQC-dynamic achieves the best class separation. (c) Complement usage: OQC-dynamic lowers av￾erage gate activation and introduces genuine input-dependent varia… view at source ↗
read the original abstract

Recent bilinear feed-forward replacements for vision transformers can substantially improve accuracy, but they often conflate two effects: stronger second-order interactions and increased redundancy relative to the main branch. We study a complementary design principle in which auxiliary quadratic features contribute only information not already captured by the dominant hidden representation. To this end, we propose Orthogonal Quadratic Complements (OQC), which construct a low-rank quadratic auxiliary branch and explicitly project it onto the orthogonal complement of the main branch before injection. We further study an efficient low-rank realization (OQC-LR) and gated extensions (OQC-static and OQC-dynamic). Under a parameter-matched Deep-ViT and CIFAR-100 protocol with a fixed penultimate residual readout, full OQC improves an AFBO baseline from 64.25 +/- 0.22 to 65.59 +/- 0.22, while OQC-LR reaches 65.52 +/- 0.25 with a substantially better speed-accuracy tradeoff. On TinyImageNet, the gated extension OQC-dynamic achieves 51.88 +/- 0.32, improving the baseline (50.45 +/- 0.21) by 1.43 points and outperforming all ungated variants. Mechanism analyses show near-zero post-projection auxiliary-main overlap together with improved representation geometry and class separation. The full family, including both ungated and gated variants, generalizes consistently across both datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Orthogonal Quadratic Complements (OQC) for vision transformer feed-forward networks. It constructs a low-rank quadratic auxiliary branch that is explicitly projected onto the orthogonal complement of the main hidden representation before injection, with the goal of adding only novel information and avoiding redundancy. Variants include an efficient low-rank realization (OQC-LR) and gated extensions (OQC-static, OQC-dynamic). Under a parameter-matched Deep-ViT protocol, full OQC improves an AFBO baseline from 64.25 ± 0.22 to 65.59 ± 0.22 on CIFAR-100; OQC-dynamic reaches 51.88 ± 0.32 (vs. baseline 50.45 ± 0.21) on TinyImageNet. Mechanism analyses report near-zero post-projection overlap together with improved representation geometry and class separation. The family generalizes across both datasets.

Significance. If the orthogonality mechanism is causally responsible for the gains, the work supplies a clean design principle for reducing redundancy when augmenting ViT FFNs with second-order terms. The modest but consistent accuracy deltas, the favorable speed-accuracy tradeoff of OQC-LR, and the explicit mechanism measurements (overlap, geometry) are strengths. The approach could influence future bilinear or quadratic FFN replacements in vision transformers.

major comments (2)
  1. [Experiments] Experiments section: the reported accuracy improvements (1.34 points on CIFAR-100, 1.43 on TinyImageNet) are measured only against the AFBO baseline and other OQC variants. No ablation compares the full OQC (with explicit orthogonal projection) against an otherwise identical quadratic auxiliary branch added without the projection step. This control is required to isolate whether the orthogonality itself, rather than added capacity or quadratic interactions, produces the observed gains in class separation and geometry.
  2. [Mechanism analyses] Mechanism analyses: the paper shows near-zero post-projection auxiliary-main overlap and improved geometry metrics, but these are correlational. Without a direct comparison of representation quality (or downstream accuracy) between the projected and non-projected quadratic branches, the claim that the orthogonal complement step is what “guarantees it contributes only genuinely new information” remains unverified.
minor comments (2)
  1. [Abstract] The abstract and experimental protocol description should explicitly state the total parameter count for the AFBO baseline and each OQC variant to confirm the “parameter-matched” claim.
  2. [Method] Notation for the projection operator and the low-rank factorization (OQC-LR) could be introduced earlier and used consistently in the method and analysis sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We agree that the suggested ablation is necessary to more rigorously isolate the causal contribution of the orthogonal projection and will add the requested experiments in the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the reported accuracy improvements (1.34 points on CIFAR-100, 1.43 on TinyImageNet) are measured only against the AFBO baseline and other OQC variants. No ablation compares the full OQC (with explicit orthogonal projection) against an otherwise identical quadratic auxiliary branch added without the projection step. This control is required to isolate whether the orthogonality itself, rather than added capacity or quadratic interactions, produces the observed gains in class separation and geometry.

    Authors: We agree that a direct ablation isolating the orthogonal projection step is required to substantiate the claim that orthogonality, rather than quadratic capacity alone, drives the gains. In the revised manuscript we will add this control experiment: an otherwise identical low-rank quadratic auxiliary branch injected without the orthogonal complement projection, trained under the exact same parameter-matched Deep-ViT protocol on both CIFAR-100 and TinyImageNet. We will report accuracy, post-injection overlap, and the same geometry/class-separation metrics for both variants so that the incremental benefit of the projection can be quantified directly. revision: yes

  2. Referee: [Mechanism analyses] Mechanism analyses: the paper shows near-zero post-projection auxiliary-main overlap and improved geometry metrics, but these are correlational. Without a direct comparison of representation quality (or downstream accuracy) between the projected and non-projected quadratic branches, the claim that the orthogonal complement step is what “guarantees it contributes only genuinely new information” remains unverified.

    Authors: We acknowledge that the current mechanism results are correlational and that a head-to-head comparison is needed to verify causality. The new ablation described above will directly address this by providing representation-quality and accuracy metrics for the projected versus non-projected branches. These results will be added to the mechanism-analyses section, allowing us to test whether the orthogonal projection is what produces the near-zero overlap and improved geometry. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured against explicit baselines

full rationale

The paper defines OQC by constructing a low-rank quadratic branch and explicitly projecting it onto the orthogonal complement of the main representation before injection. Reported results are direct accuracy measurements (e.g., +1.34 points on CIFAR-100, +1.43 on TinyImageNet) under parameter-matched protocols against named baselines such as AFBO. Mechanism analyses confirm near-zero post-projection overlap, but these are verification of the enforced construction rather than load-bearing derivations. No equations reduce the accuracy deltas to fitted parameters renamed as predictions, no self-citations justify uniqueness theorems, and no ansatz is smuggled via prior work. The derivation chain is self-contained as an architectural proposal validated externally.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that feature-space orthogonality equates to non-redundant information and that low-rank approximations preserve the essential quadratic interactions.

free parameters (1)
  • low-rank dimension
    The rank chosen for the efficient OQC-LR realization is a tunable hyperparameter that controls the auxiliary branch capacity.
axioms (1)
  • domain assumption Orthogonal projection removes all overlapping information between auxiliary quadratic features and the main branch representation
    Invoked when the paper states that the projection ensures the auxiliary branch contributes only information not already captured.

pith-pipeline@v0.9.0 · 5548 in / 1301 out tokens · 60734 ms · 2026-05-10T18:57:33.679401+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

6 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In Advances in Neural Information Processing Systems (NeurIPS), 2017. 6

  2. [2]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations (ICLR), 2021

  3. [3]

    Training Data-Efficient Image Transformers & Distilla- tion Through Attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Fran- cisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training Data-Efficient Image Transformers & Distilla- tion Through Attention. InProceedings of the 38th Inter- national Conference on Machine Learning (ICML), vol- ume 139 ofProceedings of Machine Learning Research, pages 10347–10357, 2021

  4. [4]

    Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 10012–10022, 2021

  5. [5]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202, 2020

  6. [6]

    Asymmetric Factorized Bilinear Operation for Vision Transformer

    Junjie Wu, Qilong Wang, Jiangtao Xie, Pengfei Zhu, and Qinghua Hu. Asymmetric Factorized Bilinear Operation for Vision Transformer. InInternational Conference on Learning Representations (ICLR), 2025. 7