arxiv: 2604.14430 · v1 · submitted 2026-04-15 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Three-Phase Transformer

Mohammad R. Abu Ayyash

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords transformer architectureresidual streamlanguage modelingpositional encodingGivens rotationself-stabilizationthree-phaseGabriel's horn

0 comments

The pith

Three-Phase Transformer partitions the residual stream into rotating cyclic channels plus an orthogonal DC horn to cut perplexity and accelerate convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a structural change to decoder-only transformers that splits the hidden vector into N equal cyclic channels. Each channel receives its own RMSNorm, a phase-shifted 2D rotation between attention and feed-forward blocks, and a shared head constraint with grouped-query attention. A one-dimensional DC subspace orthogonal to these channels receives a fixed Gabriel's horn profile for absolute position. At 123 million parameters the resulting model records lower perplexity and reaches the same loss in fewer steps than a matched RoPE baseline while adding fewer than 0.002 percent extra parameters.

Core claim

The Three-Phase Transformer maintains a residual stream that is self-stabilizing through the equilibrium of channel-wise scrambling by attention and feed-forward layers and re-imposition by per-channel normalization and phase-respecting rotations; the partition also creates an orthogonal DC subspace that accepts a fixed horn-shaped absolute-position signal without interfering with RoPE's relative-position rotations.

What carries the argument

Channel-partitioned residual stream with per-phase RMSNorm, Givens rotations offset by multiples of 2π/N, and orthogonal Gabriel's horn DC injection, which together enforce a cyclic equilibrium that stabilizes geometry across layers.

If this is right

N functions as a parameter-sharing hyperparameter rather than a single fixed optimum, with performance at 123 M parameters statistically similar for N=1 and N=3.
The architecture composes orthogonally with existing RoPE, attention, and FFN components without requiring changes to those modules.
Rotation-angle drift exhibits a U-shaped profile across depth in 12-layer models, indicating geometry self-stabilization without explicit constraints.
The method yields both lower final perplexity and faster step-count convergence on WikiText-103 language modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the channel partition is the dominant mechanism, then similar cyclic partitioning might improve stability in non-language sequence models that already use residual streams.
The near-equivalence of N=1 and N=3 at larger scale suggests the benefit may saturate or shift with model width, inviting direct sweeps at billion-parameter sizes.
Because the horn injection is parameter-free and orthogonal to RoPE, it could be tested as a lightweight absolute-position add-on in other relative-position architectures.

Load-bearing premise

The observed gains in perplexity and convergence speed are produced by the channel partition, phase rotations, per-phase normalization, and horn injection rather than by small uncontrolled differences in training procedure or random seed.

What would settle it

Train the exact 123 M parameter 3PT configuration and the matched RoPE-only baseline from the same random seeds with identical data order, optimizer states, and learning-rate schedule; if the perplexity gap disappears or reverses, the structural prior does not deliver the claimed benefit.

Figures

Figures reproduced from arXiv: 2604.14430 by Mohammad R. Abu Ayyash.

**Figure 1.** Figure 1: PPL over 2,000 steps for the four variants of Experiment 1. The embedding-side modification B is ahead at [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗

**Figure 2.** Figure 2: Final PPL at step 2,000 for all seven variants. Stacking three-phase on top of RoPE (F, G) produces a 15%+ [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Cumulative stacking of six refinements on top of the stage-2 winner. Align and PhRMS give small wins; [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Marginal effects per flag averaged over the 32 ON/OFF pairs. Only Align is a clean win. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Long-horizon PPL for RoPE-Only vs the three-phase winner ARhdsL. The 13.30% headline gap is visible [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: ChannelStructure vs the previous winners. The simpler architecture is structurally superior for 18 of 20 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 8.** Figure 8: Learned vs fixed Gabriel's horn. The model wants a softer head (1.0 → 0.532) and keeps approximately [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 10.** Figure 10: Noise floor at 5.5M: 13.85 ± 0.046 PPL across 5 seeds. Seed 42 (the canonical run used in the earlier [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 11.** Figure 11: Monotone curve: more independent rotation thetas → better PPL at 5.5M. N=4 vs N=3 [PITH_FULL_IMAGE:figures/full_fig_p032_11.png] view at source ↗

**Figure 13.** Figure 13: Placement check on WikiText-103 across parameter scales from 120M to 838M. Colored by trainingcorpus alignment: WT103-trained models form a lower band; zero-shot GPT-2 checkpoints sit above ThreePhase 123M despite larger parameter counts. 4.16.1 Important notes on comparability Comparing BPB, not PPL. The two ThreePhase / OnePhase rows use Llama-2 BPE (vocab 32,000); every other row uses GPT-2 BPE (voca… view at source ↗

read the original abstract

We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE's relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where three sinusoids 120 degrees apart sum to zero with no anti-correlated pair. At 123M parameters on WikiText-103, 3PT achieves -7.20% perplexity (-2.62% bits-per-byte) over a matched RoPE-Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step-count convergence speedup (1.64x wall-clock). N behaves as a parameter-sharing knob rather than a unique optimum: at 5.5M an N-sweep over {1,2,3,4,6,8,12} is near-monotone with N=1 winning; at 123M a three-seed sweep finds N=3 and N=1 statistically indistinguishable. The load-bearing mechanism is the channel-partitioned residual stream, per-block rotation, per-phase normalization, and horn DC injection. We characterize (a) self-stabilization of the geometry without explicit enforcement, a novel instance of the conservation-law framework for neural networks; (b) a U-shaped depth profile of rotation-angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Three-Phase Transformer adds cyclic channel partitioning, phase-shifted Givens rotations, and a Gabriel's horn DC side-channel to a standard decoder-only backbone, but the claimed perplexity and convergence gains rest on thin evidence with only three seeds and no clear separation from the simpler N=1 variant.

read the letter

The paper's core idea is a residual-stream prior that splits the hidden state into N cyclic channels, applies per-channel RMSNorm and phase-offset 2D rotations between attention and FFN, and injects a fixed orthogonal horn profile r(p) = 1/(p+1) as a DC absolute-position channel. This sits on top of the usual SwiGLU + RMSNorm + RoPE + GQA stack with almost no extra parameters. At 123M scale on WikiText-103 it reports a 7.2% perplexity drop and nearly 2x faster convergence versus a matched RoPE-only baseline, plus some geometric observations like a U-shaped rotation-drift profile across depth. Those are the concrete pieces that are new relative to the cited literature. The framing as a self-stabilizing equilibrium is also presented cleanly, even if it stays at the level of description rather than derivation. The work is honest about the N-sweep results: at 5.5M parameters N=1 wins, and at 123M the three-seed run finds N=3 and N=1 statistically indistinguishable. That honesty is useful. The main weakness is that the headline numbers come from only three seeds with no error bars, no paired statistical test, and no detailed ablation isolating the channel partition from the horn injection or the per-phase norms. The baseline matching is described but not shown in tables, so it is hard to rule out small hyperparameter or seed effects. The self-stabilization claim is asserted without a supporting conservation-law argument in the supplied text. This is the kind of architecture note that belongs in a reading group for people who already run small-scale transformer ablations and want fresh structural priors to test. It does not yet have the statistical weight for a strong citation, but the idea is coherent enough and the overhead low enough that a serious editor should send it out for review rather than desk-reject. The authors would need to add more seeds, clearer ablations, and perhaps a short derivation for the stability claim before it lands in a top venue.

Referee Report

3 major / 2 minor

Summary. The paper introduces Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers built on SwiGLU + RMSNorm + RoPE + GQA. It partitions the hidden state into N cyclic channels, applies per-channel RMSNorm, per-block 2D Givens rotations offset by theta + i*(2*pi/N), aligns GQA heads to the partition, and injects a fixed Gabriel's horn profile r(p)=1/(p+1) into the resulting one-dimensional DC subspace. The architecture is presented as self-stabilizing by construction. On WikiText-103 at 123M parameters, 3PT (N=3) is reported to yield -7.20% perplexity and 1.93x step-count speedup versus a matched RoPE-Only baseline (+1,536 parameters), with additional characterization of self-stabilization, U-shaped rotation-angle drift, and orthogonal composition with existing components. N is described as a parameter-sharing knob rather than a unique optimum.

Significance. If the perplexity and convergence gains prove robust and causally attributable to the channel partitioning, per-phase rotations, and horn injection rather than incidental factors, the work would supply a lightweight, parameter-efficient structural prior grounded in a conservation-law view of network geometry. The explicit characterization of self-stabilization without external enforcement and the orthogonal composition with RoPE would be notable contributions to the literature on architectural inductive biases.

major comments (3)

[Abstract / N-sweep paragraph] Abstract and N-sweep results: the headline -7.20% perplexity and 1.93x speedup claims at 123M parameters rest on a three-seed sweep in which N=3 and N=1 are reported as statistically indistinguishable. Because the RoPE-Only baseline already differs from N=1 only by the absence of the horn DC injection and because N=1 already incorporates per-channel RMSNorm plus a single-phase rotation, the observed delta cannot yet be confidently attributed to the full channel-partitioned residual stream and per-phase rotations rather than seed variance or the horn component alone.
[Abstract / Experimental results] Abstract and results: no error bars, no statistical tests (e.g., paired t-test across matched runs), and no full ablation tables isolating the channel partition, per-phase rotation, and horn injection are supplied. The baseline-matching procedure (hyperparameter parity, random-seed control, etc.) is also not detailed, rendering the +1,536-parameter overhead and reported gains difficult to interpret.
[Abstract] Abstract: the self-stabilization property is asserted as 'a novel instance of the conservation-law framework' and as arising 'without explicit enforcement,' yet no derivation, fixed-point analysis, or Lyapunov-style argument is provided to show why the combination of cyclic partitioning, phase rotations, and horn injection produces an equilibrium rather than merely being defined to do so.

minor comments (2)

[Architecture description] The precise algebraic definition of the 2D Givens rotation matrix and its composition with the residual stream should be written out explicitly (including the exact placement relative to attention and FFN) rather than described at high level.
[Characterization section] The U-shaped depth profile of rotation-angle drift is mentioned but not accompanied by a figure or quantitative table; adding one would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, with plans for revisions where appropriate to improve clarity, statistical rigor, and attribution of results.

read point-by-point responses

Referee: [Abstract / N-sweep paragraph] Abstract and N-sweep results: the headline -7.20% perplexity and 1.93x speedup claims at 123M parameters rest on a three-seed sweep in which N=3 and N=1 are reported as statistically indistinguishable. Because the RoPE-Only baseline already differs from N=1 only by the absence of the horn DC injection and because N=1 already incorporates per-channel RMSNorm plus a single-phase rotation, the observed delta cannot yet be confidently attributed to the full channel-partitioned residual stream and per-phase rotations rather than seed variance or the horn component alone.

Authors: We acknowledge that the three-seed sweep shows N=3 and N=1 as statistically indistinguishable at 123M parameters, which weakens attribution of gains specifically to multi-phase partitioning (N>1). The reported improvements are over the RoPE-Only baseline and arise from the combination of per-channel RMSNorm, phase rotation (present even at N=1), and the Gabriel's horn DC injection that carves out the orthogonal subspace. The partitioning enables this structure but does not yield a large incremental benefit at this scale, consistent with the manuscript's statement on indistinguishability. We will revise the abstract and N-sweep paragraph to clarify this attribution, note the limited additional value of N>1 at 123M, and better contextualize the small-scale N-sweep results where trends differ. revision: partial
Referee: [Abstract / Experimental results] Abstract and results: no error bars, no statistical tests (e.g., paired t-test across matched runs), and no full ablation tables isolating the channel partition, per-phase rotation, and horn injection are supplied. The baseline-matching procedure (hyperparameter parity, random-seed control, etc.) is also not detailed, rendering the +1,536-parameter overhead and reported gains difficult to interpret.

Authors: We agree that the results section would benefit from greater statistical detail and component isolation. In the revised manuscript we will add error bars computed across the three seeds for the primary metrics, include paired t-test results assessing significance of the perplexity and convergence differences, and insert a dedicated ablation table isolating the channel partition, per-phase rotations, and horn injection. We will also expand the experimental setup to fully document the baseline-matching procedure, including hyperparameter parity, seed control, and training configuration details, to support clear interpretation of the small parameter overhead. revision: yes
Referee: [Abstract] Abstract: the self-stabilization property is asserted as 'a novel instance of the conservation-law framework' and as arising 'without explicit enforcement,' yet no derivation, fixed-point analysis, or Lyapunov-style argument is provided to show why the combination of cyclic partitioning, phase rotations, and horn injection produces an equilibrium rather than merely being defined to do so.

Authors: Self-stabilization is characterized empirically via geometric stability metrics, the U-shaped rotation-angle drift across depth, and the lack of divergence without external regularization. We frame the architecture within the conservation-law view because the cyclic channels and orthogonal horn injection re-impose structure after each block by construction. We recognize the absence of a formal derivation. In revision we will add a subsection with a qualitative fixed-point discussion and equilibrium argument based on phase orthogonality and the zero-sum property of the channels, using the three-phase AC analogy. A complete Lyapunov analysis lies outside the current empirical scope but the added discussion will strengthen the claim. revision: partial

Circularity Check

0 steps flagged

No significant circularity in claimed architecture or results

full rationale

The manuscript proposes a structural modification to the Transformer residual stream (channel partitioning into N phases, per-phase RMSNorm, Givens rotations, GQA alignment, and fixed horn DC injection) and reports empirical perplexity and convergence metrics on WikiText-103. No derivation chain is presented that reduces a claimed prediction or first-principles result to its own inputs by construction; the self-stabilization and conservation-law framing are characterizations of the explicitly defined architecture rather than tautological outputs. N is treated as an empirical knob with reported sweeps showing no unique optimum, and the baseline comparison is to a RoPE-only model without the added components. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. Statistical concerns about seed count and N=1 vs N=3 indistinguishability affect evidence strength but do not constitute circular logic in the architecture definition or results reporting.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the introduced channel partition and rotations induce self-stabilization and performance gains; the horn profile is a fixed invented function with no independent evidence supplied.

free parameters (2)

N
Channel count; canonical value 3 but N=1 performs similarly at 123M scale
theta
Base rotation angle for the per-channel Givens rotations

axioms (1)

standard math Standard Transformer components (SwiGLU, RMSNorm, RoPE, GQA) behave as previously published
The backbone is taken as given

invented entities (1)

Gabriel's horn profile r(p) = 1/(p+1) no independent evidence
purpose: Fixed absolute-position DC side-channel injected orthogonally to RoPE
New fixed function introduced to supply absolute position information

pith-pipeline@v0.9.0 · 5699 in / 1409 out tokens · 35803 ms · 2026-05-10T13:00:19.955134+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

arXiv:2310.04418. Liu, E. (2024). Leveraging Intermediate Neural Collapse with Simplex ETFs for Eﬀicient Deep Neural Networks (ETF-Transformer). NeurIPS 2024 Workshop. arXiv:2412.00884. Liu, F. (2026). Rotary Positional Embeddings as Phase Modulation. arXiv:2602.10959. Liu, H., et al. (2023). Sophia: A Scalable Stochastic Second-order Optimizer for Langua...

work page arXiv 2024
[2]

Primer: Searching for efficient transformers for language modeling

arXiv:2109.08668. Su, J., et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864. Tay, Y ., et al. (2023a). Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling? EMNLP 2023 Findings. Tay, Y ., et al. (2023b). Transcending Scaling Laws with 0.1% Extra Compute. EMNLP 2023. arXiv:2210.11399. T...

work page arXiv 2021
[3]

arXiv:2210.17216. 48

work page arXiv