arxiv: 2605.10123 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Complex-Valued Phase-Coherent Transformer

Leona Hioki

Pith reviewed 2026-05-12 03:24 UTC · model grok-4.3

classification 💻 cs.LG

keywords complex-valued transformersphase-coherent attentionattention mechanismslong-range memoryneural network architecturesphase preservation

0 comments

The pith

The Phase-Coherent Transformer replaces softmax token competition with a smooth gate on L2-normalised complex similarities to preserve phase across layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Phase-Coherent Transformer (PCT) as an alternative attention mechanism for complex-valued models. Instead of normalising rows of query-key similarities as in softmax, PCT applies a real-valued smooth gate to L2-normalised complex products so that phase information is retained from one layer to the next. On mid-scale tasks that test long-range memory, hierarchical reasoning, positional retrieval, phase-based superposition, and image classification, this change produces consistent gains over both real-valued softmax Transformers and direct complex-valued softmax baselines when parameter counts are matched. The design is tested by swapping in gates that break the phase-preservation rules, confirming that the specific gate properties drive the observed improvements rather than incidental details.

Core claim

PCT replaces row-normalised token competition with token-non-competing attention by feeding L2-normalised complex query-key similarities through a real-valued, element-independent smooth gate whose output remains bounded and keeps negatively aligned phase components. This structure is applied across multiple layers and yields stronger generalisation on long-range memory, hierarchical reasoning, positional retrieval, phase-based memory, and image classification tasks than either standard softmax attention or complex-valued softmax attention under parameter-fair conditions.

What carries the argument

The Phase-Coherent Transformer (PCT) attention block, which applies a real-valued smooth gate to L2-normalised complex query-key similarities to produce phase-preserving attention weights without row normalisation.

If this is right

Gates that preserve negatively aligned phases maintain strong long-range retrieval performance while gates that delete them cause collapse.
Gates whose outputs become excessively large produce clear degradation on the tested benchmarks.
PCT exhibits no depth-related accuracy collapse across the depth range examined.
PCT remains competitive with strong real-valued baselines even on tasks such as NIAH and LRA-Text that are traditionally difficult for complex-valued networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gate construction could be inserted into other complex-valued sequence models to test whether phase coherence improves generalisation beyond the Transformer architecture.
If phase preservation is the operative factor, then real-valued models that explicitly track phase-like quantities might show analogous gains on the same task suite.
The absence of depth collapse suggests the mechanism may scale to deeper stacks without the need for additional stabilisation techniques.

Load-bearing premise

Row-normalised token competition misaligns with phase-preserving computation, and the smooth gate's ability to retain negatively aligned phases while keeping outputs bounded is the main reason for the performance difference.

What would settle it

Performance on long-range retrieval tasks such as NIAH or LRA-Text drops sharply when the gate is replaced by one that either zeros negatively aligned phase components or produces unbounded outputs.

read the original abstract

Complex-valued Transformers have largely inherited softmax attention from real-valued architectures. However, row-normalised token competition is not necessarily aligned with phase-preserving computation. In this paper, we introduce the Phase-Coherent Transformer (PCT), which applies a real-valued, element-independent, smooth gate to L2-normalised complex query-key similarities. PCT replaces token competition with token-non-competing attention and is designed to preserve phase information across layers. Across mid-scale benchmarks spanning long-range memory, hierarchical long-range reasoning, positional retrieval, phase-based memory and superposition, and image classification, PCT shows strong generalisation across task categories. Under parameter-fair comparison, PCT consistently outperforms both the standard softmax Transformer and its direct complex-valued counterpart. Moreover, even on tasks traditionally considered difficult for complex-valued neural networks, such as NIAH and LRA-Text, PCT remains competitive with Multiscreen, the strongest real-valued NN baseline in our comparison. Experiments introducing gates that deliberately violate the PCT conditions show that the design is not incidental: smooth gates that preserve negatively aligned phase components remain strong, whereas gates that delete such components collapse on long-range retrieval, and gates whose outputs become excessively large suffer clear performance degradation. PCT also shows no depth-related accuracy collapse across the tested depth range. These results support introducing multi-layer phase-coherent structure into attention as a promising design principle for achieving generalisation in complex-valued Transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PCT swaps softmax for a smooth real gate on L2-normalized complex similarities to keep phase info and drop token competition, with decent gains on mid-scale tasks but ablations that leave the causal claim a bit loose.

read the letter

The main takeaway is that this Phase-Coherent Transformer applies a real-valued, element-independent smooth gate to L2-normalized complex query-key similarities. That replaces the usual row-normalized competition with non-competing attention meant to carry phase information through layers in complex-valued models. They show it beats both a standard softmax transformer and a direct complex counterpart on parameter-matched runs across long-range memory, hierarchical reasoning, positional retrieval, phase-based memory, and image classification. It also stays competitive with strong real baselines on tough cases like NIAH and LRA-Text and avoids the depth-related collapse that sometimes hits complex nets. The ablations add some weight: gates that keep negative phase alignments perform well, while ones that strip them out fail on retrieval and ones that produce large outputs degrade overall. That pattern supports the claim that the specific conditions are not incidental. The soft spot is whether those violating gates really isolate phase preservation. The description does not confirm that the bad variants held smoothness, differentiability, and output magnitude fixed while only changing the phase rule. If they altered scaling or transition sharpness at the same time, the drops could come from those side effects instead. The lack of error bars or statistical tests on the wins also makes the size of the advantage harder to judge. This is for people working on complex-valued networks or attention variants that try to respect phase. A reader focused on phase-sensitive tasks or alternatives to softmax would pick up the new rule and the empirical checks. It has enough distinct mechanism and targeted testing to warrant referee time rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Phase-Coherent Transformer (PCT), a complex-valued architecture that replaces standard softmax attention with a real-valued, element-independent, smooth gate applied to L2-normalised complex query-key similarities. This design aims to eliminate token competition and preserve phase information across layers. The central claim is that, under parameter-fair comparisons, PCT consistently outperforms both the standard real-valued softmax Transformer and a direct complex-valued softmax counterpart across mid-scale benchmarks in long-range memory, hierarchical reasoning, positional retrieval, phase-based memory, and image classification. Ablations on deliberately violating gates are presented to show that preserving negatively aligned phases is load-bearing, while the model exhibits no depth-related accuracy collapse.

Significance. If the empirical results and ablation isolation hold under fuller verification, the work provides evidence that phase-coherent, non-competitive attention can serve as a viable design principle for complex-valued Transformers, yielding better generalisation than inherited softmax mechanisms on tasks involving phase superposition and long-range dependencies. The parameter-fair setup and targeted gate ablations are positive features that strengthen the case for the specific architectural choice over generic complex-valued extensions.

major comments (3)

[Ablation studies / Experiments] Ablation studies (as summarised in the abstract and experiments): the description of gates that 'delete negatively aligned phase components' or 'produce excessively large outputs' does not confirm that all other properties (smoothness, element-independence, differentiability, and output magnitude distribution) were held fixed while only the phase-related rule was altered. Without this control, performance collapse could stem from changes in gradient flow or effective scaling rather than loss of phase coherence, weakening the causal link to the PCT conditions.
[Experimental results] Results reporting (throughout experimental sections): no error bars, standard deviations, or statistical significance tests are mentioned for the performance comparisons or ablation tables. This is load-bearing for the claim of 'consistent outperformance' across task categories, as single-run or unreported-variance numbers leave open the possibility that observed gains fall within run-to-run variability.
[Methods / Model definition] Definition of the gate (methods section): the precise functional form of the real-valued smooth gate on L2-normalised complex similarities is not accompanied by an explicit equation showing how negative phase alignment is preserved while bounding outputs; without this, it is difficult to verify that the gate is strictly non-competitive and phase-preserving as claimed.

minor comments (2)

[Abstract / Introduction] The abstract and introduction could more explicitly state the exact parameter counts and layer depths used in the 'parameter-fair' comparisons to allow direct replication.
[Methods] Notation for complex-valued quantities (e.g., how the L2 normalisation interacts with the real gate) would benefit from a short clarifying sentence or diagram to avoid ambiguity for readers unfamiliar with complex attention variants.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below. Where the manuscript requires clarification or additional detail, we will revise accordingly to strengthen the presentation of the ablation controls, experimental reporting, and model definition.

read point-by-point responses

Referee: [Ablation studies / Experiments] Ablation studies (as summarised in the abstract and experiments): the description of gates that 'delete negatively aligned phase components' or 'produce excessively large outputs' does not confirm that all other properties (smoothness, element-independence, differentiability, and output magnitude distribution) were held fixed while only the phase-related rule was altered. Without this control, performance collapse could stem from changes in gradient flow or effective scaling rather than loss of phase coherence, weakening the causal link to the PCT conditions.

Authors: We agree that explicit isolation of the phase-coherence rule is essential for a causal claim. The ablation gates were constructed by modifying only the handling of negative real-part alignments (via a phase-dependent threshold or sign flip) while retaining the same smooth sigmoid-like activation, per-element independence, differentiability, and post-gate normalization to control output magnitude. We will revise the experimental section to include a dedicated paragraph and supplementary table that explicitly lists the fixed properties for each ablation variant, together with the precise modification applied to the phase rule. This will make the controls verifiable and reinforce that the observed collapses are attributable to loss of negative-phase preservation rather than ancillary changes in scaling or gradient flow. revision: yes
Referee: [Experimental results] Results reporting (throughout experimental sections): no error bars, standard deviations, or statistical significance tests are mentioned for the performance comparisons or ablation tables. This is load-bearing for the claim of 'consistent outperformance' across task categories, as single-run or unreported-variance numbers leave open the possibility that observed gains fall within run-to-run variability.

Authors: The referee correctly identifies that the current results are reported from single runs without variance estimates or significance tests. This is a genuine limitation in the experimental presentation. For the revised manuscript we will rerun all primary comparisons and ablations with at least five independent random seeds, report means and standard deviations in every table, and add paired statistical tests (e.g., Wilcoxon or t-tests with Bonferroni correction) for the key outperformance claims. We expect the reported margins to remain significant, but the additional statistics will directly address the concern about run-to-run variability. revision: yes
Referee: [Methods / Model definition] Definition of the gate (methods section): the precise functional form of the real-valued smooth gate on L2-normalised complex similarities is not accompanied by an explicit equation showing how negative phase alignment is preserved while bounding outputs; without this, it is difficult to verify that the gate is strictly non-competitive and phase-preserving as claimed.

Authors: We acknowledge that an explicit equation would improve verifiability. The gate is a real-valued function applied element-wise to the L2-normalised complex similarities that maps negative real-part alignments to non-zero positive weights while keeping the output bounded and independent across tokens. We will insert a new displayed equation in the methods section that defines the gate mathematically, together with a short derivation showing that (i) negative phase alignments receive non-zero weight, (ii) the operation remains element-independent, and (iii) the output magnitude is bounded by construction. This will make the non-competitive and phase-preserving properties directly inspectable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical design validated by independent benchmarks and ablations

full rationale

The paper introduces PCT as a design choice motivated by phase-preservation reasoning and supports its claims through direct experimental comparisons to softmax and complex baselines plus targeted gate ablations on multiple benchmarks. No equations reduce performance metrics to fitted parameters by construction, no self-citations form load-bearing premises, and no ansatz or uniqueness result is imported from prior author work. The derivation chain consists of architectural motivation followed by falsifiable empirical tests, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that phase preservation improves generalization in complex networks and on the empirical observation that the chosen gate satisfies the necessary conditions; no additional free parameters or invented physical entities are introduced beyond the architecture itself.

axioms (1)

domain assumption Row-normalised token competition is not necessarily aligned with phase-preserving computation.
Explicitly stated as motivation for replacing softmax in the opening paragraph.

invented entities (1)

Phase-Coherent Transformer (PCT) no independent evidence
purpose: Attention mechanism that preserves phase information across layers via smooth gating.
New architecture defined and tested in the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5536 in / 1340 out tokens · 29877 ms · 2026-05-12T03:24:56.975517+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
smooth gates that preserve negatively aligned phase components remain strong, whereas gates that delete such components collapse
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
Theorem 2: L1 & C2 & C3 & substrate non-expansion ⇒ L2 ... machine-checked in Lean

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 2 internal anchors

[1]

2018– Chiheb Trabelsi, Olexa Bilaniuk, Ying Zhang, Dmitriy Serdyuk, Sandeep Subramanian, João Felipe Santos, Soroush Mehri, Negar Rostamzadeh, Yoshua Bengio, Christopher J

Complex-valued neural networks • Trabelsi et al. 2018– Chiheb Trabelsi, Olexa Bilaniuk, Ying Zhang, Dmitriy Serdyuk, Sandeep Subramanian, João Felipe Santos, Soroush Mehri, Negar Rostamzadeh, Yoshua Bengio, Christopher J. Pal.Deep Complex Networks. ICLR

work page 2018
[2]

https://arxiv.org/abs/1705.09792 • Eilers&Jiang2023–FlorianEilers, XiaoyiJiang.Building Blocks for a Complex-Valued Transformer Architecture

arXiv:1705.09792. https://arxiv.org/abs/1705.09792 • Eilers&Jiang2023–FlorianEilers, XiaoyiJiang.Building Blocks for a Complex-Valued Transformer Architecture. ICASSP

work page arXiv
[3]

• Hao et al

arXiv:2306.09827.https://arxiv.org/abs/2306.09827 • Defines complex scaled-dot-product attention variants and complex LayerNorm; literature default for complex_softmax. • Hao et al. 2025– Yang Hao et al.Holographic Transformers for Complex-Valued Signal Processing: Integrating Phase Interference into Self-Attention. arXiv:2509.19331 (Sep 2025).https://arx...

work page arXiv 2025
[4]

no remarkable difference vs real baselines

HAL:hal-05235749.https://centralesupelec.hal. science/hal-05235749v1 • PyTorch library: complex layers, activations, attention, and complex-valued datasets (PolSAR, MRI). https://github.com/torchcvnn/torchcvnn • lucidrains 2024– Phil Wang.complex-valued-transformer. https://github.com/lucidrains/ 25 complex-valued-transformer •Open report: "no remarkable ...

work page 2024
[5]

arXiv:1706.03762.https://arxiv.org/abs/1706.03762 • Choromanski et al. 2021– Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller.Rethinking Attention with Performers. ICLR

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Rethinking Attention with Performers

arXiv:2009.14794.https://arxiv.org/abs/2009.14794 • Wortsman et al. 2023– Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith. Replacing softmax with ReLU in Vision Transformers. arXiv:2309.08586 (Sep 2023).https://arxiv. org/abs/2309.08586 • Ramapuram et al. 2025– Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, ...

work page internal anchor Pith review arXiv 2009
[7]

Theory, analysis, and best practices for sigmoid self-attention.arXiv preprint arXiv:2409.04431,

arXiv:2409.04431.https: //arxiv.org/abs/2409.04431. Code:https://github.com/apple/ml-sigmoid-attention • Saratchandran et al. 2024– Hemanth Saratchandran, Jianqiao Zheng, Yiping Ji, Wenbo Zhang, Simon Lucey.Rethinking Attention: Polynomial Alternatives to Softmax in Transformers. arXiv:2410.18613 (Oct 2024).https://arxiv.org/abs/2410.18613 • Nakanishi 202...

work page arXiv 2024