PCT replaces softmax token competition with a smooth phase-preserving gate on normalized complex similarities, yielding stronger generalization on long-range and phase-sensitive benchmarks than both real and complex Transformers.
Theory, analysis, and best practices for sigmoid self-attention.arXiv preprint arXiv:2409.04431
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5representative citing papers
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.
Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
SigGate-GT applies sigmoid gates to attention outputs in graph transformers to reduce over-smoothing, matching prior best on ZINC and setting new SOTA on ogbg-molhiv with gains over GraphGPS.
citing papers explorer
-
Complex-Valued Phase-Coherent Transformer
PCT replaces softmax token competition with a smooth phase-preserving gate on normalized complex similarities, yielding stronger generalization on long-range and phase-sensitive benchmarks than both real and complex Transformers.
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
-
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.
-
Cubit: Token Mixer with Kernel Ridge Regression
Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
-
SigGate-GT: Taming Over-Smoothing in Graph Transformers via Sigmoid-Gated Attention
SigGate-GT applies sigmoid gates to attention outputs in graph transformers to reduce over-smoothing, matching prior best on ZINC and setting new SOTA on ogbg-molhiv with gains over GraphGPS.