Recognition: unknown
SigGate-GT: Taming Over-Smoothing in Graph Transformers via Sigmoid-Gated Attention
Pith reviewed 2026-05-10 06:24 UTC · model grok-4.3
The pith
Learned sigmoid gates on attention outputs let graph transformers selectively suppress uninformative connections and reduce over-smoothing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying per-head learned sigmoid gates to the output of softmax attention inside graph transformers breaks the forced sum-to-one normalization, letting individual heads drive uninformative activations to zero and thereby slowing the progressive collapse of node representations with increasing depth.
What carries the argument
Per-head learned sigmoid gates multiplied element-wise onto the attention output, which selectively scale attended features toward zero without altering the softmax normalization itself.
If this is right
- Over-smoothing measured by mean absolute difference drops by about 30 percent across 4-to-16 layer models.
- Attention entropy rises and training remains stable across a tenfold range of learning rates.
- New state-of-the-art ROC-AUC of 82.47 percent on ogbg-molhiv and matched best MAE of 0.059 on ZINC.
- Statistically significant gains over the GraphGPS baseline on all five evaluated datasets.
- Overhead remains near 1 percent parameters while delivering these changes.
Where Pith is reading between the lines
- The same per-head zeroing mechanism could be tested on non-graph transformers to reduce attention sinks.
- Pairing the gates with residual or normalization-based anti-smoothing methods might compound the depth gains.
- The observed stability over wide learning-rate ranges suggests the gates could reduce the cost of hyper-parameter search on new graph tasks.
Load-bearing premise
Performance gains arise specifically because the gates suppress only uninformative signals rather than from incidental effects of extra parameters or different hyper-parameter choices.
What would settle it
Re-train the identical architecture with the learned sigmoid gates replaced by fixed gates of value 1.0 and check whether the measured reduction in over-smoothing and the accuracy gains both disappear.
Figures
read the original abstract
Graph transformers achieve strong results on molecular and long-range reasoning tasks, yet remain hampered by over-smoothing (the progressive collapse of node representations with depth) and attention entropy degeneration. We observe that these pathologies share a root cause with attention sinks in large language models: softmax attention's sum-to-one constraint forces every node to attend somewhere, even when no informative signal exists. Motivated by recent findings that element-wise sigmoid gating eliminates attention sinks in large language models, we propose SigGate-GT, a graph transformer that applies learned, per-head sigmoid gates to the attention output within the GraphGPS framework. Each gate can suppress activations toward zero, enabling heads to selectively silence uninformative connections. On five standard benchmarks, SigGate-GT matches the prior best on ZINC (0.059 MAE) and sets new state-of-the-art on ogbg-molhiv (82.47% ROC-AUC), with statistically significant gains over GraphGPS across all five datasets ($p < 0.05$). Ablations show that gating reduces over-smoothing by 30% (mean relative MAD gain across 4-16 layers), increases attention entropy, and stabilizes training across a $10\times$ learning rate range, with about 1% parameter overhead on OGB.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SigGate-GT, a graph transformer built on the GraphGPS framework that applies learned per-head sigmoid gates to attention outputs. This modification is motivated by the observation that softmax attention's sum-to-one constraint contributes to over-smoothing and entropy degeneration (analogous to attention sinks in LLMs). The sigmoid gates allow selective suppression of uninformative signals. On five benchmarks the model matches the prior best MAE of 0.059 on ZINC and achieves new SOTA ROC-AUC of 82.47% on ogbg-molhiv, with statistically significant gains over GraphGPS (p < 0.05) on all datasets. Ablations report a ~30% reduction in over-smoothing (via mean relative MAD across 4-16 layers), increased attention entropy, and training stability over a 10x learning-rate range, at ~1% parameter overhead.
Significance. If the reported gains are causally attributable to the gating mechanism, the work offers a lightweight, architecture-compatible intervention for a well-known limitation of graph transformers. The multi-dataset evaluation, statistical testing, and direct measurement of over-smoothing via MAD provide concrete supporting evidence. The approach could influence subsequent graph transformer designs, particularly in molecular and long-range reasoning tasks, provided the selectivity claim is isolated from capacity or optimization effects.
major comments (2)
- [Ablations section] Ablations section: The reported 30% MAD reduction and p<0.05 gains are attributed to selective suppression by the per-head sigmoid gates, yet no control experiment is described that holds total parameter count and architecture fixed while removing selectivity (e.g., replacing the sigmoid gates with learnable scalar multipliers per head or with fixed non-zero gates). Without such an isolation, the central causal claim remains vulnerable to the alternative explanation that gains arise from added capacity or altered training dynamics alone.
- [Results section] Results and experimental details: Concrete numbers and p<0.05 significance are stated, but the manuscript provides insufficient protocol information (exact data splits, baseline re-implementations, number of independent runs per result, variance estimates, and multiple-comparison correction) to allow independent verification or to rule out post-hoc selection. This weakens confidence in the SOTA claims on ogbg-molhiv and the cross-dataset significance.
minor comments (1)
- [Abstract and experimental setup] The abstract states 'about 1% parameter overhead on OGB' but a table or paragraph giving exact parameter counts for SigGate-GT versus GraphGPS (broken down by component) would strengthen the low-overhead claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the claims and reproducibility.
read point-by-point responses
-
Referee: [Ablations section] Ablations section: The reported 30% MAD reduction and p<0.05 gains are attributed to selective suppression by the per-head sigmoid gates, yet no control experiment is described that holds total parameter count and architecture fixed while removing selectivity (e.g., replacing the sigmoid gates with learnable scalar multipliers per head or with fixed non-zero gates). Without such an isolation, the central causal claim remains vulnerable to the alternative explanation that gains arise from added capacity or altered training dynamics alone.
Authors: We agree that a control isolating the selectivity of the sigmoid (its [0,1] bounding for suppression) from added capacity is valuable. The current ablations compare against the unmodified GraphGPS baseline (fewer parameters) and show consistent gains in MAD reduction and entropy. To address the concern directly, we will add a new ablation in the revised manuscript replacing the per-head sigmoid gates with learnable scalar multipliers per head (unbounded, same parameter count). This will demonstrate whether the bounded, suppressive behavior of the sigmoid is necessary for the observed over-smoothing mitigation, holding architecture and capacity fixed. revision: yes
-
Referee: [Results section] Results and experimental details: Concrete numbers and p<0.05 significance are stated, but the manuscript provides insufficient protocol information (exact data splits, baseline re-implementations, number of independent runs per result, variance estimates, and multiple-comparison correction) to allow independent verification or to rule out post-hoc selection. This weakens confidence in the SOTA claims on ogbg-molhiv and the cross-dataset significance.
Authors: We acknowledge that additional protocol details are required for full reproducibility and to support the statistical claims. In the revised manuscript, we will expand the experimental details to specify: the exact data splits and preprocessing steps (following official OGB and ZINC splits); that baselines were re-implemented in the same GraphGPS framework with reported hyperparameters; the number of independent runs (10 runs per model with different random seeds); variance estimates (standard deviations reported alongside means); and the statistical procedure (paired t-tests with p < 0.05, including any multiple-comparison correction such as Bonferroni). We will also release code, configurations, and seeds to enable independent verification. revision: yes
Circularity Check
No circularity: empirical architecture with independent benchmark results
full rationale
The paper proposes SigGate-GT as a sigmoid-gated attention modification inside the GraphGPS framework and validates it via standard benchmark experiments (ZINC, ogbg-molhiv, etc.) plus ablations on MAD, entropy, and stability. No derivation chain, equations, or uniqueness theorems are invoked that reduce any claimed result to a fitted parameter or self-citation by construction. The reported metrics and statistical gains are externally falsifiable on public datasets and do not rely on self-referential definitions or renamed inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- per-head sigmoid gate weights
axioms (1)
- domain assumption Softmax attention's sum-to-one constraint forces every node to attend to something even when no informative signal exists, contributing to over-smoothing.
invented entities (1)
-
Sigmoid gate applied to attention output
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Neural Message Passing for Quantum Chemistry
J. Gilmer et al. “Neural Message Passing for Quantum Chemistry”. In:Proceedings of the 34th Inter- national Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. Ed. by bdoina Precup and Y. W. Teh. Vol. 70. Proceedings of Machine Learning Research. PMLR, 2017, pp. 1263–1272
2017
-
[2]
Semi-Supervised Classification with Graph Convolutional Networks
T. N. Kipf and M. Welling. “Semi-Supervised Classification with Graph Convolutional Networks”. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017
2017
-
[3]
HowPowerfulareGraphNeuralNetworks?
K.Xuetal.“HowPowerfulareGraphNeuralNetworks?” In:7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019
2019
-
[4]
On the Bottleneck of Graph Neural Networks and its Practical Implications
U. Alon and E. Yahav. “On the Bottleneck of Graph Neural Networks and its Practical Implications”. In:9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021
2021
-
[5]
Attention is all you need
A. Vaswani et al. “Attention is all you need”. In:Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17. Long Beach, California, USA: Curran Associates Inc., 2017, 6000–6010.isbn: 9781510860964.url:https://proceedings.neurips.cc/paper_files/ paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
2017
-
[6]
Do Transformers Really Perform Badly for Graph Representation?
C. Ying et al. “Do Transformers Really Perform Badly for Graph Representation?” In:Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual. Ed. by M. Ranzato et al. 2021, pp. 28877– 28888
2021
-
[7]
Recipe for a General, Powerful, Scalable Graph Transformer
L. Rampásek et al. “Recipe for a General, Powerful, Scalable Graph Transformer”. In:Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. Ed. by S. Koyejo et al. 2022
2022
-
[8]
Graph Inductive Biases in Transformers without Message Passing
L. Ma et al. “Graph Inductive Biases in Transformers without Message Passing”. In:International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. Ed. by A. Krause et al. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 23321–23337
2023
-
[9]
Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Learning
Q. Li, Z. Han, and X. Wu. “Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Learning”. In:Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI- 18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Sympo- sium on Educational Advances in Artificial Intelligence (EAAI-...
2018
-
[10]
Graph Neural Networks Exponentially Lose Expressive Power for Node Classification
K. Oono and T. Suzuki. “Graph Neural Networks Exponentially Lose Expressive Power for Node Classification”. In:8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020
2020
-
[11]
Graph Convolutions Enrich the Self-Attention in Transformers!
J. Choi et al. “Graph Convolutions Enrich the Self-Attention in Transformers!” In:Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Ed. by A. Globersons et al. 2024
2024
-
[12]
A sur- vey on oversmoothing in graph neural networks,
T. K. Rusch, M. M. Bronstein, and S. Mishra. “A Survey on Oversmoothing in Graph Neural Net- works”. In:arXiv preprintarXiv.2303.10993 (2023). arXiv preprint:2303.10993
-
[13]
Stabilizing Transformer Training by Preventing Attention Entropy Collapse
S. Zhai et al. “Stabilizing Transformer Training by Preventing Attention Entropy Collapse”. In:Inter- national Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. Ed. by A. Krause et al. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 40770– 40803
2023
-
[14]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention- Sink-Free
Z. Qiu et al. “Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention- Sink-Free”. In:Advances in Neural Information Processing Systems 38 (NeurIPS 2025). Oral; Best Paper Award. 2025
2025
-
[15]
Efficient Streaming Language Models with Attention Sinks
G. Xiao et al. “Efficient Streaming Language Models with Attention Sinks”. In:The Twelfth Inter- national Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024
2024
-
[16]
GLU Variants Improve Transformer
N. Shazeer. “GLU Variants Improve Transformer”. In:arXiv preprintarXiv.2002.05202 (2020)
work page internal anchor Pith review arXiv 2002
-
[17]
X. Bresson and T. Laurent. “Residual Gated Graph ConvNets”. In:arXiv preprintarXiv.1711.07553 (2017)
-
[18]
Gated Graph Sequence Neural Networks
Y. Li et al. “Gated Graph Sequence Neural Networks”. In:4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. Ed. by Y. Bengio and Y. LeCun. 2016
2016
-
[19]
Rethinking Graph Transformers with Spectral Attention
D. Kreuzer et al. “Rethinking Graph Transformers with Spectral Attention”. In:Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual. Ed. by M. Ranzato et al. 2021, pp. 21618–21629
2021
-
[20]
Exphormer: Sparse Transformers for Graphs
H. Shirzad et al. “Exphormer: Sparse Transformers for Graphs”. In:arXiv preprintarXiv.2303.06147 (2023)
-
[21]
On the Connection Between MPNN and Graph Transformer
C. Cai et al. “On the Connection Between MPNN and Graph Transformer”. In:International Confer- ence on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. Ed. by A. Krause et al. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 3408–3430
2023
-
[22]
DropEdge: Towards Deep Graph Convolutional Networks on Node Classification
Y. Rong et al. “DropEdge: Towards Deep Graph Convolutional Networks on Node Classification”. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020
2020
-
[23]
PairNorm: Tackling Oversmoothing in GNNs
L. Zhao and L. Akoglu. “PairNorm: Tackling Oversmoothing in GNNs”. In:8th International Confer- ence on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenRe- view.net, 2020
2020
-
[24]
DeeperGCN: All You Need to Train Deeper GCNs
G. Li et al. “DeeperGCN: All You Need to Train Deeper GCNs”. In:arXiv preprintarXiv.2006.07739 (2020)
-
[25]
Theory, analysis, and best practices for sigmoid self-attention.arXiv preprint arXiv:2409.04431,
J. Ramapuram et al. “Theory, Analysis, and Best Practices for Sigmoid Self-Attention”. In:arXiv preprintarXiv.2409.04431 (2024)
-
[26]
J. L. Ba, J. R. Kiros, and G. E. Hinton. “Layer Normalization”. In:arXiv preprintarXiv.1607.06450 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
Benchmarking graph neural networks
V. P. Dwivedi et al. “Benchmarking graph neural networks”. In:J. Mach. Learn. Res.24.1 (Jan. 2023). issn: 1532-4435
2023
-
[28]
Open Graph Benchmark: Datasets for Machine Learning on Graphs
W. Hu et al. “Open Graph Benchmark: Datasets for Machine Learning on Graphs”. In:Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Ed. by H. Larochelle et al. 2020
2020
-
[29]
Long Range Graph Benchmark
V. P. Dwivedi et al. “Long Range Graph Benchmark”. In:Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. Ed. by S. Koyejo et al. 2022
2022
-
[30]
Sign and Basis Invariant Networks for Spectral Graph Representation Learning
D. Lim et al. “Sign and Basis Invariant Networks for Spectral Graph Representation Learning”. In: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023
2023
-
[31]
Graph Neural Networks with Learnable Structural and Positional Represen- tations
V. P. Dwivedi et al. “Graph Neural Networks with Learnable Structural and Positional Represen- tations”. In:The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022
2022
-
[32]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter. “Decoupled Weight Decay Regularization”. In:7th International Con- ference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenRe- view.net, 2019
2019
-
[33]
Principal Neighbourhood Aggregation for Graph Nets
G. Corso et al. “Principal Neighbourhood Aggregation for Graph Nets”. In:Advances in Neural Infor- mation Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Ed. by H. Larochelle et al. 2020
2020
-
[34]
Where Did the Gap Go? Reassessing the Long-Range Graph Benchmark
J. Tönshoff et al. “Where Did the Gap Go? Reassessing the Long-Range Graph Benchmark”. In: Trans. Mach. Learn. Res.2024 (2024)
2024
-
[35]
Measuring and Relieving the Over-Smoothing Problem for Graph Neural Networks from the Topological View
D. Chen et al. “Measuring and Relieving the Over-Smoothing Problem for Graph Neural Networks from the Topological View”. In:The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intel...
2020
-
[36]
show the widest spread and highest fraction of near-zero gates, consistent with aggressive filtering of uninformative node pairs at the representation-building stage; the final layers (8–
-
[37]
This pattern is consistent with functional specialization: the network learns where in the depth hierarchy to invest its filtering capacity
have intermediate behaviour. This pattern is consistent with functional specialization: the network learns where in the depth hierarchy to invest its filtering capacity. Layer Mean Std %<0.1%>0.9 10.71 0.13 2.4% 4.1% 20.66 0.16 5.3% 5.8% 30.62 0.18 8.1% 7.2% 40.54 0.22 15.0% 9.4% 50.51 0.23 17.8% 10.1% 60.53 0.22 16.1% 9.8% 70.55 0.21 14.2% 9.6% 80.58 0.1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.