Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077

· 2020 · arXiv 1912.10077

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

Gating Enables Curvature: A Geometric Expressivity Gap in Attention

cs.LG · 2026-04-16 · unverdicted · novelty 8.0

Gated attention enables non-flat and positively curved geometries in the Fisher-Rao manifold of representations that ungated attention cannot achieve.

The Variance Brain Foundation Models Forgot: Third-Order Statistics Predict Cognition Where Billion-Parameter Models Fail

q-bio.NC · 2026-05-29 · unverdicted · novelty 7.0

Third-order co-skewness in fMRI is destroyed by BFM pretraining, causing poor cognition prediction; a co-skewness-preserving linear FC exceeds BFMs and raw FC.

A generative pre-trained transformer with Kerr-soliton attention

physics.optics · 2026-05-22 · unverdicted · novelty 7.0

Kerr-soliton attention realizes transformer attention in physical hardware via Kerr solitons in a resonator, with analytic training and experimental inference showing high-fidelity agreement between hardware and model.

How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models

stat.ML · 2026-05-07 · conditional · novelty 7.0

Attention pooling produces a free-multiplicative-convolution bulk spectrum and two phase transitions for signal recovery; optimal weights are the top eigenvector of the positional correlation matrix R.

Continuous transformations of probability measures and their transport representations

math.FA · 2026-04-17 · unverdicted · novelty 7.0

Lipschitz continuous transformations F of probability measures w.r.t. Wasserstein distance admit continuous transport maps f(·,μ) such that F(μ) = f(·,μ)_# μ.

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

cs.LG · 2025-10-27 · unverdicted · novelty 7.0

One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.

Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

cs.LG · 2025-08-22 · unverdicted · novelty 6.0

In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-time compute extend the reachable depth but do not remove the bound.

Progressive Approximation in Deep Residual Networks: Theory and Validation

cs.LG · 2026-04-27 · unverdicted · novelty 5.0

Residual networks admit progressive approximation trajectories with monotonically decreasing error, enabling useful predictions from any depth after a single training run via the LPA principle.

citing papers explorer

Showing 1 of 1 citing paper after filters.

How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models stat.ML · 2026-05-07 · conditional · none · ref 11
Attention pooling produces a free-multiplicative-convolution bulk spectrum and two phase transitions for signal recovery; optimal weights are the top eigenvector of the positional correlation matrix R.

Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077

fields

years

verdicts

representative citing papers

citing papers explorer