Are transformers with one layer self-attention using low-rank weight matrices universal approximators?arXiv preprint arXiv:2307.14023

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? , author= · arXiv 2307.14023

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

Convergence of Gradient Descent for General Neural Network Architectures Beyond the NTK Regime

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

Proves GD convergence to stationary point neighborhoods for general NN architectures beyond NTK via block-level analysis, analyticity, and local smoothness conditions.

Transformer Approximations from ReLUs

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

A recipe translates ReLU approximations to softmax attention with target-specific economic bounds for multiplication, reciprocal computation, and min/max primitives.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Convergence of Gradient Descent for General Neural Network Architectures Beyond the NTK Regime cs.LG · 2026-06-22 · unverdicted · none · ref 111
Proves GD convergence to stationary point neighborhoods for general NN architectures beyond NTK via block-level analysis, analyticity, and local smoothness conditions.
Transformer Approximations from ReLUs cs.LG · 2026-04-27 · unverdicted · none · ref 3
A recipe translates ReLU approximations to softmax attention with target-specific economic bounds for multiplication, reciprocal computation, and min/max primitives.

Are transformers with one layer self-attention using low-rank weight matrices universal approximators?arXiv preprint arXiv:2307.14023

fields

years

verdicts

representative citing papers

citing papers explorer