Proves GD convergence to stationary point neighborhoods for general NN architectures beyond NTK via block-level analysis, analyticity, and local smoothness conditions.
Are transformers with one layer self-attention using low-rank weight matrices universal approximators?arXiv preprint arXiv:2307.14023
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
A recipe translates ReLU approximations to softmax attention with target-specific economic bounds for multiplication, reciprocal computation, and min/max primitives.
citing papers explorer
-
Convergence of Gradient Descent for General Neural Network Architectures Beyond the NTK Regime
Proves GD convergence to stationary point neighborhoods for general NN architectures beyond NTK via block-level analysis, analyticity, and local smoothness conditions.
-
Transformer Approximations from ReLUs
A recipe translates ReLU approximations to softmax attention with target-specific economic bounds for multiplication, reciprocal computation, and min/max primitives.