Looped linear transformers with LN provably converge via GD to implement the power method on principal component prediction.
arXiv preprint arXiv:2402.14951 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2years
2026 2representative citing papers
Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent application with convergence and OOD guarantees.
citing papers explorer
-
Looped Transformers with Layer Normalization Provably Learn the Power Method
Looped linear transformers with LN provably converge via GD to implement the power method on principal component prediction.
-
Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent
Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent application with convergence and OOD guarantees.