A single-head softmax transformer with O(log(1/ε)) blocks and O(√(N/ε)) MLP width implements preconditioned Richardson iteration to achieve ε-accurate Gaussian KRR predictions on length-N prompts under bounded data.
Training dynamics of in-context learning in linear attention
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
fields
cs.LG 3years
2026 3representative citing papers
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
citing papers explorer
-
Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression
A single-head softmax transformer with O(log(1/ε)) blocks and O(√(N/ε)) MLP width implements preconditioned Richardson iteration to achieve ε-accurate Gaussian KRR predictions on length-N prompts under bounded data.
-
Learning to Adapt: In-Context Learning Beyond Stationarity
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
- Understanding LoRA as Knowledge Memory: An Empirical Analysis