pith. sign in

arxiv: 2607.00479 · v1 · pith:B2LQU7P7new · submitted 2026-07-01 · 💻 cs.LG · stat.ML

Ghost in the Kernel: In-Context Learning with Efficient Transformers via Domain Generalization

Pith reviewed 2026-07-02 16:18 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords linear transformersin-context learningdomain generalizationgeneralization boundsefficient attentiontransformer linearizationactivation design
0
0 comments X

The pith

Linear transformers perform in-context learning by mapping context distributions to response functions under domain generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes linear transformers, which reduce attention complexity from quadratic to linear in context length, through the lens of domain generalization with a two-staged sampling process. It frames in-context learning as the model learning a mapping from distributions over contexts to response functions, and derives generalization bounds that hold independently of dimension while exposing a tradeoff in regularity between data distributions and latent features. This view also informs new choices for activations and losses when converting pretrained softmax-based large language models into linear form. A reader would care because the work supplies a concrete theoretical account of how efficient transformers can still adapt to new tasks from context alone, without parameter updates or quadratic costs.

Core claim

Linear transformers perform in-context learning as learning a mapping from context distributions to response functions. A dimension-independent convergence rate is obtained for our generalization analysis, which also exhibits the tradeoff between the regularities of data distributions and latent features. Guided by our theoretical framework, we propose a new perspective on activation and loss design for linearizing pretrained softmax large language models.

What carries the argument

The two-staged sampling process from domain generalization, used to analyze the feature mapping inside linear attention.

If this is right

  • Linear transformers achieve in-context learning with generalization rates independent of dimension.
  • Convergence rates reflect a tradeoff between regularity of the data distributions and regularity of the latent features.
  • Activation and loss functions can be redesigned to convert pretrained softmax transformers into linear versions while preserving in-context capability.
  • The domain-generalization framing supplies a route to theoretical guarantees for other efficient attention variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage sampling lens could be applied to analyze kernel approximations in other linear or sparse attention mechanisms.
  • If the mapping interpretation is accurate, one could test whether real-world context distributions in language tasks exhibit the regularity levels needed for the predicted rates.
  • The tradeoff between distribution and feature regularity suggests a practical knob for choosing regularization strength when training linear transformers on heterogeneous data.

Load-bearing premise

The two-staged sampling process from domain generalization accurately captures the mechanism of in-context learning in linear transformers.

What would settle it

An empirical measurement showing that the generalization error rate of a linear transformer on in-context tasks depends on input dimension, or that the learned mapping fails to align with the response functions predicted by the two-stage model.

Figures

Figures reproduced from arXiv: 2607.00479 by Ding-Xuan Zhou, Peilin Liu.

Figure 1
Figure 1. Figure 1: Fast Eigendecay of Qwen3-8B (Ghost in the Kernel) [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
read the original abstract

Transformer-based large models have demonstrated remarkable generalization abilities across different tasks by leveraging a context-aware attention module for in-context learning. With richer context, transformers adapt more effectively to the current use case without any parameter updates. However, the quadratic computational and memory complexity with respect to context length significantly slows data processing in softmax transformers. Linear transformers were proposed to address this issue by reducing the complexity to linear dependence on context length, but the design and understanding of the feature mapping in linear attention, from a theoretical viewpoint, remain unclear. In this paper, we investigate the approximation and generalization abilities of linear transformers under a two-staged sampling process from domain generalization. We show that linear transformers perform in-context learning as learning a mapping from context distributions to response functions. A dimension-independent convergence rate is obtained for our generalization analysis, which also exhibits the tradeoff between the regularities of data distributions and latent features. Guided by our theoretical framework, we propose a new perspective on activation and loss design for linearizing pretrained softmax large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript frames linear transformers' in-context learning as learning a mapping from context distributions to response functions under a two-staged sampling process drawn from domain generalization. It derives a dimension-independent convergence rate for the generalization analysis, identifies a regularity tradeoff between data distributions and latent features, and uses the framework to recommend new activation and loss designs for linearizing pretrained softmax LLMs.

Significance. If the two-staged sampling model is valid, the dimension-independent rate and the resulting activation/loss perspective would offer a useful theoretical bridge between domain generalization and efficient transformer design, with potential impact on practical linear attention implementations.

major comments (2)
  1. [Section 3 (two-staged sampling definition) and the statements of the main generalization theorem] The central modeling assumption—that the two-staged sampling process (latent features drawn first, then context conditioned on them) reproduces the distribution of context-response pairs arising in standard ICL—is invoked to obtain both the approximation and generalization bounds as well as the activation/loss recommendations. No justification, comparison to fixed-task ICL sampling, or empirical check is supplied, rendering the derived rates and design guidance dependent on an unverified modeling choice.
  2. [Main generalization theorem (the convergence-rate statement)] The dimension-independent convergence rate is stated to hold under the regularity tradeoff; however, the precise dependence on the regularity parameters of the data distribution versus the latent features is not made explicit in the bound statement, so it is unclear whether the rate remains dimension-free once those parameters are allowed to vary with dimension.
minor comments (1)
  1. [Preliminaries] Notation for the response function and the induced mapping from context distributions is introduced without an explicit comparison table to the standard attention formulation; a short side-by-side would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Section 3 (two-staged sampling definition) and the statements of the main generalization theorem] The central modeling assumption—that the two-staged sampling process (latent features drawn first, then context conditioned on them) reproduces the distribution of context-response pairs arising in standard ICL—is invoked to obtain both the approximation and generalization bounds as well as the activation/loss recommendations. No justification, comparison to fixed-task ICL sampling, or empirical check is supplied, rendering the derived rates and design guidance dependent on an unverified modeling choice.

    Authors: The two-staged sampling is introduced to connect the ICL setting to the domain generalization literature, where a similar hierarchical process models distribution shifts between contexts. We agree that the manuscript would benefit from explicit motivation. In revision we will expand Section 3 with a dedicated paragraph motivating the choice, citing relevant domain-generalization works that employ analogous two-stage sampling, and providing a brief comparison to the fixed-task ICL sampling used in prior transformer analyses. As the contribution is primarily theoretical, we will flag empirical validation of the modeling assumption as future work rather than adding new experiments. revision: partial

  2. Referee: [Main generalization theorem (the convergence-rate statement)] The dimension-independent convergence rate is stated to hold under the regularity tradeoff; however, the precise dependence on the regularity parameters of the data distribution versus the latent features is not made explicit in the bound statement, so it is unclear whether the rate remains dimension-free once those parameters are allowed to vary with dimension.

    Authors: We agree that the dependence should be stated explicitly. The dimension-free rate holds when the regularity parameters (smoothness, boundedness, etc.) of both the data distributions and the latent features remain independent of dimension and satisfy the stated tradeoff. We will revise the main theorem statement to include this explicit condition, thereby clarifying that the convergence rate is dimension-independent precisely under the assumption that these parameters do not scale with dimension. revision: yes

Circularity Check

0 steps flagged

No circularity: modeling choice followed by independent analysis

full rationale

The paper adopts a two-staged sampling process drawn from domain generalization as the framework for analyzing linear transformer ICL, then derives a mapping from context distributions to response functions plus dimension-independent convergence rates under that model. No equations are supplied in the abstract or visible text that reduce the claimed mapping or rates to fitted parameters or self-definitions by construction. No self-citation chains, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation appear in the provided material. The derivation therefore remains self-contained against external benchmarks once the modeling assumption is granted; the skeptic concern targets the realism of the assumption itself rather than any internal reduction of the claimed results to their inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5706 in / 1223 out tokens · 55468 ms · 2026-07-02T16:18:27.227640+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 6 canonical work pages · 5 internal anchors

  1. [1]

    Akyürek, D

    E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and D. Zhou. What learning algorithm is in- context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations , 2023

  2. [2]

    F. Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1–53, 2017

  3. [3]

    A. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory , 39(3):930–945, 1993

  4. [4]

    Blanchard, G

    G. Blanchard, G. Lee, and C. Scott. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Information Processing Systems , volume 24. Curran Associates, Inc., 2011

  5. [5]

    Blanchard, A

    G. Blanchard, A. A. Deshmukh, U. Dogan, G. Lee, and C. Scott. Domain generalization by marginal transfer learning. Journal of machine learning research , 22(2):1–55, 2021

  6. [6]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

  7. [7]

    K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with performers. In International Conference on Learning Representations , 2021

  8. [8]

    Christmann and I

    A. Christmann and I. Steinwart. Universal kernels on non-standard input spaces. In Advances in Neural Information Processing Systems , volume 23. Curran Associates, Inc., 2010

  9. [9]

    Cucker and D

    F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint , volume 24. Cambridge University Press, 2007

  10. [10]

    De Ryck, S

    T. De Ryck, S. Lanthaler, and S. Mishra. On the approximation of functions by tanh neural networks. Neural Networks , 143:732–750, 2021

  11. [11]

    De Ryck, A

    T. De Ryck, A. D. Jagtap, and S. Mishra. Error estimates for physics-informed neural networks approximating the navier–stokes equations. IMA Journal of Numerical Analysis , page drac085, 2023

  12. [12]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages 4171–4186, 2019

  13. [13]

    G. E. Fasshauer, F. J. Hickernell, and H. Woźniakowski. On dimension-independent rates of convergence for function approximation with gaussian kernels. SIAM Journal on Numerical Analysis, 50(1):247–271, 2012

  14. [14]

    Furuya, M

    T. Furuya, M. V. de Hoop, and G. Peyré. Transformers are universal in-context learners. In The Thirteenth International Conference on Learning Representations , 2025

  15. [15]

    S. Garg, D. Tsipras, P. S. Liang, and G. Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in neural information processing systems , 35: 30583–30598, 2022. 50

  16. [16]

    Gu and T

    A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling , 2024

  17. [17]

    K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  18. [18]

    Gaussian Error Linear Units (GELUs)

    D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016

  19. [19]

    C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. A short note on concentration inequalities for random vectors with subgaussian norm, 2019

  20. [20]

    Katharopoulos, A

    A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autore- gressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning , pages 5156–5165. PMLR, 2020

  21. [21]

    Y. Korolev. Two-layer neural networks with values in a banach space. SIAM Journal on Mathematical Analysis, 54(6):6358–6389, 2022

  22. [22]

    Ledoux and M

    M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer Berlin Heidelberg, Berlin, Heidelberg, 1991. ISBN 978-3-642-20211-7 978-3-642-20212-4

  23. [23]

    Liu and D.-X

    P. Liu and D.-X. Zhou. Generalization analysis of transformers in distribution regression. Neural Computation, 37(2):260–293, 2025

  24. [24]

    C. Ma, R. Pathak, and M. J. Wainwright. Optimally tackling covariate shift in rkhs-based nonparametric regression. The Annals of Statistics , 51(2):738–761, 2023

  25. [25]

    A. Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory , pages 3–17. Springer, 2016

  26. [26]

    Maurer and M

    A. Maurer and M. Pontil. Concentration inequalities under sub-gaussian and sub-exponential conditions. In Advances in Neural Information Processing Systems, volume 34, pages 7588–7597. Curran Associates, Inc., 2021

  27. [27]

    S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Re- thinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 11048–11064, 2022

  28. [28]

    N. Mücke. Stochastic gradient descent meets distribution regression. In International Confer- ence on Artificial Intelligence and Statistics , pages 2143–2151. PMLR, 2021

  29. [29]

    Nguyen and N

    M. Nguyen and N. Mücke. Optimal convergence rates for neural operators. arXiv preprint arXiv:2412.17518, 2024

  30. [30]

    Novak and H

    E. Novak and H. Woźniakowski. Tractability of Multivariate Problems. 1: Linear Information . Number 6. European Mathematical Soc, Zürich, 2008. ISBN 978-3-03719-026-5

  31. [31]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4195–4205, 2023

  32. [32]

    Z. Qin, X. Han, W. Sun, D. Li, L. Kong, N. Barnes, and Y. Zhong. The devil in linear transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7025–7041, 2022. 51

  33. [33]

    Searching for Activation Functions

    P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017

  34. [34]

    C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning . MIT Press, Cambridge, Mass, 2006. ISBN 978-0-262-18253-9

  35. [35]

    A. Rényi. On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics, volume 4, pages 547–562. University of California Press, 1961

  36. [36]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

  37. [37]

    N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 , 2020

  38. [38]

    Z. Shen, A. Hsu, R. Lai, and W. Liao. Understanding in-context learning on structured mani- folds: Bridging attention to kernel methods. arXiv preprint arXiv:2506.10959 , 2025

  39. [39]

    J. W. Siegel. Optimal approximation of zonoids and uniform approximation by shallow neural networks. Constructive Approximation, pages 1–29, 2025

  40. [40]

    J. W. Siegel and J. Xu. Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks. Foundations of Computational Mathematics , 24(2):481–537, 2024

  41. [41]

    Song and S

    Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc., 2019

  42. [42]

    B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet. Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research, 11:1517–1561, 2010

  43. [43]

    Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621 , 2023

  44. [44]

    Y.-H. H. Tsai, S. Bai, M. Yamada, L.-P. Morency, and R. Salakhutdinov. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , page...

  45. [45]

    van Erven and P

    T. van Erven and P. Harremoës. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory , 60(7):3797–3820, 2014

  46. [46]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017

  47. [47]

    M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint . Cambridge University Press, 1 edition, Feb. 2019. ISBN 978-1-108-62777-1 978-1-108-49802-9

  48. [48]

    S. M. Xie, A. Raghunathan, P. Liang, and T. Ma. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations , 2022. 52

  49. [49]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

  50. [50]

    S. Yang, B. Wang, Y. Shen, R. Panda, and Y. Kim. Gated linear attention transformers with hardware-efficient training. In Forty-first International Conference on Machine Learning , 2024

  51. [51]

    Yang and D.-X

    Y. Yang and D.-X. Zhou. Optimal rates of approximation by shallow relu$$^k$$neural networks and applications to nonparametric regression. Constructive Approximation, 2024

  52. [52]

    Zaheer, G

    M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems , volume 33, pages 17283–17297. Curran Associates, Inc., 2020

  53. [53]

    Zhang and R

    B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in neural infor- mation processing systems , 32, 2019

  54. [54]

    Zhang, K

    M. Zhang, K. Bhatia, H. Kumbong, and C. Re. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. In The Twelfth International Conference on Learning Representations, 2024

  55. [55]

    Zhang, S

    R. Zhang, S. Frei, and P. L. Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research , 25(49):1–55, 2024

  56. [56]

    D.-X. Zhou. Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics , 220(1):456–463, 2008. 53