pith. machine review for the scientific record. sign in

arxiv: 2605.05176 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.NA· math.NA

Recognition: unknown

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

Alexander Hsu, Rongjie Lai, Wenjing Liao, Zhaiming Shen

Pith reviewed 2026-05-08 16:23 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NA
keywords in-context learningtransformersnonlinear regressionattention mechanismgeneralization boundspolynomial featuresspline basesfeaturization
0
0 comments X

The pith

Transformers can use attention to construct nonlinear features like polynomials for in-context regression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs explicit transformer networks where attention interactions produce nonlinear feature maps such as polynomial or spline bases. These bases span wide function classes and support analysis of in-context nonlinear regression without weight updates. Finite-sample generalization error bounds are derived that scale with context length and training set size. This framework extends theoretical understanding of in-context learning beyond linear models by showing how transformers can internally generate the needed nonlinear representations.

Core claim

Through the interaction mechanism in attention, we explicitly construct transformer networks to realize nonlinear features, such as polynomial or spline bases, which span a wide class of functions. Based on this construction, we establish a framework to analyze end-to-end in-context nonlinear regression with the constructed features. Our theory provides finite-sample generalization error bounds in terms of context length and training set size.

What carries the argument

Attention mechanism used as a featurizer to explicitly realize nonlinear bases such as polynomials or splines via query-key-value interactions.

Load-bearing premise

Attention weights and interactions can be set exactly to realize the desired nonlinear bases without approximation error that would invalidate the generalization analysis.

What would settle it

A calculation showing that the explicit attention construction fails to output the target polynomial or spline features on a simple test input, or that measured generalization error exceeds the stated bounds for sufficiently large context length.

Figures

Figures reproduced from arXiv: 2605.05176 by Alexander Hsu, Rongjie Lai, Wenjing Liao, Zhaiming Shen.

Figure 1
Figure 1. Figure 1: Scaling results for different models trained and tested on synthetic regression tasks of degree d = 4 polynomials. Models use a sum-based multi-head attention with scaling n/8 heads (thus (b) uses a fixed 16 heads). Results are averaged over 3 seeds and error bars represent ±1 sd. We conduct numerical experiments on synthetic datasets to verify our predicted scaling laws, primarily focused on polynomial re… view at source ↗
Figure 2
Figure 2. Figure 2: Our constructive theory also suggests a quasi-equivalence between width and depth; view at source ↗
Figure 2
Figure 2. Figure 2: Scaling results on polynomial regression tasks using 4 attention heads per block (except for the final linear block in the theory model). (a) Test loss vs. context length (n) with fixed L = 32000 (b) Test loss vs. training set size (L) with fixed n = 128 view at source ↗
Figure 3
Figure 3. Figure 3: Scaling results on polynomial regression tasks using 16 blocks with 1 attention head each. 39 view at source ↗
Figure 4
Figure 4. Figure 4: Scaling results on polynomial regression tasks using a single attention head per block. (a) Test loss vs. context length (n) with fixed L = 32000 (b) Test loss vs. training set size (L) with fixed n = 128 view at source ↗
Figure 5
Figure 5. Figure 5: Scaling results on polynomial regression tasks with no feedforward component, i.e. an attention-only transformer. 40 view at source ↗
Figure 6
Figure 6. Figure 6: Scaling results for regression of piecewise linear functions (linear splines), with domain split by 5 equally spaced knots. All models used two blocks and no feedforward components. We used n/8 attention heads per block, except the theory model which uses a one-headed linear attention block after an n/8-headed ReLU attention block. 41 view at source ↗
read the original abstract

Pre-trained transformers are able to learn from examples provided as part of the prompt without any weight updates, a remarkable ability known as in-context learning (ICL). Despite its demonstrated efficacy across various domains, the theoretical understanding of ICL is still developing. Whereas most existing theory has focused on linear models, we study ICL in the nonlinear regression setting. Through the interaction mechanism in attention, we explicitly construct transformer networks to realize nonlinear features, such as polynomial or spline bases, which span a wide class of functions. Based on this construction, we establish a framework to analyze end-to-end in-context nonlinear regression with the constructed features. Our theory provides finite-sample generalization error bounds in terms of context length and training set size. We numerically validate the theory on synthetic regression tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that transformers can be explicitly constructed via attention interactions to realize nonlinear feature maps such as polynomial or spline bases. Using this construction, the authors develop a framework for end-to-end in-context nonlinear regression and derive finite-sample generalization error bounds depending on context length and training set size, with numerical validation on synthetic regression tasks.

Significance. If the construction achieves exact nonlinear features with zero approximation error and the bounds are derived without hidden dependencies, the work would meaningfully extend ICL theory beyond linear models by providing a concrete featurization mechanism and quantifiable finite-sample guarantees. The explicit construction and numerical experiments are strengths that could guide future analyses of nonlinear ICL.

major comments (2)
  1. [§3] §3 (Construction): The explicit realization of exact polynomial/spline bases via query-key interactions and softmax assumes specific weight settings and input scalings that produce zero approximation error. The finite-sample bounds in §4 do not include additional terms for residual error from the continuous attention mechanism or input distribution mismatches, making the headline bounds conditional on an idealized regime not fully justified in the derivation.
  2. [§4] §4 (Generalization Bounds): The bounds are expressed in terms of context length and training-set size, but the derivation does not clarify whether these quantities enter independently or through constants tied to the constructed features. This leaves open the possibility of circularity, where the bound tightness depends on the very features whose construction is being analyzed.
minor comments (2)
  1. [Abstract] The abstract states that bounds depend on context length and training set size but does not indicate the functional form (e.g., rates or logarithmic factors); adding this detail would improve precision.
  2. [Experiments] Numerical validation section lacks reported error bars, number of independent trials, and quantitative metrics comparing empirical error to the theoretical bound; including these would make the experiments more reproducible and convincing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and insightful comments on our work. We address each major comment point by point below, providing clarifications on the construction and bounds while committing to revisions that improve the manuscript's rigor and transparency without altering the core contributions.

read point-by-point responses
  1. Referee: [§3] §3 (Construction): The explicit realization of exact polynomial/spline bases via query-key interactions and softmax assumes specific weight settings and input scalings that produce zero approximation error. The finite-sample bounds in §4 do not include additional terms for residual error from the continuous attention mechanism or input distribution mismatches, making the headline bounds conditional on an idealized regime not fully justified in the derivation.

    Authors: We appreciate the referee's observation. Section 3 presents an explicit construction where specific choices of query/key weights, value projections, and input scalings (e.g., via appropriate normalization) allow the softmax attention to exactly recover polynomial or spline bases with zero approximation error. This is a deliberate theoretical device to isolate the featurization power of attention, analogous to how linear ICL analyses assume exact linear heads. The bounds in §4 are derived precisely under this zero-error regime, so no residual terms appear by construction. We agree that the manuscript would benefit from greater explicitness here. We will revise §3 to include a dedicated remark stating the exact conditions for zero error and add a short discussion in §4 noting that the bounds are conditional on this idealized attention realization (with a forward reference to potential approximation errors in non-constructed settings). This constitutes a partial revision focused on clarity. revision: partial

  2. Referee: [§4] §4 (Generalization Bounds): The bounds are expressed in terms of context length and training-set size, but the derivation does not clarify whether these quantities enter independently or through constants tied to the constructed features. This leaves open the possibility of circularity, where the bound tightness depends on the very features whose construction is being analyzed.

    Authors: We thank the referee for raising this potential circularity concern. The derivation in §4 treats the nonlinear feature map as fixed once constructed in §3; the context length n (in-context examples) and training set size m enter the finite-sample bounds independently via standard tools such as uniform convergence over the function class spanned by the fixed features (e.g., via covering numbers or Rademacher complexity that scale with n and m). The constants in the bounds depend on intrinsic properties of the constructed features (dimension, boundedness), but these are determined solely by the transformer weights and are independent of the ICL data or the regression task being solved. There is thus no circular dependence on the features being “analyzed” by the ICL process itself. We will revise the theorem statements and proof sketches in §4 to explicitly separate the fixed feature class from the sample-size terms, adding a clarifying sentence after each bound to remove any ambiguity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; explicit construction and bounds are independently derived

full rationale

The paper explicitly constructs transformer networks via attention interactions to realize exact nonlinear features (polynomials or splines) that span a function class, then uses this construction as the basis for a separate framework analyzing end-to-end ICL nonlinear regression. Finite-sample generalization bounds are derived in terms of external parameters (context length and training set size), which enter as independent variables rather than being fitted to or defined by the realized features. No self-definitional reduction, fitted-input-as-prediction, or load-bearing self-citation chain appears; the derivation remains self-contained against the stated construction and does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that attention can be configured to realize exact nonlinear bases; this is treated as a domain assumption rather than derived from first principles or external data.

axioms (1)
  • domain assumption Attention mechanisms can be configured to realize polynomial or spline bases exactly
    Invoked in the abstract as the foundation for the transformer construction and subsequent bounds.

pith-pipeline@v0.9.0 · 5437 in / 1195 out tokens · 39955 ms · 2026-05-08T16:23:07.414953+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 16 canonical work pages

  1. [1]

    Transformers learn to im- plement preconditioned gradient descent for in-context learning

    Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to im- plement preconditioned gradient descent for in-context learning. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id= LziniAXEI9

  2. [2]

    What learning algorithm is in-context learning? investigations with linear models, 2023

    Ekin Aky¨ urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models.arXiv preprint arXiv:2211.15661, 2022

  3. [3]

    Transformers as statisticians: Provable in-context learning with in-context algorithm selection

    Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id= liMSqUuVg9

  4. [4]

    Language 13 models are few-shot learners.Advances in neural information processing systems, 33:1877– 1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language 13 models are few-shot learners.Advances in neural information processing systems, 33:1877– 1901, 2020

  5. [5]

    Transformers implement functional gradient de- scent to learn non-linear functions in context

    Xiang Cheng, Yuxin Chen, and Suvrit Sra. Transformers implement functional gradient de- scent to learn non-linear functions in context. In Ruslan Salakhutdinov, Zico Kolter, Kather- ine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 ofP...

  6. [6]

    In-context learning of linear systems: Generalization theory and applications to operator learning, 2025

    Frank Cole, Yulong Lu, Wuzhe Xu, and Tianhao Zhang. In-context learning of linear systems: Generalization theory and applications to operator learning, 2025. URLhttps://arxiv.org/ abs/2409.12293

  7. [7]

    Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers

    Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Asso- ciation for Computational Linguistics: ACL 2023, pages 4005–4019, Toronto, Canada, July

  8. [8]

    Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers

    Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.247. URL https://aclanthology.org/2023.findings-acl.247/

  9. [9]

    DeVore and G.G

    R.A. DeVore and G.G. Lorentz.Constructive Approximation. Grundlehren der mathe- matischen Wissenschaften. Springer Berlin Heidelberg, 1993. ISBN 9783540506270. URL https://books.google.com/books?id=cDqNW6k7_ZwC

  10. [10]

    Journal of Functional Analysis , volume =

    R.M Dudley. The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290–330, 1967. ISSN 0022-1236. doi: https://doi.org/ 10.1016/0022-1236(67)90017-1. URLhttps://www.sciencedirect.com/science/article/ pii/0022123667900171

  11. [11]

    A mathematical framework for transformer circuits.Transformer Circuits Thread,

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  12. [12]

    https://transformer-circuits.pub/2021/framework/index.html

  13. [13]

    What can transformers learn in-context? a case study of simple function classes.Advances in Neural Information Processing Systems, 35, 2022

    Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in Neural Information Processing Systems, 35, 2022

  14. [14]

    How do transformers learn in-context beyond simple functions? A case study on learning with representations

    Tianyu Guo, Wei Hu, Song Mei, Huan Wang, Caiming Xiong, Silvio Savarese, and Yu Bai. How do transformers learn in-context beyond simple functions? A case study on learning with representations. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview. net/fo...

  15. [15]

    Predicting scaling laws with statistical and approximation theory for transformer neural networks on intrinsically low-dimensional data

    Alex Havrilla and Wenjing Liao. Predicting scaling laws with statistical and approximation theory for transformer neural networks on intrinsically low-dimensional data. InAdvances in Neural Information Processing Systems, 2024. 14

  16. [16]

    In-context convergence of transformers

    Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  17. [17]

    2019 , month = may, journal =

    Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems, 2021. URLhttps: //arxiv.org/abs/1906.01820

  18. [18]

    Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian

    Samy Jelassi, St´ ephane d’Ascoli, Carles Domingo-Enrich, Yuhuai Wu, Yuanzhi Li, and Fran¸ cois Charton. Length generalization in arithmetic transformers, 2023. URLhttps://arxiv.org/ abs/2306.15400

  19. [19]

    Transformers are minimax optimal nonparametric in-context learners

    Juno Kim, Tai Nakamaki, and Taiji Suzuki. Transformers are minimax optimal nonparametric in-context learners. InICML 2024 Workshop on In-Context Learning, 2024. URLhttps: //openreview.net/forum?id=WjrKBQTWKp

  20. [20]

    Transformers meet in- context learning: A universal approximation theory, 2025

    Gen Li, Yuchen Jiao, Yu Huang, Yuting Wei, and Yuxin Chen. Transformers meet in- context learning: A universal approximation theory, 2025. URLhttps://arxiv.org/abs/ 2506.05200

  21. [21]

    Lu, Mary Letey, Jacob A

    Yue M. Lu, Mary Letey, Jacob A. Zavatone-Veth, Anindita Maiti, and Cengiz Pehlevan. Asymptotic theory of in-context learning by linear attention.Proceedings of the National Academy of Sciences, 122(28):e2502599122, 2025. doi: 10.1073/pnas.2502599122. URL https://www.pnas.org/doi/abs/10.1073/pnas.2502599122

  22. [22]

    Arvind Mahankali, Tatsunori B

    Arvind Mahankali, Tatsunori B. Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention, 2023. URL https://arxiv.org/abs/2307.03576

  23. [23]

    Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, and Tom Goldstein

    Sean Michael McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, and Tom Goldstein. Transformers can do arithmetic with the right embeddings. InThe 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24, 2024. URLhttps://openreview.net/ forum?id=cBFsFt1nDW

  24. [24]

    Analyzing limits for in-context learning, 2025

    Omar Naim, Jerome Bolte, and Nicholas Asher. Analyzing limits for in-context learning, 2025. URLhttps://arxiv.org/abs/2502.03503

  25. [25]

    Understanding addition in transformers

    Philip Quirke and Fazl Barez. Understanding addition in transformers. InThe Twelfth In- ternational Conference on Learning Representations, 2024. URLhttps://openreview.net/ forum?id=rIx1YXVWZb

  26. [26]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  27. [27]

    A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,

    Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. A practical review of mechanistic interpretability for transformer-based language models.ArXiv, abs/2407.02646,

  28. [28]

    URLhttps://api.semanticscholar.org/CorpusID:270924412

  29. [29]

    Pretraining task diversity and the emergence of non-bayesian in-context learning for regression

    Allan Raventos, Mansheej Paul, Feng Chen, and Surya Ganguli. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id= BtAz4a5xDg. 15

  30. [30]

    Cambridge University Press, 2022

    Shai Shalev-Shwartz and Shai Ben-David.Understanding machine learning: From theory to algorithms. Cambridge University Press, 2022

  31. [31]

    Trans- formers for learning on noisy and task-level manifolds: Approximation and generalization insights, 2025

    Zhaiming Shen, Alex Havrilla, Rongjie Lai, Alexander Cloninger, and Wenjing Liao. Trans- formers for learning on noisy and task-level manifolds: Approximation and generalization insights, 2025. URLhttps://arxiv.org/abs/2505.03205

  32. [32]

    Understanding in-context learning on structured manifolds: Bridging attention to kernel methods

    Zhaiming Shen, Alexander Hsu, Rongjie Lai, and Wenjing Liao. Understanding in-context learning on structured manifolds: Bridging attention to kernel methods, 2025. URLhttps: //arxiv.org/abs/2506.10959

  33. [33]

    A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis

    Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URLhttps:// openreview.net/forum?id=aB3Hwh4UzP

  34. [34]

    On the role of transformer feed-forward layers in nonlinear in-context learning, 2025

    Haoyuan Sun, Ali Jadbabaie, and Navid Azizan. On the role of transformer feed-forward layers in nonlinear in-context learning, 2025. URLhttps://arxiv.org/abs/2501.18187

  35. [35]

    Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input

    Shokichi Takakura and Taiji Suzuki. Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input. InInternational Conference on Machine Learning, pages 33416–33447. PMLR, 2023

  36. [36]

    Joel A. Tropp. An introduction to matrix concentration inequalities.Foundations and Trends®in Machine Learning, 8(1-2):1–230, 2015. ISSN 1935-8237. doi: 10.1561/2200000048. URLhttp://dx.doi.org/10.1561/2200000048

  37. [37]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Infor- mation Processing Systems, 2017

  38. [38]

    Transformers learn in-context by gradient descent

    Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo˜ ao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. InInternational Conference on Machine Learning. PMLR, 2023

  39. [39]

    Polynomial regression as a task for understanding in-context learning through fine- tuning and alignment

    Max Wilcoxson, Morten Svendg˚ ard, Ria Doshi, Dylan Davis, Reya Vir, and Anant Sa- hai. Polynomial regression as a task for understanding in-context learning through fine- tuning and alignment. InICML 2024 Workshop on In-Context Learning, 2024. URL https://openreview.net/forum?id=8Xku9fR8dR

  40. [40]

    Cohen, and Taylor Whittington Webb

    Yukang Yang, Declan Iain Campbell, Kaixuan Huang, Mengdi Wang, Jonathan D. Cohen, and Taylor Whittington Webb. Emergent symbolic mechanisms support abstract reasoning in large language models. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=y1SnRPDWx4

  41. [41]

    Error bounds for approximations with deep

    Dmitry Yarotsky. Error bounds for approximations with deep relu networks.Neural Networks, 94:103–114, 2017. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2017.07.002. URL https://www.sciencedirect.com/science/article/pii/S0893608017301545

  42. [42]

    Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations, 2019

    Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations, 2019. 16

  43. [43]

    Trained transformers learn linear models in-context.Journal of Machine Learning Research, 25(49), 2024

    Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context.Journal of Machine Learning Research, 25(49), 2024

  44. [44]

    Latham, and Andrew M Saxe

    Yedi Zhang, Aaditya K Singh, Peter E. Latham, and Andrew M Saxe. Training dynamics of in-context learning in linear attention. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=aFNq67ilos. 17 A Relevant Lemmas Below are two important lemmas used in the construction of the feature representation. We refer...