pith. sign in

arxiv: 2211.15661 · v3 · pith:PFZHD3TFnew · submitted 2022-11-28 · 💻 cs.LG · cs.CL

What learning algorithm is in-context learning? Investigations with linear models

Pith reviewed 2026-05-17 13:34 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords in-context learningtransformerslinear regressiongradient descentridge regressionBayesian estimationimplicit algorithms
0
0 comments X p. Extension
pith:PFZHD3TF Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{PFZHD3TF}

Prints a linked pith:PFZHD3TF badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Transformers implement gradient descent and ridge regression implicitly when doing in-context learning on linear tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether in-context learning works by having transformers run familiar estimation procedures on the examples shown in the prompt. Using linear regression as a test case, the authors first construct transformers that carry out gradient descent or closed-form ridge regression. They then train transformers on sequences of input-output pairs and find that the resulting predictors closely track those produced by gradient descent, ridge regression, or exact least squares, with the choice of algorithm shifting according to network depth and noise level. For sufficiently wide and deep models the behavior further approaches that of a Bayesian estimator. Late layers of the trained networks appear to store weight vectors and second-moment matrices in a non-linear fashion.

Core claim

Trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression, transitioning between different predictors as transformer depth and dataset noise vary, and converging to Bayesian estimators for large widths and depths.

What carries the argument

Implicit linear models encoded in transformer activations that are updated by new labeled examples appearing in the context.

If this is right

  • Transformers can be explicitly constructed to run gradient descent or closed-form ridge regression on linear models.
  • Trained in-context models reproduce the outputs of gradient descent, ridge regression, and exact least squares on held-out points.
  • The effective algorithm changes with network depth and with the noise level in the training examples.
  • Late layers of trained transformers non-linearly encode weight vectors and moment matrices.
  • Very wide and deep models converge to Bayesian posterior means rather than to point estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the same algorithmic alignment appears outside the linear setting, in-context learning may amount to rediscovery of classical estimators rather than invention of new ones.
  • Prompt design could be guided by choosing examples that steer an implicit ridge or least-squares procedure toward a desired bias-variance trade-off.
  • Measuring whether late-layer activations continue to track weight vectors on non-linear tasks would test the scope of the linear-model analogy.

Load-bearing premise

Results obtained on linear regression will carry over to the non-linear tasks that dominate real language-model in-context learning.

What would settle it

A controlled experiment in which trained transformers produce predictions on linear tasks that deviate consistently from every standard regression algorithm even after training converges.

read the original abstract

Neural sequence models, especially transformers, exhibit a remarkable capacity for in-context learning. They can construct new predictors from sequences of labeled examples $(x, f(x))$ presented in the input without further parameter updates. We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly, by encoding smaller models in their activations, and updating these implicit models as new examples appear in the context. Using linear regression as a prototypical problem, we offer three sources of evidence for this hypothesis. First, we prove by construction that transformers can implement learning algorithms for linear models based on gradient descent and closed-form ridge regression. Second, we show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression, transitioning between different predictors as transformer depth and dataset noise vary, and converging to Bayesian estimators for large widths and depths. Third, we present preliminary evidence that in-context learners share algorithmic features with these predictors: learners' late layers non-linearly encode weight vectors and moment matrices. These results suggest that in-context learning is understandable in algorithmic terms, and that (at least in the linear case) learners may rediscover standard estimation algorithms. Code and reference implementations are released at https://github.com/ekinakyurek/google-research/blob/master/incontext.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript investigates the hypothesis that transformer-based in-context learners implement standard learning algorithms (gradient descent, ridge regression, least-squares) implicitly for linear regression by encoding and updating smaller models in their activations. It offers three lines of evidence: explicit constructions proving transformers can realize GD and closed-form ridge regression; experiments demonstrating that trained in-context learners produce predictors that closely match those of GD, ridge regression, and exact least-squares (with transitions across regimes of depth and noise, and convergence to Bayesian estimators at large width/depth); and preliminary evidence that late layers non-linearly encode weight vectors and moment matrices.

Significance. If the central results hold, the work supplies a concrete algorithmic account of in-context learning in the linear setting and demonstrates that trained transformers can rediscover classical estimators. Notable strengths include the parameter-free constructions, quantitative predictor matches backed by released code, and the focus on falsifiable comparisons rather than post-hoc fitting. These elements make the linear-case findings verifiable and extensible.

major comments (2)
  1. [§3] §3 (Constructions): the explicit constructions establish that transformers are capable of implementing GD and ridge regression, but the link to trained models rests on output matching rather than a demonstration that the learned weights realize the same internal update rules; this gap is load-bearing for the stronger claim that learners 'implement' the algorithms implicitly.
  2. [§4.2–4.3] §4.2–4.3 (Predictor matching and regime transitions): while experiments report close quantitative agreement with GD/ridge/least-squares and transitions with depth/noise, the manuscript does not provide a theoretical account of the selection mechanism; without it, the observed transitions remain descriptive and could be consistent with other implicit algorithms.
minor comments (3)
  1. [Notation] Notation for implicit model states and moment matrices should be introduced with a single consolidated table to reduce cross-reference burden.
  2. [Figures] Figure captions for the encoding plots (late-layer activations) should explicitly state the dimensionality and normalization used for the weight-vector and moment-matrix visualizations.
  3. [Abstract] The abstract's phrasing 'converging to Bayesian estimators' should be qualified with the precise prior and the scaling regime (width/depth) under which the convergence is observed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and the recommendation of minor revision. We address the two major comments below, clarifying the scope of our claims and noting where we will revise the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Constructions): the explicit constructions establish that transformers are capable of implementing GD and ridge regression, but the link to trained models rests on output matching rather than a demonstration that the learned weights realize the same internal update rules; this gap is load-bearing for the stronger claim that learners 'implement' the algorithms implicitly.

    Authors: We agree that the constructions show architectural capacity rather than proving that the learned weights exactly replicate the internal update rules of GD or ridge regression. Our stronger claim is supported by the combination of (i) the existence proofs, (ii) the close quantitative predictor matches across many regimes, and (iii) the preliminary representational evidence that late-layer activations encode weight vectors and moment matrices. We do not claim to have performed a full mechanistic interpretability analysis of the trained weights. We will revise the abstract, §3, and the discussion to moderate the phrasing from “implement … implicitly” to “can implement … and trained models produce equivalent predictors,” and we will add a short paragraph noting the distinction between capacity, behavioral equivalence, and internal mechanism. revision: partial

  2. Referee: [§4.2–4.3] §4.2–4.3 (Predictor matching and regime transitions): while experiments report close quantitative agreement with GD/ridge/least-squares and transitions with depth/noise, the manuscript does not provide a theoretical account of the selection mechanism; without it, the observed transitions remain descriptive and could be consistent with other implicit algorithms.

    Authors: We accept this observation. The paper is primarily empirical: it documents the quantitative agreement, the systematic transitions with depth and noise, and the convergence to Bayesian estimators at large width/depth. No theoretical derivation of the selection mechanism that chooses among GD, ridge, or least-squares in different regimes is supplied. We will add a concise limitations paragraph in the discussion that acknowledges this gap and lists it as an open direction for future theoretical work. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations rely on explicit constructions and external algorithm comparisons

full rationale

The paper's central results consist of a proof by construction showing that transformers can implement gradient descent and closed-form ridge regression on linear models, followed by empirical matching of trained in-context learners to the predictors produced by these standard algorithms plus exact least-squares regression. All comparisons are to externally defined, well-known methods whose definitions and implementations do not depend on quantities fitted or defined inside this work. No load-bearing premise reduces to a self-citation, a fitted parameter renamed as a prediction, or a redefinition of inputs; the linear-regression setting is treated explicitly as a prototypical case rather than smuggled in as a universal claim. The derivation chain is therefore self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the hypothesis that transformers encode and update implicit models, plus standard assumptions of linear regression as a representative task; no free parameters are fitted to support the central claims, and no new entities are postulated.

axioms (1)
  • domain assumption Transformers can encode smaller models in their activations and update these implicit models as new examples appear in the context.
    This is the core hypothesis tested via constructions and experiments.

pith-pipeline@v0.9.0 · 5539 in / 1118 out tokens · 38314 ms · 2026-05-17T13:34:58.805779+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Statistical Cost of Adaptation in Multi-Source Transfer Learning

    math.ST 2026-05 unverdicted novelty 8.0

    Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

  2. Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

    cs.LG 2026-05 unverdicted novelty 7.0

    Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.

  3. Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

    cs.CR 2026-05 conditional novelty 7.0

    A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.

  4. Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent

    cs.LG 2026-05 conditional novelty 7.0

    Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent ...

  5. Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

    cs.LG 2026-05 unverdicted novelty 7.0

    Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.

  6. Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations

    cs.CL 2026-05 unverdicted novelty 7.0

    LLM surrogate beliefs under sparse observations depend on prompts and query protocols, with structural prompts as priors, pointwise vs joint querying producing different beliefs, and sequential evidence causing non-mo...

  7. Meta-Harness: End-to-End Optimization of Model Harnesses

    cs.AI 2026-03 unverdicted novelty 7.0

    Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...

  8. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  9. Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...

  10. Spectral Transformer Neural Processes

    cs.LG 2026-05 unverdicted novelty 6.0

    STNPs extend TNPs with a spectral aggregator that estimates context spectra, forms spectral mixtures, and injects task-adaptive frequency features to better handle periodicity.

  11. Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

    cs.LG 2026-05 unverdicted novelty 6.0

    Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.

  12. Learning to Adapt: In-Context Learning Beyond Stationarity

    cs.LG 2026-04 unverdicted novelty 6.0

    Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.

  13. Otter: A Multi-Modal Model with In-Context Instruction Tuning

    cs.CV 2023-05 unverdicted novelty 6.0

    Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.

  14. One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    Non-linear transformers enable cross-domain generalization in in-context RL by representing value functions from different domains with shared weights inside a shared RKHS.

  15. When Context Sticks: Studying Interference in In-Context Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    In-context learning shows persistent interference from prior examples, with more misleading linear examples degrading quadratic predictions and training curricula modulating recovery speed.

  16. Online In-Context Distillation for Low-Resource Vision Language Models

    cs.CV 2025-10 unverdicted novelty 5.0

    Online In-Context Distillation lets small VLMs gain up to 33% performance with as little as 4% teacher annotations by distilling knowledge through dynamic in-context demonstrations at inference.

  17. High-Dimensional Statistics: Reflections on Progress and Open Problems

    math.ST 2026-05 unverdicted novelty 2.0

    A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 17 Pith papers · 6 internal anchors

  1. [1]

    Understanding intermediate layers using linear classifier probes

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. ArXiv preprint, abs/1610.01644, 2016. URL https://arxiv.org/abs/1610.01644

  2. [2]

    Hoffman, David Pfau, Tom Schaul, and Nando de Freitas

    Marcin Andrychowicz, Misha Denil, Sergio Gomez Colmenarejo, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on...

  3. [3]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ArXiv preprint, abs/1607.06450, 2016. URL https://arxiv.org/abs/1607.06450

  4. [4]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

  5. [5]

    Thread: circuits

    Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. Thread: circuits. Distill, 5 0 (3): 0 e24, 2020

  6. [6]

    Yiqun Chen, Qi Liu, Yi Zhang, Weiwei Sun, Xinyu Ma, Wei Yang, Daiting Shi, Jiaxin Mao, and Dawei Yin

    Stephanie CY Chan, Adam Santoro, Andrew K Lampinen, Jane X Wang, Aaditya Singh, Pierre H Richemond, Jay McClelland, and Felix Hill. Data distributional properties drive emergent few-shot learning in transformers. ArXiv preprint, abs/2205.05055, 2022. URL https://arxiv.org/abs/2205.05055

  7. [7]

    Mask-align: Self-supervised neural word alignment

    Chi Chen, Maosong Sun, and Yang Liu. Mask-align: Self-supervised neural word alignment. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 4781--4791, Online, 2021. Association for Computational Linguistics. doi:...

  8. [8]

    Meta-learning via language model in-context tuning

    Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. Meta-learning via language model in-context tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 719--730, Dublin, Ireland, 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.53. URL https:/...

  9. [9]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ArXiv preprint, abs/2204.02311, 2022. URL https://arxiv.org/abs/2204.02311

  10. [10]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , volume 70 of Proceedings of Machine Learning Research, pp.\ 1126--1135. PMLR ...

  11. [11]

    What can transformers learn in-context? a case study of simple function classes

    Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. ArXiv, abs/2208.01066, 2022

  12. [12]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). ArXiv preprint, abs/1606.08415, 2016. URL https://arxiv.org/abs/1606.08415

  13. [13]

    Ridge regression: Biased estimation for nonorthogonal problems

    Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12 0 (1): 0 55--67, 1970

  14. [14]

    TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classification problems in a second, 2022. URL https://arxiv.org/abs/2207.01848

  15. [15]

    Multilayer feedforward networks are universal approximators

    Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2 0 (5): 0 359--366, 1989

  16. [16]

    Meta learning backpropagation and improving it

    Louis Kirsch and J \"u rgen Schmidhuber. Meta learning backpropagation and improving it. Advances in Neural Information Processing Systems, 34: 0 14122--14134, 2021

  17. [17]

    In-context reinforcement learning with algorithm distillation

    Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, DJ Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, et al. In-context reinforcement learning with algorithm distillation. ArXiv preprint, abs/2210.14215, 2022. URL https://arxiv.org/abs/2210.14215

  18. [18]

    Vision: A computational investigation into the human representation and processing of visual information

    David Marr. Vision: A computational investigation into the human representation and processing of visual information. MIT press, 2010

  19. [19]

    M eta ICL : Learning to learn in context

    Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. M eta ICL : Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2791--2809, Seattle, United States, 2022. Association for Computational Linguistics. doi:10.18653...

  20. [20]

    Compositional explanations of neurons

    Jesse Mu and Jacob Andreas. Compositional explanations of neurons. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria - Florina Balcan, and Hsuan - Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://...

  21. [21]

    Transformers can do bayesian inference.arXiv preprint arXiv:2112.10510, 2021

    Samuel M \"u ller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference. arXiv preprint arXiv:2112.10510, 2021

  22. [22]

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, T. J. Henighan, Benjamin Mann, Amanda Askell, Yushi Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, John Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom B. Brown, Jack Clark, Jared Kaplan, Sam McCandl...

  23. [23]

    Linear transformers are secretly fast weight programmers

    Imanol Schlag, Kazuki Irie, and J \" u rgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research, pp.\ 9355--9366. PMLR , 2021. URL...

  24. [24]

    Simple principles of metalearning

    Juergen Schmidhuber, Jieyu Zhao, and Marco A Wiering. Simple principles of metalearning. 1996

  25. [25]

    Adjustment of an inverse matrix corresponding to a change in one element of a given matrix

    Jack Sherman and Winifred J Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21 0 (1): 0 124--127, 1950

  26. [26]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference o...

  27. [27]

    Statistically meaningful approximation: a case study on approximating turing machines with transformers

    Colin Wei, Yining Chen, and Tengyu Ma. Statistically meaningful approximation: a case study on approximating turing machines with transformers. ArXiv preprint, abs/2107.13163, 2021. URL https://arxiv.org/abs/2107.13163

  28. [29]

    An Explanation of In-context Learning as Implicit Bayesian Inference

    Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. ArXiv, abs/2111.02080, 2022

  29. [30]

    Reddi, and Sanjiv Kumar

    Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net, 2020. URL https://openreview.net/forum?id=ByxRM0Ntvr

  30. [31]

    Opt: Open pre-trained transformer language models, 2022

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022

  31. [32]

    Learning to prompt for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130 0 (9): 0 2337--2348, 2022