A Theoretical Interpretation of In-Context Learning via Probabilistic Modeling

Huaze Tang; Shao-Lun Huang; Zhenyu Liu

arxiv: 2606.28926 · v1 · pith:D7E2POMMnew · submitted 2026-06-27 · 💻 cs.IT · cs.LG· math.IT

A Theoretical Interpretation of In-Context Learning via Probabilistic Modeling

Zhenyu Liu , Huaze Tang , Shao-Lun Huang This is my paper

Pith reviewed 2026-06-30 08:21 UTC · model grok-4.3

classification 💻 cs.IT cs.LGmath.IT

keywords in-context learningprobabilistic modelingexponential familiesparameter estimationlarge language modelstheoretical analysisprompt design

0 comments

The pith

In-context learning is modeled as estimating parameters of a probability distribution from prompt demonstrations to answer queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a probabilistic model that treats the demonstrations provided in a prompt as data for estimating the parameters of an underlying distribution, which is then used to generate the answer to a new query. It derives explicit performance expressions for this process under general parametric distributions and under the special case of exponential families. The derivations show how ICL accuracy depends on the number of demonstrations, the sensitivity of the chosen distribution to its parameters, and the similarity between the demonstrations and the query. A reader would care because the model supplies concrete, testable formulas that link prompt design choices to expected performance without requiring internal access to the language model.

Core claim

This work presents a probabilistic model for ICL and derives the performance of ICL for both general parametric distributions and exponential families. Based on the derived results, the work explains the impact of multiple factors such as the number of demonstrations, the sensitivity of the probabilistic model to the variation of its parameters, as well as the similarity between the demonstrations and the query on the performance of ICL.

What carries the argument

The probabilistic model for ICL in which parameters are estimated from the demonstrations in the prompt alone and then used to predict the response to the query.

If this is right

ICL performance increases with additional demonstrations because the parameter estimates become more accurate.
Models whose likelihood is highly sensitive to parameter changes will exhibit larger swings in ICL accuracy across different prompts.
Higher similarity between demonstrations and the query improves ICL performance under the derived expressions.
For distributions in the exponential family the performance admits a closed-form expression in terms of the estimated parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the model is accurate then prompt engineering can be guided by choosing demonstrations that jointly improve parameter estimation and maximize similarity to the query.
The same framework could be used to predict how ICL behaves when the number of demonstrations is very large or when the query distribution differs markedly from the demonstration distribution.
Synthetic experiments that generate data from known parametric families would provide a direct test of the performance formulas without relying on real language-model outputs.

Load-bearing premise

The semantic information processing inside large language models during in-context learning can be faithfully captured by a standard probabilistic model whose parameters are estimated from the prompt demonstrations alone.

What would settle it

Measuring ICL accuracy on tasks whose underlying distribution belongs to a known exponential family and finding that the observed accuracy deviates systematically from the closed-form performance expressions derived in the model.

Figures

Figures reproduced from arXiv: 2606.28926 by Huaze Tang, Shao-Lun Huang, Zhenyu Liu.

read the original abstract

In-context learning (ICL) is an emerging paradigm that employs the semantic information inherent in large language models (LLMs) for generating answers to user queries. While the remarkable performance of ICL has been widely known, a general modeling and a rigorous theoretical analysis of this paradigm are still lacking. This work presents a probabilistic model for ICL and derives the performance of ICL for both general parametric distributions and exponential families. Based on the derived results, the work explains the impact of multiple factors such as the number of demonstrations, the sensitivity of the probabilistic model to the variation of its parameters, as well as the similarity between the demonstrations and the query on the performance of ICL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives closed-form performance expressions for ICL under a probabilistic model, with the exponential-family case being the cleanest new piece.

read the letter

This paper sets up a probabilistic model for in-context learning where the demonstrations are used to estimate parameters of a distribution, and then derives the performance for general parametric distributions and for exponential families. It uses those to explain effects from the number of demonstrations, model sensitivity, and similarity between demos and query.

The new part is the explicit performance expressions for the exponential family case, which give a clean way to see the scaling. That is useful because it turns the usual empirical observations into something you can calculate directly from the model parameters. The paper handles the general case as well, so it covers a range.

The soft spot is the modeling assumption. It treats the LLM's handling of semantic information as equivalent to this parametric estimation from the prompt. That is a strong simplification, and the paper does not provide evidence that it matches actual LLM internals or outputs. If the goal is interpretation, this is fine as a starting point, but it limits how far the results can be taken as an explanation of real systems.

The math looks standard and the stress-test found no circularity or inconsistency, so the derivations probably hold up inside the framework.

This is for people doing theoretical work on prompting and ICL in information theory or machine learning. A reader who wants analytic tools for understanding ICL factors will get something from it. It deserves a serious referee to check the derivations and discuss the modeling choices.

Referee Report

0 major / 0 minor

Summary. The paper presents a probabilistic model for in-context learning (ICL) and derives the performance of ICL for both general parametric distributions and exponential families. Based on the derived results, it explains the impact of multiple factors such as the number of demonstrations, the sensitivity of the probabilistic model to the variation of its parameters, as well as the similarity between the demonstrations and the query on the performance of ICL.

Significance. If the derivations are correct, the work supplies a theoretical framework for analyzing ICL performance via probabilistic modeling. The treatment of exponential families is a potential strength for obtaining analytical insights. The paper positions the model as a theoretical interpretation rather than a claim of exact mechanistic fidelity to LLM internals; this framing means the core modeling assumption does not undermine the internal validity of the derived expressions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their concise summary of the manuscript and for noting the potential value of the probabilistic framework and the exponential-family analysis. The recommendation is marked 'uncertain,' yet the report contains no enumerated major comments. We therefore provide no point-by-point responses and stand ready to address any specific concerns the referee may wish to raise.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a probabilistic model for in-context learning and derives performance expressions for general parametric distributions and exponential families. No equations or steps are visible in the provided abstract or description that reduce any claimed prediction or first-principles result to its inputs by construction, self-definition, or load-bearing self-citation. The framework treats the model as an interpretive tool rather than claiming exact equivalence to LLM internals, and the derivations rest on standard probabilistic assumptions without evident renaming of known results or smuggling of ansatzes via citation. The central claims therefore remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5643 in / 889 out tokens · 30274 ms · 2026-06-30T08:21:34.942286+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references

[1]

When large language model agents meet 6G networks: Perception, grounding, and alignment,

M. Xu, D. Niyato, J. Kang, Z. Xiong, S. Mao, Z. Han, D. I. Kim, and K. B. Letaief, “When large language model agents meet 6G networks: Perception, grounding, and alignment,”IEEE Wireless Commun., vol. 31, no. 6, pp. 63–71, Dec. 2024

2024
[2]

CityNavAgent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory,

W. Zhang, C. Gao, S. Yu, R. Peng, B. Zhao, Q. Zhang, J. Cui, X. Chen, and Y . Li, “CityNavAgent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory,” inProc. ACL, Vienna, Austria, Jul. 2025, pp. 31 292–31 309

2025
[3]

Large language models to enhance Bayesian optimization,

T. Liu, N. Astorga, N. Seedat, and M. v. d. Schaar, “Large language models to enhance Bayesian optimization,” inProc. ICLR, Vienna, Austria, May 2024, pp. 1–33

2024
[4]

A survey on in-context learning,

Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, B. Chang, X. Sun, L. Li, and Z. Sui, “A survey on in-context learning,” inProc. EMNLP, Miami, Florida, USA, Nov. 2024, pp. 1107–1128

2024
[5]

An explanation of in-context learning as implicit Bayesian inference,

S. M. Xie, A. Raghunathan, P. Liang, and T. Ma, “An explanation of in-context learning as implicit Bayesian inference,” inProc. ICLR, Apr. 2022

2022
[6]

Why can GPT learn in-context? Language models secretly perform gradient descent as meta-optimizers,

D. Dai, Y . Sun, L. Dong, Y . Hao, S. Ma, Z. Sui, and F. Wei, “Why can GPT learn in-context? Language models secretly perform gradient descent as meta-optimizers,” inProc. ACL, Jul. 2023, pp. 4005–4019

2023
[7]

Transformers as algorithms: Generalization and stability in in-context learning,

Y . Li, M. E. Ildiz, D. Papailiopoulos, and S. Oymak, “Transformers as algorithms: Generalization and stability in in-context learning,” inProc. ICML, Jul. 2023, pp. 19 565–19 594

2023
[8]

What algorithms can transformers learn? A study in length generalization,

H. Zhou, A. Bradley, E. Littwin, N. Razin, O. Saremi, J. M. Susskind, S. Bengio, and P. Nakkiran, “What algorithms can transformers learn? A study in length generalization,” inProc. ICLR, Vienna, Austria, May 2024, pp. 1–29

2024
[9]

What can transformers learn in-context? A case study of simple function classes,

S. Garg, D. Tsipras, P. S. Liang, and G. Valiant, “What can transformers learn in-context? A case study of simple function classes,” inProc. NeurIPS, New Orleans, Louisiana, USA, Dec. 2022, pp. 30 583–30 598

2022
[10]

What learning algorithm is in-context learning? Investigations with linear models,

E. Aky ¨urek, D. Schuurmans, J. Andreas, T. Ma, and D. Zhou, “What learning algorithm is in-context learning? Investigations with linear models,” inProc. ICLR, Apr. 2023, pp. 1–29

2023
[11]

Transformers learn in-context by gradient descent,

J. V on Oswald, E. Niklasson, E. Randazzo, J. Sacramento, A. Mordvint- sev, A. Zhmoginov, and M. Vladymyrov, “Transformers learn in-context by gradient descent,” inProc. ICML, Honolulu, HI, USA, Jul. 2023, pp. 35 151–35 174

2023
[12]

What in-context learning “learns

J. Pan, T. Gao, H. Chen, and D. Chen, “What in-context learning “learns” in-context: Disentangling task recognition and task learning,” inProc. ACL, Toronto, Canada, Jul. 2023, pp. 8298—-8319

2023
[13]

Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression,

A. Ravent ´os, M. Paul, F. Chen, and S. Ganguli, “Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression,” inProc. NeurIPS, New Orleans, Louisiana, USA, Dec. 2023, pp. 14 228–14 246

2023
[14]

Data distributional properties drive emergent in-context learning in transformers,

S. Chan, A. Santoro, A. Lampinen, J. Wang, A. Singh, P. Richemond, J. McClelland, and F. Hill, “Data distributional properties drive emergent in-context learning in transformers,” inProc. NeurIPS, New Orleans, Louisiana, USA, Dec. 2022, pp. 18 878–18 891

2022
[15]

Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning,

X. Wang, W. Zhu, and W. Y . Wang, “Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning,” inProc. ICML Workshop, Honolulu, HI, USA, Jul. 2023, pp. 1–19

2023
[16]

Selective annotation makes language models better few-shot learners,

H. Su, J. Kasai, C. H. Wu, W. Shi, T. Wang, J. Xin, R. Zhang, M. Ostendorf, L. Zettlemoyer, N. A. Smithet al., “Selective annotation makes language models better few-shot learners,” inProc. ICLR, Kigali, Rwanda, May 2023, pp. 1–24

2023
[17]

In-context learning with iterative demonstration selection,

C. Qin, A. Zhang, C. Chen, A. Dagar, and W. Ye, “In-context learning with iterative demonstration selection,” inProc. EMNLP, Miami, FL, USA, Nov. 2024

2024
[18]

On theoretical interpretations of concept-based in-context learning,

H. Tang, T. Peng, and S.-L. Huang, “On theoretical interpretations of concept-based in-context learning,” inProc. ICLR, Rio de Janeiro, Brazil, Apr. 2026, pp. 1–31, accepted

2026
[19]

S. M. Kay,Fundamentals of Statistical Signal Processing: Estimation Theory. Upper Saddle River, NJ: Prentice-Hall, 1993. 6

1993

[1] [1]

When large language model agents meet 6G networks: Perception, grounding, and alignment,

M. Xu, D. Niyato, J. Kang, Z. Xiong, S. Mao, Z. Han, D. I. Kim, and K. B. Letaief, “When large language model agents meet 6G networks: Perception, grounding, and alignment,”IEEE Wireless Commun., vol. 31, no. 6, pp. 63–71, Dec. 2024

2024

[2] [2]

CityNavAgent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory,

W. Zhang, C. Gao, S. Yu, R. Peng, B. Zhao, Q. Zhang, J. Cui, X. Chen, and Y . Li, “CityNavAgent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory,” inProc. ACL, Vienna, Austria, Jul. 2025, pp. 31 292–31 309

2025

[3] [3]

Large language models to enhance Bayesian optimization,

T. Liu, N. Astorga, N. Seedat, and M. v. d. Schaar, “Large language models to enhance Bayesian optimization,” inProc. ICLR, Vienna, Austria, May 2024, pp. 1–33

2024

[4] [4]

A survey on in-context learning,

Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, B. Chang, X. Sun, L. Li, and Z. Sui, “A survey on in-context learning,” inProc. EMNLP, Miami, Florida, USA, Nov. 2024, pp. 1107–1128

2024

[5] [5]

An explanation of in-context learning as implicit Bayesian inference,

S. M. Xie, A. Raghunathan, P. Liang, and T. Ma, “An explanation of in-context learning as implicit Bayesian inference,” inProc. ICLR, Apr. 2022

2022

[6] [6]

Why can GPT learn in-context? Language models secretly perform gradient descent as meta-optimizers,

D. Dai, Y . Sun, L. Dong, Y . Hao, S. Ma, Z. Sui, and F. Wei, “Why can GPT learn in-context? Language models secretly perform gradient descent as meta-optimizers,” inProc. ACL, Jul. 2023, pp. 4005–4019

2023

[7] [7]

Transformers as algorithms: Generalization and stability in in-context learning,

Y . Li, M. E. Ildiz, D. Papailiopoulos, and S. Oymak, “Transformers as algorithms: Generalization and stability in in-context learning,” inProc. ICML, Jul. 2023, pp. 19 565–19 594

2023

[8] [8]

What algorithms can transformers learn? A study in length generalization,

H. Zhou, A. Bradley, E. Littwin, N. Razin, O. Saremi, J. M. Susskind, S. Bengio, and P. Nakkiran, “What algorithms can transformers learn? A study in length generalization,” inProc. ICLR, Vienna, Austria, May 2024, pp. 1–29

2024

[9] [9]

What can transformers learn in-context? A case study of simple function classes,

S. Garg, D. Tsipras, P. S. Liang, and G. Valiant, “What can transformers learn in-context? A case study of simple function classes,” inProc. NeurIPS, New Orleans, Louisiana, USA, Dec. 2022, pp. 30 583–30 598

2022

[10] [10]

What learning algorithm is in-context learning? Investigations with linear models,

E. Aky ¨urek, D. Schuurmans, J. Andreas, T. Ma, and D. Zhou, “What learning algorithm is in-context learning? Investigations with linear models,” inProc. ICLR, Apr. 2023, pp. 1–29

2023

[11] [11]

Transformers learn in-context by gradient descent,

J. V on Oswald, E. Niklasson, E. Randazzo, J. Sacramento, A. Mordvint- sev, A. Zhmoginov, and M. Vladymyrov, “Transformers learn in-context by gradient descent,” inProc. ICML, Honolulu, HI, USA, Jul. 2023, pp. 35 151–35 174

2023

[12] [12]

What in-context learning “learns

J. Pan, T. Gao, H. Chen, and D. Chen, “What in-context learning “learns” in-context: Disentangling task recognition and task learning,” inProc. ACL, Toronto, Canada, Jul. 2023, pp. 8298—-8319

2023

[13] [13]

Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression,

A. Ravent ´os, M. Paul, F. Chen, and S. Ganguli, “Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression,” inProc. NeurIPS, New Orleans, Louisiana, USA, Dec. 2023, pp. 14 228–14 246

2023

[14] [14]

Data distributional properties drive emergent in-context learning in transformers,

S. Chan, A. Santoro, A. Lampinen, J. Wang, A. Singh, P. Richemond, J. McClelland, and F. Hill, “Data distributional properties drive emergent in-context learning in transformers,” inProc. NeurIPS, New Orleans, Louisiana, USA, Dec. 2022, pp. 18 878–18 891

2022

[15] [15]

Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning,

X. Wang, W. Zhu, and W. Y . Wang, “Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning,” inProc. ICML Workshop, Honolulu, HI, USA, Jul. 2023, pp. 1–19

2023

[16] [16]

Selective annotation makes language models better few-shot learners,

H. Su, J. Kasai, C. H. Wu, W. Shi, T. Wang, J. Xin, R. Zhang, M. Ostendorf, L. Zettlemoyer, N. A. Smithet al., “Selective annotation makes language models better few-shot learners,” inProc. ICLR, Kigali, Rwanda, May 2023, pp. 1–24

2023

[17] [17]

In-context learning with iterative demonstration selection,

C. Qin, A. Zhang, C. Chen, A. Dagar, and W. Ye, “In-context learning with iterative demonstration selection,” inProc. EMNLP, Miami, FL, USA, Nov. 2024

2024

[18] [18]

On theoretical interpretations of concept-based in-context learning,

H. Tang, T. Peng, and S.-L. Huang, “On theoretical interpretations of concept-based in-context learning,” inProc. ICLR, Rio de Janeiro, Brazil, Apr. 2026, pp. 1–31, accepted

2026

[19] [19]

S. M. Kay,Fundamentals of Statistical Signal Processing: Estimation Theory. Upper Saddle River, NJ: Prentice-Hall, 1993. 6

1993