The Bayesian Geometry of Transformer Attention

Naman Agarwal; Siddhartha R. Dalal; Vishal Misra

arxiv: 2512.22471 · v5 · pith:JQRO5QHTnew · submitted 2025-12-27 · 💻 cs.LG · cs.AI· stat.ML

The Bayesian Geometry of Transformer Attention

Naman Agarwal , Siddhartha R. Dalal , Vishal Misra This is my paper

Pith reviewed 2026-05-21 16:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords transformer attentionBayesian inferenceBayesian wind tunnelsresidual streamsposterior updateHMM state trackinggeometric mechanism

0 comments

The pith

Transformers recover exact Bayesian posteriors in controlled settings through a geometric attention mechanism while capacity-matched MLPs fail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds controlled test environments called Bayesian wind tunnels where the true posterior is known in closed form and memorization cannot occur. Small transformers reproduce these posteriors to 10^{-3} to 10^{-4} bit accuracy on tasks such as bijection elimination and HMM state tracking. The mechanism identified is that residual streams hold the belief state, feed-forward networks execute the posterior update, and attention supplies content-addressable routing. Geometric measurements show orthogonal key bases, progressive alignment of queries to keys, and a low-dimensional value manifold that tracks posterior entropy. This structure is absent in flat networks, which accounts for their failure on the same tasks.

Core claim

Hierarchical attention realizes Bayesian inference by geometric design. Residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query-key alignment, and a low-dimensional value manifold parameterized by posterior entropy, with attention patterns remaining stable while the manifold unfurls during training.

What carries the argument

Bayesian wind tunnels: controlled environments supplying closed-form posteriors and provably preventing memorization, allowing direct verification that accuracy arises from on-the-fly inference rather than stored patterns.

Load-bearing premise

The constructed environments truly supply closed-form posteriors and make memorization impossible, so that observed accuracy must reflect genuine inference.

What would settle it

A capacity-matched MLP achieving comparable posterior accuracy inside the same bijection-elimination or HMM wind-tunnel tasks would falsify the claimed architectural separation.

Figures

Figures reproduced from arXiv: 2512.22471 by Naman Agarwal, Siddhartha R. Dalal, Vishal Misra.

**Figure 2.** Figure 2: Mamba discovers the 5-corner geometry of HMM belief space. Final-layer representations from Mamba on the HMM task (5 hidden states). Left: Points colored by most likely hidden state reveal five distinct clusters—one per state. Right: The same points colored by posterior entropy show confidence variation within each cluster (red = low entropy/high confidence, blue = high entropy/uncertainty). Mamba has lear… view at source ↗

**Figure 3.** Figure 3: Bijection wind tunnel: transformer matches the Bayesian posterior; MLP does not. Entropy trajectories at 150k training steps. The transformer lies essentially on top of the analytic Bayes curve across positions, while the capacity-matched MLP barely reduces uncertainty and fails to implement hypothesis elimination. This is the comparison summarized quantitatively in [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Bijection wind tunnel: per-sequence entropy dynamics. Eight randomly chosen bijections from the test set. Each panel shows transformer entropy (solid) and analytic Bayes entropy (dashed) as a function of position. The sawtooth pattern—discrete drops when mappings are revealed and collapses to (near) zero when previously seen inputs reappear—confirms that the transformer is performing stepwise hypothesis el… view at source ↗

**Figure 5.** Figure 5: Bijection wind tunnel: layer-wise ablation. Mean absolute entropy error (bits) when ablating each layer (attention+FFN) in turn, averaged over seeds. Removing any single layer increases calibration error by more than an order of magnitude, showing that the Bayesian computation is genuinely hierarchical and compositional rather than shallow or redundant. inference by constructing a representational frame, e… view at source ↗

**Figure 6.** Figure 6: Head-wise ablation. Change in mean absolute entropy error when ablating individual attention heads. A single Layer-0 “hypothesis-frame head” plays a uniquely important role, while many later heads are partially redundant. This supports the three-stage picture in Section 6: foundational binding, progressive elimination, and value-manifold refinement [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: HMM wind tunnel: calibration across sequence lengths. Transformer predictive entropy 𝐻model(𝑡) (solid) versus analytic 𝐻Bayes(𝑡) (dashed) at the training length 𝐾 = 20 and at 𝐾 = 30 and 𝐾 = 50. At 𝐾 = 20 the trajectories overlap almost perfectly; for longer sequences the error grows smoothly with position and shows no kink at the training boundary, indicating a position-independent recursive algorithm rath… view at source ↗

**Figure 8.** Figure 8: HMM wind tunnel: per-position calibration. Absolute entropy error |𝐻model(𝑡) − 𝐻Bayes(𝑡)| as a function of position for 𝐾 = 20, 𝐾 = 30, and 𝐾 = 50. Errors are tiny at the training length and increase gradually with 𝑡 for extended lengths, again with no discontinuity at 𝑡 = 20 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: HMM wind tunnel: per-sequence entropy dynamics. Entropy trajectories 𝐻model(𝑡) and 𝐻Bayes(𝑡) for eight randomly chosen 𝐾 = 20 test HMMs. The transformer tracks sequence-specific rises and drops in uncertainty, reflecting the stochastic interplay of transitions and emissions. per-hypothesis slots in the residual stream. No other attention head exhibits comparable sensitivity. This identifies a structural bo… view at source ↗

**Figure 10.** Figure 10: Semantic invariance under hidden-state relabeling. Mean absolute entropy error before vs. after randomly permuting hidden-state labels in the HMMs. Points lie on the diagonal and the distribution of ΔMAE is tightly concentrated near zero, confirming that the transformer’s computation is invariant to arbitrary relabelings of the hidden state space [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Late-layer attention and length generalization. Mean absolute entropy error as a function of sequence length for the full transformer and a variant with attention disabled in the top two layers. The nolate-attention model is only modestly worse at the training length but its error explodes on longer sequences, with the degradation factor growing from ∼ 21× at 𝐾 = 20 to over 60× at 𝐾 = 50. Late attention … view at source ↗

**Figure 12.** Figure 12: HMM wind tunnel: transformer vs MLP length generalization. Per-position mean absolute entropy error for the transformer (solid) and capacity-matched MLP (dashed) at 𝐾 = 20 and 𝐾 = 50. The vertical gray line marks the training boundary at position 𝑡 = 20. The transformer shows near-zero error at the training length and smooth degradation beyond it; the MLP maintains flat ∼ 0.4-bit error across positions, i… view at source ↗

**Figure 13.** Figure 13: Multi-seed robustness of HMM length generalization. Overlay of per-position transformer MAE curves across five random seeds for 𝐾 = 20, 𝐾 = 30, and 𝐾 = 50. Seed-to-seed variability is negligible relative to the transformer–MLP gap, showing that the learned Bayesian algorithm is robust to initialization and optimization noise. 5.2 Sequential Bayesian Elimination Across Depth With the hypothesis frame in pl… view at source ↗

**Figure 14.** Figure 14: Representative single-seed trajectory. Per-position MAE for one representative seed (2024) closely matches the multi-seed average in [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Block-wise probe deltas for entropy prediction. For each transformer block we train a linear probe on the pre-sublayer residual stream to predict the analytic posterior entropy, then evaluate the same probe on the post-sublayer residual. The plotted quantity is the change in mean-squared error (MSE) when moving from pre- to post-sublayer, i.e., ΔMSE = MSE(probe on post-residual) − MSE(probe on pre-residua… view at source ↗

**Figure 16.** Figure 16: Key orthogonality in Layer 0. Cosine similarity matrix of key vectors for all input tokens in the bijection model at 150k steps. Off-diagonal entries cluster near zero, showing that distinct inputs occupy nearly orthogonal directions and form an explicit hypothesis basis. (a) Layer 0 (b) Layer 5 [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Progressive query–key alignment across depth. Cosine similarity between queries and keys at an early layer (left) and a deep layer (right) of the bijection transformer. For each sequence position 𝑡 on the horizontal axis, we plot the cosine similarities cos(𝑞𝑡 , 𝑘𝑗) between the query at position 𝑡 and all key vectors 𝑘𝑗 along the vertical axis. Here 𝑡 indexes the query-token positions in the serialized in… view at source ↗

**Figure 18.** Figure 18: Value-manifold unfurling during training. PCA projection of attention outputs in the bijection model, colored by analytic posterior entropy. At 100k steps, low-entropy states are tightly clustered; by 150k, they lie along a smooth one-dimensional curve parameterized by entropy, enabling fine-grained encoding of posterior states.Each point is an attention output (head output or block attention output – whi… view at source ↗

**Figure 19.** Figure 19: Per-position calibration improves as the value manifold unfurls. Absolute entropy error as a function of position in the bijection task at 100k and 150k training steps. The dominant improvements occur at late positions, matching the geometric unfurling of low-entropy states in [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗

**Figure 20.** Figure 20: Three-stage architectural mechanism for Bayesian inference. Layer 0 constructs a key–value hypoth [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗

**Figure 21.** Figure 21: Layer-wise analysis: Mamba vs LSTM on HMM. For each layer, we measure PC1 variance (dimensionality), entropy correlation (alignment with uncertainty), and linear 𝑅 2 (predictability of entropy). Left: LSTM collapses to a 1D manifold (PC1 > 90%) with near-zero entropy correlation. Right: Mamba maintains distributed representations (PC1 ≈ 22%) with increasing entropy predictability across depth. This explai… view at source ↗

read the original abstract

Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing \emph{Bayesian wind tunnels} -- controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with $10^{-3}$-$10^{-4}$ bit accuracy, while capacity-matched MLPs fail by orders of magnitude, establishing a clear architectural separation. Across two tasks -- bijection elimination and Hidden Markov Model (HMM) state tracking -- we find that transformers implement Bayesian inference through a consistent geometric mechanism: residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query-key alignment, and a low-dimensional value manifold parameterized by posterior entropy. During training this manifold unfurls while attention patterns remain stable, a \emph{frame-precision dissociation} predicted by recent gradient analyses. Taken together, these results demonstrate that hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and the failure of flat architectures. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bayesian wind tunnels give a clean testbed for checking on-the-fly inference in small transformers, but the necessity of the claimed geometric mechanism still needs tighter controls.

read the letter

The main thing to know is that this paper builds controlled environments called Bayesian wind tunnels where the true posterior is known in closed form and memorization is supposed to be impossible. Small transformers recover the posterior to 10^{-3} to 10^{-4} bit accuracy on bijection elimination and HMM state tracking, while capacity-matched MLPs do much worse. They sketch a geometric account: residual streams act as the belief substrate, FFNs do the update, and attention handles routing, backed by diagnostics like orthogonal key bases and an entropy-parameterized value manifold plus a frame-precision dissociation during training.

Referee Report

3 major / 3 minor

Summary. The paper introduces Bayesian wind tunnels as controlled synthetic tasks (bijection elimination and HMM state tracking) with known closed-form posteriors where memorization is claimed to be provably impossible. Small transformers achieve 10^{-3} to 10^{-4} bit accuracy reproducing the true posteriors, while capacity-matched MLPs fail by orders of magnitude. The authors identify a geometric mechanism: residual streams as belief substrate, FFNs performing posterior updates, and attention providing content-addressable routing, with supporting diagnostics including orthogonal key bases, query-key alignment, and an entropy-parameterized value manifold that unfurls during training.

Significance. If the central results hold, the work supplies a rigorous testbed for verifying Bayesian reasoning in transformers and a mechanistic account of why attention is necessary for certain inference tasks while flat architectures fail. The wind-tunnel methodology could enable falsifiable predictions and help connect small verifiable systems to phenomena in large language models.

major comments (3)

[Bayesian wind tunnels] Bayesian wind tunnels section: The assertion that memorization is provably impossible for the HMM task rests on the finite state space plus training protocol precluding lookup-table or finite-automaton solutions, yet no explicit bounds on sequence length, transition-matrix properties, or out-of-distribution test regimes are supplied to rule out non-Bayesian heuristics that match the posterior only on the evaluated regime.
[HMM state tracking] HMM state tracking results: The reported 10^{-3}–10^{-4} bit accuracy is measured against externally supplied closed-form posteriors rather than quantities derived from the model's own fitted parameters; without an internal consistency check or ablation that isolates the geometric mechanism from other possible computations, the accuracy does not yet demonstrate that the observed performance must arise from on-the-fly Bayesian inference.
[Geometric diagnostics] Geometric mechanism claim: The diagnostics (orthogonal key bases, progressive query-key alignment, low-dimensional value manifold) are consistent with the proposed residual-stream / FFN / attention decomposition, but the manuscript does not provide counterfactual experiments (e.g., attention-ablated or reparameterized models) showing that these geometric features are necessary rather than merely correlated with the accuracy.

minor comments (3)

[Results] Define 'bit accuracy' explicitly and report error bars together with the number of random seeds and statistical controls used for the numerical results.
[Value manifold] Clarify the precise construction of the value manifold and how posterior entropy parameterizes it; the current description leaves the dimensionality and training dynamics underspecified.
[Introduction] Add a short related-work paragraph situating the wind-tunnel approach against prior synthetic-task studies of attention and Bayesian inference in sequence models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments and constructive feedback on our work. We address each of the major comments in detail below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Bayesian wind tunnels] Bayesian wind tunnels section: The assertion that memorization is provably impossible for the HMM task rests on the finite state space plus training protocol precluding lookup-table or finite-automaton solutions, yet no explicit bounds on sequence length, transition-matrix properties, or out-of-distribution test regimes are supplied to rule out non-Bayesian heuristics that match the posterior only on the evaluated regime.

Authors: We appreciate the referee pointing out the need for more explicit bounds to support the claim that memorization is impossible. In the revised manuscript, we now provide specific bounds on sequence lengths used in training and testing, detail the properties of the transition matrices that make heuristic matching unlikely to achieve the observed accuracy levels, and include results from out-of-distribution regimes. These additions demonstrate that non-Bayesian approaches cannot replicate the posterior reproduction in the tested settings. revision: yes
Referee: [HMM state tracking] HMM state tracking results: The reported 10^{-3}–10^{-4} bit accuracy is measured against externally supplied closed-form posteriors rather than quantities derived from the model's own fitted parameters; without an internal consistency check or ablation that isolates the geometric mechanism from other possible computations, the accuracy does not yet demonstrate that the observed performance must arise from on-the-fly Bayesian inference.

Authors: We agree that relying solely on external posteriors leaves room for alternative explanations. We have added internal consistency checks by deriving approximate posteriors from the model's parameters in the revised version and performed ablations that target the proposed mechanisms. While these strengthen the case for on-the-fly inference, we note that complete isolation from all possible alternative computations is inherently difficult; however, the combination of high accuracy and specific geometric signatures supports our interpretation. revision: partial
Referee: [Geometric diagnostics] Geometric mechanism claim: The diagnostics (orthogonal key bases, progressive query-key alignment, low-dimensional value manifold) are consistent with the proposed residual-stream / FFN / attention decomposition, but the manuscript does not provide counterfactual experiments (e.g., attention-ablated or reparameterized models) showing that these geometric features are necessary rather than merely correlated with the accuracy.

Authors: We thank the referee for this suggestion. To address the correlation versus necessity concern, we have included new counterfactual experiments in the revision: attention ablation leads to substantial loss in accuracy, and disrupting the orthogonal key bases via reparameterization similarly degrades performance. These results indicate that the geometric features are indeed necessary for the observed Bayesian reproduction. revision: yes

Circularity Check

0 steps flagged

No circularity: measurements against externally known closed-form posteriors

full rationale

The paper constructs controlled Bayesian wind tunnels with analytically known closed-form posteriors and provably impossible memorization. Reported accuracy (10^{-3}–10^{-4} bit) is measured directly against these external benchmarks rather than any internally fitted parameters or self-defined quantities. Geometric diagnostics (orthogonal key bases, query-key alignment, entropy-parameterized value manifold) are post-training observations on the trained models, not definitions that presuppose the Bayesian mechanism. The architectural separation from MLPs is established by direct empirical comparison on the same tasks. No load-bearing derivation step reduces by construction to its own inputs, and no self-citation chain is invoked to justify the central claim. The derivation remains self-contained against the stated external ground truth.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of closed-form posteriors and the impossibility of memorization in the wind-tunnel tasks; these are domain assumptions rather than fitted parameters or new entities.

axioms (2)

domain assumption The two chosen tasks (bijection elimination and HMM state tracking) admit exact closed-form posteriors that can be computed independently of the model.
Stated in the abstract as the foundation for the wind-tunnel construction.
domain assumption Capacity-matched MLPs and transformers differ only in architecture, not in effective expressivity or optimization dynamics under the training regime used.
Implicit in the claim that MLPs fail by orders of magnitude while transformers succeed.

pith-pipeline@v0.9.0 · 5773 in / 1564 out tokens · 62129 ms · 2026-05-21T16:11:40.392537+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Geometric diagnostics reveal orthogonal key bases, progressive query-key alignment, and a low-dimensional value manifold parameterized by posterior entropy... frame–precision dissociation
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Transformers realize all three primitives... attention provides content-addressable routing

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
cs.AI 2026-05 unverdicted novelty 7.0

Transformer hidden states encode facts as attractor basins; hallucinations occur from basin absence and conflicts from basin competition, detected cleanly by geometric margin rather than entropy.
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
cs.AI 2026-05 unverdicted novelty 7.0

Attractor basins in transformer hidden states unify conflict and hallucination as basin competition or absence, with geometric margin outperforming entropy for detection and a scaling law governing confident hallucina...
In-context learning enables continental-scale subsurface temperature prediction from sparse local observations
cs.LG 2026-05 unverdicted novelty 6.0

A transformer-based in-context learning model predicts continental-scale subsurface temperatures from sparse borehole observations, outperforming physics and interpolation baselines while adapting to new regions with ...
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning
cs.LG 2026-01 unverdicted novelty 6.0

TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.
Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds
stat.ML 2025-12 unverdicted novelty 6.0

Gradient analysis shows cross-entropy induces an EM-like loop in attention that sculpts Bayesian manifolds supporting in-context probabilistic inference.
Comparative analysis of missing data imputation methods for CSST survey: Impact on photometric redshift estimation performance
astro-ph.GA 2026-05 conditional novelty 5.0

KNN imputation gives highest photo-z accuracy under ideal random missingness with complete training data, while SAITS is more robust for incomplete training sets and realistic mixed missingness patterns in CSST data.
Position: agentic AI orchestration should be Bayes-consistent
cs.AI 2026-05 unverdicted novelty 4.0

Agentic AI orchestration should apply Bayesian principles for belief maintenance, updating from interactions, and utility-based action selection.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 6 Pith papers · 4 internal anchors

[1]

Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

Naman Agarwal, Siddhartha R. Dalal, and Vishal Misra. 2025. Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds. arXiv:2512.22473 [cs.LG] https://arxiv.org/abs/2512.22473 Paper II of the Bayesian Attention Trilogy

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Ekin Akyürek and Jacob Andreas. 2022. What Learning Algorithms Does In-Context Learning Learn? Investigations with Linear Models.arXiv preprint arXiv:2209.11895(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. 2015. Weight Uncertainty in Neural Networks. InProceedings of the 32nd International Conference on Machine Learning. 1613–1622

work page 2015
[4]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. 2...

work page 2021
[5]

Jeffrey L Elman. 1990. Finding structure in time.Cognitive Science14, 2 (1990), 179–211

work page 1990
[6]

Liang, and Gregory Valiant

Shivam Garg, Dimitris Tsipras, Percy S. Liang, and Gregory Valiant. 2022. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes. InAdvances in Neural Information Processing Systems, Vol. 35. 29881–29895

work page 2022
[7]

Alex Graves. 2011. Practical Variational Inference for Neural Networks. InAdvances in Neural Information Processing Systems, Vol. 24

work page 2011
[8]

Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces.arXiv preprint arXiv:2312.00752(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Albert Gu, Karan Goel, and Christopher Ré. 2022. Efficiently Modeling Long Sequences with Structured State Spaces. InInternational Conference on Learning Representations

work page 2022
[10]

Repeat after me: Transformers are bet- ter than state space models at copying.arXiv preprint arXiv:2402.01032, 2024

Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. 2024. Repeat After Me: Transformers are Better than State Space Models at Copying.arXiv preprint arXiv:2402.01032(2024). https://arxiv.org/abs/2402.01032

work page arXiv 2024
[11]

David J. C. MacKay. 1992. A Practical Bayesian Framework for Backpropagation Networks.Neural Computation4, 3 (1992), 448–472

work page 1992
[12]

Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. 2024. In-Context Learning Through the Bayesian Prism. InInternational Conference on Learning Representations. https://openreview.net/forum? , Vol. 1, No. 1, Article . Publication date: January . 30 Naman Agarwal, Siddhartha R. Dalal, and Vishal Misra id=HX5ujdsSon

work page 2024
[13]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. Progress Measures for Grokking via Mechanistic Interpretability.arXiv preprint arXiv:2301.05217(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Radford M. Neal. 2012.Bayesian Learning for Neural Networks. Lecture Notes in Statistics, Vol. 118. Springer

work page 2012
[15]

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, et al. 2022. In-Context Learning and Induction Heads. Transformer Circuits Thread, Anthropic. https://transformer- circuits.pub/2022/in-contex...

work page 2022
[16]

Michael Poli, Stefano Massaroli, et al. 2023. Hyena Hierarchy: Towards Larger Convolutional Language Models. In International Conference on Machine Learning

work page 2023
[17]

Arik Reuter, Samuel Muller, and Frank Hutter. 2025. Can Transformers Learn Full Bayesian Inference in Context? arXiv preprint arXiv:2501.16825(2025). https://arxiv.org/abs/2501.16825 To appear at ICML 2025

work page arXiv 2025
[18]

Grewe, Bernhard Schölkopf, Claudia Clopath, and Johanni Brea

Johannes von Oswald, Christian Henning, Adrià Garriga-Alonso, Massimo Caccia, Frederik Träuble, Benjamin F. Grewe, Bernhard Schölkopf, Claudia Clopath, and Johanni Brea. 2023. Transformers as Meta-Learners for Bayesian Inference.arXiv preprint arXiv:2305.14034(2023)

work page arXiv 2023
[19]

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. 2022. An Explanation of In-Context Learning as Implicit Bayesian Inference. InInternational Conference on Learning Representations. , Vol. 1, No. 1, Article . Publication date: January

work page 2022

[1] [1]

Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

Naman Agarwal, Siddhartha R. Dalal, and Vishal Misra. 2025. Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds. arXiv:2512.22473 [cs.LG] https://arxiv.org/abs/2512.22473 Paper II of the Bayesian Attention Trilogy

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Ekin Akyürek and Jacob Andreas. 2022. What Learning Algorithms Does In-Context Learning Learn? Investigations with Linear Models.arXiv preprint arXiv:2209.11895(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. 2015. Weight Uncertainty in Neural Networks. InProceedings of the 32nd International Conference on Machine Learning. 1613–1622

work page 2015

[4] [4]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. 2...

work page 2021

[5] [5]

Jeffrey L Elman. 1990. Finding structure in time.Cognitive Science14, 2 (1990), 179–211

work page 1990

[6] [6]

Liang, and Gregory Valiant

Shivam Garg, Dimitris Tsipras, Percy S. Liang, and Gregory Valiant. 2022. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes. InAdvances in Neural Information Processing Systems, Vol. 35. 29881–29895

work page 2022

[7] [7]

Alex Graves. 2011. Practical Variational Inference for Neural Networks. InAdvances in Neural Information Processing Systems, Vol. 24

work page 2011

[8] [8]

Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces.arXiv preprint arXiv:2312.00752(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Albert Gu, Karan Goel, and Christopher Ré. 2022. Efficiently Modeling Long Sequences with Structured State Spaces. InInternational Conference on Learning Representations

work page 2022

[10] [10]

Repeat after me: Transformers are bet- ter than state space models at copying.arXiv preprint arXiv:2402.01032, 2024

Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. 2024. Repeat After Me: Transformers are Better than State Space Models at Copying.arXiv preprint arXiv:2402.01032(2024). https://arxiv.org/abs/2402.01032

work page arXiv 2024

[11] [11]

David J. C. MacKay. 1992. A Practical Bayesian Framework for Backpropagation Networks.Neural Computation4, 3 (1992), 448–472

work page 1992

[12] [12]

Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. 2024. In-Context Learning Through the Bayesian Prism. InInternational Conference on Learning Representations. https://openreview.net/forum? , Vol. 1, No. 1, Article . Publication date: January . 30 Naman Agarwal, Siddhartha R. Dalal, and Vishal Misra id=HX5ujdsSon

work page 2024

[13] [13]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. Progress Measures for Grokking via Mechanistic Interpretability.arXiv preprint arXiv:2301.05217(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Radford M. Neal. 2012.Bayesian Learning for Neural Networks. Lecture Notes in Statistics, Vol. 118. Springer

work page 2012

[15] [15]

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, et al. 2022. In-Context Learning and Induction Heads. Transformer Circuits Thread, Anthropic. https://transformer- circuits.pub/2022/in-contex...

work page 2022

[16] [16]

Michael Poli, Stefano Massaroli, et al. 2023. Hyena Hierarchy: Towards Larger Convolutional Language Models. In International Conference on Machine Learning

work page 2023

[17] [17]

Arik Reuter, Samuel Muller, and Frank Hutter. 2025. Can Transformers Learn Full Bayesian Inference in Context? arXiv preprint arXiv:2501.16825(2025). https://arxiv.org/abs/2501.16825 To appear at ICML 2025

work page arXiv 2025

[18] [18]

Grewe, Bernhard Schölkopf, Claudia Clopath, and Johanni Brea

Johannes von Oswald, Christian Henning, Adrià Garriga-Alonso, Massimo Caccia, Frederik Träuble, Benjamin F. Grewe, Bernhard Schölkopf, Claudia Clopath, and Johanni Brea. 2023. Transformers as Meta-Learners for Bayesian Inference.arXiv preprint arXiv:2305.14034(2023)

work page arXiv 2023

[19] [19]

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. 2022. An Explanation of In-Context Learning as Implicit Bayesian Inference. InInternational Conference on Learning Representations. , Vol. 1, No. 1, Article . Publication date: January

work page 2022