Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds
Pith reviewed 2026-05-21 15:59 UTC · model grok-4.3
The pith
Cross-entropy gradients sculpt Bayesian manifolds in transformer attention heads through advantage-based routing and responsibility-weighted updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our core result is an advantage-based routing law for attention scores, ∂L/∂s_ij = α_ij (b_ij − E_α_i [b]), coupled with a responsibility-weighted update for values, Δv_j = −η ∑_i α_ij u_i. These equations induce a positive feedback loop in which routing and content specialize together and behave like a two-timescale EM procedure that sculpts the low-dimensional manifolds implementing Bayesian inference.
What carries the argument
The advantage-based routing law for attention scores (∂L/∂s_ij equals attention weight times advantage of the dot product b_ij) together with responsibility-weighted value updates, which together drive coupled specialization of routing and content.
If this is right
- The same gradient flow that reduces cross-entropy loss simultaneously builds the internal geometry needed for in-context probabilistic reasoning.
- Attention weights perform an E-step via soft responsibilities while value vectors perform an M-step via responsibility-weighted prototype shifts.
- Queries and keys jointly adjust the hypothesis frame within which the routing and updates occur.
- In tasks such as the sticky Markov chain, the learned dynamics match those of a closed-form EM-style procedure.
Where Pith is reading between the lines
- If the mechanism scales, it offers one route by which transformers acquire probabilistic reasoning without direct supervision on Bayesian tasks.
- Attention variants could be designed that explicitly amplify the advantage signal to speed formation of the required manifolds.
- Analogous first-order dynamics may operate in other transformer components such as feed-forward layers.
Load-bearing premise
The first-order gradient analysis and the identification of the resulting structures as the specific Bayesian manifolds both hold when moving from controlled simulations to the large-scale models referenced in the work.
What would settle it
Run the sticky Markov-chain simulation with the advantage term removed from the attention gradient while still minimizing cross-entropy; check whether the low-dimensional manifolds required for accurate Bayesian inference still appear.
Figures
read the original abstract
Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an \emph{advantage-based routing law} for attention scores, \[ \frac{\partial L}{\partial s_{ij}} = \alpha_{ij}\bigl(b_{ij}-\mathbb{E}_{\alpha_i}[b]\bigr), \qquad b_{ij} := u_i^\top v_j, \] coupled with a \emph{responsibility-weighted update} for values, \[ \Delta v_j = -\eta\sum_i \alpha_{ij} u_i, \] where $u_i$ is the upstream gradient at position $i$ and $\alpha_{ij}$ are attention weights. These equations induce a positive feedback loop in which routing and content specialize together: queries route more strongly to values that are above-average for their error signal, and those values are pulled toward the queries that use them. We show that this coupled specialization behaves like a two-timescale EM procedure: attention weights implement an E-step (soft responsibilities), while values implement an M-step (responsibility-weighted prototype updates), with queries and keys adjusting the hypothesis frame. Through controlled simulations, including a sticky Markov-chain task where we compare a closed-form EM-style update to standard SGD, we demonstrate that the same gradient dynamics that minimize cross-entropy also sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference. This yields a unified picture in which optimization (gradient flow) gives rise to geometry (Bayesian manifolds), which in turn supports function (in-context probabilistic reasoning).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper provides a first-order gradient analysis of cross-entropy loss in a transformer attention head, deriving an advantage-based routing law ∂L/∂s_ij = α_ij (b_ij − E_α_i [b]) with b_ij := u_i^T v_j and a responsibility-weighted value update Δv_j = −η ∑_i α_ij u_i. It argues that the resulting positive feedback loop between routing and content specialization behaves like a two-timescale EM procedure (attention as E-step, values as M-step) that sculpts the low-dimensional Bayesian manifolds identified in a companion work, thereby explaining how gradient training produces internal geometry for in-context probabilistic reasoning. This is illustrated via controlled simulations on a sticky Markov-chain task comparing the dynamics to a closed-form EM-style update.
Significance. If the central derivations and the quantitative link to the companion manifolds hold, the work offers a concrete mechanistic bridge between gradient flow and the emergence of Bayesian inference structures inside attention, with potential implications for understanding and designing probabilistic reasoning in transformers. The comparison of SGD dynamics to an explicit EM baseline in the simulations is a positive step toward reproducibility and falsifiability.
major comments (2)
- [§4] §4 (sticky Markov-chain simulations): the central claim that the induced dynamics 'sculpt the low-dimensional manifolds identified in our companion work' rests on qualitative observation of specialization; no explicit quantitative metric (Hausdorff distance, effective dimension, or KL divergence to the companion posterior geometry) is reported to establish that the attractor is the same manifold rather than another low-dimensional structure consistent with cross-entropy minimization.
- [§3] §3 (first-order analysis): while the local expressions for ∂L/∂s_ij and Δv_j follow directly from the softmax and cross-entropy assumptions, the manuscript does not supply the full intermediate derivation steps, error bounds, or verification that higher-order terms remain negligible under the controlled simulation conditions; this weakens the assertion that the dynamics generalize to the large-scale models referenced in the abstract.
minor comments (2)
- Notation: the definition of the advantage term b_ij is introduced without an explicit statement of the upstream gradient u_i's dependence on the loss; a short clarifying sentence would improve readability.
- Figure clarity: the simulation plots comparing SGD and EM trajectories would benefit from an additional panel showing the evolution of effective dimension or posterior alignment metric.
Simulated Author's Rebuttal
We thank the referee for the constructive report and the positive assessment of the work's significance. We respond point by point to the major comments and indicate the revisions we will implement.
read point-by-point responses
-
Referee: [§4] §4 (sticky Markov-chain simulations): the central claim that the induced dynamics 'sculpt the low-dimensional manifolds identified in our companion work' rests on qualitative observation of specialization; no explicit quantitative metric (Hausdorff distance, effective dimension, or KL divergence to the companion posterior geometry) is reported to establish that the attractor is the same manifold rather than another low-dimensional structure consistent with cross-entropy minimization.
Authors: We agree that the current presentation relies on qualitative observation and that this leaves the identification of the attractor open to alternative interpretations. In the revised manuscript we will add quantitative metrics in §4, including the Hausdorff distance between the simulated trajectories and the manifolds reported in the companion work, as well as the effective dimension of the learned attention and value representations. We will also report the KL divergence between the empirical distribution of attention weights under SGD and the posterior geometry obtained from the closed-form EM baseline. These additions will be accompanied by a brief discussion of any residual discrepancies. revision: yes
-
Referee: [§3] §3 (first-order analysis): while the local expressions for ∂L/∂s_ij and Δv_j follow directly from the softmax and cross-entropy assumptions, the manuscript does not supply the full intermediate derivation steps, error bounds, or verification that higher-order terms remain negligible under the controlled simulation conditions; this weakens the assertion that the dynamics generalize to the large-scale models referenced in the abstract.
Authors: We accept that the intermediate steps were omitted for conciseness. The revised version will include a dedicated appendix that presents the complete derivation of the advantage-based routing law and the responsibility-weighted value update, beginning from the cross-entropy loss and the softmax attention definition. We will also add a short verification subsection in the simulations that compares the first-order predictions against numerically computed full gradients under the same controlled conditions, thereby confirming that higher-order contributions remain small. Finally, we will revise the abstract and discussion to clarify that the analysis supplies a local mechanistic account whose direct extrapolation to very large models remains an empirical question for future work. revision: yes
Circularity Check
Manifold-sculpting claim depends on self-citation to companion work without independent geometric verification
specific steps
-
self citation load bearing
[Abstract]
"Through controlled simulations, including a sticky Markov-chain task where we compare a closed-form EM-style update to standard SGD, we demonstrate that the same gradient dynamics that minimize cross-entropy also sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference."
The demonstration that the derived dynamics sculpt the specific Bayesian manifolds identified in the companion work is justified solely by citation to that prior identification (by the same author group) rather than by an independent re-derivation or quantitative match performed inside the present manuscript.
full rationale
The first-order gradient derivations for the advantage-based routing law ∂L/∂s_ij = α_ij (b_ij − E_α_i [b]) and the responsibility-weighted value update Δv_j = −η ∑_i α_ij u_i follow directly from the stated cross-entropy loss and softmax attention definitions in §3, supplying independent analytic content. The paper further interprets the resulting positive-feedback specialization as a two-timescale EM procedure on the basis of those same equations. However, the central unification claim—that these dynamics sculpt the specific low-dimensional Bayesian manifolds—is supported only by qualitative simulation observations plus explicit reference to the identification performed in the companion paper by overlapping authors. No quantitative metric confirming geometric equivalence (e.g., Hausdorff distance, effective dimension, or posterior KL) is supplied here, so the most ambitious interpretation of the result reduces to that self-citation for its load-bearing step.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The provided first-order analysis is complete for the attention head under cross-entropy.
invented entities (1)
-
Bayesian manifolds
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
advantage-based routing law for attention scores, ∂L/∂s_ij = α_ij (b_ij − E_α_i [b]), coupled with a responsibility-weighted update for values, Δv_j = −η ∑_i α_ij u_i. These equations induce a positive feedback loop... two-timescale EM procedure
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
The Bayesian Geometry of Transformer Attention
Small transformers reproduce known Bayesian posteriors with 10^{-3} to 10^{-4} bit accuracy in verifiable wind-tunnel tasks via residual belief states, FFN updates, and attention routing, while MLPs do not.
-
Geometric Scaling of Bayesian Inference in LLMs
Large language models preserve a geometric substrate in value representations that correlates with uncertainty and matches patterns from small models performing exact Bayesian inference.
Reference graph
Works this paper leans on
-
[1]
The Bayesian Geometry of Transformer Attention
Naman Agarwal, Siddhartha R. Dalal, and Vishal Misra. 2025. The Bayesian Geometry of Transformer Attention. arXiv:2512.22471 [cs.LG] https://arxiv.org/abs/2512.22471 Paper I of the Bayesian Attention Trilogy
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Jimmy Ba, Murat A Erdogdu, Marzyeh Ghassemi, Taiji Suzuki, Denny Wu, and Tianzong Zhang. 2022. High-dimensional asymptotics of feature learning: How one gradient step improves the representation.Advances in Neural Information Processing Systems35 (2022)
work page 2022
-
[3]
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. 2...
work page 2021
-
[4]
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, et al. 2022. In-Context Learning and Induction Heads. Transformer Circuits Thread, Anthropic. https://transformer- circuits.pub/2022/in-contex...
work page 2022
-
[5]
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. 2018. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research19, 70 (2018), 1–57
work page 2018
-
[6]
Grewe, Bernhard Schölkopf, Claudia Clopath, and Johanni Brea
Johannes von Oswald, Christian Henning, Adrià Garriga-Alonso, Massimo Caccia, Frederik Träuble, Benjamin F. Grewe, Bernhard Schölkopf, Claudia Clopath, and Johanni Brea. 2023. Transformers as Meta-Learners for Bayesian Inference. arXiv preprint arXiv:2305.14034(2023). , Vol. 1, No. 1, Article . Publication date: January . 20 Naman Agarwal, Siddhartha R. D...
-
[7]
Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. 2022. An Explanation of In-Context Learning as Implicit Bayesian Inference. InInternational Conference on Learning Representations. , Vol. 1, No. 1, Article . Publication date: January
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.