arxiv: 2605.12412 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

Atticus Geiger, Daniel Wurgaft, Ekdeep Singh Lubana, Eric Bigelow, Jack Merullo, Owen Lewis, Rapha\"el Sarfati, Thomas McGrath

Pith reviewed 2026-05-13 05:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords in-context learningbelief updatingconceptual spacelanguage modelstrajectoriesmanifoldsstory understandinggeometric representations

0 comments

The pith

Large language models update beliefs by tracing trajectories through a low-dimensional conceptual belief space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how large language models change their beliefs while processing stories, framing these changes as movement along paths in a structured geometric space rather than scattered updates. It shows that these paths lie on low-dimensional manifolds visible both in what the model says and in its internal activations. Simple linear methods can read out the current position on the path and forecast where it will go next. Direct changes to those activations can push the path in predictable directions based on the space's layout. If this account holds, in-context learning becomes a form of navigation whose rules can be read from the geometry itself.

Core claim

Large language models assign beliefs over a low-dimensional geometric space, a conceptual belief space, and in-context learning corresponds to a trajectory through this space as beliefs are updated over time. Using story understanding as a setting for dynamic belief updating, belief updates are well-described as trajectories on low-dimensional, structured manifolds. This structure is reflected consistently in both model behavior and internal representations and can be decoded with simple linear probes to predict behavior. Interventions on these representations causally steer belief trajectories, with effects that can be predicted from the geometry of the conceptual space.

What carries the argument

Conceptual belief space: the low-dimensional geometric manifold in which LLMs represent beliefs and along which in-context learning moves as a trajectory.

If this is right

Belief changes during reading can be tracked and visualized as continuous paths rather than discrete flips.
Linear probes applied to hidden states can forecast how a model will interpret later parts of a story.
Targeted edits to representations can steer belief paths toward or away from specific conclusions in a geometry-governed way.
The same geometric description links observable outputs to the underlying representational changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory view could be tested on non-narrative tasks such as multi-step reasoning or dialogue to see whether belief space remains low-dimensional.
If the geometry is stable, it might support methods that monitor and correct drifting beliefs in deployed systems without retraining.
The framework invites comparison between model trajectories and human belief updating when people read the same stories.

Load-bearing premise

The low-dimensional structure and linear decodability reflect an intrinsic geometric organization of beliefs rather than an artifact of the particular stories, models, or measurement methods chosen.

What would settle it

If editing the identified directions in the model's internal representations fails to shift subsequent story judgments in the directions predicted by the geometry, or if the low-dimensional manifolds disappear under new story sets or different models.

Figures

Figures reproduced from arXiv: 2605.12412 by Atticus Geiger, Daniel Wurgaft, Ekdeep Singh Lubana, Eric Bigelow, Jack Merullo, Owen Lewis, Rapha\"el Sarfati, Thomas McGrath.

**Figure 1.** Figure 1: Conceptual belief trajectories Model beliefs qualitatively follow reasonable patterns for a story, for example, here we see that happiness drops when the protagonist falls into a hole and then sadness increases when he discovers a group of scared creatures trapped further down, until these reverse at end of the story when the protagonist rescues the creatures (full story text in App. A). The story’s trajec… view at source ↗

**Figure 2.** Figure 2: Theory of belief update We extend the conceptual space framework of Gärdenfors [2000] to account for belief and uncertainty; axes represent two dimensions in a conceptual space. (Left) Concepts are defined as convex sub-spaces, (Middle) Belief States are probability distributions over a conceptual space, and (Right) Belief Trajectories are sequences of belief states, e.g., in the mind of a reader as they p… view at source ↗

**Figure 3.** Figure 3: Three conceptual domains (Top) Belief manifolds My across three domains, with the trajectory plotted for the same story as in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Distance reveals domain structure (Left) Distance matrices for pairs of concepts dM(c, c′ ) according to behavior My and activation Mz manifolds, computed between concept centroids. For Emotions and Genres, we find common hierarchical structure shared across My and Mz, and that concept centroid distances dM(c, c′ ) are highly correlated between My and Mz (r = .92, p < .001 for Emotions, r = .89, p < .001 f… view at source ↗

**Figure 5.** Figure 5: Steering along manifolds When we steer the story shown in Figs. 1 3, the story’s trajectory in belief space shifts towards the parts of the manifold that corresponds to a particular concept. Each figure here shows the effect of steering the trajectory bt towards a particular concept c, along with the manifold My or Mz with each point colored by the value yt,c for that concept. Results for all concepts list… view at source ↗

**Figure 6.** Figure 6: Steering entanglement follows manifold geometry We find that steering for a target concept c (like sadness) often increases belief in related but unintended concepts c ′ (like anger). (Left) Steering effect, or change in yt,c, when steering for a particular target concept c (y-axes on heatmaps), and measuring belief in query concept c ′ . Emotions domain is on top, Genres on bottom. (Middle) Steering effec… view at source ↗

**Figure 7.** Figure 7: Full text for story in Figs. 1, 3, 5, split into numbered sentences. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Steering effects for each individual concept [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Steering effects for each individual concept [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Distance matrix results at in Fig [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Steering entanglement and prediction from [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Linear accuracy for each layer, across each domain. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Using probe weights to steer at layer 8 only, we see that steering effect disappears after a [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Using probe weights to steer at layers 7-14, we see that steering effect perseveres until the [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Correlations in yt,c for each pair of concepts in the Emotions domain. A subset of 2000 random data points are shown here. H Calibrating Linear Probes We empirically observed that when plotting probe predictions ybt,c against ground truth yt,c, predictions were systematically biased, indicative of miscalibration. To rectify this, we applied isotonic regression, which largely corrects this bias ( [PITH_F… view at source ↗

**Figure 16.** Figure 16: Correlations in yt,c for each pair of concepts in the Genres domain. A subset of 2000 random data points are shown here. I Further Analysis of Steering Effects In [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Correlations in yt,c for each pair of concepts in the Arbitrary domain. A subset of 2000 random data points are shown here. J Belief Dynamics for Additional Stories The belief dynamics for each input text has its own nuanced story to tell. Here, in [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: (Left) We find that probe predictions ybt,c are initially miscalibrated with respect to ground truth yt,c. (Right) After applying isotonic regression, predictions are more calibrated and less systematically biased. Steering with Happiness Vector Steering with Sadness Vector [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: (Left) We observe that activation steering alters model belief dynamics, depending on [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

**Figure 20.** Figure 20: Belief Dynamics in yt,c for two additional stories. Activation PCA 2 Story Progress 0.2 0.4 0.6 0.8 1.0 Activation PCA 1 Activation PCA 2 Emotion Disgust Sadness Happiness Anger Surprise Fear [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗

**Figure 21.** Figure 21: PCA of activations for max activating examples, colored by the story progress, i.e., [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗

read the original abstract

Large Language Models (LLMs) update their behavior in context, which can be viewed as a form of Bayesian inference. However, the structure of the latent hypothesis space over which this inference operates remains unclear. In this work, we propose that LLMs assign beliefs over a low-dimensional geometric space - a conceptual belief space - and that in-context learning corresponds to a trajectory through this space as beliefs are updated over time. Using story understanding as a natural setting for dynamic belief updating, we combine behavioral and representational analyses to study these trajectories. We find that (1) belief updates are well-described as trajectories on low-dimensional, structured manifolds; (2) this structure is reflected consistently in both model behavior and internal representations and can be decoded with simple linear probes to predict behavior; and (3) interventions on these representations causally steer belief trajectories, with effects that can be predicted from the geometry of the conceptual space. Together, our results provide a geometric account of belief dynamics in LLMs, grounding Bayesian interpretations of in-context learning in structured conceptual representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames in-context learning as trajectories on low-dimensional belief manifolds in story tasks, with linear probes and interventions providing some support, but the geometry may still be stimulus-driven.

read the letter

The core contribution is an empirical geometric account of belief updating in LLMs during story comprehension. They treat in-context changes as paths through a conceptual space, then show that these paths appear in both behavior and hidden states, can be read out with linear probes, and can be steered by interventions whose effects roughly match the geometry. The story setting is a reasonable choice because it forces ongoing belief revision about characters and events. The intervention results are the strongest part; they move past pure observation to a causal test. The consistency between behavioral and representational measures is also useful to see in one place. The low-dimensional manifold claim is the part that needs more scrutiny. The stories likely vary along a small number of narrative dimensions, so the recovered structure could be an artifact of that input distribution rather than a general property of the model's hypothesis space. Without seeing the exact methods for choosing dimensionality and the full set of controls, it is hard to rule out that the probes are partly capitalizing on the same low-rank patterns used to define the space. The work is aimed at people doing mechanistic interpretability or trying to make in-context learning more controllable. It is not yet a general theory of reasoning, but it gives a concrete way to talk about belief dynamics. I would bring it to a reading group to discuss whether the geometry holds up under more varied stimuli. It is worth sending to peer review; the empirical pieces are concrete enough that referees can give targeted feedback on controls and generality.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs update beliefs during in-context learning by traversing trajectories on low-dimensional, structured manifolds in a conceptual belief space. Using story comprehension tasks to induce dynamic belief updates, it combines behavioral analyses, representational geometry from model activations, linear probes for decoding, and causal interventions to show that these trajectories are consistent across behavior and internals, predictable via simple linear methods, and steerable in ways aligned with the space's geometry.

Significance. If the central claims hold after addressing methodological concerns, this provides a geometric grounding for Bayesian interpretations of in-context learning, linking observable behavior to internal representations with causal evidence. The integration of behavioral, representational, and interventional methods is a strength, as is the attempt to make predictions from the geometry itself. It could inform more interpretable models of LLM belief dynamics if the low-dimensional structure proves intrinsic rather than stimulus-specific.

major comments (3)

[§3.2] §3.2 (Dimensionality reduction): The dimensionality of the conceptual belief space is selected post-hoc based on variance explained in the activations from the fixed story set. This directly bears on the central claim of an intrinsic low-dimensional manifold; without pre-specification, cross-validation across held-out story collections, or testing on varied narrative axes, the recovered structure risks being an artifact of the low-rank input distribution rather than a property of the model's hypothesis space.
[§5] §5 (Interventions): The causal interventions on representations are reported to steer belief trajectories with geometrically predictable effects, but the section lacks controls such as intervention magnitude matching, sham perturbations, or comparisons to directions orthogonal to the conceptual space. This is load-bearing for the claim that effects follow from the geometry rather than generic activation changes.
[§4.3] §4.3 (Linear probes): The probes decode the conceptual space from activations to predict behavior, yet the space itself is derived from the same activations used for both probing and intervention. This circularity risk (noted in the stress-test) undermines independence; the paper should report performance on activations from a separate model or task to verify the structure is not analysis-defined.

minor comments (2)

[Figure 3] Figure 3: The manifold visualization axes are not labeled with respect to the principal components or conceptual dimensions; clarify what each axis represents to aid interpretability.
[Abstract] The abstract and introduction use 'parameter-free' for the geometric account, but the dimensionality choice introduces a free parameter; revise for precision.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which highlight important methodological considerations for strengthening our claims about the structure of conceptual belief spaces in LLMs. We address each major comment point by point below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Dimensionality reduction): The dimensionality of the conceptual belief space is selected post-hoc based on variance explained in the activations from the fixed story set. This directly bears on the central claim of an intrinsic low-dimensional manifold; without pre-specification, cross-validation across held-out story collections, or testing on varied narrative axes, the recovered structure risks being an artifact of the low-rank input distribution rather than a property of the model's hypothesis space.

Authors: We selected the dimensionality using the standard approach of identifying the elbow in the variance explained curve from PCA applied to the model activations. This is not entirely post-hoc as it follows established practices in analyzing representational geometry. Nevertheless, to directly address the concern about potential artifacts from the fixed story set, we will incorporate cross-validation in the revised version: specifically, we will partition the stories into training and held-out sets, derive the dimensionality and principal components from the training set, and then evaluate the consistency of the low-dimensional trajectories and structure on the held-out stories. We will also extend the analysis to include stories varying along additional narrative dimensions to test generalizability beyond the original set. revision: partial
Referee: [§5] §5 (Interventions): The causal interventions on representations are reported to steer belief trajectories with geometrically predictable effects, but the section lacks controls such as intervention magnitude matching, sham perturbations, or comparisons to directions orthogonal to the conceptual space. This is load-bearing for the claim that effects follow from the geometry rather than generic activation changes.

Authors: We concur that additional controls are essential to substantiate that the observed steering effects arise from the geometry of the conceptual space rather than nonspecific activation perturbations. Accordingly, we will revise §5 to include the following: sham interventions using random vectors in the activation space with magnitudes matched to the conceptual interventions; explicit reporting of magnitude matching across all conditions; and interventions along directions orthogonal to the primary conceptual axes, with comparisons of their effects on belief trajectories. These controls will demonstrate the specificity of the geometric predictions. revision: yes
Referee: [§4.3] §4.3 (Linear probes): The probes decode the conceptual space from activations to predict behavior, yet the space itself is derived from the same activations used for both probing and intervention. This circularity risk (noted in the stress-test) undermines independence; the paper should report performance on activations from a separate model or task to verify the structure is not analysis-defined.

Authors: The manuscript includes a stress-test to partially address independence by applying the probes to different story collections. However, we recognize the value of further validation using separate models or tasks. In the revision, we will add results from linear probes trained and tested on activations from a distinct model variant (such as a different size or family) performing analogous story comprehension tasks. This will help confirm that the decoded conceptual structure is not solely an artifact of the analysis on the primary model. We note that while full separation is ideal, the core claims are supported by the convergence of behavioral, representational, and interventional evidence. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical analysis of belief trajectories

full rationale

The paper reports experimental results from inducing belief updates via stories in LLMs, followed by dimensionality reduction on activations to identify manifolds, linear probes to decode behavior, and targeted interventions to test causal effects. These steps rely on data-driven measurements and statistical methods applied to model outputs and internal states rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation that reduces the central claims to the inputs by construction. The geometric account is presented as an empirical finding supported by the observed consistency across behavior, representations, and interventions, without equations or derivations that equate outputs to inputs tautologically.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that LLM beliefs occupy a low-dimensional geometric space whose structure can be read out linearly and manipulated causally; this space is introduced by the paper rather than derived from prior theory.

free parameters (1)

dimensionality of conceptual belief space
The paper states that trajectories lie on low-dimensional manifolds but does not specify how the exact dimensionality is chosen or validated.

axioms (1)

domain assumption In-context learning can be viewed as Bayesian inference over a latent hypothesis space
Stated in the opening sentence of the abstract as the starting point for the geometric proposal.

invented entities (1)

conceptual belief space no independent evidence
purpose: To provide a geometric substrate in which belief updates appear as trajectories
Newly proposed construct whose existence and structure are supported only by the analyses in this paper.

pith-pipeline@v0.9.0 · 5511 in / 1496 out tokens · 60628 ms · 2026-05-13T05:10:16.029400+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
belief updates are well-described as trajectories on low-dimensional, structured manifolds; this structure is reflected consistently in both model behavior and internal representations and can be decoded with simple linear probes
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
steering entanglement can be predicted based on the structure of our learned manifold My … correlated with the distance between their centroids in My (r=.65)

Reference graph

Works this paper leans on

198 extracted references · 198 canonical work pages · 20 internal anchors

[1]

Science , volume=

Visual pigments in single rods and cones of the human retina , author=. Science , volume=. 1964 , publisher=

work page 1964
[2]

Frontiers in psychology , volume=

Neurotransmitters and emotions , author=. Frontiers in psychology , volume=. 2020 , publisher=

work page 2020
[3]

arXiv preprint arXiv:2511.01805 , year=

Accumulating context changes the beliefs of language models , author=. arXiv preprint arXiv:2511.01805 , year=

work page arXiv
[4]

arXiv preprint arXiv:2601.22364 , year=

Context Structure Reshapes the Representational Geometry of Language Models , author=. arXiv preprint arXiv:2601.22364 , year=

work page arXiv
[5]

arXiv preprint arXiv:2511.01836 , year=

Priors in Time: Missing Inductive Biases for Language Model Interpretability , author=. arXiv preprint arXiv:2511.01836 , year=

work page arXiv
[6]

arXiv preprint arXiv:2505.14685 , year=

Language models use lookbacks to track beliefs , author=. arXiv preprint arXiv:2505.14685 , year=

work page arXiv
[7]

Philosophical Transactions of the Royal Society B: Biological Sciences , volume=

Re-evaluating theory of mind evaluation in large language models , author=. Philosophical Transactions of the Royal Society B: Biological Sciences , volume=. 2025 , publisher=

work page 2025
[8]

First Workshop on Theory of Mind in Communicating Agents , year=

Do LLMs selectively encode the goal of an agent's reach? , author=. First Workshop on Theory of Mind in Communicating Agents , year=

work page
[9]

, author=

Emotion knowledge: further exploration of a prototype approach. , author=. Journal of personality and social psychology , volume=. 1987 , publisher=

work page 1987
[10]

Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control , author=. arXiv preprint arXiv:2604.03147 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Advances in Neural Information Processing Systems , volume=

Confidence regulation neurons in language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

MINT: Foundation Model Interventions , year=

Semantic entropy neurons: Encoding semantic uncertainty in the latent space of llms , author=. MINT: Foundation Model Interventions , year=

work page
[13]

Reasoning theater: Disentangling model beliefs from chain-of- thought, 2026

Reasoning theater: Disentangling model beliefs from chain-of-thought , author=. arXiv preprint arXiv:2603.05488 , year=

work page arXiv
[14]

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior , author=. arXiv preprint arXiv:2605.05115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2402.03563 , year=

Distinguishing the knowable from the unknowable with language models , author=. arXiv preprint arXiv:2402.03563 , year=

work page arXiv
[16]

IEEE Transactions on Visualization and Computer Graphics , year=

Story Ribbons: Reimagining Storyline Visualizations with Large Language Models , author=. IEEE Transactions on Visualization and Computer Graphics , year=

work page
[18]

Linearly mapping from image to text space.arXiv preprint arXiv:2209.15162,

Linearly mapping from image to text space , author=. arXiv preprint arXiv:2209.15162 , year=

work page arXiv
[19]

arXiv preprint arXiv:2402.00795 , year=

LLMs learn governing principles of dynamical systems, revealing an in-context neural scaling law , author=. arXiv preprint arXiv:2402.00795 , year=

work page arXiv
[20]

Transactions of the Association for Computational Linguistics , author =

A. Transactions of the Association for Computational Linguistics , author =. 2016 , note =. doi:10.1162/tacl_a_00106 , urldate =

work page doi:10.1162/tacl_a_00106 2016
[21]

Emergent linear representations in world models of self-supervised sequence models

Nanda, Neel and Lee, Andrew and Wattenberg, Martin , month = sep, year =. Emergent. doi:10.48550/arXiv.2309.00941 , abstract =

work page doi:10.48550/arxiv.2309.00941
[22]

2022 , eprint=

Toy Models of Superposition , author=. 2022 , eprint=

work page 2022
[23]

Batchtopk sparse autoencoders, 2024

Batchtopk sparse autoencoders , author=. preprint arXiv:2412.06410 , year=

work page arXiv
[24]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

work page 2023
[25]

arXiv preprint arXiv:2507.23220 , year=

Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders , author=. arXiv preprint arXiv:2507.23220 , year=

work page arXiv
[26]

arXiv preprint arXiv:2506.03093 , year=

From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit , author=. arXiv preprint arXiv:2506.03093 , year=

work page arXiv
[27]

arXiv preprint arXiv:2402.18496 , year=

Language models represent beliefs of self and others , author=. arXiv preprint arXiv:2402.18496 , year=

work page arXiv
[28]

arXiv preprint arXiv:2406.07882 , year=

Designing a dashboard for transparency and control of conversational AI , author=. arXiv preprint arXiv:2406.07882 , year=

work page arXiv
[29]

Advances in Neural Information Processing Systems , volume=

Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=

work page
[30]

arXiv preprint arXiv:2410.17245 , year=

Towards reliable evaluation of behavior steering interventions in llms , author=. arXiv preprint arXiv:2410.17245 , year=

work page arXiv
[31]

arXiv preprint arXiv:2202.12837 , year=

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , author=. arXiv preprint arXiv:2202.12837 , year=

work page arXiv
[32]

Advances in Neural Information Processing Systems , volume=

Pretraining task diversity and the emergence of non-bayesian in-context learning for regression , author=. Advances in Neural Information Processing Systems , volume=

work page
[33]

In- Context Language Learning : Architectures and Algorithms , 2024

In-context language learning: Architectures and algorithms , author=. arXiv preprint arXiv:2401.12973 , year=

work page arXiv
[34]

arXiv preprint arXiv:1801.08930 , year=

Recasting gradient-based meta-learning as hierarchical bayes , author=. arXiv preprint arXiv:1801.08930 , year=

work page arXiv
[35]

arXiv preprint arXiv:2410.16531 , year=

Bayesian scaling laws for in-context learning , author=. arXiv preprint arXiv:2410.16531 , year=

work page arXiv
[36]

arXiv preprint arXiv:2305.19420 , year=

What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization , author=. arXiv preprint arXiv:2305.19420 , year=

work page arXiv
[37]

2024 , eprint=

In-Context Learning through the Bayesian Prism , author=. 2024 , eprint=

work page 2024
[38]

ICML workshop on actionable interpretability , year=

Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics , author=. ICML workshop on actionable interpretability , year=

work page
[39]

arXiv preprint arXiv:2406.02550 , year=

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks , author=. arXiv preprint arXiv:2406.02550 , year=

work page arXiv
[40]

2023 , eprint=

Palm-e: An embodied multimodal language model , author=. 2023 , eprint=

work page 2023
[41]

arXiv preprint arXiv:2410.12782 , year=

In-Context Learning Enables Robot Action Prediction in LLMs , author=. arXiv preprint arXiv:2410.12782 , year=

work page arXiv
[42]

arXiv preprint arXiv:2410.01280 , year=

Sparse autoencoders reveal temporal difference learning in large language models , author=. arXiv preprint arXiv:2410.01280 , year=

work page arXiv
[43]

arXiv preprint arXiv:2212.10559 , year=

Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers , author=. arXiv preprint arXiv:2212.10559 , year=

work page arXiv
[44]

International Conference on Machine Learning , pages=

Transformers learn in-context by gradient descent , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[45]

The Thirteenth International Conference on Learning Representations , year=

ICLR: In-Context Learning of Representations , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[46]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

2023 , eprint=

The mechanistic basis of data dependence and abrupt learning in an in-context classification task , author=. 2023 , eprint=

work page 2023
[48]

2024 , eprint=

Differential learning kinetics govern the transition from memorization to generalization during in-context learning , author=. 2024 , eprint=

work page 2024
[49]

Advances in Neural Information Processing Systems , volume=

Data distributional properties drive emergent in-context learning in transformers , author=. Advances in Neural Information Processing Systems , volume=

work page
[50]

Advances in neural information processing systems , volume=

Bayesian modeling of human concept learning , author=. Advances in neural information processing systems , volume=

work page
[51]

Advances in neural information processing systems , volume=

Rules and similarity in concept learning , author=. Advances in neural information processing systems , volume=

work page
[52]

Trends in Cognitive Sciences , year=

Physics versus graphics as an organizing dichotomy in cognition , author=. Trends in Cognitive Sciences , year=

work page
[53]

Journal of Open Psychology Data , volume=

A large dataset of generalization patterns in the number game , author=. Journal of Open Psychology Data , volume=

work page
[54]

, author=

Inferring priors in compositional cognitive models. , author=. CogSci , year=

work page
[55]

ICLR , year=

In-context learning dynamics with random binary sequences , author=. ICLR , year=

work page
[56]

Advances in Neural Information Processing Systems , volume=

Human-like few-shot learning via bayesian reasoning over natural language , author=. Advances in Neural Information Processing Systems , volume=

work page
[57]

Cognitive science , volume=

A rational analysis of rule-based concept learning , author=. Cognitive science , volume=. 2008 , publisher=

work page 2008
[58]

, author=

The logical primitives of thought: Empirical foundations for compositional cognitive models. , author=. Psychological review , volume=. 2016 , publisher=

work page 2016
[59]

arXiv preprint arXiv:1912.11554 , year=

Composable effects for flexible and accelerated probabilistic programming in NumPyro , author=. arXiv preprint arXiv:1912.11554 , year=

work page arXiv 1912
[60]

, author=

The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. , author=. J. Mach. Learn. Res. , volume=

work page
[61]

Many-Shot In-Context Learning , author =

work page
[62]

Andreas, D

Learning with Latent Language , author =. 2017 , month = nov, number =. doi:10.48550/arXiv.1711.00482 , urldate =. arXiv , langid =:1711.00482 , primaryclass =

work page doi:10.48550/arxiv.1711.00482 2017
[63]

Many-Shot Jailbreaking , author =

work page
[64]

Refusal in Language Models Is Mediated by a Single Direction

Refusal in Language Models Is Mediated by a Single Direction , author =. 2024 , month = oct, number =. doi:10.48550/arXiv.2406.11717 , urldate =. arXiv , langid =:2406.11717 , primaryclass =

work page internal anchor Pith review doi:10.48550/arxiv.2406.11717 2024
[65]

Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models , shorttitle =

Ball, Sarah and Kreuter, Frauke and Panickssery, Nina , year =. Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models , shorttitle =. doi:10.48550/arXiv.2406.09289 , urldate =. arXiv , langid =:2406.09289 , primaryclass =

work page doi:10.48550/arxiv.2406.09289
[66]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Eliciting Latent Predictions from Transformers with the Tuned Lens , author =. 2023 , month = nov, number =. doi:10.48550/arXiv.2303.08112 , urldate =. arXiv , langid =:2303.08112 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08112 2023
[67]

2024 , month = aug, number =

Mechanistic Interpretability for AI Safety -- A Review , author =. 2024 , month = aug, number =. doi:10.48550/arXiv.2404.14082 , urldate =. arXiv , langid =:2404.14082 , primaryclass =

work page doi:10.48550/arxiv.2404.14082 2024
[68]

and Bridgers, Sophie and Gopnik, Alison and Tenenbaum, Joshua B

Bonawitz, Elizabeth and Ullman, Tomer D. and Bridgers, Sophie and Gopnik, Alison and Tenenbaum, Joshua B. , year =. Sticking to the Evidence? A Behavioral and Computational Case Study of Micro-Theory Change in the Domain of Magnetism , shorttitle =. Cognitive Science , volume =. doi:10.1111/cogs.12765 , urldate =

work page doi:10.1111/cogs.12765
[69]

2023 , month = oct, journal =

Local Search and the Evolution of World Models , author =. 2023 , month = oct, journal =. doi:10.1111/tops.12703 , urldate =

work page doi:10.1111/tops.12703 2023
[70]

Discovering latent knowledge in language models without supervision

Discovering Latent Knowledge in Language Models Without Supervision , author =. 2024 , month = mar, number =. doi:10.48550/arXiv.2212.03827 , urldate =. arXiv , langid =:2212.03827 , primaryclass =

work page doi:10.48550/arxiv.2212.03827 2024
[71]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Chen, Runjin and Arditi, Andy and Sleight, Henry and Evans, Owain and Lindsey, Jack , year =. Persona Vectors: Monitoring and Controlling Character Traits in Language Models , shorttitle =. doi:10.48550/arXiv.2507.21509 , urldate =. arXiv , langid =:2507.21509 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.21509
[72]

Plug and Play Language Models: A Simple Approach to Controlled Text Generation , shorttitle =

Dathathri, Sumanth and Madotto, Andrea and Lan, Janice and Hung, Jane and Frank, Eric and Molino, Piero and Yosinski, Jason and Liu, Rosanne , year =. Plug and Play Language Models: A Simple Approach to Controlled Text Generation , shorttitle =. doi:10.48550/arXiv.1912.02164 , urldate =. arXiv , langid =:1912.02164 , primaryclass =

work page doi:10.48550/arxiv.1912.02164 1912
[73]

Human-like Few-Shot Learning via Bayesian Reasoning over Natural Language , author =

work page
[74]

A primer on the inner workings of transformer-based language models.arXiv preprint arXiv:2405.00208, 2024

A Primer on the Inner Workings of Transformer-based Language Models , author =. 2024 , month = oct, number =. doi:10.48550/arXiv.2405.00208 , urldate =. arXiv , langid =:2405.00208 , primaryclass =

work page doi:10.48550/arxiv.2405.00208 2024
[75]

2025 , month = feb, number =

A Unified Understanding and Evaluation of Steering Methods , author =. 2025 , month = feb, number =. doi:10.48550/arXiv.2502.02716 , urldate =. arXiv , langid =:2502.02716 , primaryclass =

work page doi:10.48550/arxiv.2502.02716 2025
[76]

Jain, Samyak and Lubana, Ekdeep Singh and Oksuz, Kemal and Joy, Tom and Torr, Philip H. S. and Sanyal, Amartya and Dokania, Puneet K. , year =. What Makes and Breaks Safety Fine-tuning? A Mechanistic Study , shorttitle =. doi:10.48550/arXiv.2407.10264 , urldate =. arXiv , langid =:2407.10264 , primaryclass =

work page doi:10.48550/arxiv.2407.10264
[77]

2024 , month = mar, number =

On the Origins of Linear Representations in Large Language Models , author =. 2024 , month = mar, number =. doi:10.48550/arXiv.2403.03867 , urldate =. arXiv , langid =:2403.03867 , primaryclass =

work page doi:10.48550/arxiv.2403.03867 2024
[78]

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks , shorttitle =

Kirch, Nathalie Maria and Field, Severin and Casper, Stephen , year =. What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks , shorttitle =. doi:10.48550/arXiv.2411.03343 , urldate =. arXiv , langid =:2411.03343 , primaryclass =

work page doi:10.48550/arxiv.2411.03343
[79]

2024 , month = sep, journal =

Learning by Thinking in Natural and Artificial Minds , author =. 2024 , month = sep, journal =. doi:10.1016/j.tics.2024.07.007 , urldate =

work page doi:10.1016/j.tics.2024.07.007 2024
[80]

PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling , shorttitle =

Ma, Avery and Pan, Yangchen and Farahmand, Amir-massoud , year =. PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling , shorttitle =. doi:10.48550/arXiv.2502.01925 , urldate =. arXiv , langid =:2502.01925 , primaryclass =

work page doi:10.48550/arxiv.2502.01925
[81]

2024 , month = feb, number =

Distilling Symbolic Priors for Concept Learning into Neural Networks , author =. 2024 , month = feb, number =. arXiv , langid =:2402.07035 , primaryclass =

work page arXiv 2024

Showing first 80 references.