pith. machine review for the scientific record. sign in

arxiv: 2605.12412 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

Atticus Geiger, Daniel Wurgaft, Ekdeep Singh Lubana, Eric Bigelow, Jack Merullo, Owen Lewis, Rapha\"el Sarfati, Thomas McGrath

Pith reviewed 2026-05-13 05:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords in-context learningbelief updatingconceptual spacelanguage modelstrajectoriesmanifoldsstory understandinggeometric representations
0
0 comments X

The pith

Large language models update beliefs by tracing trajectories through a low-dimensional conceptual belief space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how large language models change their beliefs while processing stories, framing these changes as movement along paths in a structured geometric space rather than scattered updates. It shows that these paths lie on low-dimensional manifolds visible both in what the model says and in its internal activations. Simple linear methods can read out the current position on the path and forecast where it will go next. Direct changes to those activations can push the path in predictable directions based on the space's layout. If this account holds, in-context learning becomes a form of navigation whose rules can be read from the geometry itself.

Core claim

Large language models assign beliefs over a low-dimensional geometric space, a conceptual belief space, and in-context learning corresponds to a trajectory through this space as beliefs are updated over time. Using story understanding as a setting for dynamic belief updating, belief updates are well-described as trajectories on low-dimensional, structured manifolds. This structure is reflected consistently in both model behavior and internal representations and can be decoded with simple linear probes to predict behavior. Interventions on these representations causally steer belief trajectories, with effects that can be predicted from the geometry of the conceptual space.

What carries the argument

Conceptual belief space: the low-dimensional geometric manifold in which LLMs represent beliefs and along which in-context learning moves as a trajectory.

If this is right

  • Belief changes during reading can be tracked and visualized as continuous paths rather than discrete flips.
  • Linear probes applied to hidden states can forecast how a model will interpret later parts of a story.
  • Targeted edits to representations can steer belief paths toward or away from specific conclusions in a geometry-governed way.
  • The same geometric description links observable outputs to the underlying representational changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trajectory view could be tested on non-narrative tasks such as multi-step reasoning or dialogue to see whether belief space remains low-dimensional.
  • If the geometry is stable, it might support methods that monitor and correct drifting beliefs in deployed systems without retraining.
  • The framework invites comparison between model trajectories and human belief updating when people read the same stories.

Load-bearing premise

The low-dimensional structure and linear decodability reflect an intrinsic geometric organization of beliefs rather than an artifact of the particular stories, models, or measurement methods chosen.

What would settle it

If editing the identified directions in the model's internal representations fails to shift subsequent story judgments in the directions predicted by the geometry, or if the low-dimensional manifolds disappear under new story sets or different models.

Figures

Figures reproduced from arXiv: 2605.12412 by Atticus Geiger, Daniel Wurgaft, Ekdeep Singh Lubana, Eric Bigelow, Jack Merullo, Owen Lewis, Rapha\"el Sarfati, Thomas McGrath.

Figure 1
Figure 1. Figure 1: Conceptual belief trajectories Model beliefs qualitatively follow reasonable patterns for a story, for example, here we see that happiness drops when the protagonist falls into a hole and then sadness increases when he discovers a group of scared creatures trapped further down, until these reverse at end of the story when the protagonist rescues the creatures (full story text in App. A). The story’s trajec… view at source ↗
Figure 2
Figure 2. Figure 2: Theory of belief update We extend the conceptual space framework of Gärdenfors [2000] to account for belief and uncertainty; axes represent two dimensions in a conceptual space. (Left) Concepts are defined as convex sub-spaces, (Middle) Belief States are probability distributions over a conceptual space, and (Right) Belief Trajectories are sequences of belief states, e.g., in the mind of a reader as they p… view at source ↗
Figure 3
Figure 3. Figure 3: Three conceptual domains (Top) Belief manifolds My across three domains, with the trajectory plotted for the same story as in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distance reveals domain structure (Left) Distance matrices for pairs of concepts dM(c, c′ ) according to behavior My and activation Mz manifolds, computed between concept centroids. For Emotions and Genres, we find common hierarchical structure shared across My and Mz, and that concept centroid distances dM(c, c′ ) are highly correlated between My and Mz (r = .92, p < .001 for Emotions, r = .89, p < .001 f… view at source ↗
Figure 5
Figure 5. Figure 5: Steering along manifolds When we steer the story shown in Figs. 1 3, the story’s trajectory in belief space shifts towards the parts of the manifold that corresponds to a particular concept. Each figure here shows the effect of steering the trajectory bt towards a particular concept c, along with the manifold My or Mz with each point colored by the value yt,c for that concept. Results for all concepts list… view at source ↗
Figure 6
Figure 6. Figure 6: Steering entanglement follows manifold geometry We find that steering for a target concept c (like sadness) often increases belief in related but unintended concepts c ′ (like anger). (Left) Steering effect, or change in yt,c, when steering for a particular target concept c (y-axes on heatmaps), and measuring belief in query concept c ′ . Emotions domain is on top, Genres on bottom. (Middle) Steering effec… view at source ↗
Figure 7
Figure 7. Figure 7: Full text for story in Figs. 1, 3, 5, split into numbered sentences. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Steering effects for each individual concept [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Steering effects for each individual concept [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distance matrix results at in Fig [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Steering entanglement and prediction from [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Linear accuracy for each layer, across each domain. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Using probe weights to steer at layer 8 only, we see that steering effect disappears after a [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Using probe weights to steer at layers 7-14, we see that steering effect perseveres until the [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Correlations in yt,c for each pair of concepts in the Emotions domain. A subset of 2000 random data points are shown here. H Calibrating Linear Probes We empirically observed that when plotting probe predictions ybt,c against ground truth yt,c, predic￾tions were systematically biased, indicative of miscalibration. To rectify this, we applied isotonic regression, which largely corrects this bias ( [PITH_F… view at source ↗
Figure 16
Figure 16. Figure 16: Correlations in yt,c for each pair of concepts in the Genres domain. A subset of 2000 random data points are shown here. I Further Analysis of Steering Effects In [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Correlations in yt,c for each pair of concepts in the Arbitrary domain. A subset of 2000 random data points are shown here. J Belief Dynamics for Additional Stories The belief dynamics for each input text has its own nuanced story to tell. Here, in [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: (Left) We find that probe predictions ybt,c are initially miscalibrated with respect to ground truth yt,c. (Right) After applying isotonic regression, predictions are more calibrated and less systematically biased. Steering with Happiness Vector Steering with Sadness Vector [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: (Left) We observe that activation steering alters model belief dynamics, depending on [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Belief Dynamics in yt,c for two additional stories. Activation PCA 2 Story Progress 0.2 0.4 0.6 0.8 1.0 Activation PCA 1 Activation PCA 2 Emotion Disgust Sadness Happiness Anger Surprise Fear [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: PCA of activations for max activating examples, colored by the story progress, i.e., [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗
read the original abstract

Large Language Models (LLMs) update their behavior in context, which can be viewed as a form of Bayesian inference. However, the structure of the latent hypothesis space over which this inference operates remains unclear. In this work, we propose that LLMs assign beliefs over a low-dimensional geometric space - a conceptual belief space - and that in-context learning corresponds to a trajectory through this space as beliefs are updated over time. Using story understanding as a natural setting for dynamic belief updating, we combine behavioral and representational analyses to study these trajectories. We find that (1) belief updates are well-described as trajectories on low-dimensional, structured manifolds; (2) this structure is reflected consistently in both model behavior and internal representations and can be decoded with simple linear probes to predict behavior; and (3) interventions on these representations causally steer belief trajectories, with effects that can be predicted from the geometry of the conceptual space. Together, our results provide a geometric account of belief dynamics in LLMs, grounding Bayesian interpretations of in-context learning in structured conceptual representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs update beliefs during in-context learning by traversing trajectories on low-dimensional, structured manifolds in a conceptual belief space. Using story comprehension tasks to induce dynamic belief updates, it combines behavioral analyses, representational geometry from model activations, linear probes for decoding, and causal interventions to show that these trajectories are consistent across behavior and internals, predictable via simple linear methods, and steerable in ways aligned with the space's geometry.

Significance. If the central claims hold after addressing methodological concerns, this provides a geometric grounding for Bayesian interpretations of in-context learning, linking observable behavior to internal representations with causal evidence. The integration of behavioral, representational, and interventional methods is a strength, as is the attempt to make predictions from the geometry itself. It could inform more interpretable models of LLM belief dynamics if the low-dimensional structure proves intrinsic rather than stimulus-specific.

major comments (3)
  1. [§3.2] §3.2 (Dimensionality reduction): The dimensionality of the conceptual belief space is selected post-hoc based on variance explained in the activations from the fixed story set. This directly bears on the central claim of an intrinsic low-dimensional manifold; without pre-specification, cross-validation across held-out story collections, or testing on varied narrative axes, the recovered structure risks being an artifact of the low-rank input distribution rather than a property of the model's hypothesis space.
  2. [§5] §5 (Interventions): The causal interventions on representations are reported to steer belief trajectories with geometrically predictable effects, but the section lacks controls such as intervention magnitude matching, sham perturbations, or comparisons to directions orthogonal to the conceptual space. This is load-bearing for the claim that effects follow from the geometry rather than generic activation changes.
  3. [§4.3] §4.3 (Linear probes): The probes decode the conceptual space from activations to predict behavior, yet the space itself is derived from the same activations used for both probing and intervention. This circularity risk (noted in the stress-test) undermines independence; the paper should report performance on activations from a separate model or task to verify the structure is not analysis-defined.
minor comments (2)
  1. [Figure 3] Figure 3: The manifold visualization axes are not labeled with respect to the principal components or conceptual dimensions; clarify what each axis represents to aid interpretability.
  2. [Abstract] The abstract and introduction use 'parameter-free' for the geometric account, but the dimensionality choice introduces a free parameter; revise for precision.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which highlight important methodological considerations for strengthening our claims about the structure of conceptual belief spaces in LLMs. We address each major comment point by point below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Dimensionality reduction): The dimensionality of the conceptual belief space is selected post-hoc based on variance explained in the activations from the fixed story set. This directly bears on the central claim of an intrinsic low-dimensional manifold; without pre-specification, cross-validation across held-out story collections, or testing on varied narrative axes, the recovered structure risks being an artifact of the low-rank input distribution rather than a property of the model's hypothesis space.

    Authors: We selected the dimensionality using the standard approach of identifying the elbow in the variance explained curve from PCA applied to the model activations. This is not entirely post-hoc as it follows established practices in analyzing representational geometry. Nevertheless, to directly address the concern about potential artifacts from the fixed story set, we will incorporate cross-validation in the revised version: specifically, we will partition the stories into training and held-out sets, derive the dimensionality and principal components from the training set, and then evaluate the consistency of the low-dimensional trajectories and structure on the held-out stories. We will also extend the analysis to include stories varying along additional narrative dimensions to test generalizability beyond the original set. revision: partial

  2. Referee: [§5] §5 (Interventions): The causal interventions on representations are reported to steer belief trajectories with geometrically predictable effects, but the section lacks controls such as intervention magnitude matching, sham perturbations, or comparisons to directions orthogonal to the conceptual space. This is load-bearing for the claim that effects follow from the geometry rather than generic activation changes.

    Authors: We concur that additional controls are essential to substantiate that the observed steering effects arise from the geometry of the conceptual space rather than nonspecific activation perturbations. Accordingly, we will revise §5 to include the following: sham interventions using random vectors in the activation space with magnitudes matched to the conceptual interventions; explicit reporting of magnitude matching across all conditions; and interventions along directions orthogonal to the primary conceptual axes, with comparisons of their effects on belief trajectories. These controls will demonstrate the specificity of the geometric predictions. revision: yes

  3. Referee: [§4.3] §4.3 (Linear probes): The probes decode the conceptual space from activations to predict behavior, yet the space itself is derived from the same activations used for both probing and intervention. This circularity risk (noted in the stress-test) undermines independence; the paper should report performance on activations from a separate model or task to verify the structure is not analysis-defined.

    Authors: The manuscript includes a stress-test to partially address independence by applying the probes to different story collections. However, we recognize the value of further validation using separate models or tasks. In the revision, we will add results from linear probes trained and tested on activations from a distinct model variant (such as a different size or family) performing analogous story comprehension tasks. This will help confirm that the decoded conceptual structure is not solely an artifact of the analysis on the primary model. We note that while full separation is ideal, the core claims are supported by the convergence of behavioral, representational, and interventional evidence. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical analysis of belief trajectories

full rationale

The paper reports experimental results from inducing belief updates via stories in LLMs, followed by dimensionality reduction on activations to identify manifolds, linear probes to decode behavior, and targeted interventions to test causal effects. These steps rely on data-driven measurements and statistical methods applied to model outputs and internal states rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation that reduces the central claims to the inputs by construction. The geometric account is presented as an empirical finding supported by the observed consistency across behavior, representations, and interventions, without equations or derivations that equate outputs to inputs tautologically.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that LLM beliefs occupy a low-dimensional geometric space whose structure can be read out linearly and manipulated causally; this space is introduced by the paper rather than derived from prior theory.

free parameters (1)
  • dimensionality of conceptual belief space
    The paper states that trajectories lie on low-dimensional manifolds but does not specify how the exact dimensionality is chosen or validated.
axioms (1)
  • domain assumption In-context learning can be viewed as Bayesian inference over a latent hypothesis space
    Stated in the opening sentence of the abstract as the starting point for the geometric proposal.
invented entities (1)
  • conceptual belief space no independent evidence
    purpose: To provide a geometric substrate in which belief updates appear as trajectories
    Newly proposed construct whose existence and structure are supported only by the analyses in this paper.

pith-pipeline@v0.9.0 · 5511 in / 1496 out tokens · 60628 ms · 2026-05-13T05:10:16.029400+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

198 extracted references · 198 canonical work pages · 20 internal anchors

  1. [1]

    Science , volume=

    Visual pigments in single rods and cones of the human retina , author=. Science , volume=. 1964 , publisher=

  2. [2]

    Frontiers in psychology , volume=

    Neurotransmitters and emotions , author=. Frontiers in psychology , volume=. 2020 , publisher=

  3. [3]

    arXiv preprint arXiv:2511.01805 , year=

    Accumulating context changes the beliefs of language models , author=. arXiv preprint arXiv:2511.01805 , year=

  4. [4]

    arXiv preprint arXiv:2601.22364 , year=

    Context Structure Reshapes the Representational Geometry of Language Models , author=. arXiv preprint arXiv:2601.22364 , year=

  5. [5]

    arXiv preprint arXiv:2511.01836 , year=

    Priors in Time: Missing Inductive Biases for Language Model Interpretability , author=. arXiv preprint arXiv:2511.01836 , year=

  6. [6]

    arXiv preprint arXiv:2505.14685 , year=

    Language models use lookbacks to track beliefs , author=. arXiv preprint arXiv:2505.14685 , year=

  7. [7]

    Philosophical Transactions of the Royal Society B: Biological Sciences , volume=

    Re-evaluating theory of mind evaluation in large language models , author=. Philosophical Transactions of the Royal Society B: Biological Sciences , volume=. 2025 , publisher=

  8. [8]

    First Workshop on Theory of Mind in Communicating Agents , year=

    Do LLMs selectively encode the goal of an agent's reach? , author=. First Workshop on Theory of Mind in Communicating Agents , year=

  9. [9]

    , author=

    Emotion knowledge: further exploration of a prototype approach. , author=. Journal of personality and social psychology , volume=. 1987 , publisher=

  10. [10]

    Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

    Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control , author=. arXiv preprint arXiv:2604.03147 , year=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Confidence regulation neurons in language models , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    MINT: Foundation Model Interventions , year=

    Semantic entropy neurons: Encoding semantic uncertainty in the latent space of llms , author=. MINT: Foundation Model Interventions , year=

  13. [13]

    Reasoning theater: Disentangling model beliefs from chain-of- thought, 2026

    Reasoning theater: Disentangling model beliefs from chain-of-thought , author=. arXiv preprint arXiv:2603.05488 , year=

  14. [14]

    Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

    Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior , author=. arXiv preprint arXiv:2605.05115 , year=

  15. [15]

    arXiv preprint arXiv:2402.03563 , year=

    Distinguishing the knowable from the unknowable with language models , author=. arXiv preprint arXiv:2402.03563 , year=

  16. [16]

    IEEE Transactions on Visualization and Computer Graphics , year=

    Story Ribbons: Reimagining Storyline Visualizations with Large Language Models , author=. IEEE Transactions on Visualization and Computer Graphics , year=

  17. [18]

    Linearly mapping from image to text space.arXiv preprint arXiv:2209.15162,

    Linearly mapping from image to text space , author=. arXiv preprint arXiv:2209.15162 , year=

  18. [19]

    arXiv preprint arXiv:2402.00795 , year=

    LLMs learn governing principles of dynamical systems, revealing an in-context neural scaling law , author=. arXiv preprint arXiv:2402.00795 , year=

  19. [20]

    Transactions of the Association for Computational Linguistics , author =

    A. Transactions of the Association for Computational Linguistics , author =. 2016 , note =. doi:10.1162/tacl_a_00106 , urldate =

  20. [21]

    Emergent linear representations in world models of self-supervised sequence models

    Nanda, Neel and Lee, Andrew and Wattenberg, Martin , month = sep, year =. Emergent. doi:10.48550/arXiv.2309.00941 , abstract =

  21. [22]

    2022 , eprint=

    Toy Models of Superposition , author=. 2022 , eprint=

  22. [23]

    Batchtopk sparse autoencoders, 2024

    Batchtopk sparse autoencoders , author=. preprint arXiv:2412.06410 , year=

  23. [24]

    2023 , journal=

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

  24. [25]

    arXiv preprint arXiv:2507.23220 , year=

    Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders , author=. arXiv preprint arXiv:2507.23220 , year=

  25. [26]

    arXiv preprint arXiv:2506.03093 , year=

    From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit , author=. arXiv preprint arXiv:2506.03093 , year=

  26. [27]

    arXiv preprint arXiv:2402.18496 , year=

    Language models represent beliefs of self and others , author=. arXiv preprint arXiv:2402.18496 , year=

  27. [28]

    arXiv preprint arXiv:2406.07882 , year=

    Designing a dashboard for transparency and control of conversational AI , author=. arXiv preprint arXiv:2406.07882 , year=

  28. [29]

    Advances in Neural Information Processing Systems , volume=

    Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=

  29. [30]

    arXiv preprint arXiv:2410.17245 , year=

    Towards reliable evaluation of behavior steering interventions in llms , author=. arXiv preprint arXiv:2410.17245 , year=

  30. [31]

    arXiv preprint arXiv:2202.12837 , year=

    Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , author=. arXiv preprint arXiv:2202.12837 , year=

  31. [32]

    Advances in Neural Information Processing Systems , volume=

    Pretraining task diversity and the emergence of non-bayesian in-context learning for regression , author=. Advances in Neural Information Processing Systems , volume=

  32. [33]

    In- Context Language Learning : Architectures and Algorithms , 2024

    In-context language learning: Architectures and algorithms , author=. arXiv preprint arXiv:2401.12973 , year=

  33. [34]

    arXiv preprint arXiv:1801.08930 , year=

    Recasting gradient-based meta-learning as hierarchical bayes , author=. arXiv preprint arXiv:1801.08930 , year=

  34. [35]

    arXiv preprint arXiv:2410.16531 , year=

    Bayesian scaling laws for in-context learning , author=. arXiv preprint arXiv:2410.16531 , year=

  35. [36]

    arXiv preprint arXiv:2305.19420 , year=

    What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization , author=. arXiv preprint arXiv:2305.19420 , year=

  36. [37]

    2024 , eprint=

    In-Context Learning through the Bayesian Prism , author=. 2024 , eprint=

  37. [38]

    ICML workshop on actionable interpretability , year=

    Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics , author=. ICML workshop on actionable interpretability , year=

  38. [39]

    arXiv preprint arXiv:2406.02550 , year=

    Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks , author=. arXiv preprint arXiv:2406.02550 , year=

  39. [40]

    2023 , eprint=

    Palm-e: An embodied multimodal language model , author=. 2023 , eprint=

  40. [41]

    arXiv preprint arXiv:2410.12782 , year=

    In-Context Learning Enables Robot Action Prediction in LLMs , author=. arXiv preprint arXiv:2410.12782 , year=

  41. [42]

    arXiv preprint arXiv:2410.01280 , year=

    Sparse autoencoders reveal temporal difference learning in large language models , author=. arXiv preprint arXiv:2410.01280 , year=

  42. [43]

    arXiv preprint arXiv:2212.10559 , year=

    Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers , author=. arXiv preprint arXiv:2212.10559 , year=

  43. [44]

    International Conference on Machine Learning , pages=

    Transformers learn in-context by gradient descent , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  44. [45]

    The Thirteenth International Conference on Learning Representations , year=

    ICLR: In-Context Learning of Representations , author=. The Thirteenth International Conference on Learning Representations , year=

  45. [46]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  46. [47]

    2023 , eprint=

    The mechanistic basis of data dependence and abrupt learning in an in-context classification task , author=. 2023 , eprint=

  47. [48]

    2024 , eprint=

    Differential learning kinetics govern the transition from memorization to generalization during in-context learning , author=. 2024 , eprint=

  48. [49]

    Advances in Neural Information Processing Systems , volume=

    Data distributional properties drive emergent in-context learning in transformers , author=. Advances in Neural Information Processing Systems , volume=

  49. [50]

    Advances in neural information processing systems , volume=

    Bayesian modeling of human concept learning , author=. Advances in neural information processing systems , volume=

  50. [51]

    Advances in neural information processing systems , volume=

    Rules and similarity in concept learning , author=. Advances in neural information processing systems , volume=

  51. [52]

    Trends in Cognitive Sciences , year=

    Physics versus graphics as an organizing dichotomy in cognition , author=. Trends in Cognitive Sciences , year=

  52. [53]

    Journal of Open Psychology Data , volume=

    A large dataset of generalization patterns in the number game , author=. Journal of Open Psychology Data , volume=

  53. [54]

    , author=

    Inferring priors in compositional cognitive models. , author=. CogSci , year=

  54. [55]

    ICLR , year=

    In-context learning dynamics with random binary sequences , author=. ICLR , year=

  55. [56]

    Advances in Neural Information Processing Systems , volume=

    Human-like few-shot learning via bayesian reasoning over natural language , author=. Advances in Neural Information Processing Systems , volume=

  56. [57]

    Cognitive science , volume=

    A rational analysis of rule-based concept learning , author=. Cognitive science , volume=. 2008 , publisher=

  57. [58]

    , author=

    The logical primitives of thought: Empirical foundations for compositional cognitive models. , author=. Psychological review , volume=. 2016 , publisher=

  58. [59]

    arXiv preprint arXiv:1912.11554 , year=

    Composable effects for flexible and accelerated probabilistic programming in NumPyro , author=. arXiv preprint arXiv:1912.11554 , year=

  59. [60]

    , author=

    The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. , author=. J. Mach. Learn. Res. , volume=

  60. [61]

    Many-Shot In-Context Learning , author =

  61. [62]

    Andreas, D

    Learning with Latent Language , author =. 2017 , month = nov, number =. doi:10.48550/arXiv.1711.00482 , urldate =. arXiv , langid =:1711.00482 , primaryclass =

  62. [63]

    Many-Shot Jailbreaking , author =

  63. [64]

    Refusal in Language Models Is Mediated by a Single Direction

    Refusal in Language Models Is Mediated by a Single Direction , author =. 2024 , month = oct, number =. doi:10.48550/arXiv.2406.11717 , urldate =. arXiv , langid =:2406.11717 , primaryclass =

  64. [65]

    Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models , shorttitle =

    Ball, Sarah and Kreuter, Frauke and Panickssery, Nina , year =. Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models , shorttitle =. doi:10.48550/arXiv.2406.09289 , urldate =. arXiv , langid =:2406.09289 , primaryclass =

  65. [66]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Eliciting Latent Predictions from Transformers with the Tuned Lens , author =. 2023 , month = nov, number =. doi:10.48550/arXiv.2303.08112 , urldate =. arXiv , langid =:2303.08112 , primaryclass =

  66. [67]

    2024 , month = aug, number =

    Mechanistic Interpretability for AI Safety -- A Review , author =. 2024 , month = aug, number =. doi:10.48550/arXiv.2404.14082 , urldate =. arXiv , langid =:2404.14082 , primaryclass =

  67. [68]

    and Bridgers, Sophie and Gopnik, Alison and Tenenbaum, Joshua B

    Bonawitz, Elizabeth and Ullman, Tomer D. and Bridgers, Sophie and Gopnik, Alison and Tenenbaum, Joshua B. , year =. Sticking to the Evidence? A Behavioral and Computational Case Study of Micro-Theory Change in the Domain of Magnetism , shorttitle =. Cognitive Science , volume =. doi:10.1111/cogs.12765 , urldate =

  68. [69]

    2023 , month = oct, journal =

    Local Search and the Evolution of World Models , author =. 2023 , month = oct, journal =. doi:10.1111/tops.12703 , urldate =

  69. [70]

    Discovering latent knowledge in language models without supervision

    Discovering Latent Knowledge in Language Models Without Supervision , author =. 2024 , month = mar, number =. doi:10.48550/arXiv.2212.03827 , urldate =. arXiv , langid =:2212.03827 , primaryclass =

  70. [71]

    Persona Vectors: Monitoring and Controlling Character Traits in Language Models

    Chen, Runjin and Arditi, Andy and Sleight, Henry and Evans, Owain and Lindsey, Jack , year =. Persona Vectors: Monitoring and Controlling Character Traits in Language Models , shorttitle =. doi:10.48550/arXiv.2507.21509 , urldate =. arXiv , langid =:2507.21509 , primaryclass =

  71. [72]

    Plug and Play Language Models: A Simple Approach to Controlled Text Generation , shorttitle =

    Dathathri, Sumanth and Madotto, Andrea and Lan, Janice and Hung, Jane and Frank, Eric and Molino, Piero and Yosinski, Jason and Liu, Rosanne , year =. Plug and Play Language Models: A Simple Approach to Controlled Text Generation , shorttitle =. doi:10.48550/arXiv.1912.02164 , urldate =. arXiv , langid =:1912.02164 , primaryclass =

  72. [73]

    Human-like Few-Shot Learning via Bayesian Reasoning over Natural Language , author =

  73. [74]

    A primer on the inner workings of transformer-based language models.arXiv preprint arXiv:2405.00208, 2024

    A Primer on the Inner Workings of Transformer-based Language Models , author =. 2024 , month = oct, number =. doi:10.48550/arXiv.2405.00208 , urldate =. arXiv , langid =:2405.00208 , primaryclass =

  74. [75]

    2025 , month = feb, number =

    A Unified Understanding and Evaluation of Steering Methods , author =. 2025 , month = feb, number =. doi:10.48550/arXiv.2502.02716 , urldate =. arXiv , langid =:2502.02716 , primaryclass =

  75. [76]

    Jain, Samyak and Lubana, Ekdeep Singh and Oksuz, Kemal and Joy, Tom and Torr, Philip H. S. and Sanyal, Amartya and Dokania, Puneet K. , year =. What Makes and Breaks Safety Fine-tuning? A Mechanistic Study , shorttitle =. doi:10.48550/arXiv.2407.10264 , urldate =. arXiv , langid =:2407.10264 , primaryclass =

  76. [77]

    2024 , month = mar, number =

    On the Origins of Linear Representations in Large Language Models , author =. 2024 , month = mar, number =. doi:10.48550/arXiv.2403.03867 , urldate =. arXiv , langid =:2403.03867 , primaryclass =

  77. [78]

    What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks , shorttitle =

    Kirch, Nathalie Maria and Field, Severin and Casper, Stephen , year =. What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks , shorttitle =. doi:10.48550/arXiv.2411.03343 , urldate =. arXiv , langid =:2411.03343 , primaryclass =

  78. [79]

    2024 , month = sep, journal =

    Learning by Thinking in Natural and Artificial Minds , author =. 2024 , month = sep, journal =. doi:10.1016/j.tics.2024.07.007 , urldate =

  79. [80]

    PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling , shorttitle =

    Ma, Avery and Pan, Yangchen and Farahmand, Amir-massoud , year =. PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling , shorttitle =. doi:10.48550/arXiv.2502.01925 , urldate =. arXiv , langid =:2502.01925 , primaryclass =

  80. [81]

    2024 , month = feb, number =

    Distilling Symbolic Priors for Concept Learning into Neural Networks , author =. 2024 , month = feb, number =. arXiv , langid =:2402.07035 , primaryclass =

Showing first 80 references.