arxiv: 2605.03780 · v1 · submitted 2026-05-05 · 💻 cs.LG · cs.CL· stat.ML

Recognition: unknown

Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers

Hao Yan , Haolin Yang , Yiqiao Zhong

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:31 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML

keywords task vectorstransformersin-context learningtask inferencerepresentation geometryOOD generalizationBayesian retrievalextrapolative learning

0 comments

The pith

Transformers implement in-distribution task retrieval through convex combinations of task vectors and out-of-distribution adaptation through representations in a nearly orthogonal subspace.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that task-vector geometry in transformers, shaped by the training distribution, supports two inference modes that coexist in one model. In-distribution behavior follows Bayesian task retrieval by taking convex combinations of learned task vectors. Out-of-distribution behavior instead proceeds by extrapolative task learning in a subspace orthogonal to the main task-vector directions. A sympathetic reader would care because the work supplies a geometric account of how training data determines whether a model recalls familiar tasks or invents solutions for new ones.

Core claim

By training small transformers from scratch on latent-task sequence distributions, the authors show that two inference modes coexist within a single model. In-distribution behavior is governed by Bayesian task retrieval implemented internally through convex combinations of learned task vectors. OOD behavior arises through extrapolative task learning whose representations occupy a subspace nearly orthogonal to the task-vector subspace. The results link task-vector geometry, training distributions, and generalization behaviors.

What carries the argument

Task vectors, the task-specific directions extracted from middle-layer representations, whose convex combinations realize Bayesian retrieval for seen tasks while their orthogonal complement enables extrapolative learning for novel tasks.

If this is right

Task-vector geometry is shaped by the training distribution.
In-distribution inference operates via convex combinations of task vectors.
Out-of-distribution generalization relies on representations in a nearly orthogonal subspace.
The two modes can coexist inside one model without requiring separate components.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the orthogonal-subspace mechanism holds in large models, training objectives could be designed to enlarge separation between the subspaces and thereby improve OOD performance.
Targeted interventions that perturb only the task-vector subspace would be expected to impair in-distribution behavior while leaving OOD capabilities largely intact.
Extending the synthetic distributions to richer task structures could test whether natural-language training data produces analogous orthogonal geometries in frontier models.

Load-bearing premise

The assumption that internal representations and inference modes found in small transformers trained on synthetic latent-task sequences transfer to the behavior of large language models trained on natural language data.

What would settle it

A measurement in a large language model in which out-of-distribution task representations lie inside the same subspace as in-distribution task vectors or in which convex combinations of task vectors fail to predict in-distribution behavior.

Figures

Figures reproduced from arXiv: 2605.03780 by Haolin Yang, Hao Yan, Yiqiao Zhong.

**Figure 1.** Figure 1: Connections between data distribution and representation geometry. This diagram illustrates the synthetic rolling biased dice experiment. Each sequence is generated from an unobserved latent z controlling the outcome distribution. A transformer trained on mixture data learns internal task vectors that encode these latents. Two near-orthogonal subspaces largely determine a model’s inference mode and general… view at source ↗

**Figure 2.** Figure 2: Finite-context interpolation approximately holds. R2 of the interpolation model (Eq. 4) across context positions and layers for E1, E2, E3. Values are close to 1 except in late layers and early positions view at source ↗

**Figure 3.** Figure 3: Bayesian posterior alignment. Simplex-projected coefficients βt,k (markers, 10–90th percentile error bars) vs. ground-truth posterior αt,k (dashed lines, shaded bands) for E1, E2, E3. Evaluating finite-context interpolation. We next test P2 at finite context by fitting the interpolation model in Eq. (4) with simplex-constrained coefficients βt,k ≥ 0, P k βt,k = 1 ( view at source ↗

**Figure 4.** Figure 4: Causal intervention via the task-vector simplex. Substituting βt,k with randomly drawn α ∗ t,k steers model outputs to align with the corresponding mixture predictions, confirming causal effects view at source ↗

**Figure 5.** Figure 5: Phase transition between two inference modes. Panels (left to right): E1, E2, E3. Blue (negative) indicates that the model is closer to Bayesian task retrieval M1, red (positive) indicates closer to the extrapolative task learning M2, and white indicates no clear preference. Higher task diversity generally promotes M2. drops from 2.8 to 1.1 in E2. This demonstrates that the task-vector subspace provides a … view at source ↗

**Figure 6.** Figure 6: Projection R2 of OOD hidden states on major task subspace. Panels: E1, E2, E3. As task diversity increases, projection R2 becomes smaller, supporting the near-orthogonal representation hypothesis. 6.1 Coexisting near-orthogonal subspaces We hypothesize a stronger geometry for the M2 inference mode than Theorem 2 suggests. Near-orthogonal representation hypothesis: a transformer internally encodes both infe… view at source ↗

**Figure 7.** Figure 7: Task-subspace trajectories for E3 (left) and Qwen2.5-7B (right). Task vectors corresponding to 3 tasks form a triangle, and we project hidden states onto this task-vector subspace. ID prompts: hidden state ht converges toward the true task vertex as t grows. OOD prompts: hidden states remain near-orthogonal to the task subspace (small R2 ) throughout the trajectory view at source ↗

**Figure 8.** Figure 8: Planted Dyck language (E4) exhibits long-context dependence and prefix memorization. Left: residual variance ratio of ht after conditioning on z and the last 3 planted characters at each Dyck position. Unlike E1–E3, z and a local window leave substantial residual variance, with spikes at positions requiring long-range bracket matching. Right: 2D projection of final-layer hidden states at Dyck prefix length… view at source ↗

**Figure 9.** Figure 9: Latent task and last token explain most variance in hidden states at large t. Residual variance ratio SSwithin/SStotal = Var d(ht | z, st)/Var d(ht) as a function of context position for E1 (left), E2 (middle), and E3 (right). Each curve corresponds to one layer (post-MLP). The ratio decreases as context grows (the model gradually infers the latent) and is smaller in later layers (depth aids task inference) view at source ↗

**Figure 10.** Figure 10: Additive separability of task and token effects. All three panels show the interaction proportion η 2 interaction: the fraction of explained variance due to the task–token interaction term. Left (E1): two-way ANOVA on discrete tokens. Middle (E2): ANCOVA on continuous covariates, η 2 interaction = (R2 full − R2 additive)/R2 full. Right (E3): two-way ANOVA on discrete tokens. Small values confirm that the … view at source ↗

**Figure 11.** Figure 11: OLS probe R2 decomposition for experiment E3. Left: marginal R2 of each feature group alone. Right: partial R2 (Eq. 17) after controlling for the other two groups. Results are shown layer-by-layer (post-MLP, layers 0–5) at context positions t ∈ [170, 190), with per-position mean subtraction applied to ht before fitting. Layerwise findings. The decomposition pinpoints the source of the late-layer non-addit… view at source ↗

**Figure 12.** Figure 12: Posterior alignment at later layers: simplex-projected coefficients view at source ↗

**Figure 13.** Figure 13: Affine projection coefficients β aff t,k (markers with error bars) vs. Bayesian posterior αt,k (dashed lines with shaded percentile bands). No simplex projection is applied; coefficients may be negative or exceed 1. Left: Dice (E1, layer 3). Middle: linear regression (E2, layer 9). Right: latent Markov (E3, layer 3). C.5 Simplex Intervention Details This subsection details the simplex-intervention experim… view at source ↗

**Figure 14.** Figure 14: Simplex steering outperforms the mode-output baseline across the entire task simplex. Each panel pair shows error (KL divergence or RMSE) under simplex steering (Steered) and the mode-output baseline (Mode task k ∗ ) for E1 (left), E2 (middle), and E3 (right). Simplex steering achieves uniformly low error throughout the simplex interior, whereas the mode baseline (which selects the pure-task output for th… view at source ↗

**Figure 15.** Figure 15: Training dynamics of in-distribution and out-of-distribution loss. Panels (top to bottom): E1, E2, E3. Each panel compares ID and OOD performance over training. Strong OOD performance in the later training regime of high task diversity is consistent with the emergence of the extrapolative inference mode M2 (cf. Sec. 6). diversity, whereas at low diversity they remain inside the simplex with R2 comparable … view at source ↗

**Figure 16.** Figure 16: Affine projection trajectories onto the task-vector simplex for E1–E3. Each panel plots the affine OLS coordinates (β aff t,1 , βaff t,2 , βaff t,3 ) (sum-to-one, no nonnegativity constraint) of the batch-mean centered hidden state at an intermediate layer onto the barycentric frame defined by the three averaging-based task vectors θˆ 1, θˆ 2, θˆ 3 (simplex vertices, ⋆). Solid lines (blue shades) trace ea… view at source ↗

**Figure 17.** Figure 17: Posterior alignment analog in Qwen2.5-7B (layer 20), ID prompts. Each panel shows one of the K = 6 ID tasks (columns). Lines show the mean affine coefficient βt,k for each task k as a function of shot position t. For each ID task, βt,k for the correct task k (matching column color) rises toward 1 as demonstrations accumulate, while coefficients for other tasks remain near 0, analogous to Bayesian posterio… view at source ↗

**Figure 18.** Figure 18: Per-layer orthogonal-subspace intervention across all three experiments. Each bar shows ∆Lmode/gmode × 100% when Vˆ opt is suppressed at that layer; error bars are the inter-quartile range over evaluation batches; the shaded band and horizontal dashed line mark the random same-rank baseline and the 100% reference respectively. (a) E1 (Dice): OOD loss is disrupted substantially at all layers (≈53–155%) and… view at source ↗

**Figure 19.** Figure 19: Per-layer task subspace suppression across all three experiments. Each bar shows ∆Lmode/gmode × 100% when the task subspace col(Θˆ ) is suppressed at that layer. This is the complementary intervention to view at source ↗

**Figure 20.** Figure 20: Per-feature R2 predicting Vˆ ⊤ opth across all three experiments. Solid lines show individual feature contributions; the dashed line shows the combined R2 from regressing on all features jointly (see Eq. 23 for the CLR definition used in E1 and E3). (a) E1 (Dice): current token dominates early (R2 ≈ 0.81 at layer 0) and decays, while unigram CLR rises from ≈ 0.35 to ≈ 0.88 by layer 5; combined R2 remains … view at source ↗

**Figure 21.** Figure 21: Causal comparison of Vˆ opt vs. filtered subspace Vfilt across all three experiments. Dark bars show the full Vˆ opt intervention; light bars show the filtered subspace derived from the combined observable features. (a) E1 (Dice): filtered bars are nearly indistinguishable from Vˆ opt at every layer (≈75–135%), confirming that the causally active content is linearly accessible from token statistics throug… view at source ↗

**Figure 22.** Figure 22: Hidden-state geometry for the planted Dyck language (E4). Two-dimensional projection of final-layer hidden states at prefix length l = 7 (yielding 35 distinct Dyck prefixes). Points form well-separated clusters, with both clusters and induced decision regions colored by the number of unmatched left parentheses. The clear separation across all prefix classes indicates that the representation preserves the … view at source ↗

read the original abstract

Transformers are effective at inferring the latent task from context via two inference modes: recognizing a task seen during training, and adapting to a novel one. Recent interpretability studies have identified from middle-layer representations task-specific directions, or task vectors, that steer model behavior. However, a lack of rigorous foundations hinders connecting internal representations to external model behavior: existing work fails to explain how task-vector geometry is shaped by the training distribution, and what geometry enables out-of-distribution (OOD) generalization. In this paper, we study these questions in a controlled synthetic setting by training small transformers from scratch on latent-task sequence distributions, which allows a principled mathematical characterization. We show that two inference modes can coexist within a single model. In-distribution behavior is governed by Bayesian task retrieval, implemented internally through convex combinations of learned task vectors. OOD behavior, by contrast, arises through extrapolative task learning, whose representations occupy a subspace nearly orthogonal to the task-vector subspace. Taken together, our results suggest that task-vector geometry, training distributions, and generalization behaviors are closely related.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Task vector geometry explains dual inference modes in synthetic transformers, but real-model transfer stays untested.

read the letter

Two things stand out right away. The paper demonstrates that in-distribution inference in these transformers works through convex combinations of task vectors, matching a Bayesian retrieval process, while out-of-distribution inference uses a nearly orthogonal subspace for extrapolation. This separation is linked to the geometry of the training distribution. The controlled synthetic setting is where the paper shines. By training small transformers from scratch on carefully designed latent-task sequence distributions, the authors can provide a mathematical characterization of the task-vector subspaces and show how the two modes coexist in one model. This is more rigorous than much of the prior task-vector work, which often stops at identifying directions without connecting them to specific inference behaviors or training data properties. The main soft spot is the narrow scope. All the evidence comes from synthetic data and small models trained from scratch. Nothing checks whether the same convex-combination and orthogonal-subspace pattern appears in large language models trained on natural language. That transfer is assumed rather than shown, which limits how far the conclusions can be pushed. There is also the usual concern with extracting task vectors from the model itself to explain its own behavior, though the synthetic control probably reduces the risk of circularity. The abstract does not include equations or statistical details, so the full paper needs to make the derivations and measurements transparent. This work will interest people in mechanistic interpretability and in-context learning who want geometric explanations for generalization. A reader looking for testable ideas about how model internals relate to training distributions will get value from it. The experiments are principled enough that it deserves a serious referee. I would recommend sending the paper out for peer review. The findings are new within the synthetic setting and the approach is honest, but reviewers will likely ask for more on whether the geometry generalizes beyond the controlled case.

Referee Report

2 major / 2 minor

Summary. The paper claims that two inference modes can coexist in a single transformer: in-distribution behavior is governed by Bayesian task retrieval implemented as convex combinations of learned task vectors, while out-of-distribution behavior arises from extrapolative task learning whose representations occupy a subspace nearly orthogonal to the task-vector subspace. These findings are obtained via controlled experiments training small transformers from scratch on synthetic latent-task sequence distributions, which enables a mathematical characterization relating task-vector geometry to the training distribution and generalization.

Significance. The synthetic controlled setting and mathematical characterization are clear strengths, allowing rigorous study of how internal geometry shapes ID vs. OOD inference without confounding factors from natural data. If the reported convex-combination and near-orthogonality properties prove robust, the work supplies a concrete geometric mechanism that could explain dual inference modes more broadly. The absence of any transfer experiments to large pretrained models or natural-language distributions, however, keeps the suggested generality speculative.

major comments (2)

[Abstract and Conclusion] Abstract and final paragraph: the suggestion that 'task-vector geometry, training distributions, and generalization behaviors are closely related' in transformers generally is not supported by evidence, as all results are confined to small models trained on synthetic latent-task distributions; no experiments test whether the same convex/orthogonal structure appears in large language models or natural data.
[Methods and Results (mathematical characterization)] The central geometric claims rest on task vectors extracted from the same models whose inference modes they are used to explain; it is unclear whether the vectors are defined independently of the ID/OOD measurements or whether the reported orthogonality and convex combinations are derived from first principles rather than observed post-hoc.

minor comments (2)

[Throughout] Provide explicit equations for the claimed mathematical characterization of the convex combinations and the orthogonality metric, along with error bars or confidence intervals on the reported 'nearly orthogonal' angles.
[Experimental Setup] Clarify the precise definition of the synthetic latent-task sequence distributions and the procedure for extracting task vectors from middle-layer representations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of our controlled synthetic setting and mathematical characterization. We address each major comment below and will revise the manuscript accordingly to tighten the scope of our claims while clarifying the theoretical foundations.

read point-by-point responses

Referee: [Abstract and Conclusion] Abstract and final paragraph: the suggestion that 'task-vector geometry, training distributions, and generalization behaviors are closely related' in transformers generally is not supported by evidence, as all results are confined to small models trained on synthetic latent-task distributions; no experiments test whether the same convex/orthogonal structure appears in large language models or natural data.

Authors: We agree that all empirical results and the mathematical characterization are confined to small transformers trained from scratch on synthetic latent-task sequence distributions. The abstract and conclusion use phrasing that could be read as implying broader applicability to transformers in general. We will revise both the abstract and the final paragraph to explicitly restrict the scope to the synthetic controlled setting, stating that the work provides a rigorous geometric mechanism in this environment rather than claiming direct evidence for large language models or natural data. No transfer experiments will be added, as they fall outside the current paper's focus on principled mathematical characterization. revision: yes
Referee: [Methods and Results (mathematical characterization)] The central geometric claims rest on task vectors extracted from the same models whose inference modes they are used to explain; it is unclear whether the vectors are defined independently of the ID/OOD measurements or whether the reported orthogonality and convex combinations are derived from first principles rather than observed post-hoc.

Authors: The task vectors are extracted from the trained models, but the key geometric properties are not observed post-hoc. Our mathematical characterization derives the emergence of a task-vector subspace, the convex-combination behavior for in-distribution inference, and the near-orthogonality of extrapolative representations for out-of-distribution tasks directly from the structure of the training distribution and the transformer architecture. The ID/OOD behavioral measurements serve to validate these analytically predicted properties rather than to define them. We will add a new subsection in the Methods that separates the first-principles derivation from the subsequent empirical extraction and validation steps, making this independence explicit. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper trains small transformers from scratch on synthetic latent-task sequence distributions and empirically characterizes the geometry of task vectors extracted from middle-layer representations. In-distribution behavior is shown to align with convex combinations of these vectors, while OOD behavior occupies a nearly orthogonal subspace. This is an observational analysis within a controlled generative process, not a reduction where predictions or modes are defined in terms of themselves or forced by fitting the same quantities used to measure them. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the abstract or described setup. The synthetic setting is self-contained, and the dual-mode claim follows from direct measurement rather than tautological renaming or construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of identifiable task vectors in middle-layer activations and on the assumption that the synthetic training distribution induces the reported convex and orthogonal geometries.

free parameters (1)

task vector directions
Learned from data; their number and exact placement are fitted during training.

axioms (1)

domain assumption Middle-layer activations contain linearly extractable task-specific directions.
Invoked to interpret the internal representations as task vectors.

pith-pipeline@v0.9.0 · 5487 in / 1275 out tokens · 47934 ms · 2026-05-07T16:31:06.779990+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Aitchison.The Statistical Analysis of Compositional Data

J. Aitchison.The Statistical Analysis of Compositional Data. Monographs on Statistics and Applied Probability. Chapman & Hall, London, 1986

1986
[2]

What learning algorithm is in-context learning? Investigations with linear models

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? Investigations with linear models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id= 0g0X4H8yN4I

2023
[3]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

2023
[4]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901
[5]

Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information.IEEE Transactions on information theory, 52(2):489–509, 2006

Emmanuel J Candès, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information.IEEE Transactions on information theory, 52(2):489–509, 2006

2006
[6]

Dynamics of transient structure in in-context linear regression transformers.arXiv preprint arXiv:2501.17745, 2025

Liam Carroll, Jesse Hoogland, Matthew Farrugia-Roberts, and Daniel Murfet. Dynamics of transient structure in in-context linear regression transformers.arXiv preprint arXiv:2501.17745, 2025

work page arXiv 2025
[7]

Compressed sensing.IEEE Transactions on information theory, 52(4): 1289–1306, 2006

David L Donoho. Compressed sensing.IEEE Transactions on information theory, 52(4): 1289–1306, 2006

2006
[8]

Uncertainty principles and signal recovery.SIAM Journal on Applied Mathematics, 49(3):906–931, 1989

David L Donoho and Philip B Stark. Uncertainty principles and signal recovery.SIAM Journal on Applied Mathematics, 49(3):906–931, 1989

1989
[9]

Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra

John C. Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the ℓ1-ball for learning in high dimensions. InProceedings of the 25th International Conference on Machine Learning (ICML), pages 272–279, 2008. doi: 10.1145/1390156.1390191

work page doi:10.1145/1390156.1390191 2008
[10]

The evolution of statistical induction heads: In-context learning markov chains.Advances in neural information processing systems, 37:64273–64311, 2024

Ezra Edelman, Nikolaos Tsilivis, Benjamin L Edelman, Eran Malach, and Surbhi Goel. The evolution of statistical induction heads: In-context learning markov chains.Advances in neural information processing systems, 37:64273–64311, 2024. 13

2024
[11]

A mathematical framework for transformer circuits.Transformer Circuits Thread,

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...
[12]

https://transformer-circuits.pub/2021/framework/index.html

2021
[13]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. https://transformer- circuits.pub/2022/t...

2022
[14]

What can transformers learn in-context? a case study of simple function classes.Advances in Neural Information Processing Systems, 35:30583–30598, 2022

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in Neural Information Processing Systems, 35:30583–30598, 2022

2022
[15]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021

2021
[16]

When models manipulate manifolds: The geometry of a counting task, 2026

Wes Gurnee, Emmanuel Ameisen, Isaac Kauvar, Julius Tarng, Adam Pearce, Chris Olah, and Joshua Batson. When models manipulate manifolds: The geometry of a counting task.arXiv preprint arXiv:2601.04480, 2026

work page arXiv 2026
[17]

In-Context Learning Creates Task Vectors

Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9318–9333. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.findings-emnlp.624. URLhttps://aclanthology.org/2023.findings-emnlp.624/

work page doi:10.18653/v1/2023.findings-emnlp.624 2023
[18]

In-context learning state vector with inner and momentum optimiza- tion

Dongfang Li, Zhenyu Liu, Xinshuo Hu, Zetian Sun, Baotian Hu, and Min Zhang. In-context learning state vector with inner and momentum optimiza- tion. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024),

2024
[19]

URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ 0ed52d7f6f641f228405d48a611e0684-Abstract-Conference.html

2024
[20]

Sheng Liu, Haotian Ye, Lei Xing, and James Y. Zou. In-context vectors: Making in context learning more effective and controllable through latent space steering. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 32287–32307. PMLR, 2024. URLhttps://proceedings.mlr.press/ v235...

2024
[21]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id= Bkg6RiCqY7

2019
[22]

Asymp- totic theory of in-context learning by linear attention.Proceedings of the National Academy of Sciences, 122(28):e2502599122, 2025

Yue M Lu, Mary Letey, Jacob A Zavatone-Veth, Anindita Maiti, and Cengiz Pehlevan. Asymp- totic theory of in-context learning by linear attention.Proceedings of the National Academy of Sciences, 122(28):e2502599122, 2025. 14

2025
[23]

Language models implement simple W ord2 V ec-style vector arithmetic

Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Language models implement simple Word2Vec- style vector arithmetic. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5030–5047. Association for Computational Linguistics, 2024. do...

work page doi:10.18653/v1/2024.naacl-long.281 2024
[24]

Distributed representations of words and phrases and their compositionality.Advances in neural information processing systems, 26, 2013

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality.Advances in neural information processing systems, 26, 2013

2013
[25]

Linguistic regularities in continuous space word representations

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia,...

2013
[26]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review arXiv 2022
[27]

What in-context learning “learns” in-context: Disentangling task recognition and task learning

Jane Pan, Tianyu Gao, Howard Chen, and Danqi Chen. What in-context learning “learns” in-context: Disentangling task recognition and task learning. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8298–8319, 2023

2023
[28]

Competition dynamics shape algorithmic phases of in-context learning

Core Francisco Park, Ekdeep Singh Lubana, and Hidenori Tanaka. Competition dynamics shape algorithmic phases of in-context learning. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=XgH1wfHSX8

2025
[29]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InForty-first International Conference on Machine Learning, 2024

2024
[30]

The geometry of categorical and hierarchical concepts in large language models

Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch. The geometry of categorical and hierarchical concepts in large language models. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=bVTM2QKYuA

2025
[31]

Approximation theory of the mlp model in neural networks.Acta numerica, 8: 143–195, 1999

Allan Pinkus. Approximation theory of the mlp model in neural networks.Acta numerica, 8: 143–195, 1999

1999
[32]

Qwen2.5 Technical Report

Qwen Team, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao L...

work page internal anchor Pith review arXiv 2025
[33]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019. 15

2019
[34]

Pretraining task diversity and the emergence of non-bayesian in-context learning for regression.Advances in neural information processing systems, 36:14228–14246, 2023

Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression.Advances in neural information processing systems, 36:14228–14246, 2023

2023
[35]

Transformers learn factored representations.arXiv preprint arXiv:2602.02385,

Adam Shai, Loren Amdahl-Culleton, Casper L Christensen, Henry R Bigelow, Fernando E Rosas, Alexander B Boyd, Eric A Alt, Kyle J Ray, and Paul M Riechers. Transformers learn factored representations.arXiv preprint arXiv:2602.02385, 2026

work page arXiv 2026
[36]

Transformers represent belief state geometry in their residual stream.Advances in Neural Information Processing Systems, 37:75012–75034, 2024

Adam S Shai, Sarah E Marzen, Lucas Teixeira, Alexander G Oldenziel, and Paul M Riechers. Transformers represent belief state geometry in their residual stream.Advances in Neural Information Processing Systems, 37:75012–75034, 2024

2024
[37]

Uncovering hidden geometry in transformers via disentangling position and context.arXiv preprint arXiv:2310.04861, 2023

Jiajun Song and Yiqiao Zhong. Uncovering hidden geometry in transformers via disentangling position and context.arXiv preprint arXiv:2310.04861, 2023

work page arXiv 2023
[38]

Out-of-distribution generalization via composition: A lens through induction heads in transformers.Proceedings of the National Academy of Sciences, 122(6):e2417182122, 2025

Jiajun Song, Zhuoyan Xu, and Yiqiao Zhong. Out-of-distribution generalization via composition: A lens through induction heads in transformers.Proceedings of the National Academy of Sciences, 122(6):e2417182122, 2025. doi: 10.1073/pnas.2417182122. URLhttps://www.pnas.org/doi/ abs/10.1073/pnas.2417182122

work page doi:10.1073/pnas.2417182122 2025
[39]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[40]

Daniel Freeman, Theodore R

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Ed- ward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monose...

2024
[41]

Function vectors in large language models

Eric Todd, Millicent Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024
[42]

A. W. van der Vaart.Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998

1998
[43]

Cambridge Series in Statistical and Probabilistic Mathematics

Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018

2018
[44]

Grokking of implicit reasoning in transformers: A mechanistic journey to the edge of generalization

Boshi Wang, Xiang Yue, Yu Su, and Huan Sun. Grokking of implicit reasoning in transformers: A mechanistic journey to the edge of generalization. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), 2024. URL https://proceedings.neurips.cc/paper_files/ paper/2024/hash/ad217e0c7fecc71bdf48660ad6714b07-Abstract-Conference.html

2024
[45]

Dif- ferentiation and specialization of attention heads via the refined local learning coefficient

George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, and Daniel Murfet. Dif- ferentiation and specialization of attention heads via the refined local learning coefficient. In International Conference on Learning Representations, 2025. 16

2025
[46]

Larger language models do in-context learning differently

Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023

work page arXiv 2023
[47]

Transformers: State-of-the-Art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45. Asso...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[48]

An explanation of in-context learning as implicit bayesian inference

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. InInternational Conference on Learning Representations, 2022

2022
[49]

number of independent directions along whichp(z)varies

Kayo Yin and Jacob Steinhardt. Which attention heads matter for in-context learning? InProceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pages 72428–72461. PMLR, 2025. URLhttps: //proceedings.mlr.press/v267/yin25e.html. Organization of the appendix.We organize our appendix into fi...

2025
[50]

Pool selection.With probability pminor = 0.1, draw from the minor pool; with probability 1−p minor = 0.9, draw from the major pool
[51]

table”→“t

Task selection.Sample a task z uniformly at random from the selected pool, then generate a sequence of lengthTfromp z. For the degenerate caseNminor = 0the minor pool is empty: the Bernoulli pool-selection step is skipped and every sequence is drawn from the major pool (equivalently, the effectivepminor collapses to0). For Nminor ≥ 1, this gives major tas...