Recognition: no theorem link
In-context Learning and Induction Heads
Pith reviewed 2026-05-11 03:44 UTC · model grok-4.3
The pith
Induction heads implement the core copying algorithm behind in-context learning in transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Induction heads are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. The authors present six lines of evidence that these heads constitute the mechanism for the majority of all in-context learning in large transformer models, developing precisely when a sudden sharp increase in in-context learning ability occurs during training.
What carries the argument
Induction heads, attention heads that detect a prior token match and copy the subsequent token from that earlier occurrence.
If this is right
- Induction heads emerge at the same moment training loss shows a sharp improvement on later tokens.
- In small attention-only models, directly ablating induction heads reduces in-context learning performance.
- The timing correlation between head formation and performance gains holds across model sizes.
- The mechanism appears general enough to explain in-context learning in transformers of any scale.
Where Pith is reading between the lines
- If induction heads are the primary driver, then interventions that speed their formation could shorten the training needed for strong few-shot behavior.
- The copying rule might also explain why transformers handle many different in-context tasks without task-specific fine-tuning.
- Checking whether non-attention architectures develop analogous copying circuits would test how specific this mechanism is to transformers.
Load-bearing premise
The formation of induction heads directly causes the observed jump in in-context learning rather than both changes arising together from some other training dynamic.
What would settle it
Train a transformer in which induction heads never form yet a sharp increase in in-context learning still appears at the same training step.
read the original abstract
"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguing that induction heads may be the mechanistic source of general in-context learning in transformer models of any size. For small attention-only models, we present strong, causal evidence; for larger models with MLPs, we present correlational evidence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper hypothesizes that 'induction heads' (attention heads implementing a simple [A][B]...[A] -> [B] completion algorithm) are the primary mechanistic source of in-context learning in transformers, defined as the decrease in loss at increasing token indices. It reports that these heads emerge at the same training point as a sharp loss bump signaling increased in-context ability, presenting six lines of evidence: strong causal interventions (ablations/patching) for small attention-only models and correlational/timing-based evidence for larger models containing MLPs.
Significance. If the causal link holds, the work would supply a concrete mechanistic account of in-context learning, a core capability of large language models. The strong, reproducible causal interventions in small attention-only models constitute a clear strength, as do the multiple complementary observational measures (timing correlations, head activation patterns) that could guide future targeted experiments. The paper thereby advances mechanistic interpretability by linking a specific circuit to a broad behavioral phenomenon.
major comments (2)
- Abstract: The claim that induction heads 'might constitute the mechanism for the majority of all in-context learning' in large transformer models rests on correlational evidence only; the text states that the six lines of evidence for models with MLPs are 'preliminary and indirect' and 'correlational,' with no ablation, patching, or causal intervention results reported to show that disabling induction heads specifically impairs the observed in-context loss reduction.
- Description of the six lines of evidence (larger models): These lines rely on coincidence of induction-head emergence with the training loss bump and on observational metrics such as head activation timing; they do not include controls that would distinguish whether both phenomena are parallel downstream effects of an earlier training dynamic (e.g., a phase transition in optimization or representation geometry), leaving the causal inference untested for models containing MLPs.
minor comments (1)
- Abstract: Quantitative details on the magnitude of the loss bump, the fraction of heads identified as induction heads, and any error controls or statistical tests for the six lines of evidence would improve clarity and allow readers to assess the strength of the correlational results.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the value of the causal interventions in small models as well as the potential of the observational measures to guide future work. We agree that the distinction between causal and correlational evidence must be drawn more sharply in the abstract and discussion, and we will revise the manuscript to address both major comments.
read point-by-point responses
-
Referee: Abstract: The claim that induction heads 'might constitute the mechanism for the majority of all in-context learning' in large transformer models rests on correlational evidence only; the text states that the six lines of evidence for models with MLPs are 'preliminary and indirect' and 'correlational,' with no ablation, patching, or causal intervention results reported to show that disabling induction heads specifically impairs the observed in-context loss reduction.
Authors: We accept the point. While the body of the paper already describes the evidence for models with MLPs as preliminary, indirect, and correlational, the abstract phrasing risks implying stronger support than exists. We will revise the abstract to state explicitly that the hypothesis for large models rests on correlational evidence from the six lines, without causal interventions such as ablation or patching, and to moderate the language concerning induction heads as the mechanism for the majority of in-context learning. revision: yes
-
Referee: Description of the six lines of evidence (larger models): These lines rely on coincidence of induction-head emergence with the training loss bump and on observational metrics such as head activation timing; they do not include controls that would distinguish whether both phenomena are parallel downstream effects of an earlier training dynamic (e.g., a phase transition in optimization or representation geometry), leaving the causal inference untested for models containing MLPs.
Authors: The referee correctly notes that the six lines are observational and lack controls that could rule out alternative accounts in which induction-head emergence and the loss bump are both downstream of an earlier training dynamic. We do not claim to have performed such controls. In revision we will add an explicit limitations paragraph in the discussion that acknowledges this gap, lists possible alternative explanations (including phase transitions in optimization or representation geometry), and clarifies that the lines of evidence are intended to be suggestive and to motivate targeted causal experiments rather than to demonstrate causality. revision: yes
Circularity Check
No significant circularity; hypothesis rests on timing correlations and interventions rather than definitional reduction
full rationale
The paper defines induction heads via their observable attention pattern on token sequences and presents empirical evidence (simultaneous emergence with loss bump, six lines of correlational evidence for large models, and causal ablations for small attention-only models) that they contribute to in-context learning. No step reduces a claimed prediction or result to a fitted parameter or self-citation by construction; the central hypothesis is explicitly labeled preliminary and indirect, with the link to decreasing loss at later token indices argued via external observations rather than tautological redefinition. The derivation chain is self-contained against the provided benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Induction heads implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]
- domain assumption A sharp increase in in-context learning ability is visible as a bump in the training loss curve
Forward citations
Cited by 60 Pith papers
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
-
Slot Machines: How LLMs Keep Track of Multiple Entities
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
-
Layerwise Dynamics for In-Context Classification in Transformers
Enforcing feature- and label-permutation equivariance in transformers for in-context classification yields an identifiable emergent update rule driven by mixed feature-label Gram matrices that amplifies class separation.
-
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...
-
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
-
KAN: Kolmogorov-Arnold Networks
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
-
Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers
Text embeddings in MM-DiTs contain a detectable omission signal for missing concepts, and amplifying it via OSI reduces concept omission in generated images on FLUX.1-Dev and SD3.5-Medium.
-
Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers
The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.
-
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
-
Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
-
From Mechanistic to Compositional Interpretability
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaran...
-
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
-
Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations
LLM surrogate beliefs under sparse observations depend on prompts and query protocols, with structural prompts as priors, pointwise vs joint querying producing different beliefs, and sequential evidence causing non-mo...
-
Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers
In a controlled synthetic setting, transformers implement in-distribution task inference via convex combinations of task vectors and out-of-distribution inference via nearly orthogonal extrapolative representations.
-
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
-
Cell-Based Representation of Relational Binding in Language Models
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...
-
Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs
A Merkle-committed SAE feature-trace protocol detects model substitutions in hosted LLMs at a stable threshold where parallel-probe baselines fail, including against adaptive LoRA attackers.
-
HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads
HeadRank improves decoding-free passage reranking by preference-aligning attention heads to increase discriminability in middle-context documents, outperforming baselines on 14 benchmarks with only 211 training queries.
-
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
-
Screening Is Enough
Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
-
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory
Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing f...
-
Jamba: A Hybrid Transformer-Mamba Language Model
Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
-
Steering Language Models With Activation Engineering
Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
-
Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
Training installs a depth-dependent spectral gradient and low-rank bottleneck in LLM residual streams whose amplification or suppression of graph communities is predicted by local operator type.
-
Fusion-fission forecasts when AI will shift to undesirable behavior
A vector generalization of fusion-fission group dynamics from physics forecasts when AI behavior shifts to undesirable states, validated at 90 percent across seven models and prior to real-world data.
-
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
-
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.
-
Instructions Shape Production of Language, not Processing
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
-
Interpretability Can Be Actionable
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
-
Architecture, Not Scale: Circuit Localization in Large Language Models
Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.
-
The Propagation Field: A Geometric Substrate Theory of Deep Learning
Neural networks possess a propagation field of trajectories and Jacobians whose quality can be measured and optimized independently of endpoint loss, yielding better unseen-path generalization and reduced forgetting i...
-
Belief or Circuitry? Causal Evidence for In-Context Graph Learning
Causal evidence from representation analysis and interventions shows LLMs use both genuine structure inference and induction circuits in parallel for in-context graph learning.
-
Priming: Hybrid State Space Models From Pre-trained Transformers
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
-
Large Vision-Language Models Get Lost in Attention
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
-
Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize
Transformers show a sharp, task-specific critical window for weight decay application that determines reasoning versus memorization, with middle placement optimal and boundaries as narrow as 100 steps.
-
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 7...
-
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
Circuit analysis reveals that routing circuits for agent memory emerge at 0.6B parameters while content circuits emerge at 4B, with a shared grounding hub and an unsupervised diagnostic achieving 76.2% accuracy for lo...
-
Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency
A gradient-transport framework with observables D, z, β, δ, v_rel applied to Pico-LM and Pythia datasets shows distinct scaling regimes in duration and efficiency while sharing a near-unity cascade-size backbone.
-
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
Refusal in LLMs leaves a detectable upstream trajectory that SALO exploits to raise jailbreak detection from near zero to over 90 percent even under forced-decoding attacks.
-
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
-
Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing
TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.
-
BVI-Mamba: Video Enhancement Using a Visual State-Space Model for Low-Light and Underwater Environments
BVI-Mamba enhances low-light and underwater videos by combining feature alignment with a UNet architecture built from Visual State Space blocks, claiming better quality and efficiency than prior Transformer or convolu...
-
Omission Constraints Decay While Commission Constraints Persist in Long-Context LLM Agents
Omission constraints in LLM agents decay with conversation length while commission constraints remain stable, creating an invisible security failure.
-
The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.
-
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs
Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a ...
-
Parcae: Scaling Laws For Stable Looped Language Models
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
-
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard detects attention collapse loops during LLM decoding and prunes repetitive KV cache tail spans under fixed budget, cutting loop incidence by over 90 percentage points on the new LoopBench benchmark.
-
Transformer See, Transformer Do: Copying as an Intermediate Step in Learning Analogical Reasoning
Including copying tasks in training enables transformers to learn letter-string analogies, improving generalization to new alphabets with a 3-layer model outperforming some frontier models.
-
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning
LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between d...
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement
MTP induces representational contractivity for coherent world models in LLMs but causes illegal latent shortcuts; LSE-MTP anchors to true trajectories to reduce hallucinations and improve consistency.
-
Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents
Persistent memory is necessary and sufficient for LLM poker agents to reach ToM levels 3-5 and use strategic deception, while agents without memory stay at level 0.
-
Automated Attention Pattern Discovery at Scale in Large Language Models
AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.
-
SnapKV: LLM Knows What You are Looking for Before Generation
SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable pe...
-
Instructions Shape Production of Language, not Processing
Instructions primarily shape the production stage of language models rather than the processing stage, with task-specific information and causal effects stronger in output tokens than input tokens.
-
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
-
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.
-
When Context Sticks: Studying Interference in In-Context Learning
In-context learning shows persistent interference from prior examples, with more misleading linear examples degrading quadratic predictions and training curricula modulating recovery speed.
Reference graph
Works this paper leans on
-
[1]
Language Models are Few-Shot Learners
arXiv preprint arXiv:2005.14165. . LaMDA: our breakthrough conversation technology[link] Collins, E. and Ghahramani, Z.,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[2]
Evaluating Large Language Models Trained on Code
arXiv preprint arXiv:2107.03374. . Towards a human-like open-domain chatbot Adiwardana, D., Luong, M., So, D.R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y. and others,,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
arXiv preprint arXiv:2001.09977. . Scaling Language Models: Methods, Analysis & Insights from Training Gopher[PDF] Rae, J.W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., Driessche, G.v.d., Hendricks, L.A., Rauh, M., Huang...
-
[4]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
arXiv preprint arXiv:2201.02177. . The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models[link] Pan, A., Bhatia, K. and Steinhardt, J.,
work page internal anchor Pith review arXiv
-
[5]
Scaling Laws for Neural Language Models
arXiv preprint arXiv:2001.08361. . Risks from Learned Optimization in Advanced Machine Learning Systems Hubinger, E., Merwijk, C.v., Mikulik, V., Skalse, J. and Garrabrant, S.,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[6]
arXiv preprint arXiv:2012.15832. . It wouldʼve been better if we had used a start of sequence token for the copying head evaluator as well, but we omitted it by mistake. Without the “start of sequence” token, some heads that were doing prefix matching on real data would get anomalously low scores on our test sequences[ ↩ ] ,,, ,, pp . A General Language A...
-
[7]
A General Language Assistant as a Laboratory for Alignment
arXiv preprint arXiv:2112.00861. . Common Crawl[link] Foundation, T.C.C.. . The Pile: An 800GB Dataset of Diverse Text for Language Modeling Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S. and Leahy, C.,
work page internal anchor Pith review arXiv
-
[8]
A multiscale visualization of attention in the transformer model
arXiv preprint arXiv:1906.05714. . Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned[PDF] Voita, E., Talbot, D., Moiseev, F., Sennrich, R. and Titov, I.,
-
[9]
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
arXiv preprint arXiv:1905.09418. . What does bert look at? an analysis of bert's attention[PDF] Clark, K., Khandelwal, U., Levy, O. and Manning, C.D.,
work page Pith review arXiv 1905
-
[10]
What Does BERT Look At? An Analysis of BERT's Attention
arXiv preprint arXiv:1906.04341. . Do attention heads in bert track syntactic dependencies?[PDF] Htut, P.M., Phang, J., Bordia, S. and Bowman, S.R.,
work page Pith review arXiv 1906
-
[11]
arXiv preprint arXiv:1911.12246. . Attention is not all you need: Pure attention loses rank doubly exponentially with depth[PDF] Dong, Y., Cordonnier, J. and Loukas, A.,
-
[12]
arXiv preprint arXiv:2103.03404 , year=
arXiv preprint arXiv:2103.03404. . What Context Features Can Transformer Language Models Use? O'Connor, J. and Andreas, J.,
-
[13]
arXiv preprint arXiv:2106.08367. . An Explanation of In-context Learning as Implicit Bayesian Inference Xie, S.M., Raghunathan, A., Liang, P. and Ma, T.,
-
[14]
arXiv preprint arXiv:2111.02080. . Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H. and Zettlemoyer, L.,
-
[15]
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
arXiv preprint arXiv:2202.12837. . Reconciling modern machine-learning practice and the classical bias--variance trade-off Belkin, M., Hsu, D., Ma, S. and Mandal, S.,
work page internal anchor Pith review arXiv
-
[16]
Journal of Statistical Mechanics: Theory and Experiment, Vol 2021(12), pp. 124003. IOP Publishing. . Future ML Systems Will Be Qualitatively Different[link] Steinhardt, J.,
work page 2021
-
[17]
Qualitatively characterizing neural network optimization problems
arXiv preprint arXiv:1412.6544. . Analyzing monotonic linear interpolation in neural network loss landscapes Lucas, J., Bae, J., Zhang, M.R., Fort, S., Zemel, R. and Grosse, R.,
-
[18]
Analyzing monotonic linear interpolation in neural network loss landscapes, 2021
arXiv preprint arXiv:2104.11044. . Geometry of neural network loss surfaces via random matrix theory Pennington, J. and Bahri, Y.,
-
[19]
Zoom in: An introduction to circuits
Distill. DOI: 10.23915/distill.00024.001 . Convergent learning: Do different neural networks learn the same representations? Li, Y., Yosinski, J., Clune, J., Lipson, H., Hopcroft, J.E. and others,,
-
[20]
Similarity of Neural Network Representations Revisited
arXiv preprint arXiv:1905.00414. . High-Low Frequency Detectors Schubert, L., Voss, C., Cammarata, N., Goh, G. and Olah, C.,
work page Pith review arXiv 1905
-
[21]
DOI: 10.23915/distill.00024.005
Distill. DOI: 10.23915/distill.00024.005 . Performance-optimized hierarchical models predict neural responses in higher visual cortex Yamins, D.L., Hong, H., Cadieu, C.F., Solomon, E.A., Seibert, D. and DiCarlo, J.J.,
-
[22]
Multimodal neurons in artificial neural networks
Distill. DOI: 10.23915/distill.00030 . Neural machine translation by jointly learning to align and translate Bahdanau, D., Cho, K. and Bengio, Y.,
-
[23]
Neural Machine Translation by Jointly Learning to Align and Translate
arXiv preprint arXiv:1409.0473. . Listen, attend and spell Chan, W., Jaitly, N., Le, Q.V. and Vinyals, O.,
work page internal anchor Pith review arXiv
-
[24]
Listen, attend and spell.arXiv preprint arXiv:1508.01211,
arXiv preprint arXiv:1508.01211. Acknowledgments In writing this paper, our thinking and exposition was greatly clarified by detailed correspondence with Sam Bowman, Paul Christiano, Aidan Gomez, Dan Hendrycks, Jacob Hilton, Evan Hubinger, Andrew Ilyas, Percy Liang, Tom Lieberum, Chris Maddison, Aleksander Madry, Ethan Perez, Jacob Steinhardt, and Martin ...
-
[25]
BibTeX Citation: @article{olsson2022context, title={In-context Learning and Induction Heads}, author={Olsson, Catherine and Elhage, Nelson and Nanda, Neel and Joseph, Nicholas and DasSarma, Nova and Henighan, Tom and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and H...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.