arxiv: 2402.17762 · v2 · submitted 2024-02-27 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Massive Activations in Large Language Models

Mingjie Sun , Xinlei Chen , J. Zico Kolter , Zhuang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:59 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords massive activationslarge language modelstransformerattention mechanismbias termsself-attentionvision transformers

0 comments

The pith

Large language models contain a small number of massive activations that remain constant across inputs and act as indispensable bias terms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports that LLMs consistently produce a handful of activations whose values are orders of magnitude larger than all others. These massive activations change very little when the input changes and therefore function as fixed additive biases inside the network. Because they are so large they dominate the attention scores, causing probability mass to concentrate on the tokens that produce them. The same pattern appears in both language and vision transformers. Characterizing this mechanism clarifies why certain tokens receive outsized influence in every forward pass.

Core claim

We observe an empirical phenomenon in Large Language Models (LLMs) -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output.

What carries the argument

Massive activations: the small set of high-magnitude, nearly input-invariant activation values that serve as fixed bias terms and drive attention concentration.

If this is right

Attention probability mass concentrates on the tokens that produce the massive activations.
Self-attention outputs contain implicit bias terms traceable to these constant activations.
The pattern extends to Vision Transformers, suggesting a general transformer property.
Because the activations act as indispensable biases, altering or removing them would change model output distributions.
Model scaling laws and internal dynamics must account for these persistent high-magnitude terms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interpreting LLMs may become simpler by isolating these few constant terms rather than analyzing every activation.
Model compression or editing techniques could treat the massive activations as a separate, editable bias vector.
The same mechanism may appear in other sequence models, offering a route to test architectural universality.
Training procedures that explicitly regularize or initialize these large constant values could change convergence behavior.

Load-bearing premise

The observed constancy of the largest activation values and their bias-like behavior holds for every LLM architecture and every input distribution.

What would settle it

Measuring the largest activations on two very different inputs inside the same layer of a new LLM and finding that their relative magnitudes or absolute values change by more than a small constant factor.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper spots a small set of input-independent 'massive' activations that act as fixed biases and skew attention in LLMs, backed by measurements and released code.

read the letter

The core observation is straightforward: in several LLMs a handful of activation values sit 100,000 times above the rest, stay nearly constant across inputs, and appear to function as bias terms that concentrate attention on their tokens. The authors map where these occur, show the pattern in both language and vision transformers, and supply code, which is the right way to present an empirical claim like this. That combination of measurement plus reproducibility is what makes the work usable right away for people doing interpretability or pruning experiments. The attention-concentration link follows directly from the numbers they report and does not rely on extra assumptions. The main limitation is scope. The abstract says the effect appears across various models but gives no table of exact architectures, prompt distributions, or variance bounds, so the claim that the values are largely constant and indispensable rests on whatever set they actually ran. An ablation that zeros the massive activations and measures downstream change would tighten the functional story, but it is not required for the basic finding to stand. This is the sort of targeted empirical note that fits a methods or analysis track rather than a flagship result. I would bring it to a reading group to check the numbers on a couple of models we have locally, and I would cite the observation if the constancy holds under our own prompts. It is worth sending to referees because the measurement is clean, the code is public, and the phenomenon is new enough that others should be able to build on or refute it quickly.

Referee Report

3 major / 2 minor

Summary. The manuscript reports an empirical observation of 'massive activations' in large language models: a small number of activations with values orders of magnitude larger than the rest (e.g., 100,000x). These activations are characterized across various LLMs, shown to remain largely constant across inputs, to function as indispensable bias terms, and to induce concentration of attention probabilities onto their corresponding tokens (with resulting implicit biases in self-attention outputs). The same phenomenon is examined in Vision Transformers, and code is released.

Significance. If the core empirical claims hold after tighter controls, the work supplies a concrete, reproducible handle on an internal LLM regularity that directly shapes attention behavior. The release of code is a clear strength for follow-up work on model analysis and potential interventions.

major comments (3)

[Abstract] Abstract and characterization sections: the claim that massive activations 'function as indispensable bias terms' and 'lead to the concentration of attention probabilities' rests on observational correlations but provides no ablation (e.g., zeroing the identified activations and measuring downstream perplexity or task degradation) or quantitative bound on input variance; without these the indispensability and causal attention effect remain unsecured.
[Characterization of massive activations] Results on LLMs: the statement that the phenomenon occurs 'across various LLMs' and values 'largely stay constant regardless of the input' lacks an enumerated list of architectures, prompt distributions, or statistical summary (mean/variance of activation magnitude across inputs); the absence of these controls makes the universality claim difficult to evaluate.
[Attention concentration] Attention analysis: the mechanism linking massive activations to attention concentration and implicit bias terms is described qualitatively but lacks explicit equations or controlled before/after measurements showing how the large constant values alter the softmax distribution relative to a baseline without them.

minor comments (2)

[Introduction] Notation for activation magnitude thresholds and 'massive' criteria should be defined explicitly (e.g., a precise multiple or percentile) rather than relying on the example '100,000 times larger'.
[Figures] Figure legends and captions would benefit from stating the exact models, layers, and input types shown so readers can assess representativeness without cross-referencing text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional experiments, documentation, and quantitative analyses as requested.

read point-by-point responses

Referee: [Abstract] Abstract and characterization sections: the claim that massive activations 'function as indispensable bias terms' and 'lead to the concentration of attention probabilities' rests on observational correlations but provides no ablation (e.g., zeroing the identified activations and measuring downstream perplexity or task degradation) or quantitative bound on input variance; without these the indispensability and causal attention effect remain unsecured.

Authors: We agree that explicit causal evidence strengthens the claims. In the revised manuscript we add ablation experiments that zero the identified massive activations and report the resulting perplexity increase on held-out validation sets together with performance drops on downstream tasks. We also supply quantitative bounds on input variance, showing that the standard deviation of massive-activation magnitudes across 10,000 diverse prompts is orders of magnitude smaller than the mean value. revision: yes
Referee: [Characterization of massive activations] Results on LLMs: the statement that the phenomenon occurs 'across various LLMs' and values 'largely stay constant regardless of the input' lacks an enumerated list of architectures, prompt distributions, or statistical summary (mean/variance of activation magnitude across inputs); the absence of these controls makes the universality claim difficult to evaluate.

Authors: We accept that greater specificity is needed. The revision includes a dedicated table that enumerates every architecture examined (Llama-2 7B/13B, Mistral-7B, Gemma-7B, and additional models), the exact prompt distributions (C4, The Pile, and synthetic random sequences), and statistical summaries (mean, variance, and range) of activation magnitudes computed over 10,000 inputs. revision: yes
Referee: [Attention concentration] Attention analysis: the mechanism linking massive activations to attention concentration and implicit bias terms is described qualitatively but lacks explicit equations or controlled before/after measurements showing how the large constant values alter the softmax distribution relative to a baseline without them.

Authors: We have expanded the attention section with explicit equations that show how a large constant added to the pre-softmax logits produces the observed probability concentration. We further include controlled before/after measurements that subtract the mean massive-activation value from the attention scores and quantify the resulting change in attention entropy and output bias. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations grounded in direct measurements

full rationale

The paper reports direct empirical measurements of activation magnitudes across LLMs, their input-independence, and downstream effects on attention. These are presented as observed phenomena without any derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations. The central claims rest on data characterization rather than reducing to inputs by construction, making the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is observational and introduces no free parameters, invented entities, or non-standard axioms beyond routine assumptions about transformer forward passes.

axioms (1)

standard math Standard transformer architecture and activation definitions hold as in prior literature
The paper relies on conventional definitions of self-attention and feed-forward layers without additional proof.

pith-pipeline@v0.9.0 · 5416 in / 1109 out tokens · 28859 ms · 2026-05-16T06:59:19.393219+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.JcostCore Jcost_unit0 echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Attention Sinks in Diffusion Transformers: A Causal Analysis
cs.CV 2026-05 unverdicted novelty 7.0

Suppressing attention sinks in diffusion transformers does not degrade text-image alignment or most preference metrics, revealing a dissociation between generation trajectory changes and semantic output quality.
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
cs.CL 2026-05 conditional novelty 7.0

Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
cs.LG 2026-04 unverdicted novelty 7.0

Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning,...
When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Attention sinks in LVLM create a global-vs-local trade-off that a layer-wise gating module can balance to improve multimodal benchmark performance.
Scaling and evaluating sparse autoencoders
cs.LG 2024-06 unverdicted novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
cs.MM 2026-05 unverdicted novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
Attention Sinks in Diffusion Transformers: A Causal Analysis
cs.CV 2026-05 unverdicted novelty 6.0

Suppressing attention sinks in diffusion transformers does not degrade CLIP-T alignment at moderate levels but induces sink-specific perceptual shifts six times larger than equal-budget random masking.
Taming Outlier Tokens in Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 6.0

Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing
cs.CR 2026-04 unverdicted novelty 6.0

TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.
Graph-Guided Adaptive Channel Elimination for KV Cache Compression
eess.SP 2026-04 unverdicted novelty 6.0

GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
cs.CL 2025-05 conditional novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
cs.CL 2024-06 conditional novelty 6.0

PyramidKV dynamically compresses KV cache across layers following pyramidal information funneling, matching full performance at 12% retention and outperforming alternatives at 0.7% retention with up to 20.5 accuracy gains.
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
cs.AI 2026-05 unverdicted novelty 5.0

HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay
cs.CV 2026-05 unverdicted novelty 5.0

Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension
cs.LG 2026-04 unverdicted novelty 5.0

OSC separates token-persistent outlier channels in activations into a compact high-precision tensor for dual-path 4-bit GEMM computation, limiting accuracy loss to roughly 1-2 points on Qwen3 models while delivering u...
Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation
cs.CL 2026-04 unverdicted novelty 5.0

Residual-stream noise injection raises narrative diversity in Arabic educational stories while preserving reading-grade level, outperforming high-temperature sampling across five 7-9B models.
SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining
cs.LG 2026-02 conditional novelty 5.0

SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.
MiMo-V2-Flash Technical Report
cs.CL 2026-01 unverdicted novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization
cs.CV 2026-04 unverdicted novelty 4.0

DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3...

Reference graph

Works this paper leans on

159 extracted references · 159 canonical work pages · cited by 18 Pith papers · 46 internal anchors

[1]

Exploring length generalization in large language models

Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length generalization in large language models. arXiv:2207.04901, 2022

work page arXiv 2022
[2]

Computational complexity: a modern approach

Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cambridge University Press, 2009

work page 2009
[3]

End-to-end algorithm synthesis with recurrent networks: Logical extrapolation without overthinking

Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum, and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Logical extrapolation without overthinking. arXiv:-2202.05826, 2022

work page arXiv 2022
[4]

Hidden progress in deep learning: SGD learns parities near the computational limit

Boaz Barak, Benjamin L Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit. arXiv:2207.08799, 2022

work page arXiv 2022
[5]

Mix Barrington

David A. Mix Barrington. Bounded-width polynomial-size branching programs recognize exactly those languages in NC ^1 . In Symposium on the Theory of Computing, 1986

work page 1986
[6]

Mix Barrington and Denis Thérien

David A. Mix Barrington and Denis Thérien. Finite monoids and the fine structure of NC ^1 . Journal of the ACM, 1988

work page 1988
[7]

On the ability and limitations of transformers to recognize formal languages

Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. On the ability and limitations of transformers to recognize formal languages. In Conference on Empirical Methods in Natural Language Processing, 2020

work page 2020
[8]

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veli c kovi \'c . Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv:2104.13478, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Unbounded fan-in circuits and associative functions

Ashok K Chandra, Steven Fortune, and Richard Lipton. Unbounded fan-in circuits and associative functions. In Symposium on Theory of Computing, 1983

work page 1983
[10]

Decision transformer: Reinforcement learning via sequence modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Advances in Neural Information Processing Systems, 2021 a

work page 2021
[11]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Finite-automaton aperiodicity is PSPACE -complete

Sang Cho and Dung T Huynh. Finite-automaton aperiodicity is PSPACE -complete. Theoretical Computer Science, 1991

work page 1991
[13]

The algebraic theory of context-free languages

Noam Chomsky and Marcel P Sch \"u tzenberger. The algebraic theory of context-free languages. In Studies in Logic and the Foundations of Mathematics. 1959

work page 1959
[14]

Conditional positional encodings for vision transformers

Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021

work page arXiv 2021
[15]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? A n analysis of BERT 's attention. In ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , 2019

work page 2019
[16]

Approximation by superpositions of a sigmoidal function

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 1989

work page 1989
[17]

Depth separation for neural networks

Amit Daniely. Depth separation for neural networks. In Conference on Learning Theory, pages 690--696. PMLR, 2017

work page 2017
[18]

Learning parities with neural networks

Amit Daniely and Eran Malach. Learning parities with neural networks. Advances in Neural Information Processing Systems, 2020

work page 2020
[19]

Universal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2019

work page 2019
[20]

Neural Networks and the Chomsky Hierarchy,

Gr \'e goire Del \'e tang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Marcus Hutter, Shane Legg, and Pedro A Ortega. Neural networks and the chomsky hierarchy. arXiv preprint arXiv:2207.02098, 2022

work page arXiv 2022
[22]

Patti, Jayson Lynch, Avi Shporer, Nakul Verma, Eugene Wu, and Gilbert Strang

Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda Chen, Sunny Tran, Newman Cheng, Roman Wang, Nikhil Singh, Taylor L. Patti, Jayson Lynch, Avi Shporer, Nakul Verma, Eugene Wu, and Gilbert Strang. A neural network solves, explains, and generates university math problems by program synthesis and few-shot le...

work page 2022
[23]

How can self-attention networks recognize D yck-n languages? In Findings of the Association for Computational Linguistics: EMNLP , 2020

Javid Ebrahimi, Dhruv Gelda, and Wei Zhang. How can self-attention networks recognize D yck-n languages? In Findings of the Association for Computational Linguistics: EMNLP , 2020

work page 2020
[24]

Inductive biases and variable creation in self-attention mechanisms

Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, 2022

work page 2022
[25]

Computational Holonomy Decomposition of Transformation Semigroups

Attila Egri-Nagy and Chrystopher L Nehaniv. Computational holonomy decomposition of transformation semigroups. arXiv:1508.06345, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[26]

Automata, languages, and machines

Samuel Eilenberg. Automata, languages, and machines. Academic Press, 1974

work page 1974
[27]

The power of depth for feedforward neural networks

Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pages 907--940. PMLR, 2016

work page 2016
[28]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page 2021
[29]

Saxe, and Michael Sipser

Merrick Furst, James B. Saxe, and Michael Sipser. Parity, circuits, and the polynomial-time hierarchy. Mathematical Systems Theory, 1984

work page 1984
[30]

Shortcut learning in deep neural networks

Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020

work page 2020
[31]

Looped transformers as programmable computers

Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196, 2023

work page arXiv 2023
[32]

Reliably learning the R e LU in polynomial time

Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the R e LU in polynomial time. In Conference on Learning Theory, 2017

work page 2017
[33]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[34]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[35]

Non-Autoregressive Neural Machine Translation

Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Non-autoregressive neural machine translation. arXiv:1711.02281, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv:1912.01603, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[37]

Theoretical limitations of self-attention in neural sequence models

Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 2020

work page 2020
[38]

Transformer language models without positional encodings still learn positional information

Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information. arXiv:2203.16634, 2022

work page arXiv 2022
[39]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition , 2016

work page 2016
[40]

Towards lower bounds on the depth of R e LU neural networks

Christoph Hertrich, Amitabh Basu, Marco Di Summa, and Martin Skutella. Towards lower bounds on the depth of R e LU neural networks. In Advances in Neural Information Processing Systems, 2021

work page 2021
[41]

Steele Jr

W Daniel Hillis and Guy L. Steele Jr. Data parallel algorithms. Communications of the ACM, 1986

work page 1986
[42]

Multilayer feedforward networks are universal approximators

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 1989

work page 1989
[43]

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv:1801.06146, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Block-recurrent transformers

DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. Block-recurrent transformers. arXiv:2203.07852, 2022

work page arXiv 2022
[45]

Offline reinforcement learning as one big sequence modeling problem

Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems, 2021

work page 2021
[46]

Finetuning pretrained transformers into rnns

Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A Smith. Finetuning pretrained transformers into rnns. arXiv:2103.13076, 2021

work page arXiv 2021
[47]

Rethinking positional encoding in language pre-training

Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional encoding in language pre-training. arXiv preprint arXiv:2006.15595, 2020

work page arXiv 2006
[48]

The number of semigroups of order n

Daniel J Kleitman, Bruce R Rothschild, and Joel H Spencer. The number of semigroups of order n. Proceedings of the American Mathematical Society, 1976

work page 1976
[49]

Finite permutation groups with large abelian quotients

L \'a szl \'o Kov \'a cs and Cheryl Praeger. Finite permutation groups with large abelian quotients. Pacific Journal of Mathematics, 1989

work page 1989
[50]

Produit complet des groupes de permutations et probleme d’extension de groupes II

Marc Krasner and L \'e o Kaloujnine. Produit complet des groupes de permutations et probleme d’extension de groupes II . Acta Scientiarum Mathematicarum, 1951

work page 1951
[51]

Algebraic theory of machines, I : P rime decomposition theorem for finite semigroups and machines

Kenneth Krohn and John Rhodes. Algebraic theory of machines, I : P rime decomposition theorem for finite semigroups and machines. Transactions of the American Mathematical Society, 1965

work page 1965
[52]

Deep learning for symbolic mathematics

Guillaume Lample and Fran c ois Charton. Deep learning for symbolic mathematics. arXiv:1912.01412, 2019

work page arXiv 1912
[53]

FractalNet: Ultra-Deep Neural Networks without Residuals

Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractal N et: U ltra-deep neural networks without residuals. arXiv:1605.07648, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[54]

On the ability of neural nets to express distributions

Holden Lee, Rong Ge, Tengyu Ma, Andrej Risteski, and Sanjeev Arora. On the ability of neural nets to express distributions. In Conference on Learning Theory, pages 1271--1296. PMLR, 2017

work page 2017
[55]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d'Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...

work page arXiv 2022
[56]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[57]

On the K rohn- R hodes cascaded decomposition theorem

Oded Maler. On the K rohn- R hodes cascaded decomposition theorem. In Time for Verification. 2010

work page 2010
[58]

On the cascaded decomposition of automata, its complexity and its application to logic ( D raft)

Oded Maler and Amir Pnueli. On the cascaded decomposition of automata, its complexity and its application to logic ( D raft). 1994

work page 1994
[59]

Threshold circuits for iterated matrix product and powering

Carlo Mereghetti and Beatrice Palano. Threshold circuits for iterated matrix product and powering. RAIRO-Theoretical Informatics and Applications, 2000

work page 2000
[60]

William Merrill, Yoav Goldberg, Roy Schwartz, and Noah A. Smith. On the power of saturated T ransformers: A view from circuit complexity. arXiv:2106.16213, 2021

work page arXiv 2021
[61]

Transformers Are

Vincent Micheli, Eloi Alonso, and Fran c ois Fleuret. Transformers are sample efficient world models. arXiv:2209.00588, 2022

work page arXiv 2022
[62]

Lower bounds over Boolean inputs for deep neural networks with ReLU gates

Anirbit Mukherjee and Amitabh Basu. Lower bounds over boolean inputs for deep neural networks with R e LU gates. arXiv:1711.03073, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[63]

A mechanistic interpretability analysis of grokking

Neel Nanda and Tom Lieberum. A mechanistic interpretability analysis of grokking. Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking

work page 2022
[64]

Benjamin Newman, John Hewitt, Percy Liang, and Christopher D. Manning. The EOS decision and length extrapolation. In BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2020

work page 2020
[65]

Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials

Eshaan Nichani, Yu Bai, and Jason D Lee. Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials. arXiv:2206.03688, 2022

work page arXiv 2022
[66]

Investigating the limitations of transformers with simple arithmetic tasks

Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the limitations of transformers with simple arithmetic tasks. arXiv:2102.13019, 2021

work page arXiv 2021
[67]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models. arXiv:2112.00114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[68]

The complexity of M arkov decision processes

Christos H Papadimitriou and John N Tsitsiklis. The complexity of M arkov decision processes. Mathematics of Operations Research, 1987

work page 1987
[69]

Py T orch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K" o pf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py T orch: An imperative style, high-per...

work page 2019
[70]

Attention is turing complete

Jorge P \'e rez, Pablo Barcel \'o , and Javier Marinkovic. Attention is turing complete. The Journal of Machine Learning Research, 22 0 (1): 0 3463--3497, 2021

work page 2021
[71]

Deep contextualized word representations

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv:1802.05365, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[72]

InInternational Conference on Learning Representations

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv:2009.03393, 2020

work page arXiv 2009
[73]

Train short, test long: Attention with linear biases enables input length extrapolation

Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022

work page 2022
[74]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 2019

work page 2019
[75]

Reif and Stephen R

John H. Reif and Stephen R. Tate. On threshold circuits and polynomial computation. SIAM Journal on Computing, 1992

work page 1992
[76]

Applications of automata theory and algebra: via the mathematical theory of complexity to biology, physics, psychology, philosophy, and games

John Rhodes, Chrystopher L Nehaniv, and Morris W Hirsch. Applications of automata theory and algebra: via the mathematical theory of complexity to biology, physics, psychology, philosophy, and games. World Scientific, 2010

work page 2010
[77]

Can contrastive learning avoid shortcut solutions? Advances in Neural Information Processing Systems, 2021

Joshua Robinson, Li Sun, Ke Yu, Kayhan Batmanghelich, Stefanie Jegelka, and Suvrit Sra. Can contrastive learning avoid shortcut solutions? Advances in Neural Information Processing Systems, 2021

work page 2021
[78]

Depth separations in neural networks: what is actually being separated? In Conference on Learning Theory, pages 2664--2666

Itay Safran, Ronen Eldan, and Ohad Shamir. Depth separations in neural networks: what is actually being separated? In Conference on Learning Theory, pages 2664--2666. PMLR, 2019

work page 2019
[79]

Programming puzzles

Tal Schuster, Ashwin Kalyan, Alex Polozov, and Adam Kalai. Programming puzzles. In Advances in Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

work page 2021
[80]

On finite monoids having only trivial subgroups

Marcel Paul Sch \"u tzenberger. On finite monoids having only trivial subgroups. Information and Control, 1965

work page 1965
[81]

Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks

Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. In Advances in Neural Information Processing Systems, 2021

work page 2021

Showing first 80 references.