pith. machine review for the scientific record. sign in

arxiv: 2402.17762 · v2 · submitted 2024-02-27 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Massive Activations in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:59 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords massive activationslarge language modelstransformerattention mechanismbias termsself-attentionvision transformers
0
0 comments X

The pith

Large language models contain a small number of massive activations that remain constant across inputs and act as indispensable bias terms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports that LLMs consistently produce a handful of activations whose values are orders of magnitude larger than all others. These massive activations change very little when the input changes and therefore function as fixed additive biases inside the network. Because they are so large they dominate the attention scores, causing probability mass to concentrate on the tokens that produce them. The same pattern appears in both language and vision transformers. Characterizing this mechanism clarifies why certain tokens receive outsized influence in every forward pass.

Core claim

We observe an empirical phenomenon in Large Language Models (LLMs) -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output.

What carries the argument

Massive activations: the small set of high-magnitude, nearly input-invariant activation values that serve as fixed bias terms and drive attention concentration.

If this is right

  • Attention probability mass concentrates on the tokens that produce the massive activations.
  • Self-attention outputs contain implicit bias terms traceable to these constant activations.
  • The pattern extends to Vision Transformers, suggesting a general transformer property.
  • Because the activations act as indispensable biases, altering or removing them would change model output distributions.
  • Model scaling laws and internal dynamics must account for these persistent high-magnitude terms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interpreting LLMs may become simpler by isolating these few constant terms rather than analyzing every activation.
  • Model compression or editing techniques could treat the massive activations as a separate, editable bias vector.
  • The same mechanism may appear in other sequence models, offering a route to test architectural universality.
  • Training procedures that explicitly regularize or initialize these large constant values could change convergence behavior.

Load-bearing premise

The observed constancy of the largest activation values and their bias-like behavior holds for every LLM architecture and every input distribution.

What would settle it

Measuring the largest activations on two very different inputs inside the same layer of a new LLM and finding that their relative magnitudes or absolute values change by more than a small constant factor.

read the original abstract

We observe an empirical phenomenon in Large Language Models (LLMs) -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output. Last, we also study massive activations in Vision Transformers. Code is available at https://github.com/locuslab/massive-activations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports an empirical observation of 'massive activations' in large language models: a small number of activations with values orders of magnitude larger than the rest (e.g., 100,000x). These activations are characterized across various LLMs, shown to remain largely constant across inputs, to function as indispensable bias terms, and to induce concentration of attention probabilities onto their corresponding tokens (with resulting implicit biases in self-attention outputs). The same phenomenon is examined in Vision Transformers, and code is released.

Significance. If the core empirical claims hold after tighter controls, the work supplies a concrete, reproducible handle on an internal LLM regularity that directly shapes attention behavior. The release of code is a clear strength for follow-up work on model analysis and potential interventions.

major comments (3)
  1. [Abstract] Abstract and characterization sections: the claim that massive activations 'function as indispensable bias terms' and 'lead to the concentration of attention probabilities' rests on observational correlations but provides no ablation (e.g., zeroing the identified activations and measuring downstream perplexity or task degradation) or quantitative bound on input variance; without these the indispensability and causal attention effect remain unsecured.
  2. [Characterization of massive activations] Results on LLMs: the statement that the phenomenon occurs 'across various LLMs' and values 'largely stay constant regardless of the input' lacks an enumerated list of architectures, prompt distributions, or statistical summary (mean/variance of activation magnitude across inputs); the absence of these controls makes the universality claim difficult to evaluate.
  3. [Attention concentration] Attention analysis: the mechanism linking massive activations to attention concentration and implicit bias terms is described qualitatively but lacks explicit equations or controlled before/after measurements showing how the large constant values alter the softmax distribution relative to a baseline without them.
minor comments (2)
  1. [Introduction] Notation for activation magnitude thresholds and 'massive' criteria should be defined explicitly (e.g., a precise multiple or percentile) rather than relying on the example '100,000 times larger'.
  2. [Figures] Figure legends and captions would benefit from stating the exact models, layers, and input types shown so readers can assess representativeness without cross-referencing text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional experiments, documentation, and quantitative analyses as requested.

read point-by-point responses
  1. Referee: [Abstract] Abstract and characterization sections: the claim that massive activations 'function as indispensable bias terms' and 'lead to the concentration of attention probabilities' rests on observational correlations but provides no ablation (e.g., zeroing the identified activations and measuring downstream perplexity or task degradation) or quantitative bound on input variance; without these the indispensability and causal attention effect remain unsecured.

    Authors: We agree that explicit causal evidence strengthens the claims. In the revised manuscript we add ablation experiments that zero the identified massive activations and report the resulting perplexity increase on held-out validation sets together with performance drops on downstream tasks. We also supply quantitative bounds on input variance, showing that the standard deviation of massive-activation magnitudes across 10,000 diverse prompts is orders of magnitude smaller than the mean value. revision: yes

  2. Referee: [Characterization of massive activations] Results on LLMs: the statement that the phenomenon occurs 'across various LLMs' and values 'largely stay constant regardless of the input' lacks an enumerated list of architectures, prompt distributions, or statistical summary (mean/variance of activation magnitude across inputs); the absence of these controls makes the universality claim difficult to evaluate.

    Authors: We accept that greater specificity is needed. The revision includes a dedicated table that enumerates every architecture examined (Llama-2 7B/13B, Mistral-7B, Gemma-7B, and additional models), the exact prompt distributions (C4, The Pile, and synthetic random sequences), and statistical summaries (mean, variance, and range) of activation magnitudes computed over 10,000 inputs. revision: yes

  3. Referee: [Attention concentration] Attention analysis: the mechanism linking massive activations to attention concentration and implicit bias terms is described qualitatively but lacks explicit equations or controlled before/after measurements showing how the large constant values alter the softmax distribution relative to a baseline without them.

    Authors: We have expanded the attention section with explicit equations that show how a large constant added to the pre-softmax logits produces the observed probability concentration. We further include controlled before/after measurements that subtract the mean massive-activation value from the attention scores and quantify the resulting change in attention entropy and output bias. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations grounded in direct measurements

full rationale

The paper reports direct empirical measurements of activation magnitudes across LLMs, their input-independence, and downstream effects on attention. These are presented as observed phenomena without any derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations. The central claims rest on data characterization rather than reducing to inputs by construction, making the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is observational and introduces no free parameters, invented entities, or non-standard axioms beyond routine assumptions about transformer forward passes.

axioms (1)
  • standard math Standard transformer architecture and activation definitions hold as in prior literature
    The paper relies on conventional definitions of self-attention and feed-forward layers without additional proof.

pith-pipeline@v0.9.0 · 5416 in / 1109 out tokens · 28859 ms · 2026-05-16T06:59:19.393219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.JcostCore Jcost_unit0 echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Attention Sinks in Diffusion Transformers: A Causal Analysis

    cs.CV 2026-05 unverdicted novelty 7.0

    Suppressing attention sinks in diffusion transformers does not degrade text-image alignment or most preference metrics, revealing a dissociation between generation trajectory changes and semantic output quality.

  2. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 conditional novelty 7.0

    Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.

  3. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.

  4. Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

    cs.LG 2026-04 unverdicted novelty 7.0

    Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning,...

  5. When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Attention sinks in LVLM create a global-vs-local trade-off that a layer-wise gating module can balance to improve multimodal benchmark performance.

  6. Scaling and evaluating sparse autoencoders

    cs.LG 2024-06 unverdicted novelty 7.0

    K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

  7. Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

    cs.MM 2026-05 unverdicted novelty 6.0

    LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

  8. Attention Sinks in Diffusion Transformers: A Causal Analysis

    cs.CV 2026-05 unverdicted novelty 6.0

    Suppressing attention sinks in diffusion transformers does not degrade CLIP-T alignment at moderate levels but induces sink-specific perceptual shifts six times larger than equal-budget random masking.

  9. Taming Outlier Tokens in Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.

  10. Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing

    cs.CR 2026-04 unverdicted novelty 6.0

    TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.

  11. Graph-Guided Adaptive Channel Elimination for KV Cache Compression

    eess.SP 2026-04 unverdicted novelty 6.0

    GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.

  12. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    cs.CL 2025-05 conditional novelty 6.0

    Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.

  13. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    cs.CL 2024-06 conditional novelty 6.0

    PyramidKV dynamically compresses KV cache across layers following pyramidal information funneling, matching full performance at 12% retention and outperforming alternatives at 0.7% retention with up to 20.5 accuracy gains.

  14. HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory

    cs.AI 2026-05 unverdicted novelty 5.0

    HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.

  15. Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay

    cs.CV 2026-05 unverdicted novelty 5.0

    Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.

  16. OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

    cs.LG 2026-04 unverdicted novelty 5.0

    OSC separates token-persistent outlier channels in activations into a compact high-precision tensor for dual-path 4-bit GEMM computation, limiting accuracy loss to roughly 1-2 points on Qwen3 models while delivering u...

  17. Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation

    cs.CL 2026-04 unverdicted novelty 5.0

    Residual-stream noise injection raises narrative diversity in Arabic educational stories while preserving reading-grade level, outperforming high-temperature sampling across five 7-9B models.

  18. SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

    cs.LG 2026-02 conditional novelty 5.0

    SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.

  19. MiMo-V2-Flash Technical Report

    cs.CL 2026-01 unverdicted novelty 5.0

    MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...

  20. DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization

    cs.CV 2026-04 unverdicted novelty 4.0

    DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3...

Reference graph

Works this paper leans on

159 extracted references · 159 canonical work pages · cited by 18 Pith papers · 46 internal anchors

  1. [1]

    Exploring length generalization in large language models

    Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length generalization in large language models. arXiv:2207.04901, 2022

  2. [2]

    Computational complexity: a modern approach

    Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cambridge University Press, 2009

  3. [3]

    End-to-end algorithm synthesis with recurrent networks: Logical extrapolation without overthinking

    Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum, and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Logical extrapolation without overthinking. arXiv:-2202.05826, 2022

  4. [4]

    Hidden progress in deep learning: SGD learns parities near the computational limit

    Boaz Barak, Benjamin L Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit. arXiv:2207.08799, 2022

  5. [5]

    Mix Barrington

    David A. Mix Barrington. Bounded-width polynomial-size branching programs recognize exactly those languages in NC ^1 . In Symposium on the Theory of Computing, 1986

  6. [6]

    Mix Barrington and Denis Thérien

    David A. Mix Barrington and Denis Thérien. Finite monoids and the fine structure of NC ^1 . Journal of the ACM, 1988

  7. [7]

    On the ability and limitations of transformers to recognize formal languages

    Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. On the ability and limitations of transformers to recognize formal languages. In Conference on Empirical Methods in Natural Language Processing, 2020

  8. [8]

    Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

    Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veli c kovi \'c . Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv:2104.13478, 2021

  9. [9]

    Unbounded fan-in circuits and associative functions

    Ashok K Chandra, Steven Fortune, and Richard Lipton. Unbounded fan-in circuits and associative functions. In Symposium on Theory of Computing, 1983

  10. [10]

    Decision transformer: Reinforcement learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Advances in Neural Information Processing Systems, 2021 a

  11. [11]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  12. [12]

    Finite-automaton aperiodicity is PSPACE -complete

    Sang Cho and Dung T Huynh. Finite-automaton aperiodicity is PSPACE -complete. Theoretical Computer Science, 1991

  13. [13]

    The algebraic theory of context-free languages

    Noam Chomsky and Marcel P Sch \"u tzenberger. The algebraic theory of context-free languages. In Studies in Logic and the Foundations of Mathematics. 1959

  14. [14]

    Conditional positional encodings for vision transformers

    Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021

  15. [15]

    Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? A n analysis of BERT 's attention. In ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , 2019

  16. [16]

    Approximation by superpositions of a sigmoidal function

    George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 1989

  17. [17]

    Depth separation for neural networks

    Amit Daniely. Depth separation for neural networks. In Conference on Learning Theory, pages 690--696. PMLR, 2017

  18. [18]

    Learning parities with neural networks

    Amit Daniely and Eran Malach. Learning parities with neural networks. Advances in Neural Information Processing Systems, 2020

  19. [19]

    Universal transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2019

  20. [20]

    Neural Networks and the Chomsky Hierarchy,

    Gr \'e goire Del \'e tang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Marcus Hutter, Shane Legg, and Pedro A Ortega. Neural networks and the chomsky hierarchy. arXiv preprint arXiv:2207.02098, 2022

  21. [22]

    Patti, Jayson Lynch, Avi Shporer, Nakul Verma, Eugene Wu, and Gilbert Strang

    Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda Chen, Sunny Tran, Newman Cheng, Roman Wang, Nikhil Singh, Taylor L. Patti, Jayson Lynch, Avi Shporer, Nakul Verma, Eugene Wu, and Gilbert Strang. A neural network solves, explains, and generates university math problems by program synthesis and few-shot le...

  22. [23]

    How can self-attention networks recognize D yck-n languages? In Findings of the Association for Computational Linguistics: EMNLP , 2020

    Javid Ebrahimi, Dhruv Gelda, and Wei Zhang. How can self-attention networks recognize D yck-n languages? In Findings of the Association for Computational Linguistics: EMNLP , 2020

  23. [24]

    Inductive biases and variable creation in self-attention mechanisms

    Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, 2022

  24. [25]

    Computational Holonomy Decomposition of Transformation Semigroups

    Attila Egri-Nagy and Chrystopher L Nehaniv. Computational holonomy decomposition of transformation semigroups. arXiv:1508.06345, 2015

  25. [26]

    Automata, languages, and machines

    Samuel Eilenberg. Automata, languages, and machines. Academic Press, 1974

  26. [27]

    The power of depth for feedforward neural networks

    Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pages 907--940. PMLR, 2016

  27. [28]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  28. [29]

    Saxe, and Michael Sipser

    Merrick Furst, James B. Saxe, and Michael Sipser. Parity, circuits, and the polynomial-time hierarchy. Mathematical Systems Theory, 1984

  29. [30]

    Shortcut learning in deep neural networks

    Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020

  30. [31]

    Looped transformers as programmable computers

    Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196, 2023

  31. [32]

    Reliably learning the R e LU in polynomial time

    Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the R e LU in polynomial time. In Conference on Learning Theory, 2017

  32. [33]

    Adaptive Computation Time for Recurrent Neural Networks

    Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016

  33. [34]

    Neural Turing Machines

    Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014

  34. [35]

    Non-Autoregressive Neural Machine Translation

    Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Non-autoregressive neural machine translation. arXiv:1711.02281, 2017

  35. [36]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv:1912.01603, 2019

  36. [37]

    Theoretical limitations of self-attention in neural sequence models

    Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 2020

  37. [38]

    Transformer language models without positional encodings still learn positional information

    Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information. arXiv:2203.16634, 2022

  38. [39]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition , 2016

  39. [40]

    Towards lower bounds on the depth of R e LU neural networks

    Christoph Hertrich, Amitabh Basu, Marco Di Summa, and Martin Skutella. Towards lower bounds on the depth of R e LU neural networks. In Advances in Neural Information Processing Systems, 2021

  40. [41]

    Steele Jr

    W Daniel Hillis and Guy L. Steele Jr. Data parallel algorithms. Communications of the ACM, 1986

  41. [42]

    Multilayer feedforward networks are universal approximators

    Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 1989

  42. [43]

    Universal Language Model Fine-tuning for Text Classification

    Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv:1801.06146, 2018

  43. [44]

    Block-recurrent transformers

    DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. Block-recurrent transformers. arXiv:2203.07852, 2022

  44. [45]

    Offline reinforcement learning as one big sequence modeling problem

    Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems, 2021

  45. [46]

    Finetuning pretrained transformers into rnns

    Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A Smith. Finetuning pretrained transformers into rnns. arXiv:2103.13076, 2021

  46. [47]

    Rethinking positional encoding in language pre-training

    Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional encoding in language pre-training. arXiv preprint arXiv:2006.15595, 2020

  47. [48]

    The number of semigroups of order n

    Daniel J Kleitman, Bruce R Rothschild, and Joel H Spencer. The number of semigroups of order n. Proceedings of the American Mathematical Society, 1976

  48. [49]

    Finite permutation groups with large abelian quotients

    L \'a szl \'o Kov \'a cs and Cheryl Praeger. Finite permutation groups with large abelian quotients. Pacific Journal of Mathematics, 1989

  49. [50]

    Produit complet des groupes de permutations et probleme d’extension de groupes II

    Marc Krasner and L \'e o Kaloujnine. Produit complet des groupes de permutations et probleme d’extension de groupes II . Acta Scientiarum Mathematicarum, 1951

  50. [51]

    Algebraic theory of machines, I : P rime decomposition theorem for finite semigroups and machines

    Kenneth Krohn and John Rhodes. Algebraic theory of machines, I : P rime decomposition theorem for finite semigroups and machines. Transactions of the American Mathematical Society, 1965

  51. [52]

    Deep learning for symbolic mathematics

    Guillaume Lample and Fran c ois Charton. Deep learning for symbolic mathematics. arXiv:1912.01412, 2019

  52. [53]

    FractalNet: Ultra-Deep Neural Networks without Residuals

    Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractal N et: U ltra-deep neural networks without residuals. arXiv:1605.07648, 2016

  53. [54]

    On the ability of neural nets to express distributions

    Holden Lee, Rong Ge, Tengyu Ma, Andrej Risteski, and Sanjeev Arora. On the ability of neural nets to express distributions. In Conference on Learning Theory, pages 1271--1296. PMLR, 2017

  54. [55]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d'Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...

  55. [56]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv:1711.05101, 2017

  56. [57]

    On the K rohn- R hodes cascaded decomposition theorem

    Oded Maler. On the K rohn- R hodes cascaded decomposition theorem. In Time for Verification. 2010

  57. [58]

    On the cascaded decomposition of automata, its complexity and its application to logic ( D raft)

    Oded Maler and Amir Pnueli. On the cascaded decomposition of automata, its complexity and its application to logic ( D raft). 1994

  58. [59]

    Threshold circuits for iterated matrix product and powering

    Carlo Mereghetti and Beatrice Palano. Threshold circuits for iterated matrix product and powering. RAIRO-Theoretical Informatics and Applications, 2000

  59. [60]

    William Merrill, Yoav Goldberg, Roy Schwartz, and Noah A. Smith. On the power of saturated T ransformers: A view from circuit complexity. arXiv:2106.16213, 2021

  60. [61]

    Transformers Are

    Vincent Micheli, Eloi Alonso, and Fran c ois Fleuret. Transformers are sample efficient world models. arXiv:2209.00588, 2022

  61. [62]

    Lower bounds over Boolean inputs for deep neural networks with ReLU gates

    Anirbit Mukherjee and Amitabh Basu. Lower bounds over boolean inputs for deep neural networks with R e LU gates. arXiv:1711.03073, 2017

  62. [63]

    A mechanistic interpretability analysis of grokking

    Neel Nanda and Tom Lieberum. A mechanistic interpretability analysis of grokking. Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking

  63. [64]

    Benjamin Newman, John Hewitt, Percy Liang, and Christopher D. Manning. The EOS decision and length extrapolation. In BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2020

  64. [65]

    Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials

    Eshaan Nichani, Yu Bai, and Jason D Lee. Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials. arXiv:2206.03688, 2022

  65. [66]

    Investigating the limitations of transformers with simple arithmetic tasks

    Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the limitations of transformers with simple arithmetic tasks. arXiv:2102.13019, 2021

  66. [67]

    Show Your Work: Scratchpads for Intermediate Computation with Language Models

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models. arXiv:2112.00114, 2021

  67. [68]

    The complexity of M arkov decision processes

    Christos H Papadimitriou and John N Tsitsiklis. The complexity of M arkov decision processes. Mathematics of Operations Research, 1987

  68. [69]

    Py T orch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K" o pf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py T orch: An imperative style, high-per...

  69. [70]

    Attention is turing complete

    Jorge P \'e rez, Pablo Barcel \'o , and Javier Marinkovic. Attention is turing complete. The Journal of Machine Learning Research, 22 0 (1): 0 3463--3497, 2021

  70. [71]

    Deep contextualized word representations

    Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv:1802.05365, 2018

  71. [72]

    InInternational Conference on Learning Representations

    Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv:2009.03393, 2020

  72. [73]

    Train short, test long: Attention with linear biases enables input length extrapolation

    Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022

  73. [74]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 2019

  74. [75]

    Reif and Stephen R

    John H. Reif and Stephen R. Tate. On threshold circuits and polynomial computation. SIAM Journal on Computing, 1992

  75. [76]

    Applications of automata theory and algebra: via the mathematical theory of complexity to biology, physics, psychology, philosophy, and games

    John Rhodes, Chrystopher L Nehaniv, and Morris W Hirsch. Applications of automata theory and algebra: via the mathematical theory of complexity to biology, physics, psychology, philosophy, and games. World Scientific, 2010

  76. [77]

    Can contrastive learning avoid shortcut solutions? Advances in Neural Information Processing Systems, 2021

    Joshua Robinson, Li Sun, Ke Yu, Kayhan Batmanghelich, Stefanie Jegelka, and Suvrit Sra. Can contrastive learning avoid shortcut solutions? Advances in Neural Information Processing Systems, 2021

  77. [78]

    Depth separations in neural networks: what is actually being separated? In Conference on Learning Theory, pages 2664--2666

    Itay Safran, Ronen Eldan, and Ohad Shamir. Depth separations in neural networks: what is actually being separated? In Conference on Learning Theory, pages 2664--2666. PMLR, 2019

  78. [79]

    Programming puzzles

    Tal Schuster, Ashwin Kalyan, Alex Polozov, and Adam Kalai. Programming puzzles. In Advances in Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

  79. [80]

    On finite monoids having only trivial subgroups

    Marcel Paul Sch \"u tzenberger. On finite monoids having only trivial subgroups. Information and Control, 1965

  80. [81]

    Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks

    Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. In Advances in Neural Information Processing Systems, 2021

Showing first 80 references.