hub

Massive Activations in Large Language Models

Mingjie Sun, Xinlei Chen, J. Zico Kolter, Zhuang Liu · 2024 · cs.CL · arXiv 2402.17762

43 Pith papers cite this work. Polarity classification is still indexing.

43 Pith papers citing it

open full Pith review browse 43 citing papers arXiv PDF

abstract

We observe an empirical phenomenon in Large Language Models (LLMs) -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output. Last, we also study massive activations in Vision Transformers. Code is available at https://github.com/locuslab/massive-activations.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 1

citation-polarity summary

background 3 use method 1

representative citing papers

CacheTrap: Unveiling a Stealthier Gray-Box Trojan against LLMs

cs.CR · 2025-11-27 · conditional · novelty 8.0

CacheTrap achieves 100% targeted attack success on five open-source LLMs by using an efficient search to locate and flip a single bit in the KV cache as a transient trigger, while preserving normal accuracy without the trigger.

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

cs.CV · 2026-06-10 · conditional · novelty 7.0

Reroute turns irreversible visual-token pruning into recoverable routing that reuses existing attention scores, improving grounding performance under aggressive reduction on LLaVA-1.5 and Qwen while preserving TFLOPs and KV-cache budgets.

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

AVLLMs route audio-visual information sequentially in video tasks and via parallel streams for interleaved items, allowing early token discard with little performance loss across models and scales.

Dead Directions: Geometric Singular Learning

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

Dead directions recover Watanabe's RLCT contribution and triple (λ, m, ν) from directional Fisher curvature decay rates in original parameter space for singular models, extended via K-FAC to networks and gauge-equivariant optimizers.

When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Mechanistic analysis of GLMs shows graph sink tokens have high activation but low importance for predictions, indicating decoupling between saliency and graph-semantic utility.

YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

YARD is a training-free method using Y-shaped decoder architecture and register tokens to improve contrastive decoding for hallucination reduction in LVLMs with lower latency.

Precision Tracked Transformer via Kalman Filtering, Kriging and Process Noise

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Bayesian Filtering Transformer reframes attention as precision-weighted kriging and residual connections as Kalman updates, delivering gains on cold-start recommendation and noisy LLM fine-tuning tasks.

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

cs.CL · 2026-05-08 · conditional · novelty 7.0 · 2 refs

Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

cs.LG · 2024-07-11 · accept · novelty 7.0

FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.

Scaling and evaluating sparse autoencoders

cs.LG · 2024-06-06 · unverdicted · novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

DynamicPTQ uses new metrics of residual-stream dynamics to apply 8-bit activation precision only to quantization-sensitive layers in W4A4KV4 LLM inference, improving perplexity and QA performance over static smoothing baselines.

ICA Lens: Interpreting Language Models Without Training Another Dictionary

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

ICALens applies an optimized ICA workflow to LLM activations and recovers compact interpretable directions that match or exceed public SAEs on SAEBench probing and perturbation tasks without per-layer dictionary training.

Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs

cs.LG · 2026-06-03 · unverdicted · novelty 6.0

A single dominant layer in LLMs, found by activation outliers, accounts for most ZO fine-tuning gains and can replace full-model updates across models and tasks.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0 · 2 refs

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.

A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

A Weibull diagnostic framework classifies transformer weight matrices into consistent functional classes via the shape parameter k and tracks training progress via the scale parameter lambda across multiple architectures.

Registers Matter for Pixel-Space Diffusion Transformers

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.

Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

cs.MM · 2026-05-11 · unverdicted · novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

Attention Sinks in Diffusion Transformers: A Causal Analysis

cs.CV · 2026-05-10 · unverdicted · novelty 6.0 · 3 refs

Suppressing attention sinks in Stable Diffusion 3 does not degrade text-image alignment or preference metrics at mild intervention levels, though stronger suppression reveals sink-specific perceptual shifts larger than random masking.

Taming Outlier Tokens in Diffusion Transformers

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.

Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing

cs.CR · 2026-04-27 · unverdicted · novelty 6.0

TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.

Graph-Guided Adaptive Channel Elimination for KV Cache Compression

eess.SP · 2026-04-18 · unverdicted · novelty 6.0

GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.

Prophecy: Inferring Formal Properties from Neuron Activations

cs.LG · 2025-09-25 · unverdicted · novelty 6.0

Prophecy infers formal properties of feed-forward neural networks by extracting rules from neuron activation patterns that imply desirable output behaviors.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers cs.LG · 2026-05-29 · unverdicted · none · ref 16 · 2 links · internal anchor
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs cs.LG · 2026-04-22 · unreviewed · ref 39 · internal anchor
When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models cs.CV · 2026-04-01 · unreviewed · ref 36 · internal anchor

Massive Activations in Large Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer