Measuring Maximum Activations in Open Large Language Models

Dawei Yin; Fang Wang; Han Tian; Haoyi Xiong; Jiamin Chen; Jiashu Zhao; Linghe Kong; Luxuan Chen; Rui Kong; Shuaiqiang Wang

arxiv: 2605.15572 · v2 · pith:VRIF64LAnew · submitted 2026-05-15 · 💻 cs.CL

Measuring Maximum Activations in Open Large Language Models

Luxuan Chen , Han Tian , Xinran Chen , Rui Kong , Fang Wang , Jiamin Chen , Yuchen Li , Jiashu Zhao

show 4 more authors

Shuaiqiang Wang Haoyi Xiong Linghe Kong Dawei Yin

This is my paper

Pith reviewed 2026-05-20 19:31 UTC · model grok-4.3

classification 💻 cs.CL

keywords activation magnitudelarge language modelsquantizationmixture of expertsresidual streammodel familiesactivation scalinglow-bit inference

0 comments

The pith

Maximum activation magnitude in open LLMs is a property of family and architecture rather than size alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures the largest activation values reached during forward passes in a broad set of current open large language models. It runs the same 5,000-example multi-domain test set through 27 checkpoints from eight families, applying identical layer hooks to embeddings, attention, MLP blocks, and norms. The recorded peaks differ by nearly four orders of magnitude at comparable scales, with some families staying below a thousand and others exceeding half a million. This range directly constrains choices for activation scaling and low-bit quantization in deployment. The measurements indicate that these peak values arise from specific design and training decisions instead of following from parameter count.

Core claim

Global and layerwise maximum activations were recorded across 27 checkpoints from eight open families using a unified pipeline of 5,000 multi-domain samples and fixed hooks at embeddings, hidden states, attention, MLP or MoE, SwiGLU gates, and final norm. Maxima span almost four orders of magnitude at similar sizes, with Qwen3.5 and MoE models in the 10^2 to 10^3 range while Gemma3-27B-it reaches approximately 7 times 10^5. MoE checkpoints show 14.0 to 23.4 times lower peaks than matched dense models, and the residual stream carries the global maximum in 22 of 24 cases. These patterns establish that maximum activation magnitude is a model property tied to family, architecture, and training,

What carries the argument

The unified measurement pipeline that applies identical tokenization and layer hooks to record global and per-layer maximum activation values across model families and training stages.

If this is right

Activation scaling and quantization choices must be tuned to the specific family rather than assumed from model size.
Mixture-of-experts designs can support more aggressive low-bit formats because their activation peaks remain substantially smaller.
The residual stream must be handled with care in any activation-management scheme since it holds the largest value in nearly all cases.
Open-weight releases should include measured maximum activations to support informed low-bit deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If peak magnitudes shift with training stage, repeated measurements during continued pretraining could reveal when and how activation growth occurs.
Architectural differences such as gating or expert routing appear to control dynamic range and could be adjusted to reduce the need for large activation scales.
Direct measurement of maxima provides a lightweight way to select per-model quantization scales that match observed reconstruction error.

Load-bearing premise

The 5,000-sample multi-domain corpus together with the chosen layer hooks is sufficient to capture the true global maximum activations for each model checkpoint.

What would settle it

Running any of the tested models on a substantially larger or more diverse input set and obtaining activation values several times higher than the reported maxima.

Figures

Figures reproduced from arXiv: 2605.15572 by Dawei Yin, Fang Wang, Han Tian, Haoyi Xiong, Jiamin Chen, Jiashu Zhao, Linghe Kong, Luxuan Chen, Rui Kong, Shuaiqiang Wang, Xinran Chen, Yuchen Li.

**Figure 2.** Figure 2: Failure modes for the four checkpoints that do not satisfy the Sun criterion. Colors indicate [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Layerwise heatmap of hidden-state peak magnitudes. The horizontal axis is normalized [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Representative layerwise trajectories for the two main emergence patterns. Left: jump [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Within-family scaling effects. The figure compares size changes only within the same [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Global maximum activation magnitudes for the 24 main-analysis checkpoints. The vertical [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Generational evolution at similar sizes. Left: Qwen shows a Qwen2.5 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Matched-scale comparison of MoE and dense checkpoints. Each bar group fixes model [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Matched-scale comparison between Qwen2.5-VL and text-only Qwen2.5 checkpoints. The [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Matched-backbone comparison of Qwen2.5 Base and Instruct checkpoints. Each bar [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Global maximum activation across Ling-mini training stages. The horizontal axis is [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: INT-8 activation quantization sanity check for eight representative models. Grouped [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Deployment-oriented tiers based on global maximum activation magnitude. The horizontal [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Hidden-state layerwise maximum-activation trajectories within each model family. Each [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Component-level maximum-activation trajectories for representative models. The three [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

read the original abstract

The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Activation maxima differ substantially across recent open LLM families at similar scales, with MoE models lower and some like Gemma3 much higher, but the 5k-sample corpus leaves room for doubt on whether these are true global maxima.

read the letter

The main thing to know is that this paper measures activation maxima across 27 recent open checkpoints from eight families and finds they vary by nearly four orders of magnitude even at comparable sizes, with MoE models showing 14-23x lower peaks than dense ones and the residual stream usually carrying the global max. That pattern breaks simple size-based scaling and could matter for quantization and scaling choices in deployment.

Referee Report

1 major / 2 minor

Summary. The paper introduces a unified measurement pipeline using a fixed 5,000-sample multi-domain corpus, family-specific tokenization, and identical layer hooks to compute global and layerwise maximum activation magnitudes across 27 checkpoints from 8 open LLM families (dense, MoE, vision-language, intermediate and instruction-tuned). It reports that these maxima span nearly four orders of magnitude at comparable scales (Qwen3.5/MoE in 10^2–10^3 vs. Gemma3-27B-it at ~7×10^5), that cross-family and cross-generation trends break simple size-based scaling, that MoE models show 14–23× lower peaks than matched dense counterparts, and that the residual stream carries the global maximum in most cases. A lightweight INT-8 check links the measured maxima to quantization reconstruction error. The central conclusion is that maximum activation magnitude is an intrinsic model property tied to family, architecture, and training stage rather than parameter count alone; code is released publicly.

Significance. If the measured values are representative of true global maxima, the work is significant for low-bit quantization, activation scaling, and stable inference: it shows that activation dynamic range is not uniform across the post-LLaMA open-model landscape and must be measured per release. The public code and the INT-8 sanity check are concrete strengths that allow direct reproduction and practical validation. The four-order-of-magnitude spread and the MoE-vs-dense gap, if robust, would be useful empirical facts for the deployment community.

major comments (1)

[unified pipeline description] The central claim that maximum activation magnitude is a model property independent of size rests on the 5,000-sample corpus actually capturing (or closely approximating) the global maximum for each checkpoint. The manuscript describes the corpus and hooks but provides no ablation on sample size, no saturation analysis, and no comparison against high-entropy or targeted inputs known from prior outlier-feature literature to elicit larger peaks. This omission directly affects the validity of the reported four-order spread and the 14–23× MoE gap.

minor comments (2)

[methods] The abstract and methods would benefit from an explicit statement of how many tokens or sequences were actually processed per model after tokenization, to allow readers to judge coverage.
[results] Table or figure showing per-family maxima should include error bars or min/max across multiple random seeds of the corpus if any subsampling was performed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for highlighting an important methodological consideration. We address the major comment point by point below and outline concrete revisions to strengthen the manuscript.

read point-by-point responses

Referee: [unified pipeline description] The central claim that maximum activation magnitude is a model property independent of size rests on the 5,000-sample corpus actually capturing (or closely approximating) the global maximum for each checkpoint. The manuscript describes the corpus and hooks but provides no ablation on sample size, no saturation analysis, and no comparison against high-entropy or targeted inputs known from prior outlier-feature literature to elicit larger peaks. This omission directly affects the validity of the reported four-order spread and the 14–23× MoE gap.

Authors: We agree that explicit validation of corpus saturation would strengthen the central claim. Our 5,000-sample multi-domain corpus was assembled to maximize input diversity across domains, and the observed consistency of trends across 27 checkpoints from eight families provides supporting evidence that the reported relative differences are robust. Nevertheless, we did not include sample-size ablations or direct comparisons to high-entropy prompts in the submitted version. In the revision we will add (i) a saturation plot showing measured maxima as a function of corpus size (up to 20,000 samples) for one representative model per family and (ii) a targeted comparison using a small set of high-entropy inputs drawn from the outlier-feature literature. These results will be reported in a new appendix, the main claims will be qualified accordingly, and the public code will be updated to reproduce the new checks. We expect these additions to address the concern while preserving the core empirical observations. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements with no derivations or self-referential steps

full rationale

The paper performs direct empirical measurements of maximum activation magnitudes across 27 checkpoints using a fixed 5,000-sample multi-domain corpus, family-specific tokenization, and identical layer hooks. No mathematical derivations, fitted parameters, equations, or self-citations are used to derive the central claim; the reported variations (e.g., four-order-of-magnitude spread, MoE vs. dense gaps) are presented as observed outcomes from the measurement pipeline. The analysis is self-contained against external benchmarks because the results are falsifiable by re-running the same hooks on the same or expanded corpora, with no load-bearing step reducing to a definition or prior self-result by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical measurement study. It relies on the standard assumption that a fixed multi-domain corpus can surface global activation maxima when hooks are placed at standard locations.

axioms (1)

domain assumption The 5,000-sample multi-domain corpus and identical hooks across layers capture representative global maxima
Invoked when the unified pipeline is used to measure and compare maxima across models.

pith-pipeline@v0.9.0 · 5905 in / 1273 out tokens · 57716 ms · 2026-05-20T19:31:26.868373+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We measure global and layerwise maxima on 27 checkpoints from 8 open families... under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.