pith. machine review for the scientific record. sign in

arxiv: 2604.18128 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords W4A4 quantizationSwiGLUdepth registersreader generator decompositionactivation magnitude controlpost-training quantizationlanguage models
0
0 comments X

The pith

Depth registers with a hinge loss control residual reader magnitudes in SwiGLU, unlocking W4A4 quantization performance while leaving the w2 bilinear generator as the dominant error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the sources of error in post-training 4-bit weight and activation quantization for a SwiGLU decoder-only language model. It finds that a simple training intervention using depth registers and a magnitude hinge loss improves quantized perplexity dramatically from 1727 to 119, and to 39.9 when composed with SmoothQuant. By decomposing the SwiGLU block into residual-axis readers and block-internal generators, norm arguments and experiments show that the intervention bounds reader activations tightly but leaves generators unchanged, resulting in a consistent residual loss of about 0.28 nats dominated by w2. This identifies the bilinear input to w2 as the key remaining challenge that neither magnitude control nor orthogonal rotations fully address.

Core claim

Elementary norm arguments show residual-axis magnitude control bounds readers tightly but leaves w2's bilinear input bounded only by the trivial product of factor bounds. Empirically, DR+sink collapses reader kurtosis while leaving generators essentially unchanged, and the reader-rescued W4A4 residue is flat at ~0.28 nats across three matched checkpoints with Delta-remove(w2) dominating.

What carries the argument

Depth Registers with register-magnitude hinge loss (DR+sink) applied to residual-axis readers (qkv, w1, w3) versus block-internal generators (o_proj, w2) in SwiGLU blocks.

If this is right

  • DR+sink reduces W4A4 validation perplexity by a factor of about 14 at matched FP16 performance.
  • Combining DR+sink with SmoothQuant further reduces perplexity to 39.9.
  • Per-Linear QuaRot nearly matches DR+sink on the reader axis as a post-hoc method.
  • Full QuaRot with value Hadamard and w2-input rotation fails to close the remaining gap, confirming that orthogonal rotation cannot bound the bilinear SwiGLU tail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar magnitude control techniques could be applied to other activation quantization problems in transformer variants with residual paths.
  • The flat residue suggests that w2-specific quantization strategies, such as input-aware scaling, might close the remaining FP16 gap.
  • The reader-generator split may generalize to other feed-forward architectures beyond SwiGLU.

Load-bearing premise

The partition into residual-axis readers and block-internal generators, together with the hinge loss, isolates the reader effect without confounding generator behavior, though the experiments do not fully separate the two.

What would settle it

An experiment that quantizes only the readers versus only w2 after DR+sink training and checks whether the flat 0.28 nats residue and kurtosis collapse persist independently for the readers alone.

Figures

Figures reproduced from arXiv: 2604.18128 by Ziyang Liu.

Figure 1
Figure 1. Figure 1: The five input-activation sites of a SwiGLU block. Readers (qkv, w1, w3; teal) ingest RMSNorm of the residual stream; generators (o_proj, w2; amber) ingest block-internal quantities — the softmax·V aggregate and the SwiGLU bilinear product. We offer a clean operator-level account. In a SwiGLU block, five input-activation sites can be grouped by the source of their input: • Readers receive the residual stre… view at source ↗
Figure 2
Figure 2. Figure 2: Readers and generators respond differently to residual-axis fixes. (a) Per-layer channel￾max excess kurtosis, layer mean. DR+sink collapses reader kurtosis; generator kurtosis essentially unchanged. (b) W4A4 excess-NLL decomposition. Reader-rescued residue flat at ≈ 0.28 nats; naive ∆NLL shrinks ≈ 2.7×. Conditional-on-rest-quantized. Contributions. 1. An operator-level decomposition of the five input-activ… view at source ↗
Figure 3
Figure 3. Figure 3: W4A4 skip-linear ablation. Log-scale PPL by skip target. Skipping all readers drops every checkpoint to the ∼ 31 PPL floor; generator-only skips leave near-naive damage; w2 dominates o_proj within each model. Excess-NLL observables. PPL is non-additive, so we work in excess NLL. Let L be the set of all trainable linears in the transformer blocks, ℓ(·) validation cross-entropy (nats/token), Q(L) the all-W4A… view at source ↗
Figure 4
Figure 4. Figure 4: Per-layer input-activation kurtosis. kurtc per linear per layer, log color scale. Under DR+sink (k = 64 + sink, bottom row), reader rows (qkv, w1, w3) dim uniformly across all 20 layers, while generator rows (o_proj, w2) are indistinguishable from the k = 0 baseline at every layer. Severity (max-channel / median-channel ratio, not shown) varies across layers and linears by 2–8× and is therefore a mild disc… view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity of validation loss to Gaussian noise injected at generator inputs. For each of the three checkpoints, Gaussian noise with per-channel relative standard deviation σrel is added to the input of either o_proj (left) or w2 (right) at every layer, and ∆loss is measured against FP16. Noise at o_proj’s input grows close to linearly in σrel; noise at w2’s input grows superlinearly and is architecture-i… view at source ↗
read the original abstract

We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-Edu, and ask which input-activation sites dominate the error. Naive round-to-nearest W4A4 collapses validation perplexity from FP16 23.6 to 1727. A simple residual-axis training-time intervention -- Depth Registers with a register-magnitude hinge loss (DR+sink) -- reduces this to 119 (about 14x) at matched FP16 PPL and matched zero-shot capacity, and composes with SmoothQuant to 39.9 PPL. The residual ~2 PPL gap to FP16 is the diagnostic core. We decompose W4A4 damage by input-activation site: the five trainable linears in a SwiGLU block split into residual-axis readers (qkv, w1, w3) and block-internal generators (o_proj, w2). Elementary norm arguments show residual-axis magnitude control bounds readers tightly but leaves w2's bilinear input bounded only by the trivial product of factor bounds; empirically, DR+sink collapses reader kurtosis while leaving generators essentially unchanged, and the reader-rescued W4A4 residue is flat at ~0.28 nats across three matched checkpoints with Delta-remove(w2) dominating. We present DR+sink as a training-time probe rather than a deployment proposal: a post-hoc alternative (Per-Linear QuaRot) nearly matches it on the reader axis. Full QuaRot -- adding online per-head value Hadamard plus online w2-input rotation -- does not close the gap either, directly testing the prediction that orthogonal rotation cannot bound the bilinear SwiGLU tail. Claims are specific to our 300M, 5B-token, single-seed setting, and our experiments do not isolate the partition from the hinge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies post-training W4A4 quantization on a 300M SwiGLU decoder-only LM trained on 5B FineWeb-Edu tokens. Naive W4A4 raises validation PPL from 23.6 to 1727; a training-time Depth Registers + register-magnitude hinge (DR+sink) reduces it to 119 at matched FP16 PPL, and composes with SmoothQuant to 39.9. The central claim decomposes the residual ~2 PPL gap by site: residual-axis readers (qkv, w1, w3) are tightly bounded by magnitude control while the block-internal generator w2 is only bounded by the product of its factors; DR+sink collapses reader kurtosis, leaves generators unchanged, and leaves a flat ~0.28 nats residue whose dominant term is Delta-remove(w2). Full QuaRot fails to close the gap, consistent with the bilinear-tail prediction. All claims are scoped to the single 300M, single-seed, 5B-token setting, and the paper states that experiments do not isolate the reader/generator partition from the hinge.

Significance. If the decomposition is robust, the work supplies a concrete diagnostic for why SwiGLU W4A4 error concentrates on the w2 bilinear input and shows that residual-axis magnitude control recovers most of the damage. The matched-checkpoint PPL numbers, kurtosis measurements, and direct Delta-remove(w2) observations are strengths; the failure of full QuaRot supplies a falsifiable test of the norm argument. The single-model scale and explicit non-isolation of hinge from partition, however, keep the result exploratory rather than definitive for larger models or deployment.

major comments (2)
  1. [Abstract] Abstract: the paper states that 'our experiments do not isolate the partition from the hinge.' Because DR+sink is a training-time intervention applied before the post-hoc reader/generator split and quantization-error measurements, any hinge-induced shift in generator activation statistics or error propagation directly confounds the claim that Delta-remove(w2) dominance is an inherent property of the bilinear SwiGLU tail rather than an artifact of the hinge. This is load-bearing for the central attribution of the flat ~0.28 nats residue.
  2. [Abstract and experimental sections] The central empirical claim (flat ~0.28 nats residue with w2 dominance across three matched checkpoints) rests on a single 300M model and single seed. No ablation varies model scale, data volume, or random seed while keeping the DR+sink hinge fixed, so it is unclear whether the reader kurtosis collapse and w2 dominance generalize or are specific to this training run.
minor comments (2)
  1. [Abstract] The abstract reports 'about 14x' improvement; an exact ratio (1727/119) should be stated for precision.
  2. [Main text] Notation for 'Delta-remove(w2)' and the exact definition of the reader/generator partition should be introduced with an equation or table in the main text rather than only in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments. We respond to each major comment below, maintaining the scoped nature of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the paper states that 'our experiments do not isolate the partition from the hinge.' Because DR+sink is a training-time intervention applied before the post-hoc reader/generator split and quantization-error measurements, any hinge-induced shift in generator activation statistics or error propagation directly confounds the claim that Delta-remove(w2) dominance is an inherent property of the bilinear SwiGLU tail rather than an artifact of the hinge. This is load-bearing for the central attribution of the flat ~0.28 nats residue.

    Authors: We agree that the training-time application of DR+sink precludes fully isolating the reader/generator partition from any effects of the hinge. Our manuscript already states this limitation explicitly. Kurtosis measurements in the paper show generators remain essentially unchanged under DR+sink, and the observed w2 dominance holds within this trained model. We will revise the abstract to further emphasize that the attribution applies specifically under DR+sink training. revision: partial

  2. Referee: [Abstract and experimental sections] The central empirical claim (flat ~0.28 nats residue with w2 dominance across three matched checkpoints) rests on a single 300M model and single seed. No ablation varies model scale, data volume, or random seed while keeping the DR+sink hinge fixed, so it is unclear whether the reader kurtosis collapse and w2 dominance generalize or are specific to this training run.

    Authors: We acknowledge that results are from a single 300M model, single seed, and fixed 5B-token training run. The three matched checkpoints offer internal consistency for the reported residue and dominance, but we make no claims of generalization. All statements in the manuscript are already scoped to this specific setting. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's load-bearing steps are (1) elementary norm arguments bounding residual-axis readers (qkv, w1, w3) versus the product-bound bilinear input to w2, presented as first-principles analysis, and (2) direct empirical observations of kurtosis collapse, unchanged generator statistics, and flat ~0.28 nats residue with Delta-remove(w2) dominance across matched checkpoints. These do not reduce by construction to the DR+sink hinge parameters or to any fitted input renamed as prediction. The paper explicitly states its experiments do not isolate the partition from the hinge, which is an experimental-design caveat rather than a self-definitional loop. No self-citations, uniqueness theorems, or ansatzes appear in the provided text. The central claims remain independent of the intervention by the paper's own equations and measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that the 300M SwiGLU decoder-only model trained on FineWeb-Edu is representative and that the hinge loss does not alter generator statistics in ways that confound the reader rescue measurement. No free parameters are explicitly fitted in the reported results beyond standard training; the hinge threshold is an implicit design choice.

axioms (1)
  • domain assumption Residual-axis magnitude control bounds reader activations but only trivially bounds the product input to w2.
    Invoked via elementary norm arguments in the abstract to explain why DR+sink helps readers but not generators.
invented entities (1)
  • Depth Registers no independent evidence
    purpose: Training-time auxiliary vectors whose magnitudes are penalized to control residual-axis activations.
    New construct introduced to probe quantization sensitivity; no independent evidence outside the reported PPL and kurtosis measurements.

pith-pipeline@v0.9.0 · 5655 in / 1687 out tokens · 35416 ms · 2026-05-10T04:57:30.913578+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 7 canonical work pages · 6 internal anchors

  1. [1]

    Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information...

  2. [2]

    URL http://papers.nips.cc/paper_files/ paper/2024/hash/b5b939436789f76f08b9d0da5e81af7c-Abstract-Conference.html. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang R...

  3. [4]

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski

    URL http://papers.nips.cc/paper_files/paper/2023/hash/ edbcb7583fd8921dad78adecfe06a99b-Abstract-Conference.html. Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  4. [5]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    URL https://openreview.net/ forum?id=2dnO3LLiJ1. Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale.CoRR, abs/2208.07339,

  5. [7]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    doi: 10.48550/ ARXIV .2210.17323. URLhttps://doi.org/10.48550/arXiv.2210.17323. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas ...

  6. [9]

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Kr- ishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort

    URL https://proceedings.mlsys.org/paper_files/paper/2024/hash/ 42a452cbafa9dd64e9ba4aa95cc1ef21-Abstract-Conference.html. Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Kr- ishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: LLM quantization with learned rotations. InThe Thirteenth Internatio...

  7. [10]

    Guilherme Penedo, Hynek Kydlícek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin A

    URL https://openreview.net/forum?id=ogO6DGE6FZ. Guilherme Penedo, Hynek Kydlícek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin A. Raffel, Leandro von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomcz...

  8. [11]

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo

    URLhttp://papers.nips.cc/paper_ files/paper/2024/hash/370df50ccfdf8bde18f8f9c2d9151bda-Abstract-Datasets_ and_Benchmarks_Track.html. Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models.International Confere...

  9. [13]

    GLU Variants Improve Transformer

    URL https: //arxiv.org/abs/2002.05202. Mingjie Sun, Xinlei Chen, J. Zico Kolter, and Zhuang Liu. Massive activations in large language models.CoRR, abs/2402.17762,

  10. [15]

    LLaMA: Open and Efficient Foundation Language Models

    URLhttps://proceedings.mlr. press/v267/sun25l.html. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.CoRR, abs/2302.13971,

  11. [16]

    LLaMA: Open and Efficient Foundation Language Models

    doi: 10.48550/ARXIV .2302.13971. URL https://doi. org/10.48550/arXiv.2302.13971. Guangxuan Xiao, Ji Lin, Mickaël Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, e...

  12. [17]

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis

    URLhttps://proceedings.mlr.press/v202/xiao23c.html. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  13. [18]

    URLhttps://doi.org/10.3390/fi17040185

    doi: 10.3390/FI17040185. URLhttps://doi.org/10.3390/fi17040185. Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe Wu. RPTQ: Reorder-based post-training quantization for large language models.arXiv preprint arXiv:2304.01089,

  14. [19]

    Total trainable parameters:302M

    11 A Implementation details Architecture.All three models share: 20 decoder layers, d= 1024 model dimension, 16 attention heads (head dimension 64), SwiGLU MLP with inner dimension 2752 (parameter budget matched across configurations), RoPE positional embeddings, tied token-embedding / output head, vocabulary size50,304. Total trainable parameters:302M. T...