pith. sign in

arxiv: 2605.25203 · v1 · pith:LKAUCOYPnew · submitted 2026-05-24 · 💻 cs.LG · cs.AI· cs.LO

Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization

Pith reviewed 2026-06-30 12:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.LO
keywords LLM quantizationlow-bit weightsWalsh-Hadamard transformspectral rotationperplexitydecoder-only modelsweight-only quantizationactivation energy
0
0 comments X

The pith

A Walsh-Hadamard rotation plus column rescaling by activation energy biases 2-bit weight rounding and cuts perplexity 15-58 percent on decoder-only LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that one linear transformation applied before a standard reconstruction quantizer improves extreme low-bit weight quantization. Each linear layer's weight matrix is rotated by the Walsh-Hadamard transform and its columns are rescaled according to per-coordinate activation energy in the Walsh basis. This step is handed to an off-the-shelf quantizer such as Intel auto-round. On four decoder-only models ranging from 135 million to 1.5 billion parameters the procedure lowers WikiText-2 perplexity by 15 to 58 percent relative to the untransformed baseline at 2-bit weights and 16-bit activations. The same change produces no measurable gain at 4-bit weights, consistent with the idea that the benefit appears only when the noise budget is tight.

Core claim

The central claim is that the influence-adaptive Walsh geometry supplies a math-invariant rotation and rescaling that biases per-group integer rounding decisions of a downstream quantizer toward channels carrying higher spectral energy, and that this bias yields lower perplexity at W2A16 on the tested decoder-only models while remaining compatible with existing export pipelines to OpenVINO IR.

What carries the argument

The WHT-rotate-plus-Walsh-energy-rescale step applied to each linear layer's weight matrix before reconstruction-error quantization.

If this is right

  • The redistribution payoff is visible only at W2 and falls inside noise at W4.
  • Three architecture-specific extensions (per-head PCA replacement, SO(2) rotation commuting with RoPE, and MoE input absorption) transfer the recipe to models where the base version failed.
  • All resulting quantized weights run on Intel NPU, Arc dGPU, and CPU with perplexity invariant within 0.1 across devices.
  • The method does not claim a formal transfer of the Boolean majorization argument from the companion theory paper.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the spectral rescaling consistently improves low-bit rounding, the same transform might be tested as a preprocessing step for other reconstruction-based quantizers beyond auto-round.
  • The absence of gain at W4 suggests the technique targets regimes where rounding error dominates, which could guide when to apply it versus when to use higher-bit baselines.
  • Because the transformation is linear and invertible, it can be folded into the model weights without changing inference arithmetic, making it a drop-in engineering adjustment.

Load-bearing premise

That rescaling columns by per-coordinate Walsh-basis activation energy will steer the quantizer's rounding choices toward channels that improve overall model perplexity.

What would settle it

Running the identical auto-round calibration on the same four models at W2A16 but without the Walsh rotation and rescaling step, and obtaining WikiText-2 perplexity values within 5 percent of the transformed results.

read the original abstract

We apply the influence-adaptive Walsh geometry of a companion theory paper (arXiv:2605.01637) to extreme low-bit weight-only LLM quantization. The recipe is one math-invariant transformation: WHT-rotate each linear layer's weight matrix and rescale its columns by per-coordinate Walsh-basis activation energy before handing off to a reconstruction-error quantizer (Intel auto-round). This biases per-group integer rounding toward high-spectral-energy channels. On four pretrained decoder-only models from 135M to 1.5B parameters, BBT-spectral reduces wikitext-2 perplexity by 15-58% relative to vanilla auto-round at W2A16; we also report a TinyLlama-1.1B auxiliary data point. Three extensions transfer the recipe to families it failed on: a per-head PCA matrix-Gamma replacement of q_norm/k_norm for Qwen3 attention (PPL 136.76 -> 88.99 on Qwen3-0.6B); an SO(2) per-pair rotation that commutes with RoPE (PPL 36.93 -> 21.84 on Qwen2.5-1.5B); and an MoE-aware input-side absorption fix identified by architectural fuzzing of Laguna-style fused-expert layouts. A W2-vs-W4 ablation gives a deliberate negative control: the redistribution payoff falls within the +/-0.5 PPL noise floor at W4, consistent with the Schur-convexity intuition that the cost of unconcentrated influence vanishes as the noise budget shrinks. All quantized weights export to OpenVINO IR and run on Intel NPU + Arc dGPU + CPU with PPL invariant to device within +/-0.1. We do not claim a formal Boolean-to-real-valued transfer of the theory paper's majorization argument: the WHT activation energy used here is not the Boolean influence of the theory paper, the link is intuitive, and the contribution is engineering value rather than a transferred theorem. Head-to-head benchmarks against SpinQuant, QuaRot, QuIP-sharp, AQLM, OmniQuant, and ButterflyQuant at matched calibration are the main future-work item.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that a single math-invariant transformation—WHT-rotating each linear layer's weight matrix and rescaling its columns by per-coordinate Walsh-basis activation energy before passing to the Intel auto-round quantizer—biases per-group integer rounding toward high-spectral-energy channels and thereby reduces WikiText-2 perplexity by 15-58% relative to vanilla auto-round at W2A16 on four decoder-only models (135M–1.5B parameters), with architecture-specific extensions for Qwen attention and MoE layouts, a W4 negative control showing gains within noise, and confirmed OpenVINO export with device-invariant PPL.

Significance. If reproducible, the concrete perplexity numbers, W4 negative control, and hardware deployment results constitute a practical engineering contribution to extreme low-bit weight-only quantization. The explicit acknowledgment that the activation-energy link is intuitive rather than a transferred theorem from the companion paper (arXiv:2605.01637) appropriately frames the work as empirical rather than theoretical.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (method description): the precise definition and computation of the 'Walsh-basis activation energy' proxy (including calibration data, activation collection, and scaling formula) are not supplied with equations or pseudocode, preventing independent verification of the central claim that this rescaling biases auto-round rounding decisions.
  2. [Empirical evaluation] Empirical evaluation section: no ablation isolates the per-coordinate rescaling step from the WHT rotation itself, so the 15-58% PPL reduction cannot be attributed specifically to the influence-inspired component; the manuscript itself notes the link is intuitive and supplies no majorization argument or isolation experiment.
  3. [Future-work paragraph] Future-work paragraph: head-to-head results against SpinQuant, QuaRot, QuIP-sharp, AQLM, OmniQuant, and ButterflyQuant at matched calibration are deferred, leaving the practical advantage of BBT-spectral over existing rotation/quantization baselines unestablished despite the reported gains versus vanilla auto-round.
minor comments (1)
  1. [Abstract] The abstract references a 'TinyLlama-1.1B auxiliary data point' without stating the numerical result or exact model identifier.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment point by point below, agreeing where revisions are needed for clarity and reproducibility while defending the manuscript's scope on empirical contributions.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method description): the precise definition and computation of the 'Walsh-basis activation energy' proxy (including calibration data, activation collection, and scaling formula) are not supplied with equations or pseudocode, preventing independent verification of the central claim that this rescaling biases auto-round rounding decisions.

    Authors: We agree that explicit equations and pseudocode are required for the Walsh-basis activation energy proxy to support independent verification. The revised manuscript will expand §3 with a dedicated description of the calibration dataset, activation collection procedure, and scaling formula, including pseudocode for the full transformation pipeline. revision: yes

  2. Referee: [Empirical evaluation] Empirical evaluation section: no ablation isolates the per-coordinate rescaling step from the WHT rotation itself, so the 15-58% PPL reduction cannot be attributed specifically to the influence-inspired component; the manuscript itself notes the link is intuitive and supplies no majorization argument or isolation experiment.

    Authors: The manuscript already states that the activation-energy link is intuitive rather than a formal transfer of the majorization argument from the companion paper, and the reported gains apply to the combined transformation. To strengthen attribution, the revision will add an ablation comparing WHT rotation alone against the full WHT-plus-rescaling pipeline on the same models and calibration. revision: yes

  3. Referee: [Future-work paragraph] Future-work paragraph: head-to-head results against SpinQuant, QuaRot, QuIP-sharp, AQLM, OmniQuant, and ButterflyQuant at matched calibration are deferred, leaving the practical advantage of BBT-spectral over existing rotation/quantization baselines unestablished despite the reported gains versus vanilla auto-round.

    Authors: We acknowledge that matched-calibration comparisons would better establish relative advantages. These are explicitly noted as the primary future-work item because the current scope centers on gains versus vanilla auto-round plus the W4 negative control. The revision will expand the discussion paragraph to more explicitly contextualize the reported results against the broader literature and restate the scope limitations. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical recipe stands on measured results.

full rationale

The manuscript describes a concrete transformation (WHT rotation + per-coordinate activation-energy rescaling) applied before a standard reconstruction-error quantizer, then reports measured WikiText-2 perplexity reductions on four models. It explicitly disclaims formal transfer of any majorization argument from the companion paper, stating the link is intuitive and the contribution is engineering value. No equation or claim reduces a prediction to a fitted input by construction, no uniqueness theorem is invoked, and the self-citation is not load-bearing for any formal derivation. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the described spectral preprocessing step; the activation-energy scales are data-dependent quantities computed during calibration rather than free parameters chosen by hand, and the link to Boolean influence is treated as an intuitive domain assumption.

free parameters (1)
  • Walsh-basis activation energy scales
    Per-coordinate values computed from activation statistics during calibration to rescale columns after rotation; treated as observed rather than fitted constants.
axioms (1)
  • domain assumption Walsh-Hadamard rotation plus activation-energy rescaling biases the quantizer toward high-spectral-energy channels in a way that improves downstream perplexity
    Invoked as the mechanism for the reported gains; the abstract states the connection to the companion theory paper is intuitive rather than a transferred theorem.

pith-pipeline@v0.9.1-grok · 5943 in / 1500 out tokens · 46047 ms · 2026-06-30T12:10:22.314968+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 15 canonical work pages · 7 internal anchors

  1. [1]

    G. Pavlov. The Banach-Butterfly Invariant: Influence-adaptive Walsh geometry for ternary polynomial threshold functions. arXiv:2605.01637, 2026

  2. [2]

    J. Lin, J. Tang, H. Tang, et al. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of MLSys, 2024. arXiv:2306.00978

  3. [3]

    Cheng, W

    W. Cheng, W. Zhang, et al. Optimize weight rounding via signed gradient descent for the quantization of LLMs. arXiv:2309.05516, 2023

  4. [4]

    Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V. Chandra, Y. Tian, T. Blankevoort. SpinQuant: LLM quantization with learned rotations. arXiv:2405.16406, 2024. 13

  5. [5]

    Ashkboos, A

    S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoe- fler, J. Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InNeurIPS, 2024. arXiv:2404.00456

  6. [6]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InICLR, 2023. arXiv:2210.17323

  7. [7]

    J. Chee, Y. Cai, V. Kuleshov, C. De Sa. QuIP: 2-bit quantization of large language models with guarantees. InNeurIPS, 2023. arXiv:2307.13304

  8. [8]

    H. Wang, S. Ma, L. Dong, S. Huang, H. Wang, L. Ma, F. Yang, R. Wang, Y. Wu, F. Wei. BitNet: Scaling 1-bit transformers for large language models. arXiv:2310.11453, 2023

  9. [9]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    T. Dettmers, M. Lewis, Y. Belkada, L. Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. InNeurIPS, 2022. arXiv:2208.07339

  10. [10]

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, S. Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InICML, 2023. arXiv:2211.10438

  11. [11]

    W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y. Qiao, P. Luo. OmniQuant: Omnidirectionally calibrated quantization for large language models. InICLR,

  12. [12]

    Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,

    A. Tseng, J. Chee, Q. Sun, V. Kuleshov, C. De Sa. QuIP#: Even better LLM quantization with Hadamard incoherence and lattice codebooks. InICML, 2024. arXiv:2402.04396

  13. [13]

    Egiazarian, A

    V. Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, D. Alistarh. Extreme com- pression of large language models via additive quantization. InICML, 2024. arXiv:2401.06118

  14. [14]

    S. Ma, H. Wang, L. Ma, L. Wang, W. Wang, S. Huang, L. Dong, R. Wang, J. Xue, F. Wei. The era of 1-bit LLMs: All large language models are in 1.58 bits. arXiv:2402.17764, 2024

  15. [15]

    Hooper, S

    C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S. Shao, K. Keutzer, A. Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. In NeurIPS, 2024. arXiv:2401.18079

  16. [16]

    B. Xu, Z. Dong, O. Elachqar, Y. Shang. ButterflyQuant: Ultra-low-bit LLM quantization through learnable orthogonal butterfly transforms. arXiv:2509.09679, 2025. 14