arxiv: 2604.06515 · v1 · submitted 2026-04-07 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees

Mohammed Nowaz Rabbani Chowdhury , Kaoutar El Maghraoui , Hsinyu Tsai , Naigang Wang , Geoffrey W. Burr , Liu Liu , Meng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords mixture-of-expertspost-training quantizationmixed-precisionrouter L2 norminference optimizationlarge language modelsSwitch TransformerMixtral

0 comments

The pith

A router-norm-based mixed-precision quantization for MoE models achieves higher accuracy at lower inference costs than existing methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to quantize large Mixture-of-Experts models by assigning different bit widths to each expert. The assignment relies mainly on the change in the router's L2 norm for that expert over the course of training. Experts whose router norms change little are treated as handling infrequent but critical features, so they receive higher precision. An additional check on intra-neuron variance prevents noisy low-precision allocations. Tests on Switch Transformer and Mixtral show improved accuracy, lower memory use during inference, and almost no extra cost to compute the bit widths.

Core claim

We propose a theoretically grounded expert-wise mixed precision strategy that assigns bit-width to each expert primarily based on their change in router's l2 norm during training. Experts with smaller changes capture less frequent but critical features, requiring higher precision because model performance is more sensitive to their quantization. Experts with large maximum intra-neuron variance are also given higher precision to limit quantization noise. This yields better accuracy than prior quantization schemes on large MoE models while cutting inference cost.

What carries the argument

Expert-wise bit-width allocation based on the magnitude of change in each expert's router L2 norm during training, combined with a check for maximum intra-neuron variance.

If this is right

Quantized MoE models incur lower memory and computation costs at inference time.
The bit-width decisions add only negligible overhead.
Accuracy exceeds that of uniform quantization and other mixed-precision baselines on Switch Transformer and Mixtral.
The approach supplies theoretical generalization guarantees for the quantized models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training dynamics captured by router norms could serve as a signal for identifying sensitive parameters in other neural architectures.
The method might be combined with quantization-aware fine-tuning to push bit widths even lower.
Similar norm-change heuristics could extend to pruning or other compression techniques in sparse models.

Load-bearing premise

That smaller changes in an expert's router L2 norm during training indicate it captures critical infrequent features to which the model's performance is particularly sensitive.

What would settle it

Observing that quantizing low router-norm-change experts to low bits causes smaller accuracy drops than quantizing high-change experts, or that the proposed allocation fails to beat uniform quantization on the same models.

Figures

Figures reproduced from arXiv: 2604.06515 by Geoffrey W. Burr, Hsinyu Tsai, Kaoutar El Maghraoui, Liu Liu, Meng Wang, Mohammed Nowaz Rabbani Chowdhury, Naigang Wang.

**Figure 1.** Figure 1: A schematic of Mixture-of-Experts with our proposed approach. Experts with smaller router norm changes in higher bit. Experts with large max intra-neuron variance are reordered. In MoE models, the standard feed-forward networks (FFNs) in transformer MLP blocks are replaced with multiple parallel FFN experts. A gating network of routers assigns input tokens to specific experts. Consider an example MoE bloc… view at source ↗

**Figure 2.** Figure 2: Expert-wise mixedprecision of Switch Transformer on CNNDM. Bit choices: 2, 3 Two-level expert-wise mixed-precision: As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Inference time of different expert-wise methods. and Slim-LLM (group-wise) (Huang et al., 2025b)) by large margins. Low inference cost [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 5.** Figure 5: Ratio of activations of experts learning more and less prevalent tokens, respectively (class-1) [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 7.** Figure 7: Expert-wise mixed-precision results for Mixtral 8x7B on eight benchmark LLM tasks; [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Expert-wise mixed-precision results for Mixtral 8x7B on eight benchmark LLM tasks; [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of the maximum intra-neuron variance of the experts in Mixtral 8x7B. Only a [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Average activation over tokens with top gating values of different experts of Mixtral 8x7B. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Token visualization of the smallest and the largest two router norm experts [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Sparse Mixture-of-Experts (MoE) allows scaling of language and vision models efficiently by activating only a small subset of experts per input. While this reduces computation, the large number of parameters still incurs substantial memory overhead during inference. Post-training quantization has been explored to address this issue. Because uniform quantization suffers from significant accuracy loss at low bit-widths, mixed-precision methods have been recently explored; however, they often require substantial computation for bit-width allocation and overlook the varying sensitivity of model performance to the quantization of different experts. We propose a theoretically grounded expert-wise mixed precision strategy that assigns bit-width to each expert primarily based on their change in routers l2 norm during training. Experts with smaller changes are shown to capture less frequent but critical features, and model performance is more sensitive to the quantization of these experts, thus requiring higher precision. Furthermore, to avoid allocating experts to lower precision that inject high quantization noise, experts with large maximum intra-neuron variance are also allocated higher precision. Experiments on large-scale MoE models, including Switch Transformer and Mixtral, show that our method achieves higher accuracy than existing approaches, while also reducing inference cost and incurring only negligible overhead for bit-width assignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical heuristic for expert-wise bit allocation in MoE quantization using router L2-norm change, but the promised theoretical generalization guarantees lack any derivation or proof link in the abstract.

read the letter

The main takeaway is that this work offers a lightweight post-training rule for mixed-precision quantization of MoE models: assign higher bits to experts whose router L2 norm changed least during training, plus a secondary check on maximum intra-neuron variance to avoid noisy low-bit experts. They test the approach on Switch Transformer and Mixtral and report accuracy gains over prior mixed-precision baselines at lower average bit width and negligible assignment cost.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a mixed-precision post-training quantization method for Sparse Mixture-of-Experts (MoE) models. Bit-width assignment to each expert is determined primarily by the change in its router L2 norm over training (smaller changes receive higher precision on the grounds that such experts capture infrequent but critical features) together with a secondary rule that assigns higher precision to experts with large maximum intra-neuron variance. The approach is asserted to carry theoretical generalization guarantees and is reported to outperform prior quantization schemes on Switch Transformer and Mixtral while lowering inference cost with negligible assignment overhead.

Significance. If the proxy relationship between router-norm change and quantization sensitivity can be placed on a rigorous footing and the promised generalization bounds derived, the method would supply a low-overhead, theoretically motivated route to memory-efficient inference for large MoE architectures. The empirical evaluation on production-scale models is a necessary first step, yet the absence of supporting derivations and experimental controls currently limits the strength of that claim.

major comments (3)

[Abstract] Abstract: the claim of 'theoretical generalization guarantees' is unsupported; no derivation, proof sketch, or quantitative link is supplied that connects observed router L2-norm change to either the magnitude of introduced quantization error or any modification of a generalization bound.
[Method] Method description: the allocation rule is defined directly from observable training statistics (router L2 change and intra-neuron variance) rather than from a parameter that is subsequently shown to control a bound; consequently the performance gain is not reduced to a non-circular expression of the input data.
[Experiments] Experiments: results are summarized without error bars, explicit baseline implementations, or ablation tables that isolate the contribution of the router-norm rule versus the variance rule, preventing assessment of whether the reported accuracy advantage is robust.

minor comments (1)

[Abstract] Abstract contains minor grammatical issues ('routers l2 norm' should read 'router L2 norm'; 'allocating experts to lower precision' is awkward).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and describe the revisions we will implement.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'theoretical generalization guarantees' is unsupported; no derivation, proof sketch, or quantitative link is supplied that connects observed router L2-norm change to either the magnitude of introduced quantization error or any modification of a generalization bound.

Authors: We acknowledge that the current manuscript does not supply an explicit derivation or proof sketch supporting the generalization guarantees claim. The router L2-norm change is motivated as a proxy for expert sensitivity based on activation patterns in MoE models. In the revised version we will add a proof sketch in the appendix that quantitatively connects the observed router L2-norm change to the magnitude of quantization error and its effect on a standard generalization bound via perturbation analysis. revision: yes
Referee: [Method] Method description: the allocation rule is defined directly from observable training statistics (router L2 change and intra-neuron variance) rather than from a parameter that is subsequently shown to control a bound; consequently the performance gain is not reduced to a non-circular expression of the input data.

Authors: The allocation rule is derived from training statistics chosen because they reflect expert importance. We will revise the method section to frame the router L2-norm change explicitly as a sensitivity parameter inside a generalization bound, showing that the bit-width assignment minimizes an upper bound on generalization error. This will clarify that the performance improvement follows from a non-circular, theoretically controlled expression rather than purely empirical observation. revision: yes
Referee: [Experiments] Experiments: results are summarized without error bars, explicit baseline implementations, or ablation tables that isolate the contribution of the router-norm rule versus the variance rule, preventing assessment of whether the reported accuracy advantage is robust.

Authors: We agree that the experimental presentation lacks sufficient controls. The revised manuscript will include error bars from multiple random seeds, detailed descriptions (or code references) for reproducing all baselines, and dedicated ablation tables that isolate the router-norm rule from the intra-neuron variance rule. These additions will allow direct assessment of robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; allocation rule uses direct training observables without reducing claims to fitted inputs or self-citation chains

full rationale

The bit-width assignment is defined directly from measured router L2-norm changes and intra-neuron variance statistics collected during training. These quantities are independent observables rather than parameters fitted to the target accuracy metric and then re-labeled as predictions. The paper's empirical results on Switch Transformer and Mixtral supply external validation of the accuracy gains, and no equation or self-citation is shown to make the claimed generalization bound equivalent to the input statistics by construction. The theoretical grounding is asserted but does not collapse the central construction into a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions: that router L2-norm change during training is a reliable proxy for an expert's feature criticality and quantization sensitivity, and that maximum intra-neuron variance correlates with quantization noise impact. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Router L2-norm change during training indicates how critical an expert's features are and how sensitive model performance is to its quantization.
Stated directly in the abstract as the basis for assigning higher precision to low-change experts.
domain assumption Experts with large maximum intra-neuron variance inject high quantization noise if assigned low precision.
Used to justify additional higher-precision allocation.

pith-pipeline@v0.9.0 · 5536 in / 1436 out tokens · 33524 ms · 2026-05-10T18:27:02.539018+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

assigns bit-width to each expert primarily based on their change in router’s l2 norm during training. Experts with smaller changes are shown to capture less frequent but critical features
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.4 … bh ≥ log2(1 + Ω(d √log(kmd²)/l² log l)) … generalization P[∀(x,y)∼D : y f_Q^(T)(x) > 0] = 1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 2 canonical work pages

[1]

Amini, S

URLhttps://openreview.net/forum?id=Uuf2q9TfXGA. Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. InInternational conference on machine learning, pp. 242–252. PMLR, 2019. James Urquhart Allingham, Florian Wenzel, Zelda E Mariet, Basil Mustafa, Joan Puigcerver, Neil Houlsby, Ghassen Jerfel, Vinc...

work page doi:10.18653/v1/n19-1245 2019
[2]

Table 5: Avg

As we can see, the performance picks aroundζ= 3.0, which justifies our selection. Table 5: Avg. accuracy (%) for different values ofζ Avg. bits/expert ζ 1.0 2.0 2.5 3.0 4.0 5.0 2.0 60.44 61.56 62.2862.5661.32 61.74 D MORE RESULTS ONMIXTRAL8X7B Figure 7: Expert-wise mixed-precision results for Mixtral 8x7B on eight benchmark LLM tasks; expert bit-choices: ...

work page arXiv 2026
[3]

˜O(·)and ˜Ω(·)hides the factorlog(poly(d))with a sufficiently large polynomialpoly(·)
[4]

21 Published as a conference paper at ICLR 2026 Definitions: For anyq∈ P\{o 1, o2}, we define the activation of the experts∈[k]byqas, σ(s,t) q :=Pm r=1 ReLU(⟨w(s,t) r , q⟩)

With high probability (abbreviated asw.h.p.) refers to the probability1− 1 poly(d). 21 Published as a conference paper at ICLR 2026 Definitions: For anyq∈ P\{o 1, o2}, we define the activation of the experts∈[k]byqas, σ(s,t) q :=Pm r=1 ReLU(⟨w(s,t) r , q⟩). For any v∈ P r, we define a complementary expert proficiency measure for the expert s at time t as,...

2026
[5]

This creates contradiction

logl+O( l2 d logl)as, ⟨w(T−1) s , q−q ′⟩ ≤ 1√ 2 logl+O( l2 d logl). This creates contradiction. Therefore,∀T ′ ≤t ′ ≤T, we have for all task-irrelevant patternq,⟨w (t′) s , o1 −q⟩ ≤2 logl. Now,⟨w (T) s , o1⟩> 3 2 logl. Therefore,⟨w (T) s ,−o 1⟩<− 3 2 logl. Therefore, for anyq∈ P\{o 1, o2},⟨w (T ′) s ,−o 1 −q⟩<− 3 2 logl− ⟨w (T) s , q⟩. On the other hand, ...

2026
[6]

Therefore, for any(x, y)∼ Dsuch that∃j∈[n]withx (j) =o 1, P s1∈So1 f(T) s1 (x) = Ω(γo1 mlC2 s logl Cp )

Furthermore, from statement (iii) of Lemma I.1,σ (s1,T) o1 = Ω(mlC2 s logl Cp ). Therefore, for any(x, y)∼ Dsuch that∃j∈[n]withx (j) =o 1, P s1∈So1 f(T) s1 (x) = Ω(γo1 mlC2 s logl Cp ). On the other hand, from statement (i) of Lemma I.1, for anys 3 ∈S −o1,p (s3,T) o1 = 0. Therefore, for any(x, y)∼ Dsuch that∃j∈[n]withx (j) =o 1, P s∈S+ f(T) s (x) = Ω(γo1 ...

2026