AlphaQ: Calibration-Free Bit Allocation for Mixture-of-Experts Quantization

Alexander Conzelmann; Michael W. Mahoney; Shiwei Liu; T. Konstantin Rusch; Wanqi Yang; Xiawu Zheng; Yuexiao Ma

arxiv: 2606.04980 · v1 · pith:UR37AWQ5new · submitted 2026-06-03 · 💻 cs.LG

AlphaQ: Calibration-Free Bit Allocation for Mixture-of-Experts Quantization

Wanqi Yang , Yuexiao Ma , Alexander Conzelmann , Xiawu Zheng , Michael W. Mahoney , T. Konstantin Rusch , Shiwei Liu This is my paper

Pith reviewed 2026-06-28 07:03 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixture of expertsquantizationbit allocationheavy-tailed self-regularizationcalibration freemodel compressionlarge language modelsspectral analysis

0 comments

The pith

A calibration-free method uses weight spectral heavy-tailedness to allocate bits across MoE experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AlphaQ, a method to quantize Mixture-of-Experts models without needing calibration data. It applies Heavy-Tailed Self-Regularization theory to measure how heavy-tailed each expert's weight spectrum is, treating stronger heavy-tailedness as a sign of better training. Experts with stronger signals receive more bits under a total budget constraint to minimize overall error. This matters because proprietary training data makes traditional calibration unreliable, yet the approach still delivers high accuracy with large memory reductions on tested models.

Core claim

AlphaQ operationalizes the principle that experts with more heavy-tailed weight spectra are better trained and should receive higher bit-widths by measuring expert-wise spectral heavy-tailedness and solving a budget-constrained optimization problem that minimizes total quantization error.

What carries the argument

Expert-wise measurement of spectral heavy-tailedness from HT-SR theory to rank experts for bit allocation in a global optimization under bit budget.

Load-bearing premise

The heavy-tailedness of an expert's weight spectrum reliably signals its training quality and thus its deservingness of higher bit precision.

What would settle it

Observing that models quantized with the opposite allocation—higher bits to less heavy-tailed experts—achieve higher accuracy than AlphaQ under the same budget would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.04980 by Alexander Conzelmann, Michael W. Mahoney, Shiwei Liu, T. Konstantin Rusch, Wanqi Yang, Xiawu Zheng, Yuexiao Ma.

**Figure 1.** Figure 1: Domain bias introduced by data-driven bit-width allocation in Mixtral-8×7B. Left: bit-width allocations calibrated on datasets across domains (C4 (Raffel et al., 2020), MATH (Hendrycks et al., 2021b), GitHub-Code (Team, 2024a)) illustrate calibration-data-induced variations. Right: Mixtral-8×7B calibrated on these datasets with a 2.5-bit budget exhibits performance bias, overfitting to the calibration doma… view at source ↗

**Figure 2.** Figure 2: Comparison of the proposed (data-independent) AlphaQ framework and data-driven (or data [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of PL_Alpha_Hill across all up/gate/down projections in three representative MoE-LLMs. The bottom and top of each boxplot indicate the minimum and maximum values of PL_Alpha_Hill across all up/gate/down projections within the block. The lower and upper edges of the box correspond to the first and third quartiles for that block, respectively, and the horizontal line inside the box denotes the … view at source ↗

**Figure 4.** Figure 4: Layer-wise PL_Alpha_Hill distribution in sampled MoE blocks. The up, gate, and down projections within the same MoE block often have different PL_Alpha_Hill values, motivating layer-wise rather than expert-wise bit allocation. quantization noise across all layers under a target bit budget. Let B be the set of candidate bit-widths (e.g., B = {1, 2, 3, 4}). To formalize the allocation decision, we introduce … view at source ↗

**Figure 5.** Figure 5: End-to-end efficiency of AlphaQ. Left: average zero-shot accuracy versus inference speedup relative to BF16 for varying bit budgets on Mixtral-8×7B. Right: parameter memory footprint of Mixtral8×7B and Qwen1.5-MoE. How to Allocate Bit-Width Across Blocks? We conduct an ablation study on two budget-allocation strategies: i) fixing the global average bit-width for the entire model; and ii) fixing the averag… view at source ↗

**Figure 6.** Figure 6: Domain-dependent expert activation patterns and data-driven bit-width allocation in Mixtral-8×7B. Activation frequencies (top) and corresponding bit-width allocations (bottom) across different domains (C4, MATH, GitHub-Code), illustrating substantial variations induced by calibration data from different domains. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Hierarchical relationship among blocks, experts, and layers in our paper. An MoE model consists of multiple Transformer blocks; each block contains an attention module and an MoE module with multiple experts; each expert further comprises multiple layers (e.g., up, gate, and down projection). A.3.2 From the Power-Law Density to a Pareto Form To derive the estimator used in Eq. 3, we rewrite Eq. 6 as a Pare… view at source ↗

**Figure 8.** Figure 8: Expert-wise PL_Alpha_Hill distribution in sampled MoE blocks. Experts within the same MoE layer exhibit different alpha values, indicating that expert importance is heterogeneous even within a single block. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Layer-wise PL_Alpha_Hill distribution of sampled blocks in four non-MoE LLMs A.6 Justification of the Quantization Noise Model Here, we provide a theoretical justification for modeling the layer-wise quantization error variance as ηl,b ∝ 2 −2b . We consider a uniform quantizer applied to the weights of the l-th layer, denoted by Wl . We assume the weights lie within the interval [−Rl , Rl ], where Rl is a … view at source ↗

**Figure 10.** Figure 10: Module-level relationship between PL_Alpha_Hill, quantization noise, and quantization degradation. (a) Each point denotes a sampled module from Llama 3.2-3B or OLMoE-1B-7B. The horizontal axis is 2-bit quantization noise, the vertical axis is PL_Alpha_Hill, and darker points indicate larger PPL increase after 2-bit quantization. Severe degradation concentrates in the region with high quantization noise a… view at source ↗

**Figure 11.** Figure 11: Sensitivity analysis of γ. Sensitivity varies across MoE models: DeepSeekV2-Lite and Qwen1.5- MoE are more sensitive to γ than Mixtral-8×7B. We therefore define cl = R2 l /3. Since the clipping range Rl is typically proportional to the standard deviation of the weights, it follows that cl scales with Var(Wl). This leads to the exponential decay model used in the main text, ηl,b = cl2 −2b , (18) where cl c… view at source ↗

**Figure 12.** Figure 12: Bit allocation of DeepSeekV2-Lite under a 2-bit budget. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Bit allocation of Qwen1.5-MoE under a 2-bit budget. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Bit allocation of Mixtral-8×7B under a 2-bit budget. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

read the original abstract

Mixture-of-Experts (MoE) architectures scale model capacity through sparse expert activation, but their deployment remains memory-bound because all expert weights must reside in memory. Mixed-precision quantization can substantially reduce this footprint by assigning different bit-widths to different experts. Existing approaches, however, typically rely on calibration data to estimate expert importance and determine bit allocation. For frontier MoE LLMs, the original training data, and hence the true training distribution, is proprietary and inaccessible. As a result, calibration sets are inevitably imperfect surrogates, and this can misestimate expert utilization and lead to suboptimal bit allocation. Motivated by the substantial cross-expert quality variability observed in modern MoE models, and by the success of Heavy-Tailed Self-Regularization (HT-SR) theory at predicting neural network model quality without access to training or testing data, we propose AlphaQ, a calibration-free bit-allocation method for MoE quantization. AlphaQ draws on HT-SR theory and follows a simple principle: experts with more heavy-tailed weight spectra are typically better trained and hence should receive higher bit-widths, while experts with weaker heavy-tailed structure can be quantized more aggressively. AlphaQ operationalizes this principle by measuring expert-wise spectral heavy-tailedness and solving a budget-constrained optimization problem that minimizes total quantization error under a global bit-budget constraint. Across several MoE models, AlphaQ consistently outperforms calibration-based baselines under matched bit budgets. Notably, on Qwen1.5-MoE, AlphaQ achieves near full-precision accuracy with an average expert precision of only 3.5 bits, while delivering more than 4$\times$ memory compression. Our code is available at https://github.com/Superone77/AlphaQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AlphaQ skips calibration by ranking MoE experts via per-expert weight spectral heavy-tailedness, but the abstract gives no evidence that this ranking tracks quantization sensitivity.

read the letter

The core move is to measure heavy-tailedness in each expert's weight spectrum using HT-SR ideas and then solve a budget-constrained optimization that assigns higher bits to the more heavy-tailed experts. The abstract reports that this beats calibration-based methods on several MoE models and reaches near full-precision accuracy at 3.5 bits average on Qwen1.5-MoE with over 4x compression. Code is released, which helps.

That is the actual novelty: moving from whole-model HT-SR to per-expert bit allocation without any activation data. The motivation around proprietary training sets for frontier models is also reasonable.

The soft spot is the untested step that higher heavy-tailedness means the expert is more important and therefore needs more bits. The abstract states the principle but shows no correlation between the spectral measure and actual accuracy drop under quantization, nor does it address routing frequency in MoE. Without those checks, the outperformance claim is hard to evaluate from the given information.

This paper is for people working on memory-efficient inference for large sparse models. A reader who wants to try new heuristics for mixed-precision MoE might get something out of the full experiments if they hold up.

It deserves peer review because the practical problem is real and the approach differs from the calibration line of work, even though the central assumption will need direct testing.

Referee Report

3 major / 1 minor

Summary. The paper introduces AlphaQ, a calibration-free mixed-precision quantization method for Mixture-of-Experts (MoE) models. It applies Heavy-Tailed Self-Regularization (HT-SR) theory to measure the heavy-tailedness of each expert's weight spectrum and solves a budget-constrained optimization to assign higher bit-widths to experts with stronger heavy-tailed spectra, claiming consistent outperformance over calibration-based baselines. On Qwen1.5-MoE it reports near full-precision accuracy at 3.5-bit average expert precision with >4× memory compression.

Significance. If the HT-SR heavy-tailedness measure reliably ranks expert quality and quantization sensitivity without any activation or calibration data, the result would enable practical deployment of frontier MoE LLMs where training data is inaccessible. The approach avoids the documented risk that imperfect calibration sets misestimate expert utilization.

major comments (3)

[Abstract / principle statement] The central claim rests on the untested transfer of whole-model HT-SR results to per-expert bit allocation in sparsely activated MoE architectures. No ablation, correlation plot, or sensitivity analysis is shown demonstrating that experts with higher measured heavy-tailedness suffer larger accuracy drops under aggressive quantization than lower-HT experts (see skeptic note on weakest assumption).
[Abstract / method description] The abstract states that AlphaQ 'solves a budget-constrained optimization problem that minimizes total quantization error,' yet supplies no description of the objective function, the precise definition of per-expert quantization error, the solver used, or any guarantee that the resulting allocation is unique or stable under small perturbations of the HT-SR scores.
[Abstract / experimental claim] The reported result on Qwen1.5-MoE (near full-precision at 3.5 bits, >4× compression) is presented without controls for routing frequency, expert interaction effects, or statistical significance across multiple random seeds or calibration-set choices; these omissions make it impossible to isolate the contribution of the HT-SR allocation from other factors.

minor comments (1)

Notation for the HT-SR heavy-tailedness metric (e.g., power-law exponent or related quantity) should be defined explicitly in the main text rather than left implicit from prior HT-SR literature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Abstract / principle statement] The central claim rests on the untested transfer of whole-model HT-SR results to per-expert bit allocation in sparsely activated MoE architectures. No ablation, correlation plot, or sensitivity analysis is shown demonstrating that experts with higher measured heavy-tailedness suffer larger accuracy drops under aggressive quantization than lower-HT experts (see skeptic note on weakest assumption).

Authors: We acknowledge that the direct per-expert validation of the HT-SR to quantization sensitivity link is an assumption transferred from whole-model results in prior HT-SR literature. Our empirical results across multiple MoE models demonstrate consistent outperformance, supporting the principle, but we agree that explicit ablations would strengthen the claim. In revision we will add a correlation plot between per-expert HT-SR scores and observed accuracy drop under uniform low-bit quantization, plus a sensitivity analysis. revision: yes
Referee: [Abstract / method description] The abstract states that AlphaQ 'solves a budget-constrained optimization problem that minimizes total quantization error,' yet supplies no description of the objective function, the precise definition of per-expert quantization error, the solver used, or any guarantee that the resulting allocation is unique or stable under small perturbations of the HT-SR scores.

Authors: The abstract is intentionally high-level; the full manuscript (Section 3) defines the objective as minimizing the sum of per-expert errors where error is inversely proportional to the HT-SR heavy-tailedness score (serving as a proxy for sensitivity), formulates it as a 0-1 knapsack problem, and solves it via a standard dynamic programming approach. We will revise the abstract to include a concise description of the objective and solver, and add a short stability analysis under HT-SR score perturbation. revision: partial
Referee: [Abstract / experimental claim] The reported result on Qwen1.5-MoE (near full-precision at 3.5 bits, >4× compression) is presented without controls for routing frequency, expert interaction effects, or statistical significance across multiple random seeds or calibration-set choices; these omissions make it impossible to isolate the contribution of the HT-SR allocation from other factors.

Authors: The method is calibration-free, so calibration-set variation does not apply. Routing frequency is inherent to the MoE forward pass and our bit allocation is independent of it; we will add an analysis of allocation vs. routing frequency in the revision. For statistical significance we will report mean and std over 3 random seeds for the Qwen1.5-MoE result. Expert interaction effects are a broader MoE property not isolated in prior quantization work either, but we can note this limitation explicitly. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation applies external HT-SR principle to new MoE setting

full rationale

The paper measures per-expert spectral heavy-tailedness directly from weights, then feeds those measurements into a budget-constrained optimizer that assigns bit-widths. This chain does not reduce any quantity to a fitted parameter defined from the same performance data, nor does any equation equate an output to its input by construction. The motivating principle is imported from prior HT-SR literature rather than derived inside the paper; the empirical outperformance claims rest on direct comparisons under matched budgets, not on self-referential definitions. No load-bearing step matches the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of HT-SR theory from general neural networks to per-expert quality assessment in MoE architectures and on the existence of a direct link between spectral heavy-tailedness and quantization sensitivity.

axioms (1)

domain assumption Heavy-Tailed Self-Regularization (HT-SR) theory can predict neural network model quality without access to training or testing data.
The paper explicitly motivates AlphaQ from the success of HT-SR theory at predicting quality without data.

pith-pipeline@v0.9.1-grok · 5870 in / 1399 out tokens · 25505 ms · 2026-06-28T07:03:43.715498+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 25 canonical work pages · 13 internal anchors

[1]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

12 Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions.arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[2]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Mxmoe: Mixed-precision quantization for moe with accuracy and performance co-design.arXiv preprint arXiv:2505.05799,

Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. Mxmoe: Mixed-precision quantization for moe with accuracy and performance co-design.arXiv preprint arXiv:2505.05799,

work page arXiv
[5]

Qmoe: Practical sub-1-bit compression of trillion-parameter models.arXiv preprint arXiv:2310.16795,

Elias Frantar and Dan Alistarh. Qmoe: Practical sub-1-bit compression of trillion-parameter models.arXiv preprint arXiv:2310.16795,

work page arXiv
[6]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

URL https://zenodo.org/records/12608602. A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer. A survey of quantization methods for efficient neural network inference. Technical Report Preprint: arXiv:2103.13630,

work page arXiv
[8]

Gholami, Z

A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer. AI and memory wall. Technical Report Preprint: arXiv:2403.14123,

work page arXiv
[9]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, and Lu Yin

URLhttps://arxiv.org/abs/2506.14562. Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, and Lu Yin. Alphadecay: Module-wise weight decay for heavy-tailed balancing in LLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026a. URLhttps://openreview.net/forum?id=MKEDsVWHd0. Di He, Songjun Tu, Keyu Wang, Lu Y...

work page arXiv 2026
[11]

Hodgkinson, Z

L. Hodgkinson, Z. Wang, and M. W. Mahoney. Models of heavy-tailed mechanistic universality. Technical Report Preprint: arXiv:2506.03470,

work page arXiv
[12]

Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804, 2025a

Xing Hu, Zhixuan Chen, Dawei Yang, Zukang Xu, Chen Xu, Zhihang Yuan, Sifan Zhou, and Jiangyong Yu. Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804, 2025a. Yuanzhe Hu, Kinshuk Goel, Vlad Killiakov, and Yaoqing Yang. Eigenspectrum analysis of neu...

work page arXiv
[13]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Crafting heavy-tails in weight matrix spectrum without gradient noise.arXiv preprint arXiv:2406.04657,

Vignesh Kothapalli, Tianyu Pang, Shenyang Deng, Zongmin Liu, and Yaoqing Yang. Crafting heavy-tails in weight matrix spectrum without gradient noise.arXiv preprint arXiv:2406.04657,

work page arXiv
[15]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[16]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024a. Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Cha...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

C. H. Martin and M. W. Mahoney. Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior. Technical Report Preprint: arXiv:1710.09553,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

OLMoE: Open Mixture-of-Experts Language Models

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. Olmoe: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

15 Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

work page arXiv
[21]

Moqae: Mixed-precision quantization for long-context llm inference via mixture of quantization-aware experts

Wei Tao, Haocheng Lu, Xiaoyang Qu, Bin Zhang, Kai Lu, Jiguang Wan, and Jianzong Wang. Moqae: Mixed-precision quantization for long-context llm inference via mixture of quantization-aware experts. arXiv preprint arXiv:2506.07533,

work page arXiv
[22]

Automated fine-grained mixture-of-experts quantization

Zhanhao Xie, Yuexiao Ma, Xiawu Zheng, Fei Chao, Wanchen Sui, Yong Li, Shen Li, and Rongrong Ji. Automated fine-grained mixture-of-experts quantization. InFindings of the Association for Computational Linguistics: ACL 2025, pages 27024–27037,

2025
[23]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[25]

Dynamo: Runtime switchable quantization for moe with cross-dataset adaptation.arXiv preprint arXiv:2503.21135,

Zihao Zheng, Xiuping Cui, Size Zheng, Maoliang Li, Jiayu Chen, Yun Liang, and Xiang Chen. Dynamo: Runtime switchable quantization for moe with cross-dataset adaptation.arXiv preprint arXiv:2503.21135,

work page arXiv
[26]

These results show that experts within the same MoE layer can exhibit differentPL_Alpha_Hill values, indicating that expert importance is heterogeneous even within a single block. For comparison, Figure 9 reports layer- wisePL_Alpha_Hilldistributions for four non-MoE models, including Llama3-1B (Grattafiori et al., 2024), Llama3-3B, Qwen1.5-4B (Team, 2024...

2024
[27]

19 0 1 2 3 4 5 6 7 0.0 2.5 5.0Mixtral-8x7B Block 1 0 1 2 3 4 5 6 7 0.0 2.5 5.0 Block 7 0 1 2 3 4 5 6 7 0.0 2.5 5.0 Block 12 0 1 2 3 4 5 6 7 0.0 2.5 5.0Llama-MoE-3.5B 0 1 2 3 4 5 6 7 0.0 2.5 5.0 0 1 2 3 4 5 6 7 0.0 2.5 5.0 0 5 10 15 20 25 30 35 40 45 50 55 60 0 5 DeepSeekV2-Lite - Block 1 0 5 10 15 20 25 30 35 40 45 50 55 60 0 5 DeepSeekV2-Lite - Block 7 0...

2008
[28]

and Uniform under four bits-per-layer budget settings: 2.0 / 2.5 / 3.0 / 3.5. For evaluation, we report perplexity (PPL↓) on WikiText2 and average zero-shot accuracy (Avg.↑) over six benchmarks: PIQA (Bisk et al., 2020), ARC-Easy, ARC-Challenge (Clark et al., 2018), HellaSwag (Zellers et al., 2019), WinoGrande (Sakaguchi et al., 2021), and BoolQ (Clark et al.,

2020
[29]

In Figure 12, Figure 13, and Figure 14, we report detailed bit allocation results from AlphaQ under a 2-bit budget, with both layer-wise and expert-wise settings

using the EleutherAI LM Harness (Gao et al., 2024). In Figure 12, Figure 13, and Figure 14, we report detailed bit allocation results from AlphaQ under a 2-bit budget, with both layer-wise and expert-wise settings. 23 Table 7: Results on DeepSeekV2-Lite. Perplexity↓on WikiText2 and accuracy↑on six zero-shot tasks. The best results in each bit-width are hi...

work page arXiv 2024
[30]

We therefore implement decode-specific fused Triton kernels that fuse unpacking, dequantization, and GEMM into a single kernel

backend: a CUDA kernel performs bit unpacking and dequantization, followed byTensorCoreGEMM.Profilingshowsthisdecompositionishighlyeffectiveinprefill, whichiscompute-bound, but yields limited speedups in memory-bound decode due to data movement between the dequantization and GEMM; concretely,Memcpy HtoD and aten::copy_ become dominant in decode. We theref...

2021

[1] [1]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

12 Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions.arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[2] [2]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Mxmoe: Mixed-precision quantization for moe with accuracy and performance co-design.arXiv preprint arXiv:2505.05799,

Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. Mxmoe: Mixed-precision quantization for moe with accuracy and performance co-design.arXiv preprint arXiv:2505.05799,

work page arXiv

[5] [5]

Qmoe: Practical sub-1-bit compression of trillion-parameter models.arXiv preprint arXiv:2310.16795,

Elias Frantar and Dan Alistarh. Qmoe: Practical sub-1-bit compression of trillion-parameter models.arXiv preprint arXiv:2310.16795,

work page arXiv

[6] [6]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

URL https://zenodo.org/records/12608602. A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer. A survey of quantization methods for efficient neural network inference. Technical Report Preprint: arXiv:2103.13630,

work page arXiv

[8] [8]

Gholami, Z

A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer. AI and memory wall. Technical Report Preprint: arXiv:2403.14123,

work page arXiv

[9] [9]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, and Lu Yin

URLhttps://arxiv.org/abs/2506.14562. Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, and Lu Yin. Alphadecay: Module-wise weight decay for heavy-tailed balancing in LLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026a. URLhttps://openreview.net/forum?id=MKEDsVWHd0. Di He, Songjun Tu, Keyu Wang, Lu Y...

work page arXiv 2026

[11] [11]

Hodgkinson, Z

L. Hodgkinson, Z. Wang, and M. W. Mahoney. Models of heavy-tailed mechanistic universality. Technical Report Preprint: arXiv:2506.03470,

work page arXiv

[12] [12]

Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804, 2025a

Xing Hu, Zhixuan Chen, Dawei Yang, Zukang Xu, Chen Xu, Zhihang Yuan, Sifan Zhou, and Jiangyong Yu. Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804, 2025a. Yuanzhe Hu, Kinshuk Goel, Vlad Killiakov, and Yaoqing Yang. Eigenspectrum analysis of neu...

work page arXiv

[13] [13]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Crafting heavy-tails in weight matrix spectrum without gradient noise.arXiv preprint arXiv:2406.04657,

Vignesh Kothapalli, Tianyu Pang, Shenyang Deng, Zongmin Liu, and Yaoqing Yang. Crafting heavy-tails in weight matrix spectrum without gradient noise.arXiv preprint arXiv:2406.04657,

work page arXiv

[15] [15]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006

[16] [16]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024a. Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Cha...

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

C. H. Martin and M. W. Mahoney. Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior. Technical Report Preprint: arXiv:1710.09553,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

OLMoE: Open Mixture-of-Experts Language Models

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. Olmoe: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

15 Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

work page arXiv

[21] [21]

Moqae: Mixed-precision quantization for long-context llm inference via mixture of quantization-aware experts

Wei Tao, Haocheng Lu, Xiaoyang Qu, Bin Zhang, Kai Lu, Jiguang Wan, and Jianzong Wang. Moqae: Mixed-precision quantization for long-context llm inference via mixture of quantization-aware experts. arXiv preprint arXiv:2506.07533,

work page arXiv

[22] [22]

Automated fine-grained mixture-of-experts quantization

Zhanhao Xie, Yuexiao Ma, Xiawu Zheng, Fei Chao, Wanchen Sui, Yong Li, Shen Li, and Rongrong Ji. Automated fine-grained mixture-of-experts quantization. InFindings of the Association for Computational Linguistics: ACL 2025, pages 27024–27037,

2025

[23] [23]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[25] [25]

Dynamo: Runtime switchable quantization for moe with cross-dataset adaptation.arXiv preprint arXiv:2503.21135,

Zihao Zheng, Xiuping Cui, Size Zheng, Maoliang Li, Jiayu Chen, Yun Liang, and Xiang Chen. Dynamo: Runtime switchable quantization for moe with cross-dataset adaptation.arXiv preprint arXiv:2503.21135,

work page arXiv

[26] [26]

These results show that experts within the same MoE layer can exhibit differentPL_Alpha_Hill values, indicating that expert importance is heterogeneous even within a single block. For comparison, Figure 9 reports layer- wisePL_Alpha_Hilldistributions for four non-MoE models, including Llama3-1B (Grattafiori et al., 2024), Llama3-3B, Qwen1.5-4B (Team, 2024...

2024

[27] [27]

19 0 1 2 3 4 5 6 7 0.0 2.5 5.0Mixtral-8x7B Block 1 0 1 2 3 4 5 6 7 0.0 2.5 5.0 Block 7 0 1 2 3 4 5 6 7 0.0 2.5 5.0 Block 12 0 1 2 3 4 5 6 7 0.0 2.5 5.0Llama-MoE-3.5B 0 1 2 3 4 5 6 7 0.0 2.5 5.0 0 1 2 3 4 5 6 7 0.0 2.5 5.0 0 5 10 15 20 25 30 35 40 45 50 55 60 0 5 DeepSeekV2-Lite - Block 1 0 5 10 15 20 25 30 35 40 45 50 55 60 0 5 DeepSeekV2-Lite - Block 7 0...

2008

[28] [28]

and Uniform under four bits-per-layer budget settings: 2.0 / 2.5 / 3.0 / 3.5. For evaluation, we report perplexity (PPL↓) on WikiText2 and average zero-shot accuracy (Avg.↑) over six benchmarks: PIQA (Bisk et al., 2020), ARC-Easy, ARC-Challenge (Clark et al., 2018), HellaSwag (Zellers et al., 2019), WinoGrande (Sakaguchi et al., 2021), and BoolQ (Clark et al.,

2020

[29] [29]

In Figure 12, Figure 13, and Figure 14, we report detailed bit allocation results from AlphaQ under a 2-bit budget, with both layer-wise and expert-wise settings

using the EleutherAI LM Harness (Gao et al., 2024). In Figure 12, Figure 13, and Figure 14, we report detailed bit allocation results from AlphaQ under a 2-bit budget, with both layer-wise and expert-wise settings. 23 Table 7: Results on DeepSeekV2-Lite. Perplexity↓on WikiText2 and accuracy↑on six zero-shot tasks. The best results in each bit-width are hi...

work page arXiv 2024

[30] [30]

We therefore implement decode-specific fused Triton kernels that fuse unpacking, dequantization, and GEMM into a single kernel

backend: a CUDA kernel performs bit unpacking and dequantization, followed byTensorCoreGEMM.Profilingshowsthisdecompositionishighlyeffectiveinprefill, whichiscompute-bound, but yields limited speedups in memory-bound decode due to data movement between the dequantization and GEMM; concretely,Memcpy HtoD and aten::copy_ become dominant in decode. We theref...

2021