arxiv: 2605.07317 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Amortized-Precision Quantization for Early-Exit Vision Transformers

Hsi-Wen Chen, Ming-Syan Chen, Rui Fang

Pith reviewed 2026-05-11 01:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords Amortized-Precision QuantizationEarly-Exit Vision TransformersBi-level OptimizationModel QuantizationEfficient InferenceLow-Precision DeploymentPareto Trade-off

0 comments

The pith

Amortized-Precision Quantization stabilizes low-precision early-exit Vision Transformers by modeling each layer's stochastic exposure to noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix the fragility of quantizing early-exit Vision Transformers, where noise from low precision perturbs exit decisions and amplifies errors along variable-length paths. It introduces Amortized-Precision Quantization as a utilization-aware approach that factors in the probability each layer will actually execute under stochastic exits, thereby surfacing explicit depth-precision trade-offs. Building on this, the authors present MAQEE, a bi-level optimizer that jointly tunes exit thresholds and bit-widths while enforcing explicit risk bounds. A sympathetic reader would care because this makes high-performing Vision Transformers practical to deploy at very low compute budgets on edge hardware without accuracy collapse.

Core claim

Vision Transformers achieve strong performance across vision tasks yet their deployment with low-precision early exiting remains fragile. Existing quantization methods assume static full-depth execution, making them unstable when exit decisions are perturbed by quantization noise, which can amplify errors along dynamic inference paths. We introduce Amortized-Precision Quantization, a utilization-aware formulation that accounts for layer-wise stochastic exposure to quantization noise and reveals depth-precision trade-offs. Building on APQ, we propose Mutual Adaptive Quantization with Early Exiting, a bi-level framework that jointly optimizes exit thresholds and bit-widths under explicit risk

What carries the argument

Amortized-Precision Quantization (APQ), a utilization-aware formulation that accounts for layer-wise stochastic exposure to quantization noise and reveals depth-precision trade-offs.

Load-bearing premise

The bi-level optimization of exit thresholds and bit-widths under explicit risk control will remain stable and not introduce new instabilities when quantization noise perturbs dynamic paths.

What would settle it

Measure accuracy and BOPs for a fixed early-exit ViT on a standard vision benchmark; if applying MAQEE quantization produces accuracy drops larger than the risk-controlled allowance compared with full-precision early-exit execution, the stability claim is falsified.

Figures

Figures reproduced from arXiv: 2605.07317 by Hsi-Wen Chen, Ming-Syan Chen, Rui Fang.

**Figure 1.** Figure 1: Overview of MAQEE. Left: Quantization error perturbs early-exit decisions, causing either premature or delayed exits. Middle: Risk modeling for early exiting and quantization, including performance gap risk (PGR), boundary sensitivity risk (BSR), inverse SQNR, and quantization-induced drift (QID). Right: MAQEE solves Amortized-Precision Quantization (APQ) via bi-level optimization, where exit thresholds a… view at source ↗

**Figure 2.** Figure 2: Accuracy–throughput/BOPs results. stabilizes exit behavior and reduces exit depth and BOPs by up to 50% relative to strong baselines, consistent with the utilization-aware principle in Theorem 3. Segmentation and Detection. To assess cross-task generalization, [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Vision Transformers (ViTs) achieve strong performance across vision tasks, yet their deployment with low-precision early exiting remains fragile. Existing quantization methods assume static full-depth execution, making them unstable when exit decisions are perturbed by quantization noise, which can amplify errors along dynamic inference paths. In this paper, we introduce Amortized-Precision Quantization (APQ), a utilization-aware formulation that accounts for layer-wise stochastic exposure to quantization noise and reveals depth-precision trade-offs. Building on APQ, we propose Mutual Adaptive Quantization with Early Exiting (MAQEE), a bi-level framework that jointly optimizes exit thresholds and bit-widths under explicit risk control to improve inference stability. MAQEE establishes a superior Pareto frontier in the accuracy-efficiency trade-off, reducing BOPs by up to 95% while maintaining accuracy and outperforming strong baselines by up to 20\% across classification, detection, and segmentation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The amortized formulation for handling quantization noise in variable-depth ViT paths is the real novelty, but the abstract's strong Pareto claims lack any supporting details to evaluate.

read the letter

The main point is that this work identifies a real mismatch: standard quantization assumes fixed full-depth runs, but early-exit ViTs have data-dependent paths where quantization noise can flip exit decisions and create inconsistent utilization. APQ tries to fix that by explicitly modeling layer-wise stochastic exposure to noise based on how often each layer actually runs across variable depths. MAQEE then uses a bi-level setup to tune both exit thresholds and bit-widths together under some risk term. That joint accounting is not just a routine extension of static quant methods, and it targets a practical deployment issue for vision tasks. If the derivation and numbers check out, it could help make low-precision early exits more stable without sacrificing the efficiency gains. The paper earns credit for framing the problem clearly and for trying to tie precision choices to actual path statistics rather than assuming worst-case full depth. The stress-test concern about optimization stability lands: the abstract mentions improved inference stability and risk control, yet gives no derivation showing the risk term actually bounds path perturbations once exits depend on noisy activations. If quantization noise systematically pushes borderline cases deeper, the amortized statistics used to set bit-widths would no longer match realized execution, which would undermine both the 95% BOP cut and the 20% outperformance numbers. From the abstract alone there are no baselines, error bars, or ablation details, so the soundness score stays low and the central claims cannot be verified yet. This is for people working on efficient transformer deployment and quantization-aware early exiting. A reader who cares about practical ViT inference would find the perspective useful even if the numbers need confirmation. I would send it to peer review so referees can examine the bi-level optimization and the reported gains against the actual experimental setup.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Amortized-Precision Quantization (APQ), a utilization-aware formulation for quantizing early-exit Vision Transformers that accounts for layer-wise stochastic exposure to quantization noise and reveals depth-precision trade-offs. Building on APQ, it proposes Mutual Adaptive Quantization with Early Exiting (MAQEE), a bi-level optimization framework that jointly optimizes exit thresholds and bit-widths under explicit risk control. The paper claims that MAQEE establishes a superior Pareto frontier in the accuracy-efficiency trade-off, reducing BOPs by up to 95% while maintaining accuracy and outperforming strong baselines by up to 20% across classification, detection, and segmentation tasks.

Significance. If the bi-level optimization remains stable under quantization noise and the empirical gains are reproducible, this work would offer a principled approach to handling quantization in dynamic early-exit ViT architectures, potentially enabling more reliable low-precision deployment on resource-constrained hardware.

major comments (2)

[§4] §4 (MAQEE bi-level framework): the risk-control term is introduced to improve inference stability, yet no derivation or bound is provided showing that it prevents quantization noise from systematically shifting exit decisions; this leaves the amortized utilization statistics potentially inconsistent with realized dynamic paths, which is load-bearing for the 95% BOP reduction and stability claims.
[Experiments] Experimental results (Tables 1–3, Figures 3–5): the reported Pareto improvements and outperformance figures lack error bars, multiple-run statistics, or ablation on the risk-control hyperparameter, making it difficult to assess whether the gains are robust or sensitive to the quantization noise that the method aims to mitigate.

minor comments (2)

[Abstract] Abstract: the high-level claims would be strengthened by a one-sentence reference to the risk-control formulation or the key APQ equation.
[§2] Notation in §2: the distinction between 'amortized precision' and conventional layer-wise bit-width assignment could be made more explicit with a short comparison table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of theoretical justification and experimental robustness. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [§4] §4 (MAQEE bi-level framework): the risk-control term is introduced to improve inference stability, yet no derivation or bound is provided showing that it prevents quantization noise from systematically shifting exit decisions; this leaves the amortized utilization statistics potentially inconsistent with realized dynamic paths, which is load-bearing for the 95% BOP reduction and stability claims.

Authors: We agree that the manuscript would benefit from a formal derivation or bound on the risk-control term to rigorously demonstrate its effect on preventing systematic shifts in exit decisions due to quantization noise. The term was introduced to regularize the bi-level optimization toward stable dynamic paths, but the original submission relied primarily on empirical motivation without an explicit bound relating amortized and realized utilization. In the revised manuscript, we will add a derivation in §4 showing that the risk-control term bounds the expected deviation between amortized statistics and realized paths under standard assumptions on quantization noise (e.g., bounded variance). This addition will directly address the potential inconsistency and strengthen support for the reported efficiency gains. revision: yes
Referee: [Experiments] Experimental results (Tables 1–3, Figures 3–5): the reported Pareto improvements and outperformance figures lack error bars, multiple-run statistics, or ablation on the risk-control hyperparameter, making it difficult to assess whether the gains are robust or sensitive to the quantization noise that the method aims to mitigate.

Authors: We acknowledge that the lack of error bars, multiple-run statistics, and ablation on the risk-control hyperparameter limits the ability to evaluate robustness against quantization noise. The original results were obtained from single runs due to computational constraints in the bi-level optimization. In the revision, we will rerun the primary experiments across multiple random seeds (at least five) and report means with standard deviations in Tables 1–3 along with error bars in Figures 3–5. We will also include a dedicated ablation study on the risk-control hyperparameter, showing its influence on the Pareto frontier and inference stability to confirm that the gains remain consistent across reasonable hyperparameter choices. revision: yes

Circularity Check

0 steps flagged

No circularity: new formulations and empirical claims are independent of inputs

full rationale

The paper introduces APQ as a utilization-aware formulation accounting for stochastic quantization noise exposure and MAQEE as a bi-level joint optimization of thresholds and bit-widths under risk control. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that reduce the 95% BOP reduction or 20% outperformance claims to tautological definitions or prior self-references. The derivation chain remains self-contained, with performance results framed as outcomes of the proposed methods rather than constructions equivalent to the inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on an unstated model of how quantization noise perturbs exit decisions and on the assumption that bi-level optimization can jointly tune thresholds and precisions without new failure modes. No free parameters are explicitly named, but exit thresholds and per-layer bit-widths are optimized quantities. No new physical entities are introduced.

free parameters (2)

exit thresholds
Tuned jointly in the bi-level framework; values not reported in abstract.
bit-widths
Optimized per layer under utilization-aware formulation; specific values not given.

axioms (2)

domain assumption Quantization noise can perturb exit decisions and amplify errors along dynamic inference paths
Invoked to explain why static quantization fails for early-exit models.
ad hoc to paper Layer-wise stochastic exposure to quantization noise admits a utilization-aware formulation
Core modeling choice behind APQ.

pith-pipeline@v0.9.0 · 5455 in / 1363 out tokens · 72940 ms · 2026-05-11T01:32:36.865344+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

[1]

Bakhtiarnia, Q

A. Bakhtiarnia, Q. Zhang, and A. Iosifidis. Multi-exit vision transformer for dynamic infer- ence.arXiv preprint arXiv:2106.15183, 2021

work page arXiv 2021
[2]

Carion, F

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–

work page
[3]

C.-F. R. Chen, Q. Fan, and R. Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. InProceedings of the IEEE/CVF international conference on com- puter vision, pages 357–366, 2021

work page 2021
[4]

Courbariaux, Y

M. Courbariaux, Y . Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations.Advances in neural information processing systems, 28, 2015

work page 2015
[5]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierar- chical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[6]

Y . Ding, H. Qin, Q. Yan, Z. Chai, J. Liu, X. Wei, and X. Liu. Towards accurate post-training quantization for vision transformer. InProceedings of the 30th ACM international conference on multimedia, pages 5380–5388, 2022

work page 2022
[7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transform- ers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[8]

Elhoushi, A

M. Elhoushi, A. Shrivastava, D. Liskovich, B. Hosmer, B. Wasti, L. Lai, A. Mahmoud, B. Acun, S. Agarwal, A. Roman, et al. Layerskip: Enabling early exit inference and self- speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 12622–12642, 2024

work page 2024
[9]

Y . Fang, B. Liao, X. Wang, J. Fang, J. Qi, R. Wu, J. Niu, and W. Liu. You only look at one sequence: Rethinking transformer in vision through object detection.Advances in Neural Information Processing Systems, 34:26183–26197, 2021

work page 2021
[10]

Y . Gao, M. Zhou, and D. N. Metaxas. Utnet: a hybrid transformer architecture for medical image segmentation. InInternational conference on medical image computing and computer- assisted intervention, pages 61–71. Springer, 2021

work page 2021
[11]

Gholami, S

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer. A survey of quanti- zation methods for efficient neural network inference. InLow-power computer vision, pages 291–326. Chapman and Hall/CRC, 2022

work page 2022
[12]

X. Hu, Z. Chen, D. Yang, Z. Xu, C. Xu, Z. Yuan, S. Zhou, and J. Yu. Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804, 2025

work page arXiv 2025
[13]

Hwang, W

C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas, J. Jose, P. Ram, et al. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5:269–287, 2023

work page 2023
[14]

Jacob, S

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704– 2713, 2018

work page 2018
[15]

Jazbec, A

M. Jazbec, A. Timans, T. H. Veljkovi ´c, K. Sakmann, D. Zhang, C. A. Naesseth, and E. Nalis- nick. Fast yet safe: Early-exiting with risk control.Advances in Neural Information Processing Systems, 37:129825–129854, 2024. 10

work page 2024
[16]

J.-Y . Jeon, X. T. Nguyen, S. Ryu, and H.-J. Lee. Usdn: A unified sample-wise dynamic network with mixed-precision and early-exit. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 646–654, 2024

work page 2024
[17]

Quantizing deep convolutional networks for efficient inference: A whitepaper

R. Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper.arXiv preprint arXiv:1806.08342, 2018

work page arXiv 2018
[18]

Krizhevsky et al

A. Krizhevsky et al. Learning multiple layers of features from tiny images. 2009

work page 2009
[19]

Laskaridis, A

S. Laskaridis, A. Kouris, and N. D. Lane. Adaptive inference through early-exit networks: Design, challenges and directions. InProceedings of the 5th International Workshop on Em- bedded and Mobile Deep Learning, pages 1–6, 2021

work page 2021
[20]

Y . Li, S. Xu, B. Zhang, X. Cao, P. Gao, and G. Guo. Q-vit: Accurate and fully quantized low- bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022

work page 2022
[21]

Z. Li, J. Xiao, L. Yang, and Q. Gu. Repq-vit: Scale reparameterization for post-training quantization of vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17227–17236, 2023

work page 2023
[22]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[23]

H. Liu, S. Elkerdawy, N. Ray, and M. Elhoushi. Layer importance estimation with imprinting for neural network quantization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2408–2417, 2021

work page 2021
[24]

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 10012–10022, 2021

work page 2021
[25]

Z. Liu, Y . Wang, K. Han, W. Zhang, S. Ma, and W. Gao. Post-training quantization for vision transformer.Advances in Neural Information Processing Systems, 34:28092–28103, 2021

work page 2021
[26]

Nagel, M

M. Nagel, M. Fournarakis, Y . Bondarenko, and T. Blankevoort. Overcoming oscillations in quantization-aware training. InInternational Conference on Machine Learning, pages 16318– 16330. PMLR, 2022

work page 2022
[27]

Raghu, T

M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy. Do vision transformers see like convolutional neural networks?Advances in neural information processing systems, 34:12116–12128, 2021

work page 2021
[28]

Rahmath P, V

H. Rahmath P, V . Srivastava, K. Chaurasia, R. G. Pacheco, and R. S. Couto. Early-exit deep neural network-a comprehensive survey.ACM Computing Surveys, 57(3):1–37, 2024

work page 2024
[29]

Regol, J

F. Regol, J. Chataoui, B. Charpentier, M. Coates, P. Piantanida, and S. Gunnemann. Predict- ing probabilities of error to combine quantization and early exiting: Quee.arXiv preprint arXiv:2406.14404, 2024

work page arXiv 2024
[30]

Riquelme, J

C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Susano Pinto, D. Key- sers, and N. Houlsby. Scaling vision with sparse mixture of experts.Advances in Neural Information Processing Systems, 34:8583–8595, 2021

work page 2021
[31]

Saxena and K

U. Saxena and K. Roy. Mcqueen: Mixed precision quantization of early exit networks. In BMVC, pages 511–513, 2023

work page 2023
[32]

Schuster, A

T. Schuster, A. Fisch, J. Gupta, M. Dehghani, D. Bahri, V . Tran, Y . Tay, and D. Metzler. Confident adaptive language modeling.Advances in Neural Information Processing Systems, 35:17456–17472, 2022

work page 2022
[33]

Shang, G

Y . Shang, G. Liu, R. Kompella, and Y . Yan. Quantized-vit efficient training via fisher ma- trix regularization. InInternational Conference on Multimedia Modeling, pages 270–284. Springer, 2024. 11

work page 2024
[34]

Z. Shen, Z. Liu, and E. Xing. Sliced recursive transformer. InEuropean Conference on Computer Vision, pages 727–744. Springer, 2022

work page 2022
[35]

H. Shi, X. Cheng, W. Mao, and Z. Wang. P2-vit: Power-of-two post-training quantization and acceleration for fully quantized vision transformer.IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2024

work page 2024
[36]

Tai and A.-Y

Y .-S. Tai and A.-Y . Wu. Mptq-vit: Mixed-precision post-training quantization for vision trans- former.arXiv preprint arXiv:2401.14895, 2024

work page arXiv 2024
[37]

Teerapittayanon, B

S. Teerapittayanon, B. McDanel, and H.-T. Kung. Branchynet: Fast inference via early exiting from deep neural networks. In2016 23rd international conference on pattern recognition (ICPR), pages 2464–2469. IEEE, 2016

work page 2016
[38]

Touvron, M

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou. Training data- efficient image transformers & distillation through attention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021

work page 2021
[39]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[40]

L. Wang, C. Zeng, M. Zhang, J. Wu, and L. Nie. Domain aware post training quantization for vision transformers in deployment.Pattern Recognition, page 112182, 2025

work page 2025
[41]

J. Xiao, Z. Li, L. Yang, and Q. Gu. Patch-wise mixed-precision quantization of vision trans- former. In2023 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2023

work page 2023
[42]

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo. Segformer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems, 34:12077–12090, 2021

work page 2021
[43]

J. Xin, R. Tang, J. Lee, Y . Yu, and J. Lin. Deebert: Dynamic early exiting for accelerating bert inference. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2246–2251, 2020

work page 2020
[44]

G. Xu, J. Hao, L. Shen, H. Hu, Y . Luo, H. Lin, and J. Shen. Lgvit: Dynamic early exiting for accelerating vision transformer. InProceedings of the 31st ACM International Conference on Multimedia, pages 9103–9114, 2023

work page 2023
[45]

J. Yang, X. Shen, J. Xing, X. Tian, H. Li, B. Deng, J. Huang, and X.-s. Hua. Quantization networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 7308–7316, 2019

work page 2019
[46]

Zheng, L

J. Zheng, L. Yang, Y . Li, K. Yang, Z. Wang, and J. Zhou. Lightweight vision transformer with spatial and channel enhanced self-attention. In2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 1484–1488. IEEE, 2023

work page 2023
[47]

Zhong, Y

Y . Zhong, Y . Huang, J. Hu, Y . Zhang, and R. Ji. Towards accurate post-training quantization of vision transformers via error reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[48]

B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 633–641, 2017

work page 2017
[49]

Does there exist thresh- oldsϕsuch that the expected complexityE x[T(ϕ)]is at mostBwhile the expected loss E(x,y)[LCE(fθ,b,ϕ(x), y)]is at mostA?

Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Zhao, A. M. Dai, Q. V . Le, J. Laudon, et al. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022. 12 A Detailed Proof A.1 Proof of Theorem 1 Theorem 1.Assume that the confidence perturbationξ ℓ,b(x)satisfies the tail bound Pr(|ξℓ,b(x)|> τ ℓ)≤q ...

work page 2022