pith. machine review for the scientific record. sign in

arxiv: 2605.07317 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Amortized-Precision Quantization for Early-Exit Vision Transformers

Hsi-Wen Chen, Ming-Syan Chen, Rui Fang

Pith reviewed 2026-05-11 01:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords Amortized-Precision QuantizationEarly-Exit Vision TransformersBi-level OptimizationModel QuantizationEfficient InferenceLow-Precision DeploymentPareto Trade-off
0
0 comments X

The pith

Amortized-Precision Quantization stabilizes low-precision early-exit Vision Transformers by modeling each layer's stochastic exposure to noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix the fragility of quantizing early-exit Vision Transformers, where noise from low precision perturbs exit decisions and amplifies errors along variable-length paths. It introduces Amortized-Precision Quantization as a utilization-aware approach that factors in the probability each layer will actually execute under stochastic exits, thereby surfacing explicit depth-precision trade-offs. Building on this, the authors present MAQEE, a bi-level optimizer that jointly tunes exit thresholds and bit-widths while enforcing explicit risk bounds. A sympathetic reader would care because this makes high-performing Vision Transformers practical to deploy at very low compute budgets on edge hardware without accuracy collapse.

Core claim

Vision Transformers achieve strong performance across vision tasks yet their deployment with low-precision early exiting remains fragile. Existing quantization methods assume static full-depth execution, making them unstable when exit decisions are perturbed by quantization noise, which can amplify errors along dynamic inference paths. We introduce Amortized-Precision Quantization, a utilization-aware formulation that accounts for layer-wise stochastic exposure to quantization noise and reveals depth-precision trade-offs. Building on APQ, we propose Mutual Adaptive Quantization with Early Exiting, a bi-level framework that jointly optimizes exit thresholds and bit-widths under explicit risk

What carries the argument

Amortized-Precision Quantization (APQ), a utilization-aware formulation that accounts for layer-wise stochastic exposure to quantization noise and reveals depth-precision trade-offs.

Load-bearing premise

The bi-level optimization of exit thresholds and bit-widths under explicit risk control will remain stable and not introduce new instabilities when quantization noise perturbs dynamic paths.

What would settle it

Measure accuracy and BOPs for a fixed early-exit ViT on a standard vision benchmark; if applying MAQEE quantization produces accuracy drops larger than the risk-controlled allowance compared with full-precision early-exit execution, the stability claim is falsified.

Figures

Figures reproduced from arXiv: 2605.07317 by Hsi-Wen Chen, Ming-Syan Chen, Rui Fang.

Figure 1
Figure 1. Figure 1: Overview of MAQEE. Left: Quantization error perturbs early-exit decisions, causing either prema￾ture or delayed exits. Middle: Risk modeling for early exiting and quantization, including performance gap risk (PGR), boundary sensitivity risk (BSR), inverse SQNR, and quantization-induced drift (QID). Right: MAQEE solves Amortized-Precision Quantization (APQ) via bi-level optimization, where exit thresholds a… view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy–throughput/BOPs results. stabilizes exit behavior and reduces exit depth and BOPs by up to 50% relative to strong baselines, consistent with the utilization-aware principle in Theorem 3. Segmentation and Detection. To assess cross-task generalization, [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

Vision Transformers (ViTs) achieve strong performance across vision tasks, yet their deployment with low-precision early exiting remains fragile. Existing quantization methods assume static full-depth execution, making them unstable when exit decisions are perturbed by quantization noise, which can amplify errors along dynamic inference paths. In this paper, we introduce Amortized-Precision Quantization (APQ), a utilization-aware formulation that accounts for layer-wise stochastic exposure to quantization noise and reveals depth-precision trade-offs. Building on APQ, we propose Mutual Adaptive Quantization with Early Exiting (MAQEE), a bi-level framework that jointly optimizes exit thresholds and bit-widths under explicit risk control to improve inference stability. MAQEE establishes a superior Pareto frontier in the accuracy-efficiency trade-off, reducing BOPs by up to 95% while maintaining accuracy and outperforming strong baselines by up to 20\% across classification, detection, and segmentation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Amortized-Precision Quantization (APQ), a utilization-aware formulation for quantizing early-exit Vision Transformers that accounts for layer-wise stochastic exposure to quantization noise and reveals depth-precision trade-offs. Building on APQ, it proposes Mutual Adaptive Quantization with Early Exiting (MAQEE), a bi-level optimization framework that jointly optimizes exit thresholds and bit-widths under explicit risk control. The paper claims that MAQEE establishes a superior Pareto frontier in the accuracy-efficiency trade-off, reducing BOPs by up to 95% while maintaining accuracy and outperforming strong baselines by up to 20% across classification, detection, and segmentation tasks.

Significance. If the bi-level optimization remains stable under quantization noise and the empirical gains are reproducible, this work would offer a principled approach to handling quantization in dynamic early-exit ViT architectures, potentially enabling more reliable low-precision deployment on resource-constrained hardware.

major comments (2)
  1. [§4] §4 (MAQEE bi-level framework): the risk-control term is introduced to improve inference stability, yet no derivation or bound is provided showing that it prevents quantization noise from systematically shifting exit decisions; this leaves the amortized utilization statistics potentially inconsistent with realized dynamic paths, which is load-bearing for the 95% BOP reduction and stability claims.
  2. [Experiments] Experimental results (Tables 1–3, Figures 3–5): the reported Pareto improvements and outperformance figures lack error bars, multiple-run statistics, or ablation on the risk-control hyperparameter, making it difficult to assess whether the gains are robust or sensitive to the quantization noise that the method aims to mitigate.
minor comments (2)
  1. [Abstract] Abstract: the high-level claims would be strengthened by a one-sentence reference to the risk-control formulation or the key APQ equation.
  2. [§2] Notation in §2: the distinction between 'amortized precision' and conventional layer-wise bit-width assignment could be made more explicit with a short comparison table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of theoretical justification and experimental robustness. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [§4] §4 (MAQEE bi-level framework): the risk-control term is introduced to improve inference stability, yet no derivation or bound is provided showing that it prevents quantization noise from systematically shifting exit decisions; this leaves the amortized utilization statistics potentially inconsistent with realized dynamic paths, which is load-bearing for the 95% BOP reduction and stability claims.

    Authors: We agree that the manuscript would benefit from a formal derivation or bound on the risk-control term to rigorously demonstrate its effect on preventing systematic shifts in exit decisions due to quantization noise. The term was introduced to regularize the bi-level optimization toward stable dynamic paths, but the original submission relied primarily on empirical motivation without an explicit bound relating amortized and realized utilization. In the revised manuscript, we will add a derivation in §4 showing that the risk-control term bounds the expected deviation between amortized statistics and realized paths under standard assumptions on quantization noise (e.g., bounded variance). This addition will directly address the potential inconsistency and strengthen support for the reported efficiency gains. revision: yes

  2. Referee: [Experiments] Experimental results (Tables 1–3, Figures 3–5): the reported Pareto improvements and outperformance figures lack error bars, multiple-run statistics, or ablation on the risk-control hyperparameter, making it difficult to assess whether the gains are robust or sensitive to the quantization noise that the method aims to mitigate.

    Authors: We acknowledge that the lack of error bars, multiple-run statistics, and ablation on the risk-control hyperparameter limits the ability to evaluate robustness against quantization noise. The original results were obtained from single runs due to computational constraints in the bi-level optimization. In the revision, we will rerun the primary experiments across multiple random seeds (at least five) and report means with standard deviations in Tables 1–3 along with error bars in Figures 3–5. We will also include a dedicated ablation study on the risk-control hyperparameter, showing its influence on the Pareto frontier and inference stability to confirm that the gains remain consistent across reasonable hyperparameter choices. revision: yes

Circularity Check

0 steps flagged

No circularity: new formulations and empirical claims are independent of inputs

full rationale

The paper introduces APQ as a utilization-aware formulation accounting for stochastic quantization noise exposure and MAQEE as a bi-level joint optimization of thresholds and bit-widths under risk control. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that reduce the 95% BOP reduction or 20% outperformance claims to tautological definitions or prior self-references. The derivation chain remains self-contained, with performance results framed as outcomes of the proposed methods rather than constructions equivalent to the inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on an unstated model of how quantization noise perturbs exit decisions and on the assumption that bi-level optimization can jointly tune thresholds and precisions without new failure modes. No free parameters are explicitly named, but exit thresholds and per-layer bit-widths are optimized quantities. No new physical entities are introduced.

free parameters (2)
  • exit thresholds
    Tuned jointly in the bi-level framework; values not reported in abstract.
  • bit-widths
    Optimized per layer under utilization-aware formulation; specific values not given.
axioms (2)
  • domain assumption Quantization noise can perturb exit decisions and amplify errors along dynamic inference paths
    Invoked to explain why static quantization fails for early-exit models.
  • ad hoc to paper Layer-wise stochastic exposure to quantization noise admits a utilization-aware formulation
    Core modeling choice behind APQ.

pith-pipeline@v0.9.0 · 5455 in / 1363 out tokens · 72940 ms · 2026-05-11T01:32:36.865344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

  1. [1]

    Bakhtiarnia, Q

    A. Bakhtiarnia, Q. Zhang, and A. Iosifidis. Multi-exit vision transformer for dynamic infer- ence.arXiv preprint arXiv:2106.15183, 2021

  2. [2]

    Carion, F

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–

  3. [3]

    C.-F. R. Chen, Q. Fan, and R. Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. InProceedings of the IEEE/CVF international conference on com- puter vision, pages 357–366, 2021

  4. [4]

    Courbariaux, Y

    M. Courbariaux, Y . Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations.Advances in neural information processing systems, 28, 2015

  5. [5]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierar- chical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  6. [6]

    Y . Ding, H. Qin, Q. Yan, Z. Chai, J. Liu, X. Wei, and X. Liu. Towards accurate post-training quantization for vision transformer. InProceedings of the 30th ACM international conference on multimedia, pages 5380–5388, 2022

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transform- ers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  8. [8]

    Elhoushi, A

    M. Elhoushi, A. Shrivastava, D. Liskovich, B. Hosmer, B. Wasti, L. Lai, A. Mahmoud, B. Acun, S. Agarwal, A. Roman, et al. Layerskip: Enabling early exit inference and self- speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 12622–12642, 2024

  9. [9]

    Y . Fang, B. Liao, X. Wang, J. Fang, J. Qi, R. Wu, J. Niu, and W. Liu. You only look at one sequence: Rethinking transformer in vision through object detection.Advances in Neural Information Processing Systems, 34:26183–26197, 2021

  10. [10]

    Y . Gao, M. Zhou, and D. N. Metaxas. Utnet: a hybrid transformer architecture for medical image segmentation. InInternational conference on medical image computing and computer- assisted intervention, pages 61–71. Springer, 2021

  11. [11]

    Gholami, S

    A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer. A survey of quanti- zation methods for efficient neural network inference. InLow-power computer vision, pages 291–326. Chapman and Hall/CRC, 2022

  12. [12]

    X. Hu, Z. Chen, D. Yang, Z. Xu, C. Xu, Z. Yuan, S. Zhou, and J. Yu. Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804, 2025

  13. [13]

    Hwang, W

    C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas, J. Jose, P. Ram, et al. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5:269–287, 2023

  14. [14]

    Jacob, S

    B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704– 2713, 2018

  15. [15]

    Jazbec, A

    M. Jazbec, A. Timans, T. H. Veljkovi ´c, K. Sakmann, D. Zhang, C. A. Naesseth, and E. Nalis- nick. Fast yet safe: Early-exiting with risk control.Advances in Neural Information Processing Systems, 37:129825–129854, 2024. 10

  16. [16]

    J.-Y . Jeon, X. T. Nguyen, S. Ryu, and H.-J. Lee. Usdn: A unified sample-wise dynamic network with mixed-precision and early-exit. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 646–654, 2024

  17. [17]

    Quantizing deep convolutional networks for efficient inference: A whitepaper

    R. Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper.arXiv preprint arXiv:1806.08342, 2018

  18. [18]

    Krizhevsky et al

    A. Krizhevsky et al. Learning multiple layers of features from tiny images. 2009

  19. [19]

    Laskaridis, A

    S. Laskaridis, A. Kouris, and N. D. Lane. Adaptive inference through early-exit networks: Design, challenges and directions. InProceedings of the 5th International Workshop on Em- bedded and Mobile Deep Learning, pages 1–6, 2021

  20. [20]

    Y . Li, S. Xu, B. Zhang, X. Cao, P. Gao, and G. Guo. Q-vit: Accurate and fully quantized low- bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022

  21. [21]

    Z. Li, J. Xiao, L. Yang, and Q. Gu. Repq-vit: Scale reparameterization for post-training quantization of vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17227–17236, 2023

  22. [22]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  23. [23]

    H. Liu, S. Elkerdawy, N. Ray, and M. Elhoushi. Layer importance estimation with imprinting for neural network quantization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2408–2417, 2021

  24. [24]

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 10012–10022, 2021

  25. [25]

    Z. Liu, Y . Wang, K. Han, W. Zhang, S. Ma, and W. Gao. Post-training quantization for vision transformer.Advances in Neural Information Processing Systems, 34:28092–28103, 2021

  26. [26]

    Nagel, M

    M. Nagel, M. Fournarakis, Y . Bondarenko, and T. Blankevoort. Overcoming oscillations in quantization-aware training. InInternational Conference on Machine Learning, pages 16318– 16330. PMLR, 2022

  27. [27]

    Raghu, T

    M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy. Do vision transformers see like convolutional neural networks?Advances in neural information processing systems, 34:12116–12128, 2021

  28. [28]

    Rahmath P, V

    H. Rahmath P, V . Srivastava, K. Chaurasia, R. G. Pacheco, and R. S. Couto. Early-exit deep neural network-a comprehensive survey.ACM Computing Surveys, 57(3):1–37, 2024

  29. [29]

    Regol, J

    F. Regol, J. Chataoui, B. Charpentier, M. Coates, P. Piantanida, and S. Gunnemann. Predict- ing probabilities of error to combine quantization and early exiting: Quee.arXiv preprint arXiv:2406.14404, 2024

  30. [30]

    Riquelme, J

    C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Susano Pinto, D. Key- sers, and N. Houlsby. Scaling vision with sparse mixture of experts.Advances in Neural Information Processing Systems, 34:8583–8595, 2021

  31. [31]

    Saxena and K

    U. Saxena and K. Roy. Mcqueen: Mixed precision quantization of early exit networks. In BMVC, pages 511–513, 2023

  32. [32]

    Schuster, A

    T. Schuster, A. Fisch, J. Gupta, M. Dehghani, D. Bahri, V . Tran, Y . Tay, and D. Metzler. Confident adaptive language modeling.Advances in Neural Information Processing Systems, 35:17456–17472, 2022

  33. [33]

    Shang, G

    Y . Shang, G. Liu, R. Kompella, and Y . Yan. Quantized-vit efficient training via fisher ma- trix regularization. InInternational Conference on Multimedia Modeling, pages 270–284. Springer, 2024. 11

  34. [34]

    Z. Shen, Z. Liu, and E. Xing. Sliced recursive transformer. InEuropean Conference on Computer Vision, pages 727–744. Springer, 2022

  35. [35]

    H. Shi, X. Cheng, W. Mao, and Z. Wang. P2-vit: Power-of-two post-training quantization and acceleration for fully quantized vision transformer.IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2024

  36. [36]

    Tai and A.-Y

    Y .-S. Tai and A.-Y . Wu. Mptq-vit: Mixed-precision post-training quantization for vision trans- former.arXiv preprint arXiv:2401.14895, 2024

  37. [37]

    Teerapittayanon, B

    S. Teerapittayanon, B. McDanel, and H.-T. Kung. Branchynet: Fast inference via early exiting from deep neural networks. In2016 23rd international conference on pattern recognition (ICPR), pages 2464–2469. IEEE, 2016

  38. [38]

    Touvron, M

    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou. Training data- efficient image transformers & distillation through attention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021

  39. [39]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  40. [40]

    L. Wang, C. Zeng, M. Zhang, J. Wu, and L. Nie. Domain aware post training quantization for vision transformers in deployment.Pattern Recognition, page 112182, 2025

  41. [41]

    J. Xiao, Z. Li, L. Yang, and Q. Gu. Patch-wise mixed-precision quantization of vision trans- former. In2023 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2023

  42. [42]

    E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo. Segformer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems, 34:12077–12090, 2021

  43. [43]

    J. Xin, R. Tang, J. Lee, Y . Yu, and J. Lin. Deebert: Dynamic early exiting for accelerating bert inference. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2246–2251, 2020

  44. [44]

    G. Xu, J. Hao, L. Shen, H. Hu, Y . Luo, H. Lin, and J. Shen. Lgvit: Dynamic early exiting for accelerating vision transformer. InProceedings of the 31st ACM International Conference on Multimedia, pages 9103–9114, 2023

  45. [45]

    J. Yang, X. Shen, J. Xing, X. Tian, H. Li, B. Deng, J. Huang, and X.-s. Hua. Quantization networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 7308–7316, 2019

  46. [46]

    Zheng, L

    J. Zheng, L. Yang, Y . Li, K. Yang, Z. Wang, and J. Zhou. Lightweight vision transformer with spatial and channel enhanced self-attention. In2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 1484–1488. IEEE, 2023

  47. [47]

    Zhong, Y

    Y . Zhong, Y . Huang, J. Hu, Y . Zhang, and R. Ji. Towards accurate post-training quantization of vision transformers via error reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  48. [48]

    B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 633–641, 2017

  49. [49]

    Does there exist thresh- oldsϕsuch that the expected complexityE x[T(ϕ)]is at mostBwhile the expected loss E(x,y)[LCE(fθ,b,ϕ(x), y)]is at mostA?

    Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Zhao, A. M. Dai, Q. V . Le, J. Laudon, et al. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022. 12 A Detailed Proof A.1 Proof of Theorem 1 Theorem 1.Assume that the confidence perturbationξ ℓ,b(x)satisfies the tail bound Pr(|ξℓ,b(x)|> τ ℓ)≤q ...