Joint Architecture-Token-Bitwidth Multi-Axis Optimization of Vision Transformers for Semiconductor IC Packaging

Phat Nguyen , Xue Geng , Kaixin Xu , Wang Zhe , Xulei Yang , Ngai-Man Cheung

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords architecturetokenfirstsemiconductorwhileaccuracybit-widthcompression

0 comments

The pith

A joint architecture-token-bitwidth optimization of Vision Transformers delivers over 10x gains in throughput, parameters, FLOPs and energy on a semiconductor defect classification task while preserving required accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision Transformers are powerful but heavy models for image tasks. The authors combine three efficiency tricks: they search for smaller network shapes, merge similar image patches so fewer tokens are processed, and run calculations in lower-precision numbers. They start from a standard DeiT model, test the combinations on a big public image dataset, then move the best settings to a private factory dataset of 3D X-ray images of chip packages. The result is a model that runs more than ten times faster, uses far less memory and power, and still meets the accuracy needed to catch defects on the production line. The work focuses on making these models usable inside actual semiconductor manufacturing equipment rather than just academic benchmarks.

Core claim

Starting from a DeiT-B/16 baseline, the proposed multi-axis framework achieves more than 10 times improvement in throughput along with over 10 times reductions in parameter count, FLOPs, and energy consumption, while maintaining the required accuracy on the downstream industrial task.

Load-bearing premise

That the accuracy-efficiency trade-offs identified on ImageNet-1K under aggressive compression transfer directly to the in-house 3D X-ray semiconductor dataset without substantial accuracy loss or hidden deployment costs.

Figures

Figures reproduced from arXiv: 2605.01742 by Kaixin Xu, Ngai-Man Cheung, Phat Nguyen, Wang Zhe, Xue Geng, Xulei Yang.

read the original abstract

Vision Transformers (ViTs) have achieved strong performance in visual recognition, yet their deployment in resource-constrained industrial environments remains limited. Some main challenges are their high computational cost, memory requirement, and energy consumption. While individual efficiency techniques such as neural architecture search (NAS), token compression, and low-precision inference have been extensively studied, most prior work targets only a single optimization axis, limiting overall deployment gains while preserving accuracy. In this paper, we present one of the first holistic frameworks that jointly optimizes three complementary axes: architecture, token, and bit-width. Specifically, the framework identifies compact backbones via Neural Architecture Search (AutoFormer), reduces information processing via token merging (ToMe), and accelerates per-operation execution via fp16 mixed-precision inference. Starting from a DeiT-B/16 baseline, we first analyze accuracy-efficiency trade-offs on ImageNet-1K under aggressive compression. Then, we apply the selected configurations to a real-world in-house 3D X-ray semiconductor defect classification dataset for IC chip packaging inspection. Results show that the proposed multi-axis framework achieves more than 10 times improvement in throughput along with over 10 times reductions in parameter count, FLOPs, and energy consumption, while maintaining the required accuracy on the downstream industrial task. To the best of our knowledge, this is among the earliest works to jointly optimize architecture, token, and bit-width dimensions in ViTs and the first such resource-efficient, deployment-focused study tailored to semiconductor manufacturing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the approach relies on existing NAS (AutoFormer), token merging (ToMe), and mixed-precision inference techniques whose internal assumptions are not restated here.

pith-pipeline@v0.9.0 · 5582 in / 1114 out tokens · 25208 ms · 2026-05-08T19:37:06.789072+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 4 canonical work pages · 1 internal anchor

[1]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al. , “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021

2021
[2]

Vision transformers on the edge: A comprehensive survey of model compression and acceleration strategies,

S. Saha and L. Xu, “Vision transformers on the edge: A comprehensive survey of model compression and acceleration strategies,” Neurocomputing, vol. 643, p. 130 417, 2025

2025
[3]

Autoformer: Searching transformers for visual recognition,

M. Chen, H. Peng, J. Fu, and H. Ling, “Autoformer: Searching transformers for visual recognition,” in Pro- ceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 270–12 280

2021
[4]

Nasvit: Neural architecture search for efficient vision transformers with gradient conflict-aware supernet training,

C. Gong and D. Wang, “Nasvit: Neural architecture search for efficient vision transformers with gradient conflict-aware supernet training,” ICLR Proceedings 2022, 2022

2022
[5]

Dynamicvit: Efficient vision transformers with dynamic token sparsification,

Y . Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dynamicvit: Efficient vision transformers with dynamic token sparsification,” in Advances in Neural Information Processing Systems , A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021. [Online]. Available: https://openreview.net/forum?id= jB0Nlbwlybm

2021
[6]

Token merging: Your ViT but faster,

D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your ViT but faster,” in International Conference on Learning Representa- tions, 2023

2023
[7]

Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization,

Z. Yuan, C. Xue, Y . Chen, Q. Wu, and G. Sun, “Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization,” in European conference on computer vision, Springer, 2022, pp. 191–207

2022
[8]

I-vit: Integer-only quantization for efficient vision transformer inference,

Z. Li and Q. Gu, “I-vit: Integer-only quantization for efficient vision transformer inference,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 065–17 075

2023
[9]

Training data-efficient image transformers & distillation through attention,

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablay- rolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” in Inter- national conference on machine learning, PMLR, 2021, pp. 10 347–10 357

2021
[10]

Not all tokens are equal: Human-centric visual analysis via token cluster- ing transformer,

W. Zeng, S. Jin, W. Liu, et al. , “Not all tokens are equal: Human-centric visual analysis via token cluster- ing transformer,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition , 2022, pp. 11 101–11 111

2022
[11]

EVit: Expediting vision transformers via token reorganizations,

Y . Liang, C. GE, Z. Tong, Y . Song, J. Wang, and P. Xie, “EVit: Expediting vision transformers via token reorganizations,” in International Conference on Learn- ing Representations , 2022. [Online]. Available: https : //openreview.net/forum?id=BjyvwnXXVn

2022
[12]

Spvit: Enabling faster vision transformers via latency-aware soft token pruning,

Z. Kong, P. Dong, X. Ma, et al. , “Spvit: Enabling faster vision transformers via latency-aware soft token pruning,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2022. 5

2022
[13]

Revisit multimodal meta-learning through the lens of multi-task learning,

M. Abdollahzadeh, T. Malekzadeh, and N. M. Cheung, “Revisit multimodal meta-learning through the lens of multi-task learning,” in Advances in Neural Information Processing Systems (NeurIPS) , vol. 34, 2021

2021
[14]

Vct: A video compression transformer,

F. Mentzer, G. Toderici, D. Minnen, et al. , “Vct: A video compression transformer,” in Advances in Neural Information Processing Systems (NeurIPS) , 2022

2022
[15]

Highly parallel rate-distortion optimized intra-mode decision on multicore graphics processors,

N. M. Cheung, O. C. Au, M. C. Kung, P. H. W. Wong, and C. H. Liu, “Highly parallel rate-distortion optimized intra-mode decision on multicore graphics processors,” IEEE Transactions on Circuits and Systems for Video Technology, 2009

2009
[16]

On-device scalable image-based localization via prioritized cascade search and fast one-many ransac,

N.-T. Tran, D.-K. Le Tan, A.-D. Doan, et al., “On-device scalable image-based localization via prioritized cascade search and fast one-many ransac,”IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 1675–1690, 2018

2018
[17]

On accelerating edge ai: Optimiz- ing resource-constrained environments,

J. Sander, A. Cohen, V . R. Dasari, B. Venable, and B. Jalaian, “On accelerating edge ai: Optimiz- ing resource-constrained environments,” arXiv preprint arXiv:2501.15014, 2025

work page arXiv 2025
[18]

A Survey on Efficient Inference for Large Language Models

Z. Zhou, X. Ning, K. Hong, et al., “A survey on efficient inference for large language models,” arXiv preprint arXiv:2404.14294, 2024

work page internal anchor Pith review arXiv 2024
[19]

Rlrc: Reinforcement learning- based recovery for compressed vision-language-action models,

Y . Chen and X. Li, “Rlrc: Reinforcement learning- based recovery for compressed vision-language-action models,” arXiv preprint arXiv:2506.17639 , 2025

work page arXiv 2025
[20]

Nvit: Vision transformer compression and parameter redistribution,

H. Yang, H. Yin, P. Molchanov, H. Li, and J. Kautz, “Nvit: Vision transformer compression and parameter redistribution,” 2021

2021
[21]

Width & depth pruning for vision transformers,

F. Yu, K. Huang, M. Wang, Y . Cheng, W. Chu, and L. Cui, “Width & depth pruning for vision transformers,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, 2022, pp. 3143–3151

2022
[22]

Searching the search space of vision transformer,

M. Chen, K. Wu, B. Ni, et al. , “Searching the search space of vision transformer,” Advances in Neural In- formation Processing Systems , vol. 34, pp. 8714–8726, 2021

2021
[23]

Vitas: Vision transformer architecture search,

X. Su, S. You, J. Xie, et al., “Vitas: Vision transformer architecture search,” in European Conference on Com- puter Vision, Springer, 2022, pp. 139–157

2022
[24]

Post-training quantization for vision transformer,

Z. Liu, Y . Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Advances in Neural Information Processing Systems , vol. 34, 2021

2021
[25]

Ptq4vit: Post-training quantization framework for vision transformers with twin uniform quantization,

Z. Yuan, C. Xue, Y . Chen, Q. Wu, and G. Sun, “Ptq4vit: Post-training quantization framework for vision trans- formers,” arXiv preprint arXiv:2111.12293 , 2021

work page arXiv 2021
[26]

Towards accurate post-training quantization for vision transformer,

Y . Ding, H. Qin, Q. Yan, et al. , “Towards accurate post-training quantization for vision transformer,” in Proceedings of the 30th ACM international conference on multimedia, 2022, pp. 5380–5388

2022
[27]

Fq- vit: Post-training quantization for fully quantized vision transformer,

Y . Lin, T. Zhang, P. Sun, Z. Li, and S. Zhou, “Fq- vit: Post-training quantization for fully quantized vision transformer,” in Proceedings of the Thirty-First Interna- tional Joint Conference on Artificial Intelligence, IJCAI- 22, 2022, pp. 1173–1179

2022
[28]

Noisyquant: Noisy bias-enhanced post-training activation quantization for vision transformers,

Y . Liu, H. Yang, Z. Dong, K. Keutzer, L. Du, and S. Zhang, “Noisyquant: Noisy bias-enhanced post-training activation quantization for vision transformers,” in Pro- ceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2023, pp. 20 321–20 330. 6

2023