Recognition: 2 theorem links
Joint Architecture-Token-Bitwidth Multi-Axis Optimization of Vision Transformers for Semiconductor IC Packaging
Pith reviewed 2026-05-08 19:37 UTC · model grok-4.3
The pith
A joint architecture-token-bitwidth optimization of Vision Transformers delivers over 10x gains in throughput, parameters, FLOPs and energy on a semiconductor defect classification task while preserving required accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Starting from a DeiT-B/16 baseline, the proposed multi-axis framework achieves more than 10 times improvement in throughput along with over 10 times reductions in parameter count, FLOPs, and energy consumption, while maintaining the required accuracy on the downstream industrial task.
Load-bearing premise
That the accuracy-efficiency trade-offs identified on ImageNet-1K under aggressive compression transfer directly to the in-house 3D X-ray semiconductor dataset without substantial accuracy loss or hidden deployment costs.
Figures
read the original abstract
Vision Transformers (ViTs) have achieved strong performance in visual recognition, yet their deployment in resource-constrained industrial environments remains limited. Some main challenges are their high computational cost, memory requirement, and energy consumption. While individual efficiency techniques such as neural architecture search (NAS), token compression, and low-precision inference have been extensively studied, most prior work targets only a single optimization axis, limiting overall deployment gains while preserving accuracy. In this paper, we present one of the first holistic frameworks that jointly optimizes three complementary axes: architecture, token, and bit-width. Specifically, the framework identifies compact backbones via Neural Architecture Search (AutoFormer), reduces information processing via token merging (ToMe), and accelerates per-operation execution via fp16 mixed-precision inference. Starting from a DeiT-B/16 baseline, we first analyze accuracy-efficiency trade-offs on ImageNet-1K under aggressive compression. Then, we apply the selected configurations to a real-world in-house 3D X-ray semiconductor defect classification dataset for IC chip packaging inspection. Results show that the proposed multi-axis framework achieves more than 10 times improvement in throughput along with over 10 times reductions in parameter count, FLOPs, and energy consumption, while maintaining the required accuracy on the downstream industrial task. To the best of our knowledge, this is among the earliest works to jointly optimize architecture, token, and bit-width dimensions in ViTs and the first such resource-efficient, deployment-focused study tailored to semiconductor manufacturing.
Editorial analysis
A structured set of objections, weighed in public.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al. , “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021
2021
-
[2]
Vision transformers on the edge: A comprehensive survey of model compression and acceleration strategies,
S. Saha and L. Xu, “Vision transformers on the edge: A comprehensive survey of model compression and acceleration strategies,” Neurocomputing, vol. 643, p. 130 417, 2025
2025
-
[3]
Autoformer: Searching transformers for visual recognition,
M. Chen, H. Peng, J. Fu, and H. Ling, “Autoformer: Searching transformers for visual recognition,” in Pro- ceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 270–12 280
2021
-
[4]
Nasvit: Neural architecture search for efficient vision transformers with gradient conflict-aware supernet training,
C. Gong and D. Wang, “Nasvit: Neural architecture search for efficient vision transformers with gradient conflict-aware supernet training,” ICLR Proceedings 2022, 2022
2022
-
[5]
Dynamicvit: Efficient vision transformers with dynamic token sparsification,
Y . Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dynamicvit: Efficient vision transformers with dynamic token sparsification,” in Advances in Neural Information Processing Systems , A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021. [Online]. Available: https://openreview.net/forum?id= jB0Nlbwlybm
2021
-
[6]
Token merging: Your ViT but faster,
D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your ViT but faster,” in International Conference on Learning Representa- tions, 2023
2023
-
[7]
Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization,
Z. Yuan, C. Xue, Y . Chen, Q. Wu, and G. Sun, “Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization,” in European conference on computer vision, Springer, 2022, pp. 191–207
2022
-
[8]
I-vit: Integer-only quantization for efficient vision transformer inference,
Z. Li and Q. Gu, “I-vit: Integer-only quantization for efficient vision transformer inference,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 065–17 075
2023
-
[9]
Training data-efficient image transformers & distillation through attention,
H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablay- rolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” in Inter- national conference on machine learning, PMLR, 2021, pp. 10 347–10 357
2021
-
[10]
Not all tokens are equal: Human-centric visual analysis via token cluster- ing transformer,
W. Zeng, S. Jin, W. Liu, et al. , “Not all tokens are equal: Human-centric visual analysis via token cluster- ing transformer,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition , 2022, pp. 11 101–11 111
2022
-
[11]
EVit: Expediting vision transformers via token reorganizations,
Y . Liang, C. GE, Z. Tong, Y . Song, J. Wang, and P. Xie, “EVit: Expediting vision transformers via token reorganizations,” in International Conference on Learn- ing Representations , 2022. [Online]. Available: https : //openreview.net/forum?id=BjyvwnXXVn
2022
-
[12]
Spvit: Enabling faster vision transformers via latency-aware soft token pruning,
Z. Kong, P. Dong, X. Ma, et al. , “Spvit: Enabling faster vision transformers via latency-aware soft token pruning,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2022. 5
2022
-
[13]
Revisit multimodal meta-learning through the lens of multi-task learning,
M. Abdollahzadeh, T. Malekzadeh, and N. M. Cheung, “Revisit multimodal meta-learning through the lens of multi-task learning,” in Advances in Neural Information Processing Systems (NeurIPS) , vol. 34, 2021
2021
-
[14]
Vct: A video compression transformer,
F. Mentzer, G. Toderici, D. Minnen, et al. , “Vct: A video compression transformer,” in Advances in Neural Information Processing Systems (NeurIPS) , 2022
2022
-
[15]
Highly parallel rate-distortion optimized intra-mode decision on multicore graphics processors,
N. M. Cheung, O. C. Au, M. C. Kung, P. H. W. Wong, and C. H. Liu, “Highly parallel rate-distortion optimized intra-mode decision on multicore graphics processors,” IEEE Transactions on Circuits and Systems for Video Technology, 2009
2009
-
[16]
On-device scalable image-based localization via prioritized cascade search and fast one-many ransac,
N.-T. Tran, D.-K. Le Tan, A.-D. Doan, et al., “On-device scalable image-based localization via prioritized cascade search and fast one-many ransac,”IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 1675–1690, 2018
2018
-
[17]
On accelerating edge ai: Optimiz- ing resource-constrained environments,
J. Sander, A. Cohen, V . R. Dasari, B. Venable, and B. Jalaian, “On accelerating edge ai: Optimiz- ing resource-constrained environments,” arXiv preprint arXiv:2501.15014, 2025
-
[18]
A Survey on Efficient Inference for Large Language Models
Z. Zhou, X. Ning, K. Hong, et al., “A survey on efficient inference for large language models,” arXiv preprint arXiv:2404.14294, 2024
work page internal anchor Pith review arXiv 2024
-
[19]
Rlrc: Reinforcement learning- based recovery for compressed vision-language-action models,
Y . Chen and X. Li, “Rlrc: Reinforcement learning- based recovery for compressed vision-language-action models,” arXiv preprint arXiv:2506.17639 , 2025
-
[20]
Nvit: Vision transformer compression and parameter redistribution,
H. Yang, H. Yin, P. Molchanov, H. Li, and J. Kautz, “Nvit: Vision transformer compression and parameter redistribution,” 2021
2021
-
[21]
Width & depth pruning for vision transformers,
F. Yu, K. Huang, M. Wang, Y . Cheng, W. Chu, and L. Cui, “Width & depth pruning for vision transformers,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, 2022, pp. 3143–3151
2022
-
[22]
Searching the search space of vision transformer,
M. Chen, K. Wu, B. Ni, et al. , “Searching the search space of vision transformer,” Advances in Neural In- formation Processing Systems , vol. 34, pp. 8714–8726, 2021
2021
-
[23]
Vitas: Vision transformer architecture search,
X. Su, S. You, J. Xie, et al., “Vitas: Vision transformer architecture search,” in European Conference on Com- puter Vision, Springer, 2022, pp. 139–157
2022
-
[24]
Post-training quantization for vision transformer,
Z. Liu, Y . Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Advances in Neural Information Processing Systems , vol. 34, 2021
2021
-
[25]
Z. Yuan, C. Xue, Y . Chen, Q. Wu, and G. Sun, “Ptq4vit: Post-training quantization framework for vision trans- formers,” arXiv preprint arXiv:2111.12293 , 2021
-
[26]
Towards accurate post-training quantization for vision transformer,
Y . Ding, H. Qin, Q. Yan, et al. , “Towards accurate post-training quantization for vision transformer,” in Proceedings of the 30th ACM international conference on multimedia, 2022, pp. 5380–5388
2022
-
[27]
Fq- vit: Post-training quantization for fully quantized vision transformer,
Y . Lin, T. Zhang, P. Sun, Z. Li, and S. Zhou, “Fq- vit: Post-training quantization for fully quantized vision transformer,” in Proceedings of the Thirty-First Interna- tional Joint Conference on Artificial Intelligence, IJCAI- 22, 2022, pp. 1173–1179
2022
-
[28]
Noisyquant: Noisy bias-enhanced post-training activation quantization for vision transformers,
Y . Liu, H. Yang, Z. Dong, K. Keutzer, L. Du, and S. Zhang, “Noisyquant: Noisy bias-enhanced post-training activation quantization for vision transformers,” in Pro- ceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2023, pp. 20 321–20 330. 6
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.