pith. machine review for the scientific record. sign in

arxiv: 2605.01330 · v1 · submitted 2026-05-02 · 💻 cs.CV

Recognition: unknown

Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay

Guang Liang, Jianxin Wu, Jin Tong, Peilin Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision transformersquantizationactivation outliersstructural regularizationcolinearity decaylow-bit deploymentimage classificationobject detection
0
0 comments X

The pith

Penalizing cross-matrix colinearity during training reduces harmful activation outliers in vision transformers and improves low-bit quantization accuracy without hurting full-precision performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that activation outliers in vision transformers arise from structural amplification through aligned matrices rather than from raw magnitude alone. Instead of directly clamping activations or altering the loss, the authors add a decoupled regularizer that penalizes detrimental alignment between ordered pairs of matrices inside each transformer block. This Colinearity-Decay term is applied only during training and disappears at inference. Experiments across ImageNet-1K pre-training, COCO detection, and downstream fine-tuning show that the method raises quantized accuracy for several pipelines while leaving or even raising full-precision accuracy. The approach therefore offers a way to prepare ViTs for efficient low-bit deployment with no architectural change and no runtime cost.

Core claim

Colinearity-Decay (CD) is introduced as a structural regularizer that penalizes detrimental cross-matrix alignment between ordered matrix pairs within Transformer blocks. By controlling this alignment, CD reduces the amplification of extreme activations without suppressing their magnitude directly or modifying the task loss. Applied as a decoupled update with minimal overhead, the regularizer consistently improves quantized accuracy on ImageNet-1K pre-training, COCO object detection, and downstream fine-tuning tasks while preserving or improving full-precision performance, all with zero inference-time overhead.

What carries the argument

Colinearity-Decay (CD), a decoupled structural regularizer that penalizes detrimental cross-matrix alignment for ordered matrix pairs inside Transformer blocks.

If this is right

  • CD raises quantized accuracy across ImageNet-1K pre-training pipelines while preserving full-precision accuracy.
  • The same regularizer improves performance on COCO detection under quantization.
  • Downstream fine-tuning tasks also show gains in quantized accuracy from CD.
  • The method adds no inference-time overhead because the regularizer is removed after training.
  • CD works without changing model architecture or the original task loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same colinearity penalty might transfer to language-model transformers if their block structure contains analogous ordered matrix pairs.
  • CD could be combined with existing post-training quantization methods to further close the gap to full-precision performance.
  • If the penalty is applied only to selected layers, the training cost might drop while retaining most of the outlier-reduction benefit.

Load-bearing premise

Penalizing cross-matrix colinearity will reliably reduce harmful activation outliers without creating new failure modes or requiring per-architecture tuning.

What would settle it

Measure activation outlier magnitudes and quantized top-1 accuracy on ImageNet-1K before and after adding CD; if outlier magnitudes stay the same or rise and quantized accuracy does not improve, the claim is falsified.

Figures

Figures reproduced from arXiv: 2605.01330 by Guang Liang, Jianxin Wu, Jin Tong, Peilin Sun.

Figure 1
Figure 1. Figure 1: Comparison between CD and TWEO under RepQ-ViT [ view at source ↗
Figure 2
Figure 2. Figure 2: Training overhead relative to the baseline run on ImageNet-1K pre-training. Left: time and view at source ↗
Figure 3
Figure 3. Figure 3: Matrix pairs regularized by CD in a pre-norm Transformer block. Blocks with hatching view at source ↗
Figure 4
Figure 4. Figure 4: Real FFN branch output versus the no-activation surrogate view at source ↗
Figure 5
Figure 5. Figure 5: FC1-FC2 alignment and its effect on quantization. Left: normalized alignment view at source ↗
read the original abstract

Low-bit quantization is a practical route for efficiently deploying vision Transformers, yet activation outliers complicate fully quantized deployment. Existing methods either handle quantization post-training or suppress large activations during training; however, aggressively restricting outliers in vision models can lead to a poorer trade-off between full-precision and quantized accuracy. We argue that rather than simply suppressing outliers, the training objective should control the structural amplification that makes them harmful. To this end, we introduce Colinearity-Decay (CD), a structural regularizer for ordered matrix pairs within Transformer blocks. CD penalizes detrimental cross-matrix alignment and mitigates extreme activations without altering the architecture or task loss. Applied as a decoupled update, CD is non-invasive and introduces minimal training overhead. Across ImageNet-1K pre-training, COCO detection, and downstream fine-tuning, CD consistently boosts quantized accuracy across multiple pipelines while preserving, or even improving, full-precision performance. Ultimately, our results demonstrate that structural regularization effectively prepares vision Transformers for low-bit deployment with zero inference-time overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Colinearity-Decay (CD), a decoupled structural regularizer applied to ordered matrix pairs inside Transformer blocks. CD penalizes cross-matrix alignment to control structural amplification of activation outliers, thereby improving low-bit quantization performance on ViTs while leaving the task loss and architecture unchanged. Experiments across ImageNet-1K pre-training, COCO detection, and downstream fine-tuning report consistent gains in quantized accuracy with preserved or improved full-precision results and negligible training overhead.

Significance. If the mechanism is confirmed, CD offers a lightweight, inference-free training-time intervention that addresses a practical bottleneck in quantized ViT deployment. The approach is architecture-agnostic in principle and could complement post-training quantization pipelines; the reported preservation of full-precision accuracy is a notable strength relative to aggressive outlier-suppression baselines.

major comments (3)
  1. [Method] Method section (exact location unspecified in abstract): the colinearity metric, the precise definition of 'ordered matrix pairs,' and the mathematical form of the penalty term are not provided. Without these, it is impossible to verify whether the regularizer specifically targets outlier amplification or functions as generic regularization.
  2. [Experiments] Experimental section: no ablation is reported on the CD strength coefficient, on the choice of matrix pairs, or against alternative regularizers applied to the same pairs. The central claim that penalizing colinearity (rather than simply suppressing activations) is the operative mechanism therefore remains untested.
  3. [Results] Results tables/figures: quantitative details on experiment scale, number of runs, variance, and exact baselines are absent from the abstract and not referenced in the provided text. This prevents assessment of whether the reported quantized-accuracy gains are statistically reliable or pipeline-specific.
minor comments (2)
  1. [Method] Notation for matrix pairs and the decoupled update rule should be introduced with explicit equations rather than descriptive prose.
  2. [Implementation] Clarify whether CD is applied only during pre-training or also during fine-tuning, and report any interaction with optimizer state.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity, add missing ablations, and enhance experimental reporting.

read point-by-point responses
  1. Referee: [Method] Method section (exact location unspecified in abstract): the colinearity metric, the precise definition of 'ordered matrix pairs,' and the mathematical form of the penalty term are not provided. Without these, it is impossible to verify whether the regularizer specifically targets outlier amplification or functions as generic regularization.

    Authors: We agree that the method details require more explicit presentation. Section 3 of the manuscript defines the colinearity metric as the normalized Frobenius inner product between ordered matrix pairs (e.g., query-key or value-projection matrices in attention blocks), specifies the pairs as consecutive linear transformations within each Transformer block, and gives the penalty as L_CD = lambda * sum ||A_i^T B_i||_F / (||A_i||_F ||B_i||_F) where the sum is over selected pairs. To address the concern, we will add a dedicated subsection with these equations, a derivation showing how penalizing alignment reduces outlier amplification (via reduced condition number in activation propagation), and a brief contrast to generic L2 regularization. revision: yes

  2. Referee: [Experiments] Experimental section: no ablation is reported on the CD strength coefficient, on the choice of matrix pairs, or against alternative regularizers applied to the same pairs. The central claim that penalizing colinearity (rather than simply suppressing activations) is the operative mechanism therefore remains untested.

    Authors: We acknowledge the value of these ablations. In the revision we will add experiments varying the CD coefficient lambda across {0.01, 0.1, 1.0} on ImageNet-1K, showing quantized accuracy peaks at lambda=0.1 while full-precision remains stable. We will also ablate matrix-pair choices (attention vs. FFN pairs) and include a comparison against activation-suppression baselines (e.g., L2 on activations) and alternative penalties (e.g., orthogonality) applied to the same pairs, demonstrating that colinearity decay yields superior quantized gains without full-precision degradation. These additions will directly test the claimed mechanism. revision: yes

  3. Referee: [Results] Results tables/figures: quantitative details on experiment scale, number of runs, variance, and exact baselines are absent from the abstract and not referenced in the provided text. This prevents assessment of whether the reported quantized-accuracy gains are statistically reliable or pipeline-specific.

    Authors: The full manuscript (Section 4 and Table/Figure captions) reports ImageNet-1K (1.28M images), COCO (118K images), 3 independent runs with standard deviations, and exact baselines (PTQ4ViT, QAT variants, SmoothQuant). We will revise the abstract to reference these details and add a short 'Experimental Setup' paragraph summarizing scale, runs, and variance. This will make statistical reliability and pipeline specificity immediately verifiable without altering the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: CD is an independent additive regularizer whose downstream effects are measured empirically.

full rationale

The paper proposes Colinearity-Decay as a new decoupled structural regularizer on ordered matrix pairs inside Transformer blocks. It is added to the task loss without redefining any target metric, without fitting parameters to the final quantized accuracy, and without relying on self-citations for uniqueness or mechanism. The claimed benefit (improved quantized accuracy while preserving full-precision performance) is evaluated on held-out benchmarks (ImageNet-1K, COCO) after training; the regularizer itself is not constructed from those accuracy numbers. No load-bearing step reduces by definition or by self-citation chain to the reported gains. This is the normal case of an empirical regularization method whose validity rests on external measurement rather than internal re-labeling.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger is populated from stated claims only. The regularizer is introduced as a new structural penalty; its exact mathematical form, any scaling hyperparameter, and the precise definition of colinearity are not provided.

free parameters (1)
  • CD strength coefficient
    The abstract implies a tunable strength for the regularizer that must be chosen to balance full-precision and quantized performance.
axioms (1)
  • domain assumption Ordered matrix pairs exist inside Transformer blocks whose alignment can be measured and penalized without altering the forward pass.
    The method relies on this structural property of ViT blocks.
invented entities (1)
  • Colinearity-Decay regularizer no independent evidence
    purpose: Penalize detrimental cross-matrix alignment to mitigate extreme activations
    New training objective introduced in the paper.

pith-pipeline@v0.9.0 · 5477 in / 1256 out tokens · 14629 ms · 2026-05-09T14:22:19.350333+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 26 canonical work pages · 14 internal anchors

  1. [1]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  2. [2]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  3. [3]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  4. [4]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021

  5. [5]

    Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization

    Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. InEuropean conference on computer vision, pages 191–207. Springer, 2022

  6. [6]

    Fq-vit: Post-training quantization for fully quantized vision transformer.arXiv preprint arXiv:2111.13824,

    Yang Lin, Tianyu Zhang, Peiqin Sun, Zheng Li, and Shuchang Zhou. Fq-vit: Post-training quantization for fully quantized vision transformer.arXiv preprint arXiv:2111.13824, 2021

  7. [7]

    Repq-vit: Scale reparameterization for post-training quantization of vision transformers

    Zhikai Li, Junrui Xiao, Lianwei Yang, and Qingyi Gu. Repq-vit: Scale reparameterization for post-training quantization of vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17227–17236, 2023

  8. [8]

    Dopq-vit: Towards distribution-friendly and outlier-aware post-training quantization for vision transform- ers.arXiv preprint arXiv:2408.03291, 2024

    Lianwei Yang, Haisong Gong, Haokun Lin, Yichen Wu, Zhenan Sun, and Qingyi Gu. Dopq-vit: Towards distribution-friendly and outlier-aware post-training quantization for vision transform- ers.arXiv preprint arXiv:2408.03291, 2024

  9. [9]

    Towards accurate post-training quantization of vision transformers via error reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2676–2692, 2025

    Yunshan Zhong, You Huang, Jiawei Hu, Yuxin Zhang, and Rongrong Ji. Towards accurate post-training quantization of vision transformers via error reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2676–2692, 2025

  10. [10]

    Aphq- vit: Post-training quantization with average perturbation hessian based reconstruction for vision transformers

    Zhuguanyu Wu, Jiayi Zhang, Jiaxin Chen, Jinyang Guo, Di Huang, and Yunhong Wang. Aphq- vit: Post-training quantization with average perturbation hessian based reconstruction for vision transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9686–9695, 2025

  11. [11]

    Fima-q: Post- training quantization for vision transformers by fisher information matrix approximation

    Zhuguanyu Wu, Shihe Wang, Jiayi Zhang, Jiaxin Chen, and Yunhong Wang. Fima-q: Post- training quantization for vision transformers by fisher information matrix approximation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14891–14900, 2025

  12. [12]

    Learned step size quantization,

    Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dhar- mendra S Modha. Learned step size quantization.arXiv preprint arXiv:1902.08153, 2019

  13. [13]

    Q-vit: Accurate and fully quantized low-bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022

    Yanjing Li, Sheng Xu, Baochang Zhang, Xianbin Cao, Peng Gao, and Guodong Guo. Q-vit: Accurate and fully quantized low-bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022

  14. [14]

    Oscillation-free quantization for low-bit vision transformers

    Shih-Yang Liu, Zechun Liu, and Kwang-Ting Cheng. Oscillation-free quantization for low-bit vision transformers. InInternational conference on machine learning, pages 21813–21824. PMLR, 2023

  15. [15]

    Quantization vari- ation: A new perspective on training transformers with low-bit precision.arXiv preprint arXiv:2307.00331, 2023

    Xijie Huang, Zhiqiang Shen, Pingcheng Dong, and Kwang-Ting Cheng. Quantization vari- ation: A new perspective on training transformers with low-bit precision.arXiv preprint arXiv:2307.00331, 2023. 15

  16. [16]

    Gplq: A general, practical, and lightning qat method for vision transformers.arXiv preprint arXiv:2506.11784, 2025

    Guang Liang, Xinyao Liu, and Jianxin Wu. Gplq: A general, practical, and lightning qat method for vision transformers.arXiv preprint arXiv:2506.11784, 2025

  17. [17]

    From attention to activation: Unravelling the enigmas of large language models.arXiv preprint arXiv:2410.17174, 2024

    Prannay Kaul, Chengcheng Ma, Ismail Elezi, and Jiankang Deng. From attention to activation: Unravelling the enigmas of large language models.arXiv preprint arXiv:2410.17174, 2024

  18. [18]

    Z., and Liu, Z

    Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024

  19. [19]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  20. [20]

    Understand- ing and minimising outlier features in transformer training.Advances in Neural Information Processing Systems, 37:83786–83846, 2024

    Bobby He, Lorenzo Noci, Daniele Paliotta, Imanol Schlag, and Thomas Hofmann. Understand- ing and minimising outlier features in transformer training.Advances in Neural Information Processing Systems, 37:83786–83846, 2024

  21. [21]

    Outlier-safe pre-training for robust 4-bit quantization of large language models

    Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, and Jaewoo Kang. Outlier-safe pre-training for robust 4-bit quantization of large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12582–12600, 2025

  22. [22]

    Mitigating the impact of outlier channels for language model quantization with activation regularization.arXiv preprint arXiv:2404.03605, 2024

    Aniruddha Nrusimha, Mayank Mishra, Naigang Wang, Dan Alistarh, Rameswar Panda, and Yoon Kim. Mitigating the impact of outlier channels for language model quantization with activation regularization.arXiv preprint arXiv:2404.03605, 2024

  23. [23]

    Tweo: Transformers without extreme outliers enables fp8 training and quantization for dummies.arXiv preprint arXiv:2511.23225, 2025

    Guang Liang, Jie Shao, Ningyuan Tang, Xinyao Liu, and Jianxin Wu. Tweo: Transformers without extreme outliers enables fp8 training and quantization for dummies.arXiv preprint arXiv:2511.23225, 2025

  24. [24]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  25. [25]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  26. [26]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  27. [27]

    Post training 4-bit quantization of convolutional networks for rapid-deployment.Advances in neural information processing systems, 32, 2019

    Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutional networks for rapid-deployment.Advances in neural information processing systems, 32, 2019

  28. [28]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023

  29. [29]

    A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.arXiv preprint arXiv: 2601.22966,

    Zihan Qiu, Zeyu Huang, Kaiyue Wen, Peng Jin, Bo Zheng, Yuxin Zhou, Haofeng Huang, Zekun Wang, Xiao Li, Huaqing Zhang, et al. A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.arXiv preprint arXiv:2601.22966, 2026

  30. [30]

    Outlier suppression: Pushing the limit of low-bit transformer language models.Advances in Neural Information Processing Systems, 35:17402–17414, 2022

    Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models.Advances in Neural Information Processing Systems, 35:17402–17414, 2022

  31. [31]

    Beyond outliers: A study of optimizers under quantization.arXiv preprint arXiv:2509.23500, 2025

    Georgios Vlassis, Saleh Ashkboos, Alexandra V olkova, Torsten Hoefler, and Dan Alistarh. Beyond outliers: A study of optimizers under quantization.arXiv preprint arXiv:2509.23500, 2025

  32. [32]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  33. [33]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. 16

  34. [34]

    Mixed Precision Training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017

  35. [35]

    MMDetection: Open MMLab Detection Toolbox and Benchmark

    Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection toolbox and benchmark.arXiv preprint arXiv:1906.07155, 2019

  36. [36]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017

  37. [37]

    Nice: Noise injection and clamping estimation for neural network quantization.Mathematics, 9(17):2144, 2021

    Chaim Baskin, Evgenii Zheltonozhkii, Tal Rozen, Natan Liss, Yoav Chai, Eli Schwartz, Raja Giryes, Alexander M Bronstein, and Avi Mendelson. Nice: Noise injection and clamping estimation for neural network quantization.Mathematics, 9(17):2144, 2021

  38. [38]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013

  39. [39]

    Sun database: Large-scale scene recognition from abbey to zoo

    Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010

  40. [40]

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

  41. [41]

    Learning methods for generic object recognition with invariance to pose and lighting

    Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. InProceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., volume 2, pages II–104. IEEE, 2004

  42. [42]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  43. [43]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  44. [44]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  45. [45]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  46. [46]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  47. [47]

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

  48. [48]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

  49. [49]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

  50. [50]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023. 17