pith. sign in

arxiv: 2607.02158 · v1 · pith:ZFNVIKQInew · submitted 2026-07-02 · 💻 cs.CV

Efficient PEFT Methods with Adaptive Checkpointing for Vision Models and VLMs on Resource Constrained Consumer-GPUs

Pith reviewed 2026-07-03 15:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords parameter-efficient fine-tuningQLoRABitFitgradient checkpointingvision transformersMamba vision modelsenergy efficiencyconsumer GPUs
0
0 comments X

The pith

QLoRA and BitFit cut energy use 20-30% for vision model fine-tuning on 2GB GPUs at a 1-2% accuracy cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates five parameter-efficient fine-tuning methods and three gradient-checkpointing strategies on transformer and Mamba vision backbones under a strict 2GB VRAM limit. It reports that QLoRA and BitFit deliver 20-30% energy savings compared with full fine-tuning while losing only 1-2% accuracy on CIFAR-100 and DTD. A new memory-budget-aware adaptive checkpointing algorithm reduces peak memory by 43-79% at the price of 9-30% extra energy. The work also shows that DINOv2 reaches 0.917 accuracy on CIFAR-100 versus 0.897 for the fine-tuned models while using far less energy. These measurements matter because they identify concrete ways to run modern vision models on ordinary consumer hardware instead of high-end servers.

Core claim

Under a fixed 2 GB VRAM budget the paper claims that QLoRA and BitFit achieve 20-30% lower energy consumption than full fine-tuning at a 1-2% accuracy penalty on CIFAR-100 and DTD. The memory-budget-aware adaptive gradient-checkpointing algorithm reduces peak memory usage 43-79% while adding only 9-30% energy overhead. DINOv2 records 0.917 accuracy on CIFAR-100 against 0.897 for the best fine-tuned models and does so at a fraction of the energy cost. The comparisons cover ViT-Small, TinyViT, Vim-Small and MambaVision-T together with zero-shot and lightly evaluated foundation-model baselines.

What carries the argument

memory-budget-aware adaptive gradient-checkpointing algorithm that dynamically selects checkpoint locations to stay inside a target VRAM limit

If this is right

  • QLoRA and BitFit become practical defaults for energy-constrained fine-tuning of both transformer and Mamba vision backbones.
  • The adaptive checkpointing method makes fine-tuning feasible inside 2 GB VRAM envelopes that would otherwise cause out-of-memory failures.
  • DINOv2 supplies a lower-energy alternative that can exceed the accuracy of PEFT-tuned models on CIFAR-100.
  • Extended NetScore metrics that incorporate deployment-aware memory and energy terms become usable for ranking methods under consumer-GPU constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adaptive checkpointing logic could be ported to other memory-constrained training regimes such as on-device personalization of mobile vision models.
  • If the energy advantage of DINOv2 persists across additional datasets it would reduce the incentive to fine-tune at all in low-resource settings.
  • Mamba-based vision backbones appear to respond to PEFT and checkpointing in the same qualitative way as transformers, suggesting the techniques are backbone-agnostic within the tested size range.

Load-bearing premise

The accuracy, energy and memory numbers measured on ViT-Small, TinyViT, Vim-Small, MambaVision-T with CIFAR-100 and DTD under a fixed 2 GB VRAM budget will generalize to other models, datasets and real consumer-GPU deployments.

What would settle it

Re-running the identical protocol on ImageNet or with a 4 GB VRAM budget and observing that QLoRA and the adaptive checkpointing no longer produce the reported energy and memory gains.

Figures

Figures reproduced from arXiv: 2607.02158 by Altay Toktassyn, Jurn-Gyu Park.

Figure 1
Figure 1. Figure 1: Methodology overview. Two evaluation paradigms ( [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Fine-tuning time (solid, left axis) and accuracy (dashed, right axis) on [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Peak VRAM (solid, left axis) and accuracy (dashed, right axis) on [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Because this breaks the uniform-growth assumption [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layer-by-layer memory profile for TinyViT / CIFAR-100 / Full-FT at epoch 10. Left: no checkpointing at batch size 32 (peak 4855 MiB). Center: [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Modern pretrained vision models achieve strong accuracy but demand substantial GPU memory for fine-tuning, making edge deployment impractical. This paper compares five parameter-efficient fine-tuning (PEFT) methods (Full FT, LoRA, AdaLoRA, QLoRA, BitFit) on Transformers- (ViT-Small, TinyViT) and Mamba-based vision backbones (Vim-Small, MambaVision-T) under an on-device VRAM budget (e.g., 2 GB), together with three gradient-checkpointing strategies (none, static, and a proposed memory-budget-aware adaptive algorithm); and we evaluate three families of foundation-model baselines: zero-shot contrastive vision language models (OpenCLIP, SigLIP), self-supervised vision backbones with lightweight evaluation protocols (DINOv2), and autoregressive VLMs for prompt-based classification (PaliGemma, MobileVLM, SmolVLM). Experiments on CIFAR-100 and DTD report accuracy, training time, energy, and the NetScore family of multi-objective metrics, which we extend with two deployment-aware variants. QLoRA and BitFit cut energy 20-30% at a 1-2% accuracy cost; the adaptive algorithm reduces peak memory 43-79% with 9-30% energy overhead. DINOv2 surpasses fine-tuned models on CIFAR-100 (0.917 vs. 0.897) at a fraction of the energy, while small autoregressive VLMs remain uncompetitive.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. This paper presents an empirical comparison of parameter-efficient fine-tuning (PEFT) methods (Full FT, LoRA, AdaLoRA, QLoRA, BitFit) and gradient checkpointing strategies (none, static, adaptive) for vision Transformers and Mamba models under a constrained 2GB VRAM budget. It evaluates these on CIFAR-100 and DTD datasets, reporting accuracy, energy consumption, training time, and extended NetScore metrics, while also benchmarking against zero-shot VLMs and self-supervised models like DINOv2. Key claims include 20-30% energy savings with QLoRA/BitFit at minor accuracy cost, 43-79% peak memory reduction via adaptive checkpointing, and superior performance of DINOv2 on CIFAR-100 with lower energy.

Significance. If the empirical results hold under broader conditions, this work provides practical guidance for energy-efficient fine-tuning of vision models on consumer GPUs with limited VRAM. The extension of the NetScore family with deployment-aware variants and the direct comparison to foundation-model baselines (including DINOv2) are strengths that support multi-objective evaluation.

major comments (2)
  1. [Abstract] Abstract: the abstract states specific quantitative outcomes (QLoRA/BitFit energy savings of 20-30%, adaptive checkpointing memory reduction of 43-79%) but the manuscript provides no full methods description, error bars, statistical details, or raw data to allow verification of support for these central claims.
  2. [Experimental evaluation] Experimental evaluation: the claims rest on a narrow setup (ViT-Small/TinyViT/Vim-Small/MambaVision-T, CIFAR-100/DTD, fixed 2GB VRAM) with no scaling curves, resolution ablations, or cross-hardware tests; this undermines generalization to broader consumer-GPU deployment scenarios asserted in the introduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the abstract states specific quantitative outcomes (QLoRA/BitFit energy savings of 20-30%, adaptive checkpointing memory reduction of 43-79%) but the manuscript provides no full methods description, error bars, statistical details, or raw data to allow verification of support for these central claims.

    Authors: Section 3 and the appendix fully specify the models, PEFT ranks, quantization bits, adaptive checkpointing logic (including the memory-budget threshold and recomputation policy), training hyperparameters, and NetScore extensions. We agree that error bars and run counts are absent. In revision we will report mean ± std over three random seeds for the headline energy and memory figures and state the number of trials explicitly. Code and per-run logs will be released with the camera-ready version. revision: partial

  2. Referee: [Experimental evaluation] Experimental evaluation: the claims rest on a narrow setup (ViT-Small/TinyViT/Vim-Small/MambaVision-T, CIFAR-100/DTD, fixed 2GB VRAM) with no scaling curves, resolution ablations, or cross-hardware tests; this undermines generalization to broader consumer-GPU deployment scenarios asserted in the introduction.

    Authors: The introduction explicitly positions the work as addressing the 2 GB VRAM consumer-GPU regime; the chosen backbones and datasets are the largest that fit this budget while remaining representative. We do not assert results outside this constraint. We will add a limitations subsection that reiterates the scoped setting and notes the absence of scaling or cross-device experiments as future work. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical comparisons with no derivations

full rationale

The manuscript contains no equations, derivations, or predictive claims that could reduce to inputs by construction. All reported results (accuracy, energy, memory deltas for QLoRA/BitFit/adaptive checkpointing) are direct experimental measurements on fixed models (ViT-Small, TinyViT, Vim-Small, MambaVision-T), datasets (CIFAR-100, DTD), and a 2 GB VRAM budget. No fitted parameters are relabeled as predictions, no self-citation chains support load-bearing premises, and no ansatzes or uniqueness theorems are invoked. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no mathematical derivations, fitted parameters, axioms, or new postulated entities are described. The work is an empirical comparison study.

pith-pipeline@v0.9.1-grok · 5818 in / 1168 out tokens · 24682 ms · 2026-07-03T15:18:56.771022+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 24 canonical work pages · 17 internal anchors

  1. [1]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  2. [2]

    Tinyvit: Fast pretraining distillation for small vision transformers,

    K. Wu, J. Zhang, H. Peng, M. Liu, B. Xiao, J. Fu, and L. Yuan, “Tinyvit: Fast pretraining distillation for small vision transformers,” inEuropean Conference on Computer Vision, 2022, pp. 68–85

  3. [3]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

  4. [4]

    Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,”arXiv preprint arXiv:2401.09417, 2024

  5. [5]

    Mambavision: A hybrid mamba- transformer vision backbone,

    A. Hatamizadeh and J. Kautz, “Mambavision: A hybrid mamba- transformer vision backbone,”arXiv preprint arXiv:2407.08083, 2024

  6. [6]

    Jetson nano developer kit: Module data sheet,

    NVIDIA Corporation, “Jetson nano developer kit: Module data sheet,” https://developer.nvidia.com/embedded/jetson-nano-developer-kit, 2020, 5–10 W power profiles; Accessed 2026

  7. [7]

    GeForce GTX 1650 graphics card: Specifications,

    ——, “GeForce GTX 1650 graphics card: Specifications,” https://www. nvidia.com/en-us/geforce/graphics-cards/16-series/geforce-gtx-1650/, 2019, 75 W TDP; Accessed 2026

  8. [8]

    Carbon Emissions and Large Neural Network Training

    D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon emissions and large neural network training,”arXiv preprint arXiv:2104.10350, 2021

  9. [9]

    NVIDIA H100 tensor core gpu datasheet,

    NVIDIA Corporation, “NVIDIA H100 tensor core gpu datasheet,” https://resources.nvidia.com/en-us-tensor-core/ nvidia-tensor-core-gpu-datasheet, 2023, up to 700 W TDP (SXM5); Accessed 2026

  10. [10]

    Green ai,

    R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green ai,” Communications of the ACM, vol. 63, no. 12, pp. 54–63, 2020

  11. [11]

    LoRA: Low-Rank Adaptation of Large Language Models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021

  12. [12]

    AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

    Q. Zhang, M. Chen, A. Bukharin, P. He, Y . Cheng, W. Chen, and T. Zhao, “Adalora: Adaptive budget allocation for parameter-efficient fine-tuning,”arXiv preprint arXiv:2303.10512, 2023

  13. [13]

    Qlora: Efficient finetuning of quantized language models,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized language models,”Advances in Neural Information Processing Systems, vol. 36, 2023

  14. [14]

    Bitfit: Simple parameter- efficient fine-tuning for transformer-based masked language-models,

    E. Ben-Zaken, Y . Goldberg, and S. Ravfogel, “Bitfit: Simple parameter- efficient fine-tuning for transformer-based masked language-models,” arXiv preprint arXiv:2106.10199, 2021

  15. [15]

    Training Deep Nets with Sublinear Memory Cost

    T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,”arXiv preprint arXiv:1604.06174, 2016

  16. [16]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, vol. 30, 2017

  17. [17]

    Training data-efficient image transformers & distillation through attention,

    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” inInternational Conference on Machine Learning, 2021, pp. 10 347–10 357

  18. [18]

    Swin Transformer: Hierarchical vision transformer using shifted win- dows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin Transformer: Hierarchical vision transformer using shifted win- dows,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10 012–10 022

  19. [19]

    Efficiently modeling long sequences with structured state spaces,

    A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” inInternational Conference on Learning Representations, 2022. PREPRINT, 2026 20

  20. [20]

    Parameter-efficient transfer learning for NLP,

    N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” inProceedings of the 36th International Conference on Machine Learning (ICML), 2019, pp. 2790–2799

  21. [21]

    AdapterHub: A framework for adapting transformers,

    J. Pfeiffer, A. Rücklé, C. Poth, A. Kamath, I. Vuli ´c, S. Ruder, K. Cho, and I. Gurevych, “AdapterHub: A framework for adapting transformers,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 46–54

  22. [22]

    Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    Z. Han, C. Gao, J. Liu, J. Zhang, and S. Q. Zhang, “Parameter-efficient fine-tuning for large models: A comprehensive survey,”arXiv preprint arXiv:2403.14608, 2024

  23. [23]

    Dynamic tensor rematerialization,

    M. Kirisame, S. Lyubomirsky, A. Haan, J. Brennan, M. He, J. Roesch, T. Chen, and Z. Tatlock, “Dynamic tensor rematerialization,”arXiv preprint arXiv:2006.09616, 2020

  24. [24]

    Checkmate: Breaking the memory wall with optimal tensor rematerialization,

    P. Jain, A. Jain, A. Nrusimha, A. Gholami, P. Abbeel, J. Gonzalez, K. Keutzer, and I. Stoica, “Checkmate: Breaking the memory wall with optimal tensor rematerialization,” inProceedings of Machine Learning and Systems, vol. 2, 2020, pp. 497–511

  25. [25]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inInternational Conference on Machine Learning, 2021, pp. 8748–8763

  26. [26]

    Reproducible scaling laws for contrastive language-image learning,

    M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,”arXiv preprint arXiv:2212.07143, 2023

  27. [27]

    Sigmoid Loss for Language Image Pre-Training

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,”arXiv preprint arXiv:2303.15343, 2023

  28. [28]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

  29. [29]

    Emerging properties in self-supervised vision transformers,

    M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9650–9660

  30. [30]

    PaliGemma: A versatile 3B VLM for transfer

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello et al., “Paligemma: A versatile 3b vlm for transfer,”arXiv preprint arXiv:2407.07726, 2024

  31. [31]

    MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

    X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y . Yang, X. Sun, Y . Hu, X. Lin, B. Zhanget al., “Mobilevlm v2: Faster and stronger baseline for vision language model,”arXiv preprint arXiv:2402.03766, 2024

  32. [32]

    SmolVLM: Redefining small and efficient multimodal models

    L. B. Allal, A. Lozhkov, O. Penber, T. Wolf, and R. Lacroix, “Smolvlm: Redefining small and efficient multimodal models,”arXiv preprint arXiv:2504.05299, 2025

  33. [33]

    Energy and policy consid- erations for deep learning in nlp,

    E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy consid- erations for deep learning in nlp,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3645–3650

  34. [34]

    Towards the systematic reporting of the energy and carbon footprints of machine learning,

    P. Henderson, J. Hu, J. Romoff, E. Brunskill, D. Jurafsky, and J. Pineau, “Towards the systematic reporting of the energy and carbon footprints of machine learning,”Journal of Machine Learning Research, vol. 21, no. 248, pp. 1–43, 2020

  35. [35]

    The com- putational limits of deep learning,

    N. C. Thompson, K. Greenewald, K. Lee, and G. F. Manso, “The com- putational limits of deep learning,”arXiv preprint arXiv:2007.05558, 2020

  36. [36]

    NetScore: Towards Universal Metrics for Large-scale Performance Analysis of Deep Neural Networks for Practical On-Device Edge Usage

    A. Wong, “NetScore: Towards universal metrics for large-scale perfor- mance analysis of deep neural networks for practical on-device edge usage,”arXiv preprint arXiv:1806.05512, 2018

  37. [37]

    AttoNets: Compact and Efficient Deep Neural Networks for the Edge via Human-Machine Collaborative Design

    A. Wong, M. Famouri, M. Pavlova, and M. J. Shafiee, “AttoNets: Compact and efficient deep neural networks for the edge via human- machine collaborative design,”arXiv preprint arXiv:1903.07209, 2019

  38. [38]

    SLNet: A super-lightweight geometry-adaptive network for 3D point cloud recognition,

    Anonymous, “SLNet: A super-lightweight geometry-adaptive network for 3D point cloud recognition,”arXiv preprint arXiv:2603.07454, 2026

  39. [39]

    How green is continual learning, really? analyzing the energy consumption in continual training of vision foundation models,

    C. González, T. Deschamps, M. Cord, and P. Pérez, “How green is continual learning, really? analyzing the energy consumption in continual training of vision foundation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024

  40. [40]

    An attention-based feature memory design for energy-efficient continual learning,

    Y . Shi and J. G. Park, “An attention-based feature memory design for energy-efficient continual learning,”IEEE Access, 2024

  41. [41]

    PyTorch: An imperative style, high-performance deep learning library,

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antigaet al., “PyTorch: An imperative style, high-performance deep learning library,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019

  42. [42]

    Pytorch image models,

    R. Wightman, “Pytorch image models,” https://github.com/rwightman/ pytorch-image-models, 2019, accessed: 2024

  43. [43]

    Peft: State-of-the-art parameter-efficient fine-tuning methods,

    S. Mangrulkar, S. Gugger, L. Debut, Y . Belkada, and S. Paul, “Peft: State-of-the-art parameter-efficient fine-tuning methods,” https://github. com/huggingface/peft, 2023, accessed: 2024

  44. [44]

    8-bit optimizers via block-wise quantization,

    T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer, “8-bit optimizers via block-wise quantization,” inInternational Conference on Learning Representations, 2022

  45. [45]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019

  46. [46]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009

  47. [47]

    ImageNet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255

  48. [48]

    Describing textures in the wild,

    M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 3606– 3613

  49. [49]

    Why are visually-grounded language models bad at image classification?

    Y . Zhang, A. Unell, X. Wang, D. Ghosh, Y . Su, L. Schmidt, and S. Yeung-Levy, “Why are visually-grounded language models bad at image classification?” inAdvances in Neural Information Processing Systems (NeurIPS), 2024

  50. [50]

    Rethinking VLMs and LLMs for image classification,

    A. Epstein, T. Reusch, O. Caelen, and S. Maréchal, “Rethinking VLMs and LLMs for image classification,”Scientific Reports, 2025, arXiv preprint arXiv:2410.14690

  51. [51]

    Neural Machine Translation of Rare Words with Subword Units

    R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” inProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016, arXiv preprint arXiv:1508.07909

  52. [52]

    SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

    T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text process- ing,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations, 2018, arXiv preprint arXiv:1808.06226

  53. [53]

    What does a platypus look like? Generating customized prompts for zero-shot image classification,

    S. Pratt, I. Covert, R. Liu, and A. Farhadi, “What does a platypus look like? Generating customized prompts for zero-shot image classification,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2023

  54. [54]

    Visual classification via description from large language models,

    S. Menon and C. V ondrick, “Visual classification via description from large language models,” inInternational Conference on Learning Rep- resentations (ICLR), 2023