pith. sign in

arxiv: 2504.14988 · v4 · submitted 2025-04-21 · 💻 cs.CV

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Pith reviewed 2026-05-22 18:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords large vision-language modelsfine-grained evaluationbenchmarkLVLMssemantic recognitionimage tasksmultimodal evaluationmodel limitations
0
0 comments X

The pith

A new benchmark of 1.01 million questions shows how training paradigms and modality alignment shape LVLMs performance on fine-grained image tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FG-BMK, a benchmark with 1.01 million questions and 0.33 million images, to evaluate large vision-language models on detailed image tasks that prior studies have largely overlooked. It tests twelve representative LVLMs and VLMs through both human-oriented and machine-oriented lenses to assess semantic recognition and fine-grained feature representation. Experiments reveal that training paradigms, the quality of vision-language alignment, sensitivity to image perturbations, and skill at fine-grained category reasoning all affect outcomes. The work identifies current model limitations and points toward better approaches for data and model design.

Core claim

By constructing FG-BMK with 1.01 million questions and 0.33 million images, the authors systematically evaluate LVLMs on fine-grained tasks from human and machine perspectives, demonstrating that training paradigms exert strong influence on performance, modality alignment determines effective multimodal integration, models remain highly susceptible to perturbations, and fine-grained category reasoning constitutes a persistent weakness.

What carries the argument

FG-BMK benchmark, a dataset of 1.01 million questions paired with 0.33 million images that probes semantic recognition and fine-grained feature representation capabilities through human-oriented and machine-oriented evaluations.

If this is right

  • Certain training paradigms produce measurably higher performance on fine-grained image tasks than others.
  • Better alignment between visual and textual modalities directly improves results on semantic recognition and feature representation.
  • Current LVLMs exhibit clear drops in accuracy under small image perturbations.
  • Fine-grained category reasoning remains a bottleneck that limits overall task success.
  • Benchmark results can directly inform choices in future data construction and model architecture decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could test whether adding targeted fine-grained category examples during training closes the performance gaps identified here.
  • The benchmark's scale suggests that similar large-scale evaluations might uncover comparable patterns in related multimodal tasks such as visual question answering.
  • Models that improve perturbation robustness on FG-BMK may also show gains on real-world image variations not included in the current set.

Load-bearing premise

The 1.01 million questions and 0.33 million images in FG-BMK, together with the selected human and machine evaluation perspectives, capture the full range of fine-grained image task capabilities without selection biases or annotation artifacts.

What would settle it

An independent test set of fine-grained image questions and images, constructed separately from FG-BMK, on which top-performing models from the benchmark score significantly lower would show that the original benchmark does not fully measure the targeted capabilities.

Figures

Figures reproduced from arXiv: 2504.14988 by Hong-Tao Yu, Serge Belongie, Xiu-Shen Wei, Yuxin Peng.

Figure 1
Figure 1. Figure 1: Our proposed benchmark: The human-oriented evaluation tests the model’s ability to handle [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Results of InternVL3 [55] on true/false and multiple-choice questions across different levels of granularity on the CUB-200-2011 [41] dataset. The x￾axis denotes the granularity of the recog￾nition questions. LVLMs struggle to distinguish excessively fine-grained categories. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of the origi￾nal and fine-tuned LLaVA models on occurrence-balanced fine-grained bird categories. True/false question accuracy for each category is ranked, with blue dots representing the original model and yellow dots the fine-tuned model. To further validate this hypothesis, we examined the occur￾rence frequency of fine-grained categories in the LVLM’s training data. Interestingly, we found th… view at source ↗
Figure 5
Figure 5. Figure 5: Classification results of LVLM visual features on twelve fine-grained datasets. Differ￾ent colors represent different models [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Nemenyi statistical test results for fine-grained retrieval. Black horizontal lines in￾dicate the critical distance (CD), grouping mod￾els with no significant performance differences [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Classification results of LVLM visual features on fine-grained datasets. “Single" denotes [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Question templates for each task in huamn-oriented evaluation. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Results of GPT-4o [1], Gemini-2.0-flash [13], Qwen2.5-VL [3], LLaVA [24] and In￾ternVL [8] on true/false and multiple-choice questions across different levels of granularity on CUB-200-2011 [41] and iNat2021 [40] dataset. The x-axis denotes the granularity of the recognition questions. C.3 Results of Attribute Recognition [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Knowledge bias estimation results of two closed-source models. True/false question [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of the original and fine-tuned Qwen2.5-VL [ [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative analysis of granularity inconsistencies in LVLMs’ alignment data and a [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Classification results of LVLM visual features on fine-grained datasets. “Single" denotes [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
read the original abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces FG-BMK, a large-scale benchmark comprising 1.01 million questions and 0.33 million images for evaluating large vision-language models on fine-grained image tasks. It performs systematic experiments on twelve representative LVLMs/VLMs from both human-oriented and machine-oriented perspectives, focusing on semantic recognition and fine-grained feature representation. The work reports key findings on the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance, while highlighting limitations of current models and offering guidance for future data construction and model design. The code is released open-source.

Significance. If the benchmark construction is shown to be free of selection biases and annotation artifacts, the evaluation would provide useful empirical insights into LVLMs' capabilities on fine-grained tasks central to computer vision. The reported influences could inform improvements in training and alignment strategies. The open-source code at https://github.com/SEU-VIPGroup/FG-BMK is a clear strength that supports reproducibility.

major comments (1)
  1. §3 (FG-BMK Construction): The manuscript provides insufficient detail on the source datasets for the 0.33 million images, the question-generation process for the 1.01 million questions (e.g., templates versus LLM-assisted), and the human verification protocol including quantitative inter-annotator agreement. These omissions are load-bearing for the central claim that the experiments reliably uncover influences of training paradigms, modality alignment, perturbation susceptibility, and fine-grained reasoning, because unaddressed curation biases could produce spurious correlations between model properties and scores.
minor comments (1)
  1. Abstract: The summary of contributions could explicitly mention the number of evaluated models and the open-source release to better foreground the scale and reproducibility aspects.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive feedback on benchmark construction details. We agree that greater transparency in Section 3 is warranted to support the reliability of our empirical findings and will revise the manuscript to address this point.

read point-by-point responses
  1. Referee: §3 (FG-BMK Construction): The manuscript provides insufficient detail on the source datasets for the 0.33 million images, the question-generation process for the 1.01 million questions (e.g., templates versus LLM-assisted), and the human verification protocol including quantitative inter-annotator agreement. These omissions are load-bearing for the central claim that the experiments reliably uncover influences of training paradigms, modality alignment, perturbation susceptibility, and fine-grained reasoning, because unaddressed curation biases could produce spurious correlations between model properties and scores.

    Authors: We acknowledge that the current description in Section 3, while outlining the overall scale and high-level process, does not provide the granular information needed to fully evaluate potential curation biases. In the revised manuscript we will expand this section with: (1) an explicit enumeration of all source datasets contributing to the 0.33 million images together with their original task formulations and any filtering criteria applied; (2) a precise account of the question-generation pipeline, including the proportion of questions produced via hand-crafted templates versus LLM-assisted generation and the exact prompting strategies used; and (3) the complete human verification protocol, including the number of annotators, the annotation interface, and quantitative inter-annotator agreement statistics (e.g., Fleiss' kappa or pairwise agreement percentages). These additions will allow readers to assess selection biases directly and thereby strengthen the link between model properties and observed performance differences. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation is self-contained with no derivation chain

full rationale

This paper introduces the FG-BMK benchmark (1.01M questions, 0.33M images) and reports experimental results on twelve external LVLMs/VLMs. No equations, fitted parameters, or first-principles derivations appear in the provided text or abstract. Central claims rest on observed performance differences across training paradigms, modality alignment, and perturbation susceptibility rather than any self-referential construction or self-citation load-bearing step. The work is therefore an external evaluation against benchmarks and models, qualifying for the default non-circularity outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical benchmarking study. No mathematical derivations, new physical entities, or fitted parameters are introduced; the claims rest on the assumption that the benchmark questions and chosen models are representative.

pith-pipeline@v0.9.0 · 5730 in / 1093 out tokens · 43732 ms · 2026-05-22T18:20:56.103508+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

    cs.CV 2026-05 conditional novelty 7.0

    FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.

  2. FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

    cs.CV 2026-05 unverdicted novelty 7.0

    FIKA-Bench is a leakage-aware benchmark of 311 instances showing that even the best large multimodal models and tool-equipped agents reach only 25.1% accuracy on fine-grained recognition questions that require externa...

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    OpenAI (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

  4. [4]

    Products-10K: A large-scale product recognition dataset

    Yalong Bai, Yuxiang Chen, Wei Yu, Linfang Wang, and Wei Zhang. Products-10K: A large-scale product recognition dataset. arXiv preprint arXiv:2008.10545, 2020

  5. [5]

    Context-aware attentional pooling (cap) for fine-grained visual classification

    Ardhendu Behera, Zachary Wharton, Pradeep RPG Hewage, and Asish Bera. Context-aware attentional pooling (cap) for fine-grained visual classification. In Proc. Conf. AAAI, number 2, pages 929–937, 2021

  6. [6]

    SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization

    Asish Bera, Zachary Wharton, Yonghuai Liu, Nik Bessis, and Ardhendu Behera. SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization. IEEE Trans. Image Process., 31:6017–6031, 2022

  7. [7]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In Proc. Eur. Conf. Comp. Vis., pages 446–461, 2014

  8. [8]

    InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 24185–24198, 2024

  9. [9]

    SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis

    Roxana Daneshjou, Mert Yuksekgonul, Zhuo Ran Cai, Roberto Novoa, and James Y Zou. SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis. In Advances in Neural Inf. Process. Syst., pages 18157–18167, 2022

  10. [10]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 248–255, 2009

  11. [11]

    MetaFormer: A unified meta framework for fine-grained recognition

    Qishuai Diao, Yi Jiang, Bin Wen, Jia Sun, and Zehuan Yuan. MetaFormer: A unified meta framework for fine-grained recognition. arXiv preprint arXiv:2203.02751, 2022

  12. [12]

    African or european swallow? bench- marking large vision-language models for fine-grained object classification

    Gregor Geigle, Radu Timofte, and Goran Glavaš. African or european swallow? bench- marking large vision-language models for fine-grained object classification. arXiv preprint arXiv:2406.14496, 2024

  13. [13]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  14. [14]

    VegFru: A domain-specific dataset for fine-grained visual categorization

    Saihui Hou, Yushan Feng, and Zilei Wang. VegFru: A domain-specific dataset for fine-grained visual categorization. In Proc. IEEE Int. Conf. Comp. Vis., pages 541–549, 2017

  15. [15]

    GQA: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 6700–6709, 2019

  16. [16]

    FineCLIP: Self-distilled region-based clip for better fine-grained understanding

    Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Guoxing Yang, Wei Wei, Huiwen Zhao, and Zhiwu Lu. FineCLIP: Self-distilled region-based clip for better fine-grained understanding. In Advances in Neural Inf. Process. Syst., pages 27896–27918, 2024

  17. [17]

    Animal-Bench: Benchmarking multimodal video models for animal-centric video understanding

    Yinuo Jing, Ruxu Zhang, Kongming Liang, Yongxiang Li, Zhongjiang He, Zhanyu Ma, and Jun Guo. Animal-Bench: Benchmarking multimodal video models for animal-centric video understanding. In Advances in Neural Inf. Process. Syst., pages 23457–23469, 2024

  18. [18]

    Khosla, N

    A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. In CVPR Workshop on Fine-Grained Visual Categorization, pages 806–813, 2011

  19. [19]

    3D object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-grained categorization. In Proc. IEEE Int. Conf. Comp. Vis., pages 554–561, 2013

  20. [20]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. 10

  21. [21]

    BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. Int. Conf. Mach. Learn., pages 19730–19742, 2023

  22. [22]

    BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proc. Int. Conf. Mach. Learn., pages 12888–12900, 2022

  23. [23]

    Progressive multi-task anti-noise learning and distilling frameworks for fine-grained vehicle recognition

    Dichao Liu. Progressive multi-task anti-noise learning and distilling frameworks for fine-grained vehicle recognition. IEEE Trans. Intell. Transp. Syst., 25(9):10667–10678, 2024

  24. [24]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 26296–26306, 2024

  25. [25]

    OCRBench: On the hidden mystery of ocr in large multimodal models

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. OCRBench: On the hidden mystery of ocr in large multimodal models. Science China Information Sciences, 67(12), 2024

  26. [26]

    DeepFashion: Powering robust clothes recognition and retrieval with rich annotations

    Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. DeepFashion: Powering robust clothes recognition and retrieval with rich annotations. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1096–1104, 2016

  27. [27]

    MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. In Proc. Int. Conf. Learn. Representations, 2024

  28. [28]

    Towards deep learning models resistant to adversarial attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In Proc. Int. Conf. Learn. Representations, 2018

  29. [29]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013

  30. [30]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Proc. Conf. Association for Computational Linguistics, pages 2263–2279, 2022

  31. [31]

    DocVQA: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. DocVQA: A dataset for vqa on document images. In Proc. Winter Conf. Applications of Comp. Vis., pages 2200–2209, 2021

  32. [32]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Proc. IEEE Int. Conf. Comp. Vis., pages 722–729, 2008

  33. [33]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Hervé Jegou, Julien Mairal, Patrick Laba...

  34. [34]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskeverothers. Learning transferable visual models from natural language supervision. In Proc. Int. Conf. Mach. Learn., pages 8748–8763, 2021

  35. [35]

    SEMICON: A learning-to-hash solution for large-scale fine-grained image retrieval

    Yang Shen, Xuhao Sun, Xiu-Shen Wei, Qing-Yuan Jiang, and Jian Yang. SEMICON: A learning-to-hash solution for large-scale fine-grained image retrieval. In Proc. Eur. Conf. Comp. Vis., pages 531–548, 2022

  36. [36]

    Interweaving insights: High-order feature interaction for fine-grained visual recognition

    Arindam Sikdar, Yonghuai Liu, Siddhardha Kedarisetty, Yitian Zhao, Amr Ahmed, and Ard- hendu Behera. Interweaving insights: High-order feature interaction for fine-grained visual recognition. In Proc. IEEE Int. Conf. Comp. Vis., pages 1755–1779, 2024

  37. [37]

    Bi-modal progressive mask attention for fine-grained recognition

    Kaitao Song, Xiu-Shen Wei, Xiangbo Shu, Ren-Jie Song, and Jianfeng Lu. Bi-modal progressive mask attention for fine-grained recognition. IEEE Trans. Image Process., 29:7006–7018, 2020

  38. [38]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EV A-CLIP: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

  39. [39]

    Bottled wine defect detection data set, 2021

    Tianchi. Bottled wine defect detection data set, 2021

  40. [40]

    Benchmarking representation learning for natural world image collections

    Grant Van Horn, Elijah Cole, Sara Beery, Kimberly Wilber, Serge Belongie, and Oisin Mac Aodha. Benchmarking representation learning for natural world image collections. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 12884–12893, 2021

  41. [41]

    The Caltech-UCSD birds-200-2011 dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD birds-200-2011 dataset. Technical report, California Institute of Technology, 2011

  42. [42]

    Image as a foreign 11 language: BEiT pretraining for vision and vision-language tasks

    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. Image as a foreign 11 language: BEiT pretraining for vision and vision-language tasks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 19175–19186, 2023

  43. [43]

    RPC: A large- scale and fine-grained retail product checkout dataset

    Xiu-Shen Wei, Quan Cui, Lei Yang, Peng Wang, Lingqiao Liu, and Jian Yang. RPC: A large- scale and fine-grained retail product checkout dataset. Science China. Information Sciences, 65(9):197101, 2022

  44. [44]

    Fine-grained image analysis with deep learning: A survey

    Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, Jinhui Tang, Jian Yang, and Serge Belongie. Fine-grained image analysis with deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell., (12):8927–8948, 2022

  45. [45]

    MECOM: A meta- completion network for fine-grained recognition with incomplete multi-modalities

    Xiu-Shen Wei, Hong-Tao Yu, Anqi Xu, Faen Zhang, and Yuxin Peng. MECOM: A meta- completion network for fine-grained recognition with incomplete multi-modalities. IEEE Trans. Image Process., 33:3456–3469, 2024

  46. [46]

    Pytorch image models

    Ross Wightman. Pytorch image models. https://github.com/rwightman/ pytorch-image-models, 2019

  47. [47]

    FiV A: Fine-grained visual attribute dataset for text-to-image diffusion models

    Tong Wu, Yinghao Xu, Ryan Po, Mengchen Zhang, Guandao Yang, Jiaqi Wang, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. FiV A: Fine-grained visual attribute dataset for text-to-image diffusion models. In Advances in Neural Inf. Process. Syst., pages 31990–32011, 2024

  48. [48]

    LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models

    Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models. IEEE Trans. Pattern Anal. Mach. Intell., 47(3):1877–1893, 2025

  49. [49]

    Dual attention networks for few-shot fine-grained recognition

    Shu-Lin Xu, Faen Zhang, Xiu-Shen Wei, and Jianhua Wang. Dual attention networks for few-shot fine-grained recognition. In Proc. Conf. AAAI, pages 2911–2919, 2022

  50. [50]

    CoCa: Contrastive captioners are image-text foundation models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. CoCa: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022

  51. [51]

    MMBench: Is your multi-modal model an all-around player? In Proc

    Liu Yuan, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. MMBench: Is your multi-modal model an all-around player? In Proc. Eur. Conf. Comp. Vis., pages 216–233, 2024

  52. [52]

    MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  53. [53]

    Fine-grained image-to-lidar contrastive distillation with visual foundation models

    Yifan Zhang and Junhui Hou. Fine-grained image-to-lidar contrastive distillation with visual foundation models. In Advances in Neural Inf. Process. Syst., pages 25467–25489, 2024

  54. [54]

    Why are visually-grounded language models bad at image classification? arXiv preprint arXiv:2405.18415, 2024

    Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, and Serena Yeung-Levy. Why are visually-grounded language models bad at image classification? arXiv preprint arXiv:2405.18415, 2024

  55. [55]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...