Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Hong-Tao Yu; Serge Belongie; Xiu-Shen Wei; Yuxin Peng

arxiv: 2504.14988 · v4 · submitted 2025-04-21 · 💻 cs.CV

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Hong-Tao Yu , Yuxin Peng , Serge Belongie , Xiu-Shen Wei This is my paper

Pith reviewed 2026-05-22 18:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords large vision-language modelsfine-grained evaluationbenchmarkLVLMssemantic recognitionimage tasksmultimodal evaluationmodel limitations

0 comments

The pith

A new benchmark of 1.01 million questions shows how training paradigms and modality alignment shape LVLMs performance on fine-grained image tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FG-BMK, a benchmark with 1.01 million questions and 0.33 million images, to evaluate large vision-language models on detailed image tasks that prior studies have largely overlooked. It tests twelve representative LVLMs and VLMs through both human-oriented and machine-oriented lenses to assess semantic recognition and fine-grained feature representation. Experiments reveal that training paradigms, the quality of vision-language alignment, sensitivity to image perturbations, and skill at fine-grained category reasoning all affect outcomes. The work identifies current model limitations and points toward better approaches for data and model design.

Core claim

By constructing FG-BMK with 1.01 million questions and 0.33 million images, the authors systematically evaluate LVLMs on fine-grained tasks from human and machine perspectives, demonstrating that training paradigms exert strong influence on performance, modality alignment determines effective multimodal integration, models remain highly susceptible to perturbations, and fine-grained category reasoning constitutes a persistent weakness.

What carries the argument

FG-BMK benchmark, a dataset of 1.01 million questions paired with 0.33 million images that probes semantic recognition and fine-grained feature representation capabilities through human-oriented and machine-oriented evaluations.

If this is right

Certain training paradigms produce measurably higher performance on fine-grained image tasks than others.
Better alignment between visual and textual modalities directly improves results on semantic recognition and feature representation.
Current LVLMs exhibit clear drops in accuracy under small image perturbations.
Fine-grained category reasoning remains a bottleneck that limits overall task success.
Benchmark results can directly inform choices in future data construction and model architecture decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could test whether adding targeted fine-grained category examples during training closes the performance gaps identified here.
The benchmark's scale suggests that similar large-scale evaluations might uncover comparable patterns in related multimodal tasks such as visual question answering.
Models that improve perturbation robustness on FG-BMK may also show gains on real-world image variations not included in the current set.

Load-bearing premise

The 1.01 million questions and 0.33 million images in FG-BMK, together with the selected human and machine evaluation perspectives, capture the full range of fine-grained image task capabilities without selection biases or annotation artifacts.

What would settle it

An independent test set of fine-grained image questions and images, constructed separately from FG-BMK, on which top-performing models from the benchmark score significantly lower would show that the original benchmark does not fully measure the targeted capabilities.

Figures

Figures reproduced from arXiv: 2504.14988 by Hong-Tao Yu, Serge Belongie, Xiu-Shen Wei, Yuxin Peng.

**Figure 2.** Figure 2: Results of InternVL3 [55] on true/false and multiple-choice questions across different levels of granularity on the CUB-200-2011 [41] dataset. The xaxis denotes the granularity of the recognition questions. LVLMs struggle to distinguish excessively fine-grained categories. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of the original and fine-tuned LLaVA models on occurrence-balanced fine-grained bird categories. True/false question accuracy for each category is ranked, with blue dots representing the original model and yellow dots the fine-tuned model. To further validate this hypothesis, we examined the occurrence frequency of fine-grained categories in the LVLM’s training data. Interestingly, we found th… view at source ↗

**Figure 5.** Figure 5: Classification results of LVLM visual features on twelve fine-grained datasets. Different colors represent different models [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Nemenyi statistical test results for fine-grained retrieval. Black horizontal lines indicate the critical distance (CD), grouping models with no significant performance differences [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Classification results of LVLM visual features on fine-grained datasets. “Single" denotes [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 10.** Figure 10: Question templates for each task in huamn-oriented evaluation. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Results of GPT-4o [1], Gemini-2.0-flash [13], Qwen2.5-VL [3], LLaVA [24] and InternVL [8] on true/false and multiple-choice questions across different levels of granularity on CUB-200-2011 [41] and iNat2021 [40] dataset. The x-axis denotes the granularity of the recognition questions. C.3 Results of Attribute Recognition [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Knowledge bias estimation results of two closed-source models. True/false question [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison of the original and fine-tuned Qwen2.5-VL [ [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative analysis of granularity inconsistencies in LVLMs’ alignment data and a [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Classification results of LVLM visual features on fine-grained datasets. “Single" denotes [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

read the original abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FG-BMK is a large new benchmark for fine-grained LVLM tasks with experiments on twelve models, but its value hinges on missing construction details that could introduce bias.

read the letter

Hi, the main thing to know is that this paper builds FG-BMK, a benchmark with 1.01 million questions and 0.33 million images, then runs it across twelve LVLMs to examine effects from training paradigms, modality alignment, perturbations, and fine-grained reasoning. The scale and the focus on fine-grained image tasks stand out as the actual addition here. Prior LVLM benchmarks have stayed more general, so targeting semantic recognition and detailed feature representation fills a documented gap. They also report patterns from both human-oriented and machine-oriented views, and releasing the code helps anyone who wants to check or extend the work. That part is straightforward and useful for people tracking where current models fall short on precise visual tasks. The soft spots sit in the benchmark construction itself. The abstract gives the totals but says little about source datasets, how questions were created, or any validation steps for the annotations. At this volume, automated or aggregated methods are likely, and without numbers on diversity, inter-annotator agreement, or checks against selection bias, the claimed influences on training and perturbation susceptibility rest on an unverified assumption. If the full methods section supplies those details and shows the patterns survive basic robustness tests, the findings gain weight; otherwise the correlations could be artifacts. This paper is mainly for researchers who build or evaluate multimodal models and need concrete data on fine-grained weaknesses. It is not a theoretical advance, but the empirical scope makes it worth a close look. I would send it to peer review so referees can press on the construction protocol and statistical reporting before the results get used to guide data curation or alignment work.

Referee Report

1 major / 1 minor

Summary. The paper introduces FG-BMK, a large-scale benchmark comprising 1.01 million questions and 0.33 million images for evaluating large vision-language models on fine-grained image tasks. It performs systematic experiments on twelve representative LVLMs/VLMs from both human-oriented and machine-oriented perspectives, focusing on semantic recognition and fine-grained feature representation. The work reports key findings on the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance, while highlighting limitations of current models and offering guidance for future data construction and model design. The code is released open-source.

Significance. If the benchmark construction is shown to be free of selection biases and annotation artifacts, the evaluation would provide useful empirical insights into LVLMs' capabilities on fine-grained tasks central to computer vision. The reported influences could inform improvements in training and alignment strategies. The open-source code at https://github.com/SEU-VIPGroup/FG-BMK is a clear strength that supports reproducibility.

major comments (1)

§3 (FG-BMK Construction): The manuscript provides insufficient detail on the source datasets for the 0.33 million images, the question-generation process for the 1.01 million questions (e.g., templates versus LLM-assisted), and the human verification protocol including quantitative inter-annotator agreement. These omissions are load-bearing for the central claim that the experiments reliably uncover influences of training paradigms, modality alignment, perturbation susceptibility, and fine-grained reasoning, because unaddressed curation biases could produce spurious correlations between model properties and scores.

minor comments (1)

Abstract: The summary of contributions could explicitly mention the number of evaluated models and the open-source release to better foreground the scale and reproducibility aspects.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive feedback on benchmark construction details. We agree that greater transparency in Section 3 is warranted to support the reliability of our empirical findings and will revise the manuscript to address this point.

read point-by-point responses

Referee: §3 (FG-BMK Construction): The manuscript provides insufficient detail on the source datasets for the 0.33 million images, the question-generation process for the 1.01 million questions (e.g., templates versus LLM-assisted), and the human verification protocol including quantitative inter-annotator agreement. These omissions are load-bearing for the central claim that the experiments reliably uncover influences of training paradigms, modality alignment, perturbation susceptibility, and fine-grained reasoning, because unaddressed curation biases could produce spurious correlations between model properties and scores.

Authors: We acknowledge that the current description in Section 3, while outlining the overall scale and high-level process, does not provide the granular information needed to fully evaluate potential curation biases. In the revised manuscript we will expand this section with: (1) an explicit enumeration of all source datasets contributing to the 0.33 million images together with their original task formulations and any filtering criteria applied; (2) a precise account of the question-generation pipeline, including the proportion of questions produced via hand-crafted templates versus LLM-assisted generation and the exact prompting strategies used; and (3) the complete human verification protocol, including the number of annotators, the annotation interface, and quantitative inter-annotator agreement statistics (e.g., Fleiss' kappa or pairwise agreement percentages). These additions will allow readers to assess selection biases directly and thereby strengthen the link between model properties and observed performance differences. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation is self-contained with no derivation chain

full rationale

This paper introduces the FG-BMK benchmark (1.01M questions, 0.33M images) and reports experimental results on twelve external LVLMs/VLMs. No equations, fitted parameters, or first-principles derivations appear in the provided text or abstract. Central claims rest on observed performance differences across training paradigms, modality alignment, and perturbation susceptibility rather than any self-referential construction or self-citation load-bearing step. The work is therefore an external evaluation against benchmarks and models, qualifying for the default non-circularity outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical benchmarking study. No mathematical derivations, new physical entities, or fitted parameters are introduced; the claims rest on the assumption that the benchmark questions and chosen models are representative.

pith-pipeline@v0.9.0 · 5730 in / 1093 out tokens · 43732 ms · 2026-05-22T18:20:56.103508+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition
cs.CV 2026-05 conditional novelty 7.0

FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.
FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition
cs.CV 2026-05 unverdicted novelty 7.0

FIKA-Bench is a leakage-aware benchmark of 311 instances showing that even the best large multimodal models and tool-equipped agents reach only 25.1% accuracy on fine-grained recognition questions that require externa...

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

GPT-4 Technical Report

OpenAI (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Products-10K: A large-scale product recognition dataset

Yalong Bai, Yuxiang Chen, Wei Yu, Linfang Wang, and Wei Zhang. Products-10K: A large-scale product recognition dataset. arXiv preprint arXiv:2008.10545, 2020

work page arXiv 2008
[5]

Context-aware attentional pooling (cap) for fine-grained visual classification

Ardhendu Behera, Zachary Wharton, Pradeep RPG Hewage, and Asish Bera. Context-aware attentional pooling (cap) for fine-grained visual classification. In Proc. Conf. AAAI, number 2, pages 929–937, 2021

work page 2021
[6]

SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization

Asish Bera, Zachary Wharton, Yonghuai Liu, Nik Bessis, and Ardhendu Behera. SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization. IEEE Trans. Image Process., 31:6017–6031, 2022

work page 2022
[7]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In Proc. Eur. Conf. Comp. Vis., pages 446–461, 2014

work page 2014
[8]

InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 24185–24198, 2024

work page 2024
[9]

SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis

Roxana Daneshjou, Mert Yuksekgonul, Zhuo Ran Cai, Roberto Novoa, and James Y Zou. SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis. In Advances in Neural Inf. Process. Syst., pages 18157–18167, 2022

work page 2022
[10]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 248–255, 2009

work page 2009
[11]

MetaFormer: A unified meta framework for fine-grained recognition

Qishuai Diao, Yi Jiang, Bin Wen, Jia Sun, and Zehuan Yuan. MetaFormer: A unified meta framework for fine-grained recognition. arXiv preprint arXiv:2203.02751, 2022

work page arXiv 2022
[12]

African or european swallow? bench- marking large vision-language models for fine-grained object classification

Gregor Geigle, Radu Timofte, and Goran Glavaš. African or european swallow? bench- marking large vision-language models for fine-grained object classification. arXiv preprint arXiv:2406.14496, 2024

work page arXiv 2024
[13]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

VegFru: A domain-specific dataset for fine-grained visual categorization

Saihui Hou, Yushan Feng, and Zilei Wang. VegFru: A domain-specific dataset for fine-grained visual categorization. In Proc. IEEE Int. Conf. Comp. Vis., pages 541–549, 2017

work page 2017
[15]

GQA: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 6700–6709, 2019

work page 2019
[16]

FineCLIP: Self-distilled region-based clip for better fine-grained understanding

Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Guoxing Yang, Wei Wei, Huiwen Zhao, and Zhiwu Lu. FineCLIP: Self-distilled region-based clip for better fine-grained understanding. In Advances in Neural Inf. Process. Syst., pages 27896–27918, 2024

work page 2024
[17]

Animal-Bench: Benchmarking multimodal video models for animal-centric video understanding

Yinuo Jing, Ruxu Zhang, Kongming Liang, Yongxiang Li, Zhongjiang He, Zhanyu Ma, and Jun Guo. Animal-Bench: Benchmarking multimodal video models for animal-centric video understanding. In Advances in Neural Inf. Process. Syst., pages 23457–23469, 2024

work page 2024
[18]

Khosla, N

A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. In CVPR Workshop on Fine-Grained Visual Categorization, pages 806–813, 2011

work page 2011
[19]

3D object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-grained categorization. In Proc. IEEE Int. Conf. Comp. Vis., pages 554–561, 2013

work page 2013
[20]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. 10

work page 2009
[21]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. Int. Conf. Mach. Learn., pages 19730–19742, 2023

work page 2023
[22]

BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proc. Int. Conf. Mach. Learn., pages 12888–12900, 2022

work page 2022
[23]

Progressive multi-task anti-noise learning and distilling frameworks for fine-grained vehicle recognition

Dichao Liu. Progressive multi-task anti-noise learning and distilling frameworks for fine-grained vehicle recognition. IEEE Trans. Intell. Transp. Syst., 25(9):10667–10678, 2024

work page 2024
[24]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 26296–26306, 2024

work page 2024
[25]

OCRBench: On the hidden mystery of ocr in large multimodal models

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. OCRBench: On the hidden mystery of ocr in large multimodal models. Science China Information Sciences, 67(12), 2024

work page 2024
[26]

DeepFashion: Powering robust clothes recognition and retrieval with rich annotations

Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. DeepFashion: Powering robust clothes recognition and retrieval with rich annotations. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1096–1104, 2016

work page 2016
[27]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. In Proc. Int. Conf. Learn. Representations, 2024

work page 2024
[28]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In Proc. Int. Conf. Learn. Representations, 2018

work page 2018
[29]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[30]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Proc. Conf. Association for Computational Linguistics, pages 2263–2279, 2022

work page 2022
[31]

DocVQA: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. DocVQA: A dataset for vqa on document images. In Proc. Winter Conf. Applications of Comp. Vis., pages 2200–2209, 2021

work page 2021
[32]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Proc. IEEE Int. Conf. Comp. Vis., pages 722–729, 2008

work page 2008
[33]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Hervé Jegou, Julien Mairal, Patrick Laba...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskeverothers. Learning transferable visual models from natural language supervision. In Proc. Int. Conf. Mach. Learn., pages 8748–8763, 2021

work page 2021
[35]

SEMICON: A learning-to-hash solution for large-scale fine-grained image retrieval

Yang Shen, Xuhao Sun, Xiu-Shen Wei, Qing-Yuan Jiang, and Jian Yang. SEMICON: A learning-to-hash solution for large-scale fine-grained image retrieval. In Proc. Eur. Conf. Comp. Vis., pages 531–548, 2022

work page 2022
[36]

Interweaving insights: High-order feature interaction for fine-grained visual recognition

Arindam Sikdar, Yonghuai Liu, Siddhardha Kedarisetty, Yitian Zhao, Amr Ahmed, and Ard- hendu Behera. Interweaving insights: High-order feature interaction for fine-grained visual recognition. In Proc. IEEE Int. Conf. Comp. Vis., pages 1755–1779, 2024

work page 2024
[37]

Bi-modal progressive mask attention for fine-grained recognition

Kaitao Song, Xiu-Shen Wei, Xiangbo Shu, Ren-Jie Song, and Jianfeng Lu. Bi-modal progressive mask attention for fine-grained recognition. IEEE Trans. Image Process., 29:7006–7018, 2020

work page 2020
[38]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EV A-CLIP: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Bottled wine defect detection data set, 2021

Tianchi. Bottled wine defect detection data set, 2021

work page 2021
[40]

Benchmarking representation learning for natural world image collections

Grant Van Horn, Elijah Cole, Sara Beery, Kimberly Wilber, Serge Belongie, and Oisin Mac Aodha. Benchmarking representation learning for natural world image collections. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 12884–12893, 2021

work page 2021
[41]

The Caltech-UCSD birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD birds-200-2011 dataset. Technical report, California Institute of Technology, 2011

work page 2011
[42]

Image as a foreign 11 language: BEiT pretraining for vision and vision-language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. Image as a foreign 11 language: BEiT pretraining for vision and vision-language tasks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 19175–19186, 2023

work page 2023
[43]

RPC: A large- scale and fine-grained retail product checkout dataset

Xiu-Shen Wei, Quan Cui, Lei Yang, Peng Wang, Lingqiao Liu, and Jian Yang. RPC: A large- scale and fine-grained retail product checkout dataset. Science China. Information Sciences, 65(9):197101, 2022

work page 2022
[44]

Fine-grained image analysis with deep learning: A survey

Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, Jinhui Tang, Jian Yang, and Serge Belongie. Fine-grained image analysis with deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell., (12):8927–8948, 2022

work page 2022
[45]

MECOM: A meta- completion network for fine-grained recognition with incomplete multi-modalities

Xiu-Shen Wei, Hong-Tao Yu, Anqi Xu, Faen Zhang, and Yuxin Peng. MECOM: A meta- completion network for fine-grained recognition with incomplete multi-modalities. IEEE Trans. Image Process., 33:3456–3469, 2024

work page 2024
[46]

Pytorch image models

Ross Wightman. Pytorch image models. https://github.com/rwightman/ pytorch-image-models, 2019

work page 2019
[47]

FiV A: Fine-grained visual attribute dataset for text-to-image diffusion models

Tong Wu, Yinghao Xu, Ryan Po, Mengchen Zhang, Guandao Yang, Jiaqi Wang, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. FiV A: Fine-grained visual attribute dataset for text-to-image diffusion models. In Advances in Neural Inf. Process. Syst., pages 31990–32011, 2024

work page 2024
[48]

LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models. IEEE Trans. Pattern Anal. Mach. Intell., 47(3):1877–1893, 2025

work page 2025
[49]

Dual attention networks for few-shot fine-grained recognition

Shu-Lin Xu, Faen Zhang, Xiu-Shen Wei, and Jianhua Wang. Dual attention networks for few-shot fine-grained recognition. In Proc. Conf. AAAI, pages 2911–2919, 2022

work page 2022
[50]

CoCa: Contrastive captioners are image-text foundation models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. CoCa: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022

work page 2022
[51]

MMBench: Is your multi-modal model an all-around player? In Proc

Liu Yuan, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. MMBench: Is your multi-modal model an all-around player? In Proc. Eur. Conf. Comp. Vis., pages 216–233, 2024

work page 2024
[52]

MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page 2024
[53]

Fine-grained image-to-lidar contrastive distillation with visual foundation models

Yifan Zhang and Junhui Hou. Fine-grained image-to-lidar contrastive distillation with visual foundation models. In Advances in Neural Inf. Process. Syst., pages 25467–25489, 2024

work page 2024
[54]

Why are visually-grounded language models bad at image classification? arXiv preprint arXiv:2405.18415, 2024

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, and Serena Yeung-Levy. Why are visually-grounded language models bad at image classification? arXiv preprint arXiv:2405.18415, 2024

work page arXiv 2024
[55]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

GPT-4 Technical Report

OpenAI (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Products-10K: A large-scale product recognition dataset

Yalong Bai, Yuxiang Chen, Wei Yu, Linfang Wang, and Wei Zhang. Products-10K: A large-scale product recognition dataset. arXiv preprint arXiv:2008.10545, 2020

work page arXiv 2008

[5] [5]

Context-aware attentional pooling (cap) for fine-grained visual classification

Ardhendu Behera, Zachary Wharton, Pradeep RPG Hewage, and Asish Bera. Context-aware attentional pooling (cap) for fine-grained visual classification. In Proc. Conf. AAAI, number 2, pages 929–937, 2021

work page 2021

[6] [6]

SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization

Asish Bera, Zachary Wharton, Yonghuai Liu, Nik Bessis, and Ardhendu Behera. SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization. IEEE Trans. Image Process., 31:6017–6031, 2022

work page 2022

[7] [7]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In Proc. Eur. Conf. Comp. Vis., pages 446–461, 2014

work page 2014

[8] [8]

InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 24185–24198, 2024

work page 2024

[9] [9]

SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis

Roxana Daneshjou, Mert Yuksekgonul, Zhuo Ran Cai, Roberto Novoa, and James Y Zou. SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis. In Advances in Neural Inf. Process. Syst., pages 18157–18167, 2022

work page 2022

[10] [10]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 248–255, 2009

work page 2009

[11] [11]

MetaFormer: A unified meta framework for fine-grained recognition

Qishuai Diao, Yi Jiang, Bin Wen, Jia Sun, and Zehuan Yuan. MetaFormer: A unified meta framework for fine-grained recognition. arXiv preprint arXiv:2203.02751, 2022

work page arXiv 2022

[12] [12]

African or european swallow? bench- marking large vision-language models for fine-grained object classification

Gregor Geigle, Radu Timofte, and Goran Glavaš. African or european swallow? bench- marking large vision-language models for fine-grained object classification. arXiv preprint arXiv:2406.14496, 2024

work page arXiv 2024

[13] [13]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

VegFru: A domain-specific dataset for fine-grained visual categorization

Saihui Hou, Yushan Feng, and Zilei Wang. VegFru: A domain-specific dataset for fine-grained visual categorization. In Proc. IEEE Int. Conf. Comp. Vis., pages 541–549, 2017

work page 2017

[15] [15]

GQA: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 6700–6709, 2019

work page 2019

[16] [16]

FineCLIP: Self-distilled region-based clip for better fine-grained understanding

Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Guoxing Yang, Wei Wei, Huiwen Zhao, and Zhiwu Lu. FineCLIP: Self-distilled region-based clip for better fine-grained understanding. In Advances in Neural Inf. Process. Syst., pages 27896–27918, 2024

work page 2024

[17] [17]

Animal-Bench: Benchmarking multimodal video models for animal-centric video understanding

Yinuo Jing, Ruxu Zhang, Kongming Liang, Yongxiang Li, Zhongjiang He, Zhanyu Ma, and Jun Guo. Animal-Bench: Benchmarking multimodal video models for animal-centric video understanding. In Advances in Neural Inf. Process. Syst., pages 23457–23469, 2024

work page 2024

[18] [18]

Khosla, N

A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. In CVPR Workshop on Fine-Grained Visual Categorization, pages 806–813, 2011

work page 2011

[19] [19]

3D object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-grained categorization. In Proc. IEEE Int. Conf. Comp. Vis., pages 554–561, 2013

work page 2013

[20] [20]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. 10

work page 2009

[21] [21]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. Int. Conf. Mach. Learn., pages 19730–19742, 2023

work page 2023

[22] [22]

BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proc. Int. Conf. Mach. Learn., pages 12888–12900, 2022

work page 2022

[23] [23]

Progressive multi-task anti-noise learning and distilling frameworks for fine-grained vehicle recognition

Dichao Liu. Progressive multi-task anti-noise learning and distilling frameworks for fine-grained vehicle recognition. IEEE Trans. Intell. Transp. Syst., 25(9):10667–10678, 2024

work page 2024

[24] [24]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 26296–26306, 2024

work page 2024

[25] [25]

OCRBench: On the hidden mystery of ocr in large multimodal models

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. OCRBench: On the hidden mystery of ocr in large multimodal models. Science China Information Sciences, 67(12), 2024

work page 2024

[26] [26]

DeepFashion: Powering robust clothes recognition and retrieval with rich annotations

Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. DeepFashion: Powering robust clothes recognition and retrieval with rich annotations. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1096–1104, 2016

work page 2016

[27] [27]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. In Proc. Int. Conf. Learn. Representations, 2024

work page 2024

[28] [28]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In Proc. Int. Conf. Learn. Representations, 2018

work page 2018

[29] [29]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[30] [30]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Proc. Conf. Association for Computational Linguistics, pages 2263–2279, 2022

work page 2022

[31] [31]

DocVQA: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. DocVQA: A dataset for vqa on document images. In Proc. Winter Conf. Applications of Comp. Vis., pages 2200–2209, 2021

work page 2021

[32] [32]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Proc. IEEE Int. Conf. Comp. Vis., pages 722–729, 2008

work page 2008

[33] [33]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Hervé Jegou, Julien Mairal, Patrick Laba...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskeverothers. Learning transferable visual models from natural language supervision. In Proc. Int. Conf. Mach. Learn., pages 8748–8763, 2021

work page 2021

[35] [35]

SEMICON: A learning-to-hash solution for large-scale fine-grained image retrieval

Yang Shen, Xuhao Sun, Xiu-Shen Wei, Qing-Yuan Jiang, and Jian Yang. SEMICON: A learning-to-hash solution for large-scale fine-grained image retrieval. In Proc. Eur. Conf. Comp. Vis., pages 531–548, 2022

work page 2022

[36] [36]

Interweaving insights: High-order feature interaction for fine-grained visual recognition

Arindam Sikdar, Yonghuai Liu, Siddhardha Kedarisetty, Yitian Zhao, Amr Ahmed, and Ard- hendu Behera. Interweaving insights: High-order feature interaction for fine-grained visual recognition. In Proc. IEEE Int. Conf. Comp. Vis., pages 1755–1779, 2024

work page 2024

[37] [37]

Bi-modal progressive mask attention for fine-grained recognition

Kaitao Song, Xiu-Shen Wei, Xiangbo Shu, Ren-Jie Song, and Jianfeng Lu. Bi-modal progressive mask attention for fine-grained recognition. IEEE Trans. Image Process., 29:7006–7018, 2020

work page 2020

[38] [38]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EV A-CLIP: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Bottled wine defect detection data set, 2021

Tianchi. Bottled wine defect detection data set, 2021

work page 2021

[40] [40]

Benchmarking representation learning for natural world image collections

Grant Van Horn, Elijah Cole, Sara Beery, Kimberly Wilber, Serge Belongie, and Oisin Mac Aodha. Benchmarking representation learning for natural world image collections. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 12884–12893, 2021

work page 2021

[41] [41]

The Caltech-UCSD birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD birds-200-2011 dataset. Technical report, California Institute of Technology, 2011

work page 2011

[42] [42]

Image as a foreign 11 language: BEiT pretraining for vision and vision-language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. Image as a foreign 11 language: BEiT pretraining for vision and vision-language tasks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 19175–19186, 2023

work page 2023

[43] [43]

RPC: A large- scale and fine-grained retail product checkout dataset

Xiu-Shen Wei, Quan Cui, Lei Yang, Peng Wang, Lingqiao Liu, and Jian Yang. RPC: A large- scale and fine-grained retail product checkout dataset. Science China. Information Sciences, 65(9):197101, 2022

work page 2022

[44] [44]

Fine-grained image analysis with deep learning: A survey

Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, Jinhui Tang, Jian Yang, and Serge Belongie. Fine-grained image analysis with deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell., (12):8927–8948, 2022

work page 2022

[45] [45]

MECOM: A meta- completion network for fine-grained recognition with incomplete multi-modalities

Xiu-Shen Wei, Hong-Tao Yu, Anqi Xu, Faen Zhang, and Yuxin Peng. MECOM: A meta- completion network for fine-grained recognition with incomplete multi-modalities. IEEE Trans. Image Process., 33:3456–3469, 2024

work page 2024

[46] [46]

Pytorch image models

Ross Wightman. Pytorch image models. https://github.com/rwightman/ pytorch-image-models, 2019

work page 2019

[47] [47]

FiV A: Fine-grained visual attribute dataset for text-to-image diffusion models

Tong Wu, Yinghao Xu, Ryan Po, Mengchen Zhang, Guandao Yang, Jiaqi Wang, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. FiV A: Fine-grained visual attribute dataset for text-to-image diffusion models. In Advances in Neural Inf. Process. Syst., pages 31990–32011, 2024

work page 2024

[48] [48]

LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models. IEEE Trans. Pattern Anal. Mach. Intell., 47(3):1877–1893, 2025

work page 2025

[49] [49]

Dual attention networks for few-shot fine-grained recognition

Shu-Lin Xu, Faen Zhang, Xiu-Shen Wei, and Jianhua Wang. Dual attention networks for few-shot fine-grained recognition. In Proc. Conf. AAAI, pages 2911–2919, 2022

work page 2022

[50] [50]

CoCa: Contrastive captioners are image-text foundation models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. CoCa: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022

work page 2022

[51] [51]

MMBench: Is your multi-modal model an all-around player? In Proc

Liu Yuan, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. MMBench: Is your multi-modal model an all-around player? In Proc. Eur. Conf. Comp. Vis., pages 216–233, 2024

work page 2024

[52] [52]

MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page 2024

[53] [53]

Fine-grained image-to-lidar contrastive distillation with visual foundation models

Yifan Zhang and Junhui Hou. Fine-grained image-to-lidar contrastive distillation with visual foundation models. In Advances in Neural Inf. Process. Syst., pages 25467–25489, 2024

work page 2024

[54] [54]

Why are visually-grounded language models bad at image classification? arXiv preprint arXiv:2405.18415, 2024

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, and Serena Yeung-Levy. Why are visually-grounded language models bad at image classification? arXiv preprint arXiv:2405.18415, 2024

work page arXiv 2024

[55] [55]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

work page internal anchor Pith review Pith/arXiv arXiv 2025