Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation
Pith reviewed 2026-05-22 18:20 UTC · model grok-4.3
The pith
A new benchmark of 1.01 million questions shows how training paradigms and modality alignment shape LVLMs performance on fine-grained image tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing FG-BMK with 1.01 million questions and 0.33 million images, the authors systematically evaluate LVLMs on fine-grained tasks from human and machine perspectives, demonstrating that training paradigms exert strong influence on performance, modality alignment determines effective multimodal integration, models remain highly susceptible to perturbations, and fine-grained category reasoning constitutes a persistent weakness.
What carries the argument
FG-BMK benchmark, a dataset of 1.01 million questions paired with 0.33 million images that probes semantic recognition and fine-grained feature representation capabilities through human-oriented and machine-oriented evaluations.
If this is right
- Certain training paradigms produce measurably higher performance on fine-grained image tasks than others.
- Better alignment between visual and textual modalities directly improves results on semantic recognition and feature representation.
- Current LVLMs exhibit clear drops in accuracy under small image perturbations.
- Fine-grained category reasoning remains a bottleneck that limits overall task success.
- Benchmark results can directly inform choices in future data construction and model architecture decisions.
Where Pith is reading between the lines
- Developers could test whether adding targeted fine-grained category examples during training closes the performance gaps identified here.
- The benchmark's scale suggests that similar large-scale evaluations might uncover comparable patterns in related multimodal tasks such as visual question answering.
- Models that improve perturbation robustness on FG-BMK may also show gains on real-world image variations not included in the current set.
Load-bearing premise
The 1.01 million questions and 0.33 million images in FG-BMK, together with the selected human and machine evaluation perspectives, capture the full range of fine-grained image task capabilities without selection biases or annotation artifacts.
What would settle it
An independent test set of fine-grained image questions and images, constructed separately from FG-BMK, on which top-performing models from the benchmark score significantly lower would show that the original benchmark does not fully measure the targeted capabilities.
Figures
read the original abstract
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FG-BMK, a large-scale benchmark comprising 1.01 million questions and 0.33 million images for evaluating large vision-language models on fine-grained image tasks. It performs systematic experiments on twelve representative LVLMs/VLMs from both human-oriented and machine-oriented perspectives, focusing on semantic recognition and fine-grained feature representation. The work reports key findings on the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance, while highlighting limitations of current models and offering guidance for future data construction and model design. The code is released open-source.
Significance. If the benchmark construction is shown to be free of selection biases and annotation artifacts, the evaluation would provide useful empirical insights into LVLMs' capabilities on fine-grained tasks central to computer vision. The reported influences could inform improvements in training and alignment strategies. The open-source code at https://github.com/SEU-VIPGroup/FG-BMK is a clear strength that supports reproducibility.
major comments (1)
- §3 (FG-BMK Construction): The manuscript provides insufficient detail on the source datasets for the 0.33 million images, the question-generation process for the 1.01 million questions (e.g., templates versus LLM-assisted), and the human verification protocol including quantitative inter-annotator agreement. These omissions are load-bearing for the central claim that the experiments reliably uncover influences of training paradigms, modality alignment, perturbation susceptibility, and fine-grained reasoning, because unaddressed curation biases could produce spurious correlations between model properties and scores.
minor comments (1)
- Abstract: The summary of contributions could explicitly mention the number of evaluated models and the open-source release to better foreground the scale and reproducibility aspects.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work's significance and for the constructive feedback on benchmark construction details. We agree that greater transparency in Section 3 is warranted to support the reliability of our empirical findings and will revise the manuscript to address this point.
read point-by-point responses
-
Referee: §3 (FG-BMK Construction): The manuscript provides insufficient detail on the source datasets for the 0.33 million images, the question-generation process for the 1.01 million questions (e.g., templates versus LLM-assisted), and the human verification protocol including quantitative inter-annotator agreement. These omissions are load-bearing for the central claim that the experiments reliably uncover influences of training paradigms, modality alignment, perturbation susceptibility, and fine-grained reasoning, because unaddressed curation biases could produce spurious correlations between model properties and scores.
Authors: We acknowledge that the current description in Section 3, while outlining the overall scale and high-level process, does not provide the granular information needed to fully evaluate potential curation biases. In the revised manuscript we will expand this section with: (1) an explicit enumeration of all source datasets contributing to the 0.33 million images together with their original task formulations and any filtering criteria applied; (2) a precise account of the question-generation pipeline, including the proportion of questions produced via hand-crafted templates versus LLM-assisted generation and the exact prompting strategies used; and (3) the complete human verification protocol, including the number of annotators, the annotation interface, and quantitative inter-annotator agreement statistics (e.g., Fleiss' kappa or pairwise agreement percentages). These additions will allow readers to assess selection biases directly and thereby strengthen the link between model properties and observed performance differences. revision: yes
Circularity Check
Empirical benchmark evaluation is self-contained with no derivation chain
full rationale
This paper introduces the FG-BMK benchmark (1.01M questions, 0.33M images) and reports experimental results on twelve external LVLMs/VLMs. No equations, fitted parameters, or first-principles derivations appear in the provided text or abstract. Central claims rest on observed performance differences across training paradigms, modality alignment, and perturbation susceptibility rather than any self-referential construction or self-citation load-bearing step. The work is therefore an external evaluation against benchmarks and models, qualifying for the default non-circularity outcome.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition
FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.
-
FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition
FIKA-Bench is a leakage-aware benchmark of 311 instances showing that even the best large multimodal models and tool-equipped agents reach only 25.1% accuracy on fine-grained recognition questions that require externa...
Reference graph
Works this paper leans on
-
[1]
OpenAI (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Products-10K: A large-scale product recognition dataset
Yalong Bai, Yuxiang Chen, Wei Yu, Linfang Wang, and Wei Zhang. Products-10K: A large-scale product recognition dataset. arXiv preprint arXiv:2008.10545, 2020
-
[5]
Context-aware attentional pooling (cap) for fine-grained visual classification
Ardhendu Behera, Zachary Wharton, Pradeep RPG Hewage, and Asish Bera. Context-aware attentional pooling (cap) for fine-grained visual classification. In Proc. Conf. AAAI, number 2, pages 929–937, 2021
work page 2021
-
[6]
SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization
Asish Bera, Zachary Wharton, Yonghuai Liu, Nik Bessis, and Ardhendu Behera. SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization. IEEE Trans. Image Process., 31:6017–6031, 2022
work page 2022
-
[7]
Food-101–mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In Proc. Eur. Conf. Comp. Vis., pages 446–461, 2014
work page 2014
-
[8]
InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 24185–24198, 2024
work page 2024
-
[9]
Roxana Daneshjou, Mert Yuksekgonul, Zhuo Ran Cai, Roberto Novoa, and James Y Zou. SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis. In Advances in Neural Inf. Process. Syst., pages 18157–18167, 2022
work page 2022
-
[10]
ImageNet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 248–255, 2009
work page 2009
-
[11]
MetaFormer: A unified meta framework for fine-grained recognition
Qishuai Diao, Yi Jiang, Bin Wen, Jia Sun, and Zehuan Yuan. MetaFormer: A unified meta framework for fine-grained recognition. arXiv preprint arXiv:2203.02751, 2022
-
[12]
Gregor Geigle, Radu Timofte, and Goran Glavaš. African or european swallow? bench- marking large vision-language models for fine-grained object classification. arXiv preprint arXiv:2406.14496, 2024
-
[13]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
VegFru: A domain-specific dataset for fine-grained visual categorization
Saihui Hou, Yushan Feng, and Zilei Wang. VegFru: A domain-specific dataset for fine-grained visual categorization. In Proc. IEEE Int. Conf. Comp. Vis., pages 541–549, 2017
work page 2017
-
[15]
GQA: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 6700–6709, 2019
work page 2019
-
[16]
FineCLIP: Self-distilled region-based clip for better fine-grained understanding
Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Guoxing Yang, Wei Wei, Huiwen Zhao, and Zhiwu Lu. FineCLIP: Self-distilled region-based clip for better fine-grained understanding. In Advances in Neural Inf. Process. Syst., pages 27896–27918, 2024
work page 2024
-
[17]
Animal-Bench: Benchmarking multimodal video models for animal-centric video understanding
Yinuo Jing, Ruxu Zhang, Kongming Liang, Yongxiang Li, Zhongjiang He, Zhanyu Ma, and Jun Guo. Animal-Bench: Benchmarking multimodal video models for animal-centric video understanding. In Advances in Neural Inf. Process. Syst., pages 23457–23469, 2024
work page 2024
- [18]
-
[19]
3D object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-grained categorization. In Proc. IEEE Int. Conf. Comp. Vis., pages 554–561, 2013
work page 2013
-
[20]
Learning multiple layers of features from tiny images
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. 10
work page 2009
-
[21]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. Int. Conf. Mach. Learn., pages 19730–19742, 2023
work page 2023
-
[22]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proc. Int. Conf. Mach. Learn., pages 12888–12900, 2022
work page 2022
-
[23]
Dichao Liu. Progressive multi-task anti-noise learning and distilling frameworks for fine-grained vehicle recognition. IEEE Trans. Intell. Transp. Syst., 25(9):10667–10678, 2024
work page 2024
-
[24]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 26296–26306, 2024
work page 2024
-
[25]
OCRBench: On the hidden mystery of ocr in large multimodal models
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. OCRBench: On the hidden mystery of ocr in large multimodal models. Science China Information Sciences, 67(12), 2024
work page 2024
-
[26]
DeepFashion: Powering robust clothes recognition and retrieval with rich annotations
Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. DeepFashion: Powering robust clothes recognition and retrieval with rich annotations. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1096–1104, 2016
work page 2016
-
[27]
MathVista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. In Proc. Int. Conf. Learn. Representations, 2024
work page 2024
-
[28]
Towards deep learning models resistant to adversarial attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In Proc. Int. Conf. Learn. Representations, 2018
work page 2018
-
[29]
Fine-Grained Visual Classification of Aircraft
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[30]
ChartQA: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Proc. Conf. Association for Computational Linguistics, pages 2263–2279, 2022
work page 2022
-
[31]
DocVQA: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. DocVQA: A dataset for vqa on document images. In Proc. Winter Conf. Applications of Comp. Vis., pages 2200–2209, 2021
work page 2021
-
[32]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Proc. IEEE Int. Conf. Comp. Vis., pages 722–729, 2008
work page 2008
-
[33]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Hervé Jegou, Julien Mairal, Patrick Laba...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskeverothers. Learning transferable visual models from natural language supervision. In Proc. Int. Conf. Mach. Learn., pages 8748–8763, 2021
work page 2021
-
[35]
SEMICON: A learning-to-hash solution for large-scale fine-grained image retrieval
Yang Shen, Xuhao Sun, Xiu-Shen Wei, Qing-Yuan Jiang, and Jian Yang. SEMICON: A learning-to-hash solution for large-scale fine-grained image retrieval. In Proc. Eur. Conf. Comp. Vis., pages 531–548, 2022
work page 2022
-
[36]
Interweaving insights: High-order feature interaction for fine-grained visual recognition
Arindam Sikdar, Yonghuai Liu, Siddhardha Kedarisetty, Yitian Zhao, Amr Ahmed, and Ard- hendu Behera. Interweaving insights: High-order feature interaction for fine-grained visual recognition. In Proc. IEEE Int. Conf. Comp. Vis., pages 1755–1779, 2024
work page 2024
-
[37]
Bi-modal progressive mask attention for fine-grained recognition
Kaitao Song, Xiu-Shen Wei, Xiangbo Shu, Ren-Jie Song, and Jianfeng Lu. Bi-modal progressive mask attention for fine-grained recognition. IEEE Trans. Image Process., 29:7006–7018, 2020
work page 2020
-
[38]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EV A-CLIP: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Bottled wine defect detection data set, 2021
Tianchi. Bottled wine defect detection data set, 2021
work page 2021
-
[40]
Benchmarking representation learning for natural world image collections
Grant Van Horn, Elijah Cole, Sara Beery, Kimberly Wilber, Serge Belongie, and Oisin Mac Aodha. Benchmarking representation learning for natural world image collections. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 12884–12893, 2021
work page 2021
-
[41]
The Caltech-UCSD birds-200-2011 dataset
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD birds-200-2011 dataset. Technical report, California Institute of Technology, 2011
work page 2011
-
[42]
Image as a foreign 11 language: BEiT pretraining for vision and vision-language tasks
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. Image as a foreign 11 language: BEiT pretraining for vision and vision-language tasks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 19175–19186, 2023
work page 2023
-
[43]
RPC: A large- scale and fine-grained retail product checkout dataset
Xiu-Shen Wei, Quan Cui, Lei Yang, Peng Wang, Lingqiao Liu, and Jian Yang. RPC: A large- scale and fine-grained retail product checkout dataset. Science China. Information Sciences, 65(9):197101, 2022
work page 2022
-
[44]
Fine-grained image analysis with deep learning: A survey
Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, Jinhui Tang, Jian Yang, and Serge Belongie. Fine-grained image analysis with deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell., (12):8927–8948, 2022
work page 2022
-
[45]
MECOM: A meta- completion network for fine-grained recognition with incomplete multi-modalities
Xiu-Shen Wei, Hong-Tao Yu, Anqi Xu, Faen Zhang, and Yuxin Peng. MECOM: A meta- completion network for fine-grained recognition with incomplete multi-modalities. IEEE Trans. Image Process., 33:3456–3469, 2024
work page 2024
-
[46]
Ross Wightman. Pytorch image models. https://github.com/rwightman/ pytorch-image-models, 2019
work page 2019
-
[47]
FiV A: Fine-grained visual attribute dataset for text-to-image diffusion models
Tong Wu, Yinghao Xu, Ryan Po, Mengchen Zhang, Guandao Yang, Jiaqi Wang, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. FiV A: Fine-grained visual attribute dataset for text-to-image diffusion models. In Advances in Neural Inf. Process. Syst., pages 31990–32011, 2024
work page 2024
-
[48]
LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models
Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models. IEEE Trans. Pattern Anal. Mach. Intell., 47(3):1877–1893, 2025
work page 2025
-
[49]
Dual attention networks for few-shot fine-grained recognition
Shu-Lin Xu, Faen Zhang, Xiu-Shen Wei, and Jianhua Wang. Dual attention networks for few-shot fine-grained recognition. In Proc. Conf. AAAI, pages 2911–2919, 2022
work page 2022
-
[50]
CoCa: Contrastive captioners are image-text foundation models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. CoCa: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022
work page 2022
-
[51]
MMBench: Is your multi-modal model an all-around player? In Proc
Liu Yuan, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. MMBench: Is your multi-modal model an all-around player? In Proc. Eur. Conf. Comp. Vis., pages 216–233, 2024
work page 2024
-
[52]
MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...
work page 2024
-
[53]
Fine-grained image-to-lidar contrastive distillation with visual foundation models
Yifan Zhang and Junhui Hou. Fine-grained image-to-lidar contrastive distillation with visual foundation models. In Advances in Neural Inf. Process. Syst., pages 25467–25489, 2024
work page 2024
-
[54]
Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, and Serena Yeung-Levy. Why are visually-grounded language models bad at image classification? arXiv preprint arXiv:2405.18415, 2024
-
[55]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.