WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark
Pith reviewed 2026-06-28 02:59 UTC · model grok-4.3
The pith
WorldBench uses a visual concept taxonomy to curate diverse images and shows even top MLLMs reach only 64 percent accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WorldBench is built by defining a taxonomy of visual concepts across domains, then manually selecting images and writing questions that frontier models fail. This produces a benchmark with measurably higher visual diversity than existing ones. Evaluation of 15 MLLMs finds the best model at 64.0 percent accuracy, with several others only marginally above random guessing, exposing limits in current visual understanding.
What carries the argument
A taxonomy of thousands of visual concepts that guides curation of images from search engines and existing datasets, combined with trial-and-error manual question design.
If this is right
- MLLMs require stronger mechanisms for handling varied visual inputs to reach reliable real-world performance.
- Benchmark construction should prioritize explicit coverage of visual concepts over simply adding more task types.
- Models that score near chance on diverse images are unlikely to generalize safely to open-ended visual settings.
- Evaluation protocols for multimodal systems should include diversity metrics alongside accuracy.
Where Pith is reading between the lines
- Training data that lacks similar visual breadth may be a root cause of the observed performance gaps.
- The benchmark could serve as a filter for selecting models intended for deployment in uncontrolled visual environments.
- Extending the taxonomy approach to video or 3D inputs would test whether the same diversity issues appear in other modalities.
- Human performance baselines on the same questions would clarify how far current models remain from human-level visual reasoning.
Load-bearing premise
The manually chosen images and questions, shaped by the taxonomy, test general visual understanding rather than just the specific concepts or sources selected.
What would settle it
Quantitative diversity scores showing WorldBench is not higher than prior benchmarks, or a new model reaching above 80 percent accuracy while still performing strongly on older tests.
read the original abstract
In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WorldBench, a multimodal reasoning benchmark for MLLMs. It builds a taxonomy of visual concepts, curates diverse images from search engines and datasets, and designs challenging questions via structured trial-and-error against frontier models. The work claims higher visual diversity than prior benchmarks on quantitative and human evaluations, and reports that 15 evaluated MLLMs achieve at most 64.0% accuracy (with some near chance), indicating weaknesses in visual understanding.
Significance. If the question selection process can be shown not to introduce bias toward current model failure modes, WorldBench could serve as a useful diagnostic for visual diversity gaps in MLLMs and inform more robust benchmark construction practices.
major comments (1)
- [Abstract] Abstract: the claim that low accuracy (64.0% for the strongest model) reveals 'weaknesses in visual understanding' depends on the questions being a fair sample of the visual-concept space. The described procedure of 'structured trial-and-error' to retain only items that frontier MLLMs fail on creates a selection filter that may preferentially capture model-specific weaknesses rather than intrinsic visual-understanding deficits; this is load-bearing for the central interpretation of the results.
minor comments (2)
- [Abstract] Abstract: the quantitative diversity metric used to claim superiority over existing benchmarks is referenced but not defined or reported with values, preventing verification of the diversity claim.
- [Abstract] Abstract: the exact question design process, taxonomy details, and any controls for confounds (e.g., concept selection bias) are not specified, limiting assessment of whether the benchmark comprehensively tests visual understanding.
Simulated Author's Rebuttal
We thank the referee for highlighting this important point regarding the interpretation of our results. We agree that the question selection procedure requires careful framing and will revise the abstract and related discussion to address the concern.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that low accuracy (64.0% for the strongest model) reveals 'weaknesses in visual understanding' depends on the questions being a fair sample of the visual-concept space. The described procedure of 'structured trial-and-error' to retain only items that frontier MLLMs fail on creates a selection filter that may preferentially capture model-specific weaknesses rather than intrinsic visual-understanding deficits; this is load-bearing for the central interpretation of the results.
Authors: We acknowledge the validity of this concern. The trial-and-error process is explicitly designed to produce questions that current frontier models cannot solve, which intentionally filters for items that expose limitations rather than representing a uniform or random sample of the visual-concept space. This means the benchmark is best interpreted as a diagnostic tool for current gaps rather than a comprehensive measure of intrinsic visual understanding deficits across all possible questions. We will revise the abstract to replace the phrasing 'reveals weaknesses in visual understanding' with language that more precisely states the results demonstrate limitations of existing MLLMs on visually diverse reasoning tasks. We will also expand the methods and discussion sections to describe the selection process more explicitly and note its implications for interpretation. revision: yes
Circularity Check
No circularity; benchmark paper contains no derivation chain
full rationale
The paper introduces WorldBench via taxonomy-guided curation and trial-and-error question design, followed by direct evaluation of 15 MLLMs. No equations, parameters, predictions, or first-principles derivations are present. Diversity claims rest on separate quantitative and human evaluations. The low-accuracy observation is a straightforward measurement on the constructed set rather than a reduction of any claimed result to its own inputs by construction. No self-citations or ansatzes are load-bearing in a mathematical sense. This is a standard benchmark paper whose central claims do not exhibit the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Visual diversity across concepts is required for reliable performance on open-ended visual inputs in real-world applications.
Reference graph
Works this paper leans on
-
[1]
Agent VQA : A unified benchmark for agentic visual understanding
Anonymous. Agent VQA : A unified benchmark for agentic visual understanding. In Submitted to ICLR, 2025
2025
-
[2]
Introducing claude opus 4.7
Anthropic. Introducing claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 , 2026
2026
-
[4]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, et al. Qwen3-vl technical report. arXiv preprint arXiv: 2511.21631, 2025
Pith/arXiv arXiv 2025
-
[5]
Perception encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. In NeurIPS, 2025
2025
-
[6]
Rank analysis of incomplete block designs: I
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 1952
1952
-
[7]
Conceptual 12M : Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M : Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021
2021
-
[8]
MEGA -bench: Scaling multimodal evaluation to over 500 real-world tasks
Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Ziyan Jiang, Wang Zhu, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, and Wenhu Chen. MEGA -bench: Scaling multimodal evaluation to over 500 real-world tasks. In ICLR, 2025
2025
-
[9]
Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024
2024
-
[10]
Pali: A jointly-scaled multilingual language-image model
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carl...
2023
-
[11]
Seeclick: Harnessing gui grounding for advanced visual gui agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. In ACL, 2024
2024
-
[12]
Chatbot arena: An open platform for evaluating llms by human preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. In ICML, 2024
2024
-
[13]
Instructblip: Towards general-purpose vision-language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023
2023
-
[14]
Gemma 4: Our most intelligent open models, built from gemini 3 research and technology to maximize intelligence-per-parameter
Google DeepMind. Gemma 4: Our most intelligent open models, built from gemini 3 research and technology to maximize intelligence-per-parameter. https://deepmind.google/models/gemma/gemma-4/ , 2026
2026
-
[15]
ImageNet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database . In CVPR, 2009
2009
-
[16]
Bootstrap confidence intervals
Thomas J DiCiccio and Bradley Efron. Bootstrap confidence intervals. Statistical science, 1996
1996
-
[17]
The pascal visual object classes (voc) challenge
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010
2010
-
[18]
Wikipedia categories
Wikimedia Foundation. Wikipedia categories. https://en.wikipedia.org/wiki/Wikipedia:Contents/Categories , 2021
2021
-
[19]
Wikidata
Wikimedia Foundation. Wikidata. https://www.wikidata.org/wiki/Wikidata:Main_Page , 2023
2023
-
[20]
The vendi score: A diversity evaluation metric for machine learning
Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning. TMLR, 2023
2023
-
[21]
MME : A comprehensive evaluation benchmark for multimodal large language models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. MME : A comprehensive evaluation benchmark for multimodal large language models. In NeurIPS Datasets and Benchmarks Track, 2025 a
2025
-
[22]
Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning
Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. Ocrbench v2: An improved benchmark for evaluating large multimodal models on vis...
2025
-
[23]
Smith, Wei-Chiu Ma, and Ranjay Krishna
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In ECCV, 2024
2024
-
[24]
Datacomp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. In NeurIPS Datasets and Benchmarks Track, 2023
2023
-
[25]
Glm-4.6v: Open source multimodal models with native tool use
GLM. Glm-4.6v: Open source multimodal models with native tool use. https://z.ai/blog/glm-4.6v , 2025 a
2025
-
[26]
GLM. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv: 2507.01006, 2025 b
Pith/arXiv arXiv 2025
-
[27]
Google trend categories
Google. Google trend categories. https://github.com/pat310/google-trends-api/wiki/Google-Trends-Categories , 2017
2017
-
[28]
Google product taxonomy
Google. Google product taxonomy. https://www.google.com/basepages/producttype/taxonomy.en-US.txt , 2021
2021
-
[29]
Gemini 3 pro best for complex tasks and bringing creative concepts to life
Google. Gemini 3 pro best for complex tasks and bringing creative concepts to life. https://deepmind.google/models/gemini/pro/ , 2025
2025
-
[30]
Gemini 3.1 pro: A smarter model for your most complex tasks
Google. Gemini 3.1 pro: A smarter model for your most complex tasks. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/ , 2026
2026
-
[31]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017
2017
-
[32]
Caltech-256 object category dataset
Gregory Griffin, Alex Holub, Pietro Perona, et al. Caltech-256 object category dataset. Technical Report, 2007
2007
-
[33]
Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In CVPR, 2024
2024
-
[34]
Kimi. Kimi-vl technical report. arXiv preprint arXiv: 2504.07491, 2025
Pith/arXiv arXiv 2025
-
[36]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
2023
-
[37]
Seed-bench-2: Benchmarking multimodal large language models
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023 a
arXiv 2023
-
[38]
Seed-bench: Benchmarking multimodal llms with generative comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. In CVPR, 2024
2024
-
[39]
Screenspot-pro: Gui grounding for professional high-resolution computer use
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. In Proceedings of the 33rd ACM International Conference on Multimedia, 2025
2025
-
[40]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In EMNLP, 2023 b
2023
-
[41]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014
2014
-
[42]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023
2023
-
[43]
Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024
2024
-
[44]
Mmlongbench-doc: Benchmarking long-context document understanding with visualizations
Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, and Aixin Sun. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations. In NeurIPS, 2024
2024
-
[45]
Wordnet: a lexical database for english
George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 1995
1995
-
[46]
Introducing gpt-5.4
OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026
2026
-
[47]
Dinov2: Learning robust visual features without supervision
Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2024
2024
-
[48]
Teaching clip to count to ten
Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
2023
-
[49]
Canonical perspective and the perception of objects
Stephen E Palmer. Canonical perspective and the perception of objects. Attention and performance, 1981
1981
-
[50]
Qwen3.5: Towards native multimodal agents
Qwen. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id=qwen3.5 , 2026
2026
-
[51]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021
2021
-
[54]
The effective rank: A measure of effective dimensionality
Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In European Signal Processing Conference, 2007
2007
-
[55]
LAION -5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION -5b: An open large-scale dataset for training next generation image-text mo...
2022
-
[56]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
Pith/arXiv arXiv 2024
-
[57]
Design2code: Benchmarking multimodal code generation for automated front-end engineering
Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. In NAACL, 2025
2025
-
[59]
Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning
Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021
2021
-
[60]
YFCC100M : The new data in multimedia research
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M : The new data in multimedia research. Communications of the ACM, 2016
2016
-
[61]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In CVPR, 2024
2024
-
[62]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In NeurIPS, 2025
2025
-
[65]
Charxiv: Charting gaps in realistic chart understanding in multimodal llms
Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. 2024
2024
-
[66]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022
2022
-
[67]
Finevision: Open data is all you need
Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and Andrés Marafioti. Finevision: Open data is all you need. arXiv preprint arXiv:2510.17269, 2025
Pith/arXiv arXiv 2025
-
[68]
Grok 4.2
xAI. Grok 4.2. https://docs.x.ai/developers/models , 2026
2026
-
[70]
Demystifying clip data
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. In ICLR, 2024 a
2024
-
[71]
Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models
Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. PAML, 2024 b
2024
-
[72]
Chartmimic: Evaluating lmm's cross-modal reasoning capability via chart-to-code generation
Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, and Yujiu Yang. Chartmimic: Evaluating lmm's cross-modal reasoning capability via chart-to-code generation. In ICLR, 2025
2025
-
[73]
Mmt-bench: a comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi
Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. Mmt-bench: a comprehensive multimodal benchmark for evaluating large vision-language models toward...
2024
-
[74]
Mm-vet: Evaluating large multimodal models for integrated capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. In ICML, 2024 a
2024
-
[75]
MM -vet: Evaluating large multimodal models for integrated capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM -vet: Evaluating large multimodal models for integrated capabilities. In ICML, 2024 b
2024
-
[76]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024
2024
-
[77]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023
2023
-
[78]
Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , title =
-
[79]
Microsoft coco: Common objects in context , author=
-
[80]
Communications of the ACM , year=
WordNet: a lexical database for English , author=. Communications of the ACM , year=
-
[81]
IJCV , year=
The pascal visual object classes (voc) challenge , author=. IJCV , year=
-
[82]
2007 , journal=
Caltech-256 object category dataset , author=. 2007 , journal=
2007
-
[83]
NeurIPS , year=
Cambrian-1: A fully open, vision-centric exploration of multimodal llms , author=. NeurIPS , year=
-
[84]
Anonymous , booktitle=. Agent. 2025 , url=
2025
-
[85]
Attention and performance , year=
Canonical Perspective and the perception of objects , author=. Attention and performance , year=
-
[86]
2024 , booktitle =
Ying, Kaining and Meng, Fanqing and Wang, Jin and Li, Zhiqian and Lin, Han and Yang, Yue and Zhang, Hao and Zhang, Wenbo and Lin, Yuqi and Liu, Shuo and Lei, Jiayi and Lu, Quanfeng and Chen, Runjian and Xu, Peng and Zhang, Renrui and Zhang, Haozhe and Gao, Peng and Wang, Yali and Qiao, Yu and Luo, Ping and Zhang, Kaipeng and Shao, Wenqi , title =. 2024 , ...
2024
-
[87]
2024 , journal=
Xu, Peng and Shao, Wenqi and Zhang, Kaipeng and Gao, Peng and Liu, Shuo and Lei, Meng and Meng, Fanqing and Huang, Siyuan and Qiao, Yu and Luo, Ping , title=. 2024 , journal=
2024
-
[88]
2024 , booktitle =
Seed-bench: Benchmarking multimodal llms with generative comprehension , author=. 2024 , booktitle =
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.