Recognition: unknown
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
Pith reviewed 2026-05-14 21:46 UTC · model grok-4.3
The pith
SciVQR benchmark reveals leading multimodal models fall short on complex scientific reasoning across 54 subfields.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SciVQR is a benchmark that covers 54 subfields across six scientific disciplines and pairs domain-specific visuals with tasks that require visual comprehension plus multi-step inference. It evaluates not only the correctness of answers but also the traceability of the reasoning process, with expert-authored solutions supplied for 46 percent of the items. When applied to leading multimodal large language models, the benchmark exposes significant shortcomings in complex multimodal scientific reasoning.
What carries the argument
The SciVQR benchmark, which supplies domain-specific visuals and tasks that demand both visual understanding and multi-step reasoning across 54 subfields.
If this is right
- Models will require stronger multi-step reasoning mechanisms to reach high performance on the benchmark.
- Effective integration of knowledge across different scientific disciplines will become necessary for success.
- Evaluation methods must track reasoning processes in addition to final answers.
- Public release of the dataset and code enables direct testing and training improvements.
- Progress toward scientific intelligence in multimodal models can be measured against this standard.
Where Pith is reading between the lines
- Models that perform well on SciVQR may show improved ability to support real research workflows.
- The approach could be extended to create comparable benchmarks in applied domains such as engineering.
- Limitations observed here suggest that simply increasing model size may not resolve the gaps without targeted reasoning enhancements.
- Baseline human performance data on SciVQR would help quantify how far current models remain from expert level.
Load-bearing premise
The tasks, visuals, and expert solutions chosen for SciVQR accurately reflect the complexity and traceability of real scientific reasoning.
What would settle it
A study in which domain experts solve a sample of SciVQR tasks while documenting their reasoning steps, then compare those steps and success rates against model outputs on the same items.
Figures
read the original abstract
Scientific reasoning is a key aspect of human intelligence, requiring the integration of multimodal inputs, domain expertise, and multi-step inference across various subjects. Existing benchmarks for multimodal large language models (MLLMs) often fail to capture the complexity and traceability of reasoning processes necessary for rigorous evaluation. To fill this gap, we introduce SciVQR, a multimodal benchmark covering 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology. SciVQR includes domain-specific visuals, such as equations, charts, and diagrams, and challenges models to combine visual comprehension with reasoning. The tasks range from basic factual recall to complex, multi-step inferences, with 46% including expert-authored solutions. SciVQR not only evaluates final answers but also examines the reasoning process, providing insights into how models reach their conclusions. Our evaluation of leading MLLMs, including both proprietary and open-source models, reveals significant limitations in handling complex multimodal reasoning tasks, underscoring the need for improved multi-step reasoning and better integration of interdisciplinary knowledge in advancing MLLMs toward true scientific intelligence. The dataset and evaluation code are publicly available at https://github.com/CASIA-IVA-Lab/SciVQR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SciVQR, a multimodal benchmark spanning 54 subfields across mathematics, physics, chemistry, geography, astronomy, and biology. It incorporates domain-specific visuals (equations, charts, diagrams) and tasks ranging from factual recall to multi-step inference, with 46% featuring expert-authored solutions. The benchmark evaluates both final answers and reasoning processes in leading proprietary and open-source MLLMs, concluding that current models exhibit significant limitations in complex multimodal scientific reasoning and calling for advances in multi-step reasoning and interdisciplinary knowledge integration. The dataset and evaluation code are released publicly.
Significance. If the benchmark construction and evaluation protocols are shown to be reliable, SciVQR could provide a useful multidisciplinary testbed that extends beyond existing MLLM benchmarks by emphasizing traceable reasoning processes and visual integration across many domains. The public release of data and code supports reproducibility and community follow-up work.
major comments (3)
- [Abstract] Abstract: The claim that the evaluation 'reveals significant limitations' in MLLMs is presented without any quantitative metrics, error breakdowns, or inter-annotator agreement statistics, making it impossible to assess whether the observed shortcomings are robust or merely artifacts of task selection.
- [Benchmark construction] Benchmark construction section: The paper states that tasks 'accurately capture the complexity and traceability of real scientific reasoning processes' across 54 subfields, yet provides no details on expert validation procedures, pilot testing, or agreement scores; this directly affects the load-bearing assumption that the benchmark is a faithful proxy for scientific intelligence.
- [Evaluation] Evaluation section: Without reported baselines, per-subfield performance tables, or process-tracing rubrics (e.g., how partial credit is assigned to reasoning steps), the assertion that models lack 'multi-step reasoning and better integration of interdisciplinary knowledge' cannot be verified or compared to prior work.
minor comments (2)
- [Abstract] Abstract: Adding the total number of questions or examples would give readers immediate scale context.
- [Conclusion] The GitHub link is provided, but the manuscript does not specify the exact license or maintenance plan for the released dataset.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate additional details on metrics, validation, and evaluation protocols.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the evaluation 'reveals significant limitations' in MLLMs is presented without any quantitative metrics, error breakdowns, or inter-annotator agreement statistics, making it impossible to assess whether the observed shortcomings are robust or merely artifacts of task selection.
Authors: We agree the abstract would be strengthened by quantitative support. The full manuscript reports model accuracies, error patterns, and task-type breakdowns in the evaluation section; we will add a concise summary of key metrics (e.g., average accuracy across models and representative error rates) to the abstract. For inter-annotator agreement, we will include a description of the expert review process used for the 46% expert-authored solutions and any consistency checks performed on the remainder. revision: yes
-
Referee: [Benchmark construction] Benchmark construction section: The paper states that tasks 'accurately capture the complexity and traceability of real scientific reasoning processes' across 54 subfields, yet provides no details on expert validation procedures, pilot testing, or agreement scores; this directly affects the load-bearing assumption that the benchmark is a faithful proxy for scientific intelligence.
Authors: We will expand the benchmark construction section with explicit details on expert validation: each subfield task was reviewed by at least one domain specialist for scientific accuracy and reasoning traceability, followed by pilot testing on a small set of models and human subjects to calibrate difficulty. We will also report agreement scores for the expert-authored subset and describe the curation workflow that ensures coverage of real scientific processes. revision: yes
-
Referee: [Evaluation] Evaluation section: Without reported baselines, per-subfield performance tables, or process-tracing rubrics (e.g., how partial credit is assigned to reasoning steps), the assertion that models lack 'multi-step reasoning and better integration of interdisciplinary knowledge' cannot be verified or compared to prior work.
Authors: We will augment the evaluation section with per-subfield performance tables, explicit baseline comparisons (including random guessing and human expert performance where measured), and a detailed scoring rubric that specifies how partial credit is awarded for intermediate reasoning steps. These additions will make the evidence for limitations in multi-step and interdisciplinary reasoning directly verifiable and comparable to prior benchmarks. revision: yes
Circularity Check
No significant circularity: benchmark paper with no derivations or fitted predictions
full rationale
The paper introduces SciVQR, a new multimodal benchmark covering 54 subfields with domain visuals and expert solutions. It evaluates existing MLLMs on this benchmark and reports limitations in complex reasoning. No equations, parameters, or predictive derivations are present. The central claims rest on the benchmark construction and direct model assessments, which are self-contained contributions without reduction to self-citations, self-definitions, or fitted inputs called predictions. Standard benchmark practices apply with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Scientific reasoning requires integration of multimodal inputs, domain expertise, and multi-step inference across subjects.
Reference graph
Works this paper leans on
- [1]
-
[2]
Publications Manual , year = "1983", publisher =
work page 1983
-
[3]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
- [4]
-
[5]
Dan Gusfield , title =. 1997
work page 1997
-
[6]
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
work page 2015
-
[7]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[8]
Wu, Zhiyu and Chen, Xiaokang and Pan, Zizheng and Liu, Xingchao and Liu, Wen and Dai, Damai and Gao, Huazuo and Ma, Yiyang and Wu, Chengyue and Wang, Bingxuan and Xie, Zhenda and Wu, Yu and Hu, Kai and Wang, Jiawei and Sun, Yaofeng and Li, Yukun and Piao, Yishi and Guan, Kang and Liu, Aixin and Xie, Xin and You, Yuxiang and Dong, Kai and Yu, Xingkai and Z...
-
[9]
Aaron Hurst and Adam Lerer and Adam P. Goucher and Adam Perelman and Aditya Ramesh and Aidan Clark and AJ Ostrow and Akila Welihinda and Alan Hayes and Alec Radford and Aleksander Madry and Alex Baker-Whitcomb and Alex Beutel and Alex Borzunov and Alex Carney and Alex Chow and Alex Kirillov and Alex Nichol and Alex Paino and Alex Renzin and Alex Tachard P...
-
[10]
Advances in Neural Information Processing Systems , volume=
Visual Instruction Tuning , author=. Advances in Neural Information Processing Systems , volume=
-
[11]
Bo Li and Yuanhan Zhang and Dong Guo and Renrui Zhang and Feng Li and Hao Zhang and Kaichen Zhang and Peiyuan Zhang and Yanwei Li and Ziwei Liu and Chunyuan Li , journal=
-
[12]
Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle=. B. 2023 , organization=
work page 2023
-
[13]
Peng Wang and Shuai Bai and Sinan Tan and Shijie Wang and Zhihao Fan and Jinze Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Yang Fan and Kai Dang and Mengfei Du and Xuancheng Ren and Rui Men and Dayiheng Liu and Chang Zhou and Jingren Zhou and Junyang Lin , journal=. Qwen2-
-
[14]
Shuai Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Sibo Song and Kai Dang and Peng Wang and Shijie Wang and Jun Tang and Humen Zhong and Yuanzhi Zhu and Mingkun Yang and Zhaohai Li and Jianqiang Wan and Pengfei Wang and Wei Ding and Zheren Fu and Yiheng Xu and Jiabo Ye and Xi Zhang and Tianbao Xie and Zesen Cheng and Hang Zhang and...
-
[15]
Liu, Zikang and Guo, Longteng and Tang, Yepeng and Yue, Tongtian and Cai, Junxian and Ma, Kai and Liu, Qingbin and Chen, Xi and Liu, Jing , booktitle=. V
-
[16]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Improved Baselines with Visual Instruction Tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[19]
Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li , journal=
-
[20]
Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad , booktitle=. Video-
-
[21]
Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed , booktitle=. Mini
-
[22]
Aixin Liu and Bei Feng and Bing Xue and Bingxuan Wang and Bochao Wu and Chengda Lu and Chenggang Zhao and Chengqi Deng and Chenyu Zhang and Chong Ruan and Damai Dai and Daya Guo and Dejian Yang and Deli Chen and Dongjie Ji and Erhang Li and Fangyun Lin and Fucong Dai and Fuli Luo and Guangbo Hao and Guanting Chen and Guowei Li and H. Zhang and Han Bao and...
-
[23]
Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...
- [24]
-
[25]
Findings of the Association for Computational Linguistics: ACL 2023 , pages=
Towards Reasoning in Large Language Models: A Survey , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=
work page 2023
-
[26]
Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Lu, Keming and Kai Dang and Yang Fan and Yichang Zhang and An Yang and Rui Men and Fei Huang and Bo Zheng and Yibo Miao and Shanghaoran Quan and Yunlong Feng and Xingzhang Ren and Xuancheng Ren and Jingren Zhou and...
-
[27]
Xin, Huajian and Ren, Z. Z. and Song, Junxiao and Shao, Zhihong and Zhao, Wanjia and Wang, Haocheng and Liu, Bo and Zhang, Liyue and Lu, Xuan and Du, Qiushi and Gao, Wenjun and Zhang, Haowei and Zhu, Qihao and Yang, Dejian and Gou, Zhibin and Wu, Z. F. and Luo, Fuli and Ruan, Chong , booktitle=. Deep
-
[28]
Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , booktitle=. G
-
[29]
Liu, Ziyu and Sun, Zeyi and Zang, Yuhang and Dong, Xiaoyi and Cao, Yuhang and Duan, Haodong and Lin, Dahua and Wang, Jiaqi , booktitle=. Visual-
-
[30]
Haozhan Shen and Peng Liu and Jingcheng Li and Chunxin Fang and Yibo Ma and Jiajia Liao and Qiaoli Shen and Zilun Zhang and Kangjia Zhao and Qianqian Zhang and Ruochen Xu and Tiancheng Zhao , journal=. V
-
[31]
Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Xu, Zhe and Hu, Yao and Lin, Shaohui , booktitle=. Vision-
-
[32]
Lawrence and Parikh, Devi , booktitle=
Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Zitnick, C. Lawrence and Parikh, Devi , booktitle=. V
- [33]
-
[34]
Yin, Zhenfei and Wang, Jiong and Cao, Jianjian and Shi, Zhelun and Liu, Dingning and Li, Mukai and Huang, Xiaoshui and Wang, Zhiyong and Sheng, Lu and Bai, Lei and Shao, Jing and Ouyang Wanli , booktitle=. L
-
[35]
Li, Bohao and Ge, Yuying and Ge, Yixiao and Wang, Guangzhi and Wang, Rui and Zhang, Ruimao and Shan, Ying , booktitle=. S
-
[36]
Conference on Empirical Methods in Natural Language Processing , pages=
Evaluating Object Hallucination in Large Vision-Language Models , author=. Conference on Empirical Methods in Natural Language Processing , pages=
-
[37]
Xu, Cheng and Hou, Xiaofeng and Liu, Jiacheng and Li, Chao and Huang, Tianhao and Zhu, Xiaozhi and Niu, Mo and Sun, Lingyu and Tang, Peng and Xu, Tongqiao and Cheng, Kwang-Ting and Guo, Minyi , booktitle=. M. 2023 , organization=
work page 2023
-
[38]
Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng , booktitle=. Math
- [39]
-
[40]
Zhang, Renrui and Jiang, Dongzhi and Zhang, Yichi and Lin, Haokun and Guo, Ziyu and Qiu, Pengshuo and Zhou, Aojun and Lu, Pan and Chang, Kai-Wei and Qiao, Yu and Gao, Peng and Li, Hongsheng , booktitle=. M
-
[41]
Yue, Xiang and Zheng, Tianyu and Ni, Yuansheng and Wang, Yubo and Zhang, Kai and Tong, Shengbang and Sun, Yuxuan and Yu, Botao and Zhang, Ge and Sun, Huan and Su, Yu and Chen, Wenhu and Neubig, Graham , booktitle=. M
-
[42]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and Liu, Ruoqi and Zhang, Ge and Stevens, Samuel and Jiang, Dongfu and Ren, Weiming and Sun, Yuxuan and Wei, Cong and Yu, Botao and Yuan, Ruibin and Sun, Renliang and Yin, Ming and Zheng, Boyuan and Yang, Zhenzhu and Liu, Yibo and Huang, Wenhao and Sun, Huan and Su, Yu and Chen, Wenhu , title =...
-
[43]
Hao, Yunzhuo and Gu, Jiawei and Wang, Huichen Will and Li, Linjie and Yang, Zhengyuan and Wang, Lijuan and Cheng, Yu , booktitle=. Can
-
[44]
Advances in Neural Information Processing Systems , volume=
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. Advances in Neural Information Processing Systems , volume=
-
[45]
Trinh, Trieu H. and Wu, Yuhuai and Le, Quoc V. and He, He and Luong, Thang , journal=. Solving. 2024 , publisher=
work page 2024
-
[46]
Gao, Bofei and Song, Feifan and Yang, Zhe and Cai, Zefan and Miao, Yibo and Dong, Qingxiu and Li, Lei and Ma, Chenghao and Chen, Liang and Xu, Runxin and Tang, Zhengyang and Benyou, Wang and Zan, Daoguang and Quan, Shanghaoran and Zhang, Ge and Sha, Lei and Zhang, Yichang and Ren, Xuancheng and Liu, Tianyu and Chang, Baobao , booktitle=. Omni-
-
[47]
Aaron Jaech and Adam Kalai and Adam Lerer and Adam Richardson and Ahmed El-Kishky and Aiden Low and Alec Helyar and Aleksander Madry and Alex Beutel and Alex Carney and Alex Iftimie and Alex Karpenko and Alex Tachard Passos and Alexander Neitz and Alexander Prokofiev and Alexander Wei and Allison Tam and Ally Bennett and Ananya Kumar and Andre Saraiva and...
-
[48]
Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , month=. L
-
[49]
Jinguo Zhu and Weiyun Wang and Zhe Chen and Zhaoyang Liu and Shenglong Ye and Lixin Gu and Hao Tian and Yuchen Duan and Weijie Su and Jie Shao and Zhangwei Gao and Erfei Cui and Xuehui Wang and Yue Cao and Yangzhou Liu and Xingguang Wei and Hongjie Zhang and Haomin Wang and Weiye Xu and Hao Li and Jiahao Wang and Nianchen Deng and Songze Li and Yinan He a...
-
[50]
Yuan Yao and Tianyu Yu and Ao Zhang and Chongyi Wang and Junbo Cui and Hongji Zhu and Tianchi Cai and Haoyu Li and Weilin Zhao and Zhihui He and Qianyu Chen and Huarong Zhou and Zhensheng Zou and Haoye Zhang and Shengding Hu and Zhi Zheng and Jie Zhou and Jie Cai and Xu Han and Guoyang Zeng and Dahai Li and Zhiyuan Liu and Maosong Sun , journal=. Mini
-
[51]
Measuring Multimodal Mathematical Reasoning with the
Wang, Ke and Pan, Junting and Shi, Weikang and Lu, Zimu and Ren, Houxing and Zhou, Aojun and Zhan, Mingjie and Li, Hongsheng , booktitle=. Measuring Multimodal Mathematical Reasoning with the
-
[53]
Smith and Hannaneh Hajishirzi and Ross Girshick and Ali Farhadi and Aniruddha Kembhavi , booktitle=
Matt Deitke and Christopher Clark and Sangho Lee and Rohun Tripathi and Yue Yang and Jae Sung Park and Mohammadreza Salehi and Niklas Muennighoff and Kyle Lo and Luca Soldaini and Jiasen Lu and Taira Anderson and Erin Bransom and Kiana Ehsani and Huong Ngo and YenSung Chen and Ajay Patel and Mark Yatskar and Chris Callison-Burch and Andrew Head and Rose H...
-
[54]
Bingquan Xia and Bowen Shen and Cici and Dawei Zhu and Di Zhang and Gang Wang and Hailin Zhang and Huaqiu Liu and Jiebao Xiao and Jinhao Dong and Liang Zhao and Peidian Li and Peng Wang and Shihua Yu and Shimao Chen and Weikun Wang and Wenhan Ma and Xiangwei Deng and Yi Huang and Yifan Song and Zihan Jiang and Bowen Ye and Can Cai and Chenhong He and Dong...
-
[56]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Moitinho de Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mo Bavarian, Jeff Belgum, Irwan Bello, and 260 others. 2023. G PT -4 technical report. arXiv preprin...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, and 1331 others. 2023. Gemini: A family of highly ca...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. V QA : Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425--2433
work page 2015
-
[59]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, and 8 others. 2025. Qwen2.5- VL technical report. arXiv preprint arXiv:2502.13923
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, and 31 others. 2025. Molmo and PixM o: Open weights and open data for state-of-...
work page 2025
-
[61]
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Wang Benyou, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. 2025. Omni- MATH : A universal O lympiad level mathematic benchmark for large language models. In Inter...
work page 2025
-
[62]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and 175 others. 2025. Deep S eek- R 1 incentivizes reasoning in LLM s through reinforcement learning. Nature 645, pages 633--638
work page 2025
-
[63]
Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. 2025. Can MLLM s reason in multimodality? EMMA : An enhanced multimodal reasoning benchmark. In International Conference on Machine Learning
work page 2025
-
[64]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. Olympiad B ench: A challenging benchmark for promoting AGI with O lympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Associat...
work page 2024
-
[65]
Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049--1065
work page 2023
-
[66]
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. 2026. Vision- R 1: Incentivizing reasoning capability in multimodal large language models. In International Conference on Learning Representations
work page 2026
-
[67]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, and 5 others. 2024. Qwen2.5- C oder technical report. arXiv preprint arXiv:2409.12186
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Madry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, and 399 others. 2024. G PT -4o system card. arXiv preprint arXiv:2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, and 242 others. 2024. Open AI o1 system card. arXiv preprint arXiv:2412.16720
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2025. LL a VA - O ne V ision: Easy visual task transfer. Transactions on Machine Learning Research
work page 2025
-
[71]
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. 2024. S EED - B ench: Benchmarking multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299--13308
work page 2024
-
[72]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023 a . B LIP -2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, pages 19730--19742. PMLR
work page 2023
-
[73]
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023 b . Evaluating object hallucination in large vision-language models. In Conference on Empirical Methods in Natural Language Processing, pages 292--305
work page 2023
-
[74]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C. Lawrence Zitnick. 2014. Microsoft COCO : Common objects in context. In European Conference on Computer Vision, pages 740--755
work page 2014
-
[75]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, and 180 others. 2024 a . Deep S eek- V 3 technical report. arXiv preprint arXiv:2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024 b . Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26286--26296
work page 2024
-
[77]
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024 c . https://llava-vl.github.io/blog/2024-01-30-llava-next/ L L a VA - N e XT : Improved reasoning, OCR , and world knowledge
work page 2024
-
[78]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pages 34892--34916
work page 2023
-
[79]
Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, and Jing Liu. 2025 a . V R o PE : Rotary position embedding for video large language models. In Conference on Empirical Methods in Natural Language Processing
work page 2025
-
[80]
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. 2025 b . Visual- RFT : Visual reinforcement fine-tuning. In International Conference on Computer Vision, pages 2034--2044
work page 2025
-
[81]
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2024. Math V ista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations
work page 2024
-
[82]
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems, volume 35, pages 2507--2521
work page 2022
-
[83]
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. 2024. Video- C hat GPT : Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585--12602
work page 2024
-
[84]
OpenAI . 2023. https://openai.com/index/gpt-4v-system-card/ G PT -4 V (ision) System Card . Accessed: 2025-04-18
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.