Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

Bart Selman; Carla Gomes; Jiahao Huo; Jingheng Ye; Philip S. Yu; Qingsong Wen; Shen Wang; Xuming Hu; Yibo Yan; Zhendong Chu

arxiv: 2502.02871 · v2 · submitted 2025-02-05 · 💻 cs.CL · cs.AI

Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

Yibo Yan , Shen Wang , Jiahao Huo , Jingheng Ye , Zhendong Chu , Xuming Hu , Philip S. Yu , Carla Gomes

show 2 more authors

Bart Selman Qingsong Wen

This is my paper

Pith reviewed 2026-05-23 04:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multimodal large language modelsscientific reasoningmathematicsphysicschemistrybiologyartificial general intelligence

0 comments

The pith

Multimodal large language models can advance scientific reasoning by integrating text and images across math, physics, chemistry, and biology.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This position paper argues that multimodal large language models offer a way to overcome the generalization and multimodal perception shortfalls of current scientific reasoning systems. These systems limit progress in exploring phenomena across fields. By processing diverse data types together, MLLMs can support logic and evidence-based reasoning in mathematics, physics, chemistry, and biology. A sympathetic reader would care because better scientific reasoning tools could speed up knowledge advances in multiple disciplines.

Core claim

The paper claims that MLLMs, which integrate text, images, and other modalities, present an opportunity to overcome limitations in generalization and multimodal perception and thereby significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology. It proposes a four-stage research roadmap, summarizes the current state of MLLM applications, identifies key remaining challenges, and offers actionable suggestions toward artificial general intelligence.

What carries the argument

A four-stage research roadmap of scientific reasoning capabilities that tracks MLLM progress in integrating and reasoning over diverse data types.

If this is right

MLLMs will support reasoning that draws on both textual evidence and visual data in scientific tasks.
A staged roadmap will guide incremental improvements from basic perception to full cross-domain generalization.
Addressing listed challenges will move MLLM use closer to artificial general intelligence.
Applications will expand in mathematics, physics, chemistry, and biology through combined modality processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the roadmap holds, MLLMs could eventually propose and test new hypotheses without human prompts in lab settings.
Persistent perception gaps may require targeted datasets of paired scientific images and equations beyond current training.
Cross-field transfer success would imply similar gains in non-scientific multimodal tasks like medical diagnosis.

Load-bearing premise

The assumption that MLLMs' ability to integrate and reason over diverse data types will overcome current limitations in generalization and multimodal perception.

What would settle it

A controlled comparison on multimodal scientific reasoning benchmarks where MLLMs show no improvement over text-only models in generalization or accuracy.

Figures

Figures reproduced from arXiv: 2502.02871 by Bart Selman, Carla Gomes, Jiahao Huo, Jingheng Ye, Philip S. Yu, Qingsong Wen, Shen Wang, Xuming Hu, Yibo Yan, Zhendong Chu.

**Figure 1.** Figure 1: The big picture of our position. We focus on multimodal scientific fields, especially mathematics, physics, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of MLLM-based scientific reasoning [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Challenges for MLLM-based scientific reasoning. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Eight prospects for the future of MLLMs in the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Illustrations of alternative view 1 (a) and view 2 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Scientific reasoning, the process through which humans apply logic, evidence, and critical thinking to explore and interpret scientific phenomena, is essential in advancing knowledge reasoning across diverse fields. However, despite significant progress, current scientific reasoning models still struggle with generalization across domains and often fall short of multimodal perception. Multimodal Large Language Models (MLLMs), which integrate text, images, and other modalities, present an exciting opportunity to overcome these limitations and enhance scientific reasoning. Therefore, this position paper argues that MLLMs can significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology. First, we propose a four-stage research roadmap of scientific reasoning capabilities, and highlight the current state of MLLM applications in scientific reasoning, noting their ability to integrate and reason over diverse data types. Second, we summarize the key challenges that remain obstacles to achieving MLLM's full potential. To address these challenges, we propose actionable insights and suggestions for the future. Overall, our work offers a novel perspective on MLLM integration with scientific reasoning, providing the LLM community with a valuable vision for achieving Artificial General Intelligence (AGI).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This position paper argues that Multimodal Large Language Models (MLLMs) can significantly advance scientific reasoning across mathematics, physics, chemistry, and biology. It proposes a four-stage research roadmap for scientific reasoning capabilities, reviews the current state of MLLM applications noting their multimodal integration, summarizes remaining challenges, and offers actionable suggestions to realize their potential toward AGI.

Significance. If the roadmap and suggestions were grounded in concrete mechanisms or existing results, the paper could provide a useful organizing vision for multimodal AI in science. As written, the lack of any analysis connecting MLLM integration to gains in generalization or perception limits its potential impact on the field.

major comments (2)

[Abstract] Abstract: the claim that MLLMs 'present an exciting opportunity to overcome these limitations' (generalization across domains and multimodal perception) is asserted without any supporting analysis, derivation, illustrative case, or reference to specific MLLM results demonstrating such gains.
[Four-stage research roadmap] Four-stage research roadmap section: the stages are enumerated and current MLLM applications are noted, but no mechanism is supplied showing how multimodal integration produces the claimed improvements in cross-domain generalization; the challenges are listed but never connected back to a concrete MLLM capability or result.

minor comments (1)

The manuscript would benefit from explicit definitions or criteria for each of the four stages so that the roadmap can be evaluated for novelty and testability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. As this is a position paper, our aim is to articulate a forward-looking vision rather than deliver new empirical derivations. We address the major comments below and indicate where revisions are feasible.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that MLLMs 'present an exciting opportunity to overcome these limitations' (generalization across domains and multimodal perception) is asserted without any supporting analysis, derivation, illustrative case, or reference to specific MLLM results demonstrating such gains.

Authors: We agree the abstract states the central thesis without an immediate supporting reference or example. The manuscript is a position paper, so the claim is developed through the review of current MLLM applications in later sections rather than through new analysis. We will revise the abstract to include one or two concise references to existing MLLM results that illustrate multimodal gains in scientific tasks. revision: yes
Referee: [Four-stage research roadmap] Four-stage research roadmap section: the stages are enumerated and current MLLM applications are noted, but no mechanism is supplied showing how multimodal integration produces the claimed improvements in cross-domain generalization; the challenges are listed but never connected back to a concrete MLLM capability or result.

Authors: The roadmap is intentionally high-level to organize future research directions. Explicit causal mechanisms are not derived because the paper does not claim to introduce new technical results. We will revise the section to add brief, literature-based connections between specific MLLM capabilities (such as diagram interpretation) and potential generalization benefits, and to tie listed challenges more directly to those capabilities. revision: partial

Circularity Check

0 steps flagged

No circularity: position paper contains no derivations or self-referential constructions

full rationale

The paper is a position paper that advances an argument about MLLMs advancing scientific reasoning. It contains no equations, no fitted parameters, no derivations, and no load-bearing self-citations that reduce the central claim to its own inputs by construction. The four-stage roadmap and listed challenges are enumerated as suggestions rather than derived results. The claim is presented as a perspective, not as a prediction or theorem that collapses to prior definitions or fits within the paper itself. This is the expected outcome for a non-technical position paper with no quantitative chain to inspect.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Position paper contains no formal model, parameters, or derivations; the argument rests on domain assumptions about MLLM capabilities that are not formalized.

pith-pipeline@v0.9.0 · 5756 in / 978 out tokens · 25579 ms · 2026-05-23T04:28:29.849262+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
cs.SD 2026-05 unverdicted novelty 5.0

A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

Reference graph

Works this paper leans on

268 extracted references · 268 canonical work pages · cited by 1 Pith paper · 35 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Improving multimodal interactive agents with reinforcement learning from human feedback

Abramson, J., Ahuja, A., Carnevale, F., Georgiev, P., Goldin, A., Hung, A., Landon, J., Lhotka, J., Lillicrap, T., Muldal, A., et al. Improving multimodal interactive agents with reinforcement learning from human feedback. arXiv preprint arXiv:2211.11602, 2022

work page arXiv 2022
[3]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

and Verma, B

Agarwal, L. and Verma, B. From methods to datasets: A survey on image-caption generators. Multimedia Tools and Applications, 83 0 (9): 0 28077--28123, 2024

work page 2024
[5]

S., Krishnan, N., and Jablonka, K

Alampara, N., Schilling-Wilhelmi, M., R \' os-Garc \' a, M., Mandal, I., Khetarpal, P., Grover, H. S., Krishnan, N., and Jablonka, K. M. Probing the limitations of multimodal language models for chemistry and materials research. arXiv preprint arXiv:2411.16955, 2024

work page arXiv 2024
[6]

arXiv preprint arXiv:2402.16827

Albalak, A., Elazar, Y., Xie, S. M., Longpre, S., Lambert, N., Wang, X., Muennighoff, N., Hou, B., Pan, L., Jeong, H., et al. A survey on data selection for language models. arXiv preprint arXiv:2402.16827, 2024

work page arXiv 2024
[7]

Multimodal large language models in health care: applications, challenges, and future outlook

AlSaad, R., Abd-Alrazaq, A., Boughorbel, S., Ahmed, A., Renault, M.-A., Damseh, R., and Sheikh, J. Multimodal large language models in health care: applications, challenges, and future outlook. Journal of medical Internet research, 26: 0 e59505, 2024

work page 2024
[8]

Claude 3.5 sonnet model card addendum

Anthropic, A. Claude 3.5 sonnet model card addendum. Claude-3.5 Model Card, 3, 2024

work page 2024
[9]

G., et al

Arora, D., Singh, H. G., et al. Have llms advanced enough? a challenging problem solving benchmark for large language models. arXiv preprint arXiv:2305.15074, 2023

work page arXiv 2023
[10]

Llemma: An Open Language Model For Mathematics

Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M. D., McAleer, S. M., Jiang, A. Q., Deng, J., Biderman, S., and Welleck, S. Llemma: An open language model for mathematics. ArXiv, abs/2310.10631, 2023. URL https://api.semanticscholar.org/CorpusID:264172303

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

K., and Priyakumar, U

Bagal, V., Aggarwal, R., Vinod, P. K., and Priyakumar, U. D. Molgpt: Molecular generation using a transformer-decoder model. Journal of chemical information and modeling, 2021. URL https://api.semanticscholar.org/CorpusID:263484152

work page 2021
[12]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

A survey of multimodal large language model from a data-centric perspective

Bai, T., Liang, H., Wan, B., Xu, Y., Li, X., Li, S., Yang, L., Li, B., Wang, Y., Cui, B., et al. A survey of multimodal large language model from a data-centric perspective. arXiv preprint arXiv:2405.16640, 2024 a

work page arXiv 2024
[14]

Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., and Shou, M. Z. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Learning and scientific reasoning

Bao, L., Cai, T., Koenig, K., Fang, K., Han, J., Wang, J., Liu, Q., Ding, L., Cui, L., Luo, Y., et al. Learning and scientific reasoning. Science, 323 0 (5914): 0 586--587, 2009

work page 2009
[16]

a rber, M., Fr \

Barman, K. G., Caron, S., Sullivan, E., de Regt, H. W., de Austri, R. R., Boon, M., F \"a rber, M., Fr \"o se, S., Hasibi, F., Ipp, A., et al. Large physics models: Towards a collaborative approach with large language models and foundation models. arXiv preprint arXiv:2501.05382, 2025

work page arXiv 2025
[17]

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Bayoudh, K., Knani, R., Hamdaoui, F., and Mtibaa, A. A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. The Visual Computer, 38 0 (8): 0 2939--2970, 2022

work page 2022
[18]

C., Jiang, A., Li, J., Lipkin, B., Qina, Z., Rasul, K., Shen, Z., Soletskyi, R., and Tunstall, L

Beeching, E., Huang, S. C., Jiang, A., Li, J., Lipkin, B., Qina, Z., Rasul, K., Shen, Z., Soletskyi, R., and Tunstall, L. Numinamath 7b tir. https://huggingface.co/AI-MO/NuminaMath-7B-TIR, 2024

work page 2024
[19]

Reasoning language models: A blueprint

Besta, M., Barth, J., Schreiber, E., Kubicek, A., Catarino, A., Gerstenberger, R., Nyczyk, P., Iff, P., Li, Y., Houliston, S., et al. Reasoning language models: A blueprint. arXiv preprint arXiv:2501.11223, 2025

work page arXiv 2025
[20]

Science in the age of large language models

Birhane, A., Kasirzadeh, A., Leslie, D., and Wachter, S. Science in the age of large language models. Nature Reviews Physics, 5 0 (5): 0 277--280, 2023

work page 2023
[21]

Xai meets llms: A survey of the relation between explainable ai and large language models

Cambria, E., Malandri, L., Mercorio, F., Nobani, N., and Seveso, A. Xai meets llms: A survey of the relation between explainable ai and large language models. arXiv preprint arXiv:2407.15248, 2024

work page arXiv 2024
[23]

Preprint, arXiv:2311.16208

Cao, H., Liu, Z., Lu, X., Yao, Y., and Li, Y. Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv preprint arXiv:2311.16208, 2023 b

work page arXiv 2023
[24]

Hallucination detection in foundation models for decision-making: A flexible definition and review of the state of the art

Chakraborty, N., Ornik, M., and Driggs-Campbell, K. Hallucination detection in foundation models for decision-making: A flexible definition and review of the state of the art. arXiv preprint arXiv:2403.16527, 2024

work page arXiv 2024
[25]

and Ye, J.-C

Chang, J. and Ye, J.-C. Bidirectional generation of structure and properties through a single molecular foundation model. Nature Communications, 15, 2022. URL https://api.semanticscholar.org/CorpusID:256827263

work page 2022
[26]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

Chefer, H., Gur, S., and Wolf, L. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 397--406, 2021

work page 2021
[27]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Chen, W., Su, Y., Zuo, J., Yang, C., Yuan, C., Chan, C.-M., Yu, H., Lu, Y., Hung, Y.-H., Qian, C., et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In The Twelfth International Conference on Learning Representations, 2023 a

work page 2023
[28]

Theoremqa: A theorem-driven question answering dataset

Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y., Ma, X., Xu, J., Wang, X., and Xia, T. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 7889--7901, 2023 b

work page 2023
[29]

Fine-tuning large language models in education

Chen, Y., Chen, H., and Su, S. Fine-tuning large language models in education. In 2023 13th International Conference on Information Technology in Medicine and Education (ITME), pp.\ 718--723. IEEE, 2023 c

work page 2023
[30]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

Chen, Z., Chen, S., Ning, Y., Zhang, Q., Wang, B., Yu, B., Li, Y., Liao, Z., Wei, C., Lu, Z., et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. arXiv preprint arXiv:2410.05080, 2024 a

work page arXiv 2024
[31]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67 0 (12): 0 220101, 2024 b

work page 2024
[32]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 24185--24198, 2024 c

work page 2024
[33]

Generative ai for math: Abel

Chern, E., Zou, H., Li, X., Hu, J., Feng, K., Li, J., and Liu, P. Generative ai for math: Abel. https://github.com/GAIR-NLP/abel, 2023

work page 2023
[34]

Unveiling causal reasoning in large language models: Reality or mirage? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

Chi, H., Li, H., Yang, W., Liu, F., Lan, L., Ren, X., Liu, T., and Han, B. Unveiling causal reasoning in large language models: Reality or mirage? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[35]

Exploring response uncertainty in mllms: An empirical evaluation under misleading scenarios

Dang, Y., Gao, M., Yan, Y., Zou, X., Gu, Y., Liu, A., and Hu, X. Exploring response uncertainty in mllms: An empirical evaluation under misleading scenarios. arXiv preprint arXiv:2411.02708, 2024 a

work page arXiv 2024
[36]

Explainable and interpretable multimodal large language models: A comprehensive survey

Dang, Y., Huang, K., Huo, J., Yan, Y., Huang, S., Liu, D., Gao, M., Zhang, J., Qian, C., Wang, K., et al. Explainable and interpretable multimodal large language models: A comprehensive survey. arXiv preprint arXiv:2412.02104, 2024 b

work page arXiv 2024
[37]

EXAMS - V : A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models

Das, R., Hristov, S., Li, H., Dimitrov, D., Koychev, I., and Nakov, P. EXAMS - V : A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 7768--7...

work page doi:10.18653/v1/2024.acl-long.420 2024
[38]

Dehimi, N. E. H. and Tolba, Z. Attention mechanisms in deep learning: Towards explainable artificial intelligence. In 2024 6th International Conference on Pattern Analysis and Intelligent Systems (PAIS), pp.\ 1--7. IEEE, 2024

work page 2024
[39]

Desirable characteristics for ai teaching assistants in programming education

Denny, P., MacNeil, S., Savelka, J., Porter, L., and Luxton-Reilly, A. Desirable characteristics for ai teaching assistants in programming education. In Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, pp.\ 408--414. 2024

work page 2024
[40]

Codefuse-13b: A pretrained multi-lingual code large language model

Di, P., Li, J., Yu, H., Jiang, W., Cai, W., Cao, Y., Chen, C., Chen, D., Chen, H., Chen, L., et al. Codefuse-13b: A pretrained multi-lingual code large language model. In Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, pp.\ 418--429, 2024

work page 2024
[41]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

A survey on rag meeting llms: Towards retrieval-augmented large language models

Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D., Chua, T.-S., and Li, Q. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.\ 6491--6501, 2024

work page 2024
[43]

Towards artificial general intelligence via a multimodal foundation model

Fei, N., Lu, Z., Gao, Y., Yang, G., Huo, Y., Wen, J., Lu, H., Song, R., Gao, X., Xiang, T., et al. Towards artificial general intelligence via a multimodal foundation model. Nature Communications, 13 0 (1): 0 3094, 2022

work page 2022
[44]

How far are we from agi

Feng, T., Jin, C., Liu, J., Zhu, K., Tu, H., Cheng, Z., Lin, G., and You, J. How far are we from agi. arXiv preprint arXiv:2405.10313, 2024

work page arXiv 2024
[45]

Flores, L., Kim, S., and Young, S. D. Addressing bias in artificial intelligence for public health surveillance. Journal of Medical Ethics, 50 0 (3): 0 190--194, 2024

work page 2024
[46]

Mme-survey: A comprehensive survey on evaluation of multimodal llms

Fu, C., Zhang, Y.-F., Yin, S., Li, B., Fang, X., Zhao, S., Duan, H., Sun, X., Liu, Z., Wang, L., et al. Mme-survey: A comprehensive survey on evaluation of multimodal llms. arXiv preprint arXiv:2411.15296, 2024

work page arXiv 2024
[47]

Kwaiyiimath: Technical report

Fu, J.-Y., Lin, L., Gao, X., Liu, P., Chen, Z., Yang, Z., Zhang, S., Zheng, X., Li, Y., Liu, Y., Ye, X., Liao, Y., Liao, C., Chen, B., Song, C., Wan, J., Lin, Z., Zhang, F., Wang, Z., Zhang, D., and Gai, K. Kwaiyiimath: Technical report. ArXiv, abs/2310.07488, 2023. URL https://api.semanticscholar.org/CorpusID:263834833

work page arXiv 2023
[48]

Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al

Gadre, S. Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[49]

Large language models empowered agent-based modeling and simulation: A survey and perspectives

Gao, C., Lan, X., Li, N., Yuan, Y., Ding, J., Zhou, Z., Xu, F., and Li, Y. Large language models empowered agent-based modeling and simulation: A survey and perspectives. Humanities and Social Sciences Communications, 11 0 (1): 0 1--24, 2024 a

work page 2024
[50]

Physically grounded vision-language models for robotic manipulation

Gao, J., Sarkar, B., Xia, F., Xiao, T., Wu, J., Ichter, B., Majumdar, A., and Sadigh, D. Physically grounded vision-language models for robotic manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 12462--12469. IEEE, 2024 b

work page 2024
[51]

Goodman, S. N. Aligning statistical and scientific reasoning. Science, 352 0 (6290): 0 1180--1181, 2016

work page 2016
[52]

and Wang, G

Guan, S. and Wang, G. Drug discovery and development in the era of artificial intelligence: From machine learning to large language models. Artificial Intelligence Chemistry, 2 0 (1): 0 100070, 2024

work page 2024
[53]

Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting

Guan, X., Liu, Y., Lin, H., Lu, Y., He, B., Han, X., and Sun, L. Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 18126--18134, 2024

work page 2024
[54]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., Li, Y., et al. Deepseek-coder: When the large language model meets programming--the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

What can large language models do in chemistry? a comprehensive benchmark on eight tasks

Guo, T., Nan, B., Liang, Z., Guo, Z., Chawla, N., Wiest, O., Zhang, X., et al. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. Advances in Neural Information Processing Systems, 36: 0 59662--59688, 2023

work page 2023
[56]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N. V., Wiest, O., and Zhang, X. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Reasoning with Language Model is Planning with World Model

Hao, S., Gu, Y., Ma, H., Hong, J. J., Wang, Z., Wang, D. Z., and Hu, Z. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction

Hao, X., Chen, W., Yan, Y., Zhong, S., Wang, K., Wen, Q., and Liang, Y. Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction. arXiv preprint arXiv:2403.16831, 2024

work page arXiv 2024
[59]

W., Li, L., Yang, Z., Wang, L., and Cheng, Y

Hao, Y., Gu, J., Wang, H. W., Li, L., Yang, Z., Wang, L., and Cheng, Y. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark. arXiv preprint arXiv:2501.05444, 2025

work page arXiv 2025
[60]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Mmworld: Towards multi-discipline multi-faceted world model evaluation in videos

He, X., Feng, W., Zheng, K., Lu, Y., Zhu, W., Li, J., Fan, Y., Wang, J., Li, L., Yang, Z., et al. Mmworld: Towards multi-discipline multi-faceted world model evaluation in videos. arXiv preprint arXiv:2406.08407, 2024 b

work page arXiv 2024
[62]

Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning

He, Z., Wu, X., Zhou, P., Xuan, R., Liu, G., Yang, X., Zhu, Q., and Huang, H. Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning. arXiv preprint arXiv:2401.14011, 2024 c

work page arXiv 2024
[63]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2009
[64]

Hocky, G. M. Connecting molecular properties with plain language. Nature Machine Intelligence, 6 0 (3): 0 249--250, 2024

work page 2024
[65]

SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions

Horawalavithana, S., Munikoti, S., Stewart, I., and Kvinge, H. Scitune: Aligning large language models with scientific multimodal instructions. arXiv preprint arXiv:2307.01139, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Teaching plan generation and evaluation with gpt-4: Unleashing the potential of llm in instructional design

Hu, B., Zheng, L., Zhu, J., Ding, L., Wang, Y., and Gu, X. Teaching plan generation and evaluation with gpt-4: Unleashing the potential of llm in instructional design. IEEE Transactions on Learning Technologies, 2024

work page 2024
[67]

and Yu, K

Hu, S. and Yu, K. Learning robust rationales for model explainability: A guidance-based approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 18243--18251, 2024

work page 2024
[68]

Towards Reasoning in Large Language Models: A Survey

Huang, J. and Chang, K. C.-C. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[69]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 2023

work page 2023
[70]

Adapting large language models for biomedicine though retrieval-augmented generation with documents scoring

Huang, Y., Gao, T., Zhang, J., Liu, X., and Wang, G. Adapting large language models for biomedicine though retrieval-augmented generation with documents scoring. In 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp.\ 5770--5775. IEEE, 2024 a

work page 2024
[71]

Olympicarena medal ranks: Who is the most intelligent ai so far? arXiv preprint arXiv:2406.16772, 2024 b

Huang, Z., Wang, Z., Xia, S., and Liu, P. Olympicarena medal ranks: Who is the most intelligent ai so far? arXiv preprint arXiv:2406.16772, 2024 b

work page arXiv 2024
[72]

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model

Huo, J., Yan, Y., Hu, B., Yue, Y., and Hu, X. Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model. arXiv preprint arXiv:2406.11193, 2024

work page arXiv 2024
[74]

F., Shovon, M

Ishmam, M. F., Shovon, M. S. H., Mridha, M. F., and Dey, N. From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities. Information Fusion, pp.\ 102270, 2024

work page 2024
[75]

OpenAI o1 System Card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

P., Anand, A., Dharmadhikari, A., Marathe, A., and Shah, R

Jaiswal, R., Jain, D., Popat, H. P., Anand, A., Dharmadhikari, A., Marathe, A., and Shah, R. R. Improving physics reasoning in large language models using mixture of refinement agents. arXiv preprint arXiv:2412.00821, 2024

work page arXiv 2024
[77]

A survey on large language model hallucination via a creativity perspective

Jiang, X., Tian, Y., Hua, F., Xu, C., Wang, Y., and Guo, J. A survey on large language model hallucination via a creativity perspective. arXiv preprint arXiv:2402.06647, 2024

work page arXiv 2024
[78]

Reasoning grasping via multimodal large language model

Jin, S., Xu, J., Lei, Y., and Zhang, L. Reasoning grasping via multimodal large language model. arXiv preprint arXiv:2402.06798, 2024 a

work page arXiv 2024
[79]

Cladder: A benchmark to assess causal reasoning capabilities of language models

Jin, Z., Chen, Y., Leeb, F., Gresele, L., Kamal, O., Lyu, Z., Blin, K., Gonzalez Adauto, F., Kleiman-Weiner, M., Sachan, M., et al. Cladder: A benchmark to assess causal reasoning capabilities of language models. Advances in Neural Information Processing Systems, 36, 2024 b

work page 2024
[80]

C., De Sousa Ribeiro, F., Oktay, O., McCradden, M., and Glocker, B

Jones, C., Castro, D. C., De Sousa Ribeiro, F., Oktay, O., McCradden, M., and Glocker, B. A causal perspective on dataset bias in machine learning for medical imaging. Nature Machine Intelligence, 6 0 (2): 0 138--146, 2024

work page 2024
[81]

Highly accurate protein structure prediction with alphafold

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Z \' dek, A., Potapenko, A., et al. Highly accurate protein structure prediction with alphafold. nature, 596 0 (7873): 0 583--589, 2021

work page 2021

Showing first 80 references.

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

Improving multimodal interactive agents with reinforcement learning from human feedback

Abramson, J., Ahuja, A., Carnevale, F., Georgiev, P., Goldin, A., Hung, A., Landon, J., Lhotka, J., Lillicrap, T., Muldal, A., et al. Improving multimodal interactive agents with reinforcement learning from human feedback. arXiv preprint arXiv:2211.11602, 2022

work page arXiv 2022

[3] [3]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

and Verma, B

Agarwal, L. and Verma, B. From methods to datasets: A survey on image-caption generators. Multimedia Tools and Applications, 83 0 (9): 0 28077--28123, 2024

work page 2024

[5] [5]

S., Krishnan, N., and Jablonka, K

Alampara, N., Schilling-Wilhelmi, M., R \' os-Garc \' a, M., Mandal, I., Khetarpal, P., Grover, H. S., Krishnan, N., and Jablonka, K. M. Probing the limitations of multimodal language models for chemistry and materials research. arXiv preprint arXiv:2411.16955, 2024

work page arXiv 2024

[6] [6]

arXiv preprint arXiv:2402.16827

Albalak, A., Elazar, Y., Xie, S. M., Longpre, S., Lambert, N., Wang, X., Muennighoff, N., Hou, B., Pan, L., Jeong, H., et al. A survey on data selection for language models. arXiv preprint arXiv:2402.16827, 2024

work page arXiv 2024

[7] [7]

Multimodal large language models in health care: applications, challenges, and future outlook

AlSaad, R., Abd-Alrazaq, A., Boughorbel, S., Ahmed, A., Renault, M.-A., Damseh, R., and Sheikh, J. Multimodal large language models in health care: applications, challenges, and future outlook. Journal of medical Internet research, 26: 0 e59505, 2024

work page 2024

[8] [8]

Claude 3.5 sonnet model card addendum

Anthropic, A. Claude 3.5 sonnet model card addendum. Claude-3.5 Model Card, 3, 2024

work page 2024

[9] [9]

G., et al

Arora, D., Singh, H. G., et al. Have llms advanced enough? a challenging problem solving benchmark for large language models. arXiv preprint arXiv:2305.15074, 2023

work page arXiv 2023

[10] [10]

Llemma: An Open Language Model For Mathematics

Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M. D., McAleer, S. M., Jiang, A. Q., Deng, J., Biderman, S., and Welleck, S. Llemma: An open language model for mathematics. ArXiv, abs/2310.10631, 2023. URL https://api.semanticscholar.org/CorpusID:264172303

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

K., and Priyakumar, U

Bagal, V., Aggarwal, R., Vinod, P. K., and Priyakumar, U. D. Molgpt: Molecular generation using a transformer-decoder model. Journal of chemical information and modeling, 2021. URL https://api.semanticscholar.org/CorpusID:263484152

work page 2021

[12] [12]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

A survey of multimodal large language model from a data-centric perspective

Bai, T., Liang, H., Wan, B., Xu, Y., Li, X., Li, S., Yang, L., Li, B., Wang, Y., Cui, B., et al. A survey of multimodal large language model from a data-centric perspective. arXiv preprint arXiv:2405.16640, 2024 a

work page arXiv 2024

[14] [14]

Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., and Shou, M. Z. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Learning and scientific reasoning

Bao, L., Cai, T., Koenig, K., Fang, K., Han, J., Wang, J., Liu, Q., Ding, L., Cui, L., Luo, Y., et al. Learning and scientific reasoning. Science, 323 0 (5914): 0 586--587, 2009

work page 2009

[16] [16]

a rber, M., Fr \

Barman, K. G., Caron, S., Sullivan, E., de Regt, H. W., de Austri, R. R., Boon, M., F \"a rber, M., Fr \"o se, S., Hasibi, F., Ipp, A., et al. Large physics models: Towards a collaborative approach with large language models and foundation models. arXiv preprint arXiv:2501.05382, 2025

work page arXiv 2025

[17] [17]

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Bayoudh, K., Knani, R., Hamdaoui, F., and Mtibaa, A. A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. The Visual Computer, 38 0 (8): 0 2939--2970, 2022

work page 2022

[18] [18]

C., Jiang, A., Li, J., Lipkin, B., Qina, Z., Rasul, K., Shen, Z., Soletskyi, R., and Tunstall, L

Beeching, E., Huang, S. C., Jiang, A., Li, J., Lipkin, B., Qina, Z., Rasul, K., Shen, Z., Soletskyi, R., and Tunstall, L. Numinamath 7b tir. https://huggingface.co/AI-MO/NuminaMath-7B-TIR, 2024

work page 2024

[19] [19]

Reasoning language models: A blueprint

Besta, M., Barth, J., Schreiber, E., Kubicek, A., Catarino, A., Gerstenberger, R., Nyczyk, P., Iff, P., Li, Y., Houliston, S., et al. Reasoning language models: A blueprint. arXiv preprint arXiv:2501.11223, 2025

work page arXiv 2025

[20] [20]

Science in the age of large language models

Birhane, A., Kasirzadeh, A., Leslie, D., and Wachter, S. Science in the age of large language models. Nature Reviews Physics, 5 0 (5): 0 277--280, 2023

work page 2023

[21] [21]

Xai meets llms: A survey of the relation between explainable ai and large language models

Cambria, E., Malandri, L., Mercorio, F., Nobani, N., and Seveso, A. Xai meets llms: A survey of the relation between explainable ai and large language models. arXiv preprint arXiv:2407.15248, 2024

work page arXiv 2024

[22] [23]

Preprint, arXiv:2311.16208

Cao, H., Liu, Z., Lu, X., Yao, Y., and Li, Y. Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv preprint arXiv:2311.16208, 2023 b

work page arXiv 2023

[23] [24]

Hallucination detection in foundation models for decision-making: A flexible definition and review of the state of the art

Chakraborty, N., Ornik, M., and Driggs-Campbell, K. Hallucination detection in foundation models for decision-making: A flexible definition and review of the state of the art. arXiv preprint arXiv:2403.16527, 2024

work page arXiv 2024

[24] [25]

and Ye, J.-C

Chang, J. and Ye, J.-C. Bidirectional generation of structure and properties through a single molecular foundation model. Nature Communications, 15, 2022. URL https://api.semanticscholar.org/CorpusID:256827263

work page 2022

[25] [26]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

Chefer, H., Gur, S., and Wolf, L. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 397--406, 2021

work page 2021

[26] [27]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Chen, W., Su, Y., Zuo, J., Yang, C., Yuan, C., Chan, C.-M., Yu, H., Lu, Y., Hung, Y.-H., Qian, C., et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In The Twelfth International Conference on Learning Representations, 2023 a

work page 2023

[27] [28]

Theoremqa: A theorem-driven question answering dataset

Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y., Ma, X., Xu, J., Wang, X., and Xia, T. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 7889--7901, 2023 b

work page 2023

[28] [29]

Fine-tuning large language models in education

Chen, Y., Chen, H., and Su, S. Fine-tuning large language models in education. In 2023 13th International Conference on Information Technology in Medicine and Education (ITME), pp.\ 718--723. IEEE, 2023 c

work page 2023

[29] [30]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

Chen, Z., Chen, S., Ning, Y., Zhang, Q., Wang, B., Yu, B., Li, Y., Liao, Z., Wei, C., Lu, Z., et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. arXiv preprint arXiv:2410.05080, 2024 a

work page arXiv 2024

[30] [31]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67 0 (12): 0 220101, 2024 b

work page 2024

[31] [32]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 24185--24198, 2024 c

work page 2024

[32] [33]

Generative ai for math: Abel

Chern, E., Zou, H., Li, X., Hu, J., Feng, K., Li, J., and Liu, P. Generative ai for math: Abel. https://github.com/GAIR-NLP/abel, 2023

work page 2023

[33] [34]

Unveiling causal reasoning in large language models: Reality or mirage? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

Chi, H., Li, H., Yang, W., Liu, F., Lan, L., Ren, X., Liu, T., and Han, B. Unveiling causal reasoning in large language models: Reality or mirage? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[34] [35]

Exploring response uncertainty in mllms: An empirical evaluation under misleading scenarios

Dang, Y., Gao, M., Yan, Y., Zou, X., Gu, Y., Liu, A., and Hu, X. Exploring response uncertainty in mllms: An empirical evaluation under misleading scenarios. arXiv preprint arXiv:2411.02708, 2024 a

work page arXiv 2024

[35] [36]

Explainable and interpretable multimodal large language models: A comprehensive survey

Dang, Y., Huang, K., Huo, J., Yan, Y., Huang, S., Liu, D., Gao, M., Zhang, J., Qian, C., Wang, K., et al. Explainable and interpretable multimodal large language models: A comprehensive survey. arXiv preprint arXiv:2412.02104, 2024 b

work page arXiv 2024

[36] [37]

EXAMS - V : A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models

Das, R., Hristov, S., Li, H., Dimitrov, D., Koychev, I., and Nakov, P. EXAMS - V : A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 7768--7...

work page doi:10.18653/v1/2024.acl-long.420 2024

[37] [38]

Dehimi, N. E. H. and Tolba, Z. Attention mechanisms in deep learning: Towards explainable artificial intelligence. In 2024 6th International Conference on Pattern Analysis and Intelligent Systems (PAIS), pp.\ 1--7. IEEE, 2024

work page 2024

[38] [39]

Desirable characteristics for ai teaching assistants in programming education

Denny, P., MacNeil, S., Savelka, J., Porter, L., and Luxton-Reilly, A. Desirable characteristics for ai teaching assistants in programming education. In Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, pp.\ 408--414. 2024

work page 2024

[39] [40]

Codefuse-13b: A pretrained multi-lingual code large language model

Di, P., Li, J., Yu, H., Jiang, W., Cai, W., Cao, Y., Chen, C., Chen, D., Chen, H., Chen, L., et al. Codefuse-13b: A pretrained multi-lingual code large language model. In Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, pp.\ 418--429, 2024

work page 2024

[40] [41]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [42]

A survey on rag meeting llms: Towards retrieval-augmented large language models

Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D., Chua, T.-S., and Li, Q. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.\ 6491--6501, 2024

work page 2024

[42] [43]

Towards artificial general intelligence via a multimodal foundation model

Fei, N., Lu, Z., Gao, Y., Yang, G., Huo, Y., Wen, J., Lu, H., Song, R., Gao, X., Xiang, T., et al. Towards artificial general intelligence via a multimodal foundation model. Nature Communications, 13 0 (1): 0 3094, 2022

work page 2022

[43] [44]

How far are we from agi

Feng, T., Jin, C., Liu, J., Zhu, K., Tu, H., Cheng, Z., Lin, G., and You, J. How far are we from agi. arXiv preprint arXiv:2405.10313, 2024

work page arXiv 2024

[44] [45]

Flores, L., Kim, S., and Young, S. D. Addressing bias in artificial intelligence for public health surveillance. Journal of Medical Ethics, 50 0 (3): 0 190--194, 2024

work page 2024

[45] [46]

Mme-survey: A comprehensive survey on evaluation of multimodal llms

Fu, C., Zhang, Y.-F., Yin, S., Li, B., Fang, X., Zhao, S., Duan, H., Sun, X., Liu, Z., Wang, L., et al. Mme-survey: A comprehensive survey on evaluation of multimodal llms. arXiv preprint arXiv:2411.15296, 2024

work page arXiv 2024

[46] [47]

Kwaiyiimath: Technical report

Fu, J.-Y., Lin, L., Gao, X., Liu, P., Chen, Z., Yang, Z., Zhang, S., Zheng, X., Li, Y., Liu, Y., Ye, X., Liao, Y., Liao, C., Chen, B., Song, C., Wan, J., Lin, Z., Zhang, F., Wang, Z., Zhang, D., and Gai, K. Kwaiyiimath: Technical report. ArXiv, abs/2310.07488, 2023. URL https://api.semanticscholar.org/CorpusID:263834833

work page arXiv 2023

[47] [48]

Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al

Gadre, S. Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[48] [49]

Large language models empowered agent-based modeling and simulation: A survey and perspectives

Gao, C., Lan, X., Li, N., Yuan, Y., Ding, J., Zhou, Z., Xu, F., and Li, Y. Large language models empowered agent-based modeling and simulation: A survey and perspectives. Humanities and Social Sciences Communications, 11 0 (1): 0 1--24, 2024 a

work page 2024

[49] [50]

Physically grounded vision-language models for robotic manipulation

Gao, J., Sarkar, B., Xia, F., Xiao, T., Wu, J., Ichter, B., Majumdar, A., and Sadigh, D. Physically grounded vision-language models for robotic manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 12462--12469. IEEE, 2024 b

work page 2024

[50] [51]

Goodman, S. N. Aligning statistical and scientific reasoning. Science, 352 0 (6290): 0 1180--1181, 2016

work page 2016

[51] [52]

and Wang, G

Guan, S. and Wang, G. Drug discovery and development in the era of artificial intelligence: From machine learning to large language models. Artificial Intelligence Chemistry, 2 0 (1): 0 100070, 2024

work page 2024

[52] [53]

Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting

Guan, X., Liu, Y., Lin, H., Lu, Y., He, B., Han, X., and Sun, L. Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 18126--18134, 2024

work page 2024

[53] [54]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., Li, Y., et al. Deepseek-coder: When the large language model meets programming--the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [55]

What can large language models do in chemistry? a comprehensive benchmark on eight tasks

Guo, T., Nan, B., Liang, Z., Guo, Z., Chawla, N., Wiest, O., Zhang, X., et al. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. Advances in Neural Information Processing Systems, 36: 0 59662--59688, 2023

work page 2023

[55] [56]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N. V., Wiest, O., and Zhang, X. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [57]

Reasoning with Language Model is Planning with World Model

Hao, S., Gu, Y., Ma, H., Hong, J. J., Wang, Z., Wang, D. Z., and Hu, Z. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [58]

Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction

Hao, X., Chen, W., Yan, Y., Zhong, S., Wang, K., Wen, Q., and Liang, Y. Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction. arXiv preprint arXiv:2403.16831, 2024

work page arXiv 2024

[58] [59]

W., Li, L., Yang, Z., Wang, L., and Cheng, Y

Hao, Y., Gu, J., Wang, H. W., Li, L., Yang, Z., Wang, L., and Cheng, Y. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark. arXiv preprint arXiv:2501.05444, 2025

work page arXiv 2025

[59] [60]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [61]

Mmworld: Towards multi-discipline multi-faceted world model evaluation in videos

He, X., Feng, W., Zheng, K., Lu, Y., Zhu, W., Li, J., Fan, Y., Wang, J., Li, L., Yang, Z., et al. Mmworld: Towards multi-discipline multi-faceted world model evaluation in videos. arXiv preprint arXiv:2406.08407, 2024 b

work page arXiv 2024

[61] [62]

Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning

He, Z., Wu, X., Zhou, P., Xuan, R., Liu, G., Yang, X., Zhu, Q., and Huang, H. Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning. arXiv preprint arXiv:2401.14011, 2024 c

work page arXiv 2024

[62] [63]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2009

[63] [64]

Hocky, G. M. Connecting molecular properties with plain language. Nature Machine Intelligence, 6 0 (3): 0 249--250, 2024

work page 2024

[64] [65]

SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions

Horawalavithana, S., Munikoti, S., Stewart, I., and Kvinge, H. Scitune: Aligning large language models with scientific multimodal instructions. arXiv preprint arXiv:2307.01139, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[65] [66]

Teaching plan generation and evaluation with gpt-4: Unleashing the potential of llm in instructional design

Hu, B., Zheng, L., Zhu, J., Ding, L., Wang, Y., and Gu, X. Teaching plan generation and evaluation with gpt-4: Unleashing the potential of llm in instructional design. IEEE Transactions on Learning Technologies, 2024

work page 2024

[66] [67]

and Yu, K

Hu, S. and Yu, K. Learning robust rationales for model explainability: A guidance-based approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 18243--18251, 2024

work page 2024

[67] [68]

Towards Reasoning in Large Language Models: A Survey

Huang, J. and Chang, K. C.-C. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[68] [69]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 2023

work page 2023

[69] [70]

Adapting large language models for biomedicine though retrieval-augmented generation with documents scoring

Huang, Y., Gao, T., Zhang, J., Liu, X., and Wang, G. Adapting large language models for biomedicine though retrieval-augmented generation with documents scoring. In 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp.\ 5770--5775. IEEE, 2024 a

work page 2024

[70] [71]

Olympicarena medal ranks: Who is the most intelligent ai so far? arXiv preprint arXiv:2406.16772, 2024 b

Huang, Z., Wang, Z., Xia, S., and Liu, P. Olympicarena medal ranks: Who is the most intelligent ai so far? arXiv preprint arXiv:2406.16772, 2024 b

work page arXiv 2024

[71] [72]

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[72] [73]

Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model

Huo, J., Yan, Y., Hu, B., Yue, Y., and Hu, X. Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model. arXiv preprint arXiv:2406.11193, 2024

work page arXiv 2024

[73] [74]

F., Shovon, M

Ishmam, M. F., Shovon, M. S. H., Mridha, M. F., and Dey, N. From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities. Information Fusion, pp.\ 102270, 2024

work page 2024

[74] [75]

OpenAI o1 System Card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [76]

P., Anand, A., Dharmadhikari, A., Marathe, A., and Shah, R

Jaiswal, R., Jain, D., Popat, H. P., Anand, A., Dharmadhikari, A., Marathe, A., and Shah, R. R. Improving physics reasoning in large language models using mixture of refinement agents. arXiv preprint arXiv:2412.00821, 2024

work page arXiv 2024

[76] [77]

A survey on large language model hallucination via a creativity perspective

Jiang, X., Tian, Y., Hua, F., Xu, C., Wang, Y., and Guo, J. A survey on large language model hallucination via a creativity perspective. arXiv preprint arXiv:2402.06647, 2024

work page arXiv 2024

[77] [78]

Reasoning grasping via multimodal large language model

Jin, S., Xu, J., Lei, Y., and Zhang, L. Reasoning grasping via multimodal large language model. arXiv preprint arXiv:2402.06798, 2024 a

work page arXiv 2024

[78] [79]

Cladder: A benchmark to assess causal reasoning capabilities of language models

Jin, Z., Chen, Y., Leeb, F., Gresele, L., Kamal, O., Lyu, Z., Blin, K., Gonzalez Adauto, F., Kleiman-Weiner, M., Sachan, M., et al. Cladder: A benchmark to assess causal reasoning capabilities of language models. Advances in Neural Information Processing Systems, 36, 2024 b

work page 2024

[79] [80]

C., De Sousa Ribeiro, F., Oktay, O., McCradden, M., and Glocker, B

Jones, C., Castro, D. C., De Sousa Ribeiro, F., Oktay, O., McCradden, M., and Glocker, B. A causal perspective on dataset bias in machine learning for medical imaging. Nature Machine Intelligence, 6 0 (2): 0 138--146, 2024

work page 2024

[80] [81]

Highly accurate protein structure prediction with alphafold

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Z \' dek, A., Potapenko, A., et al. Highly accurate protein structure prediction with alphafold. nature, 596 0 (7873): 0 583--589, 2021

work page 2021