pith. sign in

arxiv: 2502.02871 · v2 · submitted 2025-02-05 · 💻 cs.CL · cs.AI

Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

Pith reviewed 2026-05-23 04:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multimodal large language modelsscientific reasoningmathematicsphysicschemistrybiologyartificial general intelligence
0
0 comments X

The pith

Multimodal large language models can advance scientific reasoning by integrating text and images across math, physics, chemistry, and biology.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This position paper argues that multimodal large language models offer a way to overcome the generalization and multimodal perception shortfalls of current scientific reasoning systems. These systems limit progress in exploring phenomena across fields. By processing diverse data types together, MLLMs can support logic and evidence-based reasoning in mathematics, physics, chemistry, and biology. A sympathetic reader would care because better scientific reasoning tools could speed up knowledge advances in multiple disciplines.

Core claim

The paper claims that MLLMs, which integrate text, images, and other modalities, present an opportunity to overcome limitations in generalization and multimodal perception and thereby significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology. It proposes a four-stage research roadmap, summarizes the current state of MLLM applications, identifies key remaining challenges, and offers actionable suggestions toward artificial general intelligence.

What carries the argument

A four-stage research roadmap of scientific reasoning capabilities that tracks MLLM progress in integrating and reasoning over diverse data types.

If this is right

  • MLLMs will support reasoning that draws on both textual evidence and visual data in scientific tasks.
  • A staged roadmap will guide incremental improvements from basic perception to full cross-domain generalization.
  • Addressing listed challenges will move MLLM use closer to artificial general intelligence.
  • Applications will expand in mathematics, physics, chemistry, and biology through combined modality processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the roadmap holds, MLLMs could eventually propose and test new hypotheses without human prompts in lab settings.
  • Persistent perception gaps may require targeted datasets of paired scientific images and equations beyond current training.
  • Cross-field transfer success would imply similar gains in non-scientific multimodal tasks like medical diagnosis.

Load-bearing premise

The assumption that MLLMs' ability to integrate and reason over diverse data types will overcome current limitations in generalization and multimodal perception.

What would settle it

A controlled comparison on multimodal scientific reasoning benchmarks where MLLMs show no improvement over text-only models in generalization or accuracy.

Figures

Figures reproduced from arXiv: 2502.02871 by Bart Selman, Carla Gomes, Jiahao Huo, Jingheng Ye, Philip S. Yu, Qingsong Wen, Shen Wang, Xuming Hu, Yibo Yan, Zhendong Chu.

Figure 1
Figure 1. Figure 1: The big picture of our position. We focus on multimodal scientific fields, especially mathematics, physics, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MLLM-based scientific reasoning [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Challenges for MLLM-based scientific reasoning. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Eight prospects for the future of MLLMs in the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustrations of alternative view 1 (a) and view 2 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Scientific reasoning, the process through which humans apply logic, evidence, and critical thinking to explore and interpret scientific phenomena, is essential in advancing knowledge reasoning across diverse fields. However, despite significant progress, current scientific reasoning models still struggle with generalization across domains and often fall short of multimodal perception. Multimodal Large Language Models (MLLMs), which integrate text, images, and other modalities, present an exciting opportunity to overcome these limitations and enhance scientific reasoning. Therefore, this position paper argues that MLLMs can significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology. First, we propose a four-stage research roadmap of scientific reasoning capabilities, and highlight the current state of MLLM applications in scientific reasoning, noting their ability to integrate and reason over diverse data types. Second, we summarize the key challenges that remain obstacles to achieving MLLM's full potential. To address these challenges, we propose actionable insights and suggestions for the future. Overall, our work offers a novel perspective on MLLM integration with scientific reasoning, providing the LLM community with a valuable vision for achieving Artificial General Intelligence (AGI).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This position paper argues that Multimodal Large Language Models (MLLMs) can significantly advance scientific reasoning across mathematics, physics, chemistry, and biology. It proposes a four-stage research roadmap for scientific reasoning capabilities, reviews the current state of MLLM applications noting their multimodal integration, summarizes remaining challenges, and offers actionable suggestions to realize their potential toward AGI.

Significance. If the roadmap and suggestions were grounded in concrete mechanisms or existing results, the paper could provide a useful organizing vision for multimodal AI in science. As written, the lack of any analysis connecting MLLM integration to gains in generalization or perception limits its potential impact on the field.

major comments (2)
  1. [Abstract] Abstract: the claim that MLLMs 'present an exciting opportunity to overcome these limitations' (generalization across domains and multimodal perception) is asserted without any supporting analysis, derivation, illustrative case, or reference to specific MLLM results demonstrating such gains.
  2. [Four-stage research roadmap] Four-stage research roadmap section: the stages are enumerated and current MLLM applications are noted, but no mechanism is supplied showing how multimodal integration produces the claimed improvements in cross-domain generalization; the challenges are listed but never connected back to a concrete MLLM capability or result.
minor comments (1)
  1. The manuscript would benefit from explicit definitions or criteria for each of the four stages so that the roadmap can be evaluated for novelty and testability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. As this is a position paper, our aim is to articulate a forward-looking vision rather than deliver new empirical derivations. We address the major comments below and indicate where revisions are feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that MLLMs 'present an exciting opportunity to overcome these limitations' (generalization across domains and multimodal perception) is asserted without any supporting analysis, derivation, illustrative case, or reference to specific MLLM results demonstrating such gains.

    Authors: We agree the abstract states the central thesis without an immediate supporting reference or example. The manuscript is a position paper, so the claim is developed through the review of current MLLM applications in later sections rather than through new analysis. We will revise the abstract to include one or two concise references to existing MLLM results that illustrate multimodal gains in scientific tasks. revision: yes

  2. Referee: [Four-stage research roadmap] Four-stage research roadmap section: the stages are enumerated and current MLLM applications are noted, but no mechanism is supplied showing how multimodal integration produces the claimed improvements in cross-domain generalization; the challenges are listed but never connected back to a concrete MLLM capability or result.

    Authors: The roadmap is intentionally high-level to organize future research directions. Explicit causal mechanisms are not derived because the paper does not claim to introduce new technical results. We will revise the section to add brief, literature-based connections between specific MLLM capabilities (such as diagram interpretation) and potential generalization benefits, and to tie listed challenges more directly to those capabilities. revision: partial

Circularity Check

0 steps flagged

No circularity: position paper contains no derivations or self-referential constructions

full rationale

The paper is a position paper that advances an argument about MLLMs advancing scientific reasoning. It contains no equations, no fitted parameters, no derivations, and no load-bearing self-citations that reduce the central claim to its own inputs by construction. The four-stage roadmap and listed challenges are enumerated as suggestions rather than derived results. The claim is presented as a perspective, not as a prediction or theorem that collapses to prior definitions or fits within the paper itself. This is the expected outcome for a non-technical position paper with no quantitative chain to inspect.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Position paper contains no formal model, parameters, or derivations; the argument rests on domain assumptions about MLLM capabilities that are not formalized.

pith-pipeline@v0.9.0 · 5756 in / 978 out tokens · 25579 ms · 2026-05-23T04:28:29.849262+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

    cs.SD 2026-05 unverdicted novelty 5.0

    A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

Reference graph

Works this paper leans on

268 extracted references · 268 canonical work pages · cited by 1 Pith paper · 35 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Improving multimodal interactive agents with reinforcement learning from human feedback

    Abramson, J., Ahuja, A., Carnevale, F., Georgiev, P., Goldin, A., Hung, A., Landon, J., Lhotka, J., Lillicrap, T., Muldal, A., et al. Improving multimodal interactive agents with reinforcement learning from human feedback. arXiv preprint arXiv:2211.11602, 2022

  3. [3]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2024

  4. [4]

    and Verma, B

    Agarwal, L. and Verma, B. From methods to datasets: A survey on image-caption generators. Multimedia Tools and Applications, 83 0 (9): 0 28077--28123, 2024

  5. [5]

    S., Krishnan, N., and Jablonka, K

    Alampara, N., Schilling-Wilhelmi, M., R \' os-Garc \' a, M., Mandal, I., Khetarpal, P., Grover, H. S., Krishnan, N., and Jablonka, K. M. Probing the limitations of multimodal language models for chemistry and materials research. arXiv preprint arXiv:2411.16955, 2024

  6. [6]

    arXiv preprint arXiv:2402.16827

    Albalak, A., Elazar, Y., Xie, S. M., Longpre, S., Lambert, N., Wang, X., Muennighoff, N., Hou, B., Pan, L., Jeong, H., et al. A survey on data selection for language models. arXiv preprint arXiv:2402.16827, 2024

  7. [7]

    Multimodal large language models in health care: applications, challenges, and future outlook

    AlSaad, R., Abd-Alrazaq, A., Boughorbel, S., Ahmed, A., Renault, M.-A., Damseh, R., and Sheikh, J. Multimodal large language models in health care: applications, challenges, and future outlook. Journal of medical Internet research, 26: 0 e59505, 2024

  8. [8]

    Claude 3.5 sonnet model card addendum

    Anthropic, A. Claude 3.5 sonnet model card addendum. Claude-3.5 Model Card, 3, 2024

  9. [9]

    G., et al

    Arora, D., Singh, H. G., et al. Have llms advanced enough? a challenging problem solving benchmark for large language models. arXiv preprint arXiv:2305.15074, 2023

  10. [10]

    Llemma: An Open Language Model For Mathematics

    Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M. D., McAleer, S. M., Jiang, A. Q., Deng, J., Biderman, S., and Welleck, S. Llemma: An open language model for mathematics. ArXiv, abs/2310.10631, 2023. URL https://api.semanticscholar.org/CorpusID:264172303

  11. [11]

    K., and Priyakumar, U

    Bagal, V., Aggarwal, R., Vinod, P. K., and Priyakumar, U. D. Molgpt: Molecular generation using a transformer-decoder model. Journal of chemical information and modeling, 2021. URL https://api.semanticscholar.org/CorpusID:263484152

  12. [12]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

  13. [13]

    A survey of multimodal large language model from a data-centric perspective

    Bai, T., Liang, H., Wan, B., Xu, Y., Li, X., Li, S., Yang, L., Li, B., Wang, Y., Cui, B., et al. A survey of multimodal large language model from a data-centric perspective. arXiv preprint arXiv:2405.16640, 2024 a

  14. [14]

    Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., and Shou, M. Z. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024 b

  15. [15]

    Learning and scientific reasoning

    Bao, L., Cai, T., Koenig, K., Fang, K., Han, J., Wang, J., Liu, Q., Ding, L., Cui, L., Luo, Y., et al. Learning and scientific reasoning. Science, 323 0 (5914): 0 586--587, 2009

  16. [16]

    a rber, M., Fr \

    Barman, K. G., Caron, S., Sullivan, E., de Regt, H. W., de Austri, R. R., Boon, M., F \"a rber, M., Fr \"o se, S., Hasibi, F., Ipp, A., et al. Large physics models: Towards a collaborative approach with large language models and foundation models. arXiv preprint arXiv:2501.05382, 2025

  17. [17]

    A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

    Bayoudh, K., Knani, R., Hamdaoui, F., and Mtibaa, A. A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. The Visual Computer, 38 0 (8): 0 2939--2970, 2022

  18. [18]

    C., Jiang, A., Li, J., Lipkin, B., Qina, Z., Rasul, K., Shen, Z., Soletskyi, R., and Tunstall, L

    Beeching, E., Huang, S. C., Jiang, A., Li, J., Lipkin, B., Qina, Z., Rasul, K., Shen, Z., Soletskyi, R., and Tunstall, L. Numinamath 7b tir. https://huggingface.co/AI-MO/NuminaMath-7B-TIR, 2024

  19. [19]

    Reasoning language models: A blueprint

    Besta, M., Barth, J., Schreiber, E., Kubicek, A., Catarino, A., Gerstenberger, R., Nyczyk, P., Iff, P., Li, Y., Houliston, S., et al. Reasoning language models: A blueprint. arXiv preprint arXiv:2501.11223, 2025

  20. [20]

    Science in the age of large language models

    Birhane, A., Kasirzadeh, A., Leslie, D., and Wachter, S. Science in the age of large language models. Nature Reviews Physics, 5 0 (5): 0 277--280, 2023

  21. [21]

    Xai meets llms: A survey of the relation between explainable ai and large language models

    Cambria, E., Malandri, L., Mercorio, F., Nobani, N., and Seveso, A. Xai meets llms: A survey of the relation between explainable ai and large language models. arXiv preprint arXiv:2407.15248, 2024

  22. [23]

    Preprint, arXiv:2311.16208

    Cao, H., Liu, Z., Lu, X., Yao, Y., and Li, Y. Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv preprint arXiv:2311.16208, 2023 b

  23. [24]

    Hallucination detection in foundation models for decision-making: A flexible definition and review of the state of the art

    Chakraborty, N., Ornik, M., and Driggs-Campbell, K. Hallucination detection in foundation models for decision-making: A flexible definition and review of the state of the art. arXiv preprint arXiv:2403.16527, 2024

  24. [25]

    and Ye, J.-C

    Chang, J. and Ye, J.-C. Bidirectional generation of structure and properties through a single molecular foundation model. Nature Communications, 15, 2022. URL https://api.semanticscholar.org/CorpusID:256827263

  25. [26]

    Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

    Chefer, H., Gur, S., and Wolf, L. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 397--406, 2021

  26. [27]

    Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

    Chen, W., Su, Y., Zuo, J., Yang, C., Yuan, C., Chan, C.-M., Yu, H., Lu, Y., Hung, Y.-H., Qian, C., et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In The Twelfth International Conference on Learning Representations, 2023 a

  27. [28]

    Theoremqa: A theorem-driven question answering dataset

    Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y., Ma, X., Xu, J., Wang, X., and Xia, T. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 7889--7901, 2023 b

  28. [29]

    Fine-tuning large language models in education

    Chen, Y., Chen, H., and Su, S. Fine-tuning large language models in education. In 2023 13th International Conference on Information Technology in Medicine and Education (ITME), pp.\ 718--723. IEEE, 2023 c

  29. [30]

    Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

    Chen, Z., Chen, S., Ning, Y., Zhang, Q., Wang, B., Yu, B., Li, Y., Liao, Z., Wei, C., Lu, Z., et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. arXiv preprint arXiv:2410.05080, 2024 a

  30. [31]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

    Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67 0 (12): 0 220101, 2024 b

  31. [32]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 24185--24198, 2024 c

  32. [33]

    Generative ai for math: Abel

    Chern, E., Zou, H., Li, X., Hu, J., Feng, K., Li, J., and Liu, P. Generative ai for math: Abel. https://github.com/GAIR-NLP/abel, 2023

  33. [34]

    Unveiling causal reasoning in large language models: Reality or mirage? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

    Chi, H., Li, H., Yang, W., Liu, F., Lan, L., Ren, X., Liu, T., and Han, B. Unveiling causal reasoning in large language models: Reality or mirage? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  34. [35]

    Exploring response uncertainty in mllms: An empirical evaluation under misleading scenarios

    Dang, Y., Gao, M., Yan, Y., Zou, X., Gu, Y., Liu, A., and Hu, X. Exploring response uncertainty in mllms: An empirical evaluation under misleading scenarios. arXiv preprint arXiv:2411.02708, 2024 a

  35. [36]

    Explainable and interpretable multimodal large language models: A comprehensive survey

    Dang, Y., Huang, K., Huo, J., Yan, Y., Huang, S., Liu, D., Gao, M., Zhang, J., Qian, C., Wang, K., et al. Explainable and interpretable multimodal large language models: A comprehensive survey. arXiv preprint arXiv:2412.02104, 2024 b

  36. [37]

    EXAMS - V : A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models

    Das, R., Hristov, S., Li, H., Dimitrov, D., Koychev, I., and Nakov, P. EXAMS - V : A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 7768--7...

  37. [38]

    Dehimi, N. E. H. and Tolba, Z. Attention mechanisms in deep learning: Towards explainable artificial intelligence. In 2024 6th International Conference on Pattern Analysis and Intelligent Systems (PAIS), pp.\ 1--7. IEEE, 2024

  38. [39]

    Desirable characteristics for ai teaching assistants in programming education

    Denny, P., MacNeil, S., Savelka, J., Porter, L., and Luxton-Reilly, A. Desirable characteristics for ai teaching assistants in programming education. In Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, pp.\ 408--414. 2024

  39. [40]

    Codefuse-13b: A pretrained multi-lingual code large language model

    Di, P., Li, J., Yu, H., Jiang, W., Cai, W., Cao, Y., Chen, C., Chen, D., Chen, H., Chen, L., et al. Codefuse-13b: A pretrained multi-lingual code large language model. In Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, pp.\ 418--429, 2024

  40. [41]

    The Llama 3 Herd of Models

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  41. [42]

    A survey on rag meeting llms: Towards retrieval-augmented large language models

    Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D., Chua, T.-S., and Li, Q. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.\ 6491--6501, 2024

  42. [43]

    Towards artificial general intelligence via a multimodal foundation model

    Fei, N., Lu, Z., Gao, Y., Yang, G., Huo, Y., Wen, J., Lu, H., Song, R., Gao, X., Xiang, T., et al. Towards artificial general intelligence via a multimodal foundation model. Nature Communications, 13 0 (1): 0 3094, 2022

  43. [44]

    How far are we from agi

    Feng, T., Jin, C., Liu, J., Zhu, K., Tu, H., Cheng, Z., Lin, G., and You, J. How far are we from agi. arXiv preprint arXiv:2405.10313, 2024

  44. [45]

    Flores, L., Kim, S., and Young, S. D. Addressing bias in artificial intelligence for public health surveillance. Journal of Medical Ethics, 50 0 (3): 0 190--194, 2024

  45. [46]

    Mme-survey: A comprehensive survey on evaluation of multimodal llms

    Fu, C., Zhang, Y.-F., Yin, S., Li, B., Fang, X., Zhao, S., Duan, H., Sun, X., Liu, Z., Wang, L., et al. Mme-survey: A comprehensive survey on evaluation of multimodal llms. arXiv preprint arXiv:2411.15296, 2024

  46. [47]

    Kwaiyiimath: Technical report

    Fu, J.-Y., Lin, L., Gao, X., Liu, P., Chen, Z., Yang, Z., Zhang, S., Zheng, X., Li, Y., Liu, Y., Ye, X., Liao, Y., Liao, C., Chen, B., Song, C., Wan, J., Lin, Z., Zhang, F., Wang, Z., Zhang, D., and Gai, K. Kwaiyiimath: Technical report. ArXiv, abs/2310.07488, 2023. URL https://api.semanticscholar.org/CorpusID:263834833

  47. [48]

    Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al

    Gadre, S. Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024

  48. [49]

    Large language models empowered agent-based modeling and simulation: A survey and perspectives

    Gao, C., Lan, X., Li, N., Yuan, Y., Ding, J., Zhou, Z., Xu, F., and Li, Y. Large language models empowered agent-based modeling and simulation: A survey and perspectives. Humanities and Social Sciences Communications, 11 0 (1): 0 1--24, 2024 a

  49. [50]

    Physically grounded vision-language models for robotic manipulation

    Gao, J., Sarkar, B., Xia, F., Xiao, T., Wu, J., Ichter, B., Majumdar, A., and Sadigh, D. Physically grounded vision-language models for robotic manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 12462--12469. IEEE, 2024 b

  50. [51]

    Goodman, S. N. Aligning statistical and scientific reasoning. Science, 352 0 (6290): 0 1180--1181, 2016

  51. [52]

    and Wang, G

    Guan, S. and Wang, G. Drug discovery and development in the era of artificial intelligence: From machine learning to large language models. Artificial Intelligence Chemistry, 2 0 (1): 0 100070, 2024

  52. [53]

    Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting

    Guan, X., Liu, Y., Lin, H., Lu, Y., He, B., Han, X., and Sun, L. Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 18126--18134, 2024

  53. [54]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., Li, Y., et al. Deepseek-coder: When the large language model meets programming--the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024 a

  54. [55]

    What can large language models do in chemistry? a comprehensive benchmark on eight tasks

    Guo, T., Nan, B., Liang, Z., Guo, Z., Chawla, N., Wiest, O., Zhang, X., et al. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. Advances in Neural Information Processing Systems, 36: 0 59662--59688, 2023

  55. [56]

    Large Language Model based Multi-Agents: A Survey of Progress and Challenges

    Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N. V., Wiest, O., and Zhang, X. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680, 2024 b

  56. [57]

    Reasoning with Language Model is Planning with World Model

    Hao, S., Gu, Y., Ma, H., Hong, J. J., Wang, Z., Wang, D. Z., and Hu, Z. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023

  57. [58]

    Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction

    Hao, X., Chen, W., Yan, Y., Zhong, S., Wang, K., Wen, Q., and Liang, Y. Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction. arXiv preprint arXiv:2403.16831, 2024

  58. [59]

    W., Li, L., Yang, Z., Wang, L., and Cheng, Y

    Hao, Y., Gu, J., Wang, H. W., Li, L., Yang, Z., Wang, L., and Cheng, Y. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark. arXiv preprint arXiv:2501.05444, 2025

  59. [60]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024 a

  60. [61]

    Mmworld: Towards multi-discipline multi-faceted world model evaluation in videos

    He, X., Feng, W., Zheng, K., Lu, Y., Zhu, W., Li, J., Fan, Y., Wang, J., Li, L., Yang, Z., et al. Mmworld: Towards multi-discipline multi-faceted world model evaluation in videos. arXiv preprint arXiv:2406.08407, 2024 b

  61. [62]

    Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning

    He, Z., Wu, X., Zhou, P., Xuan, R., Liu, G., Yang, X., Zhu, Q., and Huang, H. Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning. arXiv preprint arXiv:2401.14011, 2024 c

  62. [63]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021

  63. [64]

    Hocky, G. M. Connecting molecular properties with plain language. Nature Machine Intelligence, 6 0 (3): 0 249--250, 2024

  64. [65]

    SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions

    Horawalavithana, S., Munikoti, S., Stewart, I., and Kvinge, H. Scitune: Aligning large language models with scientific multimodal instructions. arXiv preprint arXiv:2307.01139, 2023

  65. [66]

    Teaching plan generation and evaluation with gpt-4: Unleashing the potential of llm in instructional design

    Hu, B., Zheng, L., Zhu, J., Ding, L., Wang, Y., and Gu, X. Teaching plan generation and evaluation with gpt-4: Unleashing the potential of llm in instructional design. IEEE Transactions on Learning Technologies, 2024

  66. [67]

    and Yu, K

    Hu, S. and Yu, K. Learning robust rationales for model explainability: A guidance-based approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 18243--18251, 2024

  67. [68]

    Towards Reasoning in Large Language Models: A Survey

    Huang, J. and Chang, K. C.-C. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022

  68. [69]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 2023

  69. [70]

    Adapting large language models for biomedicine though retrieval-augmented generation with documents scoring

    Huang, Y., Gao, T., Zhang, J., Liu, X., and Wang, G. Adapting large language models for biomedicine though retrieval-augmented generation with documents scoring. In 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp.\ 5770--5775. IEEE, 2024 a

  70. [71]

    Olympicarena medal ranks: Who is the most intelligent ai so far? arXiv preprint arXiv:2406.16772, 2024 b

    Huang, Z., Wang, Z., Xia, S., and Liu, P. Olympicarena medal ranks: Who is the most intelligent ai so far? arXiv preprint arXiv:2406.16772, 2024 b

  71. [72]

    Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

  72. [73]

    Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model

    Huo, J., Yan, Y., Hu, B., Yue, Y., and Hu, X. Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model. arXiv preprint arXiv:2406.11193, 2024

  73. [74]

    F., Shovon, M

    Ishmam, M. F., Shovon, M. S. H., Mridha, M. F., and Dey, N. From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities. Information Fusion, pp.\ 102270, 2024

  74. [75]

    OpenAI o1 System Card

    Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  75. [76]

    P., Anand, A., Dharmadhikari, A., Marathe, A., and Shah, R

    Jaiswal, R., Jain, D., Popat, H. P., Anand, A., Dharmadhikari, A., Marathe, A., and Shah, R. R. Improving physics reasoning in large language models using mixture of refinement agents. arXiv preprint arXiv:2412.00821, 2024

  76. [77]

    A survey on large language model hallucination via a creativity perspective

    Jiang, X., Tian, Y., Hua, F., Xu, C., Wang, Y., and Guo, J. A survey on large language model hallucination via a creativity perspective. arXiv preprint arXiv:2402.06647, 2024

  77. [78]

    Reasoning grasping via multimodal large language model

    Jin, S., Xu, J., Lei, Y., and Zhang, L. Reasoning grasping via multimodal large language model. arXiv preprint arXiv:2402.06798, 2024 a

  78. [79]

    Cladder: A benchmark to assess causal reasoning capabilities of language models

    Jin, Z., Chen, Y., Leeb, F., Gresele, L., Kamal, O., Lyu, Z., Blin, K., Gonzalez Adauto, F., Kleiman-Weiner, M., Sachan, M., et al. Cladder: A benchmark to assess causal reasoning capabilities of language models. Advances in Neural Information Processing Systems, 36, 2024 b

  79. [80]

    C., De Sousa Ribeiro, F., Oktay, O., McCradden, M., and Glocker, B

    Jones, C., Castro, D. C., De Sousa Ribeiro, F., Oktay, O., McCradden, M., and Glocker, B. A causal perspective on dataset bias in machine learning for medical imaging. Nature Machine Intelligence, 6 0 (2): 0 138--146, 2024

  80. [81]

    Highly accurate protein structure prediction with alphafold

    Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Z \' dek, A., Potapenko, A., et al. Highly accurate protein structure prediction with alphafold. nature, 596 0 (7873): 0 583--589, 2021

Showing first 80 references.