pith. sign in

arxiv: 2507.11810 · v2 · submitted 2025-07-16 · 💻 cs.DL · cs.AI

Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator

Pith reviewed 2026-05-19 05:12 UTC · model grok-4.3

classification 💻 cs.DL cs.AI
keywords large language modelsscientific innovationroles frameworkAI in scienceautonomy levelshypothesis generationresearch evaluationsurvey
0
0 comments X p. Extension

The pith

Large language models in science are best understood through four roles: Assistant, Collaborator, Scientist, and Evaluator.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a four-role framework to organize how LLMs contribute to scientific work. The roles are distinguished by three dimensions: autonomy level, cognitive function, and scientific innovation. This separation matters because it clarifies the difference between tools that aid routine research and systems aimed at genuine discovery. The survey examines methods, benchmarks, and limitations for each role, noting that Assistants are mature at retrieval but unreliable in open tasks, while Scientists automate workflows yet face safety problems. It argues that real progress requires attention to evaluation, oversight, and institutional fit beyond raw model capability.

Core claim

The central claim is that LLMs in scientific innovation can be classified into four roles—Assistant, Collaborator, Scientist, and Evaluator—by combining autonomy level, cognitive function, and scientific innovation. This framework separates research-oriented support from frontier-oriented discovery. Literature review shows Assistants excel at retrieval and synthesis but falter in open-ended use; Collaborators broaden hypothesis options yet trade off novelty against grounding; Scientists automate research but hit reliability and safety limits; Evaluators aid verification yet remain weak at novelty judgment. Advancement in AI for science therefore hinges on evaluation practices, human control,

What carries the argument

The four-role framework that classifies LLM systems by integrating autonomy level, cognitive function, and scientific innovation.

If this is right

  • Assistant systems reach maturity in literature tasks but still need human oversight for open-ended scientific applications.
  • Collaborator systems enlarge the space of possible hypotheses yet must resolve trade-offs between novelty and grounding in known facts.
  • Scientist systems increasingly automate full research workflows but remain constrained by reliability and safety bottlenecks.
  • Evaluator systems support review and verification but continue to underperform when judging true novelty.
  • Progress across all roles depends on developing better evaluation methods, stronger oversight, accountability structures, and institutional integration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Research funders could use the four-role lens to decide whether a project targets support tools or discovery engines.
  • Benchmark designers might create role-specific tests rather than general science benchmarks that mix different autonomy levels.
  • Institutions could develop role-tailored oversight policies, such as stricter safety reviews for Scientist-level systems.
  • The same three dimensions might later classify non-LLM AI tools in science to track broader trends.

Load-bearing premise

The body of existing literature on LLMs in science can be partitioned into these four roles with limited overlap, and the three dimensions provide a stable way to separate routine research support from frontier discovery.

What would settle it

A systematic review of recent LLM papers in science that reveals frequent unclassifiable cases, high role overlap, or inconsistent separation of support versus discovery tasks along the three dimensions would undermine the framework.

Figures

Figures reproduced from arXiv: 2507.11810 by Haihua Chen, Haoxuan Zhang, Jiangping Chen, Junhua Ding, Ruochi Li, Ting Xiao, Yang Zhang.

Figure 1
Figure 1. Figure 1: Trends in annual publication counts for traditional AI-driven versus LLM-driven scientific innovation. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The dual pathways of scientific innovation-scientific research and discovery, and the evolving roles of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The pyramidal framework of large language models’ roles in scientific innovation: evaluators, collaborators, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The evolution of large language models’ roles in scientific innovation with demonstration of existing [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Closed-loop workflow of LLMs as Evaluators. Multimodal embeddings underpin SKS (blue) and SLQA [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LLMs as collaborators in scientific innovation. LLMs transforming raw knowledge into actionable hypotheses, [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Taxonomy of LLMs as Scientists. The upper (blue) panel organizes ASR into three strata—fully autonomous [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used in scientific research and discovery, supporting tasks ranging from literature retrieval and synthesis to hypothesis generation, autonomous experimentation, and research evaluation. Existing surveys often conflate scientific research with scientific discovery and typically organize systems by domain, task, or autonomy level alone. In this survey, we propose a four-role framework for understanding LLMs in scientific innovation: Assistant, Collaborator, Scientist, and Evaluator. The framework integrates three complementary dimensions: autonomy level, cognitive function, and scientific innovation, to distinguish research-oriented support from frontier-oriented discovery. We review representative methods, benchmarks, and evaluation practices for each role, examining their capabilities, limitations, and human oversight requirements. Across the literature, Assistant systems are comparatively mature in retrieval and synthesis but remain unreliable in open-ended applications; Collaborator systems expand the space of candidate hypotheses yet struggle with novelty-grounding trade-offs; Scientist systems increasingly automate research workflows but face reliability and safety bottlenecks; and Evaluator systems support review and verification while remaining weak in novelty assessment. We argue that progress in AI for science depends not only on model capability, but also on evaluation, oversight, accountability, and institutional integration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a four-role framework for LLMs in scientific innovation—Assistant, Collaborator, Scientist, and Evaluator—integrating three dimensions (autonomy level, cognitive function, and scientific innovation) to distinguish routine research support from frontier-oriented discovery. It reviews representative methods, benchmarks, and evaluation practices for each role, discusses their capabilities and limitations (including human oversight needs), and argues that progress in AI for science requires advances in evaluation, oversight, accountability, and institutional integration beyond model capability alone.

Significance. If the taxonomy can be shown to be stable and reproducible, the framework would provide a useful organizing lens that improves on prior autonomy-only surveys by incorporating cognitive function and innovation dimensions. The structured review of capabilities, limitations, and oversight requirements across roles could help identify specific gaps, such as weak novelty assessment in Evaluator systems and safety bottlenecks in Scientist systems.

major comments (2)
  1. [Framework definition section] Section on the four-role framework: The three dimensions are presented as complementary for role separation, but no explicit demarcation rules, decision criteria, thresholds, or handling of boundary cases (e.g., a hypothesis-generation system that also performs self-evaluation) are supplied. Role assignments therefore depend on author judgment rather than reproducible thresholds, which directly affects whether the taxonomy reliably distinguishes research-oriented support from frontier discovery.
  2. [Literature review / methods] Methods or literature review section: No description is given of the literature search strategy, inclusion/exclusion criteria, or process for selecting representative methods and benchmarks for each role. This absence makes it impossible to evaluate selection bias or coverage, which is load-bearing for the survey's claims about comparative maturity, limitations, and trends across the four roles.
minor comments (2)
  1. [Introduction] The abstract states that existing surveys 'often conflate scientific research with scientific discovery,' but the introduction or related-work section should cite specific prior surveys to ground this contrast.
  2. Figure or table summarizing the three dimensions and role mappings would improve clarity; currently the distinctions are described only in prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Framework definition section] Section on the four-role framework: The three dimensions are presented as complementary for role separation, but no explicit demarcation rules, decision criteria, thresholds, or handling of boundary cases (e.g., a hypothesis-generation system that also performs self-evaluation) are supplied. Role assignments therefore depend on author judgment rather than reproducible thresholds, which directly affects whether the taxonomy reliably distinguishes research-oriented support from frontier discovery.

    Authors: We appreciate the referee's point that the current presentation of the three dimensions does not include explicit demarcation rules, decision criteria, or boundary-case handling. While the manuscript uses the dimensions to conceptually separate roles, we agree that the absence of reproducible thresholds leaves role assignment open to author judgment. In the revised version, we will add a dedicated subsection to the framework definition that specifies decision criteria, provides thresholds where feasible, and includes explicit examples of boundary cases such as hybrid hypothesis-generation and self-evaluation systems. This addition will improve the taxonomy's reproducibility and better demonstrate how it distinguishes routine support from frontier discovery. revision: yes

  2. Referee: [Literature review / methods] Methods or literature review section: No description is given of the literature search strategy, inclusion/exclusion criteria, or process for selecting representative methods and benchmarks for each role. This absence makes it impossible to evaluate selection bias or coverage, which is load-bearing for the survey's claims about comparative maturity, limitations, and trends across the four roles.

    Authors: The referee correctly identifies that the manuscript does not describe the literature search strategy, inclusion/exclusion criteria, or selection process for representative methods and benchmarks. This omission limits the ability to assess coverage and potential bias. We will add a new 'Survey Methodology' subsection (placed early in the paper) that details the search strategy, databases and repositories queried, keywords and Boolean strings used, inclusion and exclusion criteria, and the rationale for selecting the representative examples discussed for each role. This revision will increase transparency and allow readers to better evaluate the comparative claims across roles. revision: yes

Circularity Check

0 steps flagged

No significant circularity: framework is an organizing lens drawn from cited literature

full rationale

The paper is a survey that proposes a four-role taxonomy (Assistant, Collaborator, Scientist, Evaluator) by integrating three dimensions (autonomy level, cognitive function, scientific innovation) to partition existing work. No equations, fitted parameters, predictions, or derivations are present. The central claim does not reduce to quantities defined by the authors' own prior work or by construction. Self-citations, if present, are not load-bearing for the taxonomy itself; the framework is presented as a synthesis of the broader literature rather than a self-referential loop. This matches the default expectation for non-mathematical survey papers where the contribution is conceptual organization without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no free parameters, no new physical or mathematical axioms, and no invented entities. It rests on the domain assumption that LLMs in science can be usefully classified by the stated dimensions and that the reviewed literature is representative.

axioms (1)
  • domain assumption Existing LLM systems in science can be partitioned into four distinct roles with limited overlap using the dimensions of autonomy level, cognitive function, and scientific innovation.
    This assumption underpins the entire framework and is invoked when the authors distinguish research-oriented support from frontier-oriented discovery.

pith-pipeline@v0.9.0 · 5762 in / 1413 out tokens · 30839 ms · 2026-05-19T05:12:45.998745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

237 extracted references · 237 canonical work pages · 8 internal anchors

  1. [1]

    Litllm: A toolkit for scientific literature review

    Shubham Agarwal, Issam Hadj Laradji, Laurent Charlin, and Christopher Pal. Litllm: A toolkit for scientific literature review. ArXiv, abs/2402.01788, 2024

  2. [2]

    Cellvoyager: Ai compbio agent generates new insights by autonomously analyzing biological data

    Samuel Alber, Bowen Chen, Eric Sun, Alina Isakova, Aaron James Wilk, and James Zou. Cellvoyager: Ai compbio agent generates new insights by autonomously analyzing biological data. bioRxiv, pages 2025–06, 2025

  3. [3]

    A survey on hypothesis generation for scientific discovery in the era of large language models

    Atilla Kaan Alkan, Shashwat Sourav, Maja Jablonska, Simone Astarita, Rishabh Chakrabarty, Nikhil Garuda, Pranav Khetarpal, Maciej Pióro, Dimitrios Tanoglidis, Kartheik G Iyer, et al. A survey on hypothesis generation for scientific discovery in the era of large language models. arXiv preprint arXiv:2504.05496, 2025

  4. [4]

    Beyond citations: Measuring novel scientific ideas and their impact in publication text

    Sam Arts, Nicola Melluso, and Reinhilde Veugelers. Beyond citations: Measuring novel scientific ideas and their impact in publication text. Review of Economics and Statistics, 2023. doi: https://doi.org/10.1162/rest_a_01561

  5. [5]

    PPTAgent: Generating and evaluating presentations beyond text-to-slides

    Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human...

  6. [6]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  7. [7]

    Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path? arXiv preprint arXiv:2502.15657, 2025

    Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, et al. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path? arXiv preprint arXiv:2502.15657, 2025

  8. [8]

    Reasoning language models: A blueprint

    Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, et al. Reasoning language models: A blueprint. arXiv preprint arXiv:2501.11223, 2025

  9. [9]

    Super: Evaluating agents on setting up and executing tasks from research repositories

    Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. Super: Evaluating agents on setting up and executing tasks from research repositories. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12622–12645, 2024

  10. [10]

    A., MacKnight, R., & Gomes, G

    Daniil A Boiko, Robert MacKnight, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332, 2023

  11. [11]

    Autonomous chemical research with large language models

    Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023. doi: 10.1038/s41586-023-06792-0

  12. [12]

    Generative adversarial reviews: When llms become the critic

    Nicolas Bougie and Narimasa Watanabe. Generative adversarial reviews: When llms become the critic. arXiv preprint arXiv:2412.10415, 2024

  13. [13]

    Generative retrieval-augmented ontologic graph and multiagent strategies for interpretive large language model-based materials design

    Markus J Buehler. Generative retrieval-augmented ontologic graph and multiagent strategies for interpretive large language model-based materials design. ACS Engineering Au , 4(2):241–277, 2024. doi: 10.1021/ acsengineeringau.3c00058

  14. [14]

    Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G

    James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G Galaz-Montoya, Yuhui Zhang, Yuchang Su, Disha Bhowmik, Zachary Coman, et al. Microvqa: A multimodal reasoning benchmark for microscopy-based scientific research. arXiv preprint arXiv:2503.13399, 2025

  15. [15]

    Eaira: Establishing a methodology for evaluating ai models as scientific research assistants

    Franck Cappello, Sandeep Madireddy, Robert Underwood, Neil Getty, Nicholas Lee-Ping Chia, Nesar Ramachan- dra, Josh Nguyen, Murat Keçeli, Tanwi Mallick, Zilinghan Li, et al. Eaira: Establishing a methodology for evaluating ai models as scientific research assistants. CoRR, 2025

  16. [16]

    MLE-bench: Evaluating machine learning agents on machine learning engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering. In The Thirteenth International Conference on Learning Representations, 2025

  17. [17]

    A joint framework for identifying the type and ar- guments of scientific contribution

    Wenhan Chao, Mengyuan Chen, Xian Zhou, and Zhunchen Luo. A joint framework for identifying the type and ar- guments of scientific contribution. Scientometrics, 128(6):3347–3376, 2023. doi: 10.1007/s11192-023-04694-6. 50

  18. [18]

    Structuring scientific innovation: A framework for modeling and discovering impactful knowledge combinations

    Junlan Chen, Kexin Zhang, Daifeng Li, Yangyang Feng, Yuxuan Zhang, and Bowen Deng. Structuring scientific innovation: A framework for modeling and discovering impactful knowledge combinations. arXiv preprint arXiv:2503.18865, 2025

  19. [19]

    Ai4research: A survey of artificial intelligence for scientific research

    Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, et al. Ai4research: A survey of artificial intelligence for scientific research. arXiv preprint arXiv:2507.01903, 2025

  20. [20]

    Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

    Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. In ...

  21. [21]

    The theoretical and policy implications of knowledge codification

    Patrick Cohendet and Frieder Meyer-Krahmer. The theoretical and policy implications of knowledge codification. Research policy, 30(9):1563–1591, 2001. doi: 10.1016/S0048-7333(01)00168-8

  22. [22]

    Curie: Evaluating llms on multitask scientific long-context understanding and reasoning

    Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Christian Norgaard, Nayantara Mudur, Martyna Beata Plomecka, Paul Raccuglia, et al. Curie: Evaluating llms on multitask scientific long-context understanding and reasoning. In The Thirteenth International Conference on Learning Representations, 2025

  23. [23]

    Structured information extraction from scientific text with large language models

    John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S Rosen, Gerbrand Ceder, Kristin A Persson, and Anubhav Jain. Structured information extraction from scientific text with large language models. Nature Communications, 15(1):1418, 2024. doi: 10.1038/s41467-024-45563-x

  24. [24]

    Marg: Multi-agent review generation for scientific papers

    Mike D’Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. Marg: Multi-agent review generation for scientific papers. ArXiv, abs/2401.04259, 2024

  25. [25]

    Organa: a robotic assistant for automated chemistry experimentation and characterization

    Kourosh Darvish, Marta Skreta, Yuchi Zhao, Naruki Yoshikawa, Sagnik Som, Miroslav Bogdanovic, Yang Cao, Han Hao, Haoping Xu, Alán Aspuru-Guzik, et al. Organa: a robotic assistant for automated chemistry experimentation and characterization. Matter, 8(2), 2025. doi: 10.1016/j.matt.2024.10.015

  26. [26]

    Empowering ai as autonomous researchers: Evaluating llms in generating novel research ideas through automated metrics

    Debajyoti Dasgupta, Arijit Mondal, and Partha Pratim Chakrabarti. Empowering ai as autonomous researchers: Evaluating llms in generating novel research ideas through automated metrics. In 2nd AI4Research Workshop: Towards a Knowledge-grounded Scientific Research Lifecycle, 2025

  27. [27]

    Matexpert: Decomposing materials discovery by mimicking human experts

    Qianggang Ding, Santiago Miret, and Bang Liu. Matexpert: Decomposing materials discovery by mimicking human experts. In The Thirteenth International Conference on Learning Representations, 2024

  28. [28]

    Llms assist nlp researchers: Critique paper (meta-) reviewing

    Jiangshu Du, Yibo Wang, Wenting Zhao, Zhongfen Deng, Shuaiqi Liu, Renze Lou, Henry Zou, Pranav Narayanan Venkit, Nan Zhang, Mukund Srinath, et al. Llms assist nlp researchers: Critique paper (meta-) reviewing. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5081–5099, 2024

  29. [29]

    Llm4ed: Large language models for automatic equation discovery

    Mengge Du, Yuntian Chen, Zhongzheng Wang, Longfeng Nie, and Dongxiao Zhang. Llm4ed: Large language models for automatic equation discovery. CoRR, 2024

  30. [30]

    The path to superintelligence: A critical analysis of openai’s five levels of ai progression

    Tom Duenas and Diana Ruiz. The path to superintelligence: A critical analysis of openai’s five levels of ai progression. ResearchGate, 2024b. doi, 10, 2024. doi: http://dx.doi.org/10.13140/RG.2.2.33794.70085

  31. [31]

    Agent ai: Surveying the horizons of multimodal interaction

    Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, et al. Agent ai: Surveying the horizons of multimodal interaction. CoRR, 2024

  32. [32]

    Nlpeer: A unified resource for the computational study of peer review

    Nils Dycke, Ilia Kuznetsov, and Iryna Gurevych. Nlpeer: A unified resource for the computational study of peer review. In Annual Meeting of the Association for Computational Linguistics, 2022

  33. [33]

    mclm: A function-infused and synthesis-friendly modular chemical language model

    Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Bowen Jin, Chetan Kumar Prasad, Sara Szymku´c, Bartosz A Grzybowski, Ying Diao, Jiawei Han, et al. mclm: A function-infused and synthesis-friendly modular chemical language model. arXiv preprint arXiv:2505.12565, 2025

  34. [34]

    Steffen Eger, Yong Cao, Jennifer D’Souza, Andreas Geiger, Christian Greisinger, Stephanie Gross, Yufang Hou, Brigitte Krenn, Anne Lauscher, Yizhi Li, et al. Transforming science with large language models: A survey on ai-assisted scientific discovery, experimentation, content generation, and evaluation.arXiv preprint arXiv:2502.05151, 2025

  35. [35]

    Science of science

    Santo Fortunato, Carl T Bergstrom, Katy Börner, James A Evans, Dirk Helbing, Staša Milojevi´c, Alexander M Petersen, Filippo Radicchi, Roberta Sinatra, Brian Uzzi, et al. Science of science. Science, 359(6379):eaao0185,

  36. [36]

    doi: 10.1126/science.aao0185. 51

  37. [37]

    Tradition and innovation in scientists’ research strategies

    Jacob G Foster, Andrey Rzhetsky, and James A Evans. Tradition and innovation in scientists’ research strategies. American sociological review, 80(5):875–908, 2015. doi: 10.1177/0003122415601618

  38. [38]

    Boxinggym: Benchmarking progress in automated experimental design and model discovery

    Kanishk Gandhi, Michael Y Li, Lyle Goodyear, Louise Li, Aditi Bhaskar, Mohammed Zaman, and Noah D Goodman. Boxinggym: Benchmarking progress in automated experimental design and model discovery. arXiv preprint arXiv:2501.01540, 2025

  39. [39]

    Empowering biomedical discovery with ai agents

    Shanghua Gao, Ada Fang, Yepeng Huang, Valentina Giunchiglia, Ayush Noori, Jonathan Richard Schwarz, Yasha Ektefaie, Jovana Kondic, and Marinka Zitnik. Empowering biomedical discovery with ai agents. Cell, 187 (22):6125–6151, 2024. doi: https://doi.org/10.1016/j.cell.2024.09.022

  40. [40]

    Reviewagents: Bridging the gap between human and ai-generated paper reviews

    Xian Gao, Jiacheng Ruan, Jingsheng Gao, Ting Liu, and Yuzhuo Fu. Reviewagents: Bridging the gap between human and ai-generated paper reviews. CoRR, 2025

  41. [41]

    Reviewer2: Optimizing review generation through prompt generation

    Zhaolin Gao, Kianté Brantley, and Thorsten Joachims. Reviewer2: Optimizing review generation through prompt generation. arXiv preprint arXiv:2402.10886, 2024

  42. [42]

    Atomagents: Alloy design and discovery through physics-aware multi-modal multi-agent artificial intelligence

    Alireza Ghafarollahi and Markus J Buehler. Atomagents: Alloy design and discovery through physics-aware multi-modal multi-agent artificial intelligence. arXiv preprint arXiv:2407.10022, 2024

  43. [43]

    Sciagents: Automating scientific discovery through bioinspired multi- agent intelligent graph reasoning

    Alireza Ghafarollahi and Markus J Buehler. Sciagents: Automating scientific discovery through bioinspired multi- agent intelligent graph reasoning. Advanced Materials, page 2413523, 2024. doi: 10.1002/adma.202413523

  44. [44]

    Automating alloy design and discovery with physics-aware multimodal multiagent ai

    Alireza Ghafarollahi and Markus J Buehler. Automating alloy design and discovery with physics-aware multimodal multiagent ai. Proceedings of the National Academy of Sciences, 122(4):e2414074122, 2025. doi: 10.1073/pnas.2414074122

  45. [45]

    Towards an AI co-scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Fe- lix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025

  46. [46]

    The concept of entropy in scientometrics and innovation research: An indicator for institutional involvement in scientific and technological developments

    Hariolf Grupp. The concept of entropy in scientometrics and innovation research: An indicator for institutional involvement in scientific and technological developments. Scientometrics, 18(3-4):219–239, 1990

  47. [47]

    Llms can realize combinatorial creativity: generating creative ideas via llms for scientific research

    Tianyang Gu, Jingjin Wang, Zhihao Zhang, and HaoHong Li. Llms can realize combinatorial creativity: generating creative ideas via llms for scientific research. arXiv preprint arXiv:2412.14141, 2024

  48. [48]

    Interesting scientific idea generation using knowledge graphs and llms: Evaluations with 100 research group leaders

    Xuemei Gu and Mario Krenn. Interesting scientific idea generation using knowledge graphs and llms: Evaluations with 100 research group leaders. arXiv preprint arXiv:2405.17044, 2024

  49. [49]

    Ideabench: Benchmarking large language models for research idea generation

    Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Eric Xie, Stefan Bekiranov, and Aidong Zhang. Ideabench: Benchmarking large language models for research idea generation. arXiv preprint arXiv:2411.02429, 2024

  50. [50]

    De novo generation of sars-cov-2 antibody cdrh3 with a pre-trained generative large language model

    Haohuai He, Bing He, Lei Guan, Yu Zhao, Feng Jiang, Guanxing Chen, Qingge Zhu, Calvin Yu-Chian Chen, Ting Li, and Jianhua Yao. De novo generation of sars-cov-2 antibody cdrh3 with a pre-trained generative large language model. Nature Communications, 15(1):6867, 2024. doi: 10.1038/s41467-024-50903-y

  51. [51]

    Scisight: Com- bining faceted navigation and research group detection for covid-19 exploratory scientific search

    Tom Hope, J Portenoy, K Vasan, J Borchardt, Eric Horvitz, DS Weld, MA Hearst, and Jevin West. Scisight: Com- bining faceted navigation and research group detection for covid-19 exploratory scientific search. Proceedings of the 2020 EMNLP (Systems Demonstrations), Association for Computational Linguistics, 2020

  52. [52]

    A computational inflection for scientific discovery

    Tom Hope, Doug Downey, Daniel S Weld, Oren Etzioni, and Eric Horvitz. A computational inflection for scientific discovery. Communications of the ACM, 66(8):62–73, 2023. doi: 10.1145/3576896

  53. [53]

    A new method for measuring the originality of academic articles based on knowledge units in semantic networks

    Jianhua Hou, Dongyi Wang, and Jing Li. A new method for measuring the originality of academic articles based on knowledge units in semantic networks. Journal of Informetrics, 16(3):101306, 2022. doi: 10.1016/j.joi.2022. 101306

  54. [54]

    Chime: Llm-assisted hierarchical organization of scientific studies for literature review support

    Chao-Chun Hsu, Erin Bransom, Jenna Sparks, Bailey Kuehl, Chenhao Tan, David Wadden, Lucy Lu Wang, and Aakanksha Naik. Chime: Llm-assisted hierarchical organization of scientific studies for literature review support. In Findings of the Association for Computational Linguistics ACL 2024, pages 118–132, 2024

  55. [55]

    A multi-agent framework for materials laws discovery

    Bo Hu, Siyu Liu, Beilin Ye, Yun Hao, and Tongqi Wen. A multi-agent framework for materials laws discovery. arXiv preprint arXiv:2411.16416, 2024

  56. [56]

    Nova: An iterative planning and search approach to enhance novelty and diversity of llm generated ideas

    Xiang Hu, Hongyu Fu, Jinge Wang, Yifeng Wang, Zhikun Li, Renjun Xu, Yu Lu, Yaochu Jin, Lili Pan, and Zhenzhong Lan. Nova: An iterative planning and search approach to enhance novelty and diversity of llm generated ideas. arXiv preprint arXiv:2410.14255, 2024

  57. [57]

    Hireview: Hierarchical taxonomy-driven automatic literature review generation

    Yuntong Hu, Zhuofeng Li, Zheng Zhang, Chen Ling, Raasikh Kanjiani, Boxin Zhao, and Liang Zhao. Hireview: Hierarchical taxonomy-driven automatic literature review generation. arXiv preprint arXiv:2410.03761, 2024. 52

  58. [58]

    From detection to application: Recent advances in understanding scientific tables and figures

    Jiani Huang, Haihua Chen, Fengchang Yu, and Wei Lu. From detection to application: Recent advances in understanding scientific tables and figures. ACM Computing Surveys, 56(10):1–39, 2024. doi: 10.1145/3657285

  59. [59]

    Crispr-gpt: An llm agent for automated design of gene-editing experiments

    Kaixuan Huang, Yuanhao Qu, Henry Cousins, William A Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong. Crispr-gpt: An llm agent for automated design of gene-editing experiments. arXiv preprint arXiv:2404.18021, 2024

  60. [60]

    Mlagentbench: Evaluating language agents on machine learning experimentation

    Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. In Forty-first International Conference on Machine Learning, 2023

  61. [61]

    Data multiplexed and hardware reused architecture for deep neural network accelerator,

    Shengzhi Huang, Yong Huang, Yinpeng Liu, Zhuoran Luo, and Wei Lu. Are large language models qualified reviewers in originality evaluation? Information Processing & Management, 62(3):103973, 2025. doi: 10.1016/j. ipm.2024.103973

  62. [62]

    Papereval: A universal, quantitative, and explainable paper evaluation method powered by a multi-agent system

    Shengzhi Huang, Qicong Wang, Wei Lu, Lingyu Liu, Zhenzhen Xu, and Yong Huang. Papereval: A universal, quantitative, and explainable paper evaluation method powered by a multi-agent system. Information Processing & Management, 62(6):104225, 2025

  63. [63]

    Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai

    Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al. Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai. Advances in Neural Information Processing Systems, 37:19209–19253, 2024

  64. [64]

    Openreviewer: A specialized large language model for generating critical scientific paper reviews

    Maximilian Idahl and Zahra Ahmadi. Openreviewer: A specialized large language model for generating critical scientific paper reviews. arXiv preprint arXiv:2412.11948, 2024

  65. [65]

    Autonomous llm-driven research—from data to human-verifiable research papers

    Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous llm-driven research—from data to human-verifiable research papers. NEJM AI, 2(1):AIoa2400555, 2025. doi: 10.1056/AIoa2400555

  66. [66]

    Zochi technical report

    Intology. Zochi technical report. arXiv, 2025

  67. [67]

    Scirex: A challenge dataset for document-level information extraction

    Sarthak Jain, Madeleine van Zuylen, Hannaneh Hajishirzi, and Iz Beltagy. Scirex: A challenge dataset for document-level information extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7506–7516, 2020

  68. [68]

    Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents

    Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents. Advances in Neural Information Processing Systems , 37: 10088–10116, 2024

  69. [69]

    Weld, and Peter Clark

    Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi Mishra, Bod- hisattwa Prasad Majumder, Daniel S Weld, and Peter Clark. Codescientist: End-to-end semi-automated scientific discovery with code-based experimentation. arXiv preprint arXiv:2503.22708, 2025

  70. [70]

    Llmatdesign: Autonomous materials discovery with large language models

    Shuyi Jia, Chao Zhang, and Victor Fung. Llmatdesign: Autonomous materials discovery with large language models. arXiv preprint arXiv:2406.13163, 2024

  71. [71]

    Hegta: Leveraging heterogeneous graph-enhanced large language models for few-shot complex table understanding

    Rihui Jin, Yu Li, Guilin Qi, Nan Hu, Yuan-Fang Li, Jiaoyan Chen, Jianan Wang, Yongrui Chen, Dehai Min, and Sheng Bi. Hegta: Leveraging heterogeneous graph-enhanced large language models for few-shot complex table understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24294–24302, 2025

  72. [72]

    Agentreview: Exploring peer review dynamics with llm agents

    Yiqiao Jin, Qinlin Zhao, Yiyang Wang, Hao Chen, Kaijie Zhu, Yijia Xiao, and Jindong Wang. Agentreview: Exploring peer review dynamics with llm agents. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1208–1226, 2024

  73. [73]

    DSBench: How far are data science agents from becoming data science experts? In The Thirteenth International Conference on Learning Representations, 2025

    Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. DSBench: How far are data science agents from becoming data science experts? In The Thirteenth International Conference on Learning Representations, 2025

  74. [74]

    Researcharena: Benchmarking llms’ ability to collect and organize information as research agents

    Hao Kang and Chenyan Xiong. Researcharena: Benchmarking llms’ ability to collect and organize information as research agents. arXiv preprint arXiv:2406.10291, 2024

  75. [75]

    Chatmof: an artificial intelligence system for predicting and generating metal-organic frameworks using large language models

    Yeonghun Kang and Jihan Kim. Chatmof: an artificial intelligence system for predicting and generating metal-organic frameworks using large language models. Nature communications, 15(1):4705, 2024. doi: 10.1038/s41467-024-48998-4

  76. [76]

    Scireviewgen: A large-scale dataset for automatic literature review generation

    Tetsu Kasanishi, Masaru Isonuma, Junichiro Mori, and Ichiro Sakata. Scireviewgen: A large-scale dataset for automatic literature review generation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6695–6715, 2023

  77. [77]

    Sci-idea: Context- aware scientific ideation using token and sentence embeddings

    Farhana Keya, Gollam Rabby, Prasenjit Mitra, Sahar Vahdati, Sören Auer, and Yaser Jaradeh. Sci-idea: Context- aware scientific ideation using token and sentence embeddings. arXiv preprint arXiv:2503.19257, 2025. 53

  78. [78]

    Curie: Toward rigorous and automated scientific experimentation with ai agents

    Patrick Tser Jern Kon, Jiachen Liu, Qiuyi Ding, Yiming Qiu, Zhenning Yang, Yibo Huang, Jayanth Srini- vasa, Myungjin Lee, Mosharaf Chowdhury, and Ang Chen. Curie: Toward rigorous and automated scientific experimentation with ai agents. CoRR, 2025

  79. [79]

    Hypothesis generation for materials discovery and design using goal-driven and constraint-guided LLM agents

    Shrinidhi Kumbhar, Venkatesh Mishra, Kevin Coutinho, Divij Handa, Ashif Iquebal, and Chitta Baral. Hypothesis generation for materials discovery and design using goal-driven and constraint-guided LLM agents. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025, pages 7524–7555, Albuquer...

  80. [80]

    Transformer-based highlights extraction from scientific papers.Knowledge- Based Systems, 252:109382, 2022

    Moreno La Quatra and Luca Cagliero. Transformer-based highlights extraction from scientific papers.Knowledge- Based Systems, 252:109382, 2022. doi: 10.1016/j.knosys.2022.109382

Showing first 80 references.