Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator
Pith reviewed 2026-05-19 05:12 UTC · model grok-4.3
The pith
Large language models in science are best understood through four roles: Assistant, Collaborator, Scientist, and Evaluator.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that LLMs in scientific innovation can be classified into four roles—Assistant, Collaborator, Scientist, and Evaluator—by combining autonomy level, cognitive function, and scientific innovation. This framework separates research-oriented support from frontier-oriented discovery. Literature review shows Assistants excel at retrieval and synthesis but falter in open-ended use; Collaborators broaden hypothesis options yet trade off novelty against grounding; Scientists automate research but hit reliability and safety limits; Evaluators aid verification yet remain weak at novelty judgment. Advancement in AI for science therefore hinges on evaluation practices, human control,
What carries the argument
The four-role framework that classifies LLM systems by integrating autonomy level, cognitive function, and scientific innovation.
If this is right
- Assistant systems reach maturity in literature tasks but still need human oversight for open-ended scientific applications.
- Collaborator systems enlarge the space of possible hypotheses yet must resolve trade-offs between novelty and grounding in known facts.
- Scientist systems increasingly automate full research workflows but remain constrained by reliability and safety bottlenecks.
- Evaluator systems support review and verification but continue to underperform when judging true novelty.
- Progress across all roles depends on developing better evaluation methods, stronger oversight, accountability structures, and institutional integration.
Where Pith is reading between the lines
- Research funders could use the four-role lens to decide whether a project targets support tools or discovery engines.
- Benchmark designers might create role-specific tests rather than general science benchmarks that mix different autonomy levels.
- Institutions could develop role-tailored oversight policies, such as stricter safety reviews for Scientist-level systems.
- The same three dimensions might later classify non-LLM AI tools in science to track broader trends.
Load-bearing premise
The body of existing literature on LLMs in science can be partitioned into these four roles with limited overlap, and the three dimensions provide a stable way to separate routine research support from frontier discovery.
What would settle it
A systematic review of recent LLM papers in science that reveals frequent unclassifiable cases, high role overlap, or inconsistent separation of support versus discovery tasks along the three dimensions would undermine the framework.
Figures
read the original abstract
Large language models (LLMs) are increasingly used in scientific research and discovery, supporting tasks ranging from literature retrieval and synthesis to hypothesis generation, autonomous experimentation, and research evaluation. Existing surveys often conflate scientific research with scientific discovery and typically organize systems by domain, task, or autonomy level alone. In this survey, we propose a four-role framework for understanding LLMs in scientific innovation: Assistant, Collaborator, Scientist, and Evaluator. The framework integrates three complementary dimensions: autonomy level, cognitive function, and scientific innovation, to distinguish research-oriented support from frontier-oriented discovery. We review representative methods, benchmarks, and evaluation practices for each role, examining their capabilities, limitations, and human oversight requirements. Across the literature, Assistant systems are comparatively mature in retrieval and synthesis but remain unreliable in open-ended applications; Collaborator systems expand the space of candidate hypotheses yet struggle with novelty-grounding trade-offs; Scientist systems increasingly automate research workflows but face reliability and safety bottlenecks; and Evaluator systems support review and verification while remaining weak in novelty assessment. We argue that progress in AI for science depends not only on model capability, but also on evaluation, oversight, accountability, and institutional integration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a four-role framework for LLMs in scientific innovation—Assistant, Collaborator, Scientist, and Evaluator—integrating three dimensions (autonomy level, cognitive function, and scientific innovation) to distinguish routine research support from frontier-oriented discovery. It reviews representative methods, benchmarks, and evaluation practices for each role, discusses their capabilities and limitations (including human oversight needs), and argues that progress in AI for science requires advances in evaluation, oversight, accountability, and institutional integration beyond model capability alone.
Significance. If the taxonomy can be shown to be stable and reproducible, the framework would provide a useful organizing lens that improves on prior autonomy-only surveys by incorporating cognitive function and innovation dimensions. The structured review of capabilities, limitations, and oversight requirements across roles could help identify specific gaps, such as weak novelty assessment in Evaluator systems and safety bottlenecks in Scientist systems.
major comments (2)
- [Framework definition section] Section on the four-role framework: The three dimensions are presented as complementary for role separation, but no explicit demarcation rules, decision criteria, thresholds, or handling of boundary cases (e.g., a hypothesis-generation system that also performs self-evaluation) are supplied. Role assignments therefore depend on author judgment rather than reproducible thresholds, which directly affects whether the taxonomy reliably distinguishes research-oriented support from frontier discovery.
- [Literature review / methods] Methods or literature review section: No description is given of the literature search strategy, inclusion/exclusion criteria, or process for selecting representative methods and benchmarks for each role. This absence makes it impossible to evaluate selection bias or coverage, which is load-bearing for the survey's claims about comparative maturity, limitations, and trends across the four roles.
minor comments (2)
- [Introduction] The abstract states that existing surveys 'often conflate scientific research with scientific discovery,' but the introduction or related-work section should cite specific prior surveys to ground this contrast.
- Figure or table summarizing the three dimensions and role mappings would improve clarity; currently the distinctions are described only in prose.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Framework definition section] Section on the four-role framework: The three dimensions are presented as complementary for role separation, but no explicit demarcation rules, decision criteria, thresholds, or handling of boundary cases (e.g., a hypothesis-generation system that also performs self-evaluation) are supplied. Role assignments therefore depend on author judgment rather than reproducible thresholds, which directly affects whether the taxonomy reliably distinguishes research-oriented support from frontier discovery.
Authors: We appreciate the referee's point that the current presentation of the three dimensions does not include explicit demarcation rules, decision criteria, or boundary-case handling. While the manuscript uses the dimensions to conceptually separate roles, we agree that the absence of reproducible thresholds leaves role assignment open to author judgment. In the revised version, we will add a dedicated subsection to the framework definition that specifies decision criteria, provides thresholds where feasible, and includes explicit examples of boundary cases such as hybrid hypothesis-generation and self-evaluation systems. This addition will improve the taxonomy's reproducibility and better demonstrate how it distinguishes routine support from frontier discovery. revision: yes
-
Referee: [Literature review / methods] Methods or literature review section: No description is given of the literature search strategy, inclusion/exclusion criteria, or process for selecting representative methods and benchmarks for each role. This absence makes it impossible to evaluate selection bias or coverage, which is load-bearing for the survey's claims about comparative maturity, limitations, and trends across the four roles.
Authors: The referee correctly identifies that the manuscript does not describe the literature search strategy, inclusion/exclusion criteria, or selection process for representative methods and benchmarks. This omission limits the ability to assess coverage and potential bias. We will add a new 'Survey Methodology' subsection (placed early in the paper) that details the search strategy, databases and repositories queried, keywords and Boolean strings used, inclusion and exclusion criteria, and the rationale for selecting the representative examples discussed for each role. This revision will increase transparency and allow readers to better evaluate the comparative claims across roles. revision: yes
Circularity Check
No significant circularity: framework is an organizing lens drawn from cited literature
full rationale
The paper is a survey that proposes a four-role taxonomy (Assistant, Collaborator, Scientist, Evaluator) by integrating three dimensions (autonomy level, cognitive function, scientific innovation) to partition existing work. No equations, fitted parameters, predictions, or derivations are present. The central claim does not reduce to quantities defined by the authors' own prior work or by construction. Self-citations, if present, are not load-bearing for the taxonomy itself; the framework is presented as a synthesis of the broader literature rather than a self-referential loop. This matches the default expectation for non-mathematical survey papers where the contribution is conceptual organization without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing LLM systems in science can be partitioned into four distinct roles with limited overlap using the dimensions of autonomy level, cognitive function, and scientific innovation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a four-role framework ... integrates three complementary dimensions: autonomy level, cognitive function, and scientific innovation
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
pyramidal framework ... Evaluator, Collaborator, and Scientist
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Litllm: A toolkit for scientific literature review
Shubham Agarwal, Issam Hadj Laradji, Laurent Charlin, and Christopher Pal. Litllm: A toolkit for scientific literature review. ArXiv, abs/2402.01788, 2024
-
[2]
Cellvoyager: Ai compbio agent generates new insights by autonomously analyzing biological data
Samuel Alber, Bowen Chen, Eric Sun, Alina Isakova, Aaron James Wilk, and James Zou. Cellvoyager: Ai compbio agent generates new insights by autonomously analyzing biological data. bioRxiv, pages 2025–06, 2025
work page 2025
-
[3]
A survey on hypothesis generation for scientific discovery in the era of large language models
Atilla Kaan Alkan, Shashwat Sourav, Maja Jablonska, Simone Astarita, Rishabh Chakrabarty, Nikhil Garuda, Pranav Khetarpal, Maciej Pióro, Dimitrios Tanoglidis, Kartheik G Iyer, et al. A survey on hypothesis generation for scientific discovery in the era of large language models. arXiv preprint arXiv:2504.05496, 2025
-
[4]
Beyond citations: Measuring novel scientific ideas and their impact in publication text
Sam Arts, Nicola Melluso, and Reinhilde Veugelers. Beyond citations: Measuring novel scientific ideas and their impact in publication text. Review of Economics and Statistics, 2023. doi: https://doi.org/10.1162/rest_a_01561
-
[5]
PPTAgent: Generating and evaluating presentations beyond text-to-slides
Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human...
-
[6]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, et al. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path? arXiv preprint arXiv:2502.15657, 2025
-
[8]
Reasoning language models: A blueprint
Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, et al. Reasoning language models: A blueprint. arXiv preprint arXiv:2501.11223, 2025
-
[9]
Super: Evaluating agents on setting up and executing tasks from research repositories
Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. Super: Evaluating agents on setting up and executing tasks from research repositories. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12622–12645, 2024
work page 2024
-
[10]
Daniil A Boiko, Robert MacKnight, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332, 2023
-
[11]
Autonomous chemical research with large language models
Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023. doi: 10.1038/s41586-023-06792-0
-
[12]
Generative adversarial reviews: When llms become the critic
Nicolas Bougie and Narimasa Watanabe. Generative adversarial reviews: When llms become the critic. arXiv preprint arXiv:2412.10415, 2024
-
[13]
Markus J Buehler. Generative retrieval-augmented ontologic graph and multiagent strategies for interpretive large language model-based materials design. ACS Engineering Au , 4(2):241–277, 2024. doi: 10.1021/ acsengineeringau.3c00058
work page 2024
-
[14]
Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G
James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G Galaz-Montoya, Yuhui Zhang, Yuchang Su, Disha Bhowmik, Zachary Coman, et al. Microvqa: A multimodal reasoning benchmark for microscopy-based scientific research. arXiv preprint arXiv:2503.13399, 2025
-
[15]
Eaira: Establishing a methodology for evaluating ai models as scientific research assistants
Franck Cappello, Sandeep Madireddy, Robert Underwood, Neil Getty, Nicholas Lee-Ping Chia, Nesar Ramachan- dra, Josh Nguyen, Murat Keçeli, Tanwi Mallick, Zilinghan Li, et al. Eaira: Establishing a methodology for evaluating ai models as scientific research assistants. CoRR, 2025
work page 2025
-
[16]
MLE-bench: Evaluating machine learning agents on machine learning engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[17]
A joint framework for identifying the type and ar- guments of scientific contribution
Wenhan Chao, Mengyuan Chen, Xian Zhou, and Zhunchen Luo. A joint framework for identifying the type and ar- guments of scientific contribution. Scientometrics, 128(6):3347–3376, 2023. doi: 10.1007/s11192-023-04694-6. 50
-
[18]
Junlan Chen, Kexin Zhang, Daifeng Li, Yangyang Feng, Yuxuan Zhang, and Bowen Deng. Structuring scientific innovation: A framework for modeling and discovering impactful knowledge combinations. arXiv preprint arXiv:2503.18865, 2025
-
[19]
Ai4research: A survey of artificial intelligence for scientific research
Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, et al. Ai4research: A survey of artificial intelligence for scientific research. arXiv preprint arXiv:2507.01903, 2025
-
[20]
Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. In ...
work page 2025
-
[21]
The theoretical and policy implications of knowledge codification
Patrick Cohendet and Frieder Meyer-Krahmer. The theoretical and policy implications of knowledge codification. Research policy, 30(9):1563–1591, 2001. doi: 10.1016/S0048-7333(01)00168-8
-
[22]
Curie: Evaluating llms on multitask scientific long-context understanding and reasoning
Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Christian Norgaard, Nayantara Mudur, Martyna Beata Plomecka, Paul Raccuglia, et al. Curie: Evaluating llms on multitask scientific long-context understanding and reasoning. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[23]
Structured information extraction from scientific text with large language models
John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S Rosen, Gerbrand Ceder, Kristin A Persson, and Anubhav Jain. Structured information extraction from scientific text with large language models. Nature Communications, 15(1):1418, 2024. doi: 10.1038/s41467-024-45563-x
-
[24]
Marg: Multi-agent review generation for scientific papers
Mike D’Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. Marg: Multi-agent review generation for scientific papers. ArXiv, abs/2401.04259, 2024
-
[25]
Organa: a robotic assistant for automated chemistry experimentation and characterization
Kourosh Darvish, Marta Skreta, Yuchi Zhao, Naruki Yoshikawa, Sagnik Som, Miroslav Bogdanovic, Yang Cao, Han Hao, Haoping Xu, Alán Aspuru-Guzik, et al. Organa: a robotic assistant for automated chemistry experimentation and characterization. Matter, 8(2), 2025. doi: 10.1016/j.matt.2024.10.015
-
[26]
Debajyoti Dasgupta, Arijit Mondal, and Partha Pratim Chakrabarti. Empowering ai as autonomous researchers: Evaluating llms in generating novel research ideas through automated metrics. In 2nd AI4Research Workshop: Towards a Knowledge-grounded Scientific Research Lifecycle, 2025
work page 2025
-
[27]
Matexpert: Decomposing materials discovery by mimicking human experts
Qianggang Ding, Santiago Miret, and Bang Liu. Matexpert: Decomposing materials discovery by mimicking human experts. In The Thirteenth International Conference on Learning Representations, 2024
work page 2024
-
[28]
Llms assist nlp researchers: Critique paper (meta-) reviewing
Jiangshu Du, Yibo Wang, Wenting Zhao, Zhongfen Deng, Shuaiqi Liu, Renze Lou, Henry Zou, Pranav Narayanan Venkit, Nan Zhang, Mukund Srinath, et al. Llms assist nlp researchers: Critique paper (meta-) reviewing. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5081–5099, 2024
work page 2024
-
[29]
Llm4ed: Large language models for automatic equation discovery
Mengge Du, Yuntian Chen, Zhongzheng Wang, Longfeng Nie, and Dongxiao Zhang. Llm4ed: Large language models for automatic equation discovery. CoRR, 2024
work page 2024
-
[30]
The path to superintelligence: A critical analysis of openai’s five levels of ai progression
Tom Duenas and Diana Ruiz. The path to superintelligence: A critical analysis of openai’s five levels of ai progression. ResearchGate, 2024b. doi, 10, 2024. doi: http://dx.doi.org/10.13140/RG.2.2.33794.70085
-
[31]
Agent ai: Surveying the horizons of multimodal interaction
Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, et al. Agent ai: Surveying the horizons of multimodal interaction. CoRR, 2024
work page 2024
-
[32]
Nlpeer: A unified resource for the computational study of peer review
Nils Dycke, Ilia Kuznetsov, and Iryna Gurevych. Nlpeer: A unified resource for the computational study of peer review. In Annual Meeting of the Association for Computational Linguistics, 2022
work page 2022
-
[33]
mclm: A function-infused and synthesis-friendly modular chemical language model
Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Bowen Jin, Chetan Kumar Prasad, Sara Szymku´c, Bartosz A Grzybowski, Ying Diao, Jiawei Han, et al. mclm: A function-infused and synthesis-friendly modular chemical language model. arXiv preprint arXiv:2505.12565, 2025
-
[34]
Steffen Eger, Yong Cao, Jennifer D’Souza, Andreas Geiger, Christian Greisinger, Stephanie Gross, Yufang Hou, Brigitte Krenn, Anne Lauscher, Yizhi Li, et al. Transforming science with large language models: A survey on ai-assisted scientific discovery, experimentation, content generation, and evaluation.arXiv preprint arXiv:2502.05151, 2025
-
[35]
Santo Fortunato, Carl T Bergstrom, Katy Börner, James A Evans, Dirk Helbing, Staša Milojevi´c, Alexander M Petersen, Filippo Radicchi, Roberta Sinatra, Brian Uzzi, et al. Science of science. Science, 359(6379):eaao0185,
-
[36]
doi: 10.1126/science.aao0185. 51
-
[37]
Tradition and innovation in scientists’ research strategies
Jacob G Foster, Andrey Rzhetsky, and James A Evans. Tradition and innovation in scientists’ research strategies. American sociological review, 80(5):875–908, 2015. doi: 10.1177/0003122415601618
-
[38]
Boxinggym: Benchmarking progress in automated experimental design and model discovery
Kanishk Gandhi, Michael Y Li, Lyle Goodyear, Louise Li, Aditi Bhaskar, Mohammed Zaman, and Noah D Goodman. Boxinggym: Benchmarking progress in automated experimental design and model discovery. arXiv preprint arXiv:2501.01540, 2025
-
[39]
Empowering biomedical discovery with ai agents
Shanghua Gao, Ada Fang, Yepeng Huang, Valentina Giunchiglia, Ayush Noori, Jonathan Richard Schwarz, Yasha Ektefaie, Jovana Kondic, and Marinka Zitnik. Empowering biomedical discovery with ai agents. Cell, 187 (22):6125–6151, 2024. doi: https://doi.org/10.1016/j.cell.2024.09.022
-
[40]
Reviewagents: Bridging the gap between human and ai-generated paper reviews
Xian Gao, Jiacheng Ruan, Jingsheng Gao, Ting Liu, and Yuzhuo Fu. Reviewagents: Bridging the gap between human and ai-generated paper reviews. CoRR, 2025
work page 2025
-
[41]
Reviewer2: Optimizing review generation through prompt generation
Zhaolin Gao, Kianté Brantley, and Thorsten Joachims. Reviewer2: Optimizing review generation through prompt generation. arXiv preprint arXiv:2402.10886, 2024
-
[42]
Alireza Ghafarollahi and Markus J Buehler. Atomagents: Alloy design and discovery through physics-aware multi-modal multi-agent artificial intelligence. arXiv preprint arXiv:2407.10022, 2024
-
[43]
Alireza Ghafarollahi and Markus J Buehler. Sciagents: Automating scientific discovery through bioinspired multi- agent intelligent graph reasoning. Advanced Materials, page 2413523, 2024. doi: 10.1002/adma.202413523
-
[44]
Automating alloy design and discovery with physics-aware multimodal multiagent ai
Alireza Ghafarollahi and Markus J Buehler. Automating alloy design and discovery with physics-aware multimodal multiagent ai. Proceedings of the National Academy of Sciences, 122(4):e2414074122, 2025. doi: 10.1073/pnas.2414074122
-
[45]
Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Fe- lix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Hariolf Grupp. The concept of entropy in scientometrics and innovation research: An indicator for institutional involvement in scientific and technological developments. Scientometrics, 18(3-4):219–239, 1990
work page 1990
-
[47]
Tianyang Gu, Jingjin Wang, Zhihao Zhang, and HaoHong Li. Llms can realize combinatorial creativity: generating creative ideas via llms for scientific research. arXiv preprint arXiv:2412.14141, 2024
-
[48]
Xuemei Gu and Mario Krenn. Interesting scientific idea generation using knowledge graphs and llms: Evaluations with 100 research group leaders. arXiv preprint arXiv:2405.17044, 2024
-
[49]
Ideabench: Benchmarking large language models for research idea generation
Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Eric Xie, Stefan Bekiranov, and Aidong Zhang. Ideabench: Benchmarking large language models for research idea generation. arXiv preprint arXiv:2411.02429, 2024
-
[50]
De novo generation of sars-cov-2 antibody cdrh3 with a pre-trained generative large language model
Haohuai He, Bing He, Lei Guan, Yu Zhao, Feng Jiang, Guanxing Chen, Qingge Zhu, Calvin Yu-Chian Chen, Ting Li, and Jianhua Yao. De novo generation of sars-cov-2 antibody cdrh3 with a pre-trained generative large language model. Nature Communications, 15(1):6867, 2024. doi: 10.1038/s41467-024-50903-y
-
[51]
Tom Hope, J Portenoy, K Vasan, J Borchardt, Eric Horvitz, DS Weld, MA Hearst, and Jevin West. Scisight: Com- bining faceted navigation and research group detection for covid-19 exploratory scientific search. Proceedings of the 2020 EMNLP (Systems Demonstrations), Association for Computational Linguistics, 2020
work page 2020
-
[52]
A computational inflection for scientific discovery
Tom Hope, Doug Downey, Daniel S Weld, Oren Etzioni, and Eric Horvitz. A computational inflection for scientific discovery. Communications of the ACM, 66(8):62–73, 2023. doi: 10.1145/3576896
-
[53]
Jianhua Hou, Dongyi Wang, and Jing Li. A new method for measuring the originality of academic articles based on knowledge units in semantic networks. Journal of Informetrics, 16(3):101306, 2022. doi: 10.1016/j.joi.2022. 101306
-
[54]
Chime: Llm-assisted hierarchical organization of scientific studies for literature review support
Chao-Chun Hsu, Erin Bransom, Jenna Sparks, Bailey Kuehl, Chenhao Tan, David Wadden, Lucy Lu Wang, and Aakanksha Naik. Chime: Llm-assisted hierarchical organization of scientific studies for literature review support. In Findings of the Association for Computational Linguistics ACL 2024, pages 118–132, 2024
work page 2024
-
[55]
A multi-agent framework for materials laws discovery
Bo Hu, Siyu Liu, Beilin Ye, Yun Hao, and Tongqi Wen. A multi-agent framework for materials laws discovery. arXiv preprint arXiv:2411.16416, 2024
-
[56]
Xiang Hu, Hongyu Fu, Jinge Wang, Yifeng Wang, Zhikun Li, Renjun Xu, Yu Lu, Yaochu Jin, Lili Pan, and Zhenzhong Lan. Nova: An iterative planning and search approach to enhance novelty and diversity of llm generated ideas. arXiv preprint arXiv:2410.14255, 2024
-
[57]
Hireview: Hierarchical taxonomy-driven automatic literature review generation
Yuntong Hu, Zhuofeng Li, Zheng Zhang, Chen Ling, Raasikh Kanjiani, Boxin Zhao, and Liang Zhao. Hireview: Hierarchical taxonomy-driven automatic literature review generation. arXiv preprint arXiv:2410.03761, 2024. 52
-
[58]
From detection to application: Recent advances in understanding scientific tables and figures
Jiani Huang, Haihua Chen, Fengchang Yu, and Wei Lu. From detection to application: Recent advances in understanding scientific tables and figures. ACM Computing Surveys, 56(10):1–39, 2024. doi: 10.1145/3657285
-
[59]
Crispr-gpt: An llm agent for automated design of gene-editing experiments
Kaixuan Huang, Yuanhao Qu, Henry Cousins, William A Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong. Crispr-gpt: An llm agent for automated design of gene-editing experiments. arXiv preprint arXiv:2404.18021, 2024
-
[60]
Mlagentbench: Evaluating language agents on machine learning experimentation
Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. In Forty-first International Conference on Machine Learning, 2023
work page 2023
-
[61]
Data multiplexed and hardware reused architecture for deep neural network accelerator,
Shengzhi Huang, Yong Huang, Yinpeng Liu, Zhuoran Luo, and Wei Lu. Are large language models qualified reviewers in originality evaluation? Information Processing & Management, 62(3):103973, 2025. doi: 10.1016/j. ipm.2024.103973
work page doi:10.1016/j 2025
-
[62]
Shengzhi Huang, Qicong Wang, Wei Lu, Lingyu Liu, Zhenzhen Xu, and Yong Huang. Papereval: A universal, quantitative, and explainable paper evaluation method powered by a multi-agent system. Information Processing & Management, 62(6):104225, 2025
work page 2025
-
[63]
Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai
Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al. Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai. Advances in Neural Information Processing Systems, 37:19209–19253, 2024
work page 2024
-
[64]
Openreviewer: A specialized large language model for generating critical scientific paper reviews
Maximilian Idahl and Zahra Ahmadi. Openreviewer: A specialized large language model for generating critical scientific paper reviews. arXiv preprint arXiv:2412.11948, 2024
-
[65]
Autonomous llm-driven research—from data to human-verifiable research papers
Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous llm-driven research—from data to human-verifiable research papers. NEJM AI, 2(1):AIoa2400555, 2025. doi: 10.1056/AIoa2400555
- [66]
-
[67]
Scirex: A challenge dataset for document-level information extraction
Sarthak Jain, Madeleine van Zuylen, Hannaneh Hajishirzi, and Iz Beltagy. Scirex: A challenge dataset for document-level information extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7506–7516, 2020
work page 2020
-
[68]
Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents. Advances in Neural Information Processing Systems , 37: 10088–10116, 2024
work page 2024
-
[69]
Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi Mishra, Bod- hisattwa Prasad Majumder, Daniel S Weld, and Peter Clark. Codescientist: End-to-end semi-automated scientific discovery with code-based experimentation. arXiv preprint arXiv:2503.22708, 2025
-
[70]
Llmatdesign: Autonomous materials discovery with large language models
Shuyi Jia, Chao Zhang, and Victor Fung. Llmatdesign: Autonomous materials discovery with large language models. arXiv preprint arXiv:2406.13163, 2024
-
[71]
Rihui Jin, Yu Li, Guilin Qi, Nan Hu, Yuan-Fang Li, Jiaoyan Chen, Jianan Wang, Yongrui Chen, Dehai Min, and Sheng Bi. Hegta: Leveraging heterogeneous graph-enhanced large language models for few-shot complex table understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24294–24302, 2025
work page 2025
-
[72]
Agentreview: Exploring peer review dynamics with llm agents
Yiqiao Jin, Qinlin Zhao, Yiyang Wang, Hao Chen, Kaijie Zhu, Yijia Xiao, and Jindong Wang. Agentreview: Exploring peer review dynamics with llm agents. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1208–1226, 2024
work page 2024
-
[73]
Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. DSBench: How far are data science agents from becoming data science experts? In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[74]
Researcharena: Benchmarking llms’ ability to collect and organize information as research agents
Hao Kang and Chenyan Xiong. Researcharena: Benchmarking llms’ ability to collect and organize information as research agents. arXiv preprint arXiv:2406.10291, 2024
-
[75]
Yeonghun Kang and Jihan Kim. Chatmof: an artificial intelligence system for predicting and generating metal-organic frameworks using large language models. Nature communications, 15(1):4705, 2024. doi: 10.1038/s41467-024-48998-4
-
[76]
Scireviewgen: A large-scale dataset for automatic literature review generation
Tetsu Kasanishi, Masaru Isonuma, Junichiro Mori, and Ichiro Sakata. Scireviewgen: A large-scale dataset for automatic literature review generation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6695–6715, 2023
work page 2023
-
[77]
Sci-idea: Context- aware scientific ideation using token and sentence embeddings
Farhana Keya, Gollam Rabby, Prasenjit Mitra, Sahar Vahdati, Sören Auer, and Yaser Jaradeh. Sci-idea: Context- aware scientific ideation using token and sentence embeddings. arXiv preprint arXiv:2503.19257, 2025. 53
-
[78]
Curie: Toward rigorous and automated scientific experimentation with ai agents
Patrick Tser Jern Kon, Jiachen Liu, Qiuyi Ding, Yiming Qiu, Zhenning Yang, Yibo Huang, Jayanth Srini- vasa, Myungjin Lee, Mosharaf Chowdhury, and Ang Chen. Curie: Toward rigorous and automated scientific experimentation with ai agents. CoRR, 2025
work page 2025
-
[79]
Shrinidhi Kumbhar, Venkatesh Mishra, Kevin Coutinho, Divij Handa, Ashif Iquebal, and Chitta Baral. Hypothesis generation for materials discovery and design using goal-driven and constraint-guided LLM agents. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025, pages 7524–7555, Albuquer...
-
[80]
Moreno La Quatra and Luca Cagliero. Transformer-based highlights extraction from scientific papers.Knowledge- Based Systems, 252:109382, 2022. doi: 10.1016/j.knosys.2022.109382
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.