AI for Auto-Research: Roadmap & User Guide

Benoit R. Cottereau; Jiachen Liu; Kevin Qinghong Lin; Lai Xing Ng; Leigang Qu; Linfeng Li; Lingdong Kong; Qing Wu; Rong Li; Shaoyuan Xie

arxiv: 2605.18661 · v1 · pith:RA5SONDLnew · submitted 2026-05-18 · 💻 cs.AI

AI for Auto-Research: Roadmap & User Guide

Lingdong Kong , Xian Sun , Wei Chow , Linfeng Li , Kevin Qinghong Lin , Xuan Billy Zhang , Song Wang , Rong Li

show 12 more authors

Qing Wu Wei Gao Yingshuo Wang Shaoyuan Xie Jiachen Liu Leigang Qu Shijie Li Lai Xing Ng Benoit R. Cottereau Ziwei Liu Tat-Seng Chua Wei Tsang Ooi

This is my paper

Pith reviewed 2026-05-20 10:26 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI-assisted researchresearch automationLLM limitationsscientific integrityautonomous research agentsidea generationpeer reviewvalidation

0 comments

The pith

AI excels at structured research tasks but remains fragile for novel ideas and scientific judgment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that there is a sharp boundary in AI capabilities across the research lifecycle. AI can reliably assist with structured, retrieval-based tasks such as literature review and basic coding, but it falters when tasked with generating genuinely new ideas, conducting research-level experiments, or providing scientific judgment. This matters because as automation increases, the risk of hidden errors and reduced integrity grows, suggesting that fully autonomous systems are not yet ready for prime time. A sympathetic reader would care about this as it guides how to deploy AI tools effectively without compromising research quality.

Core claim

The authors claim that AI for auto-research has reached a point where systems can generate papers cheaply and agents can run experiments with little input, yet a detailed review up to April 2026 shows persistent weaknesses. Specifically, AI excels in structured tasks but is unreliable for novelty and judgment, with ideas degrading after implementation and autonomous systems not yet meeting major venue standards. They conclude that more automation can obscure failures, making human-governed collaboration the best approach.

What carries the argument

The stage-dependent boundary between reliable assistance and unreliable autonomy in the four phases of research: creation, writing, validation, and dissemination.

Load-bearing premise

The limitations of frontier LLMs in fabricating results and judging novelty observed through April 2026 represent a stable boundary rather than a temporary limitation.

What would settle it

A fully autonomous AI system that consistently produces research papers accepted at major venues like NeurIPS without human intervention would falsify the central boundary claim.

read the original abstract

AI-assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as $15, while long-horizon agents can execute experiments, draft manuscripts, and simulate critique with minimal human input. Yet this productivity frontier exposes a deeper integrity problem: under scientific pressure, even frontier LLMs still fabricate results, miss hidden errors, and fail to judge novelty reliably. Studying developments through April 2026, we present an end-to-end analysis of AI across the complete research lifecycle, organized into four epistemological phases: Creation (idea generation, literature review, coding & experiments, tables & figures), Writing (paper writing), Validation (peer review, rebuttal & revision), and Dissemination (posters, slides, videos, social media, project pages, and interactive agents). We identify a sharp, stage-dependent boundary between reliable assistance and unreliable autonomy: AI excels at structured, retrieval-grounded, and tool-mediated tasks, but remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment. Generated ideas often degrade after implementation, research code lags far behind pattern-matching benchmarks, and end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards. We further show that greater automation can obscure rather than eliminate failure modes, making human-governed collaboration the most credible deployment paradigm. Finally, we provide a structured taxonomy, benchmark suite, and tool inventory, cross-stage design principles, and a practitioner-oriented playbook, with resources maintained at our project page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This roadmap organizes AI across the full research lifecycle into four phases and flags a practical split between reliable tool use and fragile judgment, but the split rests on 2026-era observations rather than fixed limits.

read the letter

The paper's main contribution is a four-phase breakdown of AI-assisted research—Creation, Writing, Validation, and Dissemination—plus a clear line between tasks where current systems can be trusted and those where they cannot. It argues that AI handles structured retrieval and tool-mediated work but still fabricates results, misses errors, and judges novelty poorly, so human oversight remains essential even as automation grows cheaper and more capable.

Referee Report

2 major / 1 minor

Summary. The manuscript analyzes AI-assisted research across the full lifecycle, organized into four epistemological phases: Creation (idea generation, literature review, coding/experiments), Writing, Validation (peer review, rebuttal), and Dissemination (posters, slides, videos, social media). Based on observations of frontier LLMs through April 2026, it claims a sharp stage-dependent boundary: AI excels at structured, retrieval-grounded, and tool-mediated tasks but remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment. It further argues that greater automation can obscure failure modes, advocates human-governed collaboration, and supplies a taxonomy, benchmark suite, tool inventory, cross-stage design principles, and practitioner playbook with an associated project page.

Significance. If the boundary claim holds as a stable epistemological distinction rather than a transient snapshot, the work offers a practical, practitioner-oriented roadmap that could inform responsible AI deployment in research. The emphasis on failure modes, the provision of resources at a project page, and the structured taxonomy add utility as a guide for the field, though its value depends on the durability of the observed limitations.

major comments (2)

[Abstract] Abstract: The claim of a 'sharp, stage-dependent boundary' between reliable assistance and unreliable autonomy is grounded solely in qualitative observations of LLM fabrication, missed errors, and poor novelty assessment through April 2026. No quantitative benchmarks, systematic error analysis, controlled comparisons, or longitudinal data are referenced to demonstrate why this distinction reflects an enduring limit rather than a current capability gap that scaling or new methods might close.
[Abstract] Abstract (phases description): The recommendation that 'human-governed collaboration [is] the most credible deployment paradigm' and that 'greater automation can obscure rather than eliminate failure modes' is presented without ablation studies, end-to-end comparisons of fully autonomous vs. hybrid systems, or evidence from the four phases showing that increased automation reliably increases (rather than decreases) undetected errors.

minor comments (1)

[Abstract] Abstract: The mention of a 'structured taxonomy, benchmark suite, and tool inventory' would benefit from explicit description of construction methodology and validation criteria to support reproducibility claims.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. We address the two major comments on the abstract below, clarifying the observational basis of the work while revising the text to better qualify our claims as a current snapshot rather than a proven enduring limit.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of a 'sharp, stage-dependent boundary' between reliable assistance and unreliable autonomy is grounded solely in qualitative observations of LLM fabrication, missed errors, and poor novelty assessment through April 2026. No quantitative benchmarks, systematic error analysis, controlled comparisons, or longitudinal data are referenced to demonstrate why this distinction reflects an enduring limit rather than a current capability gap that scaling or new methods might close.

Authors: The manuscript is framed as a roadmap and practitioner guide based on direct observations of frontier models through April 2026, not as a controlled empirical evaluation. The stage-dependent boundary is illustrated through the full-text analysis, benchmark suite, and tool inventory, which reference existing quantitative results for subtasks such as retrieval and code generation. We accept that the abstract overstates the distinction as 'sharp' without longitudinal evidence. We have revised the abstract to describe an 'observed stage-dependent pattern' in current systems and to note explicitly that scaling or new methods could narrow these gaps. The project page will be updated with future observations. revision: partial
Referee: [Abstract] Abstract (phases description): The recommendation that 'human-governed collaboration [is] the most credible deployment paradigm' and that 'greater automation can obscure rather than eliminate failure modes' is presented without ablation studies, end-to-end comparisons of fully autonomous vs. hybrid systems, or evidence from the four phases showing that increased automation reliably increases (rather than decreases) undetected errors.

Authors: We agree that ablation studies and direct end-to-end comparisons would provide stronger causal evidence. Such experiments lie outside the scope of this synthesis paper. The recommendation instead rests on documented failure modes across the four phases (e.g., fabricated results in autonomous paper generation and undetected errors in AI-assisted validation), which are detailed with examples in the revised manuscript. We have expanded the cross-stage design principles section to include concrete illustrations of how greater automation can mask issues and have added guidance for hybrid workflows. The claim is presented as the most credible paradigm given present capabilities rather than a universally proven result. revision: partial

standing simulated objections not resolved

New ablation studies or fresh quantitative benchmarks comparing fully autonomous versus hybrid systems across the full research lifecycle, which would require a separate large-scale experimental effort.

Circularity Check

0 steps flagged

No significant circularity; claims rest on external observations

full rationale

The paper offers an end-to-end observational roadmap of AI assistance across research phases, identifying a stage-dependent boundary between reliable structured tasks and fragile performance on novel ideas or judgment. This boundary is presented as an empirical pattern drawn from developments through April 2026 and external literature rather than any derivation, equation, or fitted parameter internal to the manuscript. No self-definitional loop, fitted-input prediction, or load-bearing self-citation chain appears; the analysis remains self-contained against external benchmarks and does not reduce its central claim to inputs defined by the paper itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that current frontier LLMs exhibit persistent fabrication and judgment failures under research pressure; no free parameters or new invented entities are introduced.

axioms (1)

domain assumption Under scientific pressure, even frontier LLMs still fabricate results, miss hidden errors, and fail to judge novelty reliably.
Invoked in the abstract as the integrity problem motivating the entire analysis.

pith-pipeline@v0.9.0 · 5870 in / 1162 out tokens · 51812 ms · 2026-05-20T10:26:33.810273+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify a sharp, stage-dependent boundary between reliable assistance and unreliable autonomy: AI excels at structured, retrieval-grounded, and tool-mediated tasks, but remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Generated ideas often degrade after implementation, research code lags far behind pattern-matching benchmarks, and end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

271 extracted references · 271 canonical work pages · 27 internal anchors

[1]

Abramson, J

J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick, S. W. Bodenstein, D. A. Evans, C.-C. Hung, M. O’Neill, D. Reiman, K. Tunyasuvunakool, Z. Wu, A. Žemgulyt˙ e, E. Arvaniti, C. Beattie, O. Bertolusso, A. Sherwood, J. M. Jumper, and D. Hassabis. Accurate structure prediction of biomolec...

work page 2024
[2]

Agarwal, G

S. Agarwal, G. Sahu, A. Puri, I. H. Laradji, K. D. Dvijotham, J. Stanley, L. Charlin, and C. Pal. LitLLM: A toolkit for literature review with large language models.arXiv preprint arXiv:2402.01788, 2024

work page arXiv 2024
[3]

Aggarwal and A

T. Aggarwal and A. Bhand. PASS: Presentation automation for slide generation and speech.arXiv preprint arXiv:2501.06497, 2025

work page arXiv 2025
[4]

Ajith, M

A. Ajith, M. Xia, A. Chevalier, T. Goyal, D. Chen, and T. Gao. LitSearch: A retrieval benchmark for scientific literature search. InConference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[5]

Al Azher, M

I. Al Azher, M. J. Mokarrama, Z. Guo, S. R. Choudhury, and H. Alhoori. FutureGen: A RAG-based approach to generate the future work of scientific article.arXiv preprint arXiv:2503.16561, 2025

work page arXiv 2025
[6]

Al Azher, Z

I. Al Azher, Z. Guo, and H. Alhoori. Multi-agent LLMs for generating research limitations.arXiv preprint arXiv:2601.11578, 2026

work page arXiv 2026
[7]

Tongyi DeepResearch: An agentic LLM for long-horizon deep information seeking

Alibaba NLP. Tongyi DeepResearch: An agentic LLM for long-horizon deep information seeking. https: //github.com/Alibaba-NLP/DeepResearch, 2025

work page 2025
[8]

FARS: Fully automated research system.https://analemma.ai/blog/introducing-fars, 2026

Analemma.ai. FARS: Fully automated research system.https://analemma.ai/blog/introducing-fars, 2026

work page 2026
[9]

A. Asai, J. He, R. Shao, W. Shi, A. Singh, J. C. Chang, K. Lo, L. Soldaini, S. Feldman, M. D’Arcy, D. Wadden, M. Latzke, J. Sparks, J. D. Hwang, V. Kishore, M. Tian, P. Ji, S. Liu, H. Tong, B. Wu, Y. Xiong, L. Zettlemoyer, G. Neubig, D. S. Weld, D. Downey, W. tau Yih, P. W. Koh, and H. Hajishirzi. Synthesizing scientific literature with retrieval-augmente...

work page 2026
[10]

J. Baek, S. K. Jauhar, S. Cucerzan, and S. J. Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. InConference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 6709–6738, 2025

work page 2025
[11]

Beger and C.-L

C. Beger and C.-L. Henneking. Citegeist: Automated generation of related work analysis on the arXiv corpus. arXiv preprint arXiv:2503.23229, 2025

work page arXiv 2025
[12]

Belouadi, A

J. Belouadi, A. Lauscher, and S. Eger. AutomaTikZ: Text-guided synthesis of scientific vector graphics with TikZ. InInternational Conference on Learning Representations, 2024

work page 2024
[13]

Belouadi, S

J. Belouadi, S. P. Ponzetto, and S. Eger. DeTikZify: Synthesizing graphics programs for scientific figures and sketches with TikZ. InAdvances in Neural Information Processing Systems, volume 37, pages 85074–85108, 2024

work page 2024
[14]

228 hours of non-stop work to produce 100 papers, burning through 11.4 billion tokens: Fars has gone crazy.https://eu.36kr.com/en/p/3696795271966336, 2026

Blog. 228 hours of non-stop work to produce 100 papers, burning through 11.4 billion tokens: Fars has gone crazy.https://eu.36kr.com/en/p/3696795271966336, 2026

work page arXiv 2026
[15]

D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023

work page 2023
[16]

Bragg, M

J. Bragg, M. D’Arcy, N. Balepur, D. Bareket, B. Dalvi, S. Feldman, D. Haddad, J. D. Hwang, P. Jansen, V. Kishore, B. P. Majumder, A. Naik, S. Rahamimov, K. Richardson, A. Singh, H. Surana, A. Tiktinsky, R. Vasu, G. Wiener, C. Anastasiades, S. Candra, J. Dunkelberger, D. Emery, R. Evans, M. Hamada, R. Huff, R. Kinney, M. Latzke, J. Lochner, R. Lozano-Aguil...

work page 2026
[17]

A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller. ChemCrow: Augmenting large language models with chemistry tools.Nature Machine Intelligence, 6(5):525–535, 2024

work page 2024
[18]

DeerFlow: A deep research framework orchestrating sub-agents, memory, and sandboxes.https: //github.com/bytedance/deer-flow, 2025

ByteDance. DeerFlow: A deep research framework orchestrating sub-agents, memory, and sandboxes.https: //github.com/bytedance/deer-flow, 2025

work page 2025
[19]

J. Chai, S. Tang, R. Ye, Y. Du, X. Zhu, M. Zhou, Y. Wang, W. E, Y. Zhang, L. Zhang, and S. Chen. SciMaster: Towards general-purpose scientific AI agents, part I. X-Master as foundation: Can we lead on humanity’s last exam?arXiv preprint arXiv:2507.05241, 2025. 52

work page arXiv 2025
[20]

J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, L. Weng, and A. Madry. MLE-Bench: Evaluating machine learning agents on machine learning engineering. In International Conference on Learning Representations, 2025

work page 2025
[21]

Chen and I

C.-C. Chen and I. Gurevych. Commitment checklist: Auditing author commitments in peer review.arXiv preprint arXiv:2603.00003, 2026

work page arXiv 2026
[22]

D. Chen. AI-generated figures in academic publishing: Policies, tools, and practical guidelines.arXiv preprint arXiv:2603.16159, 2026

work page arXiv 2026
[23]

G. Chen, J. Chen, L. Chen, J. Zhao, F. Meng, W. X. Zhao, R. Song, C. Chen, J.-R. Wen, and K. Jia. Toward autonomous long-horizon engineering for ML research.arXiv preprint arXiv:2604.13018, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

H. Chen, M. Xiong, Y. Lu, W. Han, A. Deng, Y. He, J. Wu, Y. Li, Y. Liu, and B. Hooi. MLR-Bench: Evaluating AI agents on open-ended machine learning research.arXiv preprint arXiv:2505.19955, 2025

work page arXiv 2025
[25]

N. Chen, A. H. Lin, J. Wu, J. Hou, Z. Zhang, Q. Wang, X. Wang, and B. He. XtraGPT: Context-aware and controllable academic paper revision.arXiv preprint arXiv:2505.11336, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Q. Chen, M. Yang, L. Qin, J. Liu, Z. Yan, J. Guan, D. Peng, Y. Ji, H. Li, M. Hu, Y. Zhang, Y. Liang, Y. Zhou, J. Wang, Z. Chen, and W. Che. AI4Research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025

work page arXiv 2025
[27]

S. Chen, J. Lai, J. Gao, H. Shi, Z. Liu, T. Ye, J. Luo, X. Wei, and L. Zhu. PosterOmni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

work page arXiv 2026
[28]

S. Chen, J. Lai, J. Gao, T. Ye, H. Chen, H. Shi, S. Shao, Y. Lin, S. Fei, Z. Xing, Y. Jin, J. Luo, X. Wei, and L. Zhu. PosterCraft: Rethinking high-quality aesthetic poster generation in a unified framework. InInternational Conference on Learning Representations, 2026

work page 2026
[29]

S. Chen, S. Zhong, D. P. Brumby, and A. L. Cox. What happens when reviewers receive AI feedback in their reviews? InCHI Conference on Human Factors in Computing Systems, pages 1–19, 2026

work page 2026
[30]

Y. Chen, T. Lv, S. Zhang, Y. Yin, Y. Wan, P. S. Yu, and D. Chen. Paper2Web: Let’s make your paper alive! arXiv preprint arXiv:2510.15842, 2025

work page arXiv 2025
[31]

Z. Chen, J. Chen, S. O. Arik, M. Sra, T. Pfister, and J. Yoon. CoDA: Agentic systems for collaborative data visualization.arXiv preprint arXiv:2510.03194, 2025

work page arXiv 2025
[32]

Z. Chen, S. Chen, Y. Ning, Q. Zhang, B. Wang, B. Yu, Y. Li, Z. Liao, C. Wei, Z. Lu, V. Dey, M. Xue, F. N. Baker, B. Burns, D. Adu-Ampratwum, X. Huang, X. Ning, S. Gao, Y. Su, and H. Sun. ScienceAgentBench: Toward rigorous assessment of language agents for data-driven scientific discovery. InInternational Conference on Learning Representations, 2025

work page 2025
[33]

J. Choi, S. Park, S. Song, and H. Shim. PosterForest: Hierarchical multi-agent collaboration for scientific poster generation.arXiv preprint arXiv:2508.21720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

P. H. Couto, Q. P. Ho, N. Kumari, B. K. Rachmat, T. G. H. Khuong, I. Ullah, and L. Sun-Hosoya. RelevAI- Reviewer: A benchmark on AI reviewers for survey paper relevance.arXiv preprint arXiv:2406.10294, 2024

work page arXiv 2024
[35]

D’Arcy, T

M. D’Arcy, T. Hope, L. Birnbaum, and D. Downey. MARG: Multi-agent review generation for scientific papers. arXiv preprint arXiv:2401.04259, 2024

work page arXiv 2024
[36]

De Ponte

F. De Ponte. OpenDraft: 19-agent research draft generation.https://github.com/federicodeponte/opendraft, 2025

work page 2025
[37]

X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V. Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler. SWE-Bench Pro: Can AI agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

A. Elovic. GPT Researcher: Autonomous agent for comprehensive online research. https://github.com/ assafelovic/gpt-researcher, 2024

work page 2024
[39]

T. Fan, F. Zhang, Y. Zheng, B. Chen, X. Niu, C. Huang, J. Lin, and C. Huang. DeepInnovator: Triggering the innovative capabilities of LLMs.arXiv preprint arXiv:2602.18920, 2026

work page arXiv 2026
[40]

Y. Feng, Q. Huang, X. Xie, Z. Yang, J. Yu, W. Chen, and A. K. H. Tung. IDRBench: Interactive deep research benchmark.arXiv preprint arXiv:2601.06676, 2026. 53

work page arXiv 2026
[41]

T.-J. Fu, W. Y. Wang, D. McDuff, and Y. Song. DOC2PPT: Automatic presentation slides generation from scientific documents. InAAAI Conference on Artificial Intelligence, volume 36, pages 634–642, 2022

work page 2022
[42]

S. Gao, R. Zhu, P. Sui, Z. Kong, S. Aldogom, Y. Huang, A. Noori, R. Shamji, K. Parvataneni, T. Tsiligkaridis, and M. Zitnik. Democratizing AI scientists using ToolUniverse.arXiv preprint arXiv:2509.23426, 2025

work page arXiv 2025
[43]

X. Gao, J. Ruan, Z. Zhang, J. Gao, T. Liu, and Y. Fu. ReviewAgents: Bridging the gap between human and AI-generated paper reviews.arXiv preprint arXiv:2503.08506, 2025

work page arXiv 2025
[44]

Y. Gao, Q. Wu, and L. Zhu. Merging the citations received by arXiv-deposited e-prints and their corresponding published journal articles: Problems and perspectives.Information Processing & Management, 57(5):102267, 2020

work page 2020
[45]

Z. Gao, K. Brantley, and T. Joachims. Reviewer2: Optimizing review generation through prompt generation. arXiv preprint arXiv:2402.10886, 2024

work page arXiv 2024
[46]

K. Garg, F. Shaik, S. Bandyopadhyay, and C. Caragea. Let’s use ChatGPT to write our paper! benchmarking LLMs to write the introduction of a research paper.arXiv preprint arXiv:2508.14273, 2025

work page arXiv 2025
[47]

Garikaparthi, M

A. Garikaparthi, M. Patwardhan, L. Vig, and A. Cohan. IRIS: Interactive research ideation system for accelerating scientific discovery. InAnnual Meeting of the Association for Computational Linguistics, pages 592–603, 2025

work page 2025
[48]

J. Ge, Z. Z. Wang, X. Zhou, Y.-H. Peng, S. Subramanian, Q. Tan, M. Sap, A. Suhr, D. Fried, G. Neubig, and T. Darrell. AutoPresent: Designing structured visuals from scratch. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2902–2911, 2025

work page 2025
[49]

Ghafarollahi and M

A. Ghafarollahi and M. J. Buehler. SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning.arXiv preprint arXiv:2409.05556, 2024

work page arXiv 2024
[50]

E. Gibney. Major conference catches illicit AI use — and rejects hundreds of papers.Nature News, 652:281–282, 2026

work page 2026
[51]

G. H. T. Go, K. Ly, A. Sogaard, A. Tabatabaei, M. de Rijke, and X. Chen. LiRA: A multi-agent framework for reliable and readable literature review generation.arXiv preprint arXiv:2510.05138, 2025

work page arXiv 2025
[52]

S. Goel, R. Hazra, D. Jayalath, T. Willi, P. Jain, W. F. Shen, I. Leontiadis, F. Barbieri, Y. Bachrach, J. Geiping, and C. Whitehouse. Training AI co-scientists using rubric rewards.arXiv preprint arXiv:2512.23707, 2025

work page arXiv 2025
[53]

Goswami, P

K. Goswami, P. Mathur, R. Rossi, and F. Dernoncourt. PlotGen: Multi-agent LLM-based scientific data visualization via multimodal feedback.arXiv preprint arXiv:2502.00988, 2025

work page arXiv 2025
[54]

Towards an AI co-scientist

J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno, K. Saab, D. Popovici, J. Blum, F. Zhang, K. Chou, A. Hassidim, B. Gokturk, A. Vahdat, P. Kohli, Y. Matias, A. Carroll, K. Kulkarni, N. Tomasev, Y. Guan, V. Dhillon, E. D. Vaishnav, B. Lee, T. R. D. Costa, J. R. Penadés, G. Peltz, Y. Xu, A...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review

P. Goyal, M. Parmar, Y. Song, H. Palangi, T. Pfister, and J. Yoon. ScholarPeer: A context-aware multi-agent framework for automated peer review.arXiv preprint arXiv:2601.22638, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[56]

Greisinger and S

C. Greisinger and S. Eger. TikZilla: Scaling text-to-TikZ with high-quality data and reinforcement learning. arXiv preprint arXiv:2603.03072, 2026

work page arXiv 2026
[57]

T. Gu, J. Wang, Z. Zhang, and H. Li. LLMs can realize combinatorial creativity: Generating creative ideas via LLMs for scientific research.arXiv preprint arXiv:2412.14141, 2024

work page arXiv 2024
[58]

S. Guo, C. Deng, Y. Wen, H. Chen, Y. Chang, and J. Wang. DS-Agent: Automated data science by empowering large language models with case-based reasoning.arXiv preprint arXiv:2402.17453, 2024

work page arXiv 2024
[59]

S. Guo, A. H. Shariatmadari, G. Xiong, A. Huang, M. Kim, C. M. Williams, S. Bekiranov, and A. Zhang. IdeaBench: Benchmarking large language models for research idea generation. InACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5888–5899, 2025

work page 2025
[60]

P. Han, Y. Yu, J. Xu, and J. You. DRPG (decompose, retrieve, plan, generate): An agentic framework for academic rebuttal.arXiv preprint arXiv:2601.18081, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

Q. Hao, F. Xu, Y. Li, and J. Evans. Artificial intelligence tools expand scientists’ impact but contract science’s focus.Nature, 649:1237–1243, 2026. 54

work page 2026
[62]

Y. He, G. Huang, P. Feng, Y. Lin, Y. Zhang, H. Li, and W. E. PaSa: An LLM agent for comprehensive academic paper search.arXiv preprint arXiv:2501.10120, 2025

work page arXiv 2025
[63]

Z. He, Z. Lyu, and Y. R. Fung. RebuttalAgent: Strategic persuasion in academic rebuttal via theory of mind. arXiv preprint arXiv:2601.15715, 2026

work page arXiv 2026
[64]

Paper2Slides: From paper to presentation in one click

HKU Data Intelligence Lab. Paper2Slides: From paper to presentation in one click. https://github.com/ HKUDS/Paper2Slides, 2025

work page 2025
[65]

M. Hong, D. Jiang, C. J. Zhang, Z. Guo, Y. Li, J. Chen, S. Cui, and Z. Su. CiteLLM: An agentic platform for trustworthy scientific reference discovery.arXiv preprint arXiv:2602.23075, 2026

work page arXiv 2026
[66]

Hossain, S

E. Hossain, S. K. Sinha, N. Bansal, R. A. Knipper, S. Sarkar, J. Salvador, Y. Mahajan, S. R. P. K. Guttikonda, M. Akter, M. M. Hassan, M. Freestone, M. C. W. Jr., D. Feng, and S. Karmaker. LLMs as meta-reviewers’ assis- tants: A case study. InConference of the Nations of the Americas Chapter of the Association for Computational Linguistics, pages 7763–7803, 2025

work page 2025
[67]

J. Hou, A. H. Lin, N. Chen, Y. Gong, and B. He. PaperDebugger: A plugin-based multi-agent system for in-editor academic writing, review, and editing.arXiv preprint arXiv:2512.02589, 2025

work page arXiv 2025
[68]

C.-C. Hsu, E. Bransom, J. Sparks, B. Kuehl, C. Tan, D. Wadden, L. L. Wang, and A. Naik. CHIME: LLM-assisted hierarchical organization of scientific studies for literature review support.arXiv preprint arXiv:2407.16148, 2024

work page arXiv 2024
[69]

X. Hu, H. Fu, J. Wang, Y. Wang, Z. Li, R. Xu, Y. Lu, Y. Jin, L. Pan, and Z. Lan. Nova: An iterative planning and search approach to enhance novelty and diversity of LLM generated ideas.arXiv preprint arXiv:2410.14255, 2024

work page arXiv 2024
[70]

X. Hu, Z. Zhao, S. Wei, Z. Chai, Q. Ma, G. Wang, X. Wang, J. Su, J. Xu, M. Zhu, Y. Cheng, J. Yuan, J. Li, K. Kuang, Y. Yang, H. Yang, and F. Wu. InfiAgent-DABench: Evaluating agents on data analysis tasks.arXiv preprint arXiv:2401.05507, 2024

work page arXiv 2024
[71]

T. Hua, H. Hua, V. Xiang, B. Klieger, S. T. Truong, W. Liang, F.-Y. Sun, and N. Haber. ResearchCodeBench: Benchmarking LLMs on implementing novel machine learning research code.arXiv preprint arXiv:2506.02314, 2025

work page arXiv 2025
[72]

Huang, S

K. Huang, S. Zhang, H. Wang, Y. Qu, Y. Lu, Y. Roohani, R. Li, L. Qiu, G. Li, J. Zhang, D. Yin, S. Marwaha, J. N. Carter, X. Zhou, M. Wheeler, J. A. Bernstein, M. Wang, P. He, J. Zhou, M. Snyder, L. Cong, A. Regev, and J. Leskovec. Biomni: A general-purpose biomedical AI agent.https://github.com/snap-stanford/Biomni, 2025

work page 2025
[73]

Huang, J

Q. Huang, J. Vora, P. Liang, and J. Leskovec. MLAgentBench: Evaluating language agents on machine learning experimentation. InInternational Conference on Machine Learning, 2024

work page 2024
[74]

Huang, Y

S. Huang, Y. Gao, J. Bai, Y. Zhou, Z. Yin, X. Liu, R. Chellappa, C. P. Lau, S. Nag, C. Peng, and S. Pramanick. SciFig: Towards automating scientific figure generation.arXiv preprint arXiv:2601.04390, 2026

work page arXiv 2026
[75]

Idahl and Z

M. Idahl and Z. Ahmadi. OpenReviewer: A specialized large language model for generating critical scientific paper reviews. InConference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 550–562, 2025

work page 2025
[76]

Jansen, M.-A

P. Jansen, M.-A. Cote, T. Khot, E. Bransom, B. Dalvi Mishra, B. P. Majumder, O. Tafjord, and P. Clark. DiscoveryWorld: A virtual environment for developing and evaluating automated scientific discovery agents. In Advances in Neural Information Processing Systems, 2024

work page 2024
[77]

Jansen, O

P. Jansen, O. Tafjord, M. Radensky, P. Siangliulue, T. Hope, B. Dalvi Mishra, B. P. Majumder, D. S. Weld, and P. Clark. CodeScientist: End-to-end semi-automated scientific discovery with code-based experimentation. In Annual Meeting of the Association for Computational Linguistics, pages 13370–13467, 2025

work page 2025
[78]

HindSight: EvaluatingLLM-generatedresearchideasviafutureimpact.arXiv preprint arXiv:2603.15164, 2026

B.Jiang. HindSight: EvaluatingLLM-generatedresearchideasviafutureimpact.arXiv preprint arXiv:2603.15164, 2026

work page arXiv 2026
[79]

Jiang, Y

L. Jiang, Y. Chai, M. Li, M. Liu, R. Fok, N. Dziri, Y. Tsvetkov, M. Sap, A. Albalak, and Y. Choi. Artificial hivemind: The open-ended homogeneity of language models (and beyond). InAdvances in Neural Information Processing Systems, 2025

work page 2025
[80]

Jiang and A

Y. Jiang and A. Y. Ng. Automated scientific reviewing with agentic AI.https://paperreview.ai/tech-overview, 2025. 55

work page 2025

Showing first 80 references.

[1] [1]

Abramson, J

J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick, S. W. Bodenstein, D. A. Evans, C.-C. Hung, M. O’Neill, D. Reiman, K. Tunyasuvunakool, Z. Wu, A. Žemgulyt˙ e, E. Arvaniti, C. Beattie, O. Bertolusso, A. Sherwood, J. M. Jumper, and D. Hassabis. Accurate structure prediction of biomolec...

work page 2024

[2] [2]

Agarwal, G

S. Agarwal, G. Sahu, A. Puri, I. H. Laradji, K. D. Dvijotham, J. Stanley, L. Charlin, and C. Pal. LitLLM: A toolkit for literature review with large language models.arXiv preprint arXiv:2402.01788, 2024

work page arXiv 2024

[3] [3]

Aggarwal and A

T. Aggarwal and A. Bhand. PASS: Presentation automation for slide generation and speech.arXiv preprint arXiv:2501.06497, 2025

work page arXiv 2025

[4] [4]

Ajith, M

A. Ajith, M. Xia, A. Chevalier, T. Goyal, D. Chen, and T. Gao. LitSearch: A retrieval benchmark for scientific literature search. InConference on Empirical Methods in Natural Language Processing, 2024

work page 2024

[5] [5]

Al Azher, M

I. Al Azher, M. J. Mokarrama, Z. Guo, S. R. Choudhury, and H. Alhoori. FutureGen: A RAG-based approach to generate the future work of scientific article.arXiv preprint arXiv:2503.16561, 2025

work page arXiv 2025

[6] [6]

Al Azher, Z

I. Al Azher, Z. Guo, and H. Alhoori. Multi-agent LLMs for generating research limitations.arXiv preprint arXiv:2601.11578, 2026

work page arXiv 2026

[7] [7]

Tongyi DeepResearch: An agentic LLM for long-horizon deep information seeking

Alibaba NLP. Tongyi DeepResearch: An agentic LLM for long-horizon deep information seeking. https: //github.com/Alibaba-NLP/DeepResearch, 2025

work page 2025

[8] [8]

FARS: Fully automated research system.https://analemma.ai/blog/introducing-fars, 2026

Analemma.ai. FARS: Fully automated research system.https://analemma.ai/blog/introducing-fars, 2026

work page 2026

[9] [9]

A. Asai, J. He, R. Shao, W. Shi, A. Singh, J. C. Chang, K. Lo, L. Soldaini, S. Feldman, M. D’Arcy, D. Wadden, M. Latzke, J. Sparks, J. D. Hwang, V. Kishore, M. Tian, P. Ji, S. Liu, H. Tong, B. Wu, Y. Xiong, L. Zettlemoyer, G. Neubig, D. S. Weld, D. Downey, W. tau Yih, P. W. Koh, and H. Hajishirzi. Synthesizing scientific literature with retrieval-augmente...

work page 2026

[10] [10]

J. Baek, S. K. Jauhar, S. Cucerzan, and S. J. Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. InConference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 6709–6738, 2025

work page 2025

[11] [11]

Beger and C.-L

C. Beger and C.-L. Henneking. Citegeist: Automated generation of related work analysis on the arXiv corpus. arXiv preprint arXiv:2503.23229, 2025

work page arXiv 2025

[12] [12]

Belouadi, A

J. Belouadi, A. Lauscher, and S. Eger. AutomaTikZ: Text-guided synthesis of scientific vector graphics with TikZ. InInternational Conference on Learning Representations, 2024

work page 2024

[13] [13]

Belouadi, S

J. Belouadi, S. P. Ponzetto, and S. Eger. DeTikZify: Synthesizing graphics programs for scientific figures and sketches with TikZ. InAdvances in Neural Information Processing Systems, volume 37, pages 85074–85108, 2024

work page 2024

[14] [14]

228 hours of non-stop work to produce 100 papers, burning through 11.4 billion tokens: Fars has gone crazy.https://eu.36kr.com/en/p/3696795271966336, 2026

Blog. 228 hours of non-stop work to produce 100 papers, burning through 11.4 billion tokens: Fars has gone crazy.https://eu.36kr.com/en/p/3696795271966336, 2026

work page arXiv 2026

[15] [15]

D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023

work page 2023

[16] [16]

Bragg, M

J. Bragg, M. D’Arcy, N. Balepur, D. Bareket, B. Dalvi, S. Feldman, D. Haddad, J. D. Hwang, P. Jansen, V. Kishore, B. P. Majumder, A. Naik, S. Rahamimov, K. Richardson, A. Singh, H. Surana, A. Tiktinsky, R. Vasu, G. Wiener, C. Anastasiades, S. Candra, J. Dunkelberger, D. Emery, R. Evans, M. Hamada, R. Huff, R. Kinney, M. Latzke, J. Lochner, R. Lozano-Aguil...

work page 2026

[17] [17]

A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller. ChemCrow: Augmenting large language models with chemistry tools.Nature Machine Intelligence, 6(5):525–535, 2024

work page 2024

[18] [18]

DeerFlow: A deep research framework orchestrating sub-agents, memory, and sandboxes.https: //github.com/bytedance/deer-flow, 2025

ByteDance. DeerFlow: A deep research framework orchestrating sub-agents, memory, and sandboxes.https: //github.com/bytedance/deer-flow, 2025

work page 2025

[19] [19]

J. Chai, S. Tang, R. Ye, Y. Du, X. Zhu, M. Zhou, Y. Wang, W. E, Y. Zhang, L. Zhang, and S. Chen. SciMaster: Towards general-purpose scientific AI agents, part I. X-Master as foundation: Can we lead on humanity’s last exam?arXiv preprint arXiv:2507.05241, 2025. 52

work page arXiv 2025

[20] [20]

J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, L. Weng, and A. Madry. MLE-Bench: Evaluating machine learning agents on machine learning engineering. In International Conference on Learning Representations, 2025

work page 2025

[21] [21]

Chen and I

C.-C. Chen and I. Gurevych. Commitment checklist: Auditing author commitments in peer review.arXiv preprint arXiv:2603.00003, 2026

work page arXiv 2026

[22] [22]

D. Chen. AI-generated figures in academic publishing: Policies, tools, and practical guidelines.arXiv preprint arXiv:2603.16159, 2026

work page arXiv 2026

[23] [23]

G. Chen, J. Chen, L. Chen, J. Zhao, F. Meng, W. X. Zhao, R. Song, C. Chen, J.-R. Wen, and K. Jia. Toward autonomous long-horizon engineering for ML research.arXiv preprint arXiv:2604.13018, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

H. Chen, M. Xiong, Y. Lu, W. Han, A. Deng, Y. He, J. Wu, Y. Li, Y. Liu, and B. Hooi. MLR-Bench: Evaluating AI agents on open-ended machine learning research.arXiv preprint arXiv:2505.19955, 2025

work page arXiv 2025

[25] [25]

N. Chen, A. H. Lin, J. Wu, J. Hou, Z. Zhang, Q. Wang, X. Wang, and B. He. XtraGPT: Context-aware and controllable academic paper revision.arXiv preprint arXiv:2505.11336, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Q. Chen, M. Yang, L. Qin, J. Liu, Z. Yan, J. Guan, D. Peng, Y. Ji, H. Li, M. Hu, Y. Zhang, Y. Liang, Y. Zhou, J. Wang, Z. Chen, and W. Che. AI4Research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025

work page arXiv 2025

[27] [27]

S. Chen, J. Lai, J. Gao, H. Shi, Z. Liu, T. Ye, J. Luo, X. Wei, and L. Zhu. PosterOmni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

work page arXiv 2026

[28] [28]

S. Chen, J. Lai, J. Gao, T. Ye, H. Chen, H. Shi, S. Shao, Y. Lin, S. Fei, Z. Xing, Y. Jin, J. Luo, X. Wei, and L. Zhu. PosterCraft: Rethinking high-quality aesthetic poster generation in a unified framework. InInternational Conference on Learning Representations, 2026

work page 2026

[29] [29]

S. Chen, S. Zhong, D. P. Brumby, and A. L. Cox. What happens when reviewers receive AI feedback in their reviews? InCHI Conference on Human Factors in Computing Systems, pages 1–19, 2026

work page 2026

[30] [30]

Y. Chen, T. Lv, S. Zhang, Y. Yin, Y. Wan, P. S. Yu, and D. Chen. Paper2Web: Let’s make your paper alive! arXiv preprint arXiv:2510.15842, 2025

work page arXiv 2025

[31] [31]

Z. Chen, J. Chen, S. O. Arik, M. Sra, T. Pfister, and J. Yoon. CoDA: Agentic systems for collaborative data visualization.arXiv preprint arXiv:2510.03194, 2025

work page arXiv 2025

[32] [32]

Z. Chen, S. Chen, Y. Ning, Q. Zhang, B. Wang, B. Yu, Y. Li, Z. Liao, C. Wei, Z. Lu, V. Dey, M. Xue, F. N. Baker, B. Burns, D. Adu-Ampratwum, X. Huang, X. Ning, S. Gao, Y. Su, and H. Sun. ScienceAgentBench: Toward rigorous assessment of language agents for data-driven scientific discovery. InInternational Conference on Learning Representations, 2025

work page 2025

[33] [33]

J. Choi, S. Park, S. Song, and H. Shim. PosterForest: Hierarchical multi-agent collaboration for scientific poster generation.arXiv preprint arXiv:2508.21720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

P. H. Couto, Q. P. Ho, N. Kumari, B. K. Rachmat, T. G. H. Khuong, I. Ullah, and L. Sun-Hosoya. RelevAI- Reviewer: A benchmark on AI reviewers for survey paper relevance.arXiv preprint arXiv:2406.10294, 2024

work page arXiv 2024

[35] [35]

D’Arcy, T

M. D’Arcy, T. Hope, L. Birnbaum, and D. Downey. MARG: Multi-agent review generation for scientific papers. arXiv preprint arXiv:2401.04259, 2024

work page arXiv 2024

[36] [36]

De Ponte

F. De Ponte. OpenDraft: 19-agent research draft generation.https://github.com/federicodeponte/opendraft, 2025

work page 2025

[37] [37]

X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V. Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler. SWE-Bench Pro: Can AI agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

A. Elovic. GPT Researcher: Autonomous agent for comprehensive online research. https://github.com/ assafelovic/gpt-researcher, 2024

work page 2024

[39] [39]

T. Fan, F. Zhang, Y. Zheng, B. Chen, X. Niu, C. Huang, J. Lin, and C. Huang. DeepInnovator: Triggering the innovative capabilities of LLMs.arXiv preprint arXiv:2602.18920, 2026

work page arXiv 2026

[40] [40]

Y. Feng, Q. Huang, X. Xie, Z. Yang, J. Yu, W. Chen, and A. K. H. Tung. IDRBench: Interactive deep research benchmark.arXiv preprint arXiv:2601.06676, 2026. 53

work page arXiv 2026

[41] [41]

T.-J. Fu, W. Y. Wang, D. McDuff, and Y. Song. DOC2PPT: Automatic presentation slides generation from scientific documents. InAAAI Conference on Artificial Intelligence, volume 36, pages 634–642, 2022

work page 2022

[42] [42]

S. Gao, R. Zhu, P. Sui, Z. Kong, S. Aldogom, Y. Huang, A. Noori, R. Shamji, K. Parvataneni, T. Tsiligkaridis, and M. Zitnik. Democratizing AI scientists using ToolUniverse.arXiv preprint arXiv:2509.23426, 2025

work page arXiv 2025

[43] [43]

X. Gao, J. Ruan, Z. Zhang, J. Gao, T. Liu, and Y. Fu. ReviewAgents: Bridging the gap between human and AI-generated paper reviews.arXiv preprint arXiv:2503.08506, 2025

work page arXiv 2025

[44] [44]

Y. Gao, Q. Wu, and L. Zhu. Merging the citations received by arXiv-deposited e-prints and their corresponding published journal articles: Problems and perspectives.Information Processing & Management, 57(5):102267, 2020

work page 2020

[45] [45]

Z. Gao, K. Brantley, and T. Joachims. Reviewer2: Optimizing review generation through prompt generation. arXiv preprint arXiv:2402.10886, 2024

work page arXiv 2024

[46] [46]

K. Garg, F. Shaik, S. Bandyopadhyay, and C. Caragea. Let’s use ChatGPT to write our paper! benchmarking LLMs to write the introduction of a research paper.arXiv preprint arXiv:2508.14273, 2025

work page arXiv 2025

[47] [47]

Garikaparthi, M

A. Garikaparthi, M. Patwardhan, L. Vig, and A. Cohan. IRIS: Interactive research ideation system for accelerating scientific discovery. InAnnual Meeting of the Association for Computational Linguistics, pages 592–603, 2025

work page 2025

[48] [48]

J. Ge, Z. Z. Wang, X. Zhou, Y.-H. Peng, S. Subramanian, Q. Tan, M. Sap, A. Suhr, D. Fried, G. Neubig, and T. Darrell. AutoPresent: Designing structured visuals from scratch. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2902–2911, 2025

work page 2025

[49] [49]

Ghafarollahi and M

A. Ghafarollahi and M. J. Buehler. SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning.arXiv preprint arXiv:2409.05556, 2024

work page arXiv 2024

[50] [50]

E. Gibney. Major conference catches illicit AI use — and rejects hundreds of papers.Nature News, 652:281–282, 2026

work page 2026

[51] [51]

G. H. T. Go, K. Ly, A. Sogaard, A. Tabatabaei, M. de Rijke, and X. Chen. LiRA: A multi-agent framework for reliable and readable literature review generation.arXiv preprint arXiv:2510.05138, 2025

work page arXiv 2025

[52] [52]

S. Goel, R. Hazra, D. Jayalath, T. Willi, P. Jain, W. F. Shen, I. Leontiadis, F. Barbieri, Y. Bachrach, J. Geiping, and C. Whitehouse. Training AI co-scientists using rubric rewards.arXiv preprint arXiv:2512.23707, 2025

work page arXiv 2025

[53] [53]

Goswami, P

K. Goswami, P. Mathur, R. Rossi, and F. Dernoncourt. PlotGen: Multi-agent LLM-based scientific data visualization via multimodal feedback.arXiv preprint arXiv:2502.00988, 2025

work page arXiv 2025

[54] [54]

Towards an AI co-scientist

J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno, K. Saab, D. Popovici, J. Blum, F. Zhang, K. Chou, A. Hassidim, B. Gokturk, A. Vahdat, P. Kohli, Y. Matias, A. Carroll, K. Kulkarni, N. Tomasev, Y. Guan, V. Dhillon, E. D. Vaishnav, B. Lee, T. R. D. Costa, J. R. Penadés, G. Peltz, Y. Xu, A...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review

P. Goyal, M. Parmar, Y. Song, H. Palangi, T. Pfister, and J. Yoon. ScholarPeer: A context-aware multi-agent framework for automated peer review.arXiv preprint arXiv:2601.22638, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[56] [56]

Greisinger and S

C. Greisinger and S. Eger. TikZilla: Scaling text-to-TikZ with high-quality data and reinforcement learning. arXiv preprint arXiv:2603.03072, 2026

work page arXiv 2026

[57] [57]

T. Gu, J. Wang, Z. Zhang, and H. Li. LLMs can realize combinatorial creativity: Generating creative ideas via LLMs for scientific research.arXiv preprint arXiv:2412.14141, 2024

work page arXiv 2024

[58] [58]

S. Guo, C. Deng, Y. Wen, H. Chen, Y. Chang, and J. Wang. DS-Agent: Automated data science by empowering large language models with case-based reasoning.arXiv preprint arXiv:2402.17453, 2024

work page arXiv 2024

[59] [59]

S. Guo, A. H. Shariatmadari, G. Xiong, A. Huang, M. Kim, C. M. Williams, S. Bekiranov, and A. Zhang. IdeaBench: Benchmarking large language models for research idea generation. InACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5888–5899, 2025

work page 2025

[60] [60]

P. Han, Y. Yu, J. Xu, and J. You. DRPG (decompose, retrieve, plan, generate): An agentic framework for academic rebuttal.arXiv preprint arXiv:2601.18081, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

Q. Hao, F. Xu, Y. Li, and J. Evans. Artificial intelligence tools expand scientists’ impact but contract science’s focus.Nature, 649:1237–1243, 2026. 54

work page 2026

[62] [62]

Y. He, G. Huang, P. Feng, Y. Lin, Y. Zhang, H. Li, and W. E. PaSa: An LLM agent for comprehensive academic paper search.arXiv preprint arXiv:2501.10120, 2025

work page arXiv 2025

[63] [63]

Z. He, Z. Lyu, and Y. R. Fung. RebuttalAgent: Strategic persuasion in academic rebuttal via theory of mind. arXiv preprint arXiv:2601.15715, 2026

work page arXiv 2026

[64] [64]

Paper2Slides: From paper to presentation in one click

HKU Data Intelligence Lab. Paper2Slides: From paper to presentation in one click. https://github.com/ HKUDS/Paper2Slides, 2025

work page 2025

[65] [65]

M. Hong, D. Jiang, C. J. Zhang, Z. Guo, Y. Li, J. Chen, S. Cui, and Z. Su. CiteLLM: An agentic platform for trustworthy scientific reference discovery.arXiv preprint arXiv:2602.23075, 2026

work page arXiv 2026

[66] [66]

Hossain, S

E. Hossain, S. K. Sinha, N. Bansal, R. A. Knipper, S. Sarkar, J. Salvador, Y. Mahajan, S. R. P. K. Guttikonda, M. Akter, M. M. Hassan, M. Freestone, M. C. W. Jr., D. Feng, and S. Karmaker. LLMs as meta-reviewers’ assis- tants: A case study. InConference of the Nations of the Americas Chapter of the Association for Computational Linguistics, pages 7763–7803, 2025

work page 2025

[67] [67]

J. Hou, A. H. Lin, N. Chen, Y. Gong, and B. He. PaperDebugger: A plugin-based multi-agent system for in-editor academic writing, review, and editing.arXiv preprint arXiv:2512.02589, 2025

work page arXiv 2025

[68] [68]

C.-C. Hsu, E. Bransom, J. Sparks, B. Kuehl, C. Tan, D. Wadden, L. L. Wang, and A. Naik. CHIME: LLM-assisted hierarchical organization of scientific studies for literature review support.arXiv preprint arXiv:2407.16148, 2024

work page arXiv 2024

[69] [69]

X. Hu, H. Fu, J. Wang, Y. Wang, Z. Li, R. Xu, Y. Lu, Y. Jin, L. Pan, and Z. Lan. Nova: An iterative planning and search approach to enhance novelty and diversity of LLM generated ideas.arXiv preprint arXiv:2410.14255, 2024

work page arXiv 2024

[70] [70]

X. Hu, Z. Zhao, S. Wei, Z. Chai, Q. Ma, G. Wang, X. Wang, J. Su, J. Xu, M. Zhu, Y. Cheng, J. Yuan, J. Li, K. Kuang, Y. Yang, H. Yang, and F. Wu. InfiAgent-DABench: Evaluating agents on data analysis tasks.arXiv preprint arXiv:2401.05507, 2024

work page arXiv 2024

[71] [71]

T. Hua, H. Hua, V. Xiang, B. Klieger, S. T. Truong, W. Liang, F.-Y. Sun, and N. Haber. ResearchCodeBench: Benchmarking LLMs on implementing novel machine learning research code.arXiv preprint arXiv:2506.02314, 2025

work page arXiv 2025

[72] [72]

Huang, S

K. Huang, S. Zhang, H. Wang, Y. Qu, Y. Lu, Y. Roohani, R. Li, L. Qiu, G. Li, J. Zhang, D. Yin, S. Marwaha, J. N. Carter, X. Zhou, M. Wheeler, J. A. Bernstein, M. Wang, P. He, J. Zhou, M. Snyder, L. Cong, A. Regev, and J. Leskovec. Biomni: A general-purpose biomedical AI agent.https://github.com/snap-stanford/Biomni, 2025

work page 2025

[73] [73]

Huang, J

Q. Huang, J. Vora, P. Liang, and J. Leskovec. MLAgentBench: Evaluating language agents on machine learning experimentation. InInternational Conference on Machine Learning, 2024

work page 2024

[74] [74]

Huang, Y

S. Huang, Y. Gao, J. Bai, Y. Zhou, Z. Yin, X. Liu, R. Chellappa, C. P. Lau, S. Nag, C. Peng, and S. Pramanick. SciFig: Towards automating scientific figure generation.arXiv preprint arXiv:2601.04390, 2026

work page arXiv 2026

[75] [75]

Idahl and Z

M. Idahl and Z. Ahmadi. OpenReviewer: A specialized large language model for generating critical scientific paper reviews. InConference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 550–562, 2025

work page 2025

[76] [76]

Jansen, M.-A

P. Jansen, M.-A. Cote, T. Khot, E. Bransom, B. Dalvi Mishra, B. P. Majumder, O. Tafjord, and P. Clark. DiscoveryWorld: A virtual environment for developing and evaluating automated scientific discovery agents. In Advances in Neural Information Processing Systems, 2024

work page 2024

[77] [77]

Jansen, O

P. Jansen, O. Tafjord, M. Radensky, P. Siangliulue, T. Hope, B. Dalvi Mishra, B. P. Majumder, D. S. Weld, and P. Clark. CodeScientist: End-to-end semi-automated scientific discovery with code-based experimentation. In Annual Meeting of the Association for Computational Linguistics, pages 13370–13467, 2025

work page 2025

[78] [78]

HindSight: EvaluatingLLM-generatedresearchideasviafutureimpact.arXiv preprint arXiv:2603.15164, 2026

B.Jiang. HindSight: EvaluatingLLM-generatedresearchideasviafutureimpact.arXiv preprint arXiv:2603.15164, 2026

work page arXiv 2026

[79] [79]

Jiang, Y

L. Jiang, Y. Chai, M. Li, M. Liu, R. Fok, N. Dziri, Y. Tsvetkov, M. Sap, A. Albalak, and Y. Choi. Artificial hivemind: The open-ended homogeneity of language models (and beyond). InAdvances in Neural Information Processing Systems, 2025

work page 2025

[80] [80]

Jiang and A

Y. Jiang and A. Y. Ng. Automated scientific reviewing with agentic AI.https://paperreview.ai/tech-overview, 2025. 55

work page 2025