pith. sign in

arxiv: 2605.18661 · v1 · pith:RA5SONDLnew · submitted 2026-05-18 · 💻 cs.AI

AI for Auto-Research: Roadmap & User Guide

Pith reviewed 2026-05-20 10:26 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI-assisted researchresearch automationLLM limitationsscientific integrityautonomous research agentsidea generationpeer reviewvalidation
0
0 comments X

The pith

AI excels at structured research tasks but remains fragile for novel ideas and scientific judgment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that there is a sharp boundary in AI capabilities across the research lifecycle. AI can reliably assist with structured, retrieval-based tasks such as literature review and basic coding, but it falters when tasked with generating genuinely new ideas, conducting research-level experiments, or providing scientific judgment. This matters because as automation increases, the risk of hidden errors and reduced integrity grows, suggesting that fully autonomous systems are not yet ready for prime time. A sympathetic reader would care about this as it guides how to deploy AI tools effectively without compromising research quality.

Core claim

The authors claim that AI for auto-research has reached a point where systems can generate papers cheaply and agents can run experiments with little input, yet a detailed review up to April 2026 shows persistent weaknesses. Specifically, AI excels in structured tasks but is unreliable for novelty and judgment, with ideas degrading after implementation and autonomous systems not yet meeting major venue standards. They conclude that more automation can obscure failures, making human-governed collaboration the best approach.

What carries the argument

The stage-dependent boundary between reliable assistance and unreliable autonomy in the four phases of research: creation, writing, validation, and dissemination.

Load-bearing premise

The limitations of frontier LLMs in fabricating results and judging novelty observed through April 2026 represent a stable boundary rather than a temporary limitation.

What would settle it

A fully autonomous AI system that consistently produces research papers accepted at major venues like NeurIPS without human intervention would falsify the central boundary claim.

read the original abstract

AI-assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as $15, while long-horizon agents can execute experiments, draft manuscripts, and simulate critique with minimal human input. Yet this productivity frontier exposes a deeper integrity problem: under scientific pressure, even frontier LLMs still fabricate results, miss hidden errors, and fail to judge novelty reliably. Studying developments through April 2026, we present an end-to-end analysis of AI across the complete research lifecycle, organized into four epistemological phases: Creation (idea generation, literature review, coding & experiments, tables & figures), Writing (paper writing), Validation (peer review, rebuttal & revision), and Dissemination (posters, slides, videos, social media, project pages, and interactive agents). We identify a sharp, stage-dependent boundary between reliable assistance and unreliable autonomy: AI excels at structured, retrieval-grounded, and tool-mediated tasks, but remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment. Generated ideas often degrade after implementation, research code lags far behind pattern-matching benchmarks, and end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards. We further show that greater automation can obscure rather than eliminate failure modes, making human-governed collaboration the most credible deployment paradigm. Finally, we provide a structured taxonomy, benchmark suite, and tool inventory, cross-stage design principles, and a practitioner-oriented playbook, with resources maintained at our project page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript analyzes AI-assisted research across the full lifecycle, organized into four epistemological phases: Creation (idea generation, literature review, coding/experiments), Writing, Validation (peer review, rebuttal), and Dissemination (posters, slides, videos, social media). Based on observations of frontier LLMs through April 2026, it claims a sharp stage-dependent boundary: AI excels at structured, retrieval-grounded, and tool-mediated tasks but remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment. It further argues that greater automation can obscure failure modes, advocates human-governed collaboration, and supplies a taxonomy, benchmark suite, tool inventory, cross-stage design principles, and practitioner playbook with an associated project page.

Significance. If the boundary claim holds as a stable epistemological distinction rather than a transient snapshot, the work offers a practical, practitioner-oriented roadmap that could inform responsible AI deployment in research. The emphasis on failure modes, the provision of resources at a project page, and the structured taxonomy add utility as a guide for the field, though its value depends on the durability of the observed limitations.

major comments (2)
  1. [Abstract] Abstract: The claim of a 'sharp, stage-dependent boundary' between reliable assistance and unreliable autonomy is grounded solely in qualitative observations of LLM fabrication, missed errors, and poor novelty assessment through April 2026. No quantitative benchmarks, systematic error analysis, controlled comparisons, or longitudinal data are referenced to demonstrate why this distinction reflects an enduring limit rather than a current capability gap that scaling or new methods might close.
  2. [Abstract] Abstract (phases description): The recommendation that 'human-governed collaboration [is] the most credible deployment paradigm' and that 'greater automation can obscure rather than eliminate failure modes' is presented without ablation studies, end-to-end comparisons of fully autonomous vs. hybrid systems, or evidence from the four phases showing that increased automation reliably increases (rather than decreases) undetected errors.
minor comments (1)
  1. [Abstract] Abstract: The mention of a 'structured taxonomy, benchmark suite, and tool inventory' would benefit from explicit description of construction methodology and validation criteria to support reproducibility claims.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. We address the two major comments on the abstract below, clarifying the observational basis of the work while revising the text to better qualify our claims as a current snapshot rather than a proven enduring limit.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of a 'sharp, stage-dependent boundary' between reliable assistance and unreliable autonomy is grounded solely in qualitative observations of LLM fabrication, missed errors, and poor novelty assessment through April 2026. No quantitative benchmarks, systematic error analysis, controlled comparisons, or longitudinal data are referenced to demonstrate why this distinction reflects an enduring limit rather than a current capability gap that scaling or new methods might close.

    Authors: The manuscript is framed as a roadmap and practitioner guide based on direct observations of frontier models through April 2026, not as a controlled empirical evaluation. The stage-dependent boundary is illustrated through the full-text analysis, benchmark suite, and tool inventory, which reference existing quantitative results for subtasks such as retrieval and code generation. We accept that the abstract overstates the distinction as 'sharp' without longitudinal evidence. We have revised the abstract to describe an 'observed stage-dependent pattern' in current systems and to note explicitly that scaling or new methods could narrow these gaps. The project page will be updated with future observations. revision: partial

  2. Referee: [Abstract] Abstract (phases description): The recommendation that 'human-governed collaboration [is] the most credible deployment paradigm' and that 'greater automation can obscure rather than eliminate failure modes' is presented without ablation studies, end-to-end comparisons of fully autonomous vs. hybrid systems, or evidence from the four phases showing that increased automation reliably increases (rather than decreases) undetected errors.

    Authors: We agree that ablation studies and direct end-to-end comparisons would provide stronger causal evidence. Such experiments lie outside the scope of this synthesis paper. The recommendation instead rests on documented failure modes across the four phases (e.g., fabricated results in autonomous paper generation and undetected errors in AI-assisted validation), which are detailed with examples in the revised manuscript. We have expanded the cross-stage design principles section to include concrete illustrations of how greater automation can mask issues and have added guidance for hybrid workflows. The claim is presented as the most credible paradigm given present capabilities rather than a universally proven result. revision: partial

standing simulated objections not resolved
  • New ablation studies or fresh quantitative benchmarks comparing fully autonomous versus hybrid systems across the full research lifecycle, which would require a separate large-scale experimental effort.

Circularity Check

0 steps flagged

No significant circularity; claims rest on external observations

full rationale

The paper offers an end-to-end observational roadmap of AI assistance across research phases, identifying a stage-dependent boundary between reliable structured tasks and fragile performance on novel ideas or judgment. This boundary is presented as an empirical pattern drawn from developments through April 2026 and external literature rather than any derivation, equation, or fitted parameter internal to the manuscript. No self-definitional loop, fitted-input prediction, or load-bearing self-citation chain appears; the analysis remains self-contained against external benchmarks and does not reduce its central claim to inputs defined by the paper itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that current frontier LLMs exhibit persistent fabrication and judgment failures under research pressure; no free parameters or new invented entities are introduced.

axioms (1)
  • domain assumption Under scientific pressure, even frontier LLMs still fabricate results, miss hidden errors, and fail to judge novelty reliably.
    Invoked in the abstract as the integrity problem motivating the entire analysis.

pith-pipeline@v0.9.0 · 5870 in / 1162 out tokens · 51812 ms · 2026-05-20T10:26:33.810273+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

271 extracted references · 271 canonical work pages · 27 internal anchors

  1. [1]

    Abramson, J

    J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick, S. W. Bodenstein, D. A. Evans, C.-C. Hung, M. O’Neill, D. Reiman, K. Tunyasuvunakool, Z. Wu, A. Žemgulyt˙ e, E. Arvaniti, C. Beattie, O. Bertolusso, A. Sherwood, J. M. Jumper, and D. Hassabis. Accurate structure prediction of biomolec...

  2. [2]

    Agarwal, G

    S. Agarwal, G. Sahu, A. Puri, I. H. Laradji, K. D. Dvijotham, J. Stanley, L. Charlin, and C. Pal. LitLLM: A toolkit for literature review with large language models.arXiv preprint arXiv:2402.01788, 2024

  3. [3]

    Aggarwal and A

    T. Aggarwal and A. Bhand. PASS: Presentation automation for slide generation and speech.arXiv preprint arXiv:2501.06497, 2025

  4. [4]

    Ajith, M

    A. Ajith, M. Xia, A. Chevalier, T. Goyal, D. Chen, and T. Gao. LitSearch: A retrieval benchmark for scientific literature search. InConference on Empirical Methods in Natural Language Processing, 2024

  5. [5]

    Al Azher, M

    I. Al Azher, M. J. Mokarrama, Z. Guo, S. R. Choudhury, and H. Alhoori. FutureGen: A RAG-based approach to generate the future work of scientific article.arXiv preprint arXiv:2503.16561, 2025

  6. [6]

    Al Azher, Z

    I. Al Azher, Z. Guo, and H. Alhoori. Multi-agent LLMs for generating research limitations.arXiv preprint arXiv:2601.11578, 2026

  7. [7]

    Tongyi DeepResearch: An agentic LLM for long-horizon deep information seeking

    Alibaba NLP. Tongyi DeepResearch: An agentic LLM for long-horizon deep information seeking. https: //github.com/Alibaba-NLP/DeepResearch, 2025

  8. [8]

    FARS: Fully automated research system.https://analemma.ai/blog/introducing-fars, 2026

    Analemma.ai. FARS: Fully automated research system.https://analemma.ai/blog/introducing-fars, 2026

  9. [9]

    A. Asai, J. He, R. Shao, W. Shi, A. Singh, J. C. Chang, K. Lo, L. Soldaini, S. Feldman, M. D’Arcy, D. Wadden, M. Latzke, J. Sparks, J. D. Hwang, V. Kishore, M. Tian, P. Ji, S. Liu, H. Tong, B. Wu, Y. Xiong, L. Zettlemoyer, G. Neubig, D. S. Weld, D. Downey, W. tau Yih, P. W. Koh, and H. Hajishirzi. Synthesizing scientific literature with retrieval-augmente...

  10. [10]

    J. Baek, S. K. Jauhar, S. Cucerzan, and S. J. Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. InConference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 6709–6738, 2025

  11. [11]

    Beger and C.-L

    C. Beger and C.-L. Henneking. Citegeist: Automated generation of related work analysis on the arXiv corpus. arXiv preprint arXiv:2503.23229, 2025

  12. [12]

    Belouadi, A

    J. Belouadi, A. Lauscher, and S. Eger. AutomaTikZ: Text-guided synthesis of scientific vector graphics with TikZ. InInternational Conference on Learning Representations, 2024

  13. [13]

    Belouadi, S

    J. Belouadi, S. P. Ponzetto, and S. Eger. DeTikZify: Synthesizing graphics programs for scientific figures and sketches with TikZ. InAdvances in Neural Information Processing Systems, volume 37, pages 85074–85108, 2024

  14. [14]

    228 hours of non-stop work to produce 100 papers, burning through 11.4 billion tokens: Fars has gone crazy.https://eu.36kr.com/en/p/3696795271966336, 2026

    Blog. 228 hours of non-stop work to produce 100 papers, burning through 11.4 billion tokens: Fars has gone crazy.https://eu.36kr.com/en/p/3696795271966336, 2026

  15. [15]

    D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023

  16. [16]

    Bragg, M

    J. Bragg, M. D’Arcy, N. Balepur, D. Bareket, B. Dalvi, S. Feldman, D. Haddad, J. D. Hwang, P. Jansen, V. Kishore, B. P. Majumder, A. Naik, S. Rahamimov, K. Richardson, A. Singh, H. Surana, A. Tiktinsky, R. Vasu, G. Wiener, C. Anastasiades, S. Candra, J. Dunkelberger, D. Emery, R. Evans, M. Hamada, R. Huff, R. Kinney, M. Latzke, J. Lochner, R. Lozano-Aguil...

  17. [17]

    A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller. ChemCrow: Augmenting large language models with chemistry tools.Nature Machine Intelligence, 6(5):525–535, 2024

  18. [18]

    DeerFlow: A deep research framework orchestrating sub-agents, memory, and sandboxes.https: //github.com/bytedance/deer-flow, 2025

    ByteDance. DeerFlow: A deep research framework orchestrating sub-agents, memory, and sandboxes.https: //github.com/bytedance/deer-flow, 2025

  19. [19]

    J. Chai, S. Tang, R. Ye, Y. Du, X. Zhu, M. Zhou, Y. Wang, W. E, Y. Zhang, L. Zhang, and S. Chen. SciMaster: Towards general-purpose scientific AI agents, part I. X-Master as foundation: Can we lead on humanity’s last exam?arXiv preprint arXiv:2507.05241, 2025. 52

  20. [20]

    J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, L. Weng, and A. Madry. MLE-Bench: Evaluating machine learning agents on machine learning engineering. In International Conference on Learning Representations, 2025

  21. [21]

    Chen and I

    C.-C. Chen and I. Gurevych. Commitment checklist: Auditing author commitments in peer review.arXiv preprint arXiv:2603.00003, 2026

  22. [22]

    D. Chen. AI-generated figures in academic publishing: Policies, tools, and practical guidelines.arXiv preprint arXiv:2603.16159, 2026

  23. [23]

    G. Chen, J. Chen, L. Chen, J. Zhao, F. Meng, W. X. Zhao, R. Song, C. Chen, J.-R. Wen, and K. Jia. Toward autonomous long-horizon engineering for ML research.arXiv preprint arXiv:2604.13018, 2026

  24. [24]

    H. Chen, M. Xiong, Y. Lu, W. Han, A. Deng, Y. He, J. Wu, Y. Li, Y. Liu, and B. Hooi. MLR-Bench: Evaluating AI agents on open-ended machine learning research.arXiv preprint arXiv:2505.19955, 2025

  25. [25]

    N. Chen, A. H. Lin, J. Wu, J. Hou, Z. Zhang, Q. Wang, X. Wang, and B. He. XtraGPT: Context-aware and controllable academic paper revision.arXiv preprint arXiv:2505.11336, 2025

  26. [26]

    Q. Chen, M. Yang, L. Qin, J. Liu, Z. Yan, J. Guan, D. Peng, Y. Ji, H. Li, M. Hu, Y. Zhang, Y. Liang, Y. Zhou, J. Wang, Z. Chen, and W. Che. AI4Research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025

  27. [27]

    S. Chen, J. Lai, J. Gao, H. Shi, Z. Liu, T. Ye, J. Luo, X. Wei, and L. Zhu. PosterOmni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

  28. [28]

    S. Chen, J. Lai, J. Gao, T. Ye, H. Chen, H. Shi, S. Shao, Y. Lin, S. Fei, Z. Xing, Y. Jin, J. Luo, X. Wei, and L. Zhu. PosterCraft: Rethinking high-quality aesthetic poster generation in a unified framework. InInternational Conference on Learning Representations, 2026

  29. [29]

    S. Chen, S. Zhong, D. P. Brumby, and A. L. Cox. What happens when reviewers receive AI feedback in their reviews? InCHI Conference on Human Factors in Computing Systems, pages 1–19, 2026

  30. [30]

    Y. Chen, T. Lv, S. Zhang, Y. Yin, Y. Wan, P. S. Yu, and D. Chen. Paper2Web: Let’s make your paper alive! arXiv preprint arXiv:2510.15842, 2025

  31. [31]

    Z. Chen, J. Chen, S. O. Arik, M. Sra, T. Pfister, and J. Yoon. CoDA: Agentic systems for collaborative data visualization.arXiv preprint arXiv:2510.03194, 2025

  32. [32]

    Z. Chen, S. Chen, Y. Ning, Q. Zhang, B. Wang, B. Yu, Y. Li, Z. Liao, C. Wei, Z. Lu, V. Dey, M. Xue, F. N. Baker, B. Burns, D. Adu-Ampratwum, X. Huang, X. Ning, S. Gao, Y. Su, and H. Sun. ScienceAgentBench: Toward rigorous assessment of language agents for data-driven scientific discovery. InInternational Conference on Learning Representations, 2025

  33. [33]

    J. Choi, S. Park, S. Song, and H. Shim. PosterForest: Hierarchical multi-agent collaboration for scientific poster generation.arXiv preprint arXiv:2508.21720, 2025

  34. [34]

    P. H. Couto, Q. P. Ho, N. Kumari, B. K. Rachmat, T. G. H. Khuong, I. Ullah, and L. Sun-Hosoya. RelevAI- Reviewer: A benchmark on AI reviewers for survey paper relevance.arXiv preprint arXiv:2406.10294, 2024

  35. [35]

    D’Arcy, T

    M. D’Arcy, T. Hope, L. Birnbaum, and D. Downey. MARG: Multi-agent review generation for scientific papers. arXiv preprint arXiv:2401.04259, 2024

  36. [36]

    De Ponte

    F. De Ponte. OpenDraft: 19-agent research draft generation.https://github.com/federicodeponte/opendraft, 2025

  37. [37]

    X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V. Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler. SWE-Bench Pro: Can AI agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025

  38. [38]

    A. Elovic. GPT Researcher: Autonomous agent for comprehensive online research. https://github.com/ assafelovic/gpt-researcher, 2024

  39. [39]

    T. Fan, F. Zhang, Y. Zheng, B. Chen, X. Niu, C. Huang, J. Lin, and C. Huang. DeepInnovator: Triggering the innovative capabilities of LLMs.arXiv preprint arXiv:2602.18920, 2026

  40. [40]

    Y. Feng, Q. Huang, X. Xie, Z. Yang, J. Yu, W. Chen, and A. K. H. Tung. IDRBench: Interactive deep research benchmark.arXiv preprint arXiv:2601.06676, 2026. 53

  41. [41]

    T.-J. Fu, W. Y. Wang, D. McDuff, and Y. Song. DOC2PPT: Automatic presentation slides generation from scientific documents. InAAAI Conference on Artificial Intelligence, volume 36, pages 634–642, 2022

  42. [42]

    S. Gao, R. Zhu, P. Sui, Z. Kong, S. Aldogom, Y. Huang, A. Noori, R. Shamji, K. Parvataneni, T. Tsiligkaridis, and M. Zitnik. Democratizing AI scientists using ToolUniverse.arXiv preprint arXiv:2509.23426, 2025

  43. [43]

    X. Gao, J. Ruan, Z. Zhang, J. Gao, T. Liu, and Y. Fu. ReviewAgents: Bridging the gap between human and AI-generated paper reviews.arXiv preprint arXiv:2503.08506, 2025

  44. [44]

    Y. Gao, Q. Wu, and L. Zhu. Merging the citations received by arXiv-deposited e-prints and their corresponding published journal articles: Problems and perspectives.Information Processing & Management, 57(5):102267, 2020

  45. [45]

    Z. Gao, K. Brantley, and T. Joachims. Reviewer2: Optimizing review generation through prompt generation. arXiv preprint arXiv:2402.10886, 2024

  46. [46]

    K. Garg, F. Shaik, S. Bandyopadhyay, and C. Caragea. Let’s use ChatGPT to write our paper! benchmarking LLMs to write the introduction of a research paper.arXiv preprint arXiv:2508.14273, 2025

  47. [47]

    Garikaparthi, M

    A. Garikaparthi, M. Patwardhan, L. Vig, and A. Cohan. IRIS: Interactive research ideation system for accelerating scientific discovery. InAnnual Meeting of the Association for Computational Linguistics, pages 592–603, 2025

  48. [48]

    J. Ge, Z. Z. Wang, X. Zhou, Y.-H. Peng, S. Subramanian, Q. Tan, M. Sap, A. Suhr, D. Fried, G. Neubig, and T. Darrell. AutoPresent: Designing structured visuals from scratch. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2902–2911, 2025

  49. [49]

    Ghafarollahi and M

    A. Ghafarollahi and M. J. Buehler. SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning.arXiv preprint arXiv:2409.05556, 2024

  50. [50]

    E. Gibney. Major conference catches illicit AI use — and rejects hundreds of papers.Nature News, 652:281–282, 2026

  51. [51]

    G. H. T. Go, K. Ly, A. Sogaard, A. Tabatabaei, M. de Rijke, and X. Chen. LiRA: A multi-agent framework for reliable and readable literature review generation.arXiv preprint arXiv:2510.05138, 2025

  52. [52]

    S. Goel, R. Hazra, D. Jayalath, T. Willi, P. Jain, W. F. Shen, I. Leontiadis, F. Barbieri, Y. Bachrach, J. Geiping, and C. Whitehouse. Training AI co-scientists using rubric rewards.arXiv preprint arXiv:2512.23707, 2025

  53. [53]

    Goswami, P

    K. Goswami, P. Mathur, R. Rossi, and F. Dernoncourt. PlotGen: Multi-agent LLM-based scientific data visualization via multimodal feedback.arXiv preprint arXiv:2502.00988, 2025

  54. [54]

    Towards an AI co-scientist

    J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno, K. Saab, D. Popovici, J. Blum, F. Zhang, K. Chou, A. Hassidim, B. Gokturk, A. Vahdat, P. Kohli, Y. Matias, A. Carroll, K. Kulkarni, N. Tomasev, Y. Guan, V. Dhillon, E. D. Vaishnav, B. Lee, T. R. D. Costa, J. R. Penadés, G. Peltz, Y. Xu, A...

  55. [55]

    ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review

    P. Goyal, M. Parmar, Y. Song, H. Palangi, T. Pfister, and J. Yoon. ScholarPeer: A context-aware multi-agent framework for automated peer review.arXiv preprint arXiv:2601.22638, 2026

  56. [56]

    Greisinger and S

    C. Greisinger and S. Eger. TikZilla: Scaling text-to-TikZ with high-quality data and reinforcement learning. arXiv preprint arXiv:2603.03072, 2026

  57. [57]

    T. Gu, J. Wang, Z. Zhang, and H. Li. LLMs can realize combinatorial creativity: Generating creative ideas via LLMs for scientific research.arXiv preprint arXiv:2412.14141, 2024

  58. [58]

    S. Guo, C. Deng, Y. Wen, H. Chen, Y. Chang, and J. Wang. DS-Agent: Automated data science by empowering large language models with case-based reasoning.arXiv preprint arXiv:2402.17453, 2024

  59. [59]

    S. Guo, A. H. Shariatmadari, G. Xiong, A. Huang, M. Kim, C. M. Williams, S. Bekiranov, and A. Zhang. IdeaBench: Benchmarking large language models for research idea generation. InACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5888–5899, 2025

  60. [60]

    P. Han, Y. Yu, J. Xu, and J. You. DRPG (decompose, retrieve, plan, generate): An agentic framework for academic rebuttal.arXiv preprint arXiv:2601.18081, 2026

  61. [61]

    Q. Hao, F. Xu, Y. Li, and J. Evans. Artificial intelligence tools expand scientists’ impact but contract science’s focus.Nature, 649:1237–1243, 2026. 54

  62. [62]

    Y. He, G. Huang, P. Feng, Y. Lin, Y. Zhang, H. Li, and W. E. PaSa: An LLM agent for comprehensive academic paper search.arXiv preprint arXiv:2501.10120, 2025

  63. [63]

    Z. He, Z. Lyu, and Y. R. Fung. RebuttalAgent: Strategic persuasion in academic rebuttal via theory of mind. arXiv preprint arXiv:2601.15715, 2026

  64. [64]

    Paper2Slides: From paper to presentation in one click

    HKU Data Intelligence Lab. Paper2Slides: From paper to presentation in one click. https://github.com/ HKUDS/Paper2Slides, 2025

  65. [65]

    M. Hong, D. Jiang, C. J. Zhang, Z. Guo, Y. Li, J. Chen, S. Cui, and Z. Su. CiteLLM: An agentic platform for trustworthy scientific reference discovery.arXiv preprint arXiv:2602.23075, 2026

  66. [66]

    Hossain, S

    E. Hossain, S. K. Sinha, N. Bansal, R. A. Knipper, S. Sarkar, J. Salvador, Y. Mahajan, S. R. P. K. Guttikonda, M. Akter, M. M. Hassan, M. Freestone, M. C. W. Jr., D. Feng, and S. Karmaker. LLMs as meta-reviewers’ assis- tants: A case study. InConference of the Nations of the Americas Chapter of the Association for Computational Linguistics, pages 7763–7803, 2025

  67. [67]

    J. Hou, A. H. Lin, N. Chen, Y. Gong, and B. He. PaperDebugger: A plugin-based multi-agent system for in-editor academic writing, review, and editing.arXiv preprint arXiv:2512.02589, 2025

  68. [68]

    C.-C. Hsu, E. Bransom, J. Sparks, B. Kuehl, C. Tan, D. Wadden, L. L. Wang, and A. Naik. CHIME: LLM-assisted hierarchical organization of scientific studies for literature review support.arXiv preprint arXiv:2407.16148, 2024

  69. [69]

    X. Hu, H. Fu, J. Wang, Y. Wang, Z. Li, R. Xu, Y. Lu, Y. Jin, L. Pan, and Z. Lan. Nova: An iterative planning and search approach to enhance novelty and diversity of LLM generated ideas.arXiv preprint arXiv:2410.14255, 2024

  70. [70]

    X. Hu, Z. Zhao, S. Wei, Z. Chai, Q. Ma, G. Wang, X. Wang, J. Su, J. Xu, M. Zhu, Y. Cheng, J. Yuan, J. Li, K. Kuang, Y. Yang, H. Yang, and F. Wu. InfiAgent-DABench: Evaluating agents on data analysis tasks.arXiv preprint arXiv:2401.05507, 2024

  71. [71]

    T. Hua, H. Hua, V. Xiang, B. Klieger, S. T. Truong, W. Liang, F.-Y. Sun, and N. Haber. ResearchCodeBench: Benchmarking LLMs on implementing novel machine learning research code.arXiv preprint arXiv:2506.02314, 2025

  72. [72]

    Huang, S

    K. Huang, S. Zhang, H. Wang, Y. Qu, Y. Lu, Y. Roohani, R. Li, L. Qiu, G. Li, J. Zhang, D. Yin, S. Marwaha, J. N. Carter, X. Zhou, M. Wheeler, J. A. Bernstein, M. Wang, P. He, J. Zhou, M. Snyder, L. Cong, A. Regev, and J. Leskovec. Biomni: A general-purpose biomedical AI agent.https://github.com/snap-stanford/Biomni, 2025

  73. [73]

    Huang, J

    Q. Huang, J. Vora, P. Liang, and J. Leskovec. MLAgentBench: Evaluating language agents on machine learning experimentation. InInternational Conference on Machine Learning, 2024

  74. [74]

    Huang, Y

    S. Huang, Y. Gao, J. Bai, Y. Zhou, Z. Yin, X. Liu, R. Chellappa, C. P. Lau, S. Nag, C. Peng, and S. Pramanick. SciFig: Towards automating scientific figure generation.arXiv preprint arXiv:2601.04390, 2026

  75. [75]

    Idahl and Z

    M. Idahl and Z. Ahmadi. OpenReviewer: A specialized large language model for generating critical scientific paper reviews. InConference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 550–562, 2025

  76. [76]

    Jansen, M.-A

    P. Jansen, M.-A. Cote, T. Khot, E. Bransom, B. Dalvi Mishra, B. P. Majumder, O. Tafjord, and P. Clark. DiscoveryWorld: A virtual environment for developing and evaluating automated scientific discovery agents. In Advances in Neural Information Processing Systems, 2024

  77. [77]

    Jansen, O

    P. Jansen, O. Tafjord, M. Radensky, P. Siangliulue, T. Hope, B. Dalvi Mishra, B. P. Majumder, D. S. Weld, and P. Clark. CodeScientist: End-to-end semi-automated scientific discovery with code-based experimentation. In Annual Meeting of the Association for Computational Linguistics, pages 13370–13467, 2025

  78. [78]

    HindSight: EvaluatingLLM-generatedresearchideasviafutureimpact.arXiv preprint arXiv:2603.15164, 2026

    B.Jiang. HindSight: EvaluatingLLM-generatedresearchideasviafutureimpact.arXiv preprint arXiv:2603.15164, 2026

  79. [79]

    Jiang, Y

    L. Jiang, Y. Chai, M. Li, M. Liu, R. Fok, N. Dziri, Y. Tsvetkov, M. Sap, A. Albalak, and Y. Choi. Artificial hivemind: The open-ended homogeneity of language models (and beyond). InAdvances in Neural Information Processing Systems, 2025

  80. [80]

    Jiang and A

    Y. Jiang and A. Y. Ng. Automated scientific reviewing with agentic AI.https://paperreview.ai/tech-overview, 2025. 55

Showing first 80 references.