XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration

Andre Lin HuiKai; Bingsheng He; Jiaying Wu; Junyi Hou; Nuo Chen; Qian Wang; Xidong Wang; Zining Zhang

arxiv: 2505.11336 · v4 · submitted 2025-05-16 · 💻 cs.CL

XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration

Nuo Chen , Andre Lin HuiKai , Jiaying Wu , Junyi Hou , Zining Zhang , Qian Wang , Xidong Wang , Bingsheng He This is my paper

Pith reviewed 2026-05-22 14:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords academic paper revisionhuman-AI collaborationLLM fine-tuningcontext-aware modelingscientific writinginstruction-response pairsopen-source models

0 comments

The pith

XtraGPT fine-tunes open LLMs on 140,000 real revision pairs so they can handle context-aware, intent-guided edits to scientific papers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that academic writing needs more than general text generation because it involves iterative, section-spanning revisions that keep conceptual coherence intact. It shows this by curating thousands of papers into explicit instruction-response pairs that capture how researchers actually revise drafts. The resulting models are then trained to follow specific revision goals while staying aware of surrounding sections. If the approach works, it supplies a controllable alternative to black-box prompting for the kind of detailed polishing that research communication requires.

Core claim

The central claim is that a human-AI collaboration framework built on criteria-guided intent alignment and context-aware modeling, when applied to a dataset of 140,000 instruction-response pairs drawn from 7,000 top-venue papers, produces open-source LLMs (1.5B to 14B parameters) that outperform same-scale baselines and approach proprietary model quality at improving scientific drafts, as measured by both automated preference tests and human evaluations.

What carries the argument

The criteria-guided intent alignment and context-aware modeling mechanism that turns each section-level revision instruction into a controlled edit while referencing the surrounding paper content.

If this is right

Open models become practical for iterative academic drafting without repeated calls to closed systems.
Revision quality improves when the model receives explicit intent signals tied to section context rather than generic prompts.
The same fine-tuning recipe can be applied at multiple model scales while preserving gains over baselines.
Human preference data confirm that the revisions maintain both local accuracy and cross-section coherence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar instruction sets could be built for related tasks such as responding to reviewer comments or condensing long papers.
Embedding the models in writing interfaces might let authors steer revisions at the paragraph level in real time.
If the revision patterns prove domain-general, the same data construction method could support non-academic technical writing.

Load-bearing premise

The curated set of 140,000 revision pairs from 7,000 papers accurately represents the kinds of edits that occur in real scientific writing and will transfer to unseen papers and research domains.

What would settle it

A blind side-by-side evaluation in which expert reviewers rate revisions of the same new paper sections produced by XtraGPT versus an untuned baseline of equal size, with no preference for XtraGPT.

Figures

Figures reproduced from arXiv: 2505.11336 by Andre Lin HuiKai, Bingsheng He, Jiaying Wu, Junyi Hou, Nuo Chen, Qian Wang, Xidong Wang, Zining Zhang.

**Figure 2.** Figure 2: Overview. The post-training pipelines enable controllable, section-level, fine-grained paper [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Paper quality scores and overall ratings from o1-based [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: shows the prompt for QA. The Prompt for QA Act as an expert model for improving articles **PAPER_CONTENT**. <SELECTED_CONTENT> User Selected </SELECTED_CONTENT> <QUESTION> <User Question> </QUESTION> [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: Prompts for Generating ReviseQA 31 [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗

**Figure 6.** Figure 6: Prompts for Judging (modified from alpaca_eval_gpt4_turbo_fn). [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗

**Figure 7.** Figure 7: Prompts for Scoring. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: ICLR 2024 Paper Token Distribution (without Appendix) [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗

**Figure 9.** Figure 9: Criteria Details of Section Title 34 [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗

**Figure 10.** Figure 10: Criteria Details of Section Abstract 35 [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗

**Figure 11.** Figure 11: Criteria Details of Section Introduction [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗

**Figure 12.** Figure 12: Criteria Details of Section Background 37 [PITH_FULL_IMAGE:figures/full_fig_p037_12.png] view at source ↗

**Figure 13.** Figure 13: Criteria Details of Section Evaluation Criteria Details of Section Conclusion 1. Broader Impact and Future Directions: Assess the thoroughness of the paper’s conclusion or discussion sections in addressing the broader impact of the research. Does the paper provide specific and clear avenues for future work? 2. Clarity and Impact of Key Innovations and Findings: Evaluate whether the conclusion effectively … view at source ↗

**Figure 14.** Figure 14: Criteria Details of Section Conclusion 38 [PITH_FULL_IMAGE:figures/full_fig_p038_14.png] view at source ↗

**Figure 15.** Figure 15: The criteria for human instructors. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_15.png] view at source ↗

**Figure 16.** Figure 16: A comparison of the paragraph before and after revision. [PITH_FULL_IMAGE:figures/full_fig_p040_16.png] view at source ↗

**Figure 17.** Figure 17: A use case on XtraGPT. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_17.png] view at source ↗

**Figure 18.** Figure 18: Scaling trends of Qwen-2.5-7B/32B/72B/Max-Instruct performance. (a) MMLU-Pro [PITH_FULL_IMAGE:figures/full_fig_p041_18.png] view at source ↗

**Figure 19.** Figure 19: ICLR 2024 paper rating distribution. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_19.png] view at source ↗

read the original abstract

Despite the growing adoption of large language models (LLMs) in academic workflows, their capabilities remain limited in supporting high-quality scientific writing. Most existing systems are designed for general-purpose scientific text generation and fail to meet the sophisticated demands of research communication beyond surface-level polishing, for example, maintaining conceptual coherence across sections. Furthermore, academic writing is inherently iterative and revision-driven, a process that is not well supported by direct prompting-based paradigms. To address these scenarios, we propose a human-AI collaboration framework for academic paper revision, centered on criteria-guided intent alignment and context-aware modeling. To validate the framework, we curate a dataset of 7,000 research papers from top-tier venues, annotated with 140,000 instruction--response pairs that reflect realistic, section-level scientific revisions. We instantiate the framework in XtraGPT, the first suite of open-source LLMs (1.5B to 14B parameters) specifically fine-tuned for context-aware academic paper revision. Extensive experiments show that XtraGPT significantly outperforms same-scale baselines and rivals the quality of proprietary counterparts. Both automated preference assessments and human evaluations confirm the effectiveness of XtraGPT in improving scientific drafts. Our code and models are available at https://github.com/Xtra-Computing/XtraGPT and https://huggingface.co/collections/Xtra-Computing/xtragpt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

XtraGPT ships a new revision dataset and open fine-tuned models that beat same-size baselines on editing tasks, but the data realism is the part that needs checking.

read the letter

The main point is a new dataset of 140,000 instruction-response pairs drawn from 7,000 top-venue papers, plus a set of fine-tuned open models (1.5B to 14B) built for context-aware academic revisions. They report gains over same-scale baselines and results close to proprietary models on both automated preference scores and human judgments for draft improvement. The code and models are released, which makes the work immediately usable for others in the area.

Referee Report

2 major / 2 minor

Summary. The paper proposes a human-AI collaboration framework for academic paper revision that emphasizes criteria-guided intent alignment and context-aware modeling. It curates a dataset of 140,000 instruction-response pairs drawn from 7,000 top-tier papers to reflect section-level scientific revisions, then fine-tunes an open-source suite of LLMs (XtraGPT, 1.5B–14B parameters) on this data. Experiments claim that XtraGPT significantly outperforms same-scale baselines and approaches proprietary model quality, as measured by automated preference scores and human evaluations; code and models are released.

Significance. If the central claims hold after clarification of dataset construction and evaluation protocols, the work would supply the first openly available, scale-specific models for iterative, context-sensitive revision of scientific writing. The public release of code, models, and the 140k-pair dataset constitutes a concrete community resource that could accelerate research on controllable scientific text improvement.

major comments (2)

[Dataset construction] Dataset construction section: the manuscript states that the 140,000 pairs 'reflect realistic, section-level scientific revisions,' yet provides no breakdown of the fraction generated by human experts versus LLM prompting or templated rewriting. Without this information it is impossible to evaluate the risk that fine-tuned models overfit to synthetic artifacts, which would undermine the claim of transfer to unseen papers and domains.
[Experiments] Experiments section: the automated preference assessments and human evaluations are reported to show outperformance, but the text does not specify the exact data splits, baseline implementation details, or inter-annotator agreement statistics. These omissions leave open the possibility that post-hoc choices inflate the reported gains relative to same-scale open models.

minor comments (2)

[Introduction] The abstract and introduction use the term 'context-aware' without an explicit operational definition or pointer to the modeling component that implements it; a short clarifying sentence would improve readability.
[Figures] Figure captions for the human evaluation results should include the exact number of evaluators, papers sampled, and statistical test used to support the preference claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will make the necessary revisions to improve clarity and transparency.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: the manuscript states that the 140,000 pairs 'reflect realistic, section-level scientific revisions,' yet provides no breakdown of the fraction generated by human experts versus LLM prompting or templated rewriting. Without this information it is impossible to evaluate the risk that fine-tuned models overfit to synthetic artifacts, which would undermine the claim of transfer to unseen papers and domains.

Authors: We acknowledge that the current manuscript does not provide a quantitative breakdown of how the 140,000 instruction-response pairs were generated. The Dataset Construction section will be revised to include this information, specifying the proportions derived from human expert revisions extracted from paper histories versus those created via LLM prompting or templated methods, along with the quality assurance steps employed to ensure the revisions remain realistic and section-level. This addition will directly address concerns about potential overfitting to synthetic patterns and support the claims of transfer to unseen papers. revision: yes
Referee: [Experiments] Experiments section: the automated preference assessments and human evaluations are reported to show outperformance, but the text does not specify the exact data splits, baseline implementation details, or inter-annotator agreement statistics. These omissions leave open the possibility that post-hoc choices inflate the reported gains relative to same-scale open models.

Authors: We agree that the experimental reporting requires greater specificity to ensure reproducibility and to rule out any perception of post-hoc selection. The revised Experiments section and appendix will explicitly state the paper-level data splits used, provide full implementation details and hyperparameters for each baseline, and report inter-annotator agreement statistics for the human evaluations. These clarifications will strengthen the validity of the performance comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on new dataset and external benchmarks

full rationale

The paper constructs a fresh dataset of 140k instruction-response pairs from 7k papers and fine-tunes open-source models, then reports performance against same-scale baselines and proprietary models via automated and human evaluations. No equations, fitted parameters, or predictions are defined in terms of the target result itself. No self-citation chain is load-bearing for the central claim, and the derivation is self-contained against external comparisons rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about LLM fine-tuning effectiveness and the representativeness of the new dataset; no new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption Instruction-tuned LLMs can learn to perform context-aware text revisions when trained on section-level human revision examples.
Invoked implicitly when describing the fine-tuning process and expected generalization.

pith-pipeline@v0.9.0 · 5796 in / 1195 out tokens · 39619 ms · 2026-05-22T14:39:34.601619+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We curate ReviseQA ... 140,000 instruction–response pairs ... 20 section-level revision criteria ... Controllable Post-Training (CPT) objective L_CPT(θ) = −E log P_θ(ˆp | q, T, p)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Context-Aware Modeling ... full document context T must be an explicit input

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI for Auto-Research: Roadmap & User Guide
cs.AI 2026-05 unverdicted novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

Reference graph

Works this paper leans on

118 extracted references · 118 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

From hypothesis to publication: A comprehensive survey of AI-driven research support systems

Zekun Zhou, Xiaocheng Feng, Lei Huang, Xiachong Feng, Ziyun Song, Ruihan Chen, Liang Zhao, Weitao Ma, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, and Bing Qin. From hypothesis to publication: A comprehensive survey of AI-driven research support systems. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors...

work page doi:10.18653/v1/2025.findings-emnlp.631 2025
[2]

Towards an AI co-scientist

Juraj Gottweis et al. Towards an ai co-scientist, 2025. URL https://arxiv.org/abs/2502.18864

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Weld and Doug Downey and Wen

Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hannaneh Hajishirzi. Openscholar: ...

work page arXiv 2024
[4]

Cycleresearcher: Improving automated research via automated review

Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. Cycleresearcher: Improving automated research via automated review. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=bjcsVLoHYs

work page 2025
[5]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URL https://arxiv.org/abs/2504.08066

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Llm×mapreduce-v2: Entropy-driven convolutional test-time scaling for generating long-form articles from extremely long resources, 2025

Haoyu Wang, Yujia Fu, Zhu Zhang, Shuo Wang, Zirui Ren, Xiaorong Wang, Zhili Li, Chaoqun He, Bo An, Zhiyuan Liu, and Maosong Sun. Llm×mapreduce-v2: Entropy-driven convolutional test-time scaling for generating long-form articles from extremely long resources, 2025. URL https://arxiv.org/abs/2504.05732

work page arXiv 2025
[7]

Using artificial intelligence in academic writing and research: An essential productivity tool.Computer Methods and Programs in Biomedicine Update, 5:100145, 2024

Mohamed Khalifa and Mona Albadawy. Using artificial intelligence in academic writing and research: An essential productivity tool.Computer Methods and Programs in Biomedicine Update, 5:100145, 2024

work page 2024
[8]

How are researchers using ai? survey reveals pros and cons for science

Miryam Naddaf. How are researchers using ai? survey reveals pros and cons for science. Nature, 2025. doi: 10.1038/d41586-025-00343-5

work page doi:10.1038/d41586-025-00343-5 2025
[9]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

University of Chicago Press, 2017

Scott L Montgomery.The Chicago guide to communicating science. University of Chicago Press, 2017

work page 2017
[11]

OUP USA, 2012

Joshua Schimel.Writing science: how to write papers that get cited and proposals that get funded. OUP USA, 2012

work page 2012
[12]

Liveclin: A live clinical benchmark without leakage

Xidong Wang, Guo shuqi, Yue Shen, Junying Chen, Jian Wang, Jinjie Gu, Ping Zhang, Lei Liu, and Benyou Wang. Liveclin: A live clinical benchmark without leakage. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/ forum?id=E0WSAugJ0j

work page 2026
[13]

Iclr 2025 reviewer guide

ICLR. Iclr 2025 reviewer guide. https://iclr.cc/Conferences/2025/ReviewerGuide, 2025. Accessed: 2025-04-20

work page 2025
[14]

Sciagents: Automating scientific discovery through multi-agent intelligent graph reasoning.arXiv preprint arXiv:2409.05556, 2024

Alireza Ghafarollahi and Markus J Buehler. Sciagents: Automating scientific discovery through multi-agent intelligent graph reasoning.arXiv preprint arXiv:2409.05556, 2024

work page arXiv 2024
[15]

DeepReview: Improving LLM-based paper review with human-like deep thinking process

Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. DeepReview: Improving LLM-based paper review with human-like deep thinking process. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29330–29355, V...

work page doi:10.18653/v1/2025.acl-long.1420 2025
[16]

Texgpt: Harness the power of chatgpt in overleaf, 2023

Hilde van Zeeland. Texgpt: Harness the power of chatgpt in overleaf, 2023. URL https: //blog.writefull.com/texgpt-harness-the-power-of-chatgpt-in-overleaf/

work page 2023
[17]

Tips for writing technical papers

Jennifer Widom. Tips for writing technical papers. https://cs.stanford.edu/people/widom/paper- writing.html, 2006. Accessed: 2025-04-20. 17

work page 2006
[18]

PaperRobot: Incremental draft generation of scientific ideas

Qingyun Wang, Lifu Huang, Zhiying Jiang, Kevin Knight, Heng Ji, Mohit Bansal, and Yi Luan. PaperRobot: Incremental draft generation of scientific ideas. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1980–1991, Florence, Italy, July 2019. Association ...

work page doi:10.18653/v1/p19-1191 1980
[19]

Tal August, Katharina Reinecke, and Noah A. Smith. Generating scientific definitions with controllable complexity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8298–8317, Dublin, Ireland, May 2022. Association for...

work page doi:10.18653/v1/2022.acl-long.569 2022
[20]

Assisting in writing Wikipedia-like articles from scratch with large language models

Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assisting in writing Wikipedia-like articles from scratch with large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T...

work page doi:10.18653/v1/2024.naacl-long.347 2024
[21]

Into the unknown unknowns: Engaged human learning through participation in language model agent conversations

Yucheng Jiang, Yijia Shao, Dekun Ma, Sina Semnani, and Monica Lam. Into the unknown unknowns: Engaged human learning through participation in language model agent conversations. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9917– 9955, Miami, Flor...

work page doi:10.18653/v1/2024.emnlp-main.554 2024
[22]

Autonomous llm-driven research from data to human-verifiable research papers, 2024

Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous llm-driven research from data to human-verifiable research papers, 2024. URL https://arxiv.org/abs/2404. 17605

work page 2024
[23]

Agent laboratory: Using LLM agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pag...

work page doi:10.18653/v1/2025.findings-emnlp.320 2025
[24]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Cursor - the ai code editor, 2024

Cursor. Cursor - the ai code editor, 2024. URL https://www.cursor.com/

work page 2024
[26]

Cheng Tan, Dongxin Lyu, Siyuan Li, Zhangyang Gao, Jingxuan Wei, Siqi Ma, Zicheng Liu, and Stan Z. Li. Peer review as a multi-turn and long-context dialogue with role-based interactions: Benchmarking large language models, 2025. URL https://openreview.net/forum? id=uV3Gdoq2ez. 18

work page 2025
[27]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024. URL https://arxiv.org/abs/ 2404.04475

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openre...

work page 2023
[29]

Alpacaeval: An automatic evaluator for instruction-following language models,

Tatsu-lab. Alpacaeval: An automatic evaluator for instruction-following language models,

work page
[30]

URL https://github.com/tatsu-lab/alpaca_eval

work page
[31]

Benchmarking cognitive biases in large language models as evaluators

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking cognitive biases in large language models as evaluators. InFindings of the Association for Computational Linguistics: ACL 2024, pages 517–545, 2024

work page 2024
[32]

Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback

Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, and Xuanjing Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2859–2873, 2023

work page 2023
[33]

Paperdebugger: A plugin-based multi-agent system for in-editor academic writing, review, and editing

Junyi Hou, Andre Lin Huikai, Nuo Chen, Yiwei Gong, and Bingsheng He. Paperdebugger: A plugin-based multi-agent system for in-editor academic writing, review, and editing. In Companion Proceedings of the ACM on Web Conference 2026, WWW ’26. Association for Computing Machinery, 2026

work page 2026
[34]

Nougat: Neural optical understanding for academic documents

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=fUtxNAKpdV

work page 2024
[35]

Readoc: A unified benchmark for realistic document structured extraction, 2024

Zichao Li, Aizier Abulaiti, Yaojie Lu, Xuanang Chen, Jia Zheng, Hongyu Lin, Xianpei Han, and Le Sun. Readoc: A unified benchmark for realistic document structured extraction, 2024. URL https://arxiv.org/abs/2409.05137

work page arXiv 2024
[36]

The hallucinations leaderboard - an open effort to measure hallucinations in large language models.CoRR, abs/2404.05904, 2024

Giwon Hong, Aryo Pradipta Gema, Rohit Saxena, Xiaotang Du, Ping Nie, Yu Zhao, Laura Perez-Beltrachini, Max Ryabinin, Xuanli He, Clémentine Fourrier, and Pasquale Minervini. The hallucinations leaderboard - an open effort to measure hallucinations in large language models.CoRR, abs/2404.05904, 2024. doi: 10.48550/arXiv.2404.05904. URL https://doi.org/ 10.4...

work page doi:10.48550/arxiv.2404.05904 2024
[37]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/ forum?id=nZeVKeeFYf9

work page 2022
[39]

Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for 19 Computing Mac...

work page doi:10.1145/3600006.3613165 2023
[40]

Fast-detectGPT: Efficient zero-shot detection of machine-generated text via conditional probability curvature

Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. Fast-detectGPT: Efficient zero-shot detection of machine-generated text via conditional probability curvature. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=Bpcgcr8E8Z

work page 2024
[41]

Spotting LLMs with binoculars: Zero- shot detection of machine-generated text

Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Spotting LLMs with binoculars: Zero- shot detection of machine-generated text. InForty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=axl3FAkpik

work page 2024
[42]

The llama 3 herd of models, 2024

Aaron Grattafiori et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407. 21783

work page 2024
[43]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Autogen: A personalized large language model for academic enhancement—ethics and proof of principle.The American Journal of Bioethics, 23(10):28–41, 2023

Sebastian Porsdam Mann, Brian D Earp, Nikolaj Møller, Suren Vynn, and Julian Savulescu. Autogen: A personalized large language model for academic enhancement—ethics and proof of principle.The American Journal of Bioethics, 23(10):28–41, 2023

work page 2023
[47]

ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models , booktitle =

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human ...

work page doi:10.18653/v1/2025.naacl-long.342 2025
[48]

Chain of ideas: Revolutionizing research via novel idea development with LLM agents

Long Li et al. Chain of ideas: Revolutionizing research via novel idea development with LLM agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8971–9004, Suzhou, China, November 2025. Association for Computational Linguistics. ISBN ...

work page doi:10.18653/v1/2025.findings-emnlp.477 2025
[49]

Can LLMs generate novel research ideas? a large-scale human study with 100+ NLP researchers

Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can LLMs generate novel research ideas? a large-scale human study with 100+ NLP researchers. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id= M23dTGWCZy

work page 2025
[50]

Llms can realize combinatorial creativity: generating creative ideas via llms for scientific research, 2024

Tianyang Gu, Jingjin Wang, Zhihao Zhang, and HaoHong Li. Llms can realize combinatorial creativity: generating creative ideas via llms for scientific research, 2024. URL https://arxiv. org/abs/2412.14141. 20

work page arXiv 2024
[51]

Marg: Multi-agent review generation for scientific papers

Mike D’Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. Marg: Multi-agent review generation for scientific papers.arXiv preprint arXiv:2401.04259, 2024

work page arXiv 2024
[52]

Can large language models provide useful feedback on research papers? a large-scale empirical analysis.NEJM AI, 1(8): AIoa2400196, 2024

Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas V odrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis.NEJM AI, 1(8): AIoa2400196, 2024

work page 2024
[53]

Scholarchemqa: Unveiling the power of language models in chemical research question answering.arXiv preprint arXiv:2407.16931, 2024

Xiuying Chen, Tairan Wang, Taicheng Guo, Kehan Guo, Juexiao Zhou, Haoyang Li, Mingchen Zhuge, Jürgen Schmidhuber, Xin Gao, and Xiangliang Zhang. Scholarchemqa: Unveiling the power of language models in chemical research question answering.arXiv preprint arXiv:2407.16931, 2024

work page arXiv 2024
[54]

Paperqa: Retrieval-augmented generative agent for scientific research,

Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G Rodriques, and Andrew D White. Paperqa: Retrieval-augmented generative agent for scientific research.arXiv preprint arXiv:2312.07559, 2023

work page arXiv 2023
[55]

CS-bench: A comprehensive benchmark for large language models towards computer science mastery

Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, Weihao Zeng, Yejie Wang, Zhuoma GongQue, Jianing Yu, Qiuna Tan, and Weiran Xu. CS-bench: A comprehensive benchmark for large language models towards computer science mastery. InThe Thirteenth International Conference on Learning ...

work page 2025
[56]

Biokgbench: A knowledge graph checking benchmark of ai agent for biomedical science.arXiv preprint arXiv:2407.00466, 2024

Xinna Lin, Siqi Ma, Junjie Shan, Xiaojing Zhang, Shell Xu Hu, Tiannan Guo, Stan Z Li, and Kaicheng Yu. Biokgbench: A knowledge graph checking benchmark of ai agent for biomedical science.arXiv preprint arXiv:2407.00466, 2024

work page arXiv 2024
[57]

Cowriter - your ai platform for speeding up creative writing, 2025

CoWriter. Cowriter - your ai platform for speeding up creative writing, 2025. URL https: //cowriter.org

work page 2025
[58]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[59]

Codegen: An open large language model for code with multi-turn program synthesis

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_

work page 2023
[60]

Towards injecting medical visual knowledge into multimodal LLMs at scale

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, and Benyou Wang. Towards injecting medical visual knowledge into multimodal LLMs at scale. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Languag...

work page doi:10.18653/v1/2024.emnlp-main.418 2024
[61]

How ai ideas affect the creativity, diversity, and evolution of human ideas: Evidence from a large, dynamic experiment.arXiv preprint arXiv:2401.13481, 2024

Joshua Ashkinaze, Julia Mendelsohn, Li Qiwei, Ceren Budak, and Eric Gilbert. How ai ideas affect the creativity, diversity, and evolution of human ideas: Evidence from a large, dynamic experiment.arXiv preprint arXiv:2401.13481, 2024. 21

work page arXiv 2024
[62]

How ai processing delays foster creativity: Exploring research question co- creation with an llm-based agent

Yiren Liu, Si Chen, Haocong Cheng, Mengxia Yu, Xiao Ran, Andrew Mo, Yiliu Tang, and Yun Huang. How ai processing delays foster creativity: Exploring research question co- creation with an llm-based agent. InProceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–25, 2024

work page 2024
[63]

Does writing with language models reduce content diversity? InThe Twelfth International Conference on Learning Representations, 2024

Vishakh Padmakumar and He He. Does writing with language models reduce content diversity? InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[64]

Llms are vulnerable to malicious prompts disguised as scientific language, 2025

Yubin Ge, Neeraja Kirtane, Hao Peng, and Dilek Hakkani-Tür. Llms are vulnerable to malicious prompts disguised as scientific language, 2025. URL https://arxiv.org/abs/2501.14073

work page arXiv 2025
[65]

Position: The current ai conference model is unsustainable! diagnosing the crisis of centralized ai conference, 2025

Nuo Chen, Moming Duan, Andre Huikai Lin, Qian Wang, Jiaying Wu, and Bingsheng He. Position: The current ai conference model is unsustainable! diagnosing the crisis of centralized ai conference, 2025. URL https://arxiv.org/abs/2508.04586

work page arXiv 2025
[66]

MAIR: A massive benchmark for evaluating instructed retrieval

Weiwei Sun, Zhengliang Shi, Wu Jiu Long, Lingyong Yan, Xinyu Ma, Yiding Liu, Min Cao, Dawei Yin, and Zhaochun Ren. MAIR: A massive benchmark for evaluating instructed retrieval. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14044– 14067, Miami, Fl...

work page doi:10.18653/v1/2024.emnlp-main.778 2024
[67]

Towards a unified framework for reference retrieval and related work generation

Zhengliang Shi, Shen Gao, Zhen Zhang, Xiuying Chen, Zhumin Chen, Pengjie Ren, and Zhaochun Ren. Towards a unified framework for reference retrieval and related work generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5785–5799, Singapore, December 2023. Association ...

work page doi:10.18653/v1/2023.findings-emnlp.385 2023
[68]

Evoscientist: Towards multi-agent evolving ai scientists for end-to-end scientific discovery.arXiv preprint arXiv:2603.08127, 2026

Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, and Xiaohui Yan. Evoscientist: Towards multi-agent evolving ai scientists for end-to-end scientific discovery, 2026. URL https://arxiv.org/abs/2603.08127

work page arXiv 2026
[69]

Gupta, Neereja Sundaresan, Thomas Alexander, Christopher J

Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, James Zou, et al. The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646:716–723, 2025. doi: 10.1038/s41586- 025-09442-9

work page doi:10.1038/s41586- 2025
[70]

Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller

Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Augmenting large language models with chemistry tools.Nature Machine Intelligence, pages 1–11, 2024

work page 2024
[71]

Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

work page 2023
[72]

LitSearch: A retrieval benchmark for scientific literature search

Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, and Tianyu Gao. LitSearch: A retrieval benchmark for scientific literature search. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15068–15083, Miami, Florida, USA, November 202...

work page doi:10.18653/v1/2024.emnlp- 2024
[73]

Researcharena: Benchmarking llms’ ability to collect and organize information as research agents.arXiv preprint arXiv:2406.10291, 2024

Hao Kang and Chenyan Xiong. Researcharena: Benchmarking llms’ ability to collect and organize information as research agents.arXiv preprint arXiv:2406.10291, 2024. 22

work page arXiv 2024
[74]

Citeme: Can language models accurately cite scientific claims?arXiv preprint arXiv:2407.12861, 2024

Ori Press, Andreas Hochlehnert, Ameya Prabhu, Vishaal Udandarao, Ofir Press, and Matthias Bethge. Citeme: Can language models accurately cite scientific claims?arXiv preprint arXiv:2407.12861, 2024

work page arXiv 2024
[75]

ScilitLLM: How to adapt LLMs for scientific literature understanding

Sihang Li, Jin Huang, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, and Hengxing Cai. ScilitLLM: How to adapt LLMs for scientific literature understanding. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=8dzKkeWUUb

work page 2025
[76]

Art or artifice? large language models and the false promise of creativity

Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, and Chien-Sheng Wu. Art or artifice? large language models and the false promise of creativity. InProceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–34, 2024

work page 2024
[77]

Homogenization effects of large language models on human creative ideation

Barrett R Anderson, Jash Hemant Shah, and Max Kreminski. Homogenization effects of large language models on human creative ideation. InProceedings of the 16th Conference on Creativity & Cognition, pages 413–425, 2024

work page 2024
[78]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InIn the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023
[79]

Agentsims: An open-source sandbox for large language model evaluation.arXiv preprint arXiv:2308.04026, 2023

Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqiuyue Ping, and Qin Chen. Agentsims: An open-source sandbox for large language model evaluation.arXiv preprint arXiv:2308.04026, 2023

work page arXiv 2023
[80]

PlatoLM: Teaching LLMs in multi-round dialogue via a user simulator

Chuyi Kong, Yaxin Fan, Xiang Wan, Feng Jiang, and Benyou Wang. PlatoLM: Teaching LLMs in multi-round dialogue via a user simulator. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7841–7863, Bangkok, Thailand, August 2024. Assoc...

work page doi:10.18653/v1/2024.acl-long.424 2024
[81]

Megaagent: A practical framework for autonomous cooperation in large-scale llm agent systems, 2024

Qian Wang, Tianyu Wang, Qinbin Li, Jingsheng Liang, and Bingsheng He. Megaagent: A practical framework for autonomous cooperation in large-scale llm agent systems, 2024. URL https://arxiv.org/abs/2408.09955

work page arXiv 2024

Showing first 80 references.

[1] [1]

From hypothesis to publication: A comprehensive survey of AI-driven research support systems

Zekun Zhou, Xiaocheng Feng, Lei Huang, Xiachong Feng, Ziyun Song, Ruihan Chen, Liang Zhao, Weitao Ma, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, and Bing Qin. From hypothesis to publication: A comprehensive survey of AI-driven research support systems. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors...

work page doi:10.18653/v1/2025.findings-emnlp.631 2025

[2] [2]

Towards an AI co-scientist

Juraj Gottweis et al. Towards an ai co-scientist, 2025. URL https://arxiv.org/abs/2502.18864

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Weld and Doug Downey and Wen

Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hannaneh Hajishirzi. Openscholar: ...

work page arXiv 2024

[4] [4]

Cycleresearcher: Improving automated research via automated review

Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. Cycleresearcher: Improving automated research via automated review. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=bjcsVLoHYs

work page 2025

[5] [5]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URL https://arxiv.org/abs/2504.08066

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Llm×mapreduce-v2: Entropy-driven convolutional test-time scaling for generating long-form articles from extremely long resources, 2025

Haoyu Wang, Yujia Fu, Zhu Zhang, Shuo Wang, Zirui Ren, Xiaorong Wang, Zhili Li, Chaoqun He, Bo An, Zhiyuan Liu, and Maosong Sun. Llm×mapreduce-v2: Entropy-driven convolutional test-time scaling for generating long-form articles from extremely long resources, 2025. URL https://arxiv.org/abs/2504.05732

work page arXiv 2025

[7] [7]

Using artificial intelligence in academic writing and research: An essential productivity tool.Computer Methods and Programs in Biomedicine Update, 5:100145, 2024

Mohamed Khalifa and Mona Albadawy. Using artificial intelligence in academic writing and research: An essential productivity tool.Computer Methods and Programs in Biomedicine Update, 5:100145, 2024

work page 2024

[8] [8]

How are researchers using ai? survey reveals pros and cons for science

Miryam Naddaf. How are researchers using ai? survey reveals pros and cons for science. Nature, 2025. doi: 10.1038/d41586-025-00343-5

work page doi:10.1038/d41586-025-00343-5 2025

[9] [9]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

University of Chicago Press, 2017

Scott L Montgomery.The Chicago guide to communicating science. University of Chicago Press, 2017

work page 2017

[11] [11]

OUP USA, 2012

Joshua Schimel.Writing science: how to write papers that get cited and proposals that get funded. OUP USA, 2012

work page 2012

[12] [12]

Liveclin: A live clinical benchmark without leakage

Xidong Wang, Guo shuqi, Yue Shen, Junying Chen, Jian Wang, Jinjie Gu, Ping Zhang, Lei Liu, and Benyou Wang. Liveclin: A live clinical benchmark without leakage. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/ forum?id=E0WSAugJ0j

work page 2026

[13] [13]

Iclr 2025 reviewer guide

ICLR. Iclr 2025 reviewer guide. https://iclr.cc/Conferences/2025/ReviewerGuide, 2025. Accessed: 2025-04-20

work page 2025

[14] [14]

Sciagents: Automating scientific discovery through multi-agent intelligent graph reasoning.arXiv preprint arXiv:2409.05556, 2024

Alireza Ghafarollahi and Markus J Buehler. Sciagents: Automating scientific discovery through multi-agent intelligent graph reasoning.arXiv preprint arXiv:2409.05556, 2024

work page arXiv 2024

[15] [15]

DeepReview: Improving LLM-based paper review with human-like deep thinking process

Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. DeepReview: Improving LLM-based paper review with human-like deep thinking process. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29330–29355, V...

work page doi:10.18653/v1/2025.acl-long.1420 2025

[16] [16]

Texgpt: Harness the power of chatgpt in overleaf, 2023

Hilde van Zeeland. Texgpt: Harness the power of chatgpt in overleaf, 2023. URL https: //blog.writefull.com/texgpt-harness-the-power-of-chatgpt-in-overleaf/

work page 2023

[17] [17]

Tips for writing technical papers

Jennifer Widom. Tips for writing technical papers. https://cs.stanford.edu/people/widom/paper- writing.html, 2006. Accessed: 2025-04-20. 17

work page 2006

[18] [18]

PaperRobot: Incremental draft generation of scientific ideas

Qingyun Wang, Lifu Huang, Zhiying Jiang, Kevin Knight, Heng Ji, Mohit Bansal, and Yi Luan. PaperRobot: Incremental draft generation of scientific ideas. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1980–1991, Florence, Italy, July 2019. Association ...

work page doi:10.18653/v1/p19-1191 1980

[19] [19]

Tal August, Katharina Reinecke, and Noah A. Smith. Generating scientific definitions with controllable complexity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8298–8317, Dublin, Ireland, May 2022. Association for...

work page doi:10.18653/v1/2022.acl-long.569 2022

[20] [20]

Assisting in writing Wikipedia-like articles from scratch with large language models

Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assisting in writing Wikipedia-like articles from scratch with large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T...

work page doi:10.18653/v1/2024.naacl-long.347 2024

[21] [21]

Into the unknown unknowns: Engaged human learning through participation in language model agent conversations

Yucheng Jiang, Yijia Shao, Dekun Ma, Sina Semnani, and Monica Lam. Into the unknown unknowns: Engaged human learning through participation in language model agent conversations. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9917– 9955, Miami, Flor...

work page doi:10.18653/v1/2024.emnlp-main.554 2024

[22] [22]

Autonomous llm-driven research from data to human-verifiable research papers, 2024

Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous llm-driven research from data to human-verifiable research papers, 2024. URL https://arxiv.org/abs/2404. 17605

work page 2024

[23] [23]

Agent laboratory: Using LLM agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pag...

work page doi:10.18653/v1/2025.findings-emnlp.320 2025

[24] [24]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Cursor - the ai code editor, 2024

Cursor. Cursor - the ai code editor, 2024. URL https://www.cursor.com/

work page 2024

[26] [26]

Cheng Tan, Dongxin Lyu, Siyuan Li, Zhangyang Gao, Jingxuan Wei, Siqi Ma, Zicheng Liu, and Stan Z. Li. Peer review as a multi-turn and long-context dialogue with role-based interactions: Benchmarking large language models, 2025. URL https://openreview.net/forum? id=uV3Gdoq2ez. 18

work page 2025

[27] [27]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024. URL https://arxiv.org/abs/ 2404.04475

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openre...

work page 2023

[29] [29]

Alpacaeval: An automatic evaluator for instruction-following language models,

Tatsu-lab. Alpacaeval: An automatic evaluator for instruction-following language models,

work page

[30] [30]

URL https://github.com/tatsu-lab/alpaca_eval

work page

[31] [31]

Benchmarking cognitive biases in large language models as evaluators

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking cognitive biases in large language models as evaluators. InFindings of the Association for Computational Linguistics: ACL 2024, pages 517–545, 2024

work page 2024

[32] [32]

Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback

Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, and Xuanjing Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2859–2873, 2023

work page 2023

[33] [33]

Paperdebugger: A plugin-based multi-agent system for in-editor academic writing, review, and editing

Junyi Hou, Andre Lin Huikai, Nuo Chen, Yiwei Gong, and Bingsheng He. Paperdebugger: A plugin-based multi-agent system for in-editor academic writing, review, and editing. In Companion Proceedings of the ACM on Web Conference 2026, WWW ’26. Association for Computing Machinery, 2026

work page 2026

[34] [34]

Nougat: Neural optical understanding for academic documents

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=fUtxNAKpdV

work page 2024

[35] [35]

Readoc: A unified benchmark for realistic document structured extraction, 2024

Zichao Li, Aizier Abulaiti, Yaojie Lu, Xuanang Chen, Jia Zheng, Hongyu Lin, Xianpei Han, and Le Sun. Readoc: A unified benchmark for realistic document structured extraction, 2024. URL https://arxiv.org/abs/2409.05137

work page arXiv 2024

[36] [36]

The hallucinations leaderboard - an open effort to measure hallucinations in large language models.CoRR, abs/2404.05904, 2024

Giwon Hong, Aryo Pradipta Gema, Rohit Saxena, Xiaotang Du, Ping Nie, Yu Zhao, Laura Perez-Beltrachini, Max Ryabinin, Xuanli He, Clémentine Fourrier, and Pasquale Minervini. The hallucinations leaderboard - an open effort to measure hallucinations in large language models.CoRR, abs/2404.05904, 2024. doi: 10.48550/arXiv.2404.05904. URL https://doi.org/ 10.4...

work page doi:10.48550/arxiv.2404.05904 2024

[37] [37]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/ forum?id=nZeVKeeFYf9

work page 2022

[39] [39]

Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for 19 Computing Mac...

work page doi:10.1145/3600006.3613165 2023

[40] [40]

Fast-detectGPT: Efficient zero-shot detection of machine-generated text via conditional probability curvature

Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. Fast-detectGPT: Efficient zero-shot detection of machine-generated text via conditional probability curvature. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=Bpcgcr8E8Z

work page 2024

[41] [41]

Spotting LLMs with binoculars: Zero- shot detection of machine-generated text

Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Spotting LLMs with binoculars: Zero- shot detection of machine-generated text. InForty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=axl3FAkpik

work page 2024

[42] [42]

The llama 3 herd of models, 2024

Aaron Grattafiori et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407. 21783

work page 2024

[43] [43]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [45]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [46]

Autogen: A personalized large language model for academic enhancement—ethics and proof of principle.The American Journal of Bioethics, 23(10):28–41, 2023

Sebastian Porsdam Mann, Brian D Earp, Nikolaj Møller, Suren Vynn, and Julian Savulescu. Autogen: A personalized large language model for academic enhancement—ethics and proof of principle.The American Journal of Bioethics, 23(10):28–41, 2023

work page 2023

[46] [47]

ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models , booktitle =

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human ...

work page doi:10.18653/v1/2025.naacl-long.342 2025

[47] [48]

Chain of ideas: Revolutionizing research via novel idea development with LLM agents

Long Li et al. Chain of ideas: Revolutionizing research via novel idea development with LLM agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8971–9004, Suzhou, China, November 2025. Association for Computational Linguistics. ISBN ...

work page doi:10.18653/v1/2025.findings-emnlp.477 2025

[48] [49]

Can LLMs generate novel research ideas? a large-scale human study with 100+ NLP researchers

Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can LLMs generate novel research ideas? a large-scale human study with 100+ NLP researchers. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id= M23dTGWCZy

work page 2025

[49] [50]

Llms can realize combinatorial creativity: generating creative ideas via llms for scientific research, 2024

Tianyang Gu, Jingjin Wang, Zhihao Zhang, and HaoHong Li. Llms can realize combinatorial creativity: generating creative ideas via llms for scientific research, 2024. URL https://arxiv. org/abs/2412.14141. 20

work page arXiv 2024

[50] [51]

Marg: Multi-agent review generation for scientific papers

Mike D’Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. Marg: Multi-agent review generation for scientific papers.arXiv preprint arXiv:2401.04259, 2024

work page arXiv 2024

[51] [52]

Can large language models provide useful feedback on research papers? a large-scale empirical analysis.NEJM AI, 1(8): AIoa2400196, 2024

Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas V odrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis.NEJM AI, 1(8): AIoa2400196, 2024

work page 2024

[52] [53]

Scholarchemqa: Unveiling the power of language models in chemical research question answering.arXiv preprint arXiv:2407.16931, 2024

Xiuying Chen, Tairan Wang, Taicheng Guo, Kehan Guo, Juexiao Zhou, Haoyang Li, Mingchen Zhuge, Jürgen Schmidhuber, Xin Gao, and Xiangliang Zhang. Scholarchemqa: Unveiling the power of language models in chemical research question answering.arXiv preprint arXiv:2407.16931, 2024

work page arXiv 2024

[53] [54]

Paperqa: Retrieval-augmented generative agent for scientific research,

Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G Rodriques, and Andrew D White. Paperqa: Retrieval-augmented generative agent for scientific research.arXiv preprint arXiv:2312.07559, 2023

work page arXiv 2023

[54] [55]

CS-bench: A comprehensive benchmark for large language models towards computer science mastery

Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, Weihao Zeng, Yejie Wang, Zhuoma GongQue, Jianing Yu, Qiuna Tan, and Weiran Xu. CS-bench: A comprehensive benchmark for large language models towards computer science mastery. InThe Thirteenth International Conference on Learning ...

work page 2025

[55] [56]

Biokgbench: A knowledge graph checking benchmark of ai agent for biomedical science.arXiv preprint arXiv:2407.00466, 2024

Xinna Lin, Siqi Ma, Junjie Shan, Xiaojing Zhang, Shell Xu Hu, Tiannan Guo, Stan Z Li, and Kaicheng Yu. Biokgbench: A knowledge graph checking benchmark of ai agent for biomedical science.arXiv preprint arXiv:2407.00466, 2024

work page arXiv 2024

[56] [57]

Cowriter - your ai platform for speeding up creative writing, 2025

CoWriter. Cowriter - your ai platform for speeding up creative writing, 2025. URL https: //cowriter.org

work page 2025

[57] [58]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[58] [59]

Codegen: An open large language model for code with multi-turn program synthesis

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_

work page 2023

[59] [60]

Towards injecting medical visual knowledge into multimodal LLMs at scale

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, and Benyou Wang. Towards injecting medical visual knowledge into multimodal LLMs at scale. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Languag...

work page doi:10.18653/v1/2024.emnlp-main.418 2024

[60] [61]

How ai ideas affect the creativity, diversity, and evolution of human ideas: Evidence from a large, dynamic experiment.arXiv preprint arXiv:2401.13481, 2024

Joshua Ashkinaze, Julia Mendelsohn, Li Qiwei, Ceren Budak, and Eric Gilbert. How ai ideas affect the creativity, diversity, and evolution of human ideas: Evidence from a large, dynamic experiment.arXiv preprint arXiv:2401.13481, 2024. 21

work page arXiv 2024

[61] [62]

How ai processing delays foster creativity: Exploring research question co- creation with an llm-based agent

Yiren Liu, Si Chen, Haocong Cheng, Mengxia Yu, Xiao Ran, Andrew Mo, Yiliu Tang, and Yun Huang. How ai processing delays foster creativity: Exploring research question co- creation with an llm-based agent. InProceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–25, 2024

work page 2024

[62] [63]

Does writing with language models reduce content diversity? InThe Twelfth International Conference on Learning Representations, 2024

Vishakh Padmakumar and He He. Does writing with language models reduce content diversity? InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[63] [64]

Llms are vulnerable to malicious prompts disguised as scientific language, 2025

Yubin Ge, Neeraja Kirtane, Hao Peng, and Dilek Hakkani-Tür. Llms are vulnerable to malicious prompts disguised as scientific language, 2025. URL https://arxiv.org/abs/2501.14073

work page arXiv 2025

[64] [65]

Position: The current ai conference model is unsustainable! diagnosing the crisis of centralized ai conference, 2025

Nuo Chen, Moming Duan, Andre Huikai Lin, Qian Wang, Jiaying Wu, and Bingsheng He. Position: The current ai conference model is unsustainable! diagnosing the crisis of centralized ai conference, 2025. URL https://arxiv.org/abs/2508.04586

work page arXiv 2025

[65] [66]

MAIR: A massive benchmark for evaluating instructed retrieval

Weiwei Sun, Zhengliang Shi, Wu Jiu Long, Lingyong Yan, Xinyu Ma, Yiding Liu, Min Cao, Dawei Yin, and Zhaochun Ren. MAIR: A massive benchmark for evaluating instructed retrieval. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14044– 14067, Miami, Fl...

work page doi:10.18653/v1/2024.emnlp-main.778 2024

[66] [67]

Towards a unified framework for reference retrieval and related work generation

Zhengliang Shi, Shen Gao, Zhen Zhang, Xiuying Chen, Zhumin Chen, Pengjie Ren, and Zhaochun Ren. Towards a unified framework for reference retrieval and related work generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5785–5799, Singapore, December 2023. Association ...

work page doi:10.18653/v1/2023.findings-emnlp.385 2023

[67] [68]

Evoscientist: Towards multi-agent evolving ai scientists for end-to-end scientific discovery.arXiv preprint arXiv:2603.08127, 2026

Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, and Xiaohui Yan. Evoscientist: Towards multi-agent evolving ai scientists for end-to-end scientific discovery, 2026. URL https://arxiv.org/abs/2603.08127

work page arXiv 2026

[68] [69]

Gupta, Neereja Sundaresan, Thomas Alexander, Christopher J

Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, James Zou, et al. The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646:716–723, 2025. doi: 10.1038/s41586- 025-09442-9

work page doi:10.1038/s41586- 2025

[69] [70]

Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller

Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Augmenting large language models with chemistry tools.Nature Machine Intelligence, pages 1–11, 2024

work page 2024

[70] [71]

Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

work page 2023

[71] [72]

LitSearch: A retrieval benchmark for scientific literature search

Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, and Tianyu Gao. LitSearch: A retrieval benchmark for scientific literature search. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15068–15083, Miami, Florida, USA, November 202...

work page doi:10.18653/v1/2024.emnlp- 2024

[72] [73]

Researcharena: Benchmarking llms’ ability to collect and organize information as research agents.arXiv preprint arXiv:2406.10291, 2024

Hao Kang and Chenyan Xiong. Researcharena: Benchmarking llms’ ability to collect and organize information as research agents.arXiv preprint arXiv:2406.10291, 2024. 22

work page arXiv 2024

[73] [74]

Citeme: Can language models accurately cite scientific claims?arXiv preprint arXiv:2407.12861, 2024

Ori Press, Andreas Hochlehnert, Ameya Prabhu, Vishaal Udandarao, Ofir Press, and Matthias Bethge. Citeme: Can language models accurately cite scientific claims?arXiv preprint arXiv:2407.12861, 2024

work page arXiv 2024

[74] [75]

ScilitLLM: How to adapt LLMs for scientific literature understanding

Sihang Li, Jin Huang, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, and Hengxing Cai. ScilitLLM: How to adapt LLMs for scientific literature understanding. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=8dzKkeWUUb

work page 2025

[75] [76]

Art or artifice? large language models and the false promise of creativity

Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, and Chien-Sheng Wu. Art or artifice? large language models and the false promise of creativity. InProceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–34, 2024

work page 2024

[76] [77]

Homogenization effects of large language models on human creative ideation

Barrett R Anderson, Jash Hemant Shah, and Max Kreminski. Homogenization effects of large language models on human creative ideation. InProceedings of the 16th Conference on Creativity & Cognition, pages 413–425, 2024

work page 2024

[77] [78]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InIn the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023

[78] [79]

Agentsims: An open-source sandbox for large language model evaluation.arXiv preprint arXiv:2308.04026, 2023

Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqiuyue Ping, and Qin Chen. Agentsims: An open-source sandbox for large language model evaluation.arXiv preprint arXiv:2308.04026, 2023

work page arXiv 2023

[79] [80]

PlatoLM: Teaching LLMs in multi-round dialogue via a user simulator

Chuyi Kong, Yaxin Fan, Xiang Wan, Feng Jiang, and Benyou Wang. PlatoLM: Teaching LLMs in multi-round dialogue via a user simulator. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7841–7863, Bangkok, Thailand, August 2024. Assoc...

work page doi:10.18653/v1/2024.acl-long.424 2024

[80] [81]

Megaagent: A practical framework for autonomous cooperation in large-scale llm agent systems, 2024

Qian Wang, Tianyu Wang, Qinbin Li, Jingsheng Liang, and Bingsheng He. Megaagent: A practical framework for autonomous cooperation in large-scale llm agent systems, 2024. URL https://arxiv.org/abs/2408.09955

work page arXiv 2024