XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration
Pith reviewed 2026-05-22 14:39 UTC · model grok-4.3
The pith
XtraGPT fine-tunes open LLMs on 140,000 real revision pairs so they can handle context-aware, intent-guided edits to scientific papers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a human-AI collaboration framework built on criteria-guided intent alignment and context-aware modeling, when applied to a dataset of 140,000 instruction-response pairs drawn from 7,000 top-venue papers, produces open-source LLMs (1.5B to 14B parameters) that outperform same-scale baselines and approach proprietary model quality at improving scientific drafts, as measured by both automated preference tests and human evaluations.
What carries the argument
The criteria-guided intent alignment and context-aware modeling mechanism that turns each section-level revision instruction into a controlled edit while referencing the surrounding paper content.
If this is right
- Open models become practical for iterative academic drafting without repeated calls to closed systems.
- Revision quality improves when the model receives explicit intent signals tied to section context rather than generic prompts.
- The same fine-tuning recipe can be applied at multiple model scales while preserving gains over baselines.
- Human preference data confirm that the revisions maintain both local accuracy and cross-section coherence.
Where Pith is reading between the lines
- Similar instruction sets could be built for related tasks such as responding to reviewer comments or condensing long papers.
- Embedding the models in writing interfaces might let authors steer revisions at the paragraph level in real time.
- If the revision patterns prove domain-general, the same data construction method could support non-academic technical writing.
Load-bearing premise
The curated set of 140,000 revision pairs from 7,000 papers accurately represents the kinds of edits that occur in real scientific writing and will transfer to unseen papers and research domains.
What would settle it
A blind side-by-side evaluation in which expert reviewers rate revisions of the same new paper sections produced by XtraGPT versus an untuned baseline of equal size, with no preference for XtraGPT.
Figures
read the original abstract
Despite the growing adoption of large language models (LLMs) in academic workflows, their capabilities remain limited in supporting high-quality scientific writing. Most existing systems are designed for general-purpose scientific text generation and fail to meet the sophisticated demands of research communication beyond surface-level polishing, for example, maintaining conceptual coherence across sections. Furthermore, academic writing is inherently iterative and revision-driven, a process that is not well supported by direct prompting-based paradigms. To address these scenarios, we propose a human-AI collaboration framework for academic paper revision, centered on criteria-guided intent alignment and context-aware modeling. To validate the framework, we curate a dataset of 7,000 research papers from top-tier venues, annotated with 140,000 instruction--response pairs that reflect realistic, section-level scientific revisions. We instantiate the framework in XtraGPT, the first suite of open-source LLMs (1.5B to 14B parameters) specifically fine-tuned for context-aware academic paper revision. Extensive experiments show that XtraGPT significantly outperforms same-scale baselines and rivals the quality of proprietary counterparts. Both automated preference assessments and human evaluations confirm the effectiveness of XtraGPT in improving scientific drafts. Our code and models are available at https://github.com/Xtra-Computing/XtraGPT and https://huggingface.co/collections/Xtra-Computing/xtragpt.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a human-AI collaboration framework for academic paper revision that emphasizes criteria-guided intent alignment and context-aware modeling. It curates a dataset of 140,000 instruction-response pairs drawn from 7,000 top-tier papers to reflect section-level scientific revisions, then fine-tunes an open-source suite of LLMs (XtraGPT, 1.5B–14B parameters) on this data. Experiments claim that XtraGPT significantly outperforms same-scale baselines and approaches proprietary model quality, as measured by automated preference scores and human evaluations; code and models are released.
Significance. If the central claims hold after clarification of dataset construction and evaluation protocols, the work would supply the first openly available, scale-specific models for iterative, context-sensitive revision of scientific writing. The public release of code, models, and the 140k-pair dataset constitutes a concrete community resource that could accelerate research on controllable scientific text improvement.
major comments (2)
- [Dataset construction] Dataset construction section: the manuscript states that the 140,000 pairs 'reflect realistic, section-level scientific revisions,' yet provides no breakdown of the fraction generated by human experts versus LLM prompting or templated rewriting. Without this information it is impossible to evaluate the risk that fine-tuned models overfit to synthetic artifacts, which would undermine the claim of transfer to unseen papers and domains.
- [Experiments] Experiments section: the automated preference assessments and human evaluations are reported to show outperformance, but the text does not specify the exact data splits, baseline implementation details, or inter-annotator agreement statistics. These omissions leave open the possibility that post-hoc choices inflate the reported gains relative to same-scale open models.
minor comments (2)
- [Introduction] The abstract and introduction use the term 'context-aware' without an explicit operational definition or pointer to the modeling component that implements it; a short clarifying sentence would improve readability.
- [Figures] Figure captions for the human evaluation results should include the exact number of evaluators, papers sampled, and statistical test used to support the preference claims.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will make the necessary revisions to improve clarity and transparency.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: the manuscript states that the 140,000 pairs 'reflect realistic, section-level scientific revisions,' yet provides no breakdown of the fraction generated by human experts versus LLM prompting or templated rewriting. Without this information it is impossible to evaluate the risk that fine-tuned models overfit to synthetic artifacts, which would undermine the claim of transfer to unseen papers and domains.
Authors: We acknowledge that the current manuscript does not provide a quantitative breakdown of how the 140,000 instruction-response pairs were generated. The Dataset Construction section will be revised to include this information, specifying the proportions derived from human expert revisions extracted from paper histories versus those created via LLM prompting or templated methods, along with the quality assurance steps employed to ensure the revisions remain realistic and section-level. This addition will directly address concerns about potential overfitting to synthetic patterns and support the claims of transfer to unseen papers. revision: yes
-
Referee: [Experiments] Experiments section: the automated preference assessments and human evaluations are reported to show outperformance, but the text does not specify the exact data splits, baseline implementation details, or inter-annotator agreement statistics. These omissions leave open the possibility that post-hoc choices inflate the reported gains relative to same-scale open models.
Authors: We agree that the experimental reporting requires greater specificity to ensure reproducibility and to rule out any perception of post-hoc selection. The revised Experiments section and appendix will explicitly state the paper-level data splits used, provide full implementation details and hyperparameters for each baseline, and report inter-annotator agreement statistics for the human evaluations. These clarifications will strengthen the validity of the performance comparisons. revision: yes
Circularity Check
No circularity: empirical claims rest on new dataset and external benchmarks
full rationale
The paper constructs a fresh dataset of 140k instruction-response pairs from 7k papers and fine-tunes open-source models, then reports performance against same-scale baselines and proprietary models via automated and human evaluations. No equations, fitted parameters, or predictions are defined in terms of the target result itself. No self-citation chain is load-bearing for the central claim, and the derivation is self-contained against external comparisons rather than reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Instruction-tuned LLMs can learn to perform context-aware text revisions when trained on section-level human revision examples.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We curate ReviseQA ... 140,000 instruction–response pairs ... 20 section-level revision criteria ... Controllable Post-Training (CPT) objective L_CPT(θ) = −E log P_θ(ˆp | q, T, p)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Context-Aware Modeling ... full document context T must be an explicit input
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
AI for Auto-Research: Roadmap & User Guide
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
Reference graph
Works this paper leans on
-
[1]
From hypothesis to publication: A comprehensive survey of AI-driven research support systems
Zekun Zhou, Xiaocheng Feng, Lei Huang, Xiachong Feng, Ziyun Song, Ruihan Chen, Liang Zhao, Weitao Ma, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, and Bing Qin. From hypothesis to publication: A comprehensive survey of AI-driven research support systems. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors...
-
[2]
Juraj Gottweis et al. Towards an ai co-scientist, 2025. URL https://arxiv.org/abs/2502.18864
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hannaneh Hajishirzi. Openscholar: ...
-
[4]
Cycleresearcher: Improving automated research via automated review
Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. Cycleresearcher: Improving automated research via automated review. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=bjcsVLoHYs
work page 2025
-
[5]
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URL https://arxiv.org/abs/2504.08066
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Haoyu Wang, Yujia Fu, Zhu Zhang, Shuo Wang, Zirui Ren, Xiaorong Wang, Zhili Li, Chaoqun He, Bo An, Zhiyuan Liu, and Maosong Sun. Llm×mapreduce-v2: Entropy-driven convolutional test-time scaling for generating long-form articles from extremely long resources, 2025. URL https://arxiv.org/abs/2504.05732
-
[7]
Mohamed Khalifa and Mona Albadawy. Using artificial intelligence in academic writing and research: An essential productivity tool.Computer Methods and Programs in Biomedicine Update, 5:100145, 2024
work page 2024
-
[8]
How are researchers using ai? survey reveals pros and cons for science
Miryam Naddaf. How are researchers using ai? survey reveals pros and cons for science. Nature, 2025. doi: 10.1038/d41586-025-00343-5
-
[9]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
University of Chicago Press, 2017
Scott L Montgomery.The Chicago guide to communicating science. University of Chicago Press, 2017
work page 2017
-
[11]
Joshua Schimel.Writing science: how to write papers that get cited and proposals that get funded. OUP USA, 2012
work page 2012
-
[12]
Liveclin: A live clinical benchmark without leakage
Xidong Wang, Guo shuqi, Yue Shen, Junying Chen, Jian Wang, Jinjie Gu, Ping Zhang, Lei Liu, and Benyou Wang. Liveclin: A live clinical benchmark without leakage. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/ forum?id=E0WSAugJ0j
work page 2026
-
[13]
ICLR. Iclr 2025 reviewer guide. https://iclr.cc/Conferences/2025/ReviewerGuide, 2025. Accessed: 2025-04-20
work page 2025
-
[14]
Alireza Ghafarollahi and Markus J Buehler. Sciagents: Automating scientific discovery through multi-agent intelligent graph reasoning.arXiv preprint arXiv:2409.05556, 2024
-
[15]
DeepReview: Improving LLM-based paper review with human-like deep thinking process
Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. DeepReview: Improving LLM-based paper review with human-like deep thinking process. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29330–29355, V...
-
[16]
Texgpt: Harness the power of chatgpt in overleaf, 2023
Hilde van Zeeland. Texgpt: Harness the power of chatgpt in overleaf, 2023. URL https: //blog.writefull.com/texgpt-harness-the-power-of-chatgpt-in-overleaf/
work page 2023
-
[17]
Tips for writing technical papers
Jennifer Widom. Tips for writing technical papers. https://cs.stanford.edu/people/widom/paper- writing.html, 2006. Accessed: 2025-04-20. 17
work page 2006
-
[18]
PaperRobot: Incremental draft generation of scientific ideas
Qingyun Wang, Lifu Huang, Zhiying Jiang, Kevin Knight, Heng Ji, Mohit Bansal, and Yi Luan. PaperRobot: Incremental draft generation of scientific ideas. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1980–1991, Florence, Italy, July 2019. Association ...
-
[19]
Tal August, Katharina Reinecke, and Noah A. Smith. Generating scientific definitions with controllable complexity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8298–8317, Dublin, Ireland, May 2022. Association for...
-
[20]
Assisting in writing Wikipedia-like articles from scratch with large language models
Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assisting in writing Wikipedia-like articles from scratch with large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T...
-
[21]
Yucheng Jiang, Yijia Shao, Dekun Ma, Sina Semnani, and Monica Lam. Into the unknown unknowns: Engaged human learning through participation in language model agent conversations. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9917– 9955, Miami, Flor...
-
[22]
Autonomous llm-driven research from data to human-verifiable research papers, 2024
Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous llm-driven research from data to human-verifiable research papers, 2024. URL https://arxiv.org/abs/2404. 17605
work page 2024
-
[23]
Agent laboratory: Using LLM agents as research assistants
Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pag...
-
[24]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Cursor - the ai code editor, 2024
Cursor. Cursor - the ai code editor, 2024. URL https://www.cursor.com/
work page 2024
-
[26]
Cheng Tan, Dongxin Lyu, Siyuan Li, Zhangyang Gao, Jingxuan Wei, Siqi Ma, Zicheng Liu, and Stan Z. Li. Peer review as a multi-turn and long-context dialogue with role-based interactions: Benchmarking large language models, 2025. URL https://openreview.net/forum? id=uV3Gdoq2ez. 18
work page 2025
-
[27]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024. URL https://arxiv.org/abs/ 2404.04475
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openre...
work page 2023
-
[29]
Alpacaeval: An automatic evaluator for instruction-following language models,
Tatsu-lab. Alpacaeval: An automatic evaluator for instruction-following language models,
-
[30]
URL https://github.com/tatsu-lab/alpaca_eval
-
[31]
Benchmarking cognitive biases in large language models as evaluators
Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking cognitive biases in large language models as evaluators. InFindings of the Association for Computational Linguistics: ACL 2024, pages 517–545, 2024
work page 2024
-
[32]
Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback
Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, and Xuanjing Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2859–2873, 2023
work page 2023
-
[33]
Paperdebugger: A plugin-based multi-agent system for in-editor academic writing, review, and editing
Junyi Hou, Andre Lin Huikai, Nuo Chen, Yiwei Gong, and Bingsheng He. Paperdebugger: A plugin-based multi-agent system for in-editor academic writing, review, and editing. In Companion Proceedings of the ACM on Web Conference 2026, WWW ’26. Association for Computing Machinery, 2026
work page 2026
-
[34]
Nougat: Neural optical understanding for academic documents
Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=fUtxNAKpdV
work page 2024
-
[35]
Readoc: A unified benchmark for realistic document structured extraction, 2024
Zichao Li, Aizier Abulaiti, Yaojie Lu, Xuanang Chen, Jia Zheng, Hongyu Lin, Xianpei Han, and Le Sun. Readoc: A unified benchmark for realistic document structured extraction, 2024. URL https://arxiv.org/abs/2409.05137
-
[36]
Giwon Hong, Aryo Pradipta Gema, Rohit Saxena, Xiaotang Du, Ping Nie, Yu Zhao, Laura Perez-Beltrachini, Max Ryabinin, Xuanli He, Clémentine Fourrier, and Pasquale Minervini. The hallucinations leaderboard - an open effort to measure hallucinations in large language models.CoRR, abs/2404.05904, 2024. doi: 10.48550/arXiv.2404.05904. URL https://doi.org/ 10.4...
-
[37]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
LoRA: Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/ forum?id=nZeVKeeFYf9
work page 2022
-
[39]
Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for 19 Computing Mac...
-
[40]
Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. Fast-detectGPT: Efficient zero-shot detection of machine-generated text via conditional probability curvature. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=Bpcgcr8E8Z
work page 2024
-
[41]
Spotting LLMs with binoculars: Zero- shot detection of machine-generated text
Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Spotting LLMs with binoculars: Zero- shot detection of machine-generated text. InForty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=axl3FAkpik
work page 2024
-
[42]
The llama 3 herd of models, 2024
Aaron Grattafiori et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407. 21783
work page 2024
-
[43]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Sebastian Porsdam Mann, Brian D Earp, Nikolaj Møller, Suren Vynn, and Julian Savulescu. Autogen: A personalized large language model for academic enhancement—ethics and proof of principle.The American Journal of Bioethics, 23(10):28–41, 2023
work page 2023
-
[47]
Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human ...
-
[48]
Chain of ideas: Revolutionizing research via novel idea development with LLM agents
Long Li et al. Chain of ideas: Revolutionizing research via novel idea development with LLM agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8971–9004, Suzhou, China, November 2025. Association for Computational Linguistics. ISBN ...
-
[49]
Can LLMs generate novel research ideas? a large-scale human study with 100+ NLP researchers
Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can LLMs generate novel research ideas? a large-scale human study with 100+ NLP researchers. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id= M23dTGWCZy
work page 2025
-
[50]
Tianyang Gu, Jingjin Wang, Zhihao Zhang, and HaoHong Li. Llms can realize combinatorial creativity: generating creative ideas via llms for scientific research, 2024. URL https://arxiv. org/abs/2412.14141. 20
-
[51]
Marg: Multi-agent review generation for scientific papers
Mike D’Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. Marg: Multi-agent review generation for scientific papers.arXiv preprint arXiv:2401.04259, 2024
-
[52]
Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas V odrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis.NEJM AI, 1(8): AIoa2400196, 2024
work page 2024
-
[53]
Xiuying Chen, Tairan Wang, Taicheng Guo, Kehan Guo, Juexiao Zhou, Haoyang Li, Mingchen Zhuge, Jürgen Schmidhuber, Xin Gao, and Xiangliang Zhang. Scholarchemqa: Unveiling the power of language models in chemical research question answering.arXiv preprint arXiv:2407.16931, 2024
-
[54]
Paperqa: Retrieval-augmented generative agent for scientific research,
Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G Rodriques, and Andrew D White. Paperqa: Retrieval-augmented generative agent for scientific research.arXiv preprint arXiv:2312.07559, 2023
-
[55]
CS-bench: A comprehensive benchmark for large language models towards computer science mastery
Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, Weihao Zeng, Yejie Wang, Zhuoma GongQue, Jianing Yu, Qiuna Tan, and Weiran Xu. CS-bench: A comprehensive benchmark for large language models towards computer science mastery. InThe Thirteenth International Conference on Learning ...
work page 2025
-
[56]
Xinna Lin, Siqi Ma, Junjie Shan, Xiaojing Zhang, Shell Xu Hu, Tiannan Guo, Stan Z Li, and Kaicheng Yu. Biokgbench: A knowledge graph checking benchmark of ai agent for biomedical science.arXiv preprint arXiv:2407.00466, 2024
-
[57]
Cowriter - your ai platform for speeding up creative writing, 2025
CoWriter. Cowriter - your ai platform for speeding up creative writing, 2025. URL https: //cowriter.org
work page 2025
-
[58]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[59]
Codegen: An open large language model for code with multi-turn program synthesis
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_
work page 2023
-
[60]
Towards injecting medical visual knowledge into multimodal LLMs at scale
Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, and Benyou Wang. Towards injecting medical visual knowledge into multimodal LLMs at scale. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Languag...
-
[61]
Joshua Ashkinaze, Julia Mendelsohn, Li Qiwei, Ceren Budak, and Eric Gilbert. How ai ideas affect the creativity, diversity, and evolution of human ideas: Evidence from a large, dynamic experiment.arXiv preprint arXiv:2401.13481, 2024. 21
-
[62]
Yiren Liu, Si Chen, Haocong Cheng, Mengxia Yu, Xiao Ran, Andrew Mo, Yiliu Tang, and Yun Huang. How ai processing delays foster creativity: Exploring research question co- creation with an llm-based agent. InProceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–25, 2024
work page 2024
-
[63]
Vishakh Padmakumar and He He. Does writing with language models reduce content diversity? InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[64]
Llms are vulnerable to malicious prompts disguised as scientific language, 2025
Yubin Ge, Neeraja Kirtane, Hao Peng, and Dilek Hakkani-Tür. Llms are vulnerable to malicious prompts disguised as scientific language, 2025. URL https://arxiv.org/abs/2501.14073
-
[65]
Nuo Chen, Moming Duan, Andre Huikai Lin, Qian Wang, Jiaying Wu, and Bingsheng He. Position: The current ai conference model is unsustainable! diagnosing the crisis of centralized ai conference, 2025. URL https://arxiv.org/abs/2508.04586
-
[66]
MAIR: A massive benchmark for evaluating instructed retrieval
Weiwei Sun, Zhengliang Shi, Wu Jiu Long, Lingyong Yan, Xinyu Ma, Yiding Liu, Min Cao, Dawei Yin, and Zhaochun Ren. MAIR: A massive benchmark for evaluating instructed retrieval. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14044– 14067, Miami, Fl...
-
[67]
Towards a unified framework for reference retrieval and related work generation
Zhengliang Shi, Shen Gao, Zhen Zhang, Xiuying Chen, Zhumin Chen, Pengjie Ren, and Zhaochun Ren. Towards a unified framework for reference retrieval and related work generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5785–5799, Singapore, December 2023. Association ...
-
[68]
Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, and Xiaohui Yan. Evoscientist: Towards multi-agent evolving ai scientists for end-to-end scientific discovery, 2026. URL https://arxiv.org/abs/2603.08127
-
[69]
Gupta, Neereja Sundaresan, Thomas Alexander, Christopher J
Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, James Zou, et al. The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646:716–723, 2025. doi: 10.1038/s41586- 025-09442-9
-
[70]
Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller
Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Augmenting large language models with chemistry tools.Nature Machine Intelligence, pages 1–11, 2024
work page 2024
-
[71]
Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023
Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023
work page 2023
-
[72]
LitSearch: A retrieval benchmark for scientific literature search
Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, and Tianyu Gao. LitSearch: A retrieval benchmark for scientific literature search. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15068–15083, Miami, Florida, USA, November 202...
-
[73]
Hao Kang and Chenyan Xiong. Researcharena: Benchmarking llms’ ability to collect and organize information as research agents.arXiv preprint arXiv:2406.10291, 2024. 22
-
[74]
Citeme: Can language models accurately cite scientific claims?arXiv preprint arXiv:2407.12861, 2024
Ori Press, Andreas Hochlehnert, Ameya Prabhu, Vishaal Udandarao, Ofir Press, and Matthias Bethge. Citeme: Can language models accurately cite scientific claims?arXiv preprint arXiv:2407.12861, 2024
-
[75]
ScilitLLM: How to adapt LLMs for scientific literature understanding
Sihang Li, Jin Huang, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, and Hengxing Cai. ScilitLLM: How to adapt LLMs for scientific literature understanding. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=8dzKkeWUUb
work page 2025
-
[76]
Art or artifice? large language models and the false promise of creativity
Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, and Chien-Sheng Wu. Art or artifice? large language models and the false promise of creativity. InProceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–34, 2024
work page 2024
-
[77]
Homogenization effects of large language models on human creative ideation
Barrett R Anderson, Jash Hemant Shah, and Max Kreminski. Homogenization effects of large language models on human creative ideation. InProceedings of the 16th Conference on Creativity & Cognition, pages 413–425, 2024
work page 2024
-
[78]
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InIn the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY , USA, 2023. Association for Computing Machinery
work page 2023
-
[79]
Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqiuyue Ping, and Qin Chen. Agentsims: An open-source sandbox for large language model evaluation.arXiv preprint arXiv:2308.04026, 2023
-
[80]
PlatoLM: Teaching LLMs in multi-round dialogue via a user simulator
Chuyi Kong, Yaxin Fan, Xiang Wan, Feng Jiang, and Benyou Wang. PlatoLM: Teaching LLMs in multi-round dialogue via a user simulator. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7841–7863, Bangkok, Thailand, August 2024. Assoc...
-
[81]
Megaagent: A practical framework for autonomous cooperation in large-scale llm agent systems, 2024
Qian Wang, Tianyu Wang, Qinbin Li, Jingsheng Liang, and Bingsheng He. Megaagent: A practical framework for autonomous cooperation in large-scale llm agent systems, 2024. URL https://arxiv.org/abs/2408.09955
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.