pith. sign in

arxiv: 2505.11336 · v4 · submitted 2025-05-16 · 💻 cs.CL

XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration

Pith reviewed 2026-05-22 14:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords academic paper revisionhuman-AI collaborationLLM fine-tuningcontext-aware modelingscientific writinginstruction-response pairsopen-source models
0
0 comments X

The pith

XtraGPT fine-tunes open LLMs on 140,000 real revision pairs so they can handle context-aware, intent-guided edits to scientific papers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that academic writing needs more than general text generation because it involves iterative, section-spanning revisions that keep conceptual coherence intact. It shows this by curating thousands of papers into explicit instruction-response pairs that capture how researchers actually revise drafts. The resulting models are then trained to follow specific revision goals while staying aware of surrounding sections. If the approach works, it supplies a controllable alternative to black-box prompting for the kind of detailed polishing that research communication requires.

Core claim

The central claim is that a human-AI collaboration framework built on criteria-guided intent alignment and context-aware modeling, when applied to a dataset of 140,000 instruction-response pairs drawn from 7,000 top-venue papers, produces open-source LLMs (1.5B to 14B parameters) that outperform same-scale baselines and approach proprietary model quality at improving scientific drafts, as measured by both automated preference tests and human evaluations.

What carries the argument

The criteria-guided intent alignment and context-aware modeling mechanism that turns each section-level revision instruction into a controlled edit while referencing the surrounding paper content.

If this is right

  • Open models become practical for iterative academic drafting without repeated calls to closed systems.
  • Revision quality improves when the model receives explicit intent signals tied to section context rather than generic prompts.
  • The same fine-tuning recipe can be applied at multiple model scales while preserving gains over baselines.
  • Human preference data confirm that the revisions maintain both local accuracy and cross-section coherence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar instruction sets could be built for related tasks such as responding to reviewer comments or condensing long papers.
  • Embedding the models in writing interfaces might let authors steer revisions at the paragraph level in real time.
  • If the revision patterns prove domain-general, the same data construction method could support non-academic technical writing.

Load-bearing premise

The curated set of 140,000 revision pairs from 7,000 papers accurately represents the kinds of edits that occur in real scientific writing and will transfer to unseen papers and research domains.

What would settle it

A blind side-by-side evaluation in which expert reviewers rate revisions of the same new paper sections produced by XtraGPT versus an untuned baseline of equal size, with no preference for XtraGPT.

Figures

Figures reproduced from arXiv: 2505.11336 by Andre Lin HuiKai, Bingsheng He, Jiaying Wu, Junyi Hou, Nuo Chen, Qian Wang, Xidong Wang, Zining Zhang.

Figure 1
Figure 1. Figure 1: (Left) Overview of the academic paper revision workflow comparing proprietary LLMs [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview. The post-training pipelines enable controllable, section-level, fine-grained paper [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Paper quality scores and overall ratings from o1-based [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: shows the prompt for QA. The Prompt for QA Act as an expert model for improving articles **PAPER_CONTENT**. <SELECTED_CONTENT> User Selected </SELECTED_CONTENT> <QUESTION> <User Question> </QUESTION> [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompts for Generating ReviseQA 31 [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompts for Judging (modified from alpaca_eval_gpt4_turbo_fn). [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompts for Scoring. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ICLR 2024 Paper Token Distribution (without Appendix) [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Criteria Details of Section Title 34 [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Criteria Details of Section Abstract 35 [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Criteria Details of Section Introduction [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Criteria Details of Section Background 37 [PITH_FULL_IMAGE:figures/full_fig_p037_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Criteria Details of Section Evaluation Criteria Details of Section Conclusion 1. Broader Impact and Future Directions: Assess the thoroughness of the paper’s conclusion or discussion sections in addressing the broader impact of the research. Does the paper provide specific and clear avenues for future work? 2. Clarity and Impact of Key Innovations and Findings: Evaluate whether the conclusion effectively … view at source ↗
Figure 14
Figure 14. Figure 14: Criteria Details of Section Conclusion 38 [PITH_FULL_IMAGE:figures/full_fig_p038_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The criteria for human instructors. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: A comparison of the paragraph before and after revision. [PITH_FULL_IMAGE:figures/full_fig_p040_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: A use case on XtraGPT. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Scaling trends of Qwen-2.5-7B/32B/72B/Max-Instruct performance. (a) MMLU-Pro [PITH_FULL_IMAGE:figures/full_fig_p041_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: ICLR 2024 paper rating distribution. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_19.png] view at source ↗
read the original abstract

Despite the growing adoption of large language models (LLMs) in academic workflows, their capabilities remain limited in supporting high-quality scientific writing. Most existing systems are designed for general-purpose scientific text generation and fail to meet the sophisticated demands of research communication beyond surface-level polishing, for example, maintaining conceptual coherence across sections. Furthermore, academic writing is inherently iterative and revision-driven, a process that is not well supported by direct prompting-based paradigms. To address these scenarios, we propose a human-AI collaboration framework for academic paper revision, centered on criteria-guided intent alignment and context-aware modeling. To validate the framework, we curate a dataset of 7,000 research papers from top-tier venues, annotated with 140,000 instruction--response pairs that reflect realistic, section-level scientific revisions. We instantiate the framework in XtraGPT, the first suite of open-source LLMs (1.5B to 14B parameters) specifically fine-tuned for context-aware academic paper revision. Extensive experiments show that XtraGPT significantly outperforms same-scale baselines and rivals the quality of proprietary counterparts. Both automated preference assessments and human evaluations confirm the effectiveness of XtraGPT in improving scientific drafts. Our code and models are available at https://github.com/Xtra-Computing/XtraGPT and https://huggingface.co/collections/Xtra-Computing/xtragpt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a human-AI collaboration framework for academic paper revision that emphasizes criteria-guided intent alignment and context-aware modeling. It curates a dataset of 140,000 instruction-response pairs drawn from 7,000 top-tier papers to reflect section-level scientific revisions, then fine-tunes an open-source suite of LLMs (XtraGPT, 1.5B–14B parameters) on this data. Experiments claim that XtraGPT significantly outperforms same-scale baselines and approaches proprietary model quality, as measured by automated preference scores and human evaluations; code and models are released.

Significance. If the central claims hold after clarification of dataset construction and evaluation protocols, the work would supply the first openly available, scale-specific models for iterative, context-sensitive revision of scientific writing. The public release of code, models, and the 140k-pair dataset constitutes a concrete community resource that could accelerate research on controllable scientific text improvement.

major comments (2)
  1. [Dataset construction] Dataset construction section: the manuscript states that the 140,000 pairs 'reflect realistic, section-level scientific revisions,' yet provides no breakdown of the fraction generated by human experts versus LLM prompting or templated rewriting. Without this information it is impossible to evaluate the risk that fine-tuned models overfit to synthetic artifacts, which would undermine the claim of transfer to unseen papers and domains.
  2. [Experiments] Experiments section: the automated preference assessments and human evaluations are reported to show outperformance, but the text does not specify the exact data splits, baseline implementation details, or inter-annotator agreement statistics. These omissions leave open the possibility that post-hoc choices inflate the reported gains relative to same-scale open models.
minor comments (2)
  1. [Introduction] The abstract and introduction use the term 'context-aware' without an explicit operational definition or pointer to the modeling component that implements it; a short clarifying sentence would improve readability.
  2. [Figures] Figure captions for the human evaluation results should include the exact number of evaluators, papers sampled, and statistical test used to support the preference claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will make the necessary revisions to improve clarity and transparency.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the manuscript states that the 140,000 pairs 'reflect realistic, section-level scientific revisions,' yet provides no breakdown of the fraction generated by human experts versus LLM prompting or templated rewriting. Without this information it is impossible to evaluate the risk that fine-tuned models overfit to synthetic artifacts, which would undermine the claim of transfer to unseen papers and domains.

    Authors: We acknowledge that the current manuscript does not provide a quantitative breakdown of how the 140,000 instruction-response pairs were generated. The Dataset Construction section will be revised to include this information, specifying the proportions derived from human expert revisions extracted from paper histories versus those created via LLM prompting or templated methods, along with the quality assurance steps employed to ensure the revisions remain realistic and section-level. This addition will directly address concerns about potential overfitting to synthetic patterns and support the claims of transfer to unseen papers. revision: yes

  2. Referee: [Experiments] Experiments section: the automated preference assessments and human evaluations are reported to show outperformance, but the text does not specify the exact data splits, baseline implementation details, or inter-annotator agreement statistics. These omissions leave open the possibility that post-hoc choices inflate the reported gains relative to same-scale open models.

    Authors: We agree that the experimental reporting requires greater specificity to ensure reproducibility and to rule out any perception of post-hoc selection. The revised Experiments section and appendix will explicitly state the paper-level data splits used, provide full implementation details and hyperparameters for each baseline, and report inter-annotator agreement statistics for the human evaluations. These clarifications will strengthen the validity of the performance comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on new dataset and external benchmarks

full rationale

The paper constructs a fresh dataset of 140k instruction-response pairs from 7k papers and fine-tunes open-source models, then reports performance against same-scale baselines and proprietary models via automated and human evaluations. No equations, fitted parameters, or predictions are defined in terms of the target result itself. No self-citation chain is load-bearing for the central claim, and the derivation is self-contained against external comparisons rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about LLM fine-tuning effectiveness and the representativeness of the new dataset; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Instruction-tuned LLMs can learn to perform context-aware text revisions when trained on section-level human revision examples.
    Invoked implicitly when describing the fine-tuning process and expected generalization.

pith-pipeline@v0.9.0 · 5796 in / 1195 out tokens · 39619 ms · 2026-05-22T14:39:34.601619+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AI for Auto-Research: Roadmap & User Guide

    cs.AI 2026-05 unverdicted novelty 4.0

    The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

Reference graph

Works this paper leans on

118 extracted references · 118 canonical work pages · cited by 1 Pith paper · 14 internal anchors

  1. [1]

    From hypothesis to publication: A comprehensive survey of AI-driven research support systems

    Zekun Zhou, Xiaocheng Feng, Lei Huang, Xiachong Feng, Ziyun Song, Ruihan Chen, Liang Zhao, Weitao Ma, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, and Bing Qin. From hypothesis to publication: A comprehensive survey of AI-driven research support systems. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors...

  2. [2]

    Towards an AI co-scientist

    Juraj Gottweis et al. Towards an ai co-scientist, 2025. URL https://arxiv.org/abs/2502.18864

  3. [3]

    Weld and Doug Downey and Wen

    Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hannaneh Hajishirzi. Openscholar: ...

  4. [4]

    Cycleresearcher: Improving automated research via automated review

    Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. Cycleresearcher: Improving automated research via automated review. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=bjcsVLoHYs

  5. [5]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URL https://arxiv.org/abs/2504.08066

  6. [6]

    Llm×mapreduce-v2: Entropy-driven convolutional test-time scaling for generating long-form articles from extremely long resources, 2025

    Haoyu Wang, Yujia Fu, Zhu Zhang, Shuo Wang, Zirui Ren, Xiaorong Wang, Zhili Li, Chaoqun He, Bo An, Zhiyuan Liu, and Maosong Sun. Llm×mapreduce-v2: Entropy-driven convolutional test-time scaling for generating long-form articles from extremely long resources, 2025. URL https://arxiv.org/abs/2504.05732

  7. [7]

    Using artificial intelligence in academic writing and research: An essential productivity tool.Computer Methods and Programs in Biomedicine Update, 5:100145, 2024

    Mohamed Khalifa and Mona Albadawy. Using artificial intelligence in academic writing and research: An essential productivity tool.Computer Methods and Programs in Biomedicine Update, 5:100145, 2024

  8. [8]

    How are researchers using ai? survey reveals pros and cons for science

    Miryam Naddaf. How are researchers using ai? survey reveals pros and cons for science. Nature, 2025. doi: 10.1038/d41586-025-00343-5

  9. [9]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  10. [10]

    University of Chicago Press, 2017

    Scott L Montgomery.The Chicago guide to communicating science. University of Chicago Press, 2017

  11. [11]

    OUP USA, 2012

    Joshua Schimel.Writing science: how to write papers that get cited and proposals that get funded. OUP USA, 2012

  12. [12]

    Liveclin: A live clinical benchmark without leakage

    Xidong Wang, Guo shuqi, Yue Shen, Junying Chen, Jian Wang, Jinjie Gu, Ping Zhang, Lei Liu, and Benyou Wang. Liveclin: A live clinical benchmark without leakage. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/ forum?id=E0WSAugJ0j

  13. [13]

    Iclr 2025 reviewer guide

    ICLR. Iclr 2025 reviewer guide. https://iclr.cc/Conferences/2025/ReviewerGuide, 2025. Accessed: 2025-04-20

  14. [14]

    Sciagents: Automating scientific discovery through multi-agent intelligent graph reasoning.arXiv preprint arXiv:2409.05556, 2024

    Alireza Ghafarollahi and Markus J Buehler. Sciagents: Automating scientific discovery through multi-agent intelligent graph reasoning.arXiv preprint arXiv:2409.05556, 2024

  15. [15]

    DeepReview: Improving LLM-based paper review with human-like deep thinking process

    Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. DeepReview: Improving LLM-based paper review with human-like deep thinking process. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29330–29355, V...

  16. [16]

    Texgpt: Harness the power of chatgpt in overleaf, 2023

    Hilde van Zeeland. Texgpt: Harness the power of chatgpt in overleaf, 2023. URL https: //blog.writefull.com/texgpt-harness-the-power-of-chatgpt-in-overleaf/

  17. [17]

    Tips for writing technical papers

    Jennifer Widom. Tips for writing technical papers. https://cs.stanford.edu/people/widom/paper- writing.html, 2006. Accessed: 2025-04-20. 17

  18. [18]

    PaperRobot: Incremental draft generation of scientific ideas

    Qingyun Wang, Lifu Huang, Zhiying Jiang, Kevin Knight, Heng Ji, Mohit Bansal, and Yi Luan. PaperRobot: Incremental draft generation of scientific ideas. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1980–1991, Florence, Italy, July 2019. Association ...

  19. [19]

    Tal August, Katharina Reinecke, and Noah A. Smith. Generating scientific definitions with controllable complexity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8298–8317, Dublin, Ireland, May 2022. Association for...

  20. [20]

    Assisting in writing Wikipedia-like articles from scratch with large language models

    Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assisting in writing Wikipedia-like articles from scratch with large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T...

  21. [21]

    Into the unknown unknowns: Engaged human learning through participation in language model agent conversations

    Yucheng Jiang, Yijia Shao, Dekun Ma, Sina Semnani, and Monica Lam. Into the unknown unknowns: Engaged human learning through participation in language model agent conversations. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9917– 9955, Miami, Flor...

  22. [22]

    Autonomous llm-driven research from data to human-verifiable research papers, 2024

    Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous llm-driven research from data to human-verifiable research papers, 2024. URL https://arxiv.org/abs/2404. 17605

  23. [23]

    Agent laboratory: Using LLM agents as research assistants

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pag...

  24. [24]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

  25. [25]

    Cursor - the ai code editor, 2024

    Cursor. Cursor - the ai code editor, 2024. URL https://www.cursor.com/

  26. [26]

    Cheng Tan, Dongxin Lyu, Siyuan Li, Zhangyang Gao, Jingxuan Wei, Siqi Ma, Zicheng Liu, and Stan Z. Li. Peer review as a multi-turn and long-context dialogue with role-based interactions: Benchmarking large language models, 2025. URL https://openreview.net/forum? id=uV3Gdoq2ez. 18

  27. [27]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024. URL https://arxiv.org/abs/ 2404.04475

  28. [28]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openre...

  29. [29]

    Alpacaeval: An automatic evaluator for instruction-following language models,

    Tatsu-lab. Alpacaeval: An automatic evaluator for instruction-following language models,

  30. [30]

    URL https://github.com/tatsu-lab/alpaca_eval

  31. [31]

    Benchmarking cognitive biases in large language models as evaluators

    Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking cognitive biases in large language models as evaluators. InFindings of the Association for Computational Linguistics: ACL 2024, pages 517–545, 2024

  32. [32]

    Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback

    Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, and Xuanjing Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2859–2873, 2023

  33. [33]

    Paperdebugger: A plugin-based multi-agent system for in-editor academic writing, review, and editing

    Junyi Hou, Andre Lin Huikai, Nuo Chen, Yiwei Gong, and Bingsheng He. Paperdebugger: A plugin-based multi-agent system for in-editor academic writing, review, and editing. In Companion Proceedings of the ACM on Web Conference 2026, WWW ’26. Association for Computing Machinery, 2026

  34. [34]

    Nougat: Neural optical understanding for academic documents

    Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=fUtxNAKpdV

  35. [35]

    Readoc: A unified benchmark for realistic document structured extraction, 2024

    Zichao Li, Aizier Abulaiti, Yaojie Lu, Xuanang Chen, Jia Zheng, Hongyu Lin, Xianpei Han, and Le Sun. Readoc: A unified benchmark for realistic document structured extraction, 2024. URL https://arxiv.org/abs/2409.05137

  36. [36]

    The hallucinations leaderboard - an open effort to measure hallucinations in large language models.CoRR, abs/2404.05904, 2024

    Giwon Hong, Aryo Pradipta Gema, Rohit Saxena, Xiaotang Du, Ping Nie, Yu Zhao, Laura Perez-Beltrachini, Max Ryabinin, Xuanli He, Clémentine Fourrier, and Pasquale Minervini. The hallucinations leaderboard - an open effort to measure hallucinations in large language models.CoRR, abs/2404.05904, 2024. doi: 10.48550/arXiv.2404.05904. URL https://doi.org/ 10.4...

  37. [37]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

  38. [38]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/ forum?id=nZeVKeeFYf9

  39. [39]

    Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for 19 Computing Mac...

  40. [40]

    Fast-detectGPT: Efficient zero-shot detection of machine-generated text via conditional probability curvature

    Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. Fast-detectGPT: Efficient zero-shot detection of machine-generated text via conditional probability curvature. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=Bpcgcr8E8Z

  41. [41]

    Spotting LLMs with binoculars: Zero- shot detection of machine-generated text

    Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Spotting LLMs with binoculars: Zero- shot detection of machine-generated text. InForty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=axl3FAkpik

  42. [42]

    The llama 3 herd of models, 2024

    Aaron Grattafiori et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407. 21783

  43. [43]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  44. [45]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  45. [46]

    Autogen: A personalized large language model for academic enhancement—ethics and proof of principle.The American Journal of Bioethics, 23(10):28–41, 2023

    Sebastian Porsdam Mann, Brian D Earp, Nikolaj Møller, Suren Vynn, and Julian Savulescu. Autogen: A personalized large language model for academic enhancement—ethics and proof of principle.The American Journal of Bioethics, 23(10):28–41, 2023

  46. [47]

    ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models , booktitle =

    Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human ...

  47. [48]

    Chain of ideas: Revolutionizing research via novel idea development with LLM agents

    Long Li et al. Chain of ideas: Revolutionizing research via novel idea development with LLM agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8971–9004, Suzhou, China, November 2025. Association for Computational Linguistics. ISBN ...

  48. [49]

    Can LLMs generate novel research ideas? a large-scale human study with 100+ NLP researchers

    Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can LLMs generate novel research ideas? a large-scale human study with 100+ NLP researchers. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id= M23dTGWCZy

  49. [50]

    Llms can realize combinatorial creativity: generating creative ideas via llms for scientific research, 2024

    Tianyang Gu, Jingjin Wang, Zhihao Zhang, and HaoHong Li. Llms can realize combinatorial creativity: generating creative ideas via llms for scientific research, 2024. URL https://arxiv. org/abs/2412.14141. 20

  50. [51]

    Marg: Multi-agent review generation for scientific papers

    Mike D’Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. Marg: Multi-agent review generation for scientific papers.arXiv preprint arXiv:2401.04259, 2024

  51. [52]

    Can large language models provide useful feedback on research papers? a large-scale empirical analysis.NEJM AI, 1(8): AIoa2400196, 2024

    Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas V odrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis.NEJM AI, 1(8): AIoa2400196, 2024

  52. [53]

    Scholarchemqa: Unveiling the power of language models in chemical research question answering.arXiv preprint arXiv:2407.16931, 2024

    Xiuying Chen, Tairan Wang, Taicheng Guo, Kehan Guo, Juexiao Zhou, Haoyang Li, Mingchen Zhuge, Jürgen Schmidhuber, Xin Gao, and Xiangliang Zhang. Scholarchemqa: Unveiling the power of language models in chemical research question answering.arXiv preprint arXiv:2407.16931, 2024

  53. [54]

    Paperqa: Retrieval-augmented generative agent for scientific research,

    Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G Rodriques, and Andrew D White. Paperqa: Retrieval-augmented generative agent for scientific research.arXiv preprint arXiv:2312.07559, 2023

  54. [55]

    CS-bench: A comprehensive benchmark for large language models towards computer science mastery

    Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, Weihao Zeng, Yejie Wang, Zhuoma GongQue, Jianing Yu, Qiuna Tan, and Weiran Xu. CS-bench: A comprehensive benchmark for large language models towards computer science mastery. InThe Thirteenth International Conference on Learning ...

  55. [56]

    Biokgbench: A knowledge graph checking benchmark of ai agent for biomedical science.arXiv preprint arXiv:2407.00466, 2024

    Xinna Lin, Siqi Ma, Junjie Shan, Xiaojing Zhang, Shell Xu Hu, Tiannan Guo, Stan Z Li, and Kaicheng Yu. Biokgbench: A knowledge graph checking benchmark of ai agent for biomedical science.arXiv preprint arXiv:2407.00466, 2024

  56. [57]

    Cowriter - your ai platform for speeding up creative writing, 2025

    CoWriter. Cowriter - your ai platform for speeding up creative writing, 2025. URL https: //cowriter.org

  57. [58]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  58. [59]

    Codegen: An open large language model for code with multi-turn program synthesis

    Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_

  59. [60]

    Towards injecting medical visual knowledge into multimodal LLMs at scale

    Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, and Benyou Wang. Towards injecting medical visual knowledge into multimodal LLMs at scale. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Languag...

  60. [61]

    How ai ideas affect the creativity, diversity, and evolution of human ideas: Evidence from a large, dynamic experiment.arXiv preprint arXiv:2401.13481, 2024

    Joshua Ashkinaze, Julia Mendelsohn, Li Qiwei, Ceren Budak, and Eric Gilbert. How ai ideas affect the creativity, diversity, and evolution of human ideas: Evidence from a large, dynamic experiment.arXiv preprint arXiv:2401.13481, 2024. 21

  61. [62]

    How ai processing delays foster creativity: Exploring research question co- creation with an llm-based agent

    Yiren Liu, Si Chen, Haocong Cheng, Mengxia Yu, Xiao Ran, Andrew Mo, Yiliu Tang, and Yun Huang. How ai processing delays foster creativity: Exploring research question co- creation with an llm-based agent. InProceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–25, 2024

  62. [63]

    Does writing with language models reduce content diversity? InThe Twelfth International Conference on Learning Representations, 2024

    Vishakh Padmakumar and He He. Does writing with language models reduce content diversity? InThe Twelfth International Conference on Learning Representations, 2024

  63. [64]

    Llms are vulnerable to malicious prompts disguised as scientific language, 2025

    Yubin Ge, Neeraja Kirtane, Hao Peng, and Dilek Hakkani-Tür. Llms are vulnerable to malicious prompts disguised as scientific language, 2025. URL https://arxiv.org/abs/2501.14073

  64. [65]

    Position: The current ai conference model is unsustainable! diagnosing the crisis of centralized ai conference, 2025

    Nuo Chen, Moming Duan, Andre Huikai Lin, Qian Wang, Jiaying Wu, and Bingsheng He. Position: The current ai conference model is unsustainable! diagnosing the crisis of centralized ai conference, 2025. URL https://arxiv.org/abs/2508.04586

  65. [66]

    MAIR: A massive benchmark for evaluating instructed retrieval

    Weiwei Sun, Zhengliang Shi, Wu Jiu Long, Lingyong Yan, Xinyu Ma, Yiding Liu, Min Cao, Dawei Yin, and Zhaochun Ren. MAIR: A massive benchmark for evaluating instructed retrieval. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14044– 14067, Miami, Fl...

  66. [67]

    Towards a unified framework for reference retrieval and related work generation

    Zhengliang Shi, Shen Gao, Zhen Zhang, Xiuying Chen, Zhumin Chen, Pengjie Ren, and Zhaochun Ren. Towards a unified framework for reference retrieval and related work generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5785–5799, Singapore, December 2023. Association ...

  67. [68]

    Evoscientist: Towards multi-agent evolving ai scientists for end-to-end scientific discovery.arXiv preprint arXiv:2603.08127, 2026

    Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, and Xiaohui Yan. Evoscientist: Towards multi-agent evolving ai scientists for end-to-end scientific discovery, 2026. URL https://arxiv.org/abs/2603.08127

  68. [69]

    Gupta, Neereja Sundaresan, Thomas Alexander, Christopher J

    Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, James Zou, et al. The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646:716–723, 2025. doi: 10.1038/s41586- 025-09442-9

  69. [70]

    Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller

    Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Augmenting large language models with chemistry tools.Nature Machine Intelligence, pages 1–11, 2024

  70. [71]

    Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

    Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

  71. [72]

    LitSearch: A retrieval benchmark for scientific literature search

    Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, and Tianyu Gao. LitSearch: A retrieval benchmark for scientific literature search. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15068–15083, Miami, Florida, USA, November 202...

  72. [73]

    Researcharena: Benchmarking llms’ ability to collect and organize information as research agents.arXiv preprint arXiv:2406.10291, 2024

    Hao Kang and Chenyan Xiong. Researcharena: Benchmarking llms’ ability to collect and organize information as research agents.arXiv preprint arXiv:2406.10291, 2024. 22

  73. [74]

    Citeme: Can language models accurately cite scientific claims?arXiv preprint arXiv:2407.12861, 2024

    Ori Press, Andreas Hochlehnert, Ameya Prabhu, Vishaal Udandarao, Ofir Press, and Matthias Bethge. Citeme: Can language models accurately cite scientific claims?arXiv preprint arXiv:2407.12861, 2024

  74. [75]

    ScilitLLM: How to adapt LLMs for scientific literature understanding

    Sihang Li, Jin Huang, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, and Hengxing Cai. ScilitLLM: How to adapt LLMs for scientific literature understanding. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=8dzKkeWUUb

  75. [76]

    Art or artifice? large language models and the false promise of creativity

    Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, and Chien-Sheng Wu. Art or artifice? large language models and the false promise of creativity. InProceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–34, 2024

  76. [77]

    Homogenization effects of large language models on human creative ideation

    Barrett R Anderson, Jash Hemant Shah, and Max Kreminski. Homogenization effects of large language models on human creative ideation. InProceedings of the 16th Conference on Creativity & Cognition, pages 413–425, 2024

  77. [78]

    O’Brien, Carrie J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InIn the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY , USA, 2023. Association for Computing Machinery

  78. [79]

    Agentsims: An open-source sandbox for large language model evaluation.arXiv preprint arXiv:2308.04026, 2023

    Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqiuyue Ping, and Qin Chen. Agentsims: An open-source sandbox for large language model evaluation.arXiv preprint arXiv:2308.04026, 2023

  79. [80]

    PlatoLM: Teaching LLMs in multi-round dialogue via a user simulator

    Chuyi Kong, Yaxin Fan, Xiang Wan, Feng Jiang, and Benyou Wang. PlatoLM: Teaching LLMs in multi-round dialogue via a user simulator. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7841–7863, Bangkok, Thailand, August 2024. Assoc...

  80. [81]

    Megaagent: A practical framework for autonomous cooperation in large-scale llm agent systems, 2024

    Qian Wang, Tianyu Wang, Qinbin Li, Jingsheng Liang, and Bingsheng He. Megaagent: A practical framework for autonomous cooperation in large-scale llm agent systems, 2024. URL https://arxiv.org/abs/2408.09955

Showing first 80 references.