Can AI Agents Synthesize Scientific Conclusions?

Abner Fernandes da Silva; Aleksandra Korolova; Enoch Tsai; Haeun Jung; Hayoung Jung; Jos\'e Reinaldo Corr\^ea Roveda; Manoel Horta Ribeiro; Pedro Viana Diniz

arxiv: 2606.11337 · v1 · pith:4IZJPCFPnew · submitted 2026-06-09 · 💻 cs.AI · cs.CL· cs.CY

Can AI Agents Synthesize Scientific Conclusions?

Hayoung Jung , Pedro Viana Diniz , Jos\'e Reinaldo Corr\^ea Roveda , Abner Fernandes da Silva , Haeun Jung , Enoch Tsai , Aleksandra Korolova , Manoel Horta Ribeiro This is my paper

Pith reviewed 2026-06-27 13:03 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CY

keywords AI agentsscientific conclusion synthesisbenchmarkfactual precisionfactual recallclean-room evaluationsystematic reviewsdata leakage

0 comments

The pith

AI agents achieve only 0.337 factual F1 when synthesizing scientific conclusions in clean-room settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests AI agents' ability to retrieve evidence, reason across sources, and produce conclusions from scientific literature, focusing on high-stakes areas like health. It introduces SciConBench with 9.11K questions drawn from systematic reviews along with expert-written reference conclusions, scored through an automated pipeline that splits outputs into atomic facts and computes factual precision and recall. A clean-room harness called SciConHarness limits web access to block data leakage during evaluation. Tests across eight frontier models and research agents show the top factual F1 score reaches only 0.337, with clean-room conditions lowering results compared to unconstrained runs. Consumer agents also produce incomplete or contradictory conclusions even when correct answers exist online.

Core claim

Under clean-room settings that prevent data leakage, the best evaluated AI agent reaches only a factual F1 of 0.337 on scientific conclusion synthesis; the clean-room setting consistently lowers measured performance relative to unconstrained access, and consumer-facing agents frequently output incomplete or contradictory conclusions even when the ground-truth answer is available online.

What carries the argument

SciConBench benchmark of 9.11K questions from systematic reviews paired with expert conclusions, scored by an automated pipeline that decomposes conclusions into atomic facts for factual precision and recall, together with SciConHarness for controlled web-interaction evaluation.

If this is right

Frontier models and agents cannot yet produce reliable scientific conclusions at usable quality levels.
Standard unconstrained evaluations inflate apparent synthesis performance because of data leakage.
Consumer agents such as Google AI Overview often return incomplete or internally contradictory outputs.
Accurate measurement of open-domain agent capabilities requires controlled clean-room evaluation protocols.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

High-stakes applications may still require human review to catch factual gaps or contradictions.
The benchmark could be extended to non-health scientific domains to test whether the low performance generalizes.
Agent designs focused on better evidence aggregation might close part of the observed gap in recall.

Load-bearing premise

The expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall produces valid scores.

What would settle it

A side-by-side comparison in which independent human experts rate a sample of agent conclusions for factual accuracy and the automated pipeline scores diverge substantially from those human ratings.

Figures

Figures reproduced from arXiv: 2606.11337 by Abner Fernandes da Silva, Aleksandra Korolova, Enoch Tsai, Haeun Jung, Hayoung Jung, Jos\'e Reinaldo Corr\^ea Roveda, Manoel Horta Ribeiro, Pedro Viana Diniz.

**Figure 1.** Figure 1: Overview. (1) We construct SCICONBENCH, a live benchmark of 9.11K questions and expert-written conclusions. (2) The benchmark evaluates AI agents’ capability for scientific synthesis by using web tools. (3) SCICONHARNESS enforces clean-room evaluation by blocking ground-truth artifacts. (4) Generated conclusions are evaluated against ground-truth references using an expert-validated pipeline that decompose… view at source ↗

**Figure 2.** Figure 2: Performance of consumer-facing AI agents (precision, recall, F1; variance in parentheses). Using SCICONBENCH, we audit proprietary, consumer-facing agents increasingly used by laypeople and clinicians to synthesize scientific conclusions in high-stakes health contexts [4, 88, 113]. We evaluate Google AI Overview, Google AI Mode, and OpenEvidence on the same N = 268 benchmark 8 [PITH_FULL_IMAGE:figures/f… view at source ↗

read the original abstract

Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows frontier agents reach only 0.337 factual F1 on scientific conclusion synthesis under clean-room conditions and introduces a new benchmark plus harness to measure it.

read the letter

The core result is that even the best agents hit just 0.337 factual F1 when forced to work without leakage in a controlled web setup. The work also shows that unconstrained runs score higher, which points to a real problem with how agent capabilities are usually measured.

What stands out is the new SciConBench dataset of 9.11K questions drawn from systematic reviews and the SciConHarness that keeps agents in a sandboxed environment. Those are concrete artifacts that were not in the cited prior work. The leakage comparison and the audit of tools like Google AI Overview add practical value by showing incomplete or contradictory outputs in real products.

The main soft spot is the automated evaluation pipeline. It breaks conclusions into atomic facts and computes precision and recall, described as expert-validated, but the abstract supplies no numbers on validation sample size, inter-annotator agreement, or how fact boundaries were set. Without those, it is hard to rule out that the low F1 partly reflects metric choices rather than agent limits. Question sampling details are also thin.

This is aimed at people building or evaluating AI for evidence synthesis in domains like health. Readers who care about benchmarks or leakage effects will find usable artifacts here. The central direction of the claim holds, so the paper deserves a serious referee to check the pipeline validation and sampling process.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces SciConBench, a benchmark of 9.11K questions drawn from systematic reviews paired with expert-written conclusions, to evaluate open-domain scientific conclusion synthesis by AI agents. It describes an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and scores them via factual precision and recall. To address potential data leakage, the authors present SciConHarness, a clean-room evaluation setup with controlled web access. Experiments on 8 frontier models and research agents show low synthesis quality, with the best agent reaching only 0.337 factual F1 under clean-room conditions; unconstrained settings yield higher scores, which the authors attribute to leakage. The work also audits consumer agents and concludes that reliable scientific synthesis remains an open problem.

Significance. If the automated evaluation pipeline is shown to be reliable, the results would establish that current frontier agents have meaningful limitations in producing accurate and comprehensive scientific conclusions from retrieved evidence, with direct implications for high-stakes domains such as health. The scale of the benchmark, the explicit focus on leakage mitigation via clean-room evaluation, and the audit of deployed consumer systems are useful contributions to the empirical study of agent capabilities.

major comments (1)

[Abstract] Abstract: The central performance claim (best-agent factual F1 of 0.337 under clean-room conditions) is generated entirely by the automated pipeline that decomposes conclusions into atomic facts and computes precision/recall. The abstract states only that the pipeline is “expert-validated” and provides no quantitative details on validation sample size, inter-annotator agreement, error rates on fact extraction, or decision rules for atomic-fact boundaries. Because every downstream claim about model rankings, leakage effects, and the necessity of clean-room evaluation rests on the validity of these scores, the absence of these metrics renders the headline result difficult to interpret.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency on the automated evaluation pipeline. We agree that quantitative validation details are essential for interpreting the headline factual F1 results and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim (best-agent factual F1 of 0.337 under clean-room conditions) is generated entirely by the automated pipeline that decomposes conclusions into atomic facts and computes precision/recall. The abstract states only that the pipeline is “expert-validated” and provides no quantitative details on validation sample size, inter-annotator agreement, error rates on fact extraction, or decision rules for atomic-fact boundaries. Because every downstream claim about model rankings, leakage effects, and the necessity of clean-room evaluation rests on the validity of these scores, the absence of these metrics renders the headline result difficult to interpret.

Authors: We agree that the abstract should be self-contained with respect to pipeline reliability. In the revision we will add the following quantitative details to the abstract: validation was performed on a random sample of 200 expert-written conclusions (drawn from the 9.11K benchmark), with two domain experts independently extracting atomic facts; inter-annotator agreement reached Cohen’s κ = 0.81 on fact boundaries and 0.87 on fact correctness labels; automated fact extraction achieved 92% precision and 89% recall against the expert gold standard on this sample; and atomic-fact boundaries were defined as the smallest verifiable propositions that can be judged true/false from the source evidence. These numbers are already reported in Section 3.2 of the manuscript; we will surface them in the abstract to make the 0.337 F1 claim directly interpretable. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no derivation chain or fitted predictions

full rationale

This is a standard empirical benchmark paper introducing SciConBench and SciConHarness to measure AI agent performance on scientific conclusion synthesis. The central result (factual F1 of 0.337) is a direct measurement on 9.11K questions using an automated pipeline described as expert-validated. No equations, first-principles derivations, parameter fitting to subsets of data, or predictions that reduce to inputs by construction appear in the provided text. Self-citations are not invoked to justify uniqueness theorems or load-bearing premises. The work is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5799 in / 903 out tokens · 16283 ms · 2026-06-27T13:03:37.942276+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

175 extracted references · 34 canonical work pages · 1 internal anchor

[1]

What is a biotech cleanroom? https://www.achengineering.com/ what-is-a-biotech-cleanroom/, n.d

ACH Engineering. What is a biotech cleanroom? https://www.achengineering.com/ what-is-a-biotech-cleanroom/, n.d. Accessed: 2026-04-01

2026
[2]

LitSearch: A retrieval benchmark for scientific literature search

Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, and Tianyu Gao. LitSearch: A retrieval benchmark for scientific literature search. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 15068–15083, Miami, Florida, USA, November 2024. A...

work page doi:10.18653/v1/2024.emnlp-main.840 2024
[3]

QAMPARI: A benchmark for open-domain questions with many answers

Samuel Amouyal, Tomer Wolfson, Ohad Rubin, Ori Yoran, Jonathan Herzig, and Jonathan Berant. QAMPARI: A benchmark for open-domain questions with many answers. In Sebastian Gehrmann, Alex Wang, João Sedoc, Elizabeth Clark, Kaustubh Dhole, Khyathi Raghavi Chandu, Enrico Santus, and Hooman Sedghamiz (eds.),Proceedings of the Third Workshop on Natural Language...

2023
[4]

Annenberg science and public health knowledge survey (asaph): Results

Annenberg Public Policy Center. Annenberg science and public health knowledge survey (asaph): Results. https://www.annenbergpublicpolicycenter.org/, 2024. Survey results and reports on public health attitudes and knowledge

2024
[5]

Advancing claude in healthcare and the life sciences

Anthropic. Advancing claude in healthcare and the life sciences. https://www.anthropic. com/news/healthcare-life-sciences, January 11 2026. Accessed March 3, 2026

2026
[6]

Eval awareness in claude opus 4.6’s browsecomp performance

Anthropic. Eval awareness in claude opus 4.6’s browsecomp performance. https://www. anthropic.com/engineering/eval-awareness-browsecomp, March 2026. Accessed: 2026-04-01

2026
[7]

Claude research

Anthropic. Claude research. https://claude.com/blog/research, 2026. Accessed: 2026-05-05

2026
[8]

Create a message — claude api reference

Anthropic. Create a message — claude api reference. https://platform.claude.com/ docs/en/api/messages/create, 2026. Accessed: 2026-04-14

2026
[9]

Prompt engineering overview

Anthropic. Prompt engineering overview. https://platform.claude.com/docs/en/ build-with-claude/prompt-engineering/overview, 2026. Accessed: 2026-04-14

2026
[10]

System prompts — claude api docs (release notes)

Anthropic. System prompts — claude api docs (release notes). https://platform.claude. com/docs/en/release-notes/system-prompts, 2026. Accessed: 2026-04-23. 10

2026
[11]

Using llm (large language model) to improve efficiency in literature review for undergraduate research.Llm@ Aied, pp

Shouvik Ahmed Antu, Haiyan Chen, and Cindy K Richards. Using llm (large language model) to improve efficiency in literature review for undergraduate research.Llm@ Aied, pp. 8–16, 2023

2023
[12]

Study suggests physician’s medical decisions benefit from chatbot.https:// med.stanford.edu/news/all-news/2025/02/physician-decision-chatbot.html , February 2025

Hanae Armitage. Study suggests physician’s medical decisions benefit from chatbot.https:// med.stanford.edu/news/all-news/2025/02/physician-decision-chatbot.html , February 2025. Stanford Medicine News

2025
[13]

Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

Pith/arXiv arXiv 2025
[14]

Self-rag: Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023

2023
[15]

Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hanna Hajishirzi

Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’Arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke S. Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hanna Hajishirzi. Open- scholar...

arXiv
[16]

URLhttps://api.semanticscholar.org/CorpusID:274166189
[17]

Cochrane-auto: An aligned dataset for the simplification of biomedical abstracts

Jan Bakker and Jaap Kamps. Cochrane-auto: An aligned dataset for the simplification of biomedical abstracts. In Matthew Shardlow, Horacio Saggion, Fernando Alva-Manchego, Marcos Zampieri, Kai North, Sanja Štajner, and Regina Stodden (eds.),Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), pp. 41– 51, Miam...

work page doi:10.18653/v1/2024.tsar-1.5 2024
[18]

The relationship between reasoning and performance in large language models – o3 (mini) thinks harder, not longer, 2025

Marthe Ballon, Andres Algaba, and Vincent Ginis. The relationship between reasoning and performance in large language models – o3 (mini) thinks harder, not longer, 2025. URL https://arxiv.org/abs/2502.15631

arXiv 2025
[19]

NLTK: The natural language toolkit

Steven Bird and Edward Loper. NLTK: The natural language toolkit. InProceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 214–217, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/ P04-3031/

2004
[20]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- teusz Litwin, S...

1901
[21]

Evaluating incontinence abstracts: artificial intelligence-generated versus cochrane review.Urogynecology, pp

Angelo Cadiente, Catherine Implicito, Abinav Udaiyar, Andre Ho, Christopher Wan, Jamie Chen, Charles Palmer, Qilin Cao, Michael Raver, Katerina Lembrikova, et al. Evaluating incontinence abstracts: artificial intelligence-generated versus cochrane review.Urogynecology, pp. 10–1097, 2024

2024
[22]

The alternative annotator test for LLM-as-a- judge: How to statistically justify replacing human annotators with LLMs

Nitay Calderon, Roi Reichart, and Rotem Dror. The alternative annotator test for LLM-as-a- judge: How to statistically justify replacing human annotators with LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.),Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...

2025
[23]

Automation of sys- tematic reviews with large language models.medRxiv, pp

Christian Cao, Rohit Arora, Paul Cento, Adil Budak, Katherine Manta, Elina Farahani, Matthew Cecere, Anabel Selemon, Jason Sang, Ling Xi Gong, et al. Automation of sys- tematic reviews with large language models.medRxiv, pp. 2025–06, 2025

2025
[24]

Large language models vs

Kevin Matthe Caramancion. Large language models vs. search engines: evaluating user preferences across varied information retrieval scenarios.arXiv preprint arXiv:2401.05761, 2024

arXiv 2024
[25]

The facts leader- board: A comprehensive benchmark for large language model factuality.arXiv preprint arXiv:2512.10791, 2025

Aileen Cheng, Alon Jacovi, Amir Globerson, Ben Golan, Charles Kwong, Chris Alberti, Connie Tao, Eyal Ben-David, Gaurav Singh Tomar, Lukas Haas, et al. The facts leader- board: A comprehensive benchmark for large language model factuality.arXiv preprint arXiv:2512.10791, 2025

arXiv 2025
[26]

Public use of a generalist llm chatbot for health queries.Nature Health, pp

Beatriz Costa-Gomes, Pavel Tolmachev, Eloise Taysom, Viknesh Sounderajah, Hannah Richardson, Philipp Schoenegger, Xiaoxuan Liu, Matthew M Nour, Seth Spielman, Samuel F Way, et al. Public use of a generalist llm chatbot for health queries.Nature Health, pp. 1–8, 2026

2026
[27]

Chapter iv: Updating a review

Miranda Cumpston and Ella Flemyng. Chapter iv: Updating a review. In Julian P. T. Higgins, James Thomas, Jacqueline Chandler, Miranda Cumpston, Tianjing Li, Matthew J. Page, and et al. (eds.),Cochrane Handbook for Systematic Reviews of Inter- ventions version 6.5. Cochrane, 2024. URL https://www.cochrane.org/authors/ handbooks-and-manuals/handbook/current...

2024
[28]

Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 06

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 06
[29]

Journal of Legal Analysis , author=

ISSN 2161-7201. doi: 10.1093/jla/laae003. URL https://doi.org/10.1093/jla/ laae003

work page doi:10.1093/jla/laae003
[30]

they are uncultured

Preetam Prabhu Srikar Dammu, Hayoung Jung, Anjali Singh, Monojit Choudhury, and Tanu Mitra. “they are uncultured”: Unveiling covert harms and social threats in LLM generated conversations. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 20339–20369, Mia...

2024
[31]

iagent- bench: Benchmarking sensemaking capabilities of information-seeking agents on high-traffic topics.arXiv preprint arXiv:2603.04656, 2026

Preetam Prabhu Srikar Dammu, Arnav Palkhiwala, Tanya Roosta, and Chirag Shah. iagent- bench: Benchmarking sensemaking capabilities of information-seeking agents on high-traffic topics.arXiv preprint arXiv:2603.04656, 2026

arXiv 2026
[32]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.),Proceedings of the 2021 Con...

work page doi:10.18653/v1/2021.naacl-main.365 2021
[33]

Delgado-Chaves, Matthew J

Fernando M. Delgado-Chaves, Matthew J. Jennings, Antonio Atalaia, Justus Wolff, Rita Horvath, Zeinab M. Mamdouh, Jan Baumbach, and Linda Baumbach. Transforming literature screening: The emerging role of large language models in systematic reviews.Proceedings of the National Academy of Sciences, 122(2):e2411962122, 2025. doi: 10.1073/pnas.2411962122. URLht...

work page doi:10.1073/pnas.2411962122 2025
[34]

Declan Devane, Johanna Pope, Paula Byrne, Evan Forde, Steven Woloshin, Eileen Cul- loty, Darren Dahly, Ingeborg Hess Elgersma, Heather Munthe-Kaas, Conor Judge, Mar- tin O’Donnell, Finn Krewer, Sandra Galvin, Nikita Burke, Theresa Tierney, KM Saif- Ur-Rahman, Tom Conway, and James Thomas. Comparison of ai-assisted and human- generated plain language summa...
[35]

doi: https://doi.org/10.1016/j.jclinepi.2025.111894

ISSN 0895-4356. doi: https://doi.org/10.1016/j.jclinepi.2025.111894. URL https: //www.sciencedirect.com/science/article/pii/S0895435625002276

work page doi:10.1016/j.jclinepi.2025.111894 2025
[36]

Paragraph-level simpli- fication of medical texts

Ashwin Devaraj, Iain Marshall, Byron Wallace, and Junyi Jessy Li. Paragraph-level simpli- fication of medical texts. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.),Proceedings of the 2021 Conference of the North American Chapter of the Assoc...

work page doi:10.18653/v1/2021.naacl-main.395 2021
[37]

Deep- research bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deep- research bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

Pith/arXiv arXiv 2025
[38]

ELI5: Long form question answering

Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: Long form question answering. In Anna Korhonen, David Traum, and Lluís Màrquez (eds.),Proceedings of the 57th Annual Meeting of the Association for Computational Linguis- tics, pp. 3558–3567, Florence, Italy, July 2019. Association for Computational Linguistics....

work page doi:10.18653/v1/p19-1346 2019
[39]

Investing in updating: how do conclusions change when cochrane systematic reviews are updated?BMC Medical Research Methodology, 5(1):33, 2005

Simon D French, Steve McDonald, Joanne E McKenzie, and Sally E Green. Investing in updating: how do conclusions change when cochrane systematic reviews are updated?BMC Medical Research Methodology, 5(1):33, 2005

2005
[40]

CiteBench: A bench- mark for scientific citation text generation

Martin Funkquist, Ilia Kuznetsov, Yufang Hou, and Iryna Gurevych. CiteBench: A bench- mark for scientific citation text generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7337–7353, Singapore, December 2023. Association for Computational Lin- guistics....

work page doi:10.18653/v1/2023.emnlp-main.455 2023
[41]

Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

arXiv 2025
[42]

Enabling large language models to generate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6465– 6488, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/ v...

2023
[43]

Gpt-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial

Ethan Goh, Robert J Gallo, Eric Strong, Yingjie Weng, Hannah Kerman, Jason A Freed, Joséphine A Cool, Zahir Kanjee, Kathleen P Lane, Andrew S Parsons, et al. Gpt-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nature Medicine, 31(4):1233–1238, 2025

2025
[44]

Gemini deep research

Google. Gemini deep research. https://gemini.google/overview/deep-research/,
[45]

Accessed: 2026-05-05

2026
[46]

Gemini 3 flash

Google Cloud. Gemini 3 flash. https://docs.cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/3-flash, 2025. Accessed: 2026-04-14

2025
[47]

Gemini 3 pro — generative ai on vertex ai

Google Cloud. Gemini 3 pro — generative ai on vertex ai. https://docs.cloud.google. com/vertex-ai/generative-ai/docs/models/gemini/3-pro, 2026. Accessed: 2026- 04-23

2026
[48]

What is prompt engineering? https://cloud.google.com/discover/ what-is-prompt-engineering, 2026

Google Cloud. What is prompt engineering? https://cloud.google.com/discover/ what-is-prompt-engineering, 2026. Accessed: 2026-04-14

2026
[49]

All that glitters is not novel: Plagiarism in ai generated research

Tarun Gupta and Danish Pruthi. All that glitters is not novel: Plagiarism in ai generated research. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 25721–25738, 2025. 13

2025
[51]

The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019

Pith/arXiv arXiv 1904
[52]

Model context protocol (mcp): Landscape, security threats, and future research directions.ACM Transactions on Software Engineering and Methodology, 2025

Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions.ACM Transactions on Software Engineering and Methodology, 2025

2025
[53]

Large language model–assisted risk-of-bias assessment in randomized controlled trials using the revised risk-of-bias tool: evaluation study

Jiajie Huang, Honghao Lai, Weilong Zhao, Danni Xia, Chunyang Bai, Mingyao Sun, Jianing Liu, Jiayi Liu, Bei Pan, Jinhui Tian, et al. Large language model–assisted risk-of-bias assessment in randomized controlled trials using the revised risk-of-bias tool: evaluation study. Journal of Medical Internet Research, 27:e70450, 2025

2025
[54]

Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D’Arcy, Daniel S

Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D’Arcy, Daniel S. Weld, Lucy Lu Wang, Doug Downey, and Sergey Feldman. Deep research, shallow evaluation: A case study in meta-evaluation for long-form qa benchmarks, 2026. URLhttps://arxiv.org/abs/2603.06942

arXiv 2026
[55]

Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025

Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, et al. Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025

arXiv 2025
[56]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

2021
[57]

PubMedQA: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.),Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process...

2019
[58]

PubMedQA : A dataset for biomedical research question answering

Association for Computational Linguistics. doi: 10.18653/v1/D19-1259. URL https: //aclanthology.org/D19-1259/

work page doi:10.18653/v1/d19-1259
[59]

FactPICO: Factuality evaluation for plain language summarization of medical evidence

Sebastian Joseph, Lily Chen, Jan Trienes, Hannah Göke, Monika Coers, Wei Xu, Byron Wallace, and Junyi Jessy Li. FactPICO: Factuality evaluation for plain language summarization of medical evidence. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long...

work page doi:10.18653/v1/2024.acl-long.459 2024
[60]

Hayoung Jung, Prerna Juneja, and Tanushree Mitra. Algorithmic behaviors across regions: A geolocation audit of youtube search for covid-19 misinformation between the united states and south africa.Proceedings of the International AAAI Conference on Web and Social Media, 19 (1):935–964, Jun. 2025. doi: 10.1609/icwsm.v19i1.35854. URL https://ojs.aaai.org/ i...

work page doi:10.1609/icwsm.v19i1.35854 2025
[61]

MythTriage: Scalable detection of opioid use disorder myths on a video-sharing platform

Hayoung Jung, Shravika Mittal, Ananya Aatreya, Navreet Kaur, Munmun De Choudhury, and Tanu Mitra. MythTriage: Scalable detection of opioid use disorder myths on a video-sharing platform. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proce...

work page doi:10.18653/v1/2025.emnlp-main.146 2025
[62]

Evaluating large language models for health-related queries with presuppositions

Navreet Kaur, Monojit Choudhury, and Danish Pruthi. Evaluating large language models for health-related queries with presuppositions. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 14308–14331, Bangkok, Thailand, August 2024. Association for Computational Linguis- tics. doi:...

work page doi:10.18653/v1/2024.findings-acl.850 2024
[63]

Who’s asking? simulating role-based questions for conversational ai evalua- tion, 2025

Navreet Kaur, Hoda Ayad, Hayoung Jung, Shravika Mittal, Munmun De Choudhury, and Tanushree Mitra. Who’s asking? simulating role-based questions for conversational ai evalua- tion, 2025. URLhttps://arxiv.org/abs/2510.16829

arXiv 2025
[64]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

2019
[65]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159–174, 1977. ISSN 0006341X, 15410420. URL http://www. jstor.org/stable/2529310

arXiv 1977
[66]

Qasa: advanced question answering on scientific articles

Yoonjoo Lee, Kyungjae Lee, Sunghyun Park, Dasol Hwang, Jaehyeon Kim, Hong-in Lee, and Moontae Lee. Qasa: advanced question answering on scientific articles. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

2023
[67]

Reportbench: Evaluating deep research agents via academic survey tasks.arXiv preprint arXiv:2508.15804, 2025

Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, and Kai Jia. Reportbench: Evaluating deep research agents via academic survey tasks.arXiv preprint arXiv:2508.15804, 2025

arXiv 2025
[68]

Reportbench: Evaluating deep research agents via academic survey tasks, 2025

Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, and Kai Jia. Reportbench: Evaluating deep research agents via academic survey tasks, 2025. URL https://arxiv.org/abs/2508. 15804

2025
[69]

Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning.Advances in Neural Information Processing Systems, 37:28858–28888, 2024

Shuyue S Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S Ilgen, Emma Pierson, Pang W Koh, and Yulia Tsvetkov. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning.Advances in Neural Information Processing Systems, 37:28858–28888, 2024

2024
[70]

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.CoRR, abs/2504.21776, 2025. doi: 10.48550/ARXIV .2504.21776. URL https: //doi.org/10.48550/arXiv.2504.21776

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
[71]

Webweaver: Structuring web-scale evidence with dynamic outlines for open-ended deep research.arXiv preprint arXiv:2509.13312, 2025

Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, et al. Webweaver: Structuring web-scale evidence with dynamic outlines for open-ended deep research.arXiv preprint arXiv:2509.13312, 2025

arXiv 2025
[72]

Webexplorer: Explore and evolve for training long-horizon web agents.arXiv preprint arXiv:2509.06501, 2025

Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, et al. Webexplorer: Explore and evolve for training long-horizon web agents.arXiv preprint arXiv:2509.06501, 2025

arXiv 2025
[73]

Evaluating verifiability in generative search engines

Nelson Liu, Tianyi Zhang, and Percy Liang. Evaluating verifiability in generative search engines. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 7001–7025, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.467. URL https:...

work page doi:10.18653/v1/2023.findings-emnlp.467 2023
[74]

VeriFact: Enhancing long-form factuality evaluation with refined fact extraction and reference facts

Xin Liu, Lechen Zhang, Sheza Munir, Yiyang Gu, and Lu Wang. VeriFact: Enhancing long-form factuality evaluation with refined fact extraction and reference facts. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17908– 17925, ...

work page doi:10.18653/v1/2025.emnlp-main.905 2025
[75]

Iain Marshall, Joël Kuiper, Edward Banner, and Byron C. Wallace. Automating biomedical evidence synthesis: RobotReviewer. In Mohit Bansal and Heng Ji (eds.),Proceedings of ACL 2017, System Demonstrations, pp. 7–12, Vancouver, Canada, July 2017. Association for Computational Linguistics. URLhttps://aclanthology.org/P17-4002/

2017
[76]

Gaia: a benchmark for general ai assistants.arXiv preprint arXiv:2311.12983, 2023

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants.arXiv preprint arXiv:2311.12983, 2023

Pith/arXiv arXiv 2023
[77]

The Cochrane Collaboration, 2025

Mike Clarke.Guide to the Contents of a Cochrane Methodology Protocol and Review. The Cochrane Collaboration, 2025. URL https://www.cochrane.org/sites/default/ files/uploads/PDFs/guide_to_the_contents_of_a_cochrane_methodology_ protocol_and_review.pdf. Accessed: 2026-02-19

2025
[78]

FA ct S core: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proces...

work page doi:10.18653/v1/2023.emnlp-main.741 2023
[79]

Evaluating style transfer for text

Remi Mir, Bjarke Felbo, Nick Obradovich, and Iyad Rahwan. Evaluating style transfer for text. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.),Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 495–504, Minneapolis, ...

work page doi:10.18653/v1/n19-1049 2019
[80]

Exploring chatgpt for toxicity detection in github

Shyamal Mishra and Preetha Chatterjee. Exploring chatgpt for toxicity detection in github. arXiv preprint arXiv:2312.13105, 2023

arXiv 2023
[81]

Online myths on opioid use disorder: A comparison of reddit and large language model

Shravika Mittal, Hayoung Jung, Mai ElSherief, Tanushree Mitra, and Munmun De Choudhury. Online myths on opioid use disorder: A comparison of reddit and large language model. Proceedings of the International AAAI Conference on Web and Social Media, 19(1):1224– 1245, Jun. 2025. doi: 10.1609/icwsm.v19i1.35870. URL https://ojs.aaai.org/index. php/ICWSM/articl...

work page doi:10.1609/icwsm.v19i1.35870 2025

Showing first 80 references.

[1] [1]

What is a biotech cleanroom? https://www.achengineering.com/ what-is-a-biotech-cleanroom/, n.d

ACH Engineering. What is a biotech cleanroom? https://www.achengineering.com/ what-is-a-biotech-cleanroom/, n.d. Accessed: 2026-04-01

2026

[2] [2]

LitSearch: A retrieval benchmark for scientific literature search

Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, and Tianyu Gao. LitSearch: A retrieval benchmark for scientific literature search. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 15068–15083, Miami, Florida, USA, November 2024. A...

work page doi:10.18653/v1/2024.emnlp-main.840 2024

[3] [3]

QAMPARI: A benchmark for open-domain questions with many answers

Samuel Amouyal, Tomer Wolfson, Ohad Rubin, Ori Yoran, Jonathan Herzig, and Jonathan Berant. QAMPARI: A benchmark for open-domain questions with many answers. In Sebastian Gehrmann, Alex Wang, João Sedoc, Elizabeth Clark, Kaustubh Dhole, Khyathi Raghavi Chandu, Enrico Santus, and Hooman Sedghamiz (eds.),Proceedings of the Third Workshop on Natural Language...

2023

[4] [4]

Annenberg science and public health knowledge survey (asaph): Results

Annenberg Public Policy Center. Annenberg science and public health knowledge survey (asaph): Results. https://www.annenbergpublicpolicycenter.org/, 2024. Survey results and reports on public health attitudes and knowledge

2024

[5] [5]

Advancing claude in healthcare and the life sciences

Anthropic. Advancing claude in healthcare and the life sciences. https://www.anthropic. com/news/healthcare-life-sciences, January 11 2026. Accessed March 3, 2026

2026

[6] [6]

Eval awareness in claude opus 4.6’s browsecomp performance

Anthropic. Eval awareness in claude opus 4.6’s browsecomp performance. https://www. anthropic.com/engineering/eval-awareness-browsecomp, March 2026. Accessed: 2026-04-01

2026

[7] [7]

Claude research

Anthropic. Claude research. https://claude.com/blog/research, 2026. Accessed: 2026-05-05

2026

[8] [8]

Create a message — claude api reference

Anthropic. Create a message — claude api reference. https://platform.claude.com/ docs/en/api/messages/create, 2026. Accessed: 2026-04-14

2026

[9] [9]

Prompt engineering overview

Anthropic. Prompt engineering overview. https://platform.claude.com/docs/en/ build-with-claude/prompt-engineering/overview, 2026. Accessed: 2026-04-14

2026

[10] [10]

System prompts — claude api docs (release notes)

Anthropic. System prompts — claude api docs (release notes). https://platform.claude. com/docs/en/release-notes/system-prompts, 2026. Accessed: 2026-04-23. 10

2026

[11] [11]

Using llm (large language model) to improve efficiency in literature review for undergraduate research.Llm@ Aied, pp

Shouvik Ahmed Antu, Haiyan Chen, and Cindy K Richards. Using llm (large language model) to improve efficiency in literature review for undergraduate research.Llm@ Aied, pp. 8–16, 2023

2023

[12] [12]

Study suggests physician’s medical decisions benefit from chatbot.https:// med.stanford.edu/news/all-news/2025/02/physician-decision-chatbot.html , February 2025

Hanae Armitage. Study suggests physician’s medical decisions benefit from chatbot.https:// med.stanford.edu/news/all-news/2025/02/physician-decision-chatbot.html , February 2025. Stanford Medicine News

2025

[13] [13]

Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

Pith/arXiv arXiv 2025

[14] [14]

Self-rag: Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023

2023

[15] [15]

Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hanna Hajishirzi

Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’Arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke S. Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hanna Hajishirzi. Open- scholar...

arXiv

[16] [16]

URLhttps://api.semanticscholar.org/CorpusID:274166189

[17] [17]

Cochrane-auto: An aligned dataset for the simplification of biomedical abstracts

Jan Bakker and Jaap Kamps. Cochrane-auto: An aligned dataset for the simplification of biomedical abstracts. In Matthew Shardlow, Horacio Saggion, Fernando Alva-Manchego, Marcos Zampieri, Kai North, Sanja Štajner, and Regina Stodden (eds.),Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), pp. 41– 51, Miam...

work page doi:10.18653/v1/2024.tsar-1.5 2024

[18] [18]

The relationship between reasoning and performance in large language models – o3 (mini) thinks harder, not longer, 2025

Marthe Ballon, Andres Algaba, and Vincent Ginis. The relationship between reasoning and performance in large language models – o3 (mini) thinks harder, not longer, 2025. URL https://arxiv.org/abs/2502.15631

arXiv 2025

[19] [19]

NLTK: The natural language toolkit

Steven Bird and Edward Loper. NLTK: The natural language toolkit. InProceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 214–217, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/ P04-3031/

2004

[20] [20]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- teusz Litwin, S...

1901

[21] [21]

Evaluating incontinence abstracts: artificial intelligence-generated versus cochrane review.Urogynecology, pp

Angelo Cadiente, Catherine Implicito, Abinav Udaiyar, Andre Ho, Christopher Wan, Jamie Chen, Charles Palmer, Qilin Cao, Michael Raver, Katerina Lembrikova, et al. Evaluating incontinence abstracts: artificial intelligence-generated versus cochrane review.Urogynecology, pp. 10–1097, 2024

2024

[22] [22]

The alternative annotator test for LLM-as-a- judge: How to statistically justify replacing human annotators with LLMs

Nitay Calderon, Roi Reichart, and Rotem Dror. The alternative annotator test for LLM-as-a- judge: How to statistically justify replacing human annotators with LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.),Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...

2025

[23] [23]

Automation of sys- tematic reviews with large language models.medRxiv, pp

Christian Cao, Rohit Arora, Paul Cento, Adil Budak, Katherine Manta, Elina Farahani, Matthew Cecere, Anabel Selemon, Jason Sang, Ling Xi Gong, et al. Automation of sys- tematic reviews with large language models.medRxiv, pp. 2025–06, 2025

2025

[24] [24]

Large language models vs

Kevin Matthe Caramancion. Large language models vs. search engines: evaluating user preferences across varied information retrieval scenarios.arXiv preprint arXiv:2401.05761, 2024

arXiv 2024

[25] [25]

The facts leader- board: A comprehensive benchmark for large language model factuality.arXiv preprint arXiv:2512.10791, 2025

Aileen Cheng, Alon Jacovi, Amir Globerson, Ben Golan, Charles Kwong, Chris Alberti, Connie Tao, Eyal Ben-David, Gaurav Singh Tomar, Lukas Haas, et al. The facts leader- board: A comprehensive benchmark for large language model factuality.arXiv preprint arXiv:2512.10791, 2025

arXiv 2025

[26] [26]

Public use of a generalist llm chatbot for health queries.Nature Health, pp

Beatriz Costa-Gomes, Pavel Tolmachev, Eloise Taysom, Viknesh Sounderajah, Hannah Richardson, Philipp Schoenegger, Xiaoxuan Liu, Matthew M Nour, Seth Spielman, Samuel F Way, et al. Public use of a generalist llm chatbot for health queries.Nature Health, pp. 1–8, 2026

2026

[27] [27]

Chapter iv: Updating a review

Miranda Cumpston and Ella Flemyng. Chapter iv: Updating a review. In Julian P. T. Higgins, James Thomas, Jacqueline Chandler, Miranda Cumpston, Tianjing Li, Matthew J. Page, and et al. (eds.),Cochrane Handbook for Systematic Reviews of Inter- ventions version 6.5. Cochrane, 2024. URL https://www.cochrane.org/authors/ handbooks-and-manuals/handbook/current...

2024

[28] [28]

Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 06

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 06

[29] [29]

Journal of Legal Analysis , author=

ISSN 2161-7201. doi: 10.1093/jla/laae003. URL https://doi.org/10.1093/jla/ laae003

work page doi:10.1093/jla/laae003

[30] [30]

they are uncultured

Preetam Prabhu Srikar Dammu, Hayoung Jung, Anjali Singh, Monojit Choudhury, and Tanu Mitra. “they are uncultured”: Unveiling covert harms and social threats in LLM generated conversations. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 20339–20369, Mia...

2024

[31] [31]

iagent- bench: Benchmarking sensemaking capabilities of information-seeking agents on high-traffic topics.arXiv preprint arXiv:2603.04656, 2026

Preetam Prabhu Srikar Dammu, Arnav Palkhiwala, Tanya Roosta, and Chirag Shah. iagent- bench: Benchmarking sensemaking capabilities of information-seeking agents on high-traffic topics.arXiv preprint arXiv:2603.04656, 2026

arXiv 2026

[32] [32]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.),Proceedings of the 2021 Con...

work page doi:10.18653/v1/2021.naacl-main.365 2021

[33] [33]

Delgado-Chaves, Matthew J

Fernando M. Delgado-Chaves, Matthew J. Jennings, Antonio Atalaia, Justus Wolff, Rita Horvath, Zeinab M. Mamdouh, Jan Baumbach, and Linda Baumbach. Transforming literature screening: The emerging role of large language models in systematic reviews.Proceedings of the National Academy of Sciences, 122(2):e2411962122, 2025. doi: 10.1073/pnas.2411962122. URLht...

work page doi:10.1073/pnas.2411962122 2025

[34] [34]

Declan Devane, Johanna Pope, Paula Byrne, Evan Forde, Steven Woloshin, Eileen Cul- loty, Darren Dahly, Ingeborg Hess Elgersma, Heather Munthe-Kaas, Conor Judge, Mar- tin O’Donnell, Finn Krewer, Sandra Galvin, Nikita Burke, Theresa Tierney, KM Saif- Ur-Rahman, Tom Conway, and James Thomas. Comparison of ai-assisted and human- generated plain language summa...

[35] [35]

doi: https://doi.org/10.1016/j.jclinepi.2025.111894

ISSN 0895-4356. doi: https://doi.org/10.1016/j.jclinepi.2025.111894. URL https: //www.sciencedirect.com/science/article/pii/S0895435625002276

work page doi:10.1016/j.jclinepi.2025.111894 2025

[36] [36]

Paragraph-level simpli- fication of medical texts

Ashwin Devaraj, Iain Marshall, Byron Wallace, and Junyi Jessy Li. Paragraph-level simpli- fication of medical texts. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.),Proceedings of the 2021 Conference of the North American Chapter of the Assoc...

work page doi:10.18653/v1/2021.naacl-main.395 2021

[37] [37]

Deep- research bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deep- research bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

Pith/arXiv arXiv 2025

[38] [38]

ELI5: Long form question answering

Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: Long form question answering. In Anna Korhonen, David Traum, and Lluís Màrquez (eds.),Proceedings of the 57th Annual Meeting of the Association for Computational Linguis- tics, pp. 3558–3567, Florence, Italy, July 2019. Association for Computational Linguistics....

work page doi:10.18653/v1/p19-1346 2019

[39] [39]

Investing in updating: how do conclusions change when cochrane systematic reviews are updated?BMC Medical Research Methodology, 5(1):33, 2005

Simon D French, Steve McDonald, Joanne E McKenzie, and Sally E Green. Investing in updating: how do conclusions change when cochrane systematic reviews are updated?BMC Medical Research Methodology, 5(1):33, 2005

2005

[40] [40]

CiteBench: A bench- mark for scientific citation text generation

Martin Funkquist, Ilia Kuznetsov, Yufang Hou, and Iryna Gurevych. CiteBench: A bench- mark for scientific citation text generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7337–7353, Singapore, December 2023. Association for Computational Lin- guistics....

work page doi:10.18653/v1/2023.emnlp-main.455 2023

[41] [41]

Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

arXiv 2025

[42] [42]

Enabling large language models to generate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6465– 6488, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/ v...

2023

[43] [43]

Gpt-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial

Ethan Goh, Robert J Gallo, Eric Strong, Yingjie Weng, Hannah Kerman, Jason A Freed, Joséphine A Cool, Zahir Kanjee, Kathleen P Lane, Andrew S Parsons, et al. Gpt-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nature Medicine, 31(4):1233–1238, 2025

2025

[44] [44]

Gemini deep research

Google. Gemini deep research. https://gemini.google/overview/deep-research/,

[45] [45]

Accessed: 2026-05-05

2026

[46] [46]

Gemini 3 flash

Google Cloud. Gemini 3 flash. https://docs.cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/3-flash, 2025. Accessed: 2026-04-14

2025

[47] [47]

Gemini 3 pro — generative ai on vertex ai

Google Cloud. Gemini 3 pro — generative ai on vertex ai. https://docs.cloud.google. com/vertex-ai/generative-ai/docs/models/gemini/3-pro, 2026. Accessed: 2026- 04-23

2026

[48] [48]

What is prompt engineering? https://cloud.google.com/discover/ what-is-prompt-engineering, 2026

Google Cloud. What is prompt engineering? https://cloud.google.com/discover/ what-is-prompt-engineering, 2026. Accessed: 2026-04-14

2026

[49] [49]

All that glitters is not novel: Plagiarism in ai generated research

Tarun Gupta and Danish Pruthi. All that glitters is not novel: Plagiarism in ai generated research. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 25721–25738, 2025. 13

2025

[50] [51]

The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019

Pith/arXiv arXiv 1904

[51] [52]

Model context protocol (mcp): Landscape, security threats, and future research directions.ACM Transactions on Software Engineering and Methodology, 2025

Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions.ACM Transactions on Software Engineering and Methodology, 2025

2025

[52] [53]

Large language model–assisted risk-of-bias assessment in randomized controlled trials using the revised risk-of-bias tool: evaluation study

Jiajie Huang, Honghao Lai, Weilong Zhao, Danni Xia, Chunyang Bai, Mingyao Sun, Jianing Liu, Jiayi Liu, Bei Pan, Jinhui Tian, et al. Large language model–assisted risk-of-bias assessment in randomized controlled trials using the revised risk-of-bias tool: evaluation study. Journal of Medical Internet Research, 27:e70450, 2025

2025

[53] [54]

Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D’Arcy, Daniel S

Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D’Arcy, Daniel S. Weld, Lucy Lu Wang, Doug Downey, and Sergey Feldman. Deep research, shallow evaluation: A case study in meta-evaluation for long-form qa benchmarks, 2026. URLhttps://arxiv.org/abs/2603.06942

arXiv 2026

[54] [55]

Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025

Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, et al. Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025

arXiv 2025

[55] [56]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

2021

[56] [57]

PubMedQA: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.),Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process...

2019

[57] [58]

PubMedQA : A dataset for biomedical research question answering

Association for Computational Linguistics. doi: 10.18653/v1/D19-1259. URL https: //aclanthology.org/D19-1259/

work page doi:10.18653/v1/d19-1259

[58] [59]

FactPICO: Factuality evaluation for plain language summarization of medical evidence

Sebastian Joseph, Lily Chen, Jan Trienes, Hannah Göke, Monika Coers, Wei Xu, Byron Wallace, and Junyi Jessy Li. FactPICO: Factuality evaluation for plain language summarization of medical evidence. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long...

work page doi:10.18653/v1/2024.acl-long.459 2024

[59] [60]

Hayoung Jung, Prerna Juneja, and Tanushree Mitra. Algorithmic behaviors across regions: A geolocation audit of youtube search for covid-19 misinformation between the united states and south africa.Proceedings of the International AAAI Conference on Web and Social Media, 19 (1):935–964, Jun. 2025. doi: 10.1609/icwsm.v19i1.35854. URL https://ojs.aaai.org/ i...

work page doi:10.1609/icwsm.v19i1.35854 2025

[60] [61]

MythTriage: Scalable detection of opioid use disorder myths on a video-sharing platform

Hayoung Jung, Shravika Mittal, Ananya Aatreya, Navreet Kaur, Munmun De Choudhury, and Tanu Mitra. MythTriage: Scalable detection of opioid use disorder myths on a video-sharing platform. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proce...

work page doi:10.18653/v1/2025.emnlp-main.146 2025

[61] [62]

Evaluating large language models for health-related queries with presuppositions

Navreet Kaur, Monojit Choudhury, and Danish Pruthi. Evaluating large language models for health-related queries with presuppositions. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 14308–14331, Bangkok, Thailand, August 2024. Association for Computational Linguis- tics. doi:...

work page doi:10.18653/v1/2024.findings-acl.850 2024

[62] [63]

Who’s asking? simulating role-based questions for conversational ai evalua- tion, 2025

Navreet Kaur, Hoda Ayad, Hayoung Jung, Shravika Mittal, Munmun De Choudhury, and Tanushree Mitra. Who’s asking? simulating role-based questions for conversational ai evalua- tion, 2025. URLhttps://arxiv.org/abs/2510.16829

arXiv 2025

[63] [64]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

2019

[64] [65]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159–174, 1977. ISSN 0006341X, 15410420. URL http://www. jstor.org/stable/2529310

arXiv 1977

[65] [66]

Qasa: advanced question answering on scientific articles

Yoonjoo Lee, Kyungjae Lee, Sunghyun Park, Dasol Hwang, Jaehyeon Kim, Hong-in Lee, and Moontae Lee. Qasa: advanced question answering on scientific articles. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

2023

[66] [67]

Reportbench: Evaluating deep research agents via academic survey tasks.arXiv preprint arXiv:2508.15804, 2025

Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, and Kai Jia. Reportbench: Evaluating deep research agents via academic survey tasks.arXiv preprint arXiv:2508.15804, 2025

arXiv 2025

[67] [68]

Reportbench: Evaluating deep research agents via academic survey tasks, 2025

Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, and Kai Jia. Reportbench: Evaluating deep research agents via academic survey tasks, 2025. URL https://arxiv.org/abs/2508. 15804

2025

[68] [69]

Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning.Advances in Neural Information Processing Systems, 37:28858–28888, 2024

Shuyue S Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S Ilgen, Emma Pierson, Pang W Koh, and Yulia Tsvetkov. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning.Advances in Neural Information Processing Systems, 37:28858–28888, 2024

2024

[69] [70]

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.CoRR, abs/2504.21776, 2025. doi: 10.48550/ARXIV .2504.21776. URL https: //doi.org/10.48550/arXiv.2504.21776

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025

[70] [71]

Webweaver: Structuring web-scale evidence with dynamic outlines for open-ended deep research.arXiv preprint arXiv:2509.13312, 2025

Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, et al. Webweaver: Structuring web-scale evidence with dynamic outlines for open-ended deep research.arXiv preprint arXiv:2509.13312, 2025

arXiv 2025

[71] [72]

Webexplorer: Explore and evolve for training long-horizon web agents.arXiv preprint arXiv:2509.06501, 2025

Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, et al. Webexplorer: Explore and evolve for training long-horizon web agents.arXiv preprint arXiv:2509.06501, 2025

arXiv 2025

[72] [73]

Evaluating verifiability in generative search engines

Nelson Liu, Tianyi Zhang, and Percy Liang. Evaluating verifiability in generative search engines. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 7001–7025, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.467. URL https:...

work page doi:10.18653/v1/2023.findings-emnlp.467 2023

[73] [74]

VeriFact: Enhancing long-form factuality evaluation with refined fact extraction and reference facts

Xin Liu, Lechen Zhang, Sheza Munir, Yiyang Gu, and Lu Wang. VeriFact: Enhancing long-form factuality evaluation with refined fact extraction and reference facts. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17908– 17925, ...

work page doi:10.18653/v1/2025.emnlp-main.905 2025

[74] [75]

Iain Marshall, Joël Kuiper, Edward Banner, and Byron C. Wallace. Automating biomedical evidence synthesis: RobotReviewer. In Mohit Bansal and Heng Ji (eds.),Proceedings of ACL 2017, System Demonstrations, pp. 7–12, Vancouver, Canada, July 2017. Association for Computational Linguistics. URLhttps://aclanthology.org/P17-4002/

2017

[75] [76]

Gaia: a benchmark for general ai assistants.arXiv preprint arXiv:2311.12983, 2023

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants.arXiv preprint arXiv:2311.12983, 2023

Pith/arXiv arXiv 2023

[76] [77]

The Cochrane Collaboration, 2025

Mike Clarke.Guide to the Contents of a Cochrane Methodology Protocol and Review. The Cochrane Collaboration, 2025. URL https://www.cochrane.org/sites/default/ files/uploads/PDFs/guide_to_the_contents_of_a_cochrane_methodology_ protocol_and_review.pdf. Accessed: 2026-02-19

2025

[77] [78]

FA ct S core: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proces...

work page doi:10.18653/v1/2023.emnlp-main.741 2023

[78] [79]

Evaluating style transfer for text

Remi Mir, Bjarke Felbo, Nick Obradovich, and Iyad Rahwan. Evaluating style transfer for text. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.),Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 495–504, Minneapolis, ...

work page doi:10.18653/v1/n19-1049 2019

[79] [80]

Exploring chatgpt for toxicity detection in github

Shyamal Mishra and Preetha Chatterjee. Exploring chatgpt for toxicity detection in github. arXiv preprint arXiv:2312.13105, 2023

arXiv 2023

[80] [81]

Online myths on opioid use disorder: A comparison of reddit and large language model

Shravika Mittal, Hayoung Jung, Mai ElSherief, Tanushree Mitra, and Munmun De Choudhury. Online myths on opioid use disorder: A comparison of reddit and large language model. Proceedings of the International AAAI Conference on Web and Social Media, 19(1):1224– 1245, Jun. 2025. doi: 10.1609/icwsm.v19i1.35870. URL https://ojs.aaai.org/index. php/ICWSM/articl...

work page doi:10.1609/icwsm.v19i1.35870 2025