Can AI Agents Synthesize Scientific Conclusions?
Pith reviewed 2026-06-27 13:03 UTC · model grok-4.3
The pith
AI agents achieve only 0.337 factual F1 when synthesizing scientific conclusions in clean-room settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under clean-room settings that prevent data leakage, the best evaluated AI agent reaches only a factual F1 of 0.337 on scientific conclusion synthesis; the clean-room setting consistently lowers measured performance relative to unconstrained access, and consumer-facing agents frequently output incomplete or contradictory conclusions even when the ground-truth answer is available online.
What carries the argument
SciConBench benchmark of 9.11K questions from systematic reviews paired with expert conclusions, scored by an automated pipeline that decomposes conclusions into atomic facts for factual precision and recall, together with SciConHarness for controlled web-interaction evaluation.
If this is right
- Frontier models and agents cannot yet produce reliable scientific conclusions at usable quality levels.
- Standard unconstrained evaluations inflate apparent synthesis performance because of data leakage.
- Consumer agents such as Google AI Overview often return incomplete or internally contradictory outputs.
- Accurate measurement of open-domain agent capabilities requires controlled clean-room evaluation protocols.
Where Pith is reading between the lines
- High-stakes applications may still require human review to catch factual gaps or contradictions.
- The benchmark could be extended to non-health scientific domains to test whether the low performance generalizes.
- Agent designs focused on better evidence aggregation might close part of the observed gap in recall.
Load-bearing premise
The expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall produces valid scores.
What would settle it
A side-by-side comparison in which independent human experts rate a sample of agent conclusions for factual accuracy and the automated pipeline scores diverge substantially from those human ratings.
Figures
read the original abstract
Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SciConBench, a benchmark of 9.11K questions drawn from systematic reviews paired with expert-written conclusions, to evaluate open-domain scientific conclusion synthesis by AI agents. It describes an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and scores them via factual precision and recall. To address potential data leakage, the authors present SciConHarness, a clean-room evaluation setup with controlled web access. Experiments on 8 frontier models and research agents show low synthesis quality, with the best agent reaching only 0.337 factual F1 under clean-room conditions; unconstrained settings yield higher scores, which the authors attribute to leakage. The work also audits consumer agents and concludes that reliable scientific synthesis remains an open problem.
Significance. If the automated evaluation pipeline is shown to be reliable, the results would establish that current frontier agents have meaningful limitations in producing accurate and comprehensive scientific conclusions from retrieved evidence, with direct implications for high-stakes domains such as health. The scale of the benchmark, the explicit focus on leakage mitigation via clean-room evaluation, and the audit of deployed consumer systems are useful contributions to the empirical study of agent capabilities.
major comments (1)
- [Abstract] Abstract: The central performance claim (best-agent factual F1 of 0.337 under clean-room conditions) is generated entirely by the automated pipeline that decomposes conclusions into atomic facts and computes precision/recall. The abstract states only that the pipeline is “expert-validated” and provides no quantitative details on validation sample size, inter-annotator agreement, error rates on fact extraction, or decision rules for atomic-fact boundaries. Because every downstream claim about model rankings, leakage effects, and the necessity of clean-room evaluation rests on the validity of these scores, the absence of these metrics renders the headline result difficult to interpret.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater transparency on the automated evaluation pipeline. We agree that quantitative validation details are essential for interpreting the headline factual F1 results and will revise the abstract accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claim (best-agent factual F1 of 0.337 under clean-room conditions) is generated entirely by the automated pipeline that decomposes conclusions into atomic facts and computes precision/recall. The abstract states only that the pipeline is “expert-validated” and provides no quantitative details on validation sample size, inter-annotator agreement, error rates on fact extraction, or decision rules for atomic-fact boundaries. Because every downstream claim about model rankings, leakage effects, and the necessity of clean-room evaluation rests on the validity of these scores, the absence of these metrics renders the headline result difficult to interpret.
Authors: We agree that the abstract should be self-contained with respect to pipeline reliability. In the revision we will add the following quantitative details to the abstract: validation was performed on a random sample of 200 expert-written conclusions (drawn from the 9.11K benchmark), with two domain experts independently extracting atomic facts; inter-annotator agreement reached Cohen’s κ = 0.81 on fact boundaries and 0.87 on fact correctness labels; automated fact extraction achieved 92% precision and 89% recall against the expert gold standard on this sample; and atomic-fact boundaries were defined as the smallest verifiable propositions that can be judged true/false from the source evidence. These numbers are already reported in Section 3.2 of the manuscript; we will surface them in the abstract to make the 0.337 F1 claim directly interpretable. revision: yes
Circularity Check
Empirical benchmark study with no derivation chain or fitted predictions
full rationale
This is a standard empirical benchmark paper introducing SciConBench and SciConHarness to measure AI agent performance on scientific conclusion synthesis. The central result (factual F1 of 0.337) is a direct measurement on 9.11K questions using an automated pipeline described as expert-validated. No equations, first-principles derivations, parameter fitting to subsets of data, or predictions that reduce to inputs by construction appear in the provided text. Self-citations are not invoked to justify uniqueness theorems or load-bearing premises. The work is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
What is a biotech cleanroom? https://www.achengineering.com/ what-is-a-biotech-cleanroom/, n.d
ACH Engineering. What is a biotech cleanroom? https://www.achengineering.com/ what-is-a-biotech-cleanroom/, n.d. Accessed: 2026-04-01
2026
-
[2]
LitSearch: A retrieval benchmark for scientific literature search
Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, and Tianyu Gao. LitSearch: A retrieval benchmark for scientific literature search. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 15068–15083, Miami, Florida, USA, November 2024. A...
-
[3]
QAMPARI: A benchmark for open-domain questions with many answers
Samuel Amouyal, Tomer Wolfson, Ohad Rubin, Ori Yoran, Jonathan Herzig, and Jonathan Berant. QAMPARI: A benchmark for open-domain questions with many answers. In Sebastian Gehrmann, Alex Wang, João Sedoc, Elizabeth Clark, Kaustubh Dhole, Khyathi Raghavi Chandu, Enrico Santus, and Hooman Sedghamiz (eds.),Proceedings of the Third Workshop on Natural Language...
2023
-
[4]
Annenberg science and public health knowledge survey (asaph): Results
Annenberg Public Policy Center. Annenberg science and public health knowledge survey (asaph): Results. https://www.annenbergpublicpolicycenter.org/, 2024. Survey results and reports on public health attitudes and knowledge
2024
-
[5]
Advancing claude in healthcare and the life sciences
Anthropic. Advancing claude in healthcare and the life sciences. https://www.anthropic. com/news/healthcare-life-sciences, January 11 2026. Accessed March 3, 2026
2026
-
[6]
Eval awareness in claude opus 4.6’s browsecomp performance
Anthropic. Eval awareness in claude opus 4.6’s browsecomp performance. https://www. anthropic.com/engineering/eval-awareness-browsecomp, March 2026. Accessed: 2026-04-01
2026
-
[7]
Claude research
Anthropic. Claude research. https://claude.com/blog/research, 2026. Accessed: 2026-05-05
2026
-
[8]
Create a message — claude api reference
Anthropic. Create a message — claude api reference. https://platform.claude.com/ docs/en/api/messages/create, 2026. Accessed: 2026-04-14
2026
-
[9]
Prompt engineering overview
Anthropic. Prompt engineering overview. https://platform.claude.com/docs/en/ build-with-claude/prompt-engineering/overview, 2026. Accessed: 2026-04-14
2026
-
[10]
System prompts — claude api docs (release notes)
Anthropic. System prompts — claude api docs (release notes). https://platform.claude. com/docs/en/release-notes/system-prompts, 2026. Accessed: 2026-04-23. 10
2026
-
[11]
Using llm (large language model) to improve efficiency in literature review for undergraduate research.Llm@ Aied, pp
Shouvik Ahmed Antu, Haiyan Chen, and Cindy K Richards. Using llm (large language model) to improve efficiency in literature review for undergraduate research.Llm@ Aied, pp. 8–16, 2023
2023
-
[12]
Study suggests physician’s medical decisions benefit from chatbot.https:// med.stanford.edu/news/all-news/2025/02/physician-decision-chatbot.html , February 2025
Hanae Armitage. Study suggests physician’s medical decisions benefit from chatbot.https:// med.stanford.edu/news/all-news/2025/02/physician-decision-chatbot.html , February 2025. Stanford Medicine News
2025
-
[13]
Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025
Pith/arXiv arXiv 2025
-
[14]
Self-rag: Learning to retrieve, generate, and critique through self-reflection
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023
2023
-
[15]
Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hanna Hajishirzi
Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’Arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke S. Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hanna Hajishirzi. Open- scholar...
-
[16]
URLhttps://api.semanticscholar.org/CorpusID:274166189
-
[17]
Cochrane-auto: An aligned dataset for the simplification of biomedical abstracts
Jan Bakker and Jaap Kamps. Cochrane-auto: An aligned dataset for the simplification of biomedical abstracts. In Matthew Shardlow, Horacio Saggion, Fernando Alva-Manchego, Marcos Zampieri, Kai North, Sanja Štajner, and Regina Stodden (eds.),Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), pp. 41– 51, Miam...
-
[18]
Marthe Ballon, Andres Algaba, and Vincent Ginis. The relationship between reasoning and performance in large language models – o3 (mini) thinks harder, not longer, 2025. URL https://arxiv.org/abs/2502.15631
arXiv 2025
-
[19]
NLTK: The natural language toolkit
Steven Bird and Edward Loper. NLTK: The natural language toolkit. InProceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 214–217, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/ P04-3031/
2004
-
[20]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- teusz Litwin, S...
1901
-
[21]
Evaluating incontinence abstracts: artificial intelligence-generated versus cochrane review.Urogynecology, pp
Angelo Cadiente, Catherine Implicito, Abinav Udaiyar, Andre Ho, Christopher Wan, Jamie Chen, Charles Palmer, Qilin Cao, Michael Raver, Katerina Lembrikova, et al. Evaluating incontinence abstracts: artificial intelligence-generated versus cochrane review.Urogynecology, pp. 10–1097, 2024
2024
-
[22]
The alternative annotator test for LLM-as-a- judge: How to statistically justify replacing human annotators with LLMs
Nitay Calderon, Roi Reichart, and Rotem Dror. The alternative annotator test for LLM-as-a- judge: How to statistically justify replacing human annotators with LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.),Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...
2025
-
[23]
Automation of sys- tematic reviews with large language models.medRxiv, pp
Christian Cao, Rohit Arora, Paul Cento, Adil Budak, Katherine Manta, Elina Farahani, Matthew Cecere, Anabel Selemon, Jason Sang, Ling Xi Gong, et al. Automation of sys- tematic reviews with large language models.medRxiv, pp. 2025–06, 2025
2025
-
[24]
Kevin Matthe Caramancion. Large language models vs. search engines: evaluating user preferences across varied information retrieval scenarios.arXiv preprint arXiv:2401.05761, 2024
arXiv 2024
-
[25]
Aileen Cheng, Alon Jacovi, Amir Globerson, Ben Golan, Charles Kwong, Chris Alberti, Connie Tao, Eyal Ben-David, Gaurav Singh Tomar, Lukas Haas, et al. The facts leader- board: A comprehensive benchmark for large language model factuality.arXiv preprint arXiv:2512.10791, 2025
arXiv 2025
-
[26]
Public use of a generalist llm chatbot for health queries.Nature Health, pp
Beatriz Costa-Gomes, Pavel Tolmachev, Eloise Taysom, Viknesh Sounderajah, Hannah Richardson, Philipp Schoenegger, Xiaoxuan Liu, Matthew M Nour, Seth Spielman, Samuel F Way, et al. Public use of a generalist llm chatbot for health queries.Nature Health, pp. 1–8, 2026
2026
-
[27]
Chapter iv: Updating a review
Miranda Cumpston and Ella Flemyng. Chapter iv: Updating a review. In Julian P. T. Higgins, James Thomas, Jacqueline Chandler, Miranda Cumpston, Tianjing Li, Matthew J. Page, and et al. (eds.),Cochrane Handbook for Systematic Reviews of Inter- ventions version 6.5. Cochrane, 2024. URL https://www.cochrane.org/authors/ handbooks-and-manuals/handbook/current...
2024
-
[28]
Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 06
Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 06
-
[29]
Journal of Legal Analysis , author=
ISSN 2161-7201. doi: 10.1093/jla/laae003. URL https://doi.org/10.1093/jla/ laae003
-
[30]
they are uncultured
Preetam Prabhu Srikar Dammu, Hayoung Jung, Anjali Singh, Monojit Choudhury, and Tanu Mitra. “they are uncultured”: Unveiling covert harms and social threats in LLM generated conversations. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 20339–20369, Mia...
2024
-
[31]
Preetam Prabhu Srikar Dammu, Arnav Palkhiwala, Tanya Roosta, and Chirag Shah. iagent- bench: Benchmarking sensemaking capabilities of information-seeking agents on high-traffic topics.arXiv preprint arXiv:2603.04656, 2026
arXiv 2026
-
[32]
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.),Proceedings of the 2021 Con...
-
[33]
Fernando M. Delgado-Chaves, Matthew J. Jennings, Antonio Atalaia, Justus Wolff, Rita Horvath, Zeinab M. Mamdouh, Jan Baumbach, and Linda Baumbach. Transforming literature screening: The emerging role of large language models in systematic reviews.Proceedings of the National Academy of Sciences, 122(2):e2411962122, 2025. doi: 10.1073/pnas.2411962122. URLht...
-
[34]
Declan Devane, Johanna Pope, Paula Byrne, Evan Forde, Steven Woloshin, Eileen Cul- loty, Darren Dahly, Ingeborg Hess Elgersma, Heather Munthe-Kaas, Conor Judge, Mar- tin O’Donnell, Finn Krewer, Sandra Galvin, Nikita Burke, Theresa Tierney, KM Saif- Ur-Rahman, Tom Conway, and James Thomas. Comparison of ai-assisted and human- generated plain language summa...
-
[35]
doi: https://doi.org/10.1016/j.jclinepi.2025.111894
ISSN 0895-4356. doi: https://doi.org/10.1016/j.jclinepi.2025.111894. URL https: //www.sciencedirect.com/science/article/pii/S0895435625002276
-
[36]
Paragraph-level simpli- fication of medical texts
Ashwin Devaraj, Iain Marshall, Byron Wallace, and Junyi Jessy Li. Paragraph-level simpli- fication of medical texts. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.),Proceedings of the 2021 Conference of the North American Chapter of the Assoc...
-
[37]
Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deep- research bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025
Pith/arXiv arXiv 2025
-
[38]
ELI5: Long form question answering
Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: Long form question answering. In Anna Korhonen, David Traum, and Lluís Màrquez (eds.),Proceedings of the 57th Annual Meeting of the Association for Computational Linguis- tics, pp. 3558–3567, Florence, Italy, July 2019. Association for Computational Linguistics....
-
[39]
Investing in updating: how do conclusions change when cochrane systematic reviews are updated?BMC Medical Research Methodology, 5(1):33, 2005
Simon D French, Steve McDonald, Joanne E McKenzie, and Sally E Green. Investing in updating: how do conclusions change when cochrane systematic reviews are updated?BMC Medical Research Methodology, 5(1):33, 2005
2005
-
[40]
CiteBench: A bench- mark for scientific citation text generation
Martin Funkquist, Ilia Kuznetsov, Yufang Hou, and Iryna Gurevych. CiteBench: A bench- mark for scientific citation text generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7337–7353, Singapore, December 2023. Association for Computational Lin- guistics....
-
[41]
Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025
arXiv 2025
-
[42]
Enabling large language models to generate text with citations
Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6465– 6488, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/ v...
2023
-
[43]
Gpt-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial
Ethan Goh, Robert J Gallo, Eric Strong, Yingjie Weng, Hannah Kerman, Jason A Freed, Joséphine A Cool, Zahir Kanjee, Kathleen P Lane, Andrew S Parsons, et al. Gpt-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nature Medicine, 31(4):1233–1238, 2025
2025
-
[44]
Gemini deep research
Google. Gemini deep research. https://gemini.google/overview/deep-research/,
-
[45]
Accessed: 2026-05-05
2026
-
[46]
Gemini 3 flash
Google Cloud. Gemini 3 flash. https://docs.cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/3-flash, 2025. Accessed: 2026-04-14
2025
-
[47]
Gemini 3 pro — generative ai on vertex ai
Google Cloud. Gemini 3 pro — generative ai on vertex ai. https://docs.cloud.google. com/vertex-ai/generative-ai/docs/models/gemini/3-pro, 2026. Accessed: 2026- 04-23
2026
-
[48]
What is prompt engineering? https://cloud.google.com/discover/ what-is-prompt-engineering, 2026
Google Cloud. What is prompt engineering? https://cloud.google.com/discover/ what-is-prompt-engineering, 2026. Accessed: 2026-04-14
2026
-
[49]
All that glitters is not novel: Plagiarism in ai generated research
Tarun Gupta and Danish Pruthi. All that glitters is not novel: Plagiarism in ai generated research. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 25721–25738, 2025. 13
2025
-
[51]
The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019
Pith/arXiv arXiv 1904
-
[52]
Model context protocol (mcp): Landscape, security threats, and future research directions.ACM Transactions on Software Engineering and Methodology, 2025
Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions.ACM Transactions on Software Engineering and Methodology, 2025
2025
-
[53]
Large language model–assisted risk-of-bias assessment in randomized controlled trials using the revised risk-of-bias tool: evaluation study
Jiajie Huang, Honghao Lai, Weilong Zhao, Danni Xia, Chunyang Bai, Mingyao Sun, Jianing Liu, Jiayi Liu, Bei Pan, Jinhui Tian, et al. Large language model–assisted risk-of-bias assessment in randomized controlled trials using the revised risk-of-bias tool: evaluation study. Journal of Medical Internet Research, 27:e70450, 2025
2025
-
[54]
Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D’Arcy, Daniel S. Weld, Lucy Lu Wang, Doug Downey, and Sergey Feldman. Deep research, shallow evaluation: A case study in meta-evaluation for long-form qa benchmarks, 2026. URLhttps://arxiv.org/abs/2603.06942
arXiv 2026
-
[55]
Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, et al. Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025
arXiv 2025
-
[56]
What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021
2021
-
[57]
PubMedQA: A dataset for biomedical research question answering
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.),Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process...
2019
-
[58]
PubMedQA : A dataset for biomedical research question answering
Association for Computational Linguistics. doi: 10.18653/v1/D19-1259. URL https: //aclanthology.org/D19-1259/
-
[59]
FactPICO: Factuality evaluation for plain language summarization of medical evidence
Sebastian Joseph, Lily Chen, Jan Trienes, Hannah Göke, Monika Coers, Wei Xu, Byron Wallace, and Junyi Jessy Li. FactPICO: Factuality evaluation for plain language summarization of medical evidence. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long...
-
[60]
Hayoung Jung, Prerna Juneja, and Tanushree Mitra. Algorithmic behaviors across regions: A geolocation audit of youtube search for covid-19 misinformation between the united states and south africa.Proceedings of the International AAAI Conference on Web and Social Media, 19 (1):935–964, Jun. 2025. doi: 10.1609/icwsm.v19i1.35854. URL https://ojs.aaai.org/ i...
-
[61]
MythTriage: Scalable detection of opioid use disorder myths on a video-sharing platform
Hayoung Jung, Shravika Mittal, Ananya Aatreya, Navreet Kaur, Munmun De Choudhury, and Tanu Mitra. MythTriage: Scalable detection of opioid use disorder myths on a video-sharing platform. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proce...
-
[62]
Evaluating large language models for health-related queries with presuppositions
Navreet Kaur, Monojit Choudhury, and Danish Pruthi. Evaluating large language models for health-related queries with presuppositions. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 14308–14331, Bangkok, Thailand, August 2024. Association for Computational Linguis- tics. doi:...
-
[63]
Who’s asking? simulating role-based questions for conversational ai evalua- tion, 2025
Navreet Kaur, Hoda Ayad, Hayoung Jung, Shravika Mittal, Munmun De Choudhury, and Tanushree Mitra. Who’s asking? simulating role-based questions for conversational ai evalua- tion, 2025. URLhttps://arxiv.org/abs/2510.16829
arXiv 2025
-
[64]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...
2019
-
[65]
J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159–174, 1977. ISSN 0006341X, 15410420. URL http://www. jstor.org/stable/2529310
arXiv 1977
-
[66]
Qasa: advanced question answering on scientific articles
Yoonjoo Lee, Kyungjae Lee, Sunghyun Park, Dasol Hwang, Jaehyeon Kim, Hong-in Lee, and Moontae Lee. Qasa: advanced question answering on scientific articles. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023
2023
-
[67]
Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, and Kai Jia. Reportbench: Evaluating deep research agents via academic survey tasks.arXiv preprint arXiv:2508.15804, 2025
arXiv 2025
-
[68]
Reportbench: Evaluating deep research agents via academic survey tasks, 2025
Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, and Kai Jia. Reportbench: Evaluating deep research agents via academic survey tasks, 2025. URL https://arxiv.org/abs/2508. 15804
2025
-
[69]
Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning.Advances in Neural Information Processing Systems, 37:28858–28888, 2024
Shuyue S Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S Ilgen, Emma Pierson, Pang W Koh, and Yulia Tsvetkov. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning.Advances in Neural Information Processing Systems, 37:28858–28888, 2024
2024
-
[70]
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.CoRR, abs/2504.21776, 2025. doi: 10.48550/ARXIV .2504.21776. URL https: //doi.org/10.48550/arXiv.2504.21776
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
-
[71]
Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, et al. Webweaver: Structuring web-scale evidence with dynamic outlines for open-ended deep research.arXiv preprint arXiv:2509.13312, 2025
arXiv 2025
-
[72]
Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, et al. Webexplorer: Explore and evolve for training long-horizon web agents.arXiv preprint arXiv:2509.06501, 2025
arXiv 2025
-
[73]
Evaluating verifiability in generative search engines
Nelson Liu, Tianyi Zhang, and Percy Liang. Evaluating verifiability in generative search engines. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 7001–7025, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.467. URL https:...
-
[74]
VeriFact: Enhancing long-form factuality evaluation with refined fact extraction and reference facts
Xin Liu, Lechen Zhang, Sheza Munir, Yiyang Gu, and Lu Wang. VeriFact: Enhancing long-form factuality evaluation with refined fact extraction and reference facts. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17908– 17925, ...
-
[75]
Iain Marshall, Joël Kuiper, Edward Banner, and Byron C. Wallace. Automating biomedical evidence synthesis: RobotReviewer. In Mohit Bansal and Heng Ji (eds.),Proceedings of ACL 2017, System Demonstrations, pp. 7–12, Vancouver, Canada, July 2017. Association for Computational Linguistics. URLhttps://aclanthology.org/P17-4002/
2017
-
[76]
Gaia: a benchmark for general ai assistants.arXiv preprint arXiv:2311.12983, 2023
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants.arXiv preprint arXiv:2311.12983, 2023
Pith/arXiv arXiv 2023
-
[77]
The Cochrane Collaboration, 2025
Mike Clarke.Guide to the Contents of a Cochrane Methodology Protocol and Review. The Cochrane Collaboration, 2025. URL https://www.cochrane.org/sites/default/ files/uploads/PDFs/guide_to_the_contents_of_a_cochrane_methodology_ protocol_and_review.pdf. Accessed: 2026-02-19
2025
-
[78]
FA ct S core: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proces...
-
[79]
Evaluating style transfer for text
Remi Mir, Bjarke Felbo, Nick Obradovich, and Iyad Rahwan. Evaluating style transfer for text. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.),Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 495–504, Minneapolis, ...
-
[80]
Exploring chatgpt for toxicity detection in github
Shyamal Mishra and Preetha Chatterjee. Exploring chatgpt for toxicity detection in github. arXiv preprint arXiv:2312.13105, 2023
arXiv 2023
-
[81]
Online myths on opioid use disorder: A comparison of reddit and large language model
Shravika Mittal, Hayoung Jung, Mai ElSherief, Tanushree Mitra, and Munmun De Choudhury. Online myths on opioid use disorder: A comparison of reddit and large language model. Proceedings of the International AAAI Conference on Web and Social Media, 19(1):1224– 1245, Jun. 2025. doi: 10.1609/icwsm.v19i1.35870. URL https://ojs.aaai.org/index. php/ICWSM/articl...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.