WisPaper: Your AI Scholar Search Engine

Chengyong Liao; Chunchun Ma; Feiyan Li; Fu Wang; Guangbin Li; Hao Peng; Junjie Ye; Junshen Chen; Jun Zhao; Kexin Tan

arxiv: 2512.06879 · v3 · submitted 2025-12-07 · 💻 cs.IR · cs.AI

WisPaper: Your AI Scholar Search Engine

Li Ju , Jun Zhao , Mingxu Chai , Ziyu Shen , Xiangyang Wang , Yage Geng , Chunchun Ma , Hao Peng

show 19 more authors

Guangbin Li Tao Li Chengyong Liao Fu Wang Xiaolong Wang Junshen Chen Rui Gong Shijia Liang Feiyan Li Ming Zhang Kexin Tan Junjie Ye Zhiheng Xi Shihan Dou Tao Gui Yuankai Ying Yang Shi Yue Zhang Qi Zhang

This is my paper

Pith reviewed 2026-05-17 00:42 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords AI academic searchagentic validationscholar search engineliterature discoveryretrieval hallucinationsTaxoBenchresearch workflow

0 comments

The pith

WisPaper adds agentic validation to academic search so results match complex questions rather than just keywords.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WisPaper as an end-to-end system that combines keyword retrieval with an agentic model called WisModel for structured validation of whether papers actually address a user's research query. This addresses the gap where standard search engines return papers that only share words but miss the intent. Discovered papers move directly into a Library module that builds user profiles, which then power AI Feeds to surface new publications and loop back to guide further searches. The approach reports 22.26 percent recall on TaxoBench, above the O3 baseline, and 93.70 percent accuracy for the validation step to cut down on irrelevant results.

Core claim

WisPaper is built from three linked modules: Scholar Search that runs rapid keywords then Deep Search where WisModel applies structured reasoning to validate candidates; a Library that organizes saved papers into profiles; and AI Feeds that monitor new publications and feed recommendations back into exploration. The system claims this integration solves both semantic mismatch and the need to stitch separate tools, with measured gains of 22.26 percent recall versus 20.92 percent for O3 and 93.70 percent validation accuracy that reduces retrieval hallucinations.

What carries the argument

WisModel, the agentic model that performs structured reasoning to decide whether a candidate paper truly addresses the user's complex research question.

If this is right

Papers flow from validation directly into the Library with one click for systematic organization.
Library profiles progressively sharpen the relevance of recommendations in AI Feeds.
AI Feeds continuously surface new publications and guide the next round of exploration.
The validation layer reduces cases where keyword-matched papers fail to address the actual query.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Researchers could maintain a living, personalized view of their field without constant manual searching.
The same validation pattern might extend to filtering preprints or conference submissions before full reading.
If the closed discovery-to-feed loop holds, long-term awareness of relevant work could improve without extra effort.

Load-bearing premise

The agentic WisModel performs structured reasoning that reliably determines whether a paper addresses a user's complex research question without the validation step introducing new errors or selection biases.

What would settle it

A side-by-side human review of papers that WisModel accepted or rejected on a held-out set of complex queries, measuring agreement rate with the reported 93.70 percent accuracy.

read the original abstract

We present \textsc{WisPaper}, an end-to-end agent system that transforms how researchers discover, organize, and track academic literature. The system addresses two fundamental challenges. (1)~\textit{Semantic search limitations}: existing academic search engines match keywords but cannot verify whether papers truly address complex research questions; and (2)~\textit{Workflow fragmentation}: researchers must manually stitch together separate tools for discovery, organization, and monitoring. \textsc{WisPaper} tackles these through three integrated modules. \textbf{Scholar Search} combines rapid keyword retrieval with \textit{Deep Search}, in which an agentic model, \textsc{WisModel}, validates candidate papers against user queries through structured reasoning. Discovered papers flow seamlessly into \textbf{Library} with one click, where systematic organization progressively builds a user profile that sharpens the recommendations of \textbf{AI Feeds}, which continuously surfaces relevant new publications and in turn guides subsequent exploration, closing the loop from discovery to long-term awareness. On TaxoBench, \textsc{WisPaper} achieves 22.26\% recall, surpassing the O3 baseline (20.92\%). Furthermore, \textsc{WisModel} attains 93.70\% validation accuracy, effectively mitigating retrieval hallucinations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WisPaper is a practical integration of keyword retrieval with agentic validation and profile-based feeds, but the small recall gain and high accuracy figure rest on evaluation details that are missing from the description.

read the letter

WisPaper combines standard keyword search with an agent called WisModel that does structured reasoning to check whether candidate papers actually match complex user questions. Papers then move into a library that builds a user profile to improve recommendations in the AI Feeds module. The loop from discovery to ongoing monitoring is the main contribution here, and it targets real workflow pain points like fragmented tools and retrieval that ignores query intent.

Referee Report

2 major / 1 minor

Summary. The manuscript presents WisPaper, an end-to-end agentic system for academic literature discovery, organization, and tracking. It integrates rapid keyword retrieval with Deep Search using the agentic WisModel for structured validation of papers against complex user queries, seamless flow into a Library module for organization and profile building, and AI Feeds for continuous recommendation of new publications. On TaxoBench, WisPaper reports 22.26% recall (surpassing the O3 baseline at 20.92%), and WisModel achieves 93.70% validation accuracy to mitigate retrieval hallucinations.

Significance. If the empirical claims hold under rigorous validation, the integrated workflow could meaningfully address semantic search limitations and workflow fragmentation in scholarly tools by using agentic reasoning to filter for relevance on complex questions. The closed-loop design from discovery to long-term awareness is a conceptual strength. However, the modest recall delta and lack of experimental details reduce the assessed significance; the work would benefit from clearer evidence that the validation step reliably handles the claimed query complexity without introducing new biases.

major comments (2)

[Abstract] Abstract: The reported 93.70% validation accuracy for WisModel is presented as evidence of effective hallucination mitigation, but no information is given on validation-set size, whether ground-truth labels are human-annotated or model-derived, inter-annotator agreement, or stratification by query difficulty. This directly undermines the central claim that the agent reliably determines whether papers address complex research questions.
[Abstract] Abstract: The 22.26% recall on TaxoBench (vs. 20.92% O3 baseline) is offered as a performance improvement, yet the manuscript provides no variance estimates, statistical significance tests, dataset splits, or confirmation that benchmark queries match the complexity level the system targets. Without these, the 1.34-point delta cannot be interpreted as support for the system's advantages.

minor comments (1)

[Abstract] The abstract and system description introduce several named components (WisModel, Deep Search, AI Feeds) without consistent cross-referencing to later sections that would define their internal mechanisms or interfaces.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and agree that additional experimental details are required to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: The reported 93.70% validation accuracy for WisModel is presented as evidence of effective hallucination mitigation, but no information is given on validation-set size, whether ground-truth labels are human-annotated or model-derived, inter-annotator agreement, or stratification by query difficulty. This directly undermines the central claim that the agent reliably determines whether papers address complex research questions.

Authors: We agree that the abstract omits critical details needed to substantiate the validation accuracy claim. The current manuscript does not provide this information in the abstract. We will revise the abstract to include the validation-set size, confirm that ground-truth labels are human-annotated, report inter-annotator agreement, and describe stratification by query difficulty. These changes will directly support the reliability of the agent for complex queries. revision: yes
Referee: [Abstract] Abstract: The 22.26% recall on TaxoBench (vs. 20.92% O3 baseline) is offered as a performance improvement, yet the manuscript provides no variance estimates, statistical significance tests, dataset splits, or confirmation that benchmark queries match the complexity level the system targets. Without these, the 1.34-point delta cannot be interpreted as support for the system's advantages.

Authors: We concur that the reported recall improvement requires supporting statistical and methodological details for proper interpretation. The manuscript currently lacks variance estimates, significance tests, dataset split information, and explicit confirmation of benchmark query complexity alignment. In the revision we will add these elements, including variance across runs, statistical test results, split details, and discussion of query complexity, to enable readers to assess the delta more rigorously. revision: yes

Circularity Check

0 steps flagged

No circularity: performance figures are direct empirical measurements on an external benchmark.

full rationale

The manuscript describes an agentic system (Scholar Search + WisModel + Library + AI Feeds) and then reports two concrete evaluation numbers: 22.26% recall on TaxoBench versus an O3 baseline, and 93.70% validation accuracy for WisModel. These quantities are obtained by running the implemented system on a fixed external test collection; they are not obtained by fitting parameters inside the paper's own equations, by renaming an input as a prediction, or by any self-referential definition. No load-bearing uniqueness theorem, ansatz, or self-citation chain is invoked to derive the reported metrics. The derivation chain therefore terminates in independent, externally verifiable measurements rather than reducing to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central performance claims rest on the empirical behavior of the integrated system and the named WisModel component; no free parameters, mathematical axioms, or new physical entities are introduced beyond standard assumptions about LLM reasoning capabilities.

invented entities (1)

WisModel no independent evidence
purpose: Agentic model that performs structured reasoning to validate whether candidate papers address user queries
Introduced as a named component of the Scholar Search module without reference to prior independent publications or external validation of its specific capabilities.

pith-pipeline@v0.9.0 · 5608 in / 1318 out tokens · 80918 ms · 2026-05-17T00:42:08.150507+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

WisModel ... decomposes queries into verification criteria and validates papers through structured reasoning ... Multi-Dimensional Shaped Reward ... Faithful Grounding ... Logical Entailment
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On TaxoBench, WisPaper achieves 22.26% recall ... WisModel attains 93.70% validation accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Faithfulness-QA: A Counterfactual Entity Substitution Dataset for Training Context-Faithful RAG Models
cs.CL 2026-04 accept novelty 6.0

Faithfulness-QA is a 99k-sample dataset created via counterfactual entity substitution on existing QA benchmarks to train and evaluate context-faithful RAG models.
SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research
cs.AI 2026-05 unverdicted novelty 4.0

SciAtlas builds a large-scale multi-disciplinary academic knowledge graph and a neuro-symbolic retrieval system to support automated scientific research tasks such as literature review and idea positioning.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 2 Pith papers · 5 internal anchors

[1]

Knowledge-Centric Hallucination Detection

Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, and Tianyu Gao. LitSearch: A retrieval benchmark for scientific literature search. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conferenceon EmpiricalMethods in NaturalLanguageProcessing, pages 15068– 15083, Miami, Florida, USA, November 2024. ...

work page doi:10.18653/v1/2024 2024
[2]

Can generative llms create query variants for test collections? an exploratory study

Marwah Alaofi, Luke Gallagher, Mark Sanderson, Falk Scholer, and Paul Thomas. Can generative llms create query variants for test collections? an exploratory study. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 1869–1873, New York, NY, USA, 2023. Association for Computing ...

work page doi:10.1145/3539618.3591960 2023
[3]

ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models , booktitle =

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nationsof the Americas Chapter of the Association for Computational Linguistics: HumanL...

work page doi:10.18653/v1/2025.naacl-long.342 2025
[4]

Lutz Bornmann, Robin Haunschild, and Rüdiger Mutz. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases.Humanities and Social SciencesCommunications, 8:224, 10 2021. doi: 10.1057/s41599-021-00903-w

work page doi:10.1057/s41599-021-00903-w 2021
[5]

ChemCrow: Augmenting large-language models with chemistry tools

Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools, 2023. URLhttps://arxiv.org/abs/2304.05376

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Bullers, A

K. Bullers, A. M. Howard, A. Hanson, W. D. Kearns, J. J. Orriola, R. L. Polo, and K. A. Sakmar. It takes longer than you think: librarian time spent on systematic review tasks.Journal ofthe MedicalLibraryAssociation, 106(2): 198–207, April 2018. doi: 10.5195/jmla.2018.323. Epub 2018 Apr 1

work page doi:10.5195/jmla.2018.323 2018
[7]

Towards an ai co-scientist,

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

work page
[8]

URL https://arxiv.org/abs/2502.18864

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Haddaway

Michael Gusenbauer and Neal R. Haddaway. Which academic search systems are suitable for systematic reviews or meta-analyses? evaluating retrieval qualities of google scholar, pubmed, and 26 other resources.ResearchSynthesis Methods, 11(2):181–217, 2020. doi: https://doi.org/10.1002/jrsm.1378. URLhttps://onlinelibrary.wiley.com/ doi/abs/10.1002/jrsm.1378

work page doi:10.1002/jrsm.1378 2020
[10]

PaSa: AnLLMagentfor comprehensiveacademicpapersearch

YichenHe,GuanhuaHuang,PeiyuanFeng,YuanLin,YuchenZhang,HangLi,andWeinanE. PaSa: AnLLMagentfor comprehensiveacademicpapersearch. InWanxiangChe, JoyceNabende, EkaterinaShutova, andMohammadTaher Pilehvar,editors, Proceedingsofthe63rdAnnualMeetingoftheAssociationforComputationalLinguistics(Volume 1: LongPapers), pages 11663–11679, Vienna, Austria, July 2025. A...

work page doi:10.18653/v1/2025.acl-long.572 2025
[11]

Query expansion by prompting large language models

Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui Wang, and Michael Bendersky. Query expansion by prompting large language models, 2023. URLhttps://arxiv.org/abs/2305.03653

work page arXiv 2023
[12]

ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents

Hao Kang and Chenyan Xiong. ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5653– 5671, Suzhou, China, November 2025. Asso...

work page doi:10.18653/v1/2025.findings-emnlp.303 2025
[13]

Why not just 12 google it? an assessment of information literacy skills in a biomedical science curriculum.BMCmedicaleducation, 11:17, 04 2011

Karl Kingsley, Gillian Galbraith, Matthew Herring, Eva Stowers, Tanis Stewart, and Karla Kingsley. Why not just 12 google it? an assessment of information literacy skills in a biomedical science curriculum.BMCmedicaleducation, 11:17, 04 2011. doi: 10.1186/1472-6920-11-17

work page doi:10.1186/1472-6920-11-17 2011
[14]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps://arxiv.org/abs/2408.06292

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Query Rewriting in Retrieval-Augmented Large Language Models

Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting in retrieval-augmented large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference onEmpiricalMethods inNaturalLanguageProcessing, pages 5303–5315, Singapore, December 2023. Association for Computational Linguistics. doi: 10.186...

work page doi:10.18653/v1/2023.emnlp-main.322 2023
[16]

Large language model based long-tail query rewriting in taobao search

Wenjun Peng, Guiyang Li, Yue Jiang, Zilong Wang, Dan Ou, Xiaoyi Zeng, Derong Xu, Tong Xu, and Enhong Chen. Large language model based long-tail query rewriting in taobao search. InCompanionProceedings of the ACM WebConference 2024, WWW ’24, page 20–28, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400701726. doi: 10.1145/3589335.3...

work page doi:10.1145/3589335.3648298 2024
[17]

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants, 2025. URLhttps: //arxiv.org/abs/2501.04227

work page internal anchor Pith review arXiv 2025
[18]

Bulaong, John E

Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, and James Zou. The virtual lab: Ai agents design new sars-cov-2 nanobodies with experimental validation.bioRxiv, 2024. doi: 10.1101/2024.11.11.623004. URL https://www.biorxiv.org/content/early/2024/11/12/2024.11.11.623004

work page doi:10.1101/2024.11.11.623004 2024
[19]

Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

Ming Zhang, Jiabao Zhuang, Wenqing Jing, Kexin Tan, Ziyu Kong, Jingyi Deng, Yujiong Shen, Yuhang Zhao, Ning Luo, Renzhe Zheng, Jiahui Lin, Mingqi Wu, Long Ma, Shihan Dou, Tao Gui, Qi Zhang, and Xuanjing Huang. Can deep research agents retrieve and organize? evaluating the synthesis gap with expert taxonomies, 2026. URL https://arxiv.org/abs/2601.12369. 13...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

search_queries

Generate 2-4 Google Scholar search queries ("search_queries")

work page
[21]

criteria

Generate 1-4 executable, standalone screening criteria ("criteria"), each an independent rule. User Prompt: Current time: {timestamp}. User query: {user_query} Expected Output Format (JSON): { "search_queries": [ "<Boolean search expression 1>", "<Boolean search expression 2>", ... ], "criteria": [ { "type": "<task|method|dataset|metric|etc.>", "name": "<...

work page

[1] [1]

Knowledge-Centric Hallucination Detection

Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, and Tianyu Gao. LitSearch: A retrieval benchmark for scientific literature search. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conferenceon EmpiricalMethods in NaturalLanguageProcessing, pages 15068– 15083, Miami, Florida, USA, November 2024. ...

work page doi:10.18653/v1/2024 2024

[2] [2]

Can generative llms create query variants for test collections? an exploratory study

Marwah Alaofi, Luke Gallagher, Mark Sanderson, Falk Scholer, and Paul Thomas. Can generative llms create query variants for test collections? an exploratory study. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 1869–1873, New York, NY, USA, 2023. Association for Computing ...

work page doi:10.1145/3539618.3591960 2023

[3] [3]

ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models , booktitle =

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nationsof the Americas Chapter of the Association for Computational Linguistics: HumanL...

work page doi:10.18653/v1/2025.naacl-long.342 2025

[4] [4]

Lutz Bornmann, Robin Haunschild, and Rüdiger Mutz. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases.Humanities and Social SciencesCommunications, 8:224, 10 2021. doi: 10.1057/s41599-021-00903-w

work page doi:10.1057/s41599-021-00903-w 2021

[5] [5]

ChemCrow: Augmenting large-language models with chemistry tools

Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools, 2023. URLhttps://arxiv.org/abs/2304.05376

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Bullers, A

K. Bullers, A. M. Howard, A. Hanson, W. D. Kearns, J. J. Orriola, R. L. Polo, and K. A. Sakmar. It takes longer than you think: librarian time spent on systematic review tasks.Journal ofthe MedicalLibraryAssociation, 106(2): 198–207, April 2018. doi: 10.5195/jmla.2018.323. Epub 2018 Apr 1

work page doi:10.5195/jmla.2018.323 2018

[7] [7]

Towards an ai co-scientist,

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

work page

[8] [8]

URL https://arxiv.org/abs/2502.18864

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Haddaway

Michael Gusenbauer and Neal R. Haddaway. Which academic search systems are suitable for systematic reviews or meta-analyses? evaluating retrieval qualities of google scholar, pubmed, and 26 other resources.ResearchSynthesis Methods, 11(2):181–217, 2020. doi: https://doi.org/10.1002/jrsm.1378. URLhttps://onlinelibrary.wiley.com/ doi/abs/10.1002/jrsm.1378

work page doi:10.1002/jrsm.1378 2020

[10] [10]

PaSa: AnLLMagentfor comprehensiveacademicpapersearch

YichenHe,GuanhuaHuang,PeiyuanFeng,YuanLin,YuchenZhang,HangLi,andWeinanE. PaSa: AnLLMagentfor comprehensiveacademicpapersearch. InWanxiangChe, JoyceNabende, EkaterinaShutova, andMohammadTaher Pilehvar,editors, Proceedingsofthe63rdAnnualMeetingoftheAssociationforComputationalLinguistics(Volume 1: LongPapers), pages 11663–11679, Vienna, Austria, July 2025. A...

work page doi:10.18653/v1/2025.acl-long.572 2025

[11] [11]

Query expansion by prompting large language models

Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui Wang, and Michael Bendersky. Query expansion by prompting large language models, 2023. URLhttps://arxiv.org/abs/2305.03653

work page arXiv 2023

[12] [12]

ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents

Hao Kang and Chenyan Xiong. ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5653– 5671, Suzhou, China, November 2025. Asso...

work page doi:10.18653/v1/2025.findings-emnlp.303 2025

[13] [13]

Why not just 12 google it? an assessment of information literacy skills in a biomedical science curriculum.BMCmedicaleducation, 11:17, 04 2011

Karl Kingsley, Gillian Galbraith, Matthew Herring, Eva Stowers, Tanis Stewart, and Karla Kingsley. Why not just 12 google it? an assessment of information literacy skills in a biomedical science curriculum.BMCmedicaleducation, 11:17, 04 2011. doi: 10.1186/1472-6920-11-17

work page doi:10.1186/1472-6920-11-17 2011

[14] [14]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps://arxiv.org/abs/2408.06292

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Query Rewriting in Retrieval-Augmented Large Language Models

Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting in retrieval-augmented large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference onEmpiricalMethods inNaturalLanguageProcessing, pages 5303–5315, Singapore, December 2023. Association for Computational Linguistics. doi: 10.186...

work page doi:10.18653/v1/2023.emnlp-main.322 2023

[16] [16]

Large language model based long-tail query rewriting in taobao search

Wenjun Peng, Guiyang Li, Yue Jiang, Zilong Wang, Dan Ou, Xiaoyi Zeng, Derong Xu, Tong Xu, and Enhong Chen. Large language model based long-tail query rewriting in taobao search. InCompanionProceedings of the ACM WebConference 2024, WWW ’24, page 20–28, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400701726. doi: 10.1145/3589335.3...

work page doi:10.1145/3589335.3648298 2024

[17] [17]

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants, 2025. URLhttps: //arxiv.org/abs/2501.04227

work page internal anchor Pith review arXiv 2025

[18] [18]

Bulaong, John E

Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, and James Zou. The virtual lab: Ai agents design new sars-cov-2 nanobodies with experimental validation.bioRxiv, 2024. doi: 10.1101/2024.11.11.623004. URL https://www.biorxiv.org/content/early/2024/11/12/2024.11.11.623004

work page doi:10.1101/2024.11.11.623004 2024

[19] [19]

Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

Ming Zhang, Jiabao Zhuang, Wenqing Jing, Kexin Tan, Ziyu Kong, Jingyi Deng, Yujiong Shen, Yuhang Zhao, Ning Luo, Renzhe Zheng, Jiahui Lin, Mingqi Wu, Long Ma, Shihan Dou, Tao Gui, Qi Zhang, and Xuanjing Huang. Can deep research agents retrieve and organize? evaluating the synthesis gap with expert taxonomies, 2026. URL https://arxiv.org/abs/2601.12369. 13...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

search_queries

Generate 2-4 Google Scholar search queries ("search_queries")

work page

[21] [21]

criteria

Generate 1-4 executable, standalone screening criteria ("criteria"), each an independent rule. User Prompt: Current time: {timestamp}. User query: {user_query} Expected Output Format (JSON): { "search_queries": [ "<Boolean search expression 1>", "<Boolean search expression 2>", ... ], "criteria": [ { "type": "<task|method|dataset|metric|etc.>", "name": "<...

work page