arxiv: 2605.14306 · v1 · submitted 2026-05-14 · 💻 cs.IR

Recognition: 2 theorem links

· Lean Theorem

Towards Self-Evolving Agentic Literature Retrieval

Yuwen Du , Tian Jin , Jing Kang , Xianghe Pang , Jingyi Chai , Tingjia Miao , Fenyi Liu , Wenhao Wang

show 3 more authors

Sikai Yao Yuzhi Zhang Siheng Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:33 UTC · model grok-4.3

classification 💻 cs.IR

keywords agentic retrievalliterature searchself-evolving systemshallucination-free retrievalscientific literaturebenchmark evaluationcost-efficient AI

0 comments

The pith

PaSaMaster turns literature retrieval into a self-evolving process that ranks papers by relevance without generating sources, outperforming GPT-5.2 by 30% at 1% cost with zero hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PaSaMaster as a system that addresses the limitations of both keyword searches and large language models in finding scientific literature. It evolves the search by using ranked evidence to refine intents and guide further retrievals. This design separates heavy planning to powerful models while using efficient models for actual retrieval and scoring. On a benchmark spanning 38 disciplines, it delivers substantial gains in accuracy and eliminates source hallucinations. The result suggests that agentic, iterative approaches can make reliable literature retrieval both effective and affordable.

Core claim

PaSaMaster is a self-evolving agentic literature retrieval system that produces relevance-scored paper rankings with evidence-grounded recommendations through iterative intent analysis, retrieval, and ranking. It outperforms GPT-5.2 by 30.0% at 1% computational cost while ensuring zero source hallucination and improves F1-score by 15.6X over traditional keyword retrieval on the PaSaMaster Benchmark across 38 disciplines.

What carries the argument

The iterative process that transforms retrieval into an evolving search using ranked evidence to reveal gaps, refine intents, and guide follow-up searches, combined with separating planning from retrieval using frontier LLMs only for intent understanding.

Load-bearing premise

The PaSaMaster Benchmark faithfully represents real-world scientific search intents without introducing selection bias in the iterative ranking process.

What would settle it

Running PaSaMaster on an independent benchmark created without author involvement and measuring if the 15.6X F1 improvement and zero hallucination hold.

Figures

Figures reproduced from arXiv: 2605.14306 by Fenyi Liu, Jing Kang, Jingyi Chai, Siheng Chen, Sikai Yao, Tian Jin, Tingjia Miao, Wenhao Wang, Xianghe Pang, Yuwen Du, Yuzhi Zhang.

**Figure 1.** Figure 1: Overview of PaSaMaster. PaSaMaster is a self-evolving agentic literature retrieval system that separates intent-aware planning from evidence-grounded retrieval and ranking. Given a natural-language query, the Navigator disambiguates complex academic intents, generates adaptive search strategies, and refines them through feedback. The Librarian Swarm executes parallel retrieval, evidence verification, inten… view at source ↗

**Figure 2.** Figure 2: Overview of the PaSaMaster-Bench data curation pipeline, comprising three stages. (1) Question Generation: domain experts formulate complex natural-language search queries grounded in authentic research bottlenecks and supply a constraint checklist that decomposes the intent into objective, verifiable criteria. (2) Multi-channel Retrieval: each query is submitted to multiple strong retrieval systems — incl… view at source ↗

**Figure 3.** Figure 3: Per-discipline F1-score comparison across all methods. PaSaMaster consistently achieves the highest or comparable F1-scores across all subject domains. Self-Evolving Retrieval Improves Complex Intent Understanding. PaSaMaster achieves the highest retrieval performance across all main quality metrics, with an NDCG of 37.93, Recall of 31.84, Precision of 22.19, and F1-score of 21.69. This demonstrates that… view at source ↗

read the original abstract

As large language models reshape scientific research, literature retrieval faces a twofold challenge: ensuring source authenticity while maintaining a deep comprehension of academic search intents. While reliable, traditional keyword-centric search fails to capture complex research intents. Frontier LLMs can handle complex research intents, but their high cost and tendency to hallucinate remain key limitations. Here we introduce PaSaMaster, a self-evolving agentic literature retrieval system that produces relevance-scored paper rankings with evidence-grounded recommendations through iterative intent analysis, retrieval, and ranking. It is built on three key designs. First, it transforms literature retrieval from a one shot query--document matching problem into a search process that evolves over time, using ranked evidence to reveal gaps, refine intents, and guide follow-up searches. Second, it prevents hallucinated sources by treating retrieval as intent--paper relevance ranking rather than generation. Finally, PaSaMaster improves cost efficiency by separating planning from retrieval: a frontier LLM is used only for intent understanding, while large scale retrieval and relevance scoring are delegated to customized corpora and lightweight models. Evaluated on the PaSaMaster Benchmark across 38 scientific disciplines, our system exposes the severe inaccuracy and incompleteness of traditional keyword retrieval (improving F1-score by 15.6X) and the unreliability of generative LLMs (which exhibit hallucination rates up to 37.79%). Remarkably, PaSaMaster outperforms GPT-5.2 by 30.0% at a mere 1% of the computational cost while ensuring zero source hallucination: https://github.com/sjtu-sai-agents/PaSaMaster

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaSaMaster's iterative intent refinement with separated planning and ranking is a sensible way to cut cost and hallucinations in literature search, but the large reported gains depend on an undescribed benchmark.

read the letter

The main things to know are that this system turns retrieval into a multi-step process where a strong LLM only handles intent analysis and gap spotting, then hands off to cheaper models for ranking real papers from a corpus. It claims this delivers 15.6X better F1 than keywords and beats GPT-5.2 by 30% at 1% of the cost with zero source hallucinations on their new benchmark across 38 fields. The design choice to rank rather than generate is straightforward and directly targets the hallucination problem. The separation of frontier planning from lightweight execution is also practical for keeping expenses down while still handling complex academic intents that keyword search misses. The self-evolving loop, where ranked evidence reveals gaps and triggers refined searches, is a reasonable extension of basic agent ideas. The soft spot is the evaluation. The abstract gives no information on how the queries were collected, how ground-truth paper sets were labeled, or the exact protocol for judging relevance and completeness. Without those details it is difficult to tell whether the big numbers reflect genuine improvement or properties of the test set itself. There are also no ablations or error bars shown. This paper is for people working on agentic tools for research or information retrieval. It deserves a serious referee because the architecture is clearly motivated and the claims are concrete enough to be tested once the benchmark construction is examined.

Referee Report

3 major / 2 minor

Summary. The paper introduces PaSaMaster, a self-evolving agentic literature retrieval system that converts one-shot query-document matching into an iterative process of intent analysis, retrieval, evidence-based gap revelation, and re-ranking. It claims to deliver relevance-scored paper lists with zero source hallucination, achieving a 15.6X F1-score improvement over traditional keyword retrieval and 30% higher performance than GPT-5.2 at 1% of the computational cost, evaluated on a self-constructed PaSaMaster Benchmark spanning 38 disciplines.

Significance. If the benchmark and results prove robust, the work offers a practical advance in agentic retrieval for scientific literature by addressing hallucination through ranking rather than generation and by delegating heavy retrieval to lightweight models while reserving frontier LLMs for planning. The iterative gap-revelation mechanism is a conceptually sound response to complex research intents, and the reported cost reduction could make high-quality literature search more accessible.

major comments (3)

[Evaluation section / PaSaMaster Benchmark] The description of the PaSaMaster Benchmark provides no information on query sourcing across the 38 disciplines, the protocol for constructing ground-truth relevant paper sets, or the criteria and inter-annotator agreement for relevance judgments. Without these details the headline claims of 15.6X F1 improvement and 30% outperformance over GPT-5.2 cannot be independently verified and may reflect benchmark-specific artifacts rather than general superiority.
[Results and evaluation] No statistical significance tests, confidence intervals, error bars, or ablation studies are reported for any quantitative result, including the zero-hallucination rate, cost ratio, or cross-discipline consistency. These omissions leave the central empirical claims unsupported at the level required for a methods paper.
[System design / Iterative ranking] The iterative evidence-ranking process is described at a high level, but the manuscript does not analyze or bound potential selection bias introduced when ranked evidence is used to 'reveal gaps' and trigger follow-up searches. This is load-bearing for the self-evolving claim and requires either a formal argument or controlled experiments showing that the process does not systematically favor certain paper types.

minor comments (2)

[Abstract] The abstract uses an em-dash in 'query--document'; standard hyphenation would improve readability.
[Abstract] The GitHub link is given without a commit hash or release tag, making exact reproduction of the reported numbers difficult.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major concern point by point below and will revise the manuscript to strengthen the evaluation and analysis sections.

read point-by-point responses

Referee: [Evaluation section / PaSaMaster Benchmark] The description of the PaSaMaster Benchmark provides no information on query sourcing across the 38 disciplines, the protocol for constructing ground-truth relevant paper sets, or the criteria and inter-annotator agreement for relevance judgments. Without these details the headline claims of 15.6X F1 improvement and 30% outperformance over GPT-5.2 cannot be independently verified and may reflect benchmark-specific artifacts rather than general superiority.

Authors: We agree that additional details on benchmark construction are required for reproducibility and independent verification. In the revised manuscript we will add a dedicated subsection describing the query sourcing process across the 38 disciplines, the exact protocol for assembling ground-truth relevant paper sets, the relevance judgment criteria, and inter-annotator agreement statistics (including Cohen’s kappa). These additions will directly support the reported F1 and performance gains. revision: yes
Referee: [Results and evaluation] No statistical significance tests, confidence intervals, error bars, or ablation studies are reported for any quantitative result, including the zero-hallucination rate, cost ratio, or cross-discipline consistency. These omissions leave the central empirical claims unsupported at the level required for a methods paper.

Authors: We acknowledge the absence of statistical rigor and ablations in the current draft. The revised version will report paired statistical significance tests, confidence intervals, and error bars for all headline metrics (F1 improvement, hallucination rate, cost ratio, and cross-discipline results). We will also include ablation studies isolating the contribution of the iterative gap-revelation step and the lightweight-model delegation, thereby providing the quantitative support expected for a methods paper. revision: yes
Referee: [System design / Iterative ranking] The iterative evidence-ranking process is described at a high level, but the manuscript does not analyze or bound potential selection bias introduced when ranked evidence is used to 'reveal gaps' and trigger follow-up searches. This is load-bearing for the self-evolving claim and requires either a formal argument or controlled experiments showing that the process does not systematically favor certain paper types.

Authors: We agree that a formal treatment of selection bias is necessary to substantiate the self-evolving claim. In the revision we will add a dedicated analysis that bounds the bias introduced by using ranked evidence for gap revelation, leveraging properties of the relevance scoring function. We will also include controlled experiments contrasting the full iterative pipeline against a non-iterative baseline to demonstrate that retrieved paper distributions do not systematically favor particular types or sources. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmark

full rationale

The paper describes an agentic retrieval system (PaSaMaster) and reports performance metrics as direct empirical measurements on the PaSaMaster Benchmark across 38 disciplines. No equations, fitted parameters, or derivation steps are present that reduce any claimed prediction or result to the system's own inputs by construction. The F1-score improvements, hallucination rates, and cost comparisons are presented as observed outcomes rather than quantities defined in terms of the method itself. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text. The evaluation is therefore self-contained against the stated benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, domain axioms, or invented entities; the system builds on standard LLM capabilities and retrieval corpora without postulating new theoretical constructs.

pith-pipeline@v0.9.0 · 5619 in / 1206 out tokens · 35630 ms · 2026-05-15T02:33:11.578720+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

self-evolving retrieval: transforms literature retrieval from one-shot query–document matching into an adaptive search process... using ranked evidence to identify coverage gaps, refine the research intent
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hallucination-free intent–paper relevance ranking... every candidate paper must be retrieved from a verified scientific corpus D

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

2025 , eprint=

PaSa: An LLM Agent for Comprehensive Academic Paper Search , author=. 2025 , eprint=

work page 2025
[2]

2021 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

work page 2021
[3]

2025 , eprint=

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models , author=. 2025 , eprint=

work page 2025
[4]

2019 , eprint=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

work page 2019
[5]

GPT-4 Technical Report

OpenAI , year=. 2303.08774 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

2020 , eprint=

Dense Passage Retrieval for Open-Domain Question Answering , author=. 2020 , eprint=

work page 2020
[7]

2024 , eprint=

Large Language Model based Multi-Agents: A Survey of Progress and Challenges , author=. 2024 , eprint=

work page 2024
[8]

2025 , eprint=

The Gerontocratization of Science: How hypergrowth reshapes knowledge circulation , author=. 2025 , eprint=

work page 2025
[9]

Weld , year=

Kyle Lo and Lucy Lu Wang and Mark Neumann and Rodney Kinney and Dan S. Weld , year=. 1911.02782 , archivePrefix=

work page arXiv 1911
[10]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

work page 2023
[11]

, howpublished =

n.d. , howpublished =

work page
[12]

Nature , author=

Scientific literature: Information overload , volume=. Nature , author=. 2016 , pages=

work page 2016
[13]

Communications of the ACM , volume=

The vocabulary problem in human-system communication , author=. Communications of the ACM , volume=

work page
[14]

2023 , eprint=

Lost in the Middle: How Language Models Use Long Contexts , author=. 2023 , eprint=

work page 2023
[15]

Nature , author=

Detecting hallucinations in large language models using semantic entropy , volume=. Nature , author=. 2024 , pages=

work page 2024
[16]

Introducing

OpenAI , year=. Introducing

work page
[17]

2024 , eprint=

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs , author=. 2024 , eprint=

work page 2024
[18]

2025 , eprint=

Bohrium + SciMaster: Building the Infrastructure and Ecosystem for Agentic Science at Scale , author=. 2025 , eprint=

work page 2025
[19]

2025 , eprint=

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=. 2025 , eprint=

work page 2025
[20]

2026 , eprint=

Kimi K2: Open Agentic Intelligence , author=. 2026 , eprint=

work page 2026
[21]

2025 , eprint=

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention , author=. 2025 , eprint=

work page 2025
[22]

2025 , eprint=

GLM-4.5: Agentic, Reasoning, and Coding (. 2025 , eprint=

work page 2025
[23]

2025 , howpublished=

Scholar Labs: An. 2025 , howpublished=

work page 2025
[24]

Journal of the Association for Information Science and Technology , volume =

Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references , author =. Journal of the Association for Information Science and Technology , volume =. 2015 , doi =

work page 2015
[25]

Science , volume =

Science of science , author =. Science , volume =. 2018 , doi =

work page 2018
[26]

Research Synthesis Methods , volume =

What every researcher should know about searching---clarified concepts, search advice, and an agenda to improve finding in academia , author =. Research Synthesis Methods , volume =. 2021 , doi =

work page 2021
[27]

2009 , doi =

Exploratory Search: Beyond the Query-Response Paradigm , author =. 2009 , doi =

work page 2009
[28]

Nature , volume =

Scientific discovery in the age of artificial intelligence , author =. Nature , volume =. 2023 , doi =

work page 2023
[29]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

LitSearch: A Retrieval Benchmark for Scientific Literature Search , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , publisher =

work page 2024
[30]

npj Artificial Intelligence , volume =

Exploring the role of large language models in the scientific method: from hypothesis to discovery , author =. npj Artificial Intelligence , volume =. 2025 , doi =

work page 2025
[31]

1996 , urldate =

PubMed , medium =. 1996 , urldate =

work page 1996