pith. machine review for the scientific record. sign in

arxiv: 2605.10848 · v1 · submitted 2026-05-11 · 💻 cs.IR · cs.AI· cs.CL

Recognition: 1 theorem link

· Lean Theorem

Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

Jheng-Hong Yang, Jimmy Lin, Tz-Huan Hsu

Pith reviewed 2026-05-12 04:10 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL
keywords agentic searchlexical retrievalBM25LLM agentsinformation retrievalBrowseComp-Plusdeep research systems
0
0 comments X

The pith

Lexical BM25 retrieval suffices for effective deep research when paired with capable LLMs in agentic search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether lexical retrievers can still work well once LLMs improve at reasoning and tool use inside agent loops for deep research. It builds Pi-Serini, an agent that uses BM25 for initial retrieval together with browsing and reading tools. On BrowseComp-Plus, the tuned BM25 version with a frontier LLM reaches 83.1 percent answer accuracy and 94.7 percent evidence recall, beating other released agents that rely on dense retrievers. Controlled tests show that BM25 parameter tuning adds 18 percent accuracy and 11.1 percent recall while greater retrieval depth adds another 25.3 percent recall.

Core claim

By equipping frontier LLMs with a lexical BM25 retriever in an agentic loop via Pi-Serini, the system achieves 83.1% answer accuracy and 94.7% surfaced evidence recall on BrowseComp-Plus, outperforming released search agents that use dense retrievers.

What carries the argument

Pi-Serini, an agent with three tools (BM25 retrieval, document browsing, and reading) that places lexical search inside the LLM agentic loop.

Load-bearing premise

That the BrowseComp-Plus benchmark and the specific LLM configurations produce results that generalize to other deep-research tasks and are not explained by differences in tool implementation or prompting.

What would settle it

A controlled comparison on BrowseComp-Plus showing that a dense-retriever agent with matched LLM and prompting details achieves higher accuracy and recall than the tuned BM25 version of Pi-Serini.

Figures

Figures reproduced from arXiv: 2605.10848 by Jheng-Hong Yang, Jimmy Lin, Tz-Huan Hsu.

Figure 1
Figure 1. Figure 1: Accuracy vs. Cost trade-off on BrowseComp [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Grid search results of BM25 tuning over dif [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: System architecture of PI-SERINI. The system includes the retrieval workflow prompt, time-budget steering policy, and retrieval controller. The LLM agent interacts with ANSERINI only through the controller, which exposes a constrained tool API and maintains local retrieval state for controlled, paginated BM25 retrieval. A.1 Agent Prompt Search Agent Prompt You are a deep research agent answering a question… view at source ↗
read the original abstract

Does a lexical retriever suffice as large language models (LLMs) become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To support researchers asking the same question, we introduce Pi-Serini, a search agent equipped with three tools for retrieving, browsing, and reading documents. Our results show that, on BrowseComp-Plus, a well-configured lexical retriever with sufficient retrieval depth can support effective deep research when paired with more capable LLMs. Specifically, Pi-Serini with gpt-5.5 achieves 83.1% answer accuracy and 94.7% surfaced evidence recall, outperforming released search agents that use dense retrievers. Controlled ablations further show that BM25 tuning improves answer accuracy by 18.0% and surfaced evidence recall by 11.1% over the default BM25 setting, while increasing retrieval depth further improves surfaced evidence recall by 25.3% over the shallow-retrieval setting. Source code is available at https://github.com/justram/pi-serini.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Pi-Serini, a three-tool agent (retrieve, browse, read) built on BM25 lexical retrieval, and reports that pairing it with frontier LLMs such as gpt-5.5 yields 83.1% answer accuracy and 94.7% surfaced evidence recall on BrowseComp-Plus, outperforming released dense-retriever agents. Controlled ablations within the system show that BM25 tuning improves accuracy by 18.0% and recall by 11.1%, while greater retrieval depth adds 25.3% to recall.

Significance. If the performance advantage can be shown to arise specifically from lexical retrieval rather than from the agent framework or prompting, the result would indicate that well-tuned BM25 remains competitive for deep research tasks when paired with capable LLMs, offering a simpler, tunable, and fully open-source alternative to dense pipelines. The public GitHub release strengthens the contribution by enabling direct replication and extension.

major comments (2)
  1. [Abstract] Abstract: The claim that Pi-Serini 'outperforms released search agents that use dense retrievers' is not supported by matched experiments. The baselines are described only as previously released systems; no evidence is given that they were re-run with the same LLM (gpt-5.5), the same three-tool interface, or the same agent loop, so the reported deltas (83.1% accuracy, 94.7% recall) cannot be unambiguously attributed to lexical retrieval.
  2. [Ablations] Ablation results (as summarized in the abstract): The reported gains from BM25 tuning (+18.0% accuracy) and increased retrieval depth (+25.3% recall) are measured only inside Pi-Serini. Without a parallel ablation that replaces BM25 with a dense retriever while holding the agent framework, tools, and prompting fixed, the experiments do not isolate retrieval type from other design choices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of experimental design and attribution. We address each major comment below, agreeing where the manuscript requires clarification and proposing targeted revisions.

read point-by-point responses
  1. Referee: [Abstract] The claim that Pi-Serini 'outperforms released search agents that use dense retrievers' is not supported by matched experiments. The baselines are described only as previously released systems; no evidence is given that they were re-run with the same LLM (gpt-5.5), the same three-tool interface, or the same agent loop, so the reported deltas (83.1% accuracy, 94.7% recall) cannot be unambiguously attributed to lexical retrieval.

    Authors: We agree that the reported outperformance is based on comparisons to the published results of released dense-retriever agents rather than re-executions under identical conditions (same LLM, tool set, and agent loop). The manuscript's intent is to demonstrate that a well-tuned lexical retriever can achieve strong results on BrowseComp-Plus when paired with frontier LLMs, exceeding the figures reported for existing dense systems. To prevent misinterpretation, we will revise the abstract and introduction to explicitly note that these are comparisons to published baseline results and add a brief discussion of potential confounding factors in the limitations section. revision: yes

  2. Referee: [Ablations] The reported gains from BM25 tuning (+18.0% accuracy) and increased retrieval depth (+25.3% recall) are measured only inside Pi-Serini. Without a parallel ablation that replaces BM25 with a dense retriever while holding the agent framework, tools, and prompting fixed, the experiments do not isolate retrieval type from other design choices.

    Authors: The ablations were conducted internally to quantify the benefits of BM25 parameter tuning and retrieval depth within the fixed Pi-Serini agent framework. We concur that this design does not isolate the effect of lexical versus dense retrieval while holding the agent, tools, and prompting constant. Performing the complementary dense-retriever ablation would require substantial additional implementation effort to integrate a dense index into the same three-tool loop. We will add text in the discussion and limitations sections to clarify the scope of the current ablations and identify a full cross-retrieval ablation as valuable future work. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on fixed benchmark

full rationale

The paper reports measured answer accuracy (83.1%) and evidence recall (94.7%) for Pi-Serini (BM25 + gpt-5.5) on BrowseComp-Plus, plus internal ablations quantifying gains from BM25 tuning (+18.0% accuracy) and retrieval depth (+25.3% recall). These are observed experimental outcomes under controlled conditions, not quantities derived from equations, fitted parameters renamed as predictions, or self-citation chains. No load-bearing derivations, uniqueness theorems, or ansatzes appear in the provided text; the results stand as independent benchmark data.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 1 invented entities

The performance claims rest on empirical tuning of BM25 parameters and the choice of retrieval depth, both adjusted to the benchmark; no mathematical axioms or newly invented physical entities are invoked.

free parameters (2)
  • BM25 tuning parameters = tuned
    Tuning is reported to improve answer accuracy by 18.0% and surfaced evidence recall by 11.1% over the default setting.
  • retrieval depth = increased
    Increasing retrieval depth is reported to improve surfaced evidence recall by 25.3% over the shallow-retrieval setting.
invented entities (1)
  • Pi-Serini no independent evidence
    purpose: Agent equipped with retrieve, browse, and read tools to test lexical retrieval in an agentic loop
    New system introduced to conduct the experiments

pith-pipeline@v0.9.0 · 5510 in / 1458 out tokens · 82466 ms · 2026-05-12T04:10:22.367398+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    neural hype

    Deep research: A systematic survey.Preprint arXiv:2512.02038. Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, and 38 others. 2025. Tongyi DeepResearch technical repo...

  2. [2]

    Use search with a concise raw query string based on the original question

  3. [3]

    Prefer short lexical searches over long natural-language rewrites

  4. [4]

    Browse the current ranking with read_search_results before repeatedly rewriting the query

  5. [5]

    If a promising candidate document appears in the ranking, inspect it with read_document

  6. [6]

    If it is truncated and still relevant, continue reading the same document

    When reading a document, start with offset=1 and a moderate limit. If it is truncated and still relevant, continue reading the same document

  7. [7]

    Use search refinements only when they add a genuinely new clue from what you already saw

  8. [8]

    Keep it specific, under 100 words, and focused on the clue, gap, candidate, or ranking issue

    Every call to search, read_search_results, and read_document must include reason as the first argument. Keep it specific, under 100 words, and focused on the clue, gap, candidate, or ranking issue

  9. [9]

    As soon as you have enough evidence, stop using tools and answer in plain assistant text

  10. [10]

    Your final response must use exactly this format: Explanation: {your explanation for your final answer. Cite supporting docids inline in square brackets [] at the end of sentences when possible, for example [123].} Exact Answer: {your succinct, final answer} Confidence: {your confidence score between 0% and 100%}

  11. [11]

    Do not do more research after that steer

    If you later receive a user steer telling you to submit now, stop using tools immediately and answer right away with the exact final response format below. Do not do more research after that steer

  12. [12]

    \textsc{Anserini}-bm25

    Keep Exact Answer concise and directly responsive to the question. Question: {Question} 2The setup documented in this appendix corresponds to PI-SERINI’s codebase at commit68c5e0f. 11 A.2 Retrieval Tool Interface Tool: search Purpose: Send a raw lexical query to the configured retrieval backend. Arguments: reason: brief rationale, supplied first, at most ...

  13. [13]

    Send a user steer saying the time budget is nearly exhausted

  14. [14]

    Instruct the agent to stop using tools and submit its best answer immediately

  15. [15]

    extracted_final_answer

    Block later calls to search, read_search_results, and read_document. Final response format after the steer remains: Explanation: ... Exact Answer: ... Confidence: ... A.5 Gold-Answer LLM Judge We evaluate final-answer correctness with a gold-answer LLM judge. The judge receives the question, the agent’s final response, and the benchmark-provided correct a...