Towards Retrieving Interaction Spaces for Agentic Search
Pith reviewed 2026-06-27 21:04 UTC · model grok-4.3
The pith
Retrieval for agentic search should construct bounded interaction spaces that agents explore with tools instead of only ranking documents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an interaction space constructed by a first-stage retriever enables an agent to reach the same task accuracy as unbounded direct corpus interaction while incurring roughly one quarter the per-query cost and maintaining performance as corpus size grows to one million documents.
What carries the argument
RISE (Retrieving Interaction SpacE), which uses BM25 to supply the boundary of the interaction space and processes documents at indexing time to support shell-style navigation inside that space.
If this is right
- On BrowseComp-Plus, RISE matches the pure-shell DCI baseline at 78 percent accuracy with gpt-5.4-mini at roughly one quarter of the per-query cost.
- At one million documents, RISE-BM25 reaches 81 percent accuracy on gpt-5.4-mini while DCI on gpt-5.4-nano degrades to 60 percent with 33 of 100 wall-clock failures.
- Pre-processing documents for shell-style navigation during indexing allows the agent to use grep and file reads without scanning the whole corpus on every query.
Where Pith is reading between the lines
- The same boundary-plus-navigation design could be tested on agent tasks that currently rely on repeated retrieval rounds.
- Replacing BM25 with a stronger first-stage retriever might tighten the interaction space further without losing coverage.
- The approach suggests a general pattern in which retrieval hands the agent a navigable subspace rather than a static list of passages.
Load-bearing premise
A standard first-stage retriever such as BM25 can reliably supply a boundary that still contains all information the agent needs to solve the task without the need for iterative re-retrieval or expansion of the space.
What would settle it
A set of queries where the information required to reach the correct answer lies outside the BM25-retrieved interaction space, causing the agent to fail even when given unlimited exploration time inside the space.
Figures
read the original abstract
Retrieval for search agents is still inherited from non-agentic information retrieval: a retriever ranks the corpus and the agent reads a small set of returned documents. Recent direct corpus interaction (DCI) work shows that agents can instead interact with the raw corpus through shell tools such as grep and file reads. But unbounded interaction does not scale: every broad shell command is a scan over the whole corpus, and latency degrades sharply as the corpus grows. We argue that the role of retrieval for agentic search is not just to select documents that fit in the LLM context window, but to construct an interaction space: a bounded subset of the corpus the agent can explore with associated tools. Two design consequences follow. The space needs a boundary supplied by retrieval, and the objects within it should be processed for interaction. As a proof of concept, we propose RISE (Retrieving Interaction SpacE): we use BM25 to construct the interaction space; meanwhile, its documents are processed during indexing for shell-style navigation. On BrowseComp-Plus, RISE matches the pure-shell DCI baseline at 78% accuracy with gpt-5.4-mini at roughly one quarter of the per-query cost. At 1M documents, RISE-BM25 reaches 81% on gpt-5.4-mini, whereas DCI on gpt-5.4-nano degrades to 60% with 33 of 100 wall-clock failures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RISE (Retrieving Interaction SpacE) as a proof-of-concept approach to agentic search that uses BM25 to construct a bounded interaction space from the corpus; documents in this space are pre-processed for shell-style navigation tools. It claims that on BrowseComp-Plus, RISE matches the accuracy of unbounded direct corpus interaction (DCI) at 78% using gpt-5.4-mini while incurring roughly one quarter the per-query cost, and that at a 1M-document scale RISE-BM25 reaches 81% accuracy on gpt-5.4-mini whereas DCI on gpt-5.4-nano falls to 60% with frequent wall-clock failures.
Significance. If the central empirical result holds, the work would be significant for demonstrating that retrieval can usefully bound the interaction space for agents rather than merely supplying a context window, thereby improving scalability of direct corpus interaction methods. The concrete accuracy and cost numbers at 1M scale constitute a useful data point for the emerging agentic-IR literature.
major comments (3)
- [Abstract] Abstract: the headline claim that RISE 'matches the pure-shell DCI baseline at 78% accuracy' is reported without error bars, statistical significance tests, or any description of how many documents are typically included in the BM25-constructed space or how the space boundary is exactly enforced; this leaves the central empirical comparison only weakly supported.
- [Scaling experiment] Scaling experiment (1M-document setting): the comparison pits RISE-BM25 on gpt-5.4-mini (81%) against DCI on gpt-5.4-nano (60%); because model capability differs, the result does not isolate the contribution of the bounded interaction space and therefore does not yet establish the claimed scaling advantage.
- [Evaluation] The central claim that a single BM25-retrieved set supplies every document the agent must read or grep requires that no gold document lies outside the top-k; the manuscript provides no analysis or failure-case enumeration of queries where term overlap is low, leaving the reliability of the boundary assumption untested and load-bearing for the reported 78% match.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that RISE 'matches the pure-shell DCI baseline at 78% accuracy' is reported without error bars, statistical significance tests, or any description of how many documents are typically included in the BM25-constructed space or how the space boundary is exactly enforced; this leaves the central empirical comparison only weakly supported.
Authors: We agree that additional details would strengthen the claim. In the revised manuscript, we will include a description of the BM25 interaction space construction, including typical sizes (top-200 documents in our experiments) and the enforcement mechanism (the agent is provided only with tools scoped to the retrieved set). Regarding error bars and significance, the main results are from single runs due to computational cost; we will add a note on variance from pilot runs and, if feasible, report results over 3 seeds for key comparisons. This addresses the support for the comparison. revision: partial
-
Referee: [Scaling experiment] Scaling experiment (1M-document setting): the comparison pits RISE-BM25 on gpt-5.4-mini (81%) against DCI on gpt-5.4-nano (60%); because model capability differs, the result does not isolate the contribution of the bounded interaction space and therefore does not yet establish the claimed scaling advantage.
Authors: The referee correctly identifies that differing model sizes confound direct attribution to the interaction space. Our intent was to demonstrate practical scalability: RISE enables a more capable model to operate effectively at 1M scale, while DCI forces use of a weaker model with failures. However, to better isolate the effect, we will add a note clarifying this and, where possible, include a comparison using the same model (e.g., attempting DCI with mini at smaller scales or noting cost barriers). We will revise the text to avoid overstating the isolation of the bounded space contribution. revision: yes
-
Referee: [Evaluation] The central claim that a single BM25-retrieved set supplies every document the agent must read or grep requires that no gold document lies outside the top-k; the manuscript provides no analysis or failure-case enumeration of queries where term overlap is low, leaving the reliability of the boundary assumption untested and load-bearing for the reported 78% match.
Authors: This is an important point. The 78% match to DCI implies that for the queries DCI could solve, the BM25 space included the necessary documents, but we did not explicitly verify recall of gold documents. In revision, we will add an analysis of the fraction of queries where gold documents fall outside the BM25 top-k, and enumerate or characterize cases with low term overlap (e.g., using query-document term statistics). This will test the boundary assumption more rigorously. revision: yes
Circularity Check
Empirical proof-of-concept with no derivation chain
full rationale
The paper is a systems/empirical work presenting RISE as a proof-of-concept that uses BM25 to bound an interaction space for agentic search, then reports accuracy and cost numbers on BrowseComp-Plus. No equations, derivations, fitted parameters, or predictions are claimed. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the method. The central result (RISE matching DCI accuracy at lower cost) is an experimental observation, not a reduction of any claimed prediction to its own inputs. This is the normal non-circular case for an empirical retrieval paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Coding agents are effective long-context processors.arXiv preprint arXiv:2603.20432,
Weili Cao, Xunjian Yin, Bhuwan Dhingra, and Shuyan Zhou. Coding agents are effective long-context processors.arXiv preprint arXiv:2603.20432,
-
[2]
BrowseComp-Plus: A more fair and transparent evaluation benchmark of deep-research agent
Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, and Jimmy Lin. BrowseComp-Plus: A more fair and transparent evaluation benchmark of deep-research agent. arXiv ...
-
[3]
AgentIR: Reasoning-aware retrieval for deep research agents.arXiv preprint arXiv:2603.04384,
Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, Akari Asai, and Victor Zhong. AgentIR: Reasoning-aware retrieval for deep research agents.arXiv preprint arXiv:2603.04384,
-
[4]
Tz-Huan Hsu, Jheng-Hong Yang, and Jimmy Lin. Rethinking agentic search with PI-SERINI: Is lexical retrieval sufficient?arXiv preprint arXiv:2605.10848,
-
[5]
SAGE: Benchmarking and improving retrieval for deep research agents.arXiv preprint arXiv:2602.05975,
Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, and Chen Zhao. SAGE: Benchmarking and improving retrieval for deep research agents.arXiv preprint arXiv:2602.05975,
-
[6]
Curran Associates Inc. ISBN 9781713829546. Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. WebThinker: Empowering large reasoning models with deep research capability. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026a. URL https://openreview.net/forum?id=7LKKHBAM...
-
[7]
Xing Han Lù. BM25S: Orders of magnitude faster lexical search via eager sparse scoring.arXiv preprint arXiv:2407.03618,
-
[8]
Revisiting text ranking in deep research
Chuan Meng, Litu Ou, Sean MacAvaney, and Jeff Dalton. Revisiting text ranking in deep research. arXiv preprint arXiv:2602.21456,
-
[9]
WebGPT: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. WebGPT: Browser-assisted question-answering with human feedback.arXiv preprint arXiv...
-
[10]
Is grep all you need? how agent harnesses reshape agentic search.arXiv preprint arXiv:2605.15184,
Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, and Vamse Kumar Subbiah. Is grep all you need? how agent harnesses reshape agentic search.arXiv preprint arXiv:2605.15184,
-
[11]
BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,
10 Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,
-
[12]
The user prompt is the BrowseComp-Plus query under a one-lineQUESTION:header, shared across RISE and RISE-BM25
11 A RISE-BM25 System Prompt The RISE-BM25 runs use the following system prompt. The user prompt is the BrowseComp-Plus query under a one-lineQUESTION:header, shared across RISE and RISE-BM25. You answer research questions over a large document corpus you can’t see directly. You have three tools: - search(queries): search the corpus with one or more queri...
2000
-
[13]
sections
You restructure a plain-text document so a search agent (using bash tools ‘cat‘, ‘ sed‘, ‘grep‘) can navigate it. You will NOT rewrite the document. You only propose: - section boundaries (where each section begins), - a short heading for each section, and - a one-sentence description of what each section covers. A downstream script will use your output t...
2000
-
[14]
list of films released in 2020
- CATALOG / LIST documents are different and SHOULD be sectioned aggressively – one section per item is often correct: * A wildlife-identification guide listing 24 bird species→24 sections (one per species). * A "list of films released in 2020"→either one big "Films" section, or one section per studio/country/letter grouping. Not 0 sections. * A Wikipedia...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.