Towards Retrieving Interaction Spaces for Agentic Search

Hengxin Fun; Jimmy Lin; Shengyao Zhuang; Xueguang Ma; Yuansheng Ni

arxiv: 2606.06880 · v1 · pith:MGRC4OZMnew · submitted 2026-06-05 · 💻 cs.IR

Towards Retrieving Interaction Spaces for Agentic Search

Shengyao Zhuang , Yuansheng Ni , Hengxin Fun , Jimmy Lin , Xueguang Ma This is my paper

Pith reviewed 2026-06-27 21:04 UTC · model grok-4.3

classification 💻 cs.IR

keywords agentic searchinteraction spaceretrievalBM25direct corpus interactionshell toolsBrowseComp-Plus

0 comments

The pith

Retrieval for agentic search should construct bounded interaction spaces that agents explore with tools instead of only ranking documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that retrieval methods inherited from non-agentic search only select documents for an LLM context window, while direct corpus interaction via shell commands fails to scale because broad commands scan the entire corpus. It argues that retrieval's proper role is to define an interaction space: a bounded subset of the corpus that the agent can navigate with associated tools. RISE implements this idea by applying BM25 to set the space boundary and pre-processing the documents inside it for shell-style navigation during indexing. On the evaluated task this yields accuracy matching the unbounded baseline while cutting per-query cost substantially and avoiding the latency and failure spikes that appear as the corpus reaches one million documents.

Core claim

The central claim is that an interaction space constructed by a first-stage retriever enables an agent to reach the same task accuracy as unbounded direct corpus interaction while incurring roughly one quarter the per-query cost and maintaining performance as corpus size grows to one million documents.

What carries the argument

RISE (Retrieving Interaction SpacE), which uses BM25 to supply the boundary of the interaction space and processes documents at indexing time to support shell-style navigation inside that space.

If this is right

On BrowseComp-Plus, RISE matches the pure-shell DCI baseline at 78 percent accuracy with gpt-5.4-mini at roughly one quarter of the per-query cost.
At one million documents, RISE-BM25 reaches 81 percent accuracy on gpt-5.4-mini while DCI on gpt-5.4-nano degrades to 60 percent with 33 of 100 wall-clock failures.
Pre-processing documents for shell-style navigation during indexing allows the agent to use grep and file reads without scanning the whole corpus on every query.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same boundary-plus-navigation design could be tested on agent tasks that currently rely on repeated retrieval rounds.
Replacing BM25 with a stronger first-stage retriever might tighten the interaction space further without losing coverage.
The approach suggests a general pattern in which retrieval hands the agent a navigable subspace rather than a static list of passages.

Load-bearing premise

A standard first-stage retriever such as BM25 can reliably supply a boundary that still contains all information the agent needs to solve the task without the need for iterative re-retrieval or expansion of the space.

What would settle it

A set of queries where the information required to reach the correct answer lies outside the BM25-retrieved interaction space, causing the agent to fail even when given unlimited exploration time inside the space.

Figures

Figures reproduced from arXiv: 2606.06880 by Hengxin Fun, Jimmy Lin, Shengyao Zhuang, Xueguang Ma, Yuansheng Ni.

**Figure 1.** Figure 1: Accuracy vs. per-query cost on BrowseComp-Plus. Each point is one model–architecture pair; the dashed line is the RISE frontier. At 100k, RISE matches DCI at 78% accuracy on mini while reducing cost from $1.10 to $0.28/query. At 1M, RISE-BM25 (the bounding mechanism alone) remains stable after adding 900k FineWeb-Edu distractors; the 1M DCI run is reported for nano, where accuracy drops to 60% with 33/100 … view at source ↗

**Figure 2.** Figure 2: RISE workflow. BM25 supplies the boundary (consequence i): candidate files are imported from the full corpus into a per-query interaction space that lives outside the context window. The imported files are TOC-indexed by an offline processing step (consequence ii): each file exposes a line-numbered table of contents so shell tools can land on relevant spans without a full read. The agent then iterates betw… view at source ↗

**Figure 3.** Figure 3: Two case-study trajectories (gpt-5.4-mini, 100k corpus). (a) RISE-BM25: an obfuscated question is decomposed into many BM25 sub-queries whose union forms the workspace, after which bash/read verify the answer in the workspace. (b) RISE: the agent reads a document’s frontmatter and line-numbered table of contents, matches the “Acknowledgements” clue to the TOC entry L259–265: Acknowledgements and credits, … view at source ↗

read the original abstract

Retrieval for search agents is still inherited from non-agentic information retrieval: a retriever ranks the corpus and the agent reads a small set of returned documents. Recent direct corpus interaction (DCI) work shows that agents can instead interact with the raw corpus through shell tools such as grep and file reads. But unbounded interaction does not scale: every broad shell command is a scan over the whole corpus, and latency degrades sharply as the corpus grows. We argue that the role of retrieval for agentic search is not just to select documents that fit in the LLM context window, but to construct an interaction space: a bounded subset of the corpus the agent can explore with associated tools. Two design consequences follow. The space needs a boundary supplied by retrieval, and the objects within it should be processed for interaction. As a proof of concept, we propose RISE (Retrieving Interaction SpacE): we use BM25 to construct the interaction space; meanwhile, its documents are processed during indexing for shell-style navigation. On BrowseComp-Plus, RISE matches the pure-shell DCI baseline at 78% accuracy with gpt-5.4-mini at roughly one quarter of the per-query cost. At 1M documents, RISE-BM25 reaches 81% on gpt-5.4-mini, whereas DCI on gpt-5.4-nano degrades to 60% with 33 of 100 wall-clock failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RISE gives a practical way to bound agentic search with BM25 so shell tools stay cheap, but the results rest on an untested assumption that the boundary always contains what the agent needs.

read the letter

The core point is that retrieval for agents should create a bounded interaction space the model can explore with tools, rather than just handing over a few documents or letting the agent scan everything. RISE implements this by using BM25 for the boundary and pre-processing documents for shell-style navigation.

The paper does a clean job naming the scaling problem with pure direct corpus interaction and showing a workable alternative. On BrowseComp-Plus it matches the unbounded DCI baseline at 78% accuracy while cutting cost by roughly three-quarters with gpt-5.4-mini. At one million documents the same approach keeps accuracy near 81% while the baseline drops and starts timing out.

The main weakness is that everything hinges on the first-stage retriever never missing a document the agent will need. If a gold file has low term overlap with the query, it sits outside the space and the agent cannot reach it. The abstract gives no error bars, no breakdown by query difficulty, and no count of how large the typical space ends up being, so the 78% match only shows it worked for this particular test distribution.

The work is aimed at people building tool-using agents that must operate over growing corpora. Readers who care about deployment cost and latency will find the numbers and the framing useful even if they want tighter validation.

It deserves a serious referee. The scaling issue is real and the proposed fix is simple enough to test further; the authors should be asked for failure-case analysis and variance numbers, but the idea itself is worth the review time.

Referee Report

3 major / 0 minor

Summary. The paper proposes RISE (Retrieving Interaction SpacE) as a proof-of-concept approach to agentic search that uses BM25 to construct a bounded interaction space from the corpus; documents in this space are pre-processed for shell-style navigation tools. It claims that on BrowseComp-Plus, RISE matches the accuracy of unbounded direct corpus interaction (DCI) at 78% using gpt-5.4-mini while incurring roughly one quarter the per-query cost, and that at a 1M-document scale RISE-BM25 reaches 81% accuracy on gpt-5.4-mini whereas DCI on gpt-5.4-nano falls to 60% with frequent wall-clock failures.

Significance. If the central empirical result holds, the work would be significant for demonstrating that retrieval can usefully bound the interaction space for agents rather than merely supplying a context window, thereby improving scalability of direct corpus interaction methods. The concrete accuracy and cost numbers at 1M scale constitute a useful data point for the emerging agentic-IR literature.

major comments (3)

[Abstract] Abstract: the headline claim that RISE 'matches the pure-shell DCI baseline at 78% accuracy' is reported without error bars, statistical significance tests, or any description of how many documents are typically included in the BM25-constructed space or how the space boundary is exactly enforced; this leaves the central empirical comparison only weakly supported.
[Scaling experiment] Scaling experiment (1M-document setting): the comparison pits RISE-BM25 on gpt-5.4-mini (81%) against DCI on gpt-5.4-nano (60%); because model capability differs, the result does not isolate the contribution of the bounded interaction space and therefore does not yet establish the claimed scaling advantage.
[Evaluation] The central claim that a single BM25-retrieved set supplies every document the agent must read or grep requires that no gold document lies outside the top-k; the manuscript provides no analysis or failure-case enumeration of queries where term overlap is low, leaving the reliability of the boundary assumption untested and load-bearing for the reported 78% match.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that RISE 'matches the pure-shell DCI baseline at 78% accuracy' is reported without error bars, statistical significance tests, or any description of how many documents are typically included in the BM25-constructed space or how the space boundary is exactly enforced; this leaves the central empirical comparison only weakly supported.

Authors: We agree that additional details would strengthen the claim. In the revised manuscript, we will include a description of the BM25 interaction space construction, including typical sizes (top-200 documents in our experiments) and the enforcement mechanism (the agent is provided only with tools scoped to the retrieved set). Regarding error bars and significance, the main results are from single runs due to computational cost; we will add a note on variance from pilot runs and, if feasible, report results over 3 seeds for key comparisons. This addresses the support for the comparison. revision: partial
Referee: [Scaling experiment] Scaling experiment (1M-document setting): the comparison pits RISE-BM25 on gpt-5.4-mini (81%) against DCI on gpt-5.4-nano (60%); because model capability differs, the result does not isolate the contribution of the bounded interaction space and therefore does not yet establish the claimed scaling advantage.

Authors: The referee correctly identifies that differing model sizes confound direct attribution to the interaction space. Our intent was to demonstrate practical scalability: RISE enables a more capable model to operate effectively at 1M scale, while DCI forces use of a weaker model with failures. However, to better isolate the effect, we will add a note clarifying this and, where possible, include a comparison using the same model (e.g., attempting DCI with mini at smaller scales or noting cost barriers). We will revise the text to avoid overstating the isolation of the bounded space contribution. revision: yes
Referee: [Evaluation] The central claim that a single BM25-retrieved set supplies every document the agent must read or grep requires that no gold document lies outside the top-k; the manuscript provides no analysis or failure-case enumeration of queries where term overlap is low, leaving the reliability of the boundary assumption untested and load-bearing for the reported 78% match.

Authors: This is an important point. The 78% match to DCI implies that for the queries DCI could solve, the BM25 space included the necessary documents, but we did not explicitly verify recall of gold documents. In revision, we will add an analysis of the fraction of queries where gold documents fall outside the BM25 top-k, and enumerate or characterize cases with low term overlap (e.g., using query-document term statistics). This will test the boundary assumption more rigorously. revision: yes

Circularity Check

0 steps flagged

Empirical proof-of-concept with no derivation chain

full rationale

The paper is a systems/empirical work presenting RISE as a proof-of-concept that uses BM25 to bound an interaction space for agentic search, then reports accuracy and cost numbers on BrowseComp-Plus. No equations, derivations, fitted parameters, or predictions are claimed. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the method. The central result (RISE matching DCI accuracy at lower cost) is an experimental observation, not a reduction of any claimed prediction to its own inputs. This is the normal non-circular case for an empirical retrieval paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that BM25 can serve as an effective boundary without further justification.

pith-pipeline@v0.9.1-grok · 5796 in / 1182 out tokens · 23083 ms · 2026-06-27T21:04:41.696617+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 6 linked inside Pith

[1]

Coding agents are effective long-context processors.arXiv preprint arXiv:2603.20432,

Weili Cao, Xunjian Yin, Bhuwan Dhingra, and Shuyan Zhou. Coding agents are effective long-context processors.arXiv preprint arXiv:2603.20432,

arXiv
[2]

BrowseComp-Plus: A more fair and transparent evaluation benchmark of deep-research agent

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, and Jimmy Lin. BrowseComp-Plus: A more fair and transparent evaluation benchmark of deep-research agent. arXiv ...

arXiv
[3]

AgentIR: Reasoning-aware retrieval for deep research agents.arXiv preprint arXiv:2603.04384,

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, Akari Asai, and Victor Zhong. AgentIR: Reasoning-aware retrieval for deep research agents.arXiv preprint arXiv:2603.04384,

arXiv
[4]

Rethinking agentic search with PI-SERINI: Is lexical retrieval sufficient?arXiv preprint arXiv:2605.10848,

Tz-Huan Hsu, Jheng-Hong Yang, and Jimmy Lin. Rethinking agentic search with PI-SERINI: Is lexical retrieval sufficient?arXiv preprint arXiv:2605.10848,

Pith/arXiv arXiv
[5]

SAGE: Benchmarking and improving retrieval for deep research agents.arXiv preprint arXiv:2602.05975,

Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, and Chen Zhao. SAGE: Benchmarking and improving retrieval for deep research agents.arXiv preprint arXiv:2602.05975,

arXiv
[6]

ISBN 9781713829546

Curran Associates Inc. ISBN 9781713829546. Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. WebThinker: Empowering large reasoning models with deep research capability. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026a. URL https://openreview.net/forum?id=7LKKHBAM...

Pith/arXiv arXiv
[7]

BM25S: Orders of magnitude faster lexical search via eager sparse scoring.arXiv preprint arXiv:2407.03618,

Xing Han Lù. BM25S: Orders of magnitude faster lexical search via eager sparse scoring.arXiv preprint arXiv:2407.03618,

arXiv
[8]

Revisiting text ranking in deep research

Chuan Meng, Litu Ou, Sean MacAvaney, and Jeff Dalton. Revisiting text ranking in deep research. arXiv preprint arXiv:2602.21456,

Pith/arXiv arXiv
[9]

WebGPT: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. WebGPT: Browser-assisted question-answering with human feedback.arXiv preprint arXiv...

Pith/arXiv arXiv
[10]

Is grep all you need? how agent harnesses reshape agentic search.arXiv preprint arXiv:2605.15184,

Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, and Vamse Kumar Subbiah. Is grep all you need? how agent harnesses reshape agentic search.arXiv preprint arXiv:2605.15184,

Pith/arXiv arXiv
[11]

BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

10 Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

Pith/arXiv arXiv
[12]

The user prompt is the BrowseComp-Plus query under a one-lineQUESTION:header, shared across RISE and RISE-BM25

11 A RISE-BM25 System Prompt The RISE-BM25 runs use the following system prompt. The user prompt is the BrowseComp-Plus query under a one-lineQUESTION:header, shared across RISE and RISE-BM25. You answer research questions over a large document corpus you can’t see directly. You have three tools: - search(queries): search the corpus with one or more queri...

2000
[13]

sections

You restructure a plain-text document so a search agent (using bash tools ‘cat‘, ‘ sed‘, ‘grep‘) can navigate it. You will NOT rewrite the document. You only propose: - section boundaries (where each section begins), - a short heading for each section, and - a one-sentence description of what each section covers. A downstream script will use your output t...

2000
[14]

list of films released in 2020

- CATALOG / LIST documents are different and SHOULD be sectioned aggressively – one section per item is often correct: * A wildlife-identification guide listing 24 bird species→24 sections (one per species). * A "list of films released in 2020"→either one big "Films" section, or one section per studio/country/letter grouping. Not 0 sections. * A Wikipedia...

2020

[1] [1]

Coding agents are effective long-context processors.arXiv preprint arXiv:2603.20432,

Weili Cao, Xunjian Yin, Bhuwan Dhingra, and Shuyan Zhou. Coding agents are effective long-context processors.arXiv preprint arXiv:2603.20432,

arXiv

[2] [2]

BrowseComp-Plus: A more fair and transparent evaluation benchmark of deep-research agent

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, and Jimmy Lin. BrowseComp-Plus: A more fair and transparent evaluation benchmark of deep-research agent. arXiv ...

arXiv

[3] [3]

AgentIR: Reasoning-aware retrieval for deep research agents.arXiv preprint arXiv:2603.04384,

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, Akari Asai, and Victor Zhong. AgentIR: Reasoning-aware retrieval for deep research agents.arXiv preprint arXiv:2603.04384,

arXiv

[4] [4]

Rethinking agentic search with PI-SERINI: Is lexical retrieval sufficient?arXiv preprint arXiv:2605.10848,

Tz-Huan Hsu, Jheng-Hong Yang, and Jimmy Lin. Rethinking agentic search with PI-SERINI: Is lexical retrieval sufficient?arXiv preprint arXiv:2605.10848,

Pith/arXiv arXiv

[5] [5]

SAGE: Benchmarking and improving retrieval for deep research agents.arXiv preprint arXiv:2602.05975,

Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, and Chen Zhao. SAGE: Benchmarking and improving retrieval for deep research agents.arXiv preprint arXiv:2602.05975,

arXiv

[6] [6]

ISBN 9781713829546

Curran Associates Inc. ISBN 9781713829546. Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. WebThinker: Empowering large reasoning models with deep research capability. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026a. URL https://openreview.net/forum?id=7LKKHBAM...

Pith/arXiv arXiv

[7] [7]

BM25S: Orders of magnitude faster lexical search via eager sparse scoring.arXiv preprint arXiv:2407.03618,

Xing Han Lù. BM25S: Orders of magnitude faster lexical search via eager sparse scoring.arXiv preprint arXiv:2407.03618,

arXiv

[8] [8]

Revisiting text ranking in deep research

Chuan Meng, Litu Ou, Sean MacAvaney, and Jeff Dalton. Revisiting text ranking in deep research. arXiv preprint arXiv:2602.21456,

Pith/arXiv arXiv

[9] [9]

WebGPT: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. WebGPT: Browser-assisted question-answering with human feedback.arXiv preprint arXiv...

Pith/arXiv arXiv

[10] [10]

Is grep all you need? how agent harnesses reshape agentic search.arXiv preprint arXiv:2605.15184,

Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, and Vamse Kumar Subbiah. Is grep all you need? how agent harnesses reshape agentic search.arXiv preprint arXiv:2605.15184,

Pith/arXiv arXiv

[11] [11]

BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

10 Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

Pith/arXiv arXiv

[12] [12]

The user prompt is the BrowseComp-Plus query under a one-lineQUESTION:header, shared across RISE and RISE-BM25

11 A RISE-BM25 System Prompt The RISE-BM25 runs use the following system prompt. The user prompt is the BrowseComp-Plus query under a one-lineQUESTION:header, shared across RISE and RISE-BM25. You answer research questions over a large document corpus you can’t see directly. You have three tools: - search(queries): search the corpus with one or more queri...

2000

[13] [13]

sections

You restructure a plain-text document so a search agent (using bash tools ‘cat‘, ‘ sed‘, ‘grep‘) can navigate it. You will NOT rewrite the document. You only propose: - section boundaries (where each section begins), - a short heading for each section, and - a one-sentence description of what each section covers. A downstream script will use your output t...

2000

[14] [14]

list of films released in 2020

- CATALOG / LIST documents are different and SHOULD be sectioned aggressively – one section per item is often correct: * A wildlife-identification guide listing 24 bird species→24 sections (one per species). * A "list of films released in 2020"→either one big "Films" section, or one section per studio/country/letter grouping. Not 0 sections. * A Wikipedia...

2020