arxiv: 2605.05242 · v1 · submitted 2026-05-03 · 💻 cs.IR · cs.AI

Recognition: 1 theorem link

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Cong Wei, Dongfu Jiang, Hangxiao Zhu, Haoxiang Zhang, James Zou, Jianwen Xie, Jiawei Han, Jimmy Lin, Ming Zhong, Pan Lu, Ping Nie, Shangbin Feng, Wenhu Chen, Yejin Choi, Yi Lu, Yuyang Bai, Yuyu Zhang, Yu Zhang, Zhuofeng Li

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:06 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords direct corpus interactionagentic searchinformation retrievallanguage agentsretrieval interfacesterminal toolsmulti-hop QA

0 comments

The pith

Agents retrieve information more effectively by directly probing raw text with tools like grep than by relying on semantic similarity interfaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that conventional retrieval systems force agents into a single top-k similarity step that discards exact matches, local context, and weak clues too early for multi-step tasks. It proposes direct corpus interaction instead, where agents apply general terminal tools straight to unindexed text files to perform searches, checks, and refinements as needed. Experiments across IR benchmarks and agentic tasks show this method beats sparse, dense, and reranking baselines while delivering strong results on BrowseComp-Plus and multi-hop QA without any embedding models or APIs. Readers would care because it reframes retrieval quality as a matter of interface resolution that grows with agent capability rather than model scale alone.

Core claim

Direct corpus interaction lets an agent search the raw corpus using general-purpose terminal tools such as grep, file reads, and lightweight scripts without embedding models, vector indexes, or retrieval APIs; this setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets and reaches strong accuracy on BrowseComp-Plus and multi-hop QA.

What carries the argument

Direct corpus interaction (DCI), the approach of letting agents apply everyday terminal commands and scripts directly to raw text files to enforce exact lexical constraints, combine sparse clues, and revise plans from partial evidence.

If this is right

Agents no longer lose evidence filtered out in an early similarity step and can recover it through later direct checks.
Exact lexical constraints and multi-step hypothesis refinement become straightforward to implement without calling an off-the-shelf retriever.
Systems adapt to evolving local corpora with no offline indexing required.
Retrieval quality for agents depends on the resolution of the interaction interface as much as on reasoning ability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent architectures may treat terminal tool access as the primary retrieval layer for knowledge tasks rather than an add-on.
Efficiency of general tools could replace embedding quality as the main engineering target when corpora change frequently.
The same direct-interaction pattern might apply to other structured data sources where fixed APIs currently limit flexibility.

Load-bearing premise

Agents can efficiently and effectively use general-purpose terminal tools to explore and extract information from raw corpora of varying sizes without specialized indexing or retrieval infrastructure.

What would settle it

A controlled test on a corpus where direct tool calls recover fewer relevant passages than a standard top-k retriever on tasks that require conjunction of multiple weak clues or local context checks would falsify the performance advantage.

Figures

Figures reproduced from arXiv: 2605.05242 by Cong Wei, Dongfu Jiang, Hangxiao Zhu, Haoxiang Zhang, James Zou, Jianwen Xie, Jiawei Han, Jimmy Lin, Ming Zhong, Pan Lu, Ping Nie, Shangbin Feng, Wenhu Chen, Yejin Choi, Yi Lu, Yuyang Bai, Yuyu Zhang, Yu Zhang, Zhuofeng Li.

**Figure 1.** Figure 1: Pareto frontier of performance vs. cost on BrowseComp-Plus, comparing two paradigms: view at source ↗

**Figure 2.** Figure 2: Two retrieval interfaces for agentic search. (Left) Retriever-mediated retrieval relies on offline indexing over a corpus and a retriever: the agent queries the retriever and reasons over the returned top-k candidates. (Right) In contrast, direct corpus interaction bypasses preprocessing and any separate retriever: the agent searches the raw corpus directly using general-purpose terminal tools such as grep… view at source ↗

**Figure 3.** Figure 3: Visualization of runtime context-management strategies for long-horizon DCI. We use view at source ↗

**Figure 4.** Figure 4: Left: Results on all 830 BrowseComp-Plus questions with the Sonnet 4.6 backbone, comparing DCI-Agent-CC to the retrieval agent using Qwen3-Embedding-8B as the retriever. Right: Distribution of tool calls and Bash intents across all DCI-Agent-CC runs, illustrating how the dominant Bash tool decomposes into ten concrete command intents. Agentic Search. As shown in view at source ↗

**Figure 5.** Figure 5: Corpus-scaling results for DCI-AgentCC on a BrowseComp-Plus subset (n = 100) view at source ↗

**Figure 6.** Figure 6: Distribution of Bash command patterns in DCI-Agent-Lite trajectories across 100 cases view at source ↗

read the original abstract

Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims agents using raw terminal tools on corpora beat retrievers on BRIGHT/BEIR and agent tasks, but the abstract supplies no numbers or controls to check it.

read the letter

The core pitch is that conventional retrievers create a bottleneck for agents because they force everything through a single top-k similarity step. Instead, the authors let the agent use plain shell tools like grep and file reads to explore the raw corpus directly, with no embeddings or indexes. They report this direct corpus interaction wins on several BRIGHT and BEIR sets plus BrowseComp-Plus and multi-hop QA. That framing of the interface as the variable worth changing is the clearest new angle here, and it makes sense for settings where corpora change or stay local. The no-index requirement is also practical for some use cases. Credit to them for testing the idea end-to-end rather than stopping at retrieval metrics alone. The main weakness is that the abstract states outperformance without any tables, protocols, dataset sizes, or error breakdowns, so the size of the gains and the controls used stay invisible. The stress-test point about full-scale corpora is worth taking seriously: if the agent has to scan millions of documents repeatedly or if the scripts start to look like ad-hoc indexes, the claimed advantage over retrievers becomes harder to attribute to interface resolution. Without seeing the actual experimental details it is difficult to know whether the results hold on the full collections or only on easier subsets. This is the kind of work that would interest people building agentic search systems or rethinking RAG pipelines. A reader already thinking about tool use and multi-step reasoning could extract the perspective even if the numbers need more scrutiny. I would send it to peer review so the experiments can be checked properly rather than desk-rejecting on the abstract alone.

Referee Report

2 major / 2 minor

Summary. The paper proposes Direct Corpus Interaction (DCI) as an alternative retrieval paradigm for agentic search. Instead of exposing corpora through fixed lexical or semantic similarity interfaces that return top-k results, DCI lets LLM agents interact directly with raw corpus files using general-purpose terminal tools (grep, file reads, shell commands, lightweight scripts). The authors claim this higher-resolution interface enables better handling of exact constraints, clue conjunctions, and multi-step reasoning, yielding substantial outperformance over strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets plus strong accuracy on BrowseComp-Plus and multi-hop QA, all without any conventional semantic retriever or offline index.

Significance. If the empirical claims hold under the stated conditions, the work would demonstrate that retrieval quality for capable agents depends as much on interface resolution as on model reasoning, opening a broader design space beyond similarity-based abstractions. The no-indexing property is practically attractive for dynamic or local corpora. The absence of quantitative results, protocols, or controls in the abstract, however, leaves the magnitude and robustness of the reported gains unevaluable at present.

major comments (2)

[Abstract] Abstract: the central claim of substantial outperformance on BRIGHT and BEIR (and strong accuracy on BrowseComp-Plus/multi-hop QA) is asserted without any numerical results, experimental protocols, error analysis, or controls. This prevents assessment of whether the gains are large enough, statistically reliable, or attributable to DCI rather than other factors.
[Experimental Evaluation] Experimental sections (methods and results): the headline attribution of gains to the 'higher-resolution DCI interface' rather than conventional retrievers is load-bearing on the assumption that agents used only general-purpose terminal tools on full-scale raw corpora (tens of thousands to >1 M documents). The manuscript must explicitly report corpus sizes actually used, total tool-call budgets, confirmation that no custom indexing scripts or down-sampling occurred, and evidence that exhaustive or narrow-command searches remained tractable; otherwise the interface-resolution argument does not follow from the results.

minor comments (2)

[Methods] Clarify the precise boundary between 'general-purpose terminal tools' and any agent-written scripts that might implicitly re-implement indexing or caching; this distinction is essential for reproducibility.
[Abstract] The abstract and introduction would benefit from a short table or bullet list of the key quantitative improvements (e.g., nDCG or accuracy deltas) to make the claims immediately evaluable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to enhance transparency and evaluability of our results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of substantial outperformance on BRIGHT and BEIR (and strong accuracy on BrowseComp-Plus/multi-hop QA) is asserted without any numerical results, experimental protocols, error analysis, or controls. This prevents assessment of whether the gains are large enough, statistically reliable, or attributable to DCI rather than other factors.

Authors: We agree that the abstract would benefit from quantitative support to make the claims more immediately assessable. In the revised manuscript we will incorporate key numerical results (e.g., relative gains on the relevant BRIGHT and BEIR subsets together with absolute accuracies on BrowseComp-Plus and the multi-hop QA tasks) while remaining within length limits. A brief reference to the evaluation protocol will also be added so readers can contextualize the reported figures. revision: yes
Referee: [Experimental Evaluation] Experimental sections (methods and results): the headline attribution of gains to the 'higher-resolution DCI interface' rather than conventional retrievers is load-bearing on the assumption that agents used only general-purpose terminal tools on full-scale raw corpora (tens of thousands to >1 M documents). The manuscript must explicitly report corpus sizes actually used, total tool-call budgets, confirmation that no custom indexing scripts or down-sampling occurred, and evidence that exhaustive or narrow-command searches remained tractable; otherwise the interface-resolution argument does not follow from the results.

Authors: We concur that these experimental controls are essential for substantiating the interface-resolution claim. The current manuscript states that DCI operates directly on raw corpora with general-purpose tools and no indexing, but we will expand the experimental sections to include: (i) explicit corpus sizes for every benchmark (confirming full-scale usage without down-sampling), (ii) statistics on tool-call budgets, (iii) explicit confirmation that no custom indexing scripts were employed, and (iv) discussion of search tractability, including any observed limits and the strategies used to keep exhaustive or narrow-command searches feasible. These additions will be placed in a new or expanded subsection so the attribution to DCI follows directly from the reported conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential fits

full rationale

The paper advances an empirical claim that direct corpus interaction via general terminal tools outperforms conventional retrievers on BRIGHT/BEIR and agentic tasks. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text or abstract. The central argument rests on benchmark results that are externally falsifiable and do not reduce to any definitional or constructional equivalence with the proposed interface. This is the expected non-finding for an engineering/empirical IR paper without mathematical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that general terminal tools suffice for effective corpus interaction in agentic workflows.

axioms (1)

domain assumption General-purpose terminal tools enable agents to perform effective multi-step exploration and evidence gathering on raw corpora.
This assumption is required for DCI to be viable and superior to indexed retrieval systems.

invented entities (1)

Direct Corpus Interaction (DCI) no independent evidence
purpose: An interface paradigm allowing agents to search raw corpora directly with terminal tools instead of similarity-based retrievers.
DCI is introduced as the core alternative method but has no independent falsifiable evidence outside the paper's claimed experiments.

pith-pipeline@v0.9.0 · 5633 in / 1233 out tokens · 38942 ms · 2026-05-08T19:06:07.235769+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references

[1]

Search directly using ripgrep/Bash — do NOT use the Agent tool, spawn subagents, or browse the web
[2]

Run multipleripgrep/Bashsearches IN PARALLEL within a single response to save time
[3]

INSTRUCTIONS: • Search @corpus thoroughly with multiple relevant keyword combinations

Use diverse, targeted keywords to maximize recall before drawing conclusions. INSTRUCTIONS: • Search @corpus thoroughly with multiple relevant keyword combinations. • Identify and rule out competing candidate answers before committing to one. • Cite every supporting finding inline using the document’s path, e.g.[@corpus/relative_path]. Your response MUST ...
[4]

UseGrep/BashONLY — do NOT use the Agent tool, spawn subagents, or browse the web
[5]

Run multipleGrep/Bashsearches IN PARALLEL within a single response to save time
[6]

Use diverse, targeted keywords to maximize recall before drawing conclusions
[7]

After each round, reflect on gaps and launch follow-up searches to cover missing angles
[8]

pattern":

Do NOT stop after finding a few documents — exhaust all plausible search angles. RETRIEV AL INSTRUCTIONS: • Both recall AND precision matter equally — the output is evaluated with NDCG, which penalizes both missing relevant documents and including irrelevant ones. • Find EVERY document that is genuinely relevant. Missing a gold document hurts recall. • Re...
[9]

technique

Under 35,000 spectators 6. Stadium built more than 50 years before the match 7. Three refereeing officials with different nationalities 8. Goalscorer started professional career at age 18, less than 10 years before the match 9. The goalscorer said the goal was scored with "technique" rather than power (in an interview) 10. The goalscorer also scored in th...

1970
[10]

command":

Stadium built more than 50 years before—Renato Dall’Ara stadium in Bologna—need to verify 7. Three refereeing officials with different nationalities—Denmark, Italy, Yugoslavia—YES 8-12. Need to verify goalscorer details and Belgium players. Tool:Bash Tool Call: { "command": "cat \"browsecomp-plus-corpus/theguardian.com/David\_Platt\_\_Italia\_90\ _and\_En...

1966
[11]

do it at Auxerre a lot

The difference is 20 days—less than a month! Both are born in 1966, the same year as David Platt (10 June 1966). Scifo is the older of the two (February vs. March). Paul Parker mentioned Scifo in the Guardian interview. Chris Waddle said he’d seen Scifo “do it at Auxerre a lot.” Therefore, the club is AJ Auxerre. Tool:Bash Tool Call: { "command": "cat \"b...

1966
[12]

scoreboard,

txt bc \ _plus \ _docs / www . frieze . com / The Trouble With Harry Nilsson . txt bc \ _plus \ _docs / www . hampton . gov / Fort Monroe . txt bc \ _plus \ _docs / www . beltandr oadforum . org / A Global Community of Shared Future China's Proposals and Actions . txt bc \ _plus \ _docs / www . beltandr oadforum . org / The Belt ... Turn 2: Reasoning Cont...

1919
[13]

The Lobster Tank,

There’s also a mention of "The Lobster Tank," which seems related to the story’s tank filled with lobsters. I need to confirm the publication date, which is May 19, 2015. The clues seem to align well regarding the main character, the boy with blue eyes, and the principal’s obsession with the scoreboard. **Confirming author’s background** I’m noticing that...

2015
[14]

Barcelona’s 2018-19 UCL Group B opponents: PSV Eindhoven, Tottenham Hotspur, Inter Milan 3

Denis Suarez was loaned from Barcelona to Arsenal on January 31, 2019 2. Barcelona’s 2018-19 UCL Group B opponents: PSV Eindhoven, Tottenham Hotspur, Inter Milan 3. Denis Suarez scored 0 UCL goals for Barcelona in 2018-19 but could have been on the bench 4. I can’t find a specific document confirming which UCL game Denis Suarez was on the bench for

2019
[15]

a football match in England in 2019

I also can’t locate any corpus evidence about the water bottle incident itself Based on what I can verify, Denis Suarez was likely on Barcelona’s bench during their fall 2018 UCL group stage matches before his January loan to Arsenal, though I don’t have documentation confirming the specific game or the water bottle incident the question references. For B...

2018