pith. sign in

arxiv: 2605.29307 · v1 · pith:MORKEZUXnew · submitted 2026-05-28 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

GrepSeek: Training Search Agents for Direct Corpus Interaction

Pith reviewed 2026-06-29 07:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG
keywords search agentsdirect corpus interactionreinforcement learningopen-domain question answeringshell commandspolicy optimizationinformation retrievalGRPO
0
0 comments X

The pith

An LLM search agent trained to issue shell commands directly on the raw corpus outperforms retriever-based systems on seven open-domain QA benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that search agents can treat the full text corpus as an executable environment and locate evidence by running commands such as grep rather than querying a pre-built index. A two-stage process first generates stable training trajectories with an answer-aware Tutor and answer-blind Planner, then refines the policy with Group Relative Policy Optimization so the agent learns task-oriented search behavior through direct interaction. If this approach holds, it supplies a scalable alternative that removes the need for separate retriever training and index maintenance while still delivering higher token-level F1 and Exact Match scores than prior methods. The work also supplies a sharded-parallel executor that speeds up command execution by up to 7.6 times without changing results. This framing positions direct corpus interaction as a practical complement to existing retrieval pipelines rather than a replacement.

Core claim

GrepSeek trains a compact policy to find, filter, and compose evidence from large text corpora by issuing executable shell commands. A cold-start dataset is built from verified trajectories produced by an answer-aware Tutor and an answer-blind Planner; the policy is then refined with Group Relative Policy Optimization on the live corpus. A semantics-preserving sharded-parallel execution engine accelerates retrieval up to 7.6 times. Across seven open-domain question-answering benchmarks the resulting agent records the highest overall token-level F1 and Exact Match.

What carries the argument

Two-stage training pipeline that first builds a cold-start dataset via Tutor/Planner trajectory generation and then applies Group Relative Policy Optimization directly on the corpus.

If this is right

  • Direct corpus interaction via shell commands can achieve stronger token-level F1 and Exact Match than retriever-based agents on standard open-domain QA tasks.
  • The sharded-parallel executor makes byte-exact shell-based search practical at corpus scale by delivering up to 7.6 times speedup.
  • Purely lexical commands show clear limits on queries with substantial surface-form variation, indicating that DCI works best when surface matches are reliable.
  • DCI supplies a complementary retrieval method that can be combined with existing index-based systems in deployed search agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage recipe could be tested on corpora an order of magnitude larger to check whether the stability benefit persists.
  • Replacing the lexical command set with a small set of semantic operators might reduce the surface-form limitation without losing the direct-interaction advantage.
  • Because the agent never builds an index, the approach may be especially useful for rapidly changing or permission-restricted document collections.
  • The Tutor/Planner cold-start technique might transfer to other agent domains that currently suffer from unstable early reinforcement learning.

Load-bearing premise

The two-stage pipeline of cold-start trajectory collection followed by GRPO refinement is enough to stabilize reinforcement learning on a large raw corpus.

What would settle it

Training the same base model with pure reinforcement learning from scratch on the same corpora and observing whether performance collapses or remains unstable on the seven benchmarks.

Figures

Figures reproduced from arXiv: 2605.29307 by Alireza Salemi, Atharva Nijasure, Chang Zeng, Fernando Diaz, Hamed Zamani, Jui-Hui Chung, Razieh Rahimi.

Figure 1
Figure 1. Figure 1: Comparison of retrieval-augmented agentic search and direct corpus interaction. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Workflow of GrepSeek: iterative inter￾action with corpus with shell commands. To evaluate GrepSeek, we conduct experi￾ments across seven knowledge-intensive ques￾tion answering benchmarks spanning both single- and multi-hop questions. The single-hop benchmarks include Natural Questions (NQ) (Kwiatkowski et al., 2019), TriviaQA (Joshi et al., 2017), and PopQA (Mallen et al., 2023). The multi-hop benchmarks … view at source ↗
Figure 3
Figure 3. Figure 3: Efficiency and cost analysis of GrepSeek compared to dense retrieval baselines (E5 and Qwen3-4B). (a) Inference latency per query, broken down into LLM generation and tool execution time. (b) Memory footprint (RAM) required for the retrieval index. (c) Offline indexing cost measured in A100-hours. (d) Search tool latency of GrepSeek scaling with the number of shards. set of 50 examples from each dataset, t… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of the number of SFT tra￾jectories on F1 scores (EM is in [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics over 200 steps comparing [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: System prompt for GrepSeek. • Bamboogle (Press et al., 2023): A smaller but highly challenging dataset consisting of questions manually authored to defeat standard search engines. It requires deep, multi-step evidence gathering that cannot be resolved using surface-level web snippets or simple entity linking. We use the 2018 Wikipedia dump (Karpukhin et al., 2020) of 21M documents as the corpus.21 A.2 DATA… view at source ↗
Figure 7
Figure 7. Figure 7: System prompt describing the corpus and allowed shell tools. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Tutor prompt for decomposing multi-hop questions into single-hop steps. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for judging if the retrieved document entails the target answer. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Tutor prompt for refining a failed retrieval command. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for extracting the bridging entity required for backward chaining. [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: System prompt for the Planner agent used during forward assembly. [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Standard user prompt for the Planner agent during trajectory generation. [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: User prompt for the final answer formulation step. [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Tutor prompt for steering agent reasoning toward verified actions. [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt for the final quality gate checking for information leakage. [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Effect of the number of Supervised Fine-Tuning (SFT) trajectories on the Exact Match [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: System instructions for the backward retrieval task, emphasizing the anti-leak rule. [PITH_FULL_IMAGE:figures/full_fig_p044_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Tutor prompt for the initial attempt at backward evidence retrieval. [PITH_FULL_IMAGE:figures/full_fig_p045_19.png] view at source ↗
read the original abstract

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to $7.6\times$ while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level $F_1$ and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces GrepSeek, a compact LLM-based search agent that performs direct corpus interaction (DCI) by issuing executable shell commands (e.g., grep) to locate, filter, and compose evidence from large text corpora, rather than relying on pre-indexed retrievers. To stabilize training, it uses a two-stage pipeline: (1) cold-start trajectory generation via an answer-aware Tutor and answer-blind Planner, followed by (2) refinement with Group Relative Policy Optimization (GRPO). A semantics-preserving sharded-parallel execution engine is proposed to accelerate shell-based retrieval by up to 7.6×. Experiments on seven open-domain QA benchmarks are reported to yield the strongest overall token-level F1 and Exact Match.

Significance. If the headline results hold after proper controls, the work supplies a practical alternative paradigm for search agents that can complement retrieval-based systems, especially where byte-exact or lexical precision is valuable. The sharded execution engine is a concrete engineering contribution with measurable speedup while preserving equivalence. The limitation analysis on surface-form variation is a useful caveat. However, the absence of ablations for the core stabilization claim limits attribution of gains to the proposed method.

major comments (1)
  1. [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the central claim that GrepSeek achieves the strongest F1/EM rests on the two-stage Tutor/Planner + GRPO pipeline being required to stabilize direct RL on large corpora. No comparison is reported against (a) GRPO initialized from scratch or (b) a single-stage supervised baseline using the same corpus, command vocabulary, and sharded engine. This ablation is load-bearing for the methodological contribution.
minor comments (2)
  1. [Abstract] Abstract: the performance claim is stated without naming the competing systems, reporting statistical tests, or indicating whether gains are consistent across all seven benchmarks.
  2. [Method (execution engine)] The description of the sharded execution engine would benefit from an explicit statement of the equivalence invariant (byte-exact output) and any edge cases where sharding could alter command semantics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We agree that the requested ablations are important for strengthening the attribution of gains to the two-stage pipeline and will incorporate them in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the central claim that GrepSeek achieves the strongest F1/EM rests on the two-stage Tutor/Planner + GRPO pipeline being required to stabilize direct RL on large corpora. No comparison is reported against (a) GRPO initialized from scratch or (b) a single-stage supervised baseline using the same corpus, command vocabulary, and sharded engine. This ablation is load-bearing for the methodological contribution.

    Authors: We agree that direct comparisons to GRPO initialized from scratch and to a single-stage supervised baseline (using identical corpus, command vocabulary, and sharded engine) are necessary to substantiate the claim that the two-stage Tutor/Planner + GRPO pipeline is required for stable training. In the revised manuscript we will add these ablations in §4, reporting the resulting F1/EM scores and training dynamics under the same experimental conditions. This will allow readers to assess the contribution of the proposed stabilization method more precisely. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results on external benchmarks

full rationale

The paper evaluates GrepSeek on seven independent open-domain QA benchmarks using token-level F1 and Exact Match, quantities defined externally rather than by the method. The two-stage pipeline (cold-start via Tutor/Planner then GRPO) is motivated by an assumption about pure RL instability on large corpora, but this assumption is not derived from or equivalent to the reported performance metrics. No self-citations, fitted parameters renamed as predictions, or equations reducing outputs to inputs by construction appear in the abstract or described claims. The sharded execution engine is presented as an engineering optimization preserving byte-exact equivalence, which is independently verifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate free parameters, axioms, or invented entities; no equations or modeling choices are shown.

pith-pipeline@v0.9.1-grok · 5855 in / 1102 out tokens · 24622 ms · 2026-06-29T07:53:59.250140+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

    cs.LG 2026-06 conditional novelty 7.0

    KernelPro combines LLM code generation, roofline-guided tool orchestration, and domain-adapted MCTS to produce GPU kernels that outperform prior automated and some hand-tuned baselines on KernelBench and VeOmni workloads.

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

    Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https: //aclanthology.org/P17-1147/. 14 Preprint Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 conference on empirical methods i...

  2. [2]

    In: Proceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing

    URLhttps://aclanthology.org/2023.acl-long.557/. Baoyi Wang, Xingliang Wang, Guochang Li, Chen Zhi, Junxiao Han, Xinkui Zhao, Nan Wang, Shuiguang Deng, and Jianwei Yin. Greprag: An empirical study and optimization of grep-like retrieval for code completion, 2026. URLhttps://arxiv.org/abs/2601.23254. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linju...

  3. [3]

    born" to quickly locate the specific biographical details. $ rg -F

    was a world-renowned scholar and translator of Russian, German and Polish. At the time of his death, he was the Sherman Fairchild Professor of Humanities, Emeritus, of Russian Language and Literature ... answer:(no answer) Why:Exact full-name match pins the right entity; dense similarity conflates two near-identically named people, and the agent loops to ...

  4. [4]

    Edouard Vaillant et la création de l’unité socialiste en Fr ... [think]The previous searches for

    Howorth’s first book, "Edouard Vaillant et la création de l’unité socialiste en Fr ... [think]The previous searches for "Édouard Vaillant" in the corpus did not return a direct biographical entry, but the question implies a specific fact is available. Since the corpus likely contains the answer in a different context or the entry was not captured by the i...