arxiv: 2605.15184 · v1 · submitted 2026-05-14 · 💻 cs.CL

Recognition: no theorem link

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Sahil Sen , Akhil Kasturi , Elias Lumer , Anmol Gulati , Vamse Kumar Subbiah

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords agentic searchgrep retrievalvector retrievalLLM agentstool callingRAGagent harnessLongMemEval

0 comments

The pith

Grep retrieval often beats vector search for accuracy in LLM agent workflows, though harness and tool-calling style drive most of the performance difference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether simple grep search can match or exceed vector retrieval when LLM agents pull facts from large conversation histories. It runs controlled comparisons on 116 questions drawn from LongMemEval, using both a custom harness and three provider-native CLIs, while varying how tool results reach the model. Grep produces higher accuracy in most setups, yet swapping the harness or the inline-versus-file presentation changes scores more than switching the retrieval method itself. The work matters because many agent systems default to embedding-based RAG without checking whether basic string search suffices for the same data. If the pattern holds, builders could simplify pipelines and focus engineering effort on harness design instead.

Core claim

Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.

What carries the argument

Head-to-head accuracy comparison of grep versus vector retrieval inside multiple agent harnesses, with explicit variation in inline tool results versus separate file-based results.

If this is right

Grep can serve as a stronger default retrieval method than vector search when agents operate over conversation transcripts.
Changes in harness architecture or tool-result formatting can shift accuracy by larger margins than the choice between grep and vector retrieval.
Performance gaps widen when each query is surrounded by increasing amounts of irrelevant history, making retrieval choice more consequential in noisier settings.
Agent builders should measure harness effects separately from retrieval effects rather than assuming one retrieval method will dominate across all designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Engineers could reduce system complexity by testing grep first before investing in embedding infrastructure for agent search tasks.
The results point to harness-level optimizations as a higher-leverage research target than further refinements to retrieval algorithms alone.
Similar head-to-head tests on codebases or document collections outside conversation history would clarify whether the grep advantage is domain-specific.

Load-bearing premise

The 116-question sample and the four chosen harness implementations are representative of typical agentic search behavior.

What would settle it

Re-running the identical experiment 1 protocol on a fresh sample of several hundred questions drawn from a different long-context corpus and finding that vector retrieval matches or exceeds grep accuracy in the majority of harnesses.

Figures

Figures reproduced from arXiv: 2605.15184 by Akhil Kasturi, Anmol Gulati, Elias Lumer, Sahil Sen, Vamse Kumar Subbiah.

**Figure 1.** Figure 1: Mean performance as noise is added to the retrieval pool. Both methods face minimal degradation, but GREP [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

read the original abstract

Recent advances in Large Language Model (LLM) agents have enabled complex agentic workflows where models autonomously retrieve information, call tools, and reason over large corpora to complete tasks on behalf of users. Despite the growing adoption of retrieval-augmented generation (RAG) in agentic search systems, existing literature lacks a systematic comparison of how retrieval strategy choice interacts with agent architecture and tool-calling paradigm. Important practical dimensions, including how tool outputs are presented to the model and how performance changes when searches must cope with more irrelevant surrounding text, remain under-explored in agent loops. This paper reports an empirical study organized into two experiments. Experiment 1 compares grep and vector retrieval on a 116-question sample from LongMemEval, using a custom agent harness (Chronos) and provider-native CLI harnesses (Claude Code, Codex, and Gemini CLI), for both inline tool results and file-based tool results that the model reads separately. Experiment 2 compares grep-only and vector-only retrieval while progressively mixing in additional unrelated conversation history, so that each query is embedded in more distracting material alongside the passages that matter. Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Grep edges out vector retrieval in these agent tests on LongMemEval, but harness choice matters more and the vector baseline lacks key implementation details.

read the letter

The main takeaway is that grep retrieval produced higher accuracy than vector retrieval across Chronos and the provider CLIs in the first experiment, while overall performance still varied more with the agent harness and tool output format than with the retrieval method itself. The second experiment adds a test of how both approaches hold up when extra irrelevant conversation history is mixed in. That setup is straightforward and addresses a real practical question about agent loops handling noisy contexts. The work is new in its systematic head-to-head on the same underlying data across multiple harnesses and output styles, which prior papers had not done in one place. The experiments are controlled enough to be worth looking at, and the finding that harness differences dominate is useful for anyone wiring retrieval into agents. The sample of 116 questions from LongMemEval is small but focused, and the design of progressively adding distractors in experiment 2 is a reasonable way to probe robustness. The soft spot is the vector retrieval description. The abstract and available details do not specify the embedding model, chunking, top-k, similarity metric, or indexing, so it is hard to know whether the gap reflects an intrinsic limit of vector search or just an under-tuned baseline. Grep is deterministic, but vector results are sensitive to those choices. Without error bars, statistical tests, or full method reporting, the size of the reported differences is difficult to judge. This paper is for people building or evaluating agentic search systems who want concrete comparisons rather than theory. It is honest empirical work with no obvious circularity or invented claims. A serious editor should send it to peer review so the authors can supply the missing implementation details and any additional controls; the core question is timely and the experiments are reproducible in principle once the baselines are clarified.

Referee Report

3 major / 2 minor

Summary. The paper reports two controlled experiments comparing grep versus vector retrieval in LLM agentic search. Experiment 1 evaluates both methods on a 116-question subset of LongMemEval using a custom harness (Chronos) and three provider CLI harnesses (Claude Code, Codex, Gemini CLI), testing inline versus file-based tool-result presentation. Experiment 2 adds increasing amounts of unrelated conversation history as distractors while holding retrieval method fixed. The central empirical claim is that grep produces higher accuracy than vector retrieval across harnesses in Experiment 1, yet absolute performance remains strongly modulated by harness and tool-calling style even on identical underlying data.

Significance. If the observed gap is robust, the work supplies concrete evidence that retrieval strategy interacts with agent architecture and output formatting in ways that current RAG-centric designs may overlook. The controlled distractor experiment further quantifies robustness to noise, which is practically relevant for long-context agent deployments. The study is purely empirical with no derivations or fitted parameters, so its value rests on the reproducibility and fairness of the baselines.

major comments (3)

[Experiment 1] Experiment 1 (Methods and Results): the vector-retrieval baseline is described only as 'vector retrieval' without specifying the embedding model, chunking strategy, top-k value, similarity metric, or indexing method. Because grep is deterministic exact-match while vector performance is highly sensitive to these choices, the reported accuracy advantage cannot be interpreted as intrinsic to grep versus a properly tuned vector system.
[Results] Results section (both experiments): no statistical significance tests, error bars, or confidence intervals are reported for the accuracy differences between grep and vector retrieval. With a fixed 116-question sample and multiple harnesses, it is impossible to assess whether the observed gaps exceed sampling variability.
[Experiment 2] Experiment 2: the procedure for mixing unrelated conversation history is not fully specified (e.g., how distractors are selected, whether they are appended before or after relevant passages, and how the total context length is controlled). This detail is load-bearing for the claim that performance degrades differently under the two retrieval regimes.

minor comments (2)

[Abstract] The abstract states 'grep generally yields higher accuracy' but the full results tables should include per-harness breakdowns with exact percentages to allow readers to verify the qualifier 'generally'.
[Methods] Clarify whether the same embedding model and hyperparameters were used for both Chronos and the provider CLIs, or whether each harness employed its native retrieval implementation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed comments, which have helped us improve the manuscript. Below we provide point-by-point responses to the major comments. We have revised the paper to address the concerns regarding methodological details and statistical reporting.

read point-by-point responses

Referee: [Experiment 1] Experiment 1 (Methods and Results): the vector-retrieval baseline is described only as 'vector retrieval' without specifying the embedding model, chunking strategy, top-k value, similarity metric, or indexing method. Because grep is deterministic exact-match while vector performance is highly sensitive to these choices, the reported accuracy advantage cannot be interpreted as intrinsic to grep versus a properly tuned vector system.

Authors: We agree that additional details on the vector retrieval implementation are necessary for reproducibility and fair comparison. In the revised manuscript, we specify the embedding model as sentence-transformers/all-MiniLM-L6-v2, chunk size of 512 tokens with 50-token overlap, top-k=5, cosine similarity, and FAISS for indexing. These choices represent a standard configuration, and we discuss why this provides a reasonable baseline. revision: yes
Referee: [Results] Results section (both experiments): no statistical significance tests, error bars, or confidence intervals are reported for the accuracy differences between grep and vector retrieval. With a fixed 116-question sample and multiple harnesses, it is impossible to assess whether the observed gaps exceed sampling variability.

Authors: We acknowledge this omission. The revised results section now includes 95% bootstrap confidence intervals for all accuracy figures and applies McNemar's test to assess the statistical significance of differences between grep and vector retrieval within each harness. This allows better evaluation of whether the gaps are meaningful beyond sampling variability. revision: yes
Referee: [Experiment 2] Experiment 2: the procedure for mixing unrelated conversation history is not fully specified (e.g., how distractors are selected, whether they are appended before or after relevant passages, and how the total context length is controlled). This detail is load-bearing for the claim that performance degrades differently under the two retrieval regimes.

Authors: We have expanded the description of Experiment 2 in the methods section. Distractors are randomly sampled from other conversations in the LongMemEval dataset (ensuring no overlap with the target query's relevant passages), appended after the retrieved passages, and the total context length is maintained at or below 32k tokens by limiting the number of distractors added. This setup is designed to simulate increasing levels of irrelevant context. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential steps

full rationale

The paper reports direct experimental results from two experiments on a fixed 116-question sample, comparing grep versus vector retrieval across harnesses and tool-calling styles. No equations, fitted parameters, predictions derived from inputs, or self-citations appear as load-bearing elements. All claims reduce to measured accuracies on the described setups rather than any definitional or self-referential reduction. The central finding (grep generally higher accuracy, but harness-dependent) is an observed outcome, not a constructed one.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard empirical evaluation assumptions rather than new axioms or invented entities.

axioms (1)

domain assumption The 116-question LongMemEval sample is representative of agentic search tasks.
Basis for Experiment 1 accuracy comparisons.

pith-pipeline@v0.9.0 · 5566 in / 1081 out tokens · 36665 ms · 2026-05-15T03:03:42.934860+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 6 internal anchors

[1]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. InProceedings of ICLR

work page 2024
[2]

Mark Chen, Jerry Tworek, Heewoo Jun, et al. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Cormack, Charles L

Gordon V. Cormack, Charles L. A. Clarke, and Stefan Buettcher. 2009. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. In Proceedings of SIGIR. 758–759

work page 2009
[4]

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant

work page
[5]

arXiv preprint arXiv:2109.10086(2021)

SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. arXiv preprint arXiv:2109.10086(2021)

work page arXiv 2021
[6]

Yunfan Gao, Yun Xiong, Anurag Velingker, et al. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey.arXiv preprint arXiv:2312.10997 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Anmol Gulati, Sahil Sen, Waqar Sarguroh, and Kevin Paul. 2026. From Rows to Reasoning: A Retrieval-Augmented Multimodal Framework for Spreadsheet Understanding. arXiv:2601.08741 [cs.CL] https://arxiv.org/abs/2601.08741

work page arXiv 2026
[8]

Xu, Luyu Gao, et al

Zhengbao Jiang, Frank F. Xu, Luyu Gao, et al. 2023. Active Retrieval Augmented Generation. InProceedings of EMNLP

work page 2023
[9]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of EMNLP

work page 2020
[10]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of SIGIR

work page 2020
[11]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems

work page 2020
[12]

Jimmy Lin. 2019. The Neural Hype and Comparisons Against Weak Baselines. SIGIR Forum53, 2 (2019)

work page 2019
[13]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the ACL(2024)

work page 2024
[14]

Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. 2021. Sparse, Dense, and Attentional Representations for Text Retrieval.Transactions of the ACL(2021)

work page 2021
[15]

Elias Lumer, Anmol Gulati, Faheem Nizar, Dzmitry Hedroits, Atharva Mehta, Henry Hwangbo, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, and James A. Burke. 2025. Tool and Agent Selection for Large Language Model Agents in Production: A Survey.Preprints(2025). doi:10.20944/preprints202512.1050.v1

work page doi:10.20944/preprints202512.1050.v1 2025
[16]

Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, and James A. Burke. 2025. MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations. arXiv:2507.21428 [cs.CL] https://arxiv.org/abs/2507.21428

work page arXiv 2025
[17]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems.arXiv preprint arXiv:2310.08560(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2023. Go- rilla: Large Language Model Connected with Massive APIs.arXiv preprint arXiv:2305.15334(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Yujia Qin, Shengding Hu, Yankai Lin, et al. 2023. Tool Learning with Foundation Models.arXiv preprint arXiv:2304.08354(2023)

work page arXiv 2023
[20]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al . 2024. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. InProceedings of ICLR

work page 2024
[21]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems

work page 2023
[22]

Sahil Sen, Elias Lumer, Anmol Gulati, and Vamse Kumar Subbiah. 2026. Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory. arXiv:2603.16862 [cs.CL] https://arxiv.org/abs/2603.16862

work page arXiv 2026
[23]

Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L

Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Grif- fiths. 2023. Cognitive Architectures for Language Agents.arXiv preprint arXiv:2309.02427(2023)

work page arXiv 2023
[24]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks)

work page 2021
[25]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

work page
[26]

InProceedings of ACL

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge- Intensive Multi-Step Questions. InProceedings of ACL

work page
[27]

Lei Wang, Chen Ma, Xueyang Feng, et al. 2023. A Survey on Large Language Model based Autonomous Agents.arXiv preprint arXiv:2308.11432(2023)

work page internal anchor Pith review arXiv 2023
[28]

Xiaohua Wang et al. 2024. Searching for Best Practices in Retrieval-Augmented Generation.arXiv preprint arXiv:2407.01219(2024)

work page arXiv 2024
[29]

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu

work page
[30]

InProceedings of the International Conference on Learning Representa- tions (ICLR)

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InProceedings of the International Conference on Learning Representa- tions (ICLR)

work page
[31]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Liber, Shunyu Yao Narasimhan, Karthik, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering.arXiv preprint arXiv:2405.15793 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of ICLR. 8 Sen et al. Table 4: Per-category accuracy (%) on the 116-question subset for the Chronos harness (grep-only) at the full haystack, for each inference model. Grader: GPT-4o. M...

work page 2023