Recognition: no theorem link
Is Grep All You Need? How Agent Harnesses Reshape Agentic Search
Pith reviewed 2026-05-15 03:03 UTC · model grok-4.3
The pith
Grep retrieval often beats vector search for accuracy in LLM agent workflows, though harness and tool-calling style drive most of the performance difference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.
What carries the argument
Head-to-head accuracy comparison of grep versus vector retrieval inside multiple agent harnesses, with explicit variation in inline tool results versus separate file-based results.
If this is right
- Grep can serve as a stronger default retrieval method than vector search when agents operate over conversation transcripts.
- Changes in harness architecture or tool-result formatting can shift accuracy by larger margins than the choice between grep and vector retrieval.
- Performance gaps widen when each query is surrounded by increasing amounts of irrelevant history, making retrieval choice more consequential in noisier settings.
- Agent builders should measure harness effects separately from retrieval effects rather than assuming one retrieval method will dominate across all designs.
Where Pith is reading between the lines
- Engineers could reduce system complexity by testing grep first before investing in embedding infrastructure for agent search tasks.
- The results point to harness-level optimizations as a higher-leverage research target than further refinements to retrieval algorithms alone.
- Similar head-to-head tests on codebases or document collections outside conversation history would clarify whether the grep advantage is domain-specific.
Load-bearing premise
The 116-question sample and the four chosen harness implementations are representative of typical agentic search behavior.
What would settle it
Re-running the identical experiment 1 protocol on a fresh sample of several hundred questions drawn from a different long-context corpus and finding that vector retrieval matches or exceeds grep accuracy in the majority of harnesses.
Figures
read the original abstract
Recent advances in Large Language Model (LLM) agents have enabled complex agentic workflows where models autonomously retrieve information, call tools, and reason over large corpora to complete tasks on behalf of users. Despite the growing adoption of retrieval-augmented generation (RAG) in agentic search systems, existing literature lacks a systematic comparison of how retrieval strategy choice interacts with agent architecture and tool-calling paradigm. Important practical dimensions, including how tool outputs are presented to the model and how performance changes when searches must cope with more irrelevant surrounding text, remain under-explored in agent loops. This paper reports an empirical study organized into two experiments. Experiment 1 compares grep and vector retrieval on a 116-question sample from LongMemEval, using a custom agent harness (Chronos) and provider-native CLI harnesses (Claude Code, Codex, and Gemini CLI), for both inline tool results and file-based tool results that the model reads separately. Experiment 2 compares grep-only and vector-only retrieval while progressively mixing in additional unrelated conversation history, so that each query is embedded in more distracting material alongside the passages that matter. Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports two controlled experiments comparing grep versus vector retrieval in LLM agentic search. Experiment 1 evaluates both methods on a 116-question subset of LongMemEval using a custom harness (Chronos) and three provider CLI harnesses (Claude Code, Codex, Gemini CLI), testing inline versus file-based tool-result presentation. Experiment 2 adds increasing amounts of unrelated conversation history as distractors while holding retrieval method fixed. The central empirical claim is that grep produces higher accuracy than vector retrieval across harnesses in Experiment 1, yet absolute performance remains strongly modulated by harness and tool-calling style even on identical underlying data.
Significance. If the observed gap is robust, the work supplies concrete evidence that retrieval strategy interacts with agent architecture and output formatting in ways that current RAG-centric designs may overlook. The controlled distractor experiment further quantifies robustness to noise, which is practically relevant for long-context agent deployments. The study is purely empirical with no derivations or fitted parameters, so its value rests on the reproducibility and fairness of the baselines.
major comments (3)
- [Experiment 1] Experiment 1 (Methods and Results): the vector-retrieval baseline is described only as 'vector retrieval' without specifying the embedding model, chunking strategy, top-k value, similarity metric, or indexing method. Because grep is deterministic exact-match while vector performance is highly sensitive to these choices, the reported accuracy advantage cannot be interpreted as intrinsic to grep versus a properly tuned vector system.
- [Results] Results section (both experiments): no statistical significance tests, error bars, or confidence intervals are reported for the accuracy differences between grep and vector retrieval. With a fixed 116-question sample and multiple harnesses, it is impossible to assess whether the observed gaps exceed sampling variability.
- [Experiment 2] Experiment 2: the procedure for mixing unrelated conversation history is not fully specified (e.g., how distractors are selected, whether they are appended before or after relevant passages, and how the total context length is controlled). This detail is load-bearing for the claim that performance degrades differently under the two retrieval regimes.
minor comments (2)
- [Abstract] The abstract states 'grep generally yields higher accuracy' but the full results tables should include per-harness breakdowns with exact percentages to allow readers to verify the qualifier 'generally'.
- [Methods] Clarify whether the same embedding model and hyperparameters were used for both Chronos and the provider CLIs, or whether each harness employed its native retrieval implementation.
Simulated Author's Rebuttal
We thank the referee for their detailed comments, which have helped us improve the manuscript. Below we provide point-by-point responses to the major comments. We have revised the paper to address the concerns regarding methodological details and statistical reporting.
read point-by-point responses
-
Referee: [Experiment 1] Experiment 1 (Methods and Results): the vector-retrieval baseline is described only as 'vector retrieval' without specifying the embedding model, chunking strategy, top-k value, similarity metric, or indexing method. Because grep is deterministic exact-match while vector performance is highly sensitive to these choices, the reported accuracy advantage cannot be interpreted as intrinsic to grep versus a properly tuned vector system.
Authors: We agree that additional details on the vector retrieval implementation are necessary for reproducibility and fair comparison. In the revised manuscript, we specify the embedding model as sentence-transformers/all-MiniLM-L6-v2, chunk size of 512 tokens with 50-token overlap, top-k=5, cosine similarity, and FAISS for indexing. These choices represent a standard configuration, and we discuss why this provides a reasonable baseline. revision: yes
-
Referee: [Results] Results section (both experiments): no statistical significance tests, error bars, or confidence intervals are reported for the accuracy differences between grep and vector retrieval. With a fixed 116-question sample and multiple harnesses, it is impossible to assess whether the observed gaps exceed sampling variability.
Authors: We acknowledge this omission. The revised results section now includes 95% bootstrap confidence intervals for all accuracy figures and applies McNemar's test to assess the statistical significance of differences between grep and vector retrieval within each harness. This allows better evaluation of whether the gaps are meaningful beyond sampling variability. revision: yes
-
Referee: [Experiment 2] Experiment 2: the procedure for mixing unrelated conversation history is not fully specified (e.g., how distractors are selected, whether they are appended before or after relevant passages, and how the total context length is controlled). This detail is load-bearing for the claim that performance degrades differently under the two retrieval regimes.
Authors: We have expanded the description of Experiment 2 in the methods section. Distractors are randomly sampled from other conversations in the LongMemEval dataset (ensuring no overlap with the target query's relevant passages), appended after the retrieved passages, and the total context length is maintained at or below 32k tokens by limiting the number of distractors added. This setup is designed to simulate increasing levels of irrelevant context. revision: yes
Circularity Check
No circularity: purely empirical comparison with no derivations or self-referential steps
full rationale
The paper reports direct experimental results from two experiments on a fixed 116-question sample, comparing grep versus vector retrieval across harnesses and tool-calling styles. No equations, fitted parameters, predictions derived from inputs, or self-citations appear as load-bearing elements. All claims reduce to measured accuracies on the described setups rather than any definitional or self-referential reduction. The central finding (grep generally higher accuracy, but harness-dependent) is an observed outcome, not a constructed one.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 116-question LongMemEval sample is representative of agentic search tasks.
Reference graph
Works this paper leans on
-
[1]
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. InProceedings of ICLR
work page 2024
-
[2]
Mark Chen, Jerry Tworek, Heewoo Jun, et al. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Gordon V. Cormack, Charles L. A. Clarke, and Stefan Buettcher. 2009. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. In Proceedings of SIGIR. 758–759
work page 2009
-
[4]
Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant
-
[5]
arXiv preprint arXiv:2109.10086(2021)
SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. arXiv preprint arXiv:2109.10086(2021)
-
[6]
Yunfan Gao, Yun Xiong, Anurag Velingker, et al. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey.arXiv preprint arXiv:2312.10997 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [7]
-
[8]
Zhengbao Jiang, Frank F. Xu, Luyu Gao, et al. 2023. Active Retrieval Augmented Generation. InProceedings of EMNLP
work page 2023
-
[9]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of EMNLP
work page 2020
-
[10]
Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of SIGIR
work page 2020
-
[11]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems
work page 2020
-
[12]
Jimmy Lin. 2019. The Neural Hype and Comparisons Against Weak Baselines. SIGIR Forum53, 2 (2019)
work page 2019
-
[13]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the ACL(2024)
work page 2024
-
[14]
Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. 2021. Sparse, Dense, and Attentional Representations for Text Retrieval.Transactions of the ACL(2021)
work page 2021
-
[15]
Elias Lumer, Anmol Gulati, Faheem Nizar, Dzmitry Hedroits, Atharva Mehta, Henry Hwangbo, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, and James A. Burke. 2025. Tool and Agent Selection for Large Language Model Agents in Production: A Survey.Preprints(2025). doi:10.20944/preprints202512.1050.v1
- [16]
-
[17]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems.arXiv preprint arXiv:2310.08560(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Gorilla: Large Language Model Connected with Massive APIs
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2023. Go- rilla: Large Language Model Connected with Massive APIs.arXiv preprint arXiv:2305.15334(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [19]
-
[20]
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al . 2024. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. InProceedings of ICLR
work page 2024
-
[21]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems
work page 2023
- [22]
-
[23]
Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L
Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Grif- fiths. 2023. Cognitive Architectures for Language Agents.arXiv preprint arXiv:2309.02427(2023)
-
[24]
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks)
work page 2021
-
[25]
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal
-
[26]
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge- Intensive Multi-Step Questions. InProceedings of ACL
-
[27]
Lei Wang, Chen Ma, Xueyang Feng, et al. 2023. A Survey on Large Language Model based Autonomous Agents.arXiv preprint arXiv:2308.11432(2023)
work page internal anchor Pith review arXiv 2023
- [28]
-
[29]
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu
-
[30]
InProceedings of the International Conference on Learning Representa- tions (ICLR)
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InProceedings of the International Conference on Learning Representa- tions (ICLR)
-
[31]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Liber, Shunyu Yao Narasimhan, Karthik, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering.arXiv preprint arXiv:2405.15793 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of ICLR. 8 Sen et al. Table 4: Per-category accuracy (%) on the 116-question subset for the Chronos harness (grep-only) at the full haystack, for each inference model. Grader: GPT-4o. M...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.