AgentIR: A Workload-Adaptive Cascade Retrieval Substrate for Long-Term Conversational Memory

Aojie Yuan; Haiyue Zhang; Shahin Nazarian

arxiv: 2605.25092 · v1 · pith:XK5J7XGQnew · submitted 2026-05-24 · 💻 cs.IR · cs.CL· cs.DB

AgentIR: A Workload-Adaptive Cascade Retrieval Substrate for Long-Term Conversational Memory

Aojie Yuan , Haiyue Zhang , Shahin Nazarian This is my paper

Pith reviewed 2026-06-29 23:39 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.DB

keywords conversational memorycascade retrievalBM25dense retrievallong-term retrievalinformation retrievalagent systemsworkload adaptive

0 comments

The pith

A BM25-margin cascade router skips dense retrieval on most conversational queries without losing accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that retrieval for long-term conversational memory can adapt per query to skip expensive dense search when it adds no value. It introduces a router that uses only the margin between the top two BM25 scores to decide whether to run the dense channel or which fusion method to apply. This router auto-tunes to different workloads, skipping 63 percent of queries on LongMemEval at the same judged accuracy and all queries on LoCoMo. A time-partitioned index underneath keeps the work logarithmic in the inverse error tolerance and independent of total corpus size. If correct, systems could support far more concurrent agents under tight latency limits as conversation histories grow.

Core claim

The central claim is that a confidence-triggered cascade router, driven solely by the BM25 top-k margin, can decide per query whether the dense retrieval channel is worth running, and that a time-partitioned index performs O(log 1/epsilon) work independent of corpus size, together yielding large speedups at parity quality on conversational and BEIR benchmarks.

What carries the argument

The confidence-triggered cascade router that uses the BM25 top-k margin as the sole signal to decide whether to invoke the dense channel or apply a particular fusion method.

If this is right

On LongMemEval the cascade skips 63% of queries at parity accuracy for 2.67x speedup.
On LoCoMo it reaches 100% skip rate for 132x speedup and higher Hit@5.
The time-partitioned index sustains sub-100us latency even as the corpus grows 1234x.
It achieves 10-11x geo-mean speedup over Pyserini and PISA at parity quality on BEIR datasets.
Capacity rises from 154 to 1400 concurrent agents on the same hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could allow agent systems to maintain much longer conversation histories without proportional increases in retrieval cost.
The router's workload-adaptive behavior suggests it may generalize to other hybrid retrieval settings where one channel is significantly more expensive.
Fixing the documented BM25/GPU pitfalls enables reliable comparison between CPU and GPU implementations at high precision.
If the margin statistic proves robust, it removes the need for learned routers in many cascade setups.

Load-bearing premise

The BM25 top-k margin alone provides enough information to decide whether the dense channel will improve retrieval quality for that query.

What would settle it

Measuring LLM-judged accuracy on a held-out conversational workload where the cascade's skip decisions produce lower accuracy than always running both channels.

Figures

Figures reproduced from arXiv: 2605.25092 by Aojie Yuan, Haiyue Zhang, Shahin Nazarian.

**Figure 1.** Figure 1: AgentIR pipeline and two-axis adaptive control surface. Three substrate stages run concurrently: a SIMD-vectorized BM25 posting list (CPU, 0.4 ms), a Dense / BGE-small channel (CPU, 52 ms), and a time-partitioned temporal index. The cascade trigger (Δ=𝑠1−𝑠2≥𝜏𝑐 , §5.9) inspects the BM25 top-𝑘 margin: confident queries early-exit in 0.9 ms; ambiguous queries escalate to Dense+RRF+recency for 53 ms. Same trig… view at source ↗

**Figure 2.** Figure 2: Two-axis adaptive control surface. The reachable [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Per-query latency vs. corpus size, log-log (Jet [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: BM25 retrieval Pareto frontier: 6 systems [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: LongMemEval LLM-judged strict accuracy per ques [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Per-query latency breakdown. The BGE query en [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 8.** Figure 8: Same cascade trigger sweeps to two different op [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Multi-tenant scaling on 8-core Jetstream2 with [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: CSR memory layout for GPU index upload. The [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

read the original abstract

Long-term conversational memory is a retrieval workload classical IR was not built for: the index grows during the query stream, query types shift intra-session, and the latency budget per retrieval is sub-10 ms. Lucene-class engines treat the index as static and the query as stateless, leaving the workload's structure unexploited. AgentIR treats fusion as a per-query decision along two axes: which fusion to apply (BM25, Dense, RRF, or agent-aware RRF), and whether the ~52 ms dense channel is worth running at all. The second axis is a confidence-triggered cascade router that decides from the BM25 top-k margin alone and re-tunes across workloads without retraining. On LongMemEval (n=500), where the dense channel does add information, the cascade skips 63% of queries at parity LLM-judged accuracy (2.67x faster under two judges, paired bootstrap p>=0.88); per-qtype thresholds extend this to 5.76x under 5-fold cross-validation. On LoCoMo (n=1,982), where BM25 alone is already the strongest single system, the same trigger auto-tunes to a 100% skip rate (132x faster, +0.089 Hit@5). Capacity on a shared 8-core VM rises from ~154 to ~1,400 concurrent agents (9x). Underneath the cascade, a time-partitioned index does O(log 1/epsilon) work independent of corpus size: 1234x corpus growth costs only 3.6x latency, ending in 1769x over sequential at sub-100 us p50 on 5M records. At parity quality with Lucene on 9 BEIR datasets up to 8.8M docs, the substrate runs 10x geo-mean over Pyserini 8T and 11x over PISA-1T BlockMax-WAND; an A100 reaches 1.8-39x over Pyserini 8T; chunked index build sustains 56.8K docs/sec on MS MARCO. Three subtle BM25/GPU correctness pitfalls that silently regress nDCG@10 by 6-8x are documented and fixed; post-fix CPU and GPU agree within 0.0002 nDCG@10 on all eight datasets that fit a single A100.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The BM25-margin cascade and time-partitioned index give measurable speedups on conversational workloads, but the router's core assumption still needs stronger checks.

read the letter

The paper's main contribution is a cascade that routes per query using only the BM25 top-k margin to skip the dense channel, plus a time-partitioned index that keeps latency roughly logarithmic in corpus growth. On LongMemEval it skips 63% of queries at parity LLM accuracy and 2.67x speed; on LoCoMo it skips everything for 132x. The index shows 10-11x geo-mean gains over Pyserini and PISA at parity on BEIR, and the authors document and fix three BM25/GPU correctness issues that had been silently hurting nDCG.

Those numbers are the concrete part. The router auto-tunes thresholds across the two datasets without retraining, which is presented as workload-adaptive. The partitioned index really does appear to decouple latency from total size in the reported scaling test.

The soft spot is exactly the one the stress-test flags. The claim that BM25 margin alone is enough to predict when dense retrieval adds value rests on correlation that may break when queries shift inside a long conversation or when the memory contains more semantic than lexical signal. The abstract gives skip rates and paired bootstrap p-values but no error bars on the margin signal itself, no ablation removing the margin feature, and no cross-workload hold-out beyond the two reported sets. If that correlation is weaker than assumed, the 5-fold CV thresholds will not transfer and the reported skip rates will not hold on new streams.

The work is aimed at people building retrieval layers for long-running agents or conversational memory systems. It is not foundational theory, but it tackles a practical scaling problem with runnable numbers and some engineering fixes. The evidence is empirical rather than circular, and the authors engage the prior systems they compare against.

I would send it to peer review. The workload framing and the cascade design are worth a full referee look even if the margin assumption needs more ablations.

Referee Report

2 major / 2 minor

Summary. The paper presents AgentIR, a retrieval substrate for long-term conversational memory that treats fusion as a per-query decision via a confidence-triggered cascade router using only the BM25 top-k margin to decide whether to run the dense channel (skipping it when possible) and a time-partitioned index achieving O(log 1/epsilon) work independent of corpus size. It reports empirical results including 63% query skips at LLM-judged accuracy parity (2.67x faster) on LongMemEval, 100% skips (132x faster) on LoCoMo, 10-11x geo-mean speedup over Pyserini/PISA at parity on BEIR, capacity gains to 1400 concurrent agents, and fixes for three BM25/GPU correctness issues that previously regressed nDCG@10 by 6-8x.

Significance. If the central routing claim holds, the work would be significant for IR in dynamic conversational settings by exploiting workload structure (growing index, intra-session shifts) without retraining. The explicit documentation and correction of BM25 implementation pitfalls, plus reproducible speedups on named datasets (LongMemEval, LoCoMo, BEIR), are strengths. The time-partitioned index scaling result is also noteworthy if the independence from corpus size is rigorously shown.

major comments (2)

[Abstract / cascade router description] Abstract and cascade router section: The claim that the BM25 top-k margin alone is a sufficient statistic for the router (enabling workload-adaptive thresholds without retraining or additional features) is load-bearing for the 63%/100% skip rates at parity, yet the manuscript provides no ablation correlating margin with dense-channel accuracy gain or testing generalization beyond the reported 5-fold CV on these fixed datasets.
[Abstract / evaluation sections] Results on LongMemEval and LoCoMo: The parity accuracy claims and skip-rate speedups lack error bars, explicit threshold selection procedure details, or sensitivity analysis on the margin threshold; this directly affects assessment of whether the reported 2.67x/132x factors and p>=0.88 are robust.

minor comments (2)

[Index construction section] The time-partitioned index claim of O(log 1/epsilon) work should include the explicit definition of epsilon and the partitioning scheme to support the 1234x growth to 3.6x latency result.
[Throughout] Minor notation inconsistency: ensure consistent use of 'margin' vs. 'top-k margin' when describing the router input across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the potential significance of the workload-adaptive routing and scaling results. We address each major comment below and will incorporate revisions to strengthen the empirical support for the central claims.

read point-by-point responses

Referee: [Abstract / cascade router description] Abstract and cascade router section: The claim that the BM25 top-k margin alone is a sufficient statistic for the router (enabling workload-adaptive thresholds without retraining or additional features) is load-bearing for the 63%/100% skip rates at parity, yet the manuscript provides no ablation correlating margin with dense-channel accuracy gain or testing generalization beyond the reported 5-fold CV on these fixed datasets.

Authors: We agree that an explicit ablation correlating the BM25 top-k margin with the incremental accuracy gain from the dense channel would provide stronger evidence for the sufficiency claim. In the revised manuscript we will add this analysis on both LongMemEval and LoCoMo, including quantitative correlation measures and visualizations of margin versus dense-channel contribution. The existing 5-fold CV already tests generalization across query types within each dataset, but the new ablation will directly address the referee's concern about the margin as a standalone statistic. revision: yes
Referee: [Abstract / evaluation sections] Results on LongMemEval and LoCoMo: The parity accuracy claims and skip-rate speedups lack error bars, explicit threshold selection procedure details, or sensitivity analysis on the margin threshold; this directly affects assessment of whether the reported 2.67x/132x factors and p>=0.88 are robust.

Authors: We will revise the evaluation sections to include error bars obtained from the 5-fold cross-validation and paired bootstrap resampling. Explicit details on the threshold selection procedure (including how per-query-type thresholds are derived from the CV folds) will be added to the cascade router section. A sensitivity analysis sweeping the margin threshold will also be included to demonstrate the robustness of the reported skip rates, speedups, and accuracy parity (p>=0.88). revision: yes

Circularity Check

0 steps flagged

No circularity; all central claims are empirical measurements on fixed external datasets

full rationale

The paper reports measured skip rates (63% on LongMemEval, 100% on LoCoMo), speedups (2.67x–132x), and quality parity via LLM judges and nDCG@10 on BEIR, all obtained by running the system on held-out query streams and corpora. The cascade router is tuned via 5-fold CV on the same fixed datasets and the time-partitioned index latency scaling is measured directly; none of these quantities are derived from parameters fitted inside the same equations or reduced by construction to the inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text. The derivation chain consists of workload-specific empirical evaluation against external baselines (Pyserini, PISA) and is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on an empirical correlation between BM25 margin and dense-channel value plus workload-specific threshold tuning; no new physical entities or unproven mathematical axioms are introduced.

free parameters (1)

BM25-margin confidence threshold
The decision threshold that triggers skipping the dense channel is chosen per workload or per query type and is not derived from first principles.

axioms (1)

domain assumption BM25 top-k margin is a reliable proxy for whether dense retrieval adds information on the target workloads
Invoked to justify the cascade router that decides from BM25 alone.

pith-pipeline@v0.9.1-grok · 5997 in / 1510 out tokens · 31981 ms · 2026-06-29T23:39:52.590204+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Sebastian Bruch, Siyu Gai, and Amir Ingber. 2023. An Analysis of Fusion Func- tions for Hybrid Retrieval.ACM Transactions on Information Systems42, 1 (2023), 1–35

2023
[2]

Jorge, and Adam Jatowt

Ricardo Campos, Gaël Dias, Alípio M. Jorge, and Adam Jatowt. 2014. Survey of Temporal Information Retrieval and Related Applications.Comput. Surveys47, 2 (2014)

2014
[3]

Shane Culpepper

Ruey-Cheng Chen, Luke Gallagher, Roi Blanco, and J. Shane Culpepper. 2017. Efficient cost-aware cascade ranking in multi-stage retrieval. InProceedings of SIGIR. 445–454

2017
[4]

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav
[5]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.arXiv preprint arXiv:2504.19413(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Cormack, Charles L

Gordon V. Cormack, Charles L. A. Clarke, and Stefan Buettcher. 2009. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. In Proceedings of SIGIR. 758–759

2009
[7]

Shuai Ding, Jinru He, Hao Yan, and Torsten Suel. 2009. Using Graphics Processors for High Performance IR Query Processing. InProceedings of WWW. 421–430

2009
[8]

Shuai Ding and Torsten Suel. 2011. Faster Top- 𝑘 Document Retrieval Using Block-Max Indexes. InProceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 993–1002

2011
[9]

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. InProceedings of SIGIR

2021
[10]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-Scale Similarity Search with GPUs.IEEE Transactions on Big Data7, 3 (2021), 535–547

2021
[11]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of EMNLP. 6769–6781

2020
[12]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of SIGIR

2020
[13]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Informa- tion Processing Systems, Vol. 33

2020
[14]

Bruce Croft

Xiaoyan Li and W. Bruce Croft. 2003. Time-Based Language Models. InProceed- ings of CIKM. 469–475

2003
[15]

Jimmy Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques.arXiv preprint arXiv:2106.14807(2021)

work page arXiv 2021
[16]

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Infor- mation Retrieval Research with Sparse and Dense Representations. InProceedings of SIGIR

2021
[17]

Yang Liu, Jianguo Wang, and Steven Swanson. 2018. Griffin: Uniting CPU and GPU in Information Retrieval Systems for Intra-Query Parallelism. InProceedings of the IEEE International Conference on Data Engineering (ICDE)

2018
[18]

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating Very Long-Term Conversational Memory of LLM Agents. InProceedings of ACL

2024
[19]

Malkov and Dmitry A

Yury A. Malkov and Dmitry A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE Transactions on Pattern Analysis and Machine Intelligence42, 4 (2020), 824–836

2020
[20]

Antonio Mallia, Michal Siedlaczek, Joel Mackenzie, and Torsten Suel. 2019. PISA: Performant Indexes and Search for Academia. InProceedings of OSIRRC@SIGIR

2019
[21]

Antonio Mallia, Michal Siedlaczek, Torsten Suel, and Mohamed Zahran. 2019. GPU-Accelerated Decoding of Integer Lists. InProceedings of CIKM. 2193–2196

2019
[22]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. InProceedings of the Workshop on Cognitive Computation

2016
[23]

1993.Usability Engineering

Jakob Nielsen. 1993.Usability Engineering. Morgan Kaufmann. Defines the 100 ms (instant) and 1 s (interactive) latency thresholds for human-perceived responsiveness

1993
[24]

Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, and Yong Wang. 2024. CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs. InProceedings of ICDE

2024
[25]

Patil, Ion Stoica, and Joseph E

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2024. MemGPT: Towards LLMs as Operating Systems. InProceedings of ICLR

2024
[26]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. InProceedings of UIST

2023
[27]

Pinecone Systems. 2026. Pinecone Serverless Pricing. https://www.pinecone.io/ pricing/. List price for read units as of 2026-Q1; accessed 2026-05

2026
[28]

Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (2009), 333–389

2009
[29]

Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford

Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at TREC-3. InProceedings of the Third Text REtrieval Conference (TREC-3)

1994
[30]

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. InProceedings of NAACL

2022
[31]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems, Vol. 36

2023
[32]

Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L

Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths
[33]

Cognitive Architectures for Language Agents.Transactions on Machine Learning Research(2024)

2024
[34]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. InProceedings of NeurIPS Datasets and Bench- marks

2021
[35]

Howard Turtle and James Flood. 1995. Query evaluation: strategies and optimiza- tions.Information Processing & Management31, 6 (1995), 831–850. Introduces the MaxScore dynamic pruning strategy used by later WAND / BlockMax-WAND systems. AgentIR: A Workload-Adaptive Cascade Retrieval Substrate for Long-Term Conversational Memory

1995
[36]

Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xi- angyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al. 2021. Milvus: A Purpose-Built Vector Data Management System. InProceedings of SIGMOD. 2614–2627. Reports millisecond-scale p99 latency on hundred-million-vector corpora

2021
[37]

Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A cascade ranking model for efficient ranked retrieval. InProceedings of SIGIR. 105–114

2011
[38]

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu
[39]

InProceedings of ICLR

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InProceedings of ICLR
[40]

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embeddings. InProceedings of SIGIR

2024
[41]

Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use of Lucene for Information Retrieval Research. InProceedings of SIGIR. 1253–1256

2017
[42]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of ICLR

2023
[43]

cat”, “dog

Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. 2024. A Survey on the Memory Mechanism of Large Language Model based Agents.ACM Transactions on Information Systems (2024). Aojie Yuan, Haiyue Zhang, and Shahin Nazarian A Detailed Experimental Results This appendix provides complete numerical results ...

work page arXiv 2024
[44]

Collect ∼50 labeled deployment questions (Q, A, gold- session-id)
[45]

Run all 4 base systems (BM25, Dense, RRF, agent_rrf) on each Q; compute per-qtype LLM-Acc with your judge of choice
[46]

Pick the per-qtype best system 𝑠∗ (𝑞𝑡) ; this is your per- type routing table (Table 18-style)
[47]

For each qtype, sweep 𝜏𝑐 ∈ [ 0, 1] and pick the max value with within-noise per-qtype LLM-Acc; this gives 𝜏𝑐 [𝑞𝑡]
[48]

Train a TF-IDF (or TF-IDF +BGE) classifier over the 50 labels using sklearn.linear_model.LogisticRegression
[49]

AAPL”) or percentages (“401(k)

Plug all four into CascadeRouter::Config; you now have a per-tenant tuned cascade. Re-tune quarterly or on drift detection. Aojie Yuan, Haiyue Zhang, and Shahin Nazarian J.12 Reproduction CLI # Build the C++ benchmark (CPU only, 8-core x86) cmake -B build -DENABLE_AVX2=ON -DENABLE_CUDA=OFF cmake --build build -j # Run cascade on LongMemEval (Python orches...

[1] [1]

Sebastian Bruch, Siyu Gai, and Amir Ingber. 2023. An Analysis of Fusion Func- tions for Hybrid Retrieval.ACM Transactions on Information Systems42, 1 (2023), 1–35

2023

[2] [2]

Jorge, and Adam Jatowt

Ricardo Campos, Gaël Dias, Alípio M. Jorge, and Adam Jatowt. 2014. Survey of Temporal Information Retrieval and Related Applications.Comput. Surveys47, 2 (2014)

2014

[3] [3]

Shane Culpepper

Ruey-Cheng Chen, Luke Gallagher, Roi Blanco, and J. Shane Culpepper. 2017. Efficient cost-aware cascade ranking in multi-stage retrieval. InProceedings of SIGIR. 445–454

2017

[4] [4]

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav

[5] [5]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.arXiv preprint arXiv:2504.19413(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Cormack, Charles L

Gordon V. Cormack, Charles L. A. Clarke, and Stefan Buettcher. 2009. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. In Proceedings of SIGIR. 758–759

2009

[7] [7]

Shuai Ding, Jinru He, Hao Yan, and Torsten Suel. 2009. Using Graphics Processors for High Performance IR Query Processing. InProceedings of WWW. 421–430

2009

[8] [8]

Shuai Ding and Torsten Suel. 2011. Faster Top- 𝑘 Document Retrieval Using Block-Max Indexes. InProceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 993–1002

2011

[9] [9]

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. InProceedings of SIGIR

2021

[10] [10]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-Scale Similarity Search with GPUs.IEEE Transactions on Big Data7, 3 (2021), 535–547

2021

[11] [11]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of EMNLP. 6769–6781

2020

[12] [12]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of SIGIR

2020

[13] [13]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Informa- tion Processing Systems, Vol. 33

2020

[14] [14]

Bruce Croft

Xiaoyan Li and W. Bruce Croft. 2003. Time-Based Language Models. InProceed- ings of CIKM. 469–475

2003

[15] [15]

Jimmy Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques.arXiv preprint arXiv:2106.14807(2021)

work page arXiv 2021

[16] [16]

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Infor- mation Retrieval Research with Sparse and Dense Representations. InProceedings of SIGIR

2021

[17] [17]

Yang Liu, Jianguo Wang, and Steven Swanson. 2018. Griffin: Uniting CPU and GPU in Information Retrieval Systems for Intra-Query Parallelism. InProceedings of the IEEE International Conference on Data Engineering (ICDE)

2018

[18] [18]

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating Very Long-Term Conversational Memory of LLM Agents. InProceedings of ACL

2024

[19] [19]

Malkov and Dmitry A

Yury A. Malkov and Dmitry A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE Transactions on Pattern Analysis and Machine Intelligence42, 4 (2020), 824–836

2020

[20] [20]

Antonio Mallia, Michal Siedlaczek, Joel Mackenzie, and Torsten Suel. 2019. PISA: Performant Indexes and Search for Academia. InProceedings of OSIRRC@SIGIR

2019

[21] [21]

Antonio Mallia, Michal Siedlaczek, Torsten Suel, and Mohamed Zahran. 2019. GPU-Accelerated Decoding of Integer Lists. InProceedings of CIKM. 2193–2196

2019

[22] [22]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. InProceedings of the Workshop on Cognitive Computation

2016

[23] [23]

1993.Usability Engineering

Jakob Nielsen. 1993.Usability Engineering. Morgan Kaufmann. Defines the 100 ms (instant) and 1 s (interactive) latency thresholds for human-perceived responsiveness

1993

[24] [24]

Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, and Yong Wang. 2024. CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs. InProceedings of ICDE

2024

[25] [25]

Patil, Ion Stoica, and Joseph E

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2024. MemGPT: Towards LLMs as Operating Systems. InProceedings of ICLR

2024

[26] [26]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. InProceedings of UIST

2023

[27] [27]

Pinecone Systems. 2026. Pinecone Serverless Pricing. https://www.pinecone.io/ pricing/. List price for read units as of 2026-Q1; accessed 2026-05

2026

[28] [28]

Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (2009), 333–389

2009

[29] [29]

Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford

Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at TREC-3. InProceedings of the Third Text REtrieval Conference (TREC-3)

1994

[30] [30]

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. InProceedings of NAACL

2022

[31] [31]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems, Vol. 36

2023

[32] [32]

Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L

Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths

[33] [33]

Cognitive Architectures for Language Agents.Transactions on Machine Learning Research(2024)

2024

[34] [34]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. InProceedings of NeurIPS Datasets and Bench- marks

2021

[35] [35]

Howard Turtle and James Flood. 1995. Query evaluation: strategies and optimiza- tions.Information Processing & Management31, 6 (1995), 831–850. Introduces the MaxScore dynamic pruning strategy used by later WAND / BlockMax-WAND systems. AgentIR: A Workload-Adaptive Cascade Retrieval Substrate for Long-Term Conversational Memory

1995

[36] [36]

Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xi- angyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al. 2021. Milvus: A Purpose-Built Vector Data Management System. InProceedings of SIGMOD. 2614–2627. Reports millisecond-scale p99 latency on hundred-million-vector corpora

2021

[37] [37]

Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A cascade ranking model for efficient ranked retrieval. InProceedings of SIGIR. 105–114

2011

[38] [38]

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu

[39] [39]

InProceedings of ICLR

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InProceedings of ICLR

[40] [40]

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embeddings. InProceedings of SIGIR

2024

[41] [41]

Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use of Lucene for Information Retrieval Research. InProceedings of SIGIR. 1253–1256

2017

[42] [42]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of ICLR

2023

[43] [43]

cat”, “dog

Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. 2024. A Survey on the Memory Mechanism of Large Language Model based Agents.ACM Transactions on Information Systems (2024). Aojie Yuan, Haiyue Zhang, and Shahin Nazarian A Detailed Experimental Results This appendix provides complete numerical results ...

work page arXiv 2024

[44] [44]

Collect ∼50 labeled deployment questions (Q, A, gold- session-id)

[45] [45]

Run all 4 base systems (BM25, Dense, RRF, agent_rrf) on each Q; compute per-qtype LLM-Acc with your judge of choice

[46] [46]

Pick the per-qtype best system 𝑠∗ (𝑞𝑡) ; this is your per- type routing table (Table 18-style)

[47] [47]

For each qtype, sweep 𝜏𝑐 ∈ [ 0, 1] and pick the max value with within-noise per-qtype LLM-Acc; this gives 𝜏𝑐 [𝑞𝑡]

[48] [48]

Train a TF-IDF (or TF-IDF +BGE) classifier over the 50 labels using sklearn.linear_model.LogisticRegression

[49] [49]

AAPL”) or percentages (“401(k)

Plug all four into CascadeRouter::Config; you now have a per-tenant tuned cascade. Re-tune quarterly or on drift detection. Aojie Yuan, Haiyue Zhang, and Shahin Nazarian J.12 Reproduction CLI # Build the C++ benchmark (CPU only, 8-core x86) cmake -B build -DENABLE_AVX2=ON -DENABLE_CUDA=OFF cmake --build build -j # Run cascade on LongMemEval (Python orches...