arxiv: 2605.03312 · v1 · submitted 2026-05-05 · 💻 cs.MA

Recognition: unknown

MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

Guiling Wang, Jiayi Chen, Yingcong Li

Pith reviewed 2026-05-09 16:50 UTC · model grok-4.3

classification 💻 cs.MA

keywords memory orchestrationsmall language modelsintent routinglong-horizon agentsSLM agentsmemory managementagentic memorycontext compression

0 comments

The pith

MemFlow routes queries by intent to one of three fixed memory tiers, letting small language models achieve nearly twice the accuracy of full-context baselines on long-horizon tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that small language models fail on extended histories mainly because they cannot reliably choose and prepare the right memory operations themselves. By moving that choice to a separate router that classifies intent and hands off to one of three preset tiers, the system assembles compact, relevant evidence under a controlled token budget. This avoids both context overflow and noisy or hallucinated retrieval steps. A reader should care because most practical agents will run on limited models rather than frontier-scale ones, and full-context prompting already breaks on realistic conversation lengths. The approach shows that external structure can substitute for missing reasoning capacity.

Core claim

The central claim is that a training-free route-then-compile architecture, consisting of a Router Agent that assigns each query to Profile Lookup, Targeted Retrieval, or Deep Reasoning, followed by a Memory Agent that assembles evidence under a tier-specific budget and an optional Validator retry, produces nearly 2x higher accuracy than full-context prompting when the same frozen Qwen3-1.7B model is used on LongMemEval, LoCoMo, and LongBench.

What carries the argument

MemFlow's three-tier memory orchestration: a Router Agent classifies query intent, a Memory Agent executes one specialized tier and prepares evidence under a dynamic token limit, and an Answer Agent generates the final response from that compact context.

If this is right

SLMs can sustain multi-turn performance on histories that exceed their native context window without increasing model size.
Memory preparation becomes deterministic rather than dependent on the main model's open-ended reasoning.
Token usage stays bounded per tier instead of growing with full history length.
The same backbone model can be reused across tasks by swapping only the router and tier definitions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The design could be extended by adding a fourth tier for very recent context or by making tier selection depend on observed token cost rather than fixed categories.
If the three tiers prove insufficient on new domains, the framework would require either more tiers or a fallback to self-orchestration, testing the coverage assumption directly.
The separation of router and memory execution suggests similar intent-driven orchestration could apply to tool-use or planning agents that currently rely on open-ended loops.

Load-bearing premise

The router can correctly classify every query into one of the three fixed tiers and those tiers plus the validator cover the memory needs that arise in long-horizon tasks.

What would settle it

Replace the trained router with random tier assignment on the same benchmarks and measure whether accuracy falls back to the full-context baseline level.

Figures

Figures reproduced from arXiv: 2605.03312 by Guiling Wang, Jiayi Chen, Yingcong Li.

**Figure 1.** Figure 1: Comparison of existing SLM memory approaches and view at source ↗

**Figure 2.** Figure 2: Overview of the MemFlow pipeline. SLM chip icons denote SLM inference points. under limited budgets. Compression and long-context studies show why selective evidence matters: attention underuses middle-positioned evidence [23]; window-extension methods reduce attention cost [5, 48, 9, 50]; and LongLLMLingua [19] and LLMLingua-2 [31] compress prompts. These systems improve retrieval or compression, but gene… view at source ↗

**Figure 3.** Figure 3: (a) Per-question-type accuracy (lines, right axis) and mean answer-context tokens (bars, left view at source ↗

**Figure 4.** Figure 4: Snippets of MemFlow’s internal states and actions on representative queries. Each panel view at source ↗

**Figure 5.** Figure 5: Per-question-type accuracy (%, lines, right axis) and Average context tokens (hatched bars, view at source ↗

read the original abstract

Modern language agents must operate over long-horizon, multi-turn histories, yet deploying such agents with Small Language Models (SLMs) remains fundamentally difficult. Full-context prompting causes context overflow, flat retrieval exposes the model to noisy evidence, and open-ended agentic loops are unreliable under limited reasoning capacity. We argue that a substantial portion of SLM memory failure arises from mismatched memory operations: different query types demand categorically different retrieval strategies, evidence transformations, and context budgets that SLMs cannot reliably self-orchestrate through open-ended reasoning. We introduce MemFlow, a training-free memory orchestration framework that externalizes memory planning from the SLM. A Router Agent classifies each query by intent and dispatches it to the Memory Agent, which executes one of three specialized tiers (Profile Lookup, Targeted Retrieval, or Deep Reasoning) and assembles the resulting evidence under a dynamic, tier-aware token budget. An Answer Agent then generates a response from this compact context, and a Validator Agent optionally retries with a heavier memory tier when the response is not supported by the provided evidence. This route-then-compile design avoids tool-selection hallucination and reasoning loops while keeping the answer context compact. Evaluated on a frozen Qwen3-1.7B backbone across long-horizon memory benchmarks - LongMemEval, LoCoMo, and LongBench - MemFlow improves accuracy by nearly 2x over full-context SLM baselines. These results suggest that structured intent routing and deterministic evidence preparation can make limited-capacity models substantially more effective in resource-constrained long-horizon agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemFlow's router-plus-three-tier design is a reasonable way to externalize memory planning for small agents, but the abstract gives no data on whether the router works or the tiers actually cover the queries.

read the letter

The key point is that MemFlow uses a router to classify query intents and route them to one of three fixed memory tiers before compiling evidence for the answer, which could help small models avoid context problems in long tasks. This route-then-compile setup with a validator is the main new element. The paper does a good job spelling out the limitations of full-context and flat retrieval for SLMs and offers a structured alternative that keeps things deterministic and budget-aware. Evaluating on a small frozen model across relevant long-horizon benchmarks is appropriate for the claim. However, the abstract provides no specific accuracy numbers, no ablations on the router's classification accuracy, and no breakdown of tier usage or failure cases. Without those, the nearly 2x gain could be due to other factors, and the stress-test note is correct that the assumptions about reliable intent mapping and tier coverage are untested in the provided text. Researchers working on agent memory systems for constrained models would find the design ideas useful to consider. I recommend putting it through peer review if the full paper includes the experimental details and ablations, since the problem it targets is real and the approach is worth checking out even with the current gaps in evidence.

Referee Report

3 major / 2 minor

Summary. The paper introduces MemFlow, a training-free memory orchestration framework for small language model (SLM) agents operating over long-horizon histories. It externalizes memory planning via a Router Agent that classifies query intent and dispatches to a Memory Agent executing one of three fixed tiers (Profile Lookup, Targeted Retrieval, or Deep Reasoning), assembles evidence under a tier-aware token budget, and uses an Answer Agent plus optional Validator Agent for response generation and retry. Evaluated on a frozen Qwen3-1.7B backbone, the framework is claimed to nearly double accuracy over full-context SLM baselines on LongMemEval, LoCoMo, and LongBench by avoiding context overflow, noisy retrieval, and unreliable open-ended self-orchestration.

Significance. If the reported gains are substantiated with rigorous experiments, MemFlow would represent a meaningful advance for resource-constrained long-horizon agents. The core idea of structured intent routing plus deterministic evidence preparation offers a practical alternative to both flat retrieval and fully agentic loops, potentially improving reliability for SLMs without additional training. This could influence design patterns in multi-agent systems where capacity limits make self-orchestration fragile.

major comments (3)

Abstract: The central performance claim of 'nearly 2x' accuracy improvement over full-context baselines on LongMemEval, LoCoMo, and LongBench is stated without any quantitative metrics, exact baseline scores, error bars, statistical significance tests, or ablation results. This absence prevents assessment of whether the gains are robust or attributable to the proposed design.
The manuscript provides no measurement or ablation of Router Agent intent classification accuracy, nor any breakdown of tier invocation frequencies across the benchmarks. Without these data, it remains unclear whether the reported improvements derive from reliable intent-to-tier mapping or from ancillary factors such as dynamic token budgeting and evidence compilation.
No analysis is given of query types that fall outside the three fixed tiers or of performance degradation when the Router misclassifies intent. This leaves the sufficiency of the tier set untested and risks overstating the framework's generality for arbitrary long-horizon memory needs.

minor comments (2)

The description of the three memory tiers would be strengthened by concrete examples of query intents assigned to each tier and the precise evidence transformations performed by the Memory Agent.
Implementation details for the Router, Memory, Answer, and Validator agents (e.g., prompting templates, decision criteria for validator retry) are referenced but not provided, hindering reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and describe the revisions planned for the manuscript.

read point-by-point responses

Referee: Abstract: The central performance claim of 'nearly 2x' accuracy improvement over full-context baselines on LongMemEval, LoCoMo, and LongBench is stated without any quantitative metrics, exact baseline scores, error bars, statistical significance tests, or ablation results. This absence prevents assessment of whether the gains are robust or attributable to the proposed design.

Authors: We agree that the abstract would be strengthened by including concrete numbers. In the revised version we will replace the qualitative 'nearly 2x' phrasing with the exact accuracy figures for MemFlow and the full-context baseline on each of the three benchmarks, drawn directly from the results tables in Section 4. We will also note the presence of ablation studies and any error bars or significance tests reported in the experimental section. revision: yes
Referee: The manuscript provides no measurement or ablation of Router Agent intent classification accuracy, nor any breakdown of tier invocation frequencies across the benchmarks. Without these data, it remains unclear whether the reported improvements derive from reliable intent-to-tier mapping or from ancillary factors such as dynamic token budgeting and evidence compilation.

Authors: Direct accuracy measurement of the Router is difficult because the benchmarks lack ground-truth intent labels. We will nevertheless add a breakdown of tier invocation frequencies (which can be extracted from the existing experimental logs) to the revised manuscript. This will make the usage patterns explicit and allow readers to assess how often each tier is selected. The end-to-end gains and tier-specific ablations already presented in Section 4 provide supporting evidence that the structured routing contributes to the observed improvements. revision: partial
Referee: No analysis is given of query types that fall outside the three fixed tiers or of performance degradation when the Router misclassifies intent. This leaves the sufficiency of the tier set untested and risks overstating the framework's generality for arbitrary long-horizon memory needs.

Authors: We accept that an explicit discussion of out-of-tier queries and misclassification effects would improve the paper. The three tiers were derived from the dominant query patterns in the evaluated benchmarks. In the revision we will add a limitations paragraph that enumerates query types observed to fall outside the current tier set and reports any performance degradation noted during error analysis of misrouted examples. This will clarify the scope of the framework without overstating generality. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural framework with empirical claims only

full rationale

The paper describes a training-free procedural architecture (Router Agent classifies intent and routes to one of three fixed tiers executed by Memory Agent, followed by Answer Agent and optional Validator) without any equations, derivations, fitted parameters, or mathematical predictions. Central performance claims rest on benchmark evaluations (LongMemEval, LoCoMo, LongBench) with a frozen backbone rather than any reduction of results to self-defined quantities or self-citation chains. No load-bearing step equates outputs to inputs by construction, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the domain assumption that SLMs cannot self-orchestrate memory operations and on the introduction of new agent components whose effectiveness is asserted via benchmark results.

axioms (1)

domain assumption A substantial portion of SLM memory failure arises from mismatched memory operations that SLMs cannot reliably self-orchestrate through open-ended reasoning.
This premise is explicitly stated as the core argument motivating the externalized framework.

invented entities (3)

Router Agent no independent evidence
purpose: Classifies each query by intent and dispatches to memory tiers
New component introduced to externalize planning from the SLM.
Memory Agent no independent evidence
purpose: Executes one of three specialized tiers and assembles evidence under tier-aware token budget
New component for deterministic memory operations.
Validator Agent no independent evidence
purpose: Optionally retries with heavier memory tier when response is unsupported
New optional component for quality control.

pith-pipeline@v0.9.0 · 5581 in / 1567 out tokens · 60579 ms · 2026-05-09T16:50:47.032926+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 29 canonical work pages · 22 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024

work page internal anchor Pith review arXiv 2024
[2]

Introducing the next generation of Claude.Anthropic Blog, 2024

Anthropic. Introducing the next generation of Claude.Anthropic Blog, 2024

2024
[3]

Self-RAG: Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2024

2024
[4]

LongBench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages ...

2024
[5]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review arXiv 2004
[6]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, et al. SmolLM2: When smol goes big – data-centric training of a small language model.arXiv preprint arXiv:2502.02737, 2025

work page internal anchor Pith review arXiv 2025
[7]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. FrugalGPT: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review arXiv 2023
[8]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review arXiv 2025
[9]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[10]

Pan, Ruifeng Xu, and Kam-Fai Wong

Yiming Du, Bingbing Wang, Yang He, Bin Liang, Baojun Wang, Zhongyang Li, Lin Gui, Jeff Z. Pan, Ruifeng Xu, and Kam-Fai Wong. MemGuide: Intent-driven memory selection for goal-oriented multi-session LLM agents.arXiv preprint arXiv:2505.20231, 2025

work page arXiv 2025
[11]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review arXiv 2024
[12]

Tool preferences in agentic LLMs are unreliable

Kazem Faghih, Wenxiao Wang, Yize Cheng, Siddhant Bharti, Gaurang Sriramanan, Sriram Balasubramanian, Parsa Hosseini, and Soheil Feizi. Tool preferences in agentic LLMs are unreliable. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20954–20969, 2025

2025
[13]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Gemini: A Family of Highly Capable Multimodal Models

Google DeepMind. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The LLaMA 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

REALM: Retrieval-augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval-augmented language model pre-training. InProceedings of the 37th International Conference on Machine Learning, pages 3929–3938, 2020

2020
[17]

Unsupervised Dense Information Retrieval with Contrastive Learning

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021. 10

work page internal anchor Pith review arXiv 2021
[18]

Leveraging passage retrieval with generative models for open domain question answering

Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, 2021

2021
[19]

LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1658–1677, 2024

2024
[20]

Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InProceedings of EMNLP, 2023

2023
[21]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas O ˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, 2020

2020
[22]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks.arXiv preprint arXiv:2005.11401, 2020

work page internal anchor Pith review arXiv 2005
[23]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

2024
[24]

MobileLLM: Optimizing sub-billion parameter language models for on-device use cases.arXiv preprint arXiv:2402.14905, 2024

Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, and Vikas Chandra. MobileLLM: Optimizing sub-billion parameter language models for on-device use cases.arXiv preprint arXiv:2402.14905, 2024

work page arXiv 2024
[25]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

2024
[26]

Query routing for homogeneous tools: An instantiation in the RAG scenario

Feiteng Mu, Yong Jiang, Liwen Zhang, Chu Liu, Wenjie Li, Pengjun Xie, and Fei Huang. Query routing for homogeneous tools: An instantiation in the RAG scenario. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 10225–10230, Miami, Florida, USA, 2024. Association for Computational Linguistics

2024
[27]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. RouteLLM: Learning to route LLMs with preference data. arXiv preprint arXiv:2406.18665, 2024

work page internal anchor Pith review arXiv 2024
[28]

GPT-4o mini: Advancing cost-efficient intelligence.OpenAI Blog, 2024

OpenAI. GPT-4o mini: Advancing cost-efficient intelligence.OpenAI Blog, 2024

2024
[29]

GPT-4o System Card

OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review arXiv 2023
[31]

Vicky Zhao, Lili Qiu, and Dongmei Zhang

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. LLMLingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Findings of the Association for Computational Linguistics: ACL 2024, pages 963–981, 2024

2024
[32]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023. 11

2023
[33]

Engram: Effective, lightweight memory orchestration for conversational agents.arXiv preprint arXiv:2511.12960, 2025

Daivik Patel and Shrenik Patel. ENGRAM: Effective, lightweight memory orchestration for conversational agents.arXiv preprint arXiv:2511.12960, 2025

work page arXiv 2025
[34]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review arXiv 2023
[35]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956, 2025

work page internal anchor Pith review arXiv 2025
[37]

System-1.x: Learning to balance fast and slow planning with language models

Swarnadeep Saha, Archiki Prasad, Justin Chih-Yao Chen, Peter Hase, Elias Stengel-Eskin, and Mohit Bansal. System-1.x: Learning to balance fast and slow planning with language models. arXiv preprint arXiv:2407.14414, 2024

work page arXiv 2024
[38]

ColBERTv2: Effective and efficient retrieval via lightweight late interaction

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. InProceedings of NAACL, 2022

2022
[39]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree-organized retrieval. InInterna- tional Conference on Learning Representations (ICLR), 2024

2024
[40]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review arXiv 2023
[41]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review arXiv 2023
[42]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10014–10037, 2023

2023
[43]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- MemEval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review arXiv 2024
[44]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv preprint arXiv:2308.08155, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Corrective Retrieval Augmented Generation

Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation.arXiv preprint arXiv:2401.15884, 2024

work page internal anchor Pith review arXiv 2024
[46]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

2023
[47]

Rankrag: Unifying context ranking with retrieval-augmented generation in llms

Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, and Bryan Catanzaro. RankRAG: Unifying context ranking with retrieval-augmented generation in LLMs.arXiv preprint arXiv:2407.02485, 2024

work page arXiv 2024
[48]

Big bird: Transformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. InAdvances in Neural Information Processing Systems, 2020

2020
[49]

Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E

Tianhao Zhang, Shishir G. Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E. Gonzalez. RAFT: Adapting language model to domain specific RAG.arXiv preprint arXiv:2403.10131, 2024. 12

work page arXiv 2024
[50]

H2O: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[51]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations (ICLR), 2024

2024
[52]

Revisiting pruning vs quantization for small language models.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 12055–12070, 2025

Zihan Zhou, Simon Kurz, and Zhixue Zhao. Revisiting pruning vs quantization for small language models.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 12055–12070, 2025

2025
[53]

Where is the apple?

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents.arXiv preprint arXiv:2506.15841, 2025. 13 A System Implementation Details A.1 Computational Setup All MemFlow experiments are conducted onGoogle ...

work page arXiv 2025
[54]

not found

Hard-failure detection.Empty answers, the exact string ESCALATE_REQUIRED (or variants matched by a regex), and “not found” patterns immediately trigger escalation. No LLM call is made. 18
[55]

5”,“21 days

Short-answer passthrough.Answers of ≤6 words, purely numeric answers (e.g.“5”,“21 days”), and boolean answers (“yes”/“no”) bypass the grounding check entirely. Single-number or short factual extractions are almost always correct when they appear in the answer agent’s output
[56]

Session depth

LLM grounding check.For all remaining answers, the validator calls Qwen3-1.7B with the prompt above. If the response cannot be parsed as yes/no, the validator falls back to a token- overlap heuristic (τground = 0.07). C Benchmark and Evaluation Details C.1 Benchmark Composition Table 7 summarizes the three benchmarks used in our evaluation. Table 7: Bench...

2023