arxiv: 2603.15421 · v2 · submitted 2026-03-16 · 💻 cs.CL · cs.AI

Recognition: no theorem link

CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

Taeyun Roh , Wonjune Jang , Junha Jung , Jaewoo Kang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords agent memorymemory clusteringsmall language modelsretrievalSLM agentsquestion answeringmemory organizationinterference reduction

0 comments

The pith

CLAG has an SLM agent cluster its own memories and generate profiles for each group, letting retrieval first filter relevant clusters to cut interference and raise answer quality on QA tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most agent memory systems dump experiences into one shared pool, which gradually mixes topics and hurts small language models that cannot ignore noise well. CLAG instead lets the SLM itself route new memories into coherent clusters and write short profiles that describe each cluster's topic and tags. Retrieval then works in two stages: first pick the right clusters using those profiles, then search only inside them. This localized structure keeps knowledge denser and more usable. Experiments on several QA datasets with three different SLM backbones show steadier gains in answer accuracy and robustness while the overhead stays low.

Core claim

CLAG is a clustering-based agentic memory system in which an SLM-driven router assigns each incoming memory to a semantically coherent cluster and then autonomously produces a profile containing a topic summary and descriptive tags; the clusters then evolve locally while retrieval first selects matching clusters via their profiles before searching inside the selected groups, thereby limiting cross-topic interference and raising memory density for small language models.

What carries the argument

SLM-driven router that assigns memories to clusters and writes self-contained profiles, combined with two-stage profile-filtered retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same router-plus-profile pattern could be applied to tool-call histories or multi-turn plans to keep agent state organized over long sessions.
Clusters created this way may make it easier to prune or compress old memories without losing entire topics.
The method suggests running controlled tests that track how cluster purity changes when the underlying SLM is swapped for a weaker or stronger backbone.

Load-bearing premise

The same small language model can reliably decide which memories belong together and write useful cluster profiles without frequent assignment mistakes or extra tuning.

What would settle it

A direct test would be to measure whether cluster assignments frequently mix unrelated memories and whether removing the profile-filter stage erases the reported QA gains on the same datasets.

Figures

Figures reproduced from arXiv: 2603.15421 by Jaewoo Kang, Junha Jung, Taeyun Roh, Wonjune Jang.

**Figure 2.** Figure 2: Overview of the proposed CLAG framework. Left: Agentic Routing. An SLM router assigns each incoming memory note mnew to the most relevant cluster using semantic metadata, and updates the corresponding cluster profile P. Middle: Localized Evolution. An evolution agent performs consolidation (e.g., linking, rewriting, strengthening) within the routed cluster to maintain topic-consistent neighborhoods and red… view at source ↗

read the original abstract

Large language model agents heavily rely on external memory to support knowledge reuse and complex reasoning tasks. Yet most memory systems store experiences in a single global retrieval pool which can gradually dilute or corrupt stored knowledge. This problem is especially pronounced for small language models (SLMs), which are highly vulnerable to irrelevant context. We introduce CLAG, a CLustering-based AGentic memory framework where an SLM agent actively organizes memory by clustering. CLAG employs an SLM-driven router to assign incoming memories to semantically coherent clusters and autonomously generates cluster-specific profiles, including topic summaries and descriptive tags, to establish each cluster as a self-contained functional unit. By performing localized evolution within these structured neighborhoods, CLAG effectively reduces cross-topic interference and enhances internal memory density. During retrieval, the framework utilizes a two-stage process that first filters relevant clusters via their profiles, thereby excluding distractors and reducing the search space. Experiments on multiple QA datasets with three SLM backbones show that CLAG consistently improves answer quality and robustness over prior memory systems for agents, remaining lightweight and efficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLAG gives SLM agents a clustering-based memory system with profile filtering that reports QA gains, but the experiments skip any direct check on whether the clusters are actually coherent or the router accurate.

read the letter

The core contribution is an SLM-driven router that assigns memories to clusters, auto-generates topic profiles and tags for each one, then does localized updates inside those clusters before using a two-stage retrieval that first picks clusters by profile. This is a clear step past flat global memory pools for small models that suffer from context noise. The setup stays lightweight and the authors test it on multiple QA datasets across three different SLM backbones, claiming consistent lifts in answer quality and robustness over earlier agent memory baselines. That framing is useful for anyone trying to run agents on limited hardware without constant interference between unrelated facts. The experiments at least show the overall pipeline can be run end-to-end without heavy overhead. The main gap is the absence of any measurement on the clustering step itself. No cluster purity numbers, no router assignment accuracy, no profile quality checks, and no failure cases where bad clusters hurt performance. Without those, it is hard to know whether the reported gains come from the clustering logic or simply from the profile-based filtering that narrows the search space. The abstract also gives little detail on exact baselines, data splits, or statistical tests, so the strength of the empirical claim is still moderate. This paper is aimed at researchers working on memory architectures for small-model agents. A reader who needs concrete ideas for reducing cross-topic interference will find the framework description worth reading even if they have to fill in the validation gaps themselves. I would send it to peer review because the problem is real and the proposed structure is straightforward enough to test and improve.

Referee Report

2 major / 1 minor

Summary. The manuscript presents CLAG, a clustering-based agentic memory framework for small language model (SLM) agents. It employs an SLM-driven router to assign incoming memories to semantically coherent clusters, autonomously generates cluster-specific profiles (topic summaries and descriptive tags), performs localized evolution within clusters to reduce cross-topic interference, and uses a two-stage retrieval process that first filters relevant clusters via profiles before retrieving within them. Experiments on multiple QA datasets with three SLM backbones report consistent improvements in answer quality and robustness over prior memory systems for agents while remaining lightweight and efficient.

Significance. If the empirical results hold under closer scrutiny, CLAG would provide a practical, agent-driven method for structuring external memory in SLM agents to mitigate knowledge dilution and interference, addressing a key vulnerability of smaller models in long-context or multi-topic settings. The emphasis on autonomous profile generation and profile-guided filtering offers a lightweight alternative to global retrieval pools, with potential applicability to other resource-constrained agent architectures.

major comments (2)

[§4 (Experiments)] §4 (Experiments): The central claim of consistent improvements in answer quality and robustness is asserted without details on baselines, statistical significance tests, error bars, or exact data splits, leaving the empirical support only moderately strong and making it difficult to assess the magnitude or reliability of gains across the three SLM backbones.
[§3 (Method) and §4 (Experiments)] §3 (Method) and §4 (Experiments): The framework's core mechanism relies on the SLM router producing semantically coherent clusters and useful profiles with low error rates to enable localized evolution and effective two-stage retrieval. No quantification of router assignment accuracy, cluster coherence (e.g., purity or silhouette scores), or profile utility is provided, nor is there failure-case analysis; without these, it remains unclear whether observed benefits derive from the clustering structure or simply from retrieval filtering.

minor comments (1)

[Abstract] Abstract: The statement that CLAG 'remains lightweight and efficient' would be strengthened by including at least one concrete metric (e.g., additional tokens or latency overhead relative to baselines) rather than a qualitative claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the empirical presentation and analysis.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): The central claim of consistent improvements in answer quality and robustness is asserted without details on baselines, statistical significance tests, error bars, or exact data splits, leaving the empirical support only moderately strong and making it difficult to assess the magnitude or reliability of gains across the three SLM backbones.

Authors: We agree that the current presentation of results can be strengthened. In the revised manuscript we will expand §4 to explicitly list all baselines with implementation details, report mean performance with standard deviation error bars across multiple random seeds, include statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values), and specify the exact train/validation/test splits used for each QA dataset. These additions will allow readers to better evaluate the magnitude and reliability of gains across the three SLM backbones. revision: yes
Referee: [§3 (Method) and §4 (Experiments)] §3 (Method) and §4 (Experiments): The framework's core mechanism relies on the SLM router producing semantically coherent clusters and useful profiles with low error rates to enable localized evolution and effective two-stage retrieval. No quantification of router assignment accuracy, cluster coherence (e.g., purity or silhouette scores), or profile utility is provided, nor is there failure-case analysis; without these, it remains unclear whether observed benefits derive from the clustering structure or simply from retrieval filtering.

Authors: We acknowledge the value of direct internal metrics. In the revision we will add to §3 and §4: (i) router assignment accuracy measured on a held-out sample via human annotation or proxy labels, (ii) cluster coherence via purity and silhouette scores computed on the formed clusters, (iii) ablation results isolating profile utility, and (iv) a dedicated failure-case subsection discussing router misassignments and their impact. While end-to-end QA gains versus flat-retrieval baselines already indicate that structured clustering contributes beyond simple filtering, these new analyses will make the source of the gains explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is self-contained empirical description

full rationale

The paper introduces CLAG as an engineering framework for agent memory organization, relying on an SLM-driven router for clustering and two-stage retrieval. No equations, derivations, or mathematical claims appear in the provided text. Central assertions rest on experimental outcomes across QA datasets rather than any chain that reduces by construction to fitted inputs, self-definitions, or self-citation load-bearing premises. The description of cluster profiles and localized evolution stands independently as a proposed system, with performance evaluated externally via benchmarks. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework relies on standard clustering concepts and SLM prompting without stated novel postulates.

pith-pipeline@v0.9.0 · 5490 in / 910 out tokens · 27756 ms · 2026-05-15T10:11:31.964842+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 6 internal anchors

[1]

Memory in the Age of AI Agents

Memory in the age of ai agents.Preprint, arXiv:2512.13564. Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumu- lated gain-based evaluation of ir techniques. InACM Transactions on Information Systems, pages 422– 446. Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. 2025. Memory os of ai agent.arXiv preprint arXiv:2506.06326. Dharshan Kumaran, Demis Hassa...

work page internal anchor Pith review Pith/arXiv arXiv 2002
[2]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Evaluating very long-term conversational memory of llm agents.Preprint, arXiv:2402.17753. Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When not to trust language models: Investigating effectiveness of parametric and non-parametric mem- ories. InProceedings of the 61st Annual Meeting of the Associatio...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

MemGPT: Towards LLMs as Operating Systems

Memgpt: Towards llms as operating systems. Preprint, arXiv:2310.08560. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evalu- ation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Compu- tational Linguistics (ACL), pages 311–318. Joon Sung Park, Joseph O’Brien, Carrie...

work page internal anchor Pith review Pith/arXiv arXiv 2002
[4]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D

From isolated conversations to hierarchical schemas: Dynamic tree memory representation for llms.arXiv preprint arXiv:2410.14052. Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning

work page arXiv
[5]

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou

Raptor: Recursive abstractive processing for tree-organized retrieval.Preprint, arXiv:2401.18059. Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. InInter- national Conference on Machine Learning, pages 31210–31227. P...

work page arXiv 2023
[6]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empiri- cal methods in natural language processing, pages 236...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- uating text generation with bert. InInternational Conference on Learning Representations (ICLR). Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li,...

work page internal anchor Pith review arXiv 2020
[8]

Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854. A Dataset Details We evaluate our method on three benchmarks— LoCoMo, HotpotQA, and BioASQ—chosen to cover conversational long-term memory, reasoning under noisy contexts, and domain-specific adapta- tion. LoCoMoWe use LoCoMo (Maharana et al., 2024), one ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Which two mystery novels does Tim par- ticularly enjoy writing about?

under the distractor construction used in MemAgent and GAM (Yu et al., 2025; Yan et al., 2025), where each query is paired with gold sup- porting documents and additional irrelevant pas- sages sampled from the same corpus to form a long noisy context. For reproducibility and fair comparison in the small-agent regime, we use a fixed-passage (bounded-contex...

work page 2025
[10]

This allows the metric to capture semantic similarity beyond exact lexical overlap

evaluates answer quality based on aligned unigrams while accounting for synonym matching and fragmentation penalties. This allows the metric to capture semantic similarity beyond exact lexical overlap. METEOR =F mean ·(1−Penalty)(11) Fmean = 10P R R+ 9P ,Penalty = 0.5 ch m 3 (12) where P and R denote unigram precision and re- call, m is the number of matc...

work page arXiv 2020
[11]

measures the fraction of gold evidence re- trieved within the top-K results. Recall@K= 1 |E| |E|X i=1 1[rank(ei)≤K](16) where E is the set of gold evidence items, ei de- notes an evidence item, rank(ei) is the rank of the retrieved memory containing ei (or ∞ if absent), K is the cutoff rank, and 1[·] denotes the indicator function. nDCG@K.Normalized Disco...

work page 2002
[12]

Analyze the topics and contexts of the candidate clusters provided above. Dataset Method Evidence Quality Ranking Chars E-Prec E-F1 R@5 nDCG@10 LoCoMo GAM 1.97 3.01 7.506.93 17,198.04 RAG 0.66 1.12 3.12 3.26 1,389.43 A-mem 0.68 1.17 3.86 3.62 1,483.91 MemoryOS3.242.87 2.72 3.35 1,373.48 CLAG1.20 2.07 7.18 9.741,465.65 HotpotQA GAM 14.44 6.50 5.38 15.00 1,...

work page
[13]

Select the single cluster_id that exhibits the highest semantic relevance and thematic alignment with the new memory

work page
[14]

choice":

You MUST choose exactly one cluster_id from the candidate list. - Do NOT output any text that is not a valid cluster_id. Return ONLY a JSON object in this format (this is an example): {{ "choice": "cluster_1" }} Where: - cluster_1 must be replaced with one of the actual cluster ids from the candidate list above. - Do not include comments or extra fields. ...

work page
[15]

Write ONE short sentence summary that best describes the main topic of this cluster

work page
[16]

summary":

Return EXACTLY 3 tags. - Each tag must be a single word. - Do NOT repeat the same tag. Return ONLY a JSON object with the following KEYS (this is a schema, not the actual content): {{ "summary": "...your one-sentence summary here...", "tags": ["tag_1", "tag_2", "tag_3"] }} K.3 Prompt for Retrieval Stage Cluster Selection This prompt is employed during the...

work page
[17]

Analyze the user query and query_tags

work page
[18]

For each candidate cluster, judge how relevant it is

work page
[19]

You should return between 0 and {top_n} clusters

Decide how many clusters are actually needed. You should return between 0 and {top_n} clusters. •If one cluster is definitely sufficient for answering the query, return just that one. •Include additional clusters if they are needed for answering the query

work page
[20]

selected_clusters

If none of the clusters are meaningfully related, return an empty list. Return ONLY JSON with this format: { "selected_clusters": [ "cluster_id_1", "cluster_id_2" ] } If no cluster is relevant, return: { "selected_clusters": [] } User query: {query} Query tags: {query_tags} Candidate clusters: {candidate_clusters_text}

work page