arxiv: 2502.14802 · v2 · submitted 2025-02-20 · 💻 cs.CL · cs.AI

Recognition: 3 theorem links

· Lean Theorem

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

Bernal Jim\'enez Guti\'errez , Yiheng Shu , Weijian Qi , Sizhe Zhou , Yu Su

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords RAGcontinual learninglarge language modelsPersonalized PageRankassociative memorysense-makingmemory tasksHippoRAG

0 comments

The pith

HippoRAG 2 enhances Personalized PageRank with deeper passage integration and online LLM use to outperform standard RAG on factual, sense-making, and associative memory tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HippoRAG 2 to fix gaps in how retrieval-augmented generation handles connected knowledge. It takes the Personalized PageRank from earlier work and adds deeper passage integration along with more effective online LLM calls during ranking. This produces stronger results on basic factual recall, understanding connections between ideas, and forming associations. A sympathetic reader would care because it points toward AI that can keep adding and organizing new information without retraining model weights. The work frames this as progress toward non-parametric continual learning in large language models.

Core claim

HippoRAG 2 builds upon the Personalized PageRank algorithm used in HippoRAG and enhances it with deeper passage integration and more effective online use of an LLM. This combination pushes this RAG system closer to the effectiveness of human long-term memory, achieving a 7% improvement in associative memory tasks over the state-of-the-art embedding model while also exhibiting superior factual knowledge and sense-making memory capabilities.

What carries the argument

Enhanced Personalized PageRank algorithm that incorporates deeper passage integration and more effective online LLM usage to integrate retrieved information in a way that better reflects dynamic and interconnected human memory.

If this is right

Outperforms standard RAG across factual, sense-making, and associative memory tasks.
Delivers a 7% gain over state-of-the-art embedding models specifically on associative memory.
Enables non-parametric continual learning for large language models by organizing new knowledge without parameter updates.
Avoids the drop in basic factual performance that earlier graph-augmented RAG methods showed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same integration techniques might improve other graph-based retrieval systems that currently rely on simpler vector search.
Real-time knowledge updating scenarios could benefit if the online LLM component scales efficiently.
Further benchmarks focused on long-term knowledge retention over weeks or months could test how well the approach approximates human memory dynamics.

Load-bearing premise

The specific improvements in passage integration depth and online LLM usage within Personalized PageRank are responsible for the gains, and the selected benchmarks reflect the dynamic, interconnected nature of human long-term memory.

What would settle it

An ablation experiment that removes either the deeper passage integration or the online LLM component from HippoRAG 2 and checks whether the reported advantages on factual, sense-making, and associative tasks disappear.

read the original abstract

Our ability to continuously acquire, organize, and leverage knowledge is a key feature of human intelligence that AI systems must approximate to unlock their full potential. Given the challenges in continual learning with large language models (LLMs), retrieval-augmented generation (RAG) has become the dominant way to introduce new information. However, its reliance on vector retrieval hinders its ability to mimic the dynamic and interconnected nature of human long-term memory. Recent RAG approaches augment vector embeddings with various structures like knowledge graphs to address some of these gaps, namely sense-making and associativity. However, their performance on more basic factual memory tasks drops considerably below standard RAG. We address this unintended deterioration and propose HippoRAG 2, a framework that outperforms standard RAG comprehensively on factual, sense-making, and associative memory tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in HippoRAG and enhances it with deeper passage integration and more effective online use of an LLM. This combination pushes this RAG system closer to the effectiveness of human long-term memory, achieving a 7% improvement in associative memory tasks over the state-of-the-art embedding model while also exhibiting superior factual knowledge and sense-making memory capabilities. This work paves the way for non-parametric continual learning for LLMs. Code and data are available at https://github.com/OSU-NLP-Group/HippoRAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HippoRAG 2 adds deeper passage integration and online LLM use to Personalized PageRank and reports gains across factual, sense-making, and associative tasks, but the results do not isolate which change produces the 7% associative lift.

read the letter

HippoRAG 2 takes the Personalized PageRank setup from the authors' earlier HippoRAG paper and layers on deeper passage integration plus more effective online LLM calls during retrieval. The main reported outcome is that the new version beats standard RAG and current embedding models on factual recall, sense-making, and associative memory, with a 7% edge on the associative side and no apparent drop on the factual tasks that hurt earlier graph-based systems.

Referee Report

1 major / 2 minor

Summary. The paper proposes HippoRAG 2, an extension of the original HippoRAG framework that augments the Personalized PageRank retrieval algorithm with deeper passage integration and more effective online LLM usage. It claims this yields comprehensive outperformance over standard RAG and state-of-the-art embedding models across factual knowledge, sense-making, and associative memory tasks, including a quantified 7% gain on associative tasks, thereby advancing non-parametric continual learning for LLMs that better approximates human long-term memory. Code and data are released.

Significance. If the reported gains prove robust and attributable to the specified modifications, the work would offer a concrete step toward retrieval systems that capture the interconnected, dynamic structure of human memory rather than relying solely on vector similarity. The empirical comparisons on held-out tasks and public release of code support reproducibility and incremental progress in non-parametric continual learning.

major comments (1)

[Experimental evaluation] Experimental evaluation (associative memory tasks results): The manuscript attributes the headline 7% improvement over SOTA embeddings to the two targeted changes—deeper passage integration and more effective online LLM usage within Personalized PageRank. No ablation is reported that isolates either change (e.g., HippoRAG 2 minus deeper integration, or minus online LLM calls in PPR) while holding retrieval budget, graph construction, and stopping criteria fixed. Without these controls, alternative explanations such as incidental increases in retrieval budget or hyperparameter differences cannot be ruled out, weakening the causal claim for the proposed mechanisms.

minor comments (2)

[Abstract] The abstract states 'consistent outperformance' and 'superior factual knowledge and sense-making memory capabilities' without referencing the precise baselines, number of runs, or statistical tests used to support these claims.
[Method] Notation for the enhanced Personalized PageRank procedure could be clarified with an explicit equation or pseudocode distinguishing the new integration depth and online LLM components from the original HippoRAG formulation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental evaluation of HippoRAG 2. We address the concern regarding attribution of performance gains point by point below.

read point-by-point responses

Referee: [Experimental evaluation] Experimental evaluation (associative memory tasks results): The manuscript attributes the headline 7% improvement over SOTA embeddings to the two targeted changes—deeper passage integration and more effective online LLM usage within Personalized PageRank. No ablation is reported that isolates either change (e.g., HippoRAG 2 minus deeper integration, or minus online LLM calls in PPR) while holding retrieval budget, graph construction, and stopping criteria fixed. Without these controls, alternative explanations such as incidental increases in retrieval budget or hyperparameter differences cannot be ruled out, weakening the causal claim for the proposed mechanisms.

Authors: We agree that explicit ablations isolating deeper passage integration and online LLM usage within Personalized PageRank, while strictly holding retrieval budget, graph construction, and stopping criteria fixed, would strengthen the causal attribution of the reported gains. The current manuscript demonstrates the overall superiority of HippoRAG 2 through comparisons against the original HippoRAG, standard RAG, and SOTA embedding models, with retrieval budgets matched across systems. However, to directly address this concern and rule out alternative explanations, we will incorporate the requested ablation studies in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with held-out task comparisons

full rationale

The manuscript presents HippoRAG 2 as an engineering extension of the prior HippoRAG Personalized PageRank procedure, adding deeper passage integration and online LLM calls. All reported gains (including the 7% associative-memory lift) are shown via direct empirical comparisons against baselines on separate factual, sense-making, and associative benchmarks. No equations, fitted parameters, or first-principles derivations are invoked whose outputs are definitionally identical to their inputs; the central claims therefore remain independent of any self-referential reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework relies on standard assumptions about LLM capabilities for retrieval decisions and on the validity of the memory-task benchmarks as proxies for human-like memory. No new physical entities or mathematical axioms are introduced; a small number of algorithmic hyperparameters are expected but not detailed in the abstract.

free parameters (1)

PageRank damping factor and integration depth
Typical tunable parameters in Personalized PageRank and passage integration that control retrieval behavior and are likely set or tuned during development.

axioms (1)

domain assumption LLMs can be used effectively for online retrieval decisions without introducing systematic bias on the evaluated tasks.
The framework description relies on more effective online use of an LLM as a core component.

pith-pipeline@v0.9.0 · 5571 in / 1363 out tokens · 67496 ms · 2026-05-17T01:30:40.665831+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
cs.AI 2026-05 conditional novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
cs.CL 2026-05 unverdicted novelty 7.0

DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
Controlling Authority Retrieval: A Missing Retrieval Objective for Authority-Governed Knowledge
cs.IR 2026-04 unverdicted novelty 7.0

CAR is a new retrieval objective that targets the currently active authority set rather than most-similar documents, with theorems on coverage conditions and evaluations showing two-stage methods outperform dense retr...
Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems
cs.IR 2026-04 unverdicted novelty 7.0

Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.
AtomicRAG: Atom-Entity Graphs for Retrieval-Augmented Generation
cs.IR 2026-02 unverdicted novelty 7.0

AtomicRAG replaces chunk-based and triple-based GraphRAG with atom-entity graphs that store facts as atomic units and use personalized PageRank plus relevance filtering to achieve higher retrieval accuracy and reasoni...
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
cs.CL 2025-07 unverdicted novelty 7.0

MemoryAgentBench is a new multi-turn benchmark assessing four memory competencies in LLM agents—accurate retrieval, test-time learning, long-range understanding, and selective forgetting—showing that existing methods ...
Cognifold: Always-On Proactive Memory via Cognitive Folding
cs.AI 2026-05 unverdicted novelty 6.0

Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-or...
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
cs.AI 2026-05 unverdicted novelty 6.0

SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
cs.AI 2026-05 unverdicted novelty 6.0

HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA
cs.IR 2026-04 conditional novelty 6.0

Two-hop QA retrieval performance depends on whether the hop-2 entity is in the question or bridge passage, and a simple predicate-based router trained on one dataset transfers to improve R@5 on others.
HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues
cs.CL 2026-04 unverdicted novelty 6.0

HingeMem segments dialogue memory via boundary-triggered hyperedges over four elements and applies query-adaptive retrieval, yielding ~20% relative gains and 68% lower QA token cost versus baselines on LOCOMO.
BridgeRAG: Training-Free Bridge-Conditioned Retrieval for Multi-Hop Question Answering
cs.IR 2026-04 conditional novelty 6.0

BridgeRAG improves training-free multi-hop retrieval by using a bridge-conditioned LLM scorer to rank evidence chains, achieving new best R@5 scores on MuSiQue, 2WikiMultiHopQA, and HotpotQA.
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
cs.CL 2026-03 unverdicted novelty 6.0

MSA is an end-to-end trainable memory model using sparse attention and document-wise RoPE that scales to 100M tokens with linear complexity and less than 9% degradation.
MemOS: A Memory OS for AI System
cs.CL 2025-07 unverdicted novelty 5.0

MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
cs.IR 2025-04 unverdicted novelty 5.0

The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.
Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation
cs.CL 2026-04 unverdicted novelty 4.0

A minimalist retrieval-and-generation framework using turn isolation and query-driven pruning outperforms complex memory systems by directly addressing signal sparsity and dual-level redundancy in dialogues.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · cited by 18 Pith papers · 2 internal anchors

[1]

LightRAG: Simple and Fast Retrieval-Augmented Generation

URL https://aclanthology.org/2024. emnlp-main.934/. Guo, Z., Xia, L., Yu, Y ., Ao, T., and Huang, C. LightRAG: Simple and fast retrieval-augmented generation, 2024. URL https://arxiv.org/abs/2410.05779. Guti´errez, B. J., Shu, Y ., Gu, Y ., Yasunaga, M., and Su, Y . Hipporag: Neurobiologically inspired long-term memory for large language models. In The Th...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Haveliwala, T

URL https://openreview.net/forum? id=hkujvAPVsg. Haveliwala, T. H. Topic-sensitive pagerank. In Lass- ner, D., Roure, D. D., and Iyengar, A. (eds.), Pro- ceedings of the Eleventh International World Wide Web Conference, WWW 2002, May 7-11, 2002, Honolulu, Hawaii, USA, pp. 517–526. ACM, 2002. doi: 10.1145/ 511446.511513. URL https://dl.acm.org/doi/ 10.1145...

work page doi:10.1145/511446.511513 2002
[3]

Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal

URL https://aclanthology.org/2023. findings-acl.733/. Huang, J., Cui, L., Wang, A., Yang, C., Liao, X., Song, L., Yao, J., and Su, J. Mitigating catastrophic for- getting in large language models with self-synthesized rehearsal. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational ...

work page doi:10.18653/v1/2024.acl-long.77 2023
[4]

Towards General Text Embeddings with Multi-stage Contrastive Learning

URL https://aclanthology.org/2024. naacl-long.302/. Klein, G., Moon, B., and Hoffman, R. R. Making sense of sensemaking 1: Alternative perspectives. IEEE intelligent systems, 21(4):70–73, 2006. Koli, V ., Yuan, J., and Dasgupta, A. Sensemaking of socially-mediated crisis information. In Blodgett, S. L., Cercas Curry, A., Dev, S., Madaio, M., Nenkova, A., ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024 2024
[5]

Ni, J., Qu, C., Lu, J., Dai, Z., Hernandez Abrego, G., Ma, J., Zhao, V ., Luan, Y ., Hall, K., Chang, M.-W., and Yang, Y

URL https://openreview.net/forum? id=BC4lIvfSzv. Ni, J., Qu, C., Lu, J., Dai, Z., Hernandez Abrego, G., Ma, J., Zhao, V ., Luan, Y ., Hall, K., Chang, M.-W., and Yang, Y . Large dual encoders are generalizable retrievers. In Gold- berg, Y ., Kozareva, Z., and Zhang, Y . (eds.),Proceedings of the 2022 Conference on Empirical Methods in Natural Language Pro...

work page doi:10.18653/v1/2022.emnlp-main 2022
[6]

emnlp-main.669/

URL https://aclanthology.org/2022. emnlp-main.669/. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., K¨opf, A., Yang, E. Z., DeVito, Z., Rai- son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. PyTorch: An imperative style, high-perfo...

work page doi:10.1007/978-1-4471-2099-5 2022
[7]

Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., and Manning, C

URL https://openreview.net/forum? id=gkyosluSbR. Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., and Manning, C. D. RAPTOR: recursive abstractive processing for tree-organized retrieval. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, 2024. URL https://openrevie...

work page arXiv 2024
[9]

emnlp-main.632/

URL https://aclanthology.org/2023. emnlp-main.632/. Yuan, T., Ning, X., Zhou, D., Yang, Z., Li, S., Zhuang, M., Tan, Z., Yao, Z., Lin, D., Li, B., Dai, G., Yan, S., and Wang, Y . LV-Eval: A balanced long-context benchmark with 5 length levels up to 256k, 2024. URL https: //arxiv.org/abs/2402.05136. Zhang, H., Gui, L., Zhai, Y ., Wang, H., Lei, Y ., and Xu...

work page doi:10.18653/v1/2023.findings-emnlp 2023
[10]

findings-emnlp.633/

URL https://aclanthology.org/2023. findings-emnlp.633/. Zhong, Z., Wu, Z., Manning, C., Potts, C., and Chen, D. MQuAKE: Assessing knowledge editing in lan- guage models via multi-hop questions. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Nat- ural Language Processing , pp. 15686–15702, Sin- gap...

work page doi:10.18653/v1/2023.emnlp-main 2023
[11]

fact": [[

URL https://aclanthology.org/2023. emnlp-main.971/. 12 From RAG to Memory: Non-Parametric Continual Learning for Large Language Models Appendices Within this supplementary material, we elaborate on the following aspects: • Appendix A: LLM Prompts • Appendix B: HippoRAG 2 Pipeline Example • Appendix C: Detailed Experimental Results • Appendix D: Graph Stat...

work page 2023
[12]

Louis Philippe I (Recall@5 is 0.5) Query to Triple (Top-5) (”Bank of America”, ”purchased”, ”Fleetboston Financial”) (”Fleetboston Financial”, ”was acquired by”, ”Bank of America”) (”Bank of America”, ”acquired”, ”Fleetboston Financial”) (”Bank of America”, ”announced purchase of”, ”Fleetboston Financial”) (”Bank of America”, ”merged with”, ”Fleetboston F...

work page 2019