MetaSyn benchmark shows LLM pipelines recover at most 52.7% of ground-truth included studies due to screening failures on PI/ECO eligibility, despite 90.9% retrieval recall at K=200.
super hub Canonical reference
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Canonical reference. 80% of citing Pith papers cite this work as background.
abstract
The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as "What are the main themes in the dataset?", since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods, meanwhile, do not scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these contrasting methods, we propose GraphRAG, a graph-based approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text. Our approach uses an LLM to build a graph index in two stages: first, to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, we show that GraphRAG leads to substantial improvements over a conventional RAG baseline for both the comprehensiveness and diversity of generated answers.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as "What are the main themes in the dataset?", since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods, meanwhile, do not scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these
authors
co-cited works
representative citing papers
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.
ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
ContextNest formalizes context governance for AI agents using hash-chained documents and deterministic selectors, with experiments showing higher answer quality and perfect determinism versus standard retrieval.
Develops a theoretical perspective showing no hard rule can perfectly reject false unsupported trajectories while retaining true-but-unobserved ones under incomplete graph evidence, and characterizes soft grounding as KL-regularized deformation of the LLM prior.
A fixed-iteration spreading activation with per-step cosine similarity gating enables query-aware KG retrieval as one database query, matching QAFD-RAG on MuSiQue while cutting latency.
Controlled experiments across six benchmarks and four models show RAG context enrichment with metadata, structure, or strategies mostly lowers accuracy, with model-context alignment as the determining factor.
On heterogeneous document collections, only query expansion and a newly introduced per-source calibrated corrector (SSCC) deliver reliable gains beyond a strong cross-encoder reranker; other common retrieval enhancements do not.
PROBE is a generalized rank-based KGC evaluation framework with adjustable sharpness and bias-robustness components that satisfies six claimed key properties where prior metrics fall short.
SARDI uses lookahead tokens from low-confidence predictions in discrete diffusion language models to dynamically guide retrieval during denoising, outperforming training-free baselines on five multi-hop QA benchmarks at up to 8x higher throughput.
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
PersonaTree is a new hierarchical memory framework for persistent LLM agents that structures evidence into persona claims via support paths and outperforms baselines on six person-understanding benchmarks.
LifeSide is a new benchmark that evaluates AI agents on multi-session Memory-Emotion-Environment loops via simulated user profiles and event trajectories, revealing that models saturating existing memory tests fail at long-horizon user understanding.
QO-Bench shows RAG systems retrieve relevant text but often discard typed values required for query operators, with paradigm performance inverting across operators and execution remaining a bottleneck even with gold evidence.
HyperPatch reformulates sequential n-ary knowledge editing as hypergraph manifold stability, using HGNN initialization, SimHash alignment plus Topological LoRA, and fused reasoning to achieve large H-Acc gains on MQuAKE benchmarks.
SkillDAG builds a self-evolving typed skill graph that LLM agents query and update at inference time, raising success on ALFWorld and SkillsBench by 12.8 and 8.6 points over graph baselines.
AuditFlow combines a graph-grounded symbolic environment with a multi-agent LLM setup to reach 82.09% joint audit accuracy on structured financial reports, 14.93 points above the strongest baseline.
RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
LLM-Wiki structures external knowledge as compilable wiki pages with links and persistent self-correction, achieving SOTA results on HotpotQA, MuSiQue, and 2WikiMultiHopQA by 2.0-8.1 F1 points over prior RAG systems.
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.
GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.
PGR expands user queries into plausible future steps via Tree-of-Thought or chains and uses them as retrieval probes, delivering nearly 3x recall gains on the new MemoryQuest benchmark for low-similarity memory retrieval.
citing papers explorer
-
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.
-
ContextNest: Verifiable Context Governance for Autonomous AI Agent
ContextNest formalizes context governance for AI agents using hash-chained documents and deterministic selectors, with experiments showing higher answer quality and perfect determinism versus standard retrieval.
-
Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
-
SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale
SkillDAG builds a self-evolving typed skill graph that LLM agents query and update at inference time, raising success on ALFWorld and SkillsBench by 12.8 and 8.6 points over graph baselines.
-
AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification
AuditFlow combines a graph-grounded symbolic environment with a multi-agent LLM setup to reach 82.09% joint audit accuracy on structured financial reports, 14.93 points above the strongest baseline.
-
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
-
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
-
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
-
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
-
The Context Gathering Decision Process: A POMDP Framework for Agentic Search
Framing LLM agent loops as a Context Gathering Decision Process POMDP yields a predicate-based belief state that boosts multi-hop reasoning up to 11.4% and an exhaustion gate that cuts token use up to 39% with no performance loss.
-
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation
XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
-
A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding
A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.
-
STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering
STRIDE uses a meta-planner for entity-agnostic reasoning skeletons and a supervisor for dependency-aware execution to improve retrieval-augmented multi-hop QA.
-
ROZA Graphs: Self-Improving Near-Deterministic RAG through Evidence-Centric Feedback
ROZA graphs enable self-improving RAG by storing evidence-specific reasoning chains, yielding up to 10.6pp accuracy gains and 46% lower cost through graph traversal feedback.
-
GraphScout: Empowering Large Language Models with Intrinsic Exploration Ability for Agentic Graph Reasoning
GraphScout trains LLMs to autonomously synthesize structured training data from knowledge graphs via flexible exploration tools, enabling a 4B model to outperform larger LLMs by 16.7% on average with fewer inference tokens and strong cross-domain transfer.
-
Autonomous Knowledge Graph Exploration with Adaptive Breadth-Depth Retrieval
ARK adaptively retrieves from knowledge graphs using global lexical search and one-hop neighborhood exploration, reaching 59.1% Hit@1 on STaRK with up to 31.4% gains over training-free baselines and enabling distillation to 8B models.
-
Deterministic Legal Agents: A Canonical Primitive API for Auditable Reasoning over Temporal Knowledge Graphs
The paper specifies the SAT-Graph API, a canonical primitive interface that enables auditable, deterministic reasoning over temporal knowledge graphs by isolating uncertainty to intent translation and narrative synthesis.
-
MetaPS: Adaptive Programmatic Strategy Selection for Market Agents
MetaPS trains models via simulation rollouts to select from programmatic strategy libraries for market agents, yielding better performance than fixed or direct LLM baselines across model sizes.
-
A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation
HyGRAG is a hierarchical graph RAG framework that constructs LLM summaries over hybrid chunk-entity graphs, retrieves via context and relation awareness across levels, and enables dynamic updates, reporting a 9.7% average accuracy gain on multi-hop reasoning tasks.
-
Agents-K1: Towards Agent-native Knowledge Orchestration
Agents-K1 is an end-to-end pipeline with a multimodal parser, 4B GRPO-trained extractor, and agent CLI that builds scientific knowledge graphs from full papers and was run on 2.46 million documents to produce Scholar-KG.
-
TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management
TokenMizer builds a knowledge graph of LLM sessions and serializes it into 78-token resume blocks that retain more task, decision, and file information than flat-text baselines at roughly half the token cost.
-
Beyond Similarity: Trustworthy Memory Search for Personal AI Agents
MemGate is a 9M-parameter neural gate inserted between vector memory and LLM that converts similarity search into task-conditioned admission, reducing memory-induced threats across agent frameworks while preserving utility.
-
Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches
A survey of RLM use in 28 disciplines reveals uneven adoption and introduces a maturity assessment framework showing larger gaps when limited to public resources.
-
Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering
DualGraph combines semantic textual KGs with symbolic KGs for semi-structured QA and introduces the SpecsQA benchmark, outperforming baselines on both open and specification questions.
-
Format-Constraint Coupling in Knowledge Graph Construction from Statistical Tables
Empirical 2x2 factorial study on 6 statistical datasets shows format and schema constraints in LLM-based KG construction from CSV tables produce super-additive fidelity loss up to +1.180, with mismatched pairs falling below baseline, plus release of CSVFidelity-Bench.
-
GraphMind: From Operational Traces to Self-Evolving Workflow Automation
GraphMind builds and evolves action-centric workflow graphs from traces, navigates them via multi-agent LLM reasoning, and adapts via ATR, outperforming baselines on 93 incidents with 8x less context and 26% lower hallucination in production deployment.
-
IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation
IdeaForge combines multiple innovation methodologies through specialist agents on a persistent knowledge graph, using cross-methodology convergent claim linkages to rank and draft patent claims with higher traceability than single-method baselines.
-
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
Goal-Mem decomposes user goals into subgoals for targeted memory retrieval using Natural Language Logic, improving performance on multi-hop reasoning tasks in conversational agents.
-
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and long-term agent benchmarks.
-
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
-
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
ScrapMem reports SOTA 51.0% Joint@10 on ATM-Bench with up to 93% memory reduction and 70.3% Recall@10 via optical forgetting and EM-Graph.
-
Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text
Frontier LLMs solve single-needle retrieval at 1M tokens on classical Chinese but show three distinct accuracy-decay patterns in three-hop reasoning between 256K and 1M tokens.
-
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
-
ObjectGraph: From Document Injection to Knowledge Traversal -- A Native File Format for the Agentic Era
ObjectGraph is a Markdown superset file format that represents documents as traversable knowledge graphs, achieving up to 95.3% token reduction for agents with no significant accuracy loss.
-
Towards Lawful Autonomous Driving: Deriving Scenario-Aware Driving Requirements from Traffic Laws and Regulations
Grounding LLMs via node-wise anchors in a traffic scenario taxonomy improves law-scenario matching by 29.1% and derived requirement accuracy by 36.9-38.2% on Chinese laws and 5,897 scenarios, enabling a compliance layer and real-time monitor for AVs.
-
DW-Bench: Benchmarking LLMs on Data Warehouse Graph Topology Reasoning
DW-Bench shows tool-augmented LLMs outperform static ones on data warehouse graph reasoning but plateau on hard compositional question subtypes.
-
EHRAG: Bridging Semantic Gaps in Lightweight GraphRAG via Hybrid Hypergraph Construction and Retrieval
EHRAG constructs structural hyperedges from sentence co-occurrence and semantic hyperedges from entity embedding clusters, then applies hybrid diffusion plus topic-aware PPR to retrieve top-k documents, outperforming baselines on four datasets with linear indexing cost and zero token overhead.
-
GAM: Hierarchical Graph-based Agentic Memory for LLM Agents
GAM decouples event-level memory encoding from topic-level consolidation in LLM agents using hierarchical graphs to reduce interference and improve long-term coherence and retrieval.
-
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval
A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
-
HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling
HyMem introduces dual-granular memory storage with a lightweight summary module for fast responses and selective activation of a deep LLM module for complex queries, outperforming full-context baselines by 92.6% lower computational cost on LOCOMO and LongMemEval benchmarks.
-
ARIA: A Causal-Aware Framework for Rescuing LLM Reasoning in Trustworthy Materials Discovery
ARIA is a three-tier causal framework that conditions LLM knowledge use on mechanistic completeness for forward prediction and inverse design of 2D materials, producing auditable traces.
-
What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory
Geometry-led weighting outperforms blended memory recall for spatial queries, and a DDA-based visibility predicate correctly flags occluded targets while recall remains occlusion-blind.
-
Beyond Vector Similarity: A Structural Analysis of Graph-Augmented Retrieval for Industrial Knowledge Graphs
Empirical comparison on small industrial KG finds vector retrieval fails on structural queries while LLM planner with typed graph operators achieves higher F1 and generalizes to unseen queries.
-
Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline
An agentic harness letting the LLM self-manage flat text-file storage via tool calls outperforms eight prior memory systems on cross-scenario generality across QA, chat, trajectory, stress-test, and long-horizon tasks.
-
Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering
Presents RegOps-Bench benchmark and RefWalk framework for citation-closure retrieval and per-rule attribution in regulatory compliance QA, reporting substantial gains in recall and citation accuracy over baselines.
-
CogniFold: Always-On Proactive Memory via Cognitive Folding
CogniFold extends Complementary Learning Systems theory to three layers with a prefrontal intent layer and uses graph self-organization to build proactive agent memory from continuous event streams.
-
SKG-VLA: Scene Knowledge Graph Priors for Structured Scene Semantics and Multimodal Reasoning for Decision Making
SKG-VLA models each complaint as a structured scene via a Scene Knowledge Graph to improve policy-grounded multimodal reasoning and decision accuracy.
-
Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs
Grokers architecture performs bottom-up inductive comprehension over typed KGs at write time via LM agents, with three claimed formal theorems on byte-identity, accumulation monotonicity, and dual-traversal ordering, plus a deterministic synonym-caching search alternative.
-
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.
-
LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People
A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.