arxiv: 2604.07595 · v3 · submitted 2026-04-08 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

ROZA Graphs: Self-Improving Near-Deterministic RAG through Evidence-Centric Feedback

Matthew Penaroza

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:19 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords RAGretrieval augmented generationreasoning graphsself-improving systemsfeedback loopsevidence-centricmulti-hop QAnear-deterministic

0 comments

The pith

ROZA graphs let RAG systems reuse prior evidence judgments to raise accuracy and cut variance while keeping the base model frozen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ROZA graphs to address how language model agents discard their reasoning after each query, leading to repeated low accuracy and high variance. It builds reasoning graphs that store chains of thought as edges tied to specific evidence items, allowing traversal of past judgments on those items, and pairs them with retrieval graphs that prune candidates rejected in earlier runs. This creates a feedback loop where accuracy rises steadily as coverage of prior evidence profiles grows, with gains linked to how often gold passages reappear across questions. All changes occur through context organization via the graphs rather than any update to the underlying model. A reader would care because the approach turns repeated queries into cumulative improvement instead of starting over each time.

Core claim

The central claim is that combining reasoning graphs, which persist per-evidence chain-of-thought evaluations as traversable edges for evidence-centric feedback, with retrieval graphs that prune consistently rejected candidates forms a self-improving ROZA graph. This structure produces monotonic accuracy gains that scale with evidence-profile coverage, reaching +10.6 percentage points over vanilla RAG at 50 percent or higher coverage on identical questions for a 47 percent error reduction, larger gains on multi-hop questions, and substantially higher decision consistency across runs, all while the base model remains unchanged and gains derive solely from graph traversal.

What carries the argument

ROZA graph, a dual structure of reasoning graphs that link evaluation edges to specific evidence for feedback traversal and retrieval graphs that prune poor candidates over runs, which drives accuracy scaling with gold-passage reuse and efficiency scaling with candidate overlap.

If this is right

Accuracy rises monotonically with increasing evidence-profile coverage on the same questions.
Multi-hop question accuracy improves by 11 percentage points.
Gains at the cluster level are predicted by the density of gold-passage reuse.
High-reuse deployments achieve top accuracy together with 46 percent lower cost and latency.
Per-passage decision consistency across repeated runs increases by 8 to 21 percentage points.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph structure could accumulate useful memory in other agent loops that repeat similar subtasks.
Graphs built on one model family might transfer partial value when swapping to another family without full retraining.
Explicit edge-based reuse of reasoning could substitute for some increases in model size on repeated tasks.
Extending the graphs to non-QA tasks like code generation or planning would test whether the reuse benefit generalizes.

Load-bearing premise

The measured accuracy and consistency improvements are produced by the evidence-centric graph traversal and pruning rather than by longer contexts, changed prompts, or dataset quirks.

What would settle it

An experiment that matches total context length and prompt text exactly but disables graph traversal and pruning, then checks whether the accuracy, error reduction, and consistency gains vanish.

Figures

Figures reproduced from arXiv: 2604.07595 by Matthew Penaroza.

**Figure 2.** Figure 2: Verdict-vector identity on a Sonnet T=0 hard probe (3-hop, |Cq|=20, K=10 runs). Rows are runs, columns are passages; cells are green (used) or red (rejected). Vanilla RAG (left): verdict pattern flickers across runs (VVIR = 0.30). Ours-RG (right): all 10 runs identical (VVIR = 1.0); the same collapse occurs for 16/23 Sonnet and 20/38 Haiku hard probes at T=0. Aggregate statistics in [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 3.** Figure 3: A5 controlled-overlap re-analysis. Per-bin Acc gap is monotonic across all five bins; the per-question [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Paired accuracy gain ∆ (in percentage points) of OURS_FULL over VANILLA_RAG, stratified by reasoning depth (rows) and evidence-profile coverage bin (columns). Each cell shows the paired Acc difference and the cell sample size; the bordered region marks the triple-alignment sweet spot (hop ≥ 3 AND coverage ≥ 0.4, n=199), on which OURS_FULL reaches 89.4% vs. VANILLA_RAG 76.9% (∆ = +12.6pp). All cells have n … view at source ↗

**Figure 5.** Figure 5: Distribution of OURS_FULL token-F1 on the n=60 paired MUSIQUE questions rescued from VANILLA_RAG (i.e., VANILLA_RAG F1 < 0.8 and OURS_FULL F1 ≥ 0.8). Mean rescued F1 = 0.99 (vs. 0.15 under VANILLA_RAG on the same questions); 73% of rescued questions had VANILLA_RAG F1 = 0.0 exactly (complete failures). Win/loss ratio: 60:21 (2.86×). the same regime that the A5 dose-response (Section A.1) identifies as the … view at source ↗

**Figure 6.** Figure 6: Sliding-window accuracy (window = 25 questions) on cluster [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Component decomposition by coverage level. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Dose-response visualization complementing Table 2. At zero coverage, Ours-Full and Vanilla RAG are [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Decision consistency (Dτ ) tracking the mechanism in operation. Left: accuracy increases monotonically with Dτ bin. Right: Dτ rises across the question sequence as profiles accumulate signal. Combined with the paired coverage dose-response ( [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Reasoning graph anatomy. Forward (left): from an agent through decided/evaluated edges, reconstruct one run’s chain of thought. Backward (right): from an evidence item inward, aggregate evaluations across runs to surface cross-run patterns. Backward traversal enables evidence-centric feedback; flat strategy stores cannot support it [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Conditional dilution of the headline effect. Each row shows [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Per-cluster gold-passage reuse vs. accuracy improvement (Ours-Full [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

read the original abstract

Language model agents reason from scratch on every query, discarding their chain of thought after each run. The result is lower accuracy and high run-to-run variance. We introduce reasoning graphs, which persist the per-evidence chain of thought as structured edges. Unlike prior memory that retrieves distilled strategies by query similarity, reasoning graphs enable evidence-centric feedback: for every candidate item, the system traverses all incoming evaluation edges across prior runs to surface how that specific item has been judged before. We further introduce retrieval graphs, which feed a planner that prunes consistently-rejected candidates over successive runs. Together they form a ROZA graph: a self-improving feedback loop in which accuracy gains scale with gold-passage reuse (reasoning graph) and efficiency gains scale with candidate-pool overlap (retrieval graph). The base model remains frozen; all gains come from context engineering via graph traversal. We evaluate on MuSiQue and HotpotQA, plus a high-reuse deployment subset. Four findings stand out. (1) Dose-response: accuracy improves monotonically with evidence-profile coverage, reaching +10.6pp over Vanilla RAG at 50%+ coverage on the same questions (47% error reduction, $p<0.0001$; per-question Spearman $\rho=+0.144$, $p<10^{-6}$, $n=1{,}100$). (2) Multi-hop scaling: 4-hop accuracy improves by +11.0pp ($p=0.0001$). (3) Cross-cluster prediction: the cluster-level gain is predicted by gold-passage reuse density ($r=0.604$, $p=0.001$, $n=26$ clusters). (4) High-reuse Pareto dominance: highest or tied-for-highest accuracy alongside 46% lower cost and 46% lower latency. Per-passage decision consistency across repeated runs ($N=73$ paired probes, $K=10$ runs each, two model families, three temperatures) rises by +8 to +13pp on a fixed 20-passage context and by +12 to +21pp when the retrieval graph also prunes (all $p<0.005$).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ROZA graphs structure past evidence judgments to boost RAG accuracy and consistency on repeated questions, but the gains are not yet isolated from extra context length.

read the letter

The paper's main move is to store per-evidence chain-of-thought as graph edges instead of discarding it after each run. Retrieval graphs then prune weak candidates across iterations. This produces a feedback loop where accuracy scales with gold-passage reuse and efficiency scales with overlap, all with the base model frozen. The abstract reports a clear dose-response: accuracy rises monotonically with evidence-profile coverage, hitting +10.6 points over vanilla RAG at 50%+ coverage on the same questions, plus consistency lifts of 8-21 points and some cluster-level correlations. The high-reuse Pareto result on cost and latency is also concrete. Those numbers are the parts worth looking at if you work on deployed multi-hop QA systems. The design itself is a modest step beyond plain memory retrieval by similarity. The main weakness is the missing isolation of the mechanism. Higher coverage necessarily adds more historical CoT text to the prompt, so the accuracy and consistency improvements could come from longer context or altered prompting rather than the traversal and pruning logic. The reported Spearman and cluster correlations do not rule that out. An ablation that holds total tokens and prompt template fixed while removing only the graph structure would settle it. Without that, the causal claim stays soft. This is for practitioners who already run RAG on overlapping queries and want measurable reliability gains without model changes. It has enough quantified results and a clear proposal to merit referee time, though the experiments will need tightening on controls. I would send it for review and ask specifically for the length-matched ablation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ROZA graphs, which combine reasoning graphs (persisting per-evidence chain-of-thought judgments as structured edges for evidence-centric feedback) and retrieval graphs (pruning consistently rejected candidates via a planner). This forms a self-improving feedback loop for RAG where accuracy scales with gold-passage reuse and efficiency with candidate-pool overlap. All gains are attributed to context engineering via graph traversal with a frozen base model. Evaluations on MuSiQue, HotpotQA, and a high-reuse subset report dose-response accuracy gains (+10.6pp at 50%+ coverage, 47% error reduction, p<0.0001), multi-hop scaling (+11.0pp), cross-cluster predictions (r=0.604), Pareto dominance in cost/latency, and consistency improvements (+8 to +21pp across runs).

Significance. If the accuracy, consistency, and efficiency gains are causally due to the evidence-centric traversal and pruning mechanics rather than incidental factors, the work provides a structured, reusable memory mechanism that could meaningfully improve deterministic behavior in RAG and agent systems. The dose-response curves, Spearman correlations, cluster-level analysis, and high-reuse Pareto results offer falsifiable empirical patterns on standard multi-hop QA benchmarks. The parameter-free framing (no model updates) and focus on persisted judgments distinguish it from similarity-based memory approaches.

major comments (2)

[Abstract] Abstract: The central attribution that 'all gains come from context engineering via graph traversal' (with monotonic accuracy scaling by evidence-profile coverage) is load-bearing but not isolated from confounds. Higher coverage inherently supplies more historical CoT text to the prompt; without an ablation that holds total context length, number of passages, and prompt template fixed while removing only the traversal/pruning logic (e.g., unstructured concatenation of prior runs), it remains possible that gains arise from increased context volume or prompt changes rather than the graph structure itself. This directly affects the causal claim for the +10.6pp and consistency results.
[Results] Results (dose-response and consistency sections): The reported per-question Spearman correlation (rho=+0.144) and consistency gains (+8 to +21pp, p<0.005) across N=73 probes do not rule out the context-length confound, as the manuscript provides no matched-length baseline that substitutes non-graph prior-run text. This weakens the claim that gains scale specifically with gold-passage reuse density and retrieval-graph pruning.

minor comments (2)

[Methods] The definition of 'evidence-profile coverage' and how exactly incoming evaluation edges are traversed and surfaced in the prompt should be formalized with pseudocode or an equation to improve reproducibility.
[Figures/Tables] Figure captions and table legends could more explicitly state the exact prompt templates and token budgets used for each condition to allow direct comparison with Vanilla RAG.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the causal attribution in our work. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central attribution that 'all gains come from context engineering via graph traversal' (with monotonic accuracy scaling by evidence-profile coverage) is load-bearing but not isolated from confounds. Higher coverage inherently supplies more historical CoT text to the prompt; without an ablation that holds total context length, number of passages, and prompt template fixed while removing only the traversal/pruning logic (e.g., unstructured concatenation of prior runs), it remains possible that gains arise from increased context volume or prompt changes rather than the graph structure itself. This directly affects the causal claim for the +10.6pp and consistency results.

Authors: We agree that an ablation isolating the graph traversal from raw context volume is necessary to strengthen the causal claim. The ROZA mechanism selectively traverses and includes only evidence-specific prior judgments via the graph edges, rather than indiscriminately adding all prior CoT text. This targeted inclusion is a key distinction from unstructured concatenation. However, we acknowledge the current results do not fully control for total token count. In the revised manuscript, we will add a baseline condition that concatenates prior-run CoTs without graph-based selection or pruning, while matching the average context length to the ROZA condition. This will allow direct comparison of the structured traversal effect. revision: yes
Referee: [Results] Results (dose-response and consistency sections): The reported per-question Spearman correlation (rho=+0.144) and consistency gains (+8 to +21pp, p<0.005) across N=73 probes do not rule out the context-length confound, as the manuscript provides no matched-length baseline that substitutes non-graph prior-run text. This weakens the claim that gains scale specifically with gold-passage reuse density and retrieval-graph pruning.

Authors: The per-question Spearman correlation measures the relationship between gold-passage reuse density and accuracy improvement on the same questions, which is designed to link gains to evidence reuse rather than overall context size. Nevertheless, we recognize that without a matched-length control, alternative explanations remain possible. We will incorporate the proposed ablation in the revision, ensuring that the non-graph baseline uses equivalent total context length by sampling or truncating prior text as needed. This should clarify whether the structured feedback and pruning provide benefits beyond volume. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivations

full rationale

The paper reports empirical dose-response relationships (accuracy vs. evidence-profile coverage, gold-passage reuse density correlations) on standard MuSiQue and HotpotQA benchmarks, with all gains attributed to graph traversal on persisted judgments while keeping the base model frozen. No equations, definitions, or self-citations reduce any reported lift, prediction, or uniqueness claim to a quantity defined by the same fitted inputs or prior author work; the central results remain independent statistical observations rather than tautological renamings or self-referential fits.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on the abstract alone, the central claim rests on standard assumptions about graph traversability and RAG pipelines with no explicit free parameters or new physical entities; the ROZA graph itself is a methodological construct rather than an independently evidenced object.

axioms (1)

domain assumption Persisting per-evidence chain-of-thought as graph edges enables useful feedback that improves future retrieval and reasoning
This premise underpins the entire self-improving loop described in the abstract.

invented entities (1)

ROZA graph no independent evidence
purpose: Combined reasoning and retrieval graph structure that forms a self-improving feedback loop
New term and architecture introduced by the paper to organize the feedback mechanism.

pith-pipeline@v0.9.0 · 5706 in / 1482 out tokens · 90242 ms · 2026-05-10T17:19:49.881887+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 10 canonical work pages · 5 internal anchors

[1]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[2]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[3]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
[4]

Case-based reasoning: Foundational issues, methodological variations, and system approaches

Agnar Aamodt and Enric Plaza. Case-based reasoning: Foundational issues, methodological variations, and system approaches. In AI Communications, volume 7, pages 39--59, 1994

1994
[5]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self- RAG : Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023

work page internal anchor Pith review arXiv 2023
[6]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Larson, and Cody Truitt. From local to global: A graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review arXiv 2024
[7]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Case-Based Reasoning

Janet Kolodner. Case-Based Reasoning. Morgan Kaufmann, 1993

1993
[9]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459--9474, 2020

2020
[10]

Self-improving reactive agents based on reinforcement learning, planning and teaching

Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8 0 (3--4): 0 293--321, 1992

1992
[11]

When agents disagree with themselves: Measuring behavioral consistency in LLM -based agents

Aman Mehta. When agents disagree with themselves: Measuring behavioral consistency in LLM -based agents. arXiv preprint arXiv:2602.11619, 2026 a

work page arXiv 2026
[12]

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

Aman Mehta. Consistency amplifies: How behavioral variance shapes agent accuracy. arXiv preprint arXiv:2603.25764, 2026 b

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518 0 (7540): 0 529--533, 2015

2015
[14]

ReasoningBank : Scaling agent self-evolving with reasoning memory

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. ReasoningBank : Scaling agent self-evolving with reasoning memory. In International Conference on Learning Representations, 2026

2026
[15]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph C O'Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023

2023
[16]

arXiv preprint arXiv:2408.08921 (2024) A CQ-Driven RAG Workflow for Digital Storytelling 19

Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Yan, and Youzhi Li. Graph retrieval-augmented generation: A survey. arXiv preprint arXiv:2408.08921, 2024

work page arXiv 2024
[17]

Exploring the pre-conditions for memory-learning agents

Vishwa Shah, Vishruth Veerendranath, Graham Neubig, Daniel Fried, and Zora Zhiruo Wang. Exploring the pre-conditions for memory-learning agents. In ICLR 2025 Workshop on Self-Improving Foundation Models, 2025

2025
[18]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems, volume 36, 2023

2023
[19]

Evaluate-as-action: Self-evaluated process rewards for retrieval-augmented agents

Jiangming Shu, Yuxiang Zhang, Ye Ma, Xueyuan Lin, and Jitao Sang. Evaluate-as-action: Self-evaluated process rewards for retrieval-augmented agents. arXiv preprint arXiv:2603.09203, 2026

work page arXiv 2026
[20]

and Yao, Shunyu and Narasimhan, Karthik and Griffiths, Thomas L

Theodore R Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L Griffiths. Cognitive architectures for language agents. Transactions on Machine Learning Research, 2024. arXiv preprint arXiv:2309.02427, 2023

work page arXiv 2024
[21]

MuSiQue : Multihop questions via single hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue : Multihop questions via single hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022

2022
[22]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

GAM-RAG : Gain-adaptive memory for evolving retrieval in retrieval-augmented generation

Yifan Wang, Mingxuan Jiang, Zhihao Sun, Yixin Cao, Yicun Liu, Keyang Chen, Guangnan Ye, and Hongfeng Chai. GAM-RAG : Gain-adaptive memory for evolving retrieval in retrieval-augmented generation. arXiv preprint arXiv:2603.01783, 2026

work page arXiv 2026
[24]

When to use graphs in RAG : A comprehensive analysis for graph retrieval-augmented generation

Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, and Jinsong Su. When to use graphs in RAG : A comprehensive analysis for graph retrieval-augmented generation. In International Conference on Learning Representations, 2026

2026
[25]

HotpotQA : A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, 2018

2018
[26]

ReAct : Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct : Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

2023
[27]

MEM1 : Learning to synergize memory and reasoning for efficient long-horizon agents

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. MEM1 : Learning to synergize memory and reasoning for efficient long-horizon agents. In International Conference on Learning Representations, 2026

2026
[28]

LinearRAG : Linear graph retrieval augmented generation on large-scale corpora

Luyao Zhuang, Shengyuan Chen, Yilin Xiao, Huachi Zhou, Yujing Zhang, Hao Chen, Qinggang Zhang, and Xiao Huang. LinearRAG : Linear graph retrieval augmented generation on large-scale corpora. In International Conference on Learning Representations, 2026

2026