arxiv: 2511.07328 · v2 · submitted 2025-11-10 · 💻 cs.LG · cs.IR

Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training

Artyom Sorokin , Nazar Buzun , Alexander Anokhin , Oleg Inozemcev , Egor Vedernikov , Petr Anokhin , Mikhail Burtsev , Trushkov Alexey

show 2 more authors

Yin Wenshuai Evgeny Burnaev

This is my paper

Pith reviewed 2026-05-17 23:24 UTC · model grok-4.3

classification 💻 cs.LG cs.IR

keywords Retrieval-Augmented GenerationMulti-step RetrievalReinforcement LearningEmbedder TrainingLong ContextQuestion AnsweringBabiLongRULER

0 comments

The pith

Q-RAG trains only the embedder with reinforcement learning to enable multi-step retrieval for long-context question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Q-RAG as a method that applies reinforcement learning directly to the embedder model rather than to a language model. This produces a system capable of multi-step retrieval for complex questions while keeping the main LLM unchanged. Prior multi-step retrieval techniques require expensive fine-tuning of small LLMs, which limits their use with larger models and increases overall cost. Q-RAG reports state-of-the-art results on the BabiLong and RULER benchmarks for contexts up to 10 million tokens. A reader would care because the approach promises to make chained retrieval practical without the heavy training overhead of existing alternatives.

Core claim

Q-RAG fine-tunes the Embedder model for multi-step retrieval using reinforcement learning. This yields a competitive and resource-efficient alternative to existing multi-step retrieval methods that fine-tune small LLMs, and it achieves state-of-the-art results on the long-context benchmarks BabiLong and RULER for contexts up to 10M tokens.

What carries the argument

Value-based reinforcement learning applied to the embedder, which learns to select and chain relevant passages across retrieval steps for a given query.

If this is right

Multi-step retrieval becomes feasible without updating the parameters of the main language model.
Training resource requirements drop because only the embedder is updated instead of a full LLM.
Larger, more capable LLMs can be used directly in RAG pipelines without custom fine-tuning.
Effective context lengths up to 10 million tokens become practical on standard long-context QA benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same RL training recipe could be applied to improve accuracy even in single-step retrieval settings.
Q-RAG might combine with other long-context compression or attention techniques to extend usable context further.
Real-world performance would depend on whether the learned retrieval policy transfers to noisy or domain-shifted queries outside the benchmark distributions.
Different reward formulations or policy optimization methods for the embedder could be tested to strengthen the multi-step behavior.

Load-bearing premise

Reinforcement learning applied to the embedder will produce reliable multi-step retrieval behavior that generalizes from the reported benchmarks to arbitrary open-domain questions without additional LLM fine-tuning.

What would settle it

A new benchmark of complex multi-hop questions in 1M+ token contexts where Q-RAG retrieval accuracy falls below that of single-step baselines or LLM-fine-tuned multi-step systems.

Figures

Figures reproduced from arXiv: 2511.07328 by Alexander Anokhin, Artyom Sorokin, Egor Vedernikov, Evgeny Burnaev, Mikhail Burtsev, Nazar Buzun, Oleg Inozemcev, Petr Anokhin, Trushkov Alexey, Yin Wenshuai.

**Figure 2.** Figure 2: Comparison of answer accuracy on the long-context benchmark Babilong. Solid lines de [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation for (a) policy entropy coefficient ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) methods enhance LLM performance by efficiently filtering relevant context for LLMs, reducing hallucinations and inference cost. However, most existing RAG methods focus on single-step retrieval, which is often insufficient for answering complex questions that require multi-step search. Recently, multi-step retrieval approaches have emerged, typically involving the fine-tuning of small LLMs to perform multi-step retrieval. This type of fine-tuning is highly resource-intensive and does not enable the use of larger LLMs. In this work, we propose Q-RAG, a novel approach that fine-tunes the Embedder model for multi-step retrieval using reinforcement learning (RL). Q-RAG offers a competitive, resource-efficient alternative to existing multi-step retrieval methods for open-domain question answering and achieves state-of-the-art results on the popular long-context benchmarks BabiLong and RULER for contexts up to 10M tokens. Code is available at https://github.com/griver/Q-RAG

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Q-RAG, a method that applies value-based reinforcement learning to fine-tune only the embedder model for multi-step retrieval in RAG pipelines. It positions the approach as a resource-efficient alternative to fine-tuning small LLMs for handling complex open-domain questions and reports state-of-the-art results on the BabiLong and RULER long-context benchmarks for contexts up to 10M tokens.

Significance. If the central performance claims are substantiated, the work would be significant for enabling multi-step retrieval without LLM fine-tuning, thereby supporting larger base models and scaling to extremely long contexts. The public code release at the cited GitHub repository is a clear strength that aids reproducibility and follow-up work.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: The manuscript asserts SOTA results on BabiLong and RULER but provides no description of baselines, number of runs, error bars, training curves, or ablation studies on the RL components. This information is load-bearing for verifying whether the reported gains arise from genuine multi-step behavior induced by embedder-only RL.
[Method] Method section (RL objective): The reward signal and value estimation procedure are not shown to explicitly incentivize iterative query chaining across 10M-token contexts rather than single-step relevance; without this, it remains unclear whether the embedder learns reliable multi-step planning or whether results depend on benchmark-specific structure or implicit LLM capabilities.

minor comments (2)

[Abstract] The abstract would benefit from a brief mention of the key performance metrics or efficiency gains to better contextualize the SOTA claim for readers.
[Method] Notation for the value function and state representation in the RL formulation could be introduced with a small diagram or explicit equation for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We have revised the manuscript to address the concerns regarding experimental details and the clarity of the RL objective, improving the substantiation of our claims without altering the core contributions.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: The manuscript asserts SOTA results on BabiLong and RULER but provides no description of baselines, number of runs, error bars, training curves, or ablation studies on the RL components. This information is load-bearing for verifying whether the reported gains arise from genuine multi-step behavior induced by embedder-only RL.

Authors: We agree that these experimental details are necessary to fully substantiate the SOTA claims. In the revised manuscript, we have expanded the Experiments section to describe the full set of baselines (including single-step RAG variants and prior multi-step methods), report results averaged over 5 independent runs with standard error bars, include training curves for the embedder RL process in the appendix, and add ablation studies on the RL components (e.g., value estimation and multi-step reward). These additions directly address whether the gains reflect genuine multi-step behavior from embedder-only training. revision: yes
Referee: [Method] Method section (RL objective): The reward signal and value estimation procedure are not shown to explicitly incentivize iterative query chaining across 10M-token contexts rather than single-step relevance; without this, it remains unclear whether the embedder learns reliable multi-step planning or whether results depend on benchmark-specific structure or implicit LLM capabilities.

Authors: We have substantially revised the Method section to provide a clearer exposition of the reward signal (cumulative retrieval utility across steps) and value estimation (temporal-difference updates on embedder actions). We include a step-by-step derivation illustrating how the objective favors query selections that enable subsequent retrievals in long contexts, along with qualitative analysis of retrieval trajectories on BabiLong showing iterative chaining behavior. While the original formulation was designed for this purpose, the expanded presentation makes the incentive structure explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL training and benchmark evaluation form independent pipeline

full rationale

The paper describes Q-RAG as a reinforcement-learning procedure that fine-tunes an embedder model to produce multi-step retrieval queries, then evaluates the resulting system on the external BabiLong and RULER benchmarks. No equations, fitted parameters, or first-principles derivations are presented that would reduce the reported performance to a self-referential definition or to a quantity already fixed by the training objective. The central claim therefore rests on observable training dynamics and held-out benchmark scores rather than on any load-bearing self-citation, ansatz smuggling, or renaming of known results; the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

axioms (1)

domain assumption Reinforcement learning applied to embedder training can produce effective multi-step retrieval policies.
Implicit in the proposal that RL on the embedder is a viable substitute for LLM fine-tuning.

pith-pipeline@v0.9.0 · 5509 in / 1279 out tokens · 61231 ms · 2026-05-17T23:24:10.815948+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a new method for training a multi-step retrieval agent using temporal difference reinforcement learning... Q function is approximated using two embedders... Qθ(s, ai) = ⟨Es(s;θ1), Ea(ai, i;θ2)⟩
IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

relative positional mapping ρt(i) ... partitions the document into k+1 disjoint intervals

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 12 internal anchors

[1]

Arigraph: Learning knowledge graph world models with episodic memory for llm agents.arXiv preprint arXiv:2407.04363,

Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burt- sev, and Evgeny Burnaev. Arigraph: Learning knowledge graph world models with episodic memory for llm agents.arXiv preprint arXiv:2407.04363,

work page arXiv
[2]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025a

Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Raza- viyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the context at test time. arXiv preprint arXiv:2505.23735,

work page arXiv
[4]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, et al. Learning to reason with search for llms via reinforce- ment learning.arXiv preprint arXiv:2503.19470,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

URLhttps://arxiv.org/abs/2312.00752. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Hao Peng, Julia Hockenmaier, and Tong Zhang. Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning.arXiv preprint arXiv:2503.12759,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Graphreader: Building graph-based agent to enhance long- context abilities of large language models

10 Under review as a conference paper Shilong Li, Yancheng He, Hangyu Guo, Xingyuan Bu, Ge Bai, Jie Liu, Jiaheng Liu, Xingwei Qu, Yangguang Li, Wanli Ouyang, et al. Graphreader: Building graph-based agent to enhance long- context abilities of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 12758–12786,

work page 2024
[10]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models.arXiv preprint arXiv:2501.05366,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

A comprehensive survey on long context lan- guage modeling.arXiv preprint arXiv:2503.17407,

Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, et al. A comprehensive survey on long context lan- guage modeling.arXiv preprint arXiv:2503.17407,

work page arXiv
[12]

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al

URL https://arxiv.org/abs/2505.20099. V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533,

work page arXiv
[13]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngˆan V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Associative recurrent memory transformer.arXiv preprint arXiv:2407.04841,

Ivan Rodkin, Yuri Kuratov, Aydar Bulatov, and Mikhail Burtsev. Associative recurrent memory transformer.arXiv preprint arXiv:2407.04841,

work page arXiv
[15]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Longrope2: Near-lossless llm context window scaling.arXiv preprint arXiv:2502.20082,

Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, and Mao Yang. Longrope2: Near-lossless llm context window scaling.arXiv preprint arXiv:2502.20082,

work page arXiv
[17]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Replug: Retrieval-augmented black-box language models

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Replug: Retrieval-augmented black-box language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 8364–8377,

work page 2024
[19]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

URLhttps://arxiv.org/abs/2501.09136. Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

An Empirical Study of Mamba-based Language Models

Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, et al. An empirical study of mamba- based language models.arXiv preprint arXiv:2406.07887,

work page internal anchor Pith review arXiv
[21]

Survey of specialized large language model

11 Under review as a conference paper Chenghan Yang, Ruiyu Zhao, Yang Liu, and Ling Jiang. Survey of specialized large language model. arXiv preprint arXiv:2508.19667,

work page arXiv
[22]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380,

work page 2018
[23]

In defense of rag in the era of long-context language models.arXiv preprint arXiv:2409.01666,

Tan Yu, Anbang Xu, and Rama Akkiraju. In defense of rag in the era of long-context language models.arXiv preprint arXiv:2409.01666,

work page arXiv
[24]

End-to-end beam retrieval for multi-hop question answering

Jiahao Zhang, Haiyang Zhang, Dongmei Zhang, Liu Yong, and Shen Huang. End-to-end beam retrieval for multi-hop question answering. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies (Volume 1: Long Papers), pp. 1718–1731,

work page 2024
[25]

ThenF(x, y, t) =⟨w, R tv⟩

Chooseh(x) =wandg(y) =v. ThenF(x, y, t) =⟨w, R tv⟩. Sincet7→R tv is injective (forv̸= 0and non-zero frequencies),R t1 v̸=R t2 v. Choosewnot orthogonal to Rt1 v−R t2 v, soF(x 1, y1, t1)̸=F(x 2, y2, t2). Thus, by the Stone-Weierstrass theorem,Ais dense inC(K,R). Theorem 1 establishes that our architecture is capable of approximating any continuous function ...

work page 1972