arxiv: 2604.04949 · v1 · submitted 2026-03-30 · 💻 cs.IR · cs.AI· cs.CL

Recognition: no theorem link

Learning to Retrieve from Agent Trajectories

Yuqi Zhou , Sunhao Dai , Changle Qu , Liang Pang , Jun Xu , Ji-Rong Wen

Authors on Pith no claims yet

Pith reviewed 2026-05-14 01:44 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords agent trajectoriesinformation retrievalLLM agentslearning to rankrelevance labelingagentic searchbehavioral signalsmulti-turn interactions

0 comments

The pith

Training retrieval models directly on agent trajectories improves evidence recall and task success for LLM search agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Information retrieval has relied on human logs such as clicks, yet LLM agents issue queries and consume results inside multi-turn reasoning loops that differ sharply from human behavior. The paper shows that behavioral signals inside agent trajectories, including which documents are browsed, which are rejected without browsing, and the reasoning traces that follow browsing, can be mined to create relevance labels. The LRAT framework uses these signals with weighted optimization to train retrievers. Experiments on in-domain and out-of-domain research benchmarks report gains in evidence recall, end-to-end task success, and execution efficiency that hold across agent architectures and scales. A sympathetic reader would care because retrieval is now embedded inside agent loops, so supervision that matches agent behavior offers a scalable alternative to human data.

Core claim

By systematically analyzing search agent trajectories, the authors identify browsing actions, unbrowsed rejections, and post-browse reasoning traces as signals that reveal document utility. The LRAT framework extracts these signals to generate high-quality retrieval supervision and applies weighted optimization to train retrievers. When retrievers trained under this paradigm are plugged into diverse agent architectures, they produce higher evidence recall, higher end-to-end task success rates, and greater execution efficiency on both in-domain and out-of-domain deep research benchmarks.

What carries the argument

LRAT framework that mines relevance labels from browsing actions, unbrowsed rejections, and post-browse reasoning traces in agent trajectories and incorporates relevance intensity via weighted optimization.

If this is right

Retrievers achieve measurably higher evidence recall on deep research tasks.
Agents that use the resulting retrievers complete more tasks successfully from start to finish.
Execution becomes more efficient, requiring fewer steps or less time to reach correct answers.
The gains appear on both in-domain and out-of-domain benchmarks and across agent architectures of varying scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent trajectories collected during normal operation could serve as an ongoing, low-cost source of training data without separate human labeling.
The same trajectory-mining approach might be extended to train other agent components such as planners that also rely on multi-step interaction records.
Retrieval models could be updated periodically by feeding fresh trajectories back into LRAT, allowing adaptation as agent behaviors shift over time.

Load-bearing premise

Behavioral signals extracted from agent trajectories supply high-quality, unbiased relevance labels that generalize beyond the specific agent architectures used to generate the trajectories.

What would settle it

A test in which retrievers trained with LRAT on trajectories from one family of agents are evaluated on agents that use markedly different query issuance or decision styles and show no gain in recall or task success would falsify the central claim.

read the original abstract

Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model (LLM) powered search agents, however, retrieval is increasingly consumed by agents rather than human beings, and is embedded as a core component within multi-turn reasoning and action loops. In this setting, retrieval models trained under human-centric assumptions exhibit a fundamental mismatch with the way agents issue queries and consume results. In this work, we argue that retrieval models for agentic search should be trained directly from agent interaction data. We introduce learning to retrieve from agent trajectories as a new training paradigm, where supervision is derived from multi-step agent interactions. Through a systematic analysis of search agent trajectories, we identify key behavioral signals that reveal document utility, including browsing actions, unbrowsed rejections, and post-browse reasoning traces. Guided by these insights, we propose LRAT, a simple yet effective framework that mines high-quality retrieval supervision from agent trajectories and incorporates relevance intensity through weighted optimization. Extensive experiments on both in-domain and out-of-domain deep research benchmarks demonstrate that retrievers trained with LRAT consistently improve evidence recall, end-to-end task success, and execution efficiency across diverse agent architectures and scales. Our results highlight agent trajectories as a practical and scalable supervision source, pointing to a promising direction for retrieval in the era of agentic search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agent trajectories give a workable new supervision source for retrieval in agentic systems, but the gains may partly reflect the generating agent's own policy rather than pure relevance.

read the letter

The main thing to know is that this paper shows how to pull retrieval training signals straight from multi-step agent runs instead of human clicks. They mine browsing actions, rejections of unbrowsed items, and post-browse reasoning traces, then use those to train a retriever with weighted loss in their LRAT setup. The experiments report steady lifts in evidence recall, end-to-end task success, and speed on both in-domain and out-of-domain benchmarks, and they test across several agent backbones and sizes. That part is useful because it directly tackles the mismatch between human-log training and how agents actually query and consume results. The idea itself is straightforward and extends standard learning-to-rank without heavy new machinery. The soft spot is the risk that the labels are entangled with whatever policy produced the trajectories. If the agent's query formulation or rejection rules already favor certain documents, the supervision could reinforce the same biases rather than supply independent relevance. The abstract claims the gains hold across diverse architectures, but the description of how they ensured independence or measured label noise is thin, so the generalization claim rests on details that need checking. This work is aimed at people building retrieval for LLM agents or multi-turn reasoning pipelines. A reader who cares about practical data sources for agentic IR will get concrete value from the signals they identify and the benchmark results. It deserves a serious referee because the core direction is timely and the experiments are at least directionally positive, even if the independence controls need tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes LRAT, a new training paradigm for retrieval models that derives supervision directly from multi-step agent interaction trajectories rather than human logs. It identifies behavioral signals (browsing actions, unbrowsed rejections, post-browse reasoning traces) to mine relevance labels, incorporates relevance intensity via weighted optimization, and reports consistent gains in evidence recall, end-to-end task success, and execution efficiency on in-domain and out-of-domain deep research benchmarks across diverse agent architectures and scales.

Significance. If the central results hold under independent verification, the work offers a scalable alternative to human-centric supervision for retrieval in agentic search systems. It directly addresses the mismatch between traditional IR assumptions and LLM-powered agent loops, with potential to improve downstream agent performance without requiring new human annotations.

major comments (2)

[Abstract] The abstract asserts gains 'across diverse agent architectures' but provides no description of trajectory collection protocols that would ensure independence (e.g., use of distinct retrieval modules, backbones, or prompt regimes during data generation). This leaves open the risk that behavioral signals are entangled with the generating agent's policy, undermining the claim that labels are architecture-agnostic and generalizable.
[Experimental Setup] The weakest assumption—that browsing actions, unbrowsed rejections, and post-browse traces yield high-quality, unbiased relevance labels—requires explicit controls or ablations showing that label noise from agent-specific heuristics does not drive the reported improvements. Without such analysis, out-of-domain gains may reflect self-reinforcement rather than external validity.

minor comments (2)

[Abstract] The abstract would benefit from naming the specific in-domain and out-of-domain benchmarks and reporting effect sizes or statistical significance for the claimed improvements.
[Methods] Notation for relevance intensity weighting should be introduced with an equation or pseudocode in the methods section to clarify how it differs from standard learning-to-rank losses.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and will revise the manuscript to incorporate clarifications and additional analysis.

read point-by-point responses

Referee: [Abstract] The abstract asserts gains 'across diverse agent architectures' but provides no description of trajectory collection protocols that would ensure independence (e.g., use of distinct retrieval modules, backbones, or prompt regimes during data generation). This leaves open the risk that behavioral signals are entangled with the generating agent's policy, undermining the claim that labels are architecture-agnostic and generalizable.

Authors: We agree that the abstract would benefit from a concise description of the trajectory collection process to better support the architecture-agnostic claim. In the revised version, we will expand the abstract to note that trajectories were generated using multiple independent agent configurations with distinct retrieval modules, backbones, and prompt regimes. These protocols are already specified in Section 3.2 of the manuscript, which details the multi-agent data collection to minimize policy entanglement. revision: yes
Referee: [Experimental Setup] The weakest assumption—that browsing actions, unbrowsed rejections, and post-browse traces yield high-quality, unbiased relevance labels—requires explicit controls or ablations showing that label noise from agent-specific heuristics does not drive the reported improvements. Without such analysis, out-of-domain gains may reflect self-reinforcement rather than external validity.

Authors: We acknowledge that further validation of label quality would strengthen the paper. While the manuscript already reports consistent out-of-domain gains across benchmarks, we agree that targeted ablations are warranted. In the revision, we will add new analysis in Section 5.3 with controls that vary agent heuristics and compare against perturbed labels, to demonstrate that gains are not attributable to self-reinforcement or agent-specific noise. revision: yes

Circularity Check

0 steps flagged

No circularity: supervision mined from external trajectories, not fitted by construction

full rationale

The paper derives retrieval supervision from behavioral signals in agent trajectories (browsing actions, unbrowsed rejections, post-browse traces) and applies weighted optimization in LRAT. No equations reduce the claimed improvements in recall or task success to a fitted parameter or self-defined quantity. No self-citations are invoked as uniqueness theorems or to smuggle ansatzes; the method is presented as a new paradigm trained on external interaction data. Experiments on in-domain and out-of-domain benchmarks provide independent validation rather than tautological confirmation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that agent behavioral signals accurately reflect document utility for retrieval purposes. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Behavioral signals from agent trajectories (browsing actions, unbrowsed rejections, post-browse reasoning traces) reliably indicate document relevance for training retrieval models.
This assumption allows the derivation of supervision signals from trajectories without human annotations.

pith-pipeline@v0.9.0 · 5569 in / 1192 out tokens · 32108 ms · 2026-05-14T01:44:20.728844+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 7 internal anchors

[1]

URL https://github.com/OpenBMB/ AgentCPM. Z. Chen, X. Ma, S. Zhuang, P . Nie, K. Zou, A. Liu, J. Green, K. Patel, R. Meng, M. Su, S. Sharify- moghaddam, Y. Li, H. Hong, X. Shi, X. Liu, N. Thakur, C. Zhang, L. Gao, W. Chen, and J. Lin. Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent. arXiv preprint arXiv:2508.06600,

work page arXiv
[2]

Craswell, O

N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. InProceedings of the 2008 international conference on web search and data mining, pages 87–94,

work page 2008
[3]

G. Dong, Y. Zhu, C. Zhang, Z. Wang, J.-R. Wen, and Z. Dou. Understand what llm needs: Dual preference alignment for retrieval-augmented generation. InProceedings of the ACM on Web Conference 2025, pages 4206–4225,

work page 2025
[4]

Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

Y. Huang, Y. Chen, H. Zhang, K. Li, M. Fang, L. Yang, X. Li, L. Shang, S. Xu, J. Hao, et al. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096,

work page arXiv
[5]

X. Li, G. Dong, J. Jin, Y. Zhang, Y. Zhou, Y. Zhu, P . Zhang, and Z. Dou. Search-o1: Agentic search- enhanced large reasoning models. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, Suzhou, China, Nov

work page 2025
[6]

ISBN 979-8-89176-332-6

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.276. URL https://aclanthology.org/ 2025.emnlp-main.276/. J. Liu, Y. Li, C. Zhang, J. Li, A. Chen, K. Ji, W. Cheng, Z. Wu, C. Du, Q. Xu, J. Song, Z. Zhu, W. Chen, P . Zhao, and J. He. Webexplorer: Explore and evolve for training long-horizon web agents,

work page doi:10.18653/v1/2025.emnlp-main.276 2025
[7]

URLhttps://arxiv.org/abs/2509.06501. T.-Y. Liu et al. Learning to rank for information retrieval.Foundations and Trends® in Information Retrieval, 3(3):225–331,

work page arXiv
[8]

K. Luo, H. Qian, Z. Liu, Z. Xia, S. Xiao, S. Bao, J. Zhao, and K. Liu. Infoflow: Reinforcing search agent via reward density optimization.arXiv preprint arXiv:2510.26575,

work page arXiv
[9]

URLhttps://arxiv.org/abs/2508.10925. C. Qu, S. Dai, H. Cai, Y. Cheng, J. Xu, S. Wang, and D. Yin. Uplift-RAG: Uplift-driven knowl- edge preference alignment for retrieval-augmented generation. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Findings of the Association for Computational Lin- guistics: EMNLP 2025, pages 9632–9644, S...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

ISBN 979-8-89176-335-7

Association for Computa- tional Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025.findings-emnlp.511. URL https://aclanthology.org/2025.findings-emnlp.511/. 18 Learning to Retrieve from Agent Trajectories F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM conference on Inf...

work page doi:10.18653/v1/2025.findings-emnlp.511 2025
[11]

W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W.-t. Yih. REPLUG: Retrieval-augmented black-box language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),

work page 2024
[12]

Z. Shi, Y. Chen, H. Li, W. Sun, S. Ni, Y. Lyu, R.-Z. Fan, B. Jin, Y. Weng, M. Zhu, et al. Deep research: A systematic survey.arXiv preprint arXiv:2512.02038,

work page arXiv
[13]

H. Song, J. Jiang, Y. Min, J. Chen, Z. Chen, W. X. Zhao, L. Fang, and J.-R. Wen. R1-searcher: Incen- tivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

G. Team, A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, H. Zeng, J. Zhang, K. Wang, L. Zhong, M. Liu, R. Lu, S. Cao, X. Zhang, X. Huang, Y. Wei, Y. Cheng, Y. An, Y. Niu, Y. Wen, Y. Bai, Z. Du, Z. Wang, Z. Zhu, B. Zhang, B. Wen, B. Wu, B. Xu, C. Huang, C. Zhao, C. Cai, C. Yu, C. Li, C. Ge, C. Huang, C. Zhang, C. Xu, C. Zhu, C. Li, C. Y...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

T. D. Team, B. Li, B. Zhang, D. Zhang, F. Huang, G. Li, G. Chen, H. Yin, J. Wu, J. Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025b. G. Wang, S. Dai, G. Ye, Z. Gan, W. Yao, Y. Deng, X. Wu, and Z. Ying. Information gain-based policy optimization: A simple and effective approach for multi-turn llm agents.arXiv preprin...

work page internal anchor Pith review arXiv
[16]

L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Y. Xi, J. Lin, Y. Xiao, Z. Zhou, R. Shan, T. Gao, J. Zhu, W. Liu, Y. Yu, and W. Zhang. A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges.arXiv preprint arXiv:2508.05668,

work page arXiv
[18]

Z. Xia, K. Luo, H. Qian, and Z. Liu. Open data synthesis for deep research.arXiv preprint arXiv:2509.00375,

work page arXiv
[19]

Xu and J

R. Xu and J. Peng. A comprehensive survey of deep research: Systems, methodologies, and applications.arXiv preprint arXiv:2506.12594,

work page arXiv
[20]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Zhang, R

H. Zhang, R. Zhang, J. Guo, M. de Rijke, Y. Fan, and X. Cheng. Are large language models good at utility judgments? InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1941–1951,

work page 1941
[22]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Y. Zhang, M. Li, D. Long, X. Zhang, H. Lin, B. Yang, P . Xie, A. Yang, D. Liu, J. Lin, F. Huang, and J. Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176,

work page internal anchor Pith review Pith/arXiv arXiv