FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

Bryan Dai; Chuan Hao; Feng Chang; Jia Deng; Ji-Rong Wen; Ran Tao; Shuo Tang; Wayne Xin Zhao; Xiaoqing Xiang; Yimeng Chen

arxiv: 2606.12087 · v1 · pith:SM7XCKNNnew · submitted 2026-06-10 · 💻 cs.CL

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

Jia Deng , Yimeng Chen , Xiaoqing Xiang , Ziyang Zeng , Shuo Tang , Wayne Xin Zhao , Feng Chang , Chuan Hao

show 4 more authors

Yuan Wei Ran Tao Bryan Dai Ji-Rong Wen

This is my paper

Pith reviewed 2026-06-27 09:59 UTC · model grok-4.3

classification 💻 cs.CL

keywords deep search agentsshortcut-resistant synthesistraining data generationsupervised fine-tuningsearch benchmarksevidence graphsAI agents

0 comments

The pith

A shortcut-aware synthesis framework produces training tasks that force genuine deep search, enabling top benchmark performance via supervised fine-tuning alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep search agents require questions whose answers stay hidden until enough evidence is gathered, yet existing synthesis methods allow shortcuts that collapse the intended process into cheaper routes. The paper identifies four shortcut risks—evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding—and diagnoses them through trajectory signatures such as solving cost and answer hit time. It presents the FORT framework to control these risks during entity selection, graph construction, question formulation, and refinement. The resulting data yields trajectories with longer pre-answer search and reduced shortcut patterns. Supervised fine-tuning on these trajectories produces FORT-Searcher, which records the strongest results among open-source agents of comparable size on demanding search benchmarks.

Core claim

The FORT Framework of Shortcut-Resistant Training-Data Synthesis constructs training tasks by explicitly managing four shortcut risks across entity selection, evidence graph construction, question formulation, and adversarial refinement. When trajectories from these tasks are used for supervised fine-tuning, the resulting FORT-Searcher model achieves the best overall performance among comparable-size open-source search agents on challenging deep search benchmarks while exhibiting longer pre-answer search and fewer shortcut patterns than models trained on prior datasets.

What carries the argument

The FORT Framework of Shortcut-Resistant Training-Data Synthesis, which controls four shortcut risks at each stage of data creation to ensure realized search difficulty matches intended difficulty.

If this is right

FORT-synthesized tasks produce trajectories with measurably longer pre-answer search segments than existing open-source deep search datasets.
Models trained on FORT trajectories display lower prior-shortcut rates and answer-hit times that align with full evidence traversal.
Supervised fine-tuning alone on FORT data is sufficient to reach the highest scores among open-source agents of similar size on deep search benchmarks.
The same synthesis pipeline can be applied to generate additional training sets without requiring reinforcement learning stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The trajectory-signature diagnostics could be turned into an automated filter for auditing and cleaning existing search datasets.
Search-agent evaluations may need to include explicit shortcut-blocking variants to confirm that high scores reflect genuine capability.
The stage-wise risk control approach could transfer to other agent training domains where cheap reasoning paths undermine intended difficulty.

Load-bearing premise

That controlling the four named shortcut risks and checking the listed trajectory signatures is enough to remove all shortcuts so that benchmark gains reflect actual search skill.

What would settle it

A test showing that FORT-Searcher performance collapses to average levels once all four shortcut types are explicitly blocked in the evaluation benchmarks.

Figures

Figures reproduced from arXiv: 2606.12087 by Bryan Dai, Chuan Hao, Feng Chang, Jia Deng, Ji-Rong Wen, Ran Tao, Shuo Tang, Wayne Xin Zhao, Xiaoqing Xiang, Yimeng Chen, Yuan Wei, Ziyang Zeng.

**Figure 2.** Figure 2: Overview of FORT, a shortcut-resistant synthesis pipeline. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Shortcut repair by broadening a cross-domain aviation reference and relaxing a precise runtime [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Failure repair by narrowing over-fuzzed clues. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Diagnostic example of an evidence co-coverage shortcut. [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗

**Figure 6.** Figure 6: Diagnostic example of a single-clue selectivity shortcut. [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗

**Figure 7.** Figure 7: Diagnostic example of an exposed-constant shortcut. [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

**Figure 8.** Figure 8: Diagnostic example of a prior-knowledge binding shortcut. [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

read the original abstract

Training deep search agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through search. Existing synthesis methods often increase apparent difficulty by enriching graph structures, but structural complexity alone does not guarantee realized search difficulty: the intended search process can collapse through a cheaper identifying route. We formalize this gap with a shortcut-aware difficulty framework and identify four actionable shortcut risks: evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding. To diagnose their realized effects, we use trajectory signatures including solving cost, answer hit time, and prior-shortcut rate. Guided by this framework, we introduce FORT, a Framework of Shortcut-Resistant Training-Data Synthesis. FORT constructs shortcut-resistant training data by controlling shortcut risks across entity selection, evidence graph construction, question formulation, and adversarial refinement. Experiments show that FORT induces longer pre-answer search and fewer shortcut patterns than existing open-source deep search datasets. Using the resulting trajectories, we train FORT-Searcher with supervised fine-tuning (SFT) only, and it achieves the best overall performance among comparable-size open-source search agents on challenging deep search benchmarks. Relevant resources will be made available at https://github.com/RUCAIBox/FORT-Searcher.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FORT gives a concrete synthesis pipeline to block four named shortcuts when building search training data, and the SFT model beats other open-source agents on benchmarks, but the shortcut checks stop at the training set.

read the letter

The main things to know are that the paper formalizes four shortcut risks in search task synthesis and builds a pipeline called FORT to control them, then shows the resulting data trains a stronger SFT agent than prior open-source options.

What is new is the shortcut-aware difficulty framework itself. It names evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding as concrete failure modes, and pairs them with three trajectory signatures (solving cost, answer hit time, prior-shortcut rate) to measure whether the intended search actually happens. FORT then applies controls at each stage of data creation: entity selection, evidence graph construction, question formulation, and adversarial refinement. This is a step past simply adding more nodes or edges, which the abstract correctly notes does not guarantee realized difficulty.

The paper does what it sets out to do on the training side. FORT data produces longer pre-answer search and lower shortcut rates than existing datasets, and the model trained on those trajectories reports the best overall numbers among comparable open-source search agents.

The soft spot is exactly the one flagged in the stress-test note. The trajectory signatures are measured only during synthesis and on the training trajectories. Nothing in the abstract indicates the same analysis was run on the model's outputs when it answers the benchmark questions. If those benchmarks still contain the four shortcut patterns, the performance edge could come from exploiting them rather than from deeper search. The abstract also gives limited detail on baseline selection and statistical checks, which leaves the strength of the empirical claim harder to judge from the summary alone.

This is for groups training search agents who want training data that pushes past superficial routes. It has a practical method and reported gains that are worth a referee's time, even though the evaluation would benefit from direct shortcut checks on the test trajectories.

Referee Report

2 major / 2 minor

Summary. The paper introduces FORT, a Framework of Shortcut-Resistant Training-Data Synthesis for deep search agents. It formalizes four shortcut risks (evidence co-coverage, single-clue selectivity, exposed constants, prior-knowledge binding) and three trajectory signatures (solving cost, answer hit time, prior-shortcut rate) to diagnose when structural complexity fails to produce realized search difficulty. FORT controls these risks during entity selection, evidence graph construction, question formulation, and adversarial refinement. Experiments show FORT data yields longer pre-answer search and fewer shortcut patterns than prior open-source datasets; SFT on the resulting trajectories produces FORT-Searcher, which reports the best overall performance among comparable-size open-source agents on challenging deep search benchmarks.

Significance. If the shortcut-resistance properties transfer to benchmark evaluation without residual exploitation, the work supplies a concrete, controllable method for generating training data that forces genuine multi-step evidence acquisition. This could reduce reliance on post-hoc filtering or larger models to compensate for dataset artifacts in search-agent training.

major comments (2)

[Experiments] Experiments section: the trajectory signatures (solving cost, answer hit time, prior-shortcut rate) and shortcut-risk reductions are reported exclusively for the FORT synthesis process and the training trajectories; no equivalent analysis is provided for FORT-Searcher outputs on the evaluation benchmarks. Because the central claim equates benchmark gains with genuine search improvement rather than undetected shortcuts, the absence of these diagnostics on the test distributions is load-bearing.
[Experiments] Experiments section and Table reporting benchmark results: baseline comparisons and statistical significance are not detailed for the performance edge of FORT-Searcher; without reported controls for model size, training data volume, or variance across runs, it is unclear whether the reported superiority is robust or attributable to the shortcut-resistant construction.

minor comments (2)

The four shortcut risks are introduced without an explicit enumeration or pseudocode in the main text; a compact table or algorithm box would improve traceability from risk definition to synthesis steps.
The GitHub link is mentioned but no commit hash or data-release details are provided; reproducibility would benefit from explicit versioning of the released trajectories and synthesis code.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing rigorous validation of shortcut resistance. We respond to each major comment below and outline planned revisions.

read point-by-point responses

Referee: [Experiments] Experiments section: the trajectory signatures (solving cost, answer hit time, prior-shortcut rate) and shortcut-risk reductions are reported exclusively for the FORT synthesis process and the training trajectories; no equivalent analysis is provided for FORT-Searcher outputs on the evaluation benchmarks. Because the central claim equates benchmark gains with genuine search improvement rather than undetected shortcuts, the absence of these diagnostics on the test distributions is load-bearing.

Authors: We agree that the absence of trajectory signatures on the evaluation benchmarks leaves the central claim partially unverified. In the revised manuscript we will log and report solving cost, answer hit time, and prior-shortcut rate for FORT-Searcher on the benchmark tasks (where full trajectories can be collected under the same evaluation protocol), thereby directly testing whether benchmark gains arise from genuine multi-step evidence acquisition rather than residual shortcuts. revision: yes
Referee: [Experiments] Experiments section and Table reporting benchmark results: baseline comparisons and statistical significance are not detailed for the performance edge of FORT-Searcher; without reported controls for model size, training data volume, or variance across runs, it is unclear whether the reported superiority is robust or attributable to the shortcut-resistant construction.

Authors: The manuscript already restricts comparisons to open-source agents of comparable size (approximately 7B parameters) trained via SFT. We will expand the experiments section and results table to explicitly state these controls, add statistical significance tests (e.g., bootstrap confidence intervals or paired tests), and report performance variance across multiple random seeds. These additions will clarify that the observed edge is attributable to the FORT data construction rather than confounding factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper presents a methodological framework (FORT) for synthesizing training data with defined shortcut risks and trajectory signatures, then reports empirical results: longer pre-answer search on FORT data versus existing datasets, and superior benchmark performance after SFT. These outcomes are validated against independent open-source datasets and benchmarks rather than reducing to self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or derivations are shown that loop back to inputs by construction, and the central performance claim is not forced by the synthesis process itself. This is the expected non-finding for an empirical synthesis paper whose validity hinges on external comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical effectiveness of controlling four shortcut risks rather than new mathematical axioms or fitted constants; no free parameters are explicitly introduced in the abstract, and invented entities are limited to the diagnostic framework itself.

axioms (1)

domain assumption Training on tasks that force longer search trajectories without shortcuts improves agent performance on deep search benchmarks.
This underpins the motivation and experimental interpretation stated in the abstract.

invented entities (1)

Four shortcut risks (evidence co-coverage, single-clue selectivity, exposed constants, prior-knowledge binding) no independent evidence
purpose: To formalize and control gaps between apparent and realized search difficulty in task synthesis.
These are introduced as actionable risks identified by the framework.

pith-pipeline@v0.9.1-grok · 5782 in / 1383 out tokens · 22355 ms · 2026-06-27T09:59:34.517183+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 2 canonical work pages

[1]

arXiv preprint arXiv:2509.13305 , year=

Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning , author=. arXiv preprint arXiv:2509.13305 , year=

arXiv
[2]

arXiv preprint arXiv:2507.02592 , year=

Websailor: Navigating super-human reasoning for web agent , author=. arXiv preprint arXiv:2507.02592 , year=

Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2603.28376 , year=

Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design , author=. arXiv preprint arXiv:2603.28376 , year=

arXiv
[4]

Proceedings of the 28th International Conference on Computational Linguistics , pages=

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps , author=. Proceedings of the 28th International Conference on Computational Linguistics , pages=
[5]

Advances in Neural Information Processing Systems , volume=

Webdancer: Towards autonomous information seeking agency , author=. Advances in Neural Information Processing Systems , volume=
[6]

arXiv preprint arXiv:2510.14438 , year=

Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents , author=. arXiv preprint arXiv:2510.14438 , year=

Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2603.11076 , year=

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use , author=. arXiv preprint arXiv:2603.11076 , year=

arXiv
[8]

arXiv preprint arXiv:2603.15594 , year=

Openseeker: Democratizing frontier search agents by fully open-sourcing training data , author=. arXiv preprint arXiv:2603.15594 , year=

arXiv
[9]

arXiv preprint arXiv:2510.13913 , year=

Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms , author=. arXiv preprint arXiv:2510.13913 , year=

arXiv
[10]

arXiv preprint arXiv:2402.11924 , year=

MRKE: The multi-hop reasoning evaluation of LLMs by knowledge edition , author=. arXiv preprint arXiv:2402.11924 , year=

arXiv
[11]

arXiv preprint arXiv:2511.01323 , year=

DEEPAMBIGQA: Ambiguous Multi-hop Questions for Benchmarking LLM Answer Completeness , author=. arXiv preprint arXiv:2511.01323 , year=

arXiv
[12]

arXiv preprint arXiv:2412.17032 , year=

Mintqa: A multi-hop question answering benchmark for evaluating llms on new and tail knowledge , author=. arXiv preprint arXiv:2412.17032 , year=

arXiv
[13]

arXiv preprint arXiv:2509.10446 , year=

Deepdive: Advancing deep search agents with knowledge graphs and multi-turn rl , author=. arXiv preprint arXiv:2509.10446 , year=

arXiv
[14]

arXiv preprint arXiv:2603.15726 , year=

Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification , author=. arXiv preprint arXiv:2603.15726 , year=

arXiv
[15]

arXiv preprint arXiv:2507.15061 , year=

Webshaper: Agentically data synthesizing via information-seeking formalization , author=. arXiv preprint arXiv:2507.15061 , year=

arXiv
[16]

arXiv preprint arXiv:2510.24701 , year=

Tongyi DeepResearch Technical Report , author=. arXiv preprint arXiv:2510.24701 , year=

Pith/arXiv arXiv
[17]

Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

HotpotQA: A dataset for diverse, explainable multi-hop question answering , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

2018
[18]

Transactions of the Association for Computational Linguistics , volume=

MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

2022
[19]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

FanOutQA: A multi-hop, multi-document question answering benchmark for large language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=
[20]

arXiv preprint arXiv:2504.12516 , year=

Browsecomp: A simple yet challenging benchmark for browsing agents , author=. arXiv preprint arXiv:2504.12516 , year=

Pith/arXiv arXiv
[21]

Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[22]

arXiv preprint arXiv:2602.14234 , year=

Redsearcher: A scalable and cost-efficient framework for long-horizon search agents , author=. arXiv preprint arXiv:2602.14234 , year=

arXiv
[23]

arXiv preprint arXiv:2605.04036 , year=

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories , author=. arXiv preprint arXiv:2605.04036 , year=

Pith/arXiv arXiv
[24]

arXiv preprint arXiv:2509.00375 , year=

Open data synthesis for deep research , author=. arXiv preprint arXiv:2509.00375 , year=

arXiv
[25]

Findings of the Association for Computational Linguistics: EACL 2026 , pages=

SAGE: Steerable Agentic Data Generation for Deep Search with Execution Feedback , author=. Findings of the Association for Computational Linguistics: EACL 2026 , pages=

2026
[26]

arXiv preprint arXiv:2508.16994 , year=

GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation , author=. arXiv preprint arXiv:2508.16994 , year=

arXiv
[27]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[28]

arXiv preprint arXiv:2504.19314 , year=

Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese , author=. arXiv preprint arXiv:2504.19314 , year=

Pith/arXiv arXiv
[29]

arXiv preprint arXiv:2506.13651 , year=

xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations , author=. arXiv preprint arXiv:2506.13651 , year=

arXiv
[30]

arXiv preprint arXiv:2506.01062 , year=

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models , author=. arXiv preprint arXiv:2506.01062 , year=

Pith/arXiv arXiv
[31]

arXiv preprint arXiv:2511.11793 , year=

Mirothinker: Pushing the performance boundaries of open-source research agents via model, context, and interactive scaling , author=. arXiv preprint arXiv:2511.11793 , year=

Pith/arXiv arXiv
[32]

arXiv preprint arXiv:2601.03267 , year=

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

Pith/arXiv arXiv
[33]

2026 , howpublished =

2026
[34]

arXiv preprint arXiv:2602.15763 , year=

Glm-5: from vibe coding to agentic engineering , author=. arXiv preprint arXiv:2602.15763 , year=

Pith/arXiv arXiv
[35]

2: Pushing the frontier of open large language models , author=

Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

Pith/arXiv arXiv
[36]

arXiv preprint arXiv:2602.10604 , year=

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters , author=. arXiv preprint arXiv:2602.10604 , year=

arXiv
[37]

arXiv preprint arXiv:2601.16725 , year=

Longcat-flash-thinking-2601 technical report , author=. arXiv preprint arXiv:2601.16725 , year=

arXiv
[38]

2026 , note =

Tencent Hunyuan , title =. 2026 , note =

2026
[39]

5: Visual Agentic Intelligence , author=

Kimi K2. 5: Visual Agentic Intelligence , author=. arXiv preprint arXiv:2602.02276 , year=

Pith/arXiv arXiv
[40]

2026 , note =

Alibaba Qwen Team , title =. 2026 , note =

2026
[41]

2025 , note =

Z.ai , title =. 2025 , note =

2025
[42]

2026 , eprint=

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent , author=. 2026 , eprint=

2026
[43]

arXiv preprint arXiv:2502.04644 , volume=

Agentic reasoning: Reasoning llms with tools for the deep research , author=. arXiv preprint arXiv:2502.04644 , volume=

arXiv
[44]

arXiv preprint arXiv:2511.07685 , year=

Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents , author=. arXiv preprint arXiv:2511.07685 , year=

arXiv
[45]

arXiv preprint arXiv:2511.09907 , year=

Learning to pose problems: Reasoning-driven and solver-adaptive data synthesis for large reasoning models , author=. arXiv preprint arXiv:2511.09907 , year=

Pith/arXiv arXiv
[46]

arXiv preprint arXiv:2505.20416 , year=

GraphGen: enhancing supervised fine-tuning for LLMs with knowledge-driven synthetic data generation , author=. arXiv preprint arXiv:2505.20416 , year=

arXiv
[47]

arXiv preprint arXiv:2509.23252 , year=

NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning , author=. arXiv preprint arXiv:2509.23252 , year=

arXiv
[48]

arXiv preprint arXiv:2508.00965 , year=

Vault: Vigilant adversarial updates via llm-driven retrieval-augmented generation for nli , author=. arXiv preprint arXiv:2508.00965 , year=

arXiv
[49]

arXiv e-prints , pages=

Iterresearch: Rethinking long-horizon agents via markovian state reconstruction , author=. arXiv e-prints , pages=
[50]

arXiv preprint arXiv:2210.03629 , year=

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

Pith/arXiv arXiv
[51]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

Avoiding reasoning shortcuts: Adversarial evaluation, training, and model development for multi-hop QA , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=
[52]

arXiv preprint arXiv:2510.05137 , year=

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics , author=. arXiv preprint arXiv:2510.05137 , year=

arXiv
[53]

arXiv preprint arXiv:2510.06534 , year=

Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them , author=. arXiv preprint arXiv:2510.06534 , year=

arXiv
[54]

arXiv preprint arXiv:2509.21710 , year=

Think-on-Graph 3.0: Efficient and Adaptive LLM Reasoning on Heterogeneous Graphs via Multi-Agent Dual-Evolving Context Retrieval , author=. arXiv preprint arXiv:2509.21710 , year=

arXiv
[55]

Proceedings of the 2025 International Conference on Generative Artificial Intelligence for Business , pages=

Toward verifiable misinformation detection: A multi-tool LLM agent framework , author=. Proceedings of the 2025 International Conference on Generative Artificial Intelligence for Business , pages=

2025
[56]

arXiv preprint arXiv:2509.25189 , year=

Infoagent: Advancing autonomous information-seeking agents , author=. arXiv preprint arXiv:2509.25189 , year=

arXiv
[57]

arXiv preprint arXiv:2503.09516 , year=

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

Pith/arXiv arXiv
[58]

arXiv preprint arXiv:2602.23286 , year=

Sparta: Scalable and principled benchmark of tree-structured multi-hop qa over text and tables , author=. arXiv preprint arXiv:2602.23286 , year=

arXiv
[59]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[60]

arXiv preprint arXiv:2603.00582 , year=

Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research , author=. arXiv preprint arXiv:2603.00582 , year=

arXiv
[61]

arXiv preprint arXiv:2510.05592 , year=

In-the-flow agentic system optimization for effective planning and tool use , author=. arXiv preprint arXiv:2510.05592 , year=

arXiv
[62]

arXiv preprint arXiv:2601.06021 , year=

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards , author=. arXiv preprint arXiv:2601.06021 , year=

arXiv
[63]

arXiv preprint arXiv:2603.04751 , year=

Evaluating the Search Agent in a Parallel World , author=. arXiv preprint arXiv:2603.04751 , year=

Pith/arXiv arXiv
[64]

arXiv preprint arXiv:2603.12458 , year=

Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLMs , author=. arXiv preprint arXiv:2603.12458 , year=

arXiv
[65]

arXiv preprint arXiv:2409.02257 , year=

Mmlu-pro+: Evaluating higher-order reasoning and shortcut learning in llms , author=. arXiv preprint arXiv:2409.02257 , year=

arXiv
[66]

arXiv preprint arXiv:2508.02085 , year=

Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents , author=. arXiv preprint arXiv:2508.02085 , year=

arXiv
[67]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Hydrarag: Structured cross-source enhanced large language model reasoning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[68]

Improving LLM’s Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization , author=. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , pages=
[69]

Knowledge-Based Systems , volume=

Prompting large language models with knowledge graphs for question answering involving long-tail facts , author=. Knowledge-Based Systems , volume=. 2025 , publisher=

2025
[70]

arXiv preprint arXiv:2602.18137 , year=

Agentic Adversarial QA for Improving Domain-Specific LLMs , author=. arXiv preprint arXiv:2602.18137 , year=

arXiv
[71]

arXiv preprint arXiv:2510.14278 , year=

Prism: Agentic retrieval with llms for multi-hop question answering , author=. arXiv preprint arXiv:2510.14278 , year=

arXiv
[72]

arXiv preprint arXiv:2604.23783 , year=

S2G-RAG: Structured Sufficiency and Gap Judging for Iterative Retrieval-Augmented QA , author=. arXiv preprint arXiv:2604.23783 , year=

Pith/arXiv arXiv
[73]

Proceedings of the ACM Web Conference 2026 , pages=

Less is more: Compact clue selection for efficient retrieval-augmented generation reasoning , author=. Proceedings of the ACM Web Conference 2026 , pages=

2026
[74]

arXiv preprint arXiv:2603.26074 , year=

Not All Entities are Created Equal: A Dynamic Anonymization Framework for Privacy-Preserving Retrieval-Augmented Generation , author=. arXiv preprint arXiv:2603.26074 , year=

Pith/arXiv arXiv
[75]

2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA) , pages=

A survey on current trends and recent advances in text anonymization , author=. 2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA) , pages=. 2025 , organization=

2025
[76]

Wang, Xinyuan and Mao, Wenyu and Wu, Junkang and Wang, Xiang and He, Xiangnan , journal=. R\^
[77]

arXiv preprint arXiv:2602.12852 , year=

WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning , author=. arXiv preprint arXiv:2602.12852 , year=

Pith/arXiv arXiv
[78]

arXiv preprint arXiv:2604.04949 , year=

Learning to Retrieve from Agent Trajectories , author=. arXiv preprint arXiv:2604.04949 , year=

Pith/arXiv arXiv
[79]

Journal of Economics, Finance and Accounting Studies , volume=

A Deterministic Trajectory-Level Evaluation Framework for Learning-Based Agentic Systems , author=. Journal of Economics, Finance and Accounting Studies , volume=
[80]

arXiv preprint arXiv:2605.16217 , year=

Argus: Evidence Assembly for Scalable Deep Research Agents , author=. arXiv preprint arXiv:2605.16217 , year=

Pith/arXiv arXiv

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:2509.13305 , year=

Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning , author=. arXiv preprint arXiv:2509.13305 , year=

arXiv

[2] [2]

arXiv preprint arXiv:2507.02592 , year=

Websailor: Navigating super-human reasoning for web agent , author=. arXiv preprint arXiv:2507.02592 , year=

Pith/arXiv arXiv

[3] [3]

arXiv preprint arXiv:2603.28376 , year=

Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design , author=. arXiv preprint arXiv:2603.28376 , year=

arXiv

[4] [4]

Proceedings of the 28th International Conference on Computational Linguistics , pages=

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps , author=. Proceedings of the 28th International Conference on Computational Linguistics , pages=

[5] [5]

Advances in Neural Information Processing Systems , volume=

Webdancer: Towards autonomous information seeking agency , author=. Advances in Neural Information Processing Systems , volume=

[6] [6]

arXiv preprint arXiv:2510.14438 , year=

Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents , author=. arXiv preprint arXiv:2510.14438 , year=

Pith/arXiv arXiv

[7] [7]

arXiv preprint arXiv:2603.11076 , year=

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use , author=. arXiv preprint arXiv:2603.11076 , year=

arXiv

[8] [8]

arXiv preprint arXiv:2603.15594 , year=

Openseeker: Democratizing frontier search agents by fully open-sourcing training data , author=. arXiv preprint arXiv:2603.15594 , year=

arXiv

[9] [9]

arXiv preprint arXiv:2510.13913 , year=

Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms , author=. arXiv preprint arXiv:2510.13913 , year=

arXiv

[10] [10]

arXiv preprint arXiv:2402.11924 , year=

MRKE: The multi-hop reasoning evaluation of LLMs by knowledge edition , author=. arXiv preprint arXiv:2402.11924 , year=

arXiv

[11] [11]

arXiv preprint arXiv:2511.01323 , year=

DEEPAMBIGQA: Ambiguous Multi-hop Questions for Benchmarking LLM Answer Completeness , author=. arXiv preprint arXiv:2511.01323 , year=

arXiv

[12] [12]

arXiv preprint arXiv:2412.17032 , year=

Mintqa: A multi-hop question answering benchmark for evaluating llms on new and tail knowledge , author=. arXiv preprint arXiv:2412.17032 , year=

arXiv

[13] [13]

arXiv preprint arXiv:2509.10446 , year=

Deepdive: Advancing deep search agents with knowledge graphs and multi-turn rl , author=. arXiv preprint arXiv:2509.10446 , year=

arXiv

[14] [14]

arXiv preprint arXiv:2603.15726 , year=

Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification , author=. arXiv preprint arXiv:2603.15726 , year=

arXiv

[15] [15]

arXiv preprint arXiv:2507.15061 , year=

Webshaper: Agentically data synthesizing via information-seeking formalization , author=. arXiv preprint arXiv:2507.15061 , year=

arXiv

[16] [16]

arXiv preprint arXiv:2510.24701 , year=

Tongyi DeepResearch Technical Report , author=. arXiv preprint arXiv:2510.24701 , year=

Pith/arXiv arXiv

[17] [17]

Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

HotpotQA: A dataset for diverse, explainable multi-hop question answering , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

2018

[18] [18]

Transactions of the Association for Computational Linguistics , volume=

MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

2022

[19] [19]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

FanOutQA: A multi-hop, multi-document question answering benchmark for large language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

[20] [20]

arXiv preprint arXiv:2504.12516 , year=

Browsecomp: A simple yet challenging benchmark for browsing agents , author=. arXiv preprint arXiv:2504.12516 , year=

Pith/arXiv arXiv

[21] [21]

Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025

[22] [22]

arXiv preprint arXiv:2602.14234 , year=

Redsearcher: A scalable and cost-efficient framework for long-horizon search agents , author=. arXiv preprint arXiv:2602.14234 , year=

arXiv

[23] [23]

arXiv preprint arXiv:2605.04036 , year=

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories , author=. arXiv preprint arXiv:2605.04036 , year=

Pith/arXiv arXiv

[24] [24]

arXiv preprint arXiv:2509.00375 , year=

Open data synthesis for deep research , author=. arXiv preprint arXiv:2509.00375 , year=

arXiv

[25] [25]

Findings of the Association for Computational Linguistics: EACL 2026 , pages=

SAGE: Steerable Agentic Data Generation for Deep Search with Execution Feedback , author=. Findings of the Association for Computational Linguistics: EACL 2026 , pages=

2026

[26] [26]

arXiv preprint arXiv:2508.16994 , year=

GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation , author=. arXiv preprint arXiv:2508.16994 , year=

arXiv

[27] [27]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[28] [28]

arXiv preprint arXiv:2504.19314 , year=

Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese , author=. arXiv preprint arXiv:2504.19314 , year=

Pith/arXiv arXiv

[29] [29]

arXiv preprint arXiv:2506.13651 , year=

xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations , author=. arXiv preprint arXiv:2506.13651 , year=

arXiv

[30] [30]

arXiv preprint arXiv:2506.01062 , year=

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models , author=. arXiv preprint arXiv:2506.01062 , year=

Pith/arXiv arXiv

[31] [31]

arXiv preprint arXiv:2511.11793 , year=

Mirothinker: Pushing the performance boundaries of open-source research agents via model, context, and interactive scaling , author=. arXiv preprint arXiv:2511.11793 , year=

Pith/arXiv arXiv

[32] [32]

arXiv preprint arXiv:2601.03267 , year=

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

Pith/arXiv arXiv

[33] [33]

2026 , howpublished =

2026

[34] [34]

arXiv preprint arXiv:2602.15763 , year=

Glm-5: from vibe coding to agentic engineering , author=. arXiv preprint arXiv:2602.15763 , year=

Pith/arXiv arXiv

[35] [35]

2: Pushing the frontier of open large language models , author=

Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

Pith/arXiv arXiv

[36] [36]

arXiv preprint arXiv:2602.10604 , year=

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters , author=. arXiv preprint arXiv:2602.10604 , year=

arXiv

[37] [37]

arXiv preprint arXiv:2601.16725 , year=

Longcat-flash-thinking-2601 technical report , author=. arXiv preprint arXiv:2601.16725 , year=

arXiv

[38] [38]

2026 , note =

Tencent Hunyuan , title =. 2026 , note =

2026

[39] [39]

5: Visual Agentic Intelligence , author=

Kimi K2. 5: Visual Agentic Intelligence , author=. arXiv preprint arXiv:2602.02276 , year=

Pith/arXiv arXiv

[40] [40]

2026 , note =

Alibaba Qwen Team , title =. 2026 , note =

2026

[41] [41]

2025 , note =

Z.ai , title =. 2025 , note =

2025

[42] [42]

2026 , eprint=

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent , author=. 2026 , eprint=

2026

[43] [43]

arXiv preprint arXiv:2502.04644 , volume=

Agentic reasoning: Reasoning llms with tools for the deep research , author=. arXiv preprint arXiv:2502.04644 , volume=

arXiv

[44] [44]

arXiv preprint arXiv:2511.07685 , year=

Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents , author=. arXiv preprint arXiv:2511.07685 , year=

arXiv

[45] [45]

arXiv preprint arXiv:2511.09907 , year=

Learning to pose problems: Reasoning-driven and solver-adaptive data synthesis for large reasoning models , author=. arXiv preprint arXiv:2511.09907 , year=

Pith/arXiv arXiv

[46] [46]

arXiv preprint arXiv:2505.20416 , year=

GraphGen: enhancing supervised fine-tuning for LLMs with knowledge-driven synthetic data generation , author=. arXiv preprint arXiv:2505.20416 , year=

arXiv

[47] [47]

arXiv preprint arXiv:2509.23252 , year=

NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning , author=. arXiv preprint arXiv:2509.23252 , year=

arXiv

[48] [48]

arXiv preprint arXiv:2508.00965 , year=

Vault: Vigilant adversarial updates via llm-driven retrieval-augmented generation for nli , author=. arXiv preprint arXiv:2508.00965 , year=

arXiv

[49] [49]

arXiv e-prints , pages=

Iterresearch: Rethinking long-horizon agents via markovian state reconstruction , author=. arXiv e-prints , pages=

[50] [50]

arXiv preprint arXiv:2210.03629 , year=

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

Pith/arXiv arXiv

[51] [51]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

Avoiding reasoning shortcuts: Adversarial evaluation, training, and model development for multi-hop QA , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

[52] [52]

arXiv preprint arXiv:2510.05137 , year=

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics , author=. arXiv preprint arXiv:2510.05137 , year=

arXiv

[53] [53]

arXiv preprint arXiv:2510.06534 , year=

Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them , author=. arXiv preprint arXiv:2510.06534 , year=

arXiv

[54] [54]

arXiv preprint arXiv:2509.21710 , year=

Think-on-Graph 3.0: Efficient and Adaptive LLM Reasoning on Heterogeneous Graphs via Multi-Agent Dual-Evolving Context Retrieval , author=. arXiv preprint arXiv:2509.21710 , year=

arXiv

[55] [55]

Proceedings of the 2025 International Conference on Generative Artificial Intelligence for Business , pages=

Toward verifiable misinformation detection: A multi-tool LLM agent framework , author=. Proceedings of the 2025 International Conference on Generative Artificial Intelligence for Business , pages=

2025

[56] [56]

arXiv preprint arXiv:2509.25189 , year=

Infoagent: Advancing autonomous information-seeking agents , author=. arXiv preprint arXiv:2509.25189 , year=

arXiv

[57] [57]

arXiv preprint arXiv:2503.09516 , year=

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

Pith/arXiv arXiv

[58] [58]

arXiv preprint arXiv:2602.23286 , year=

Sparta: Scalable and principled benchmark of tree-structured multi-hop qa over text and tables , author=. arXiv preprint arXiv:2602.23286 , year=

arXiv

[59] [59]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[60] [60]

arXiv preprint arXiv:2603.00582 , year=

Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research , author=. arXiv preprint arXiv:2603.00582 , year=

arXiv

[61] [61]

arXiv preprint arXiv:2510.05592 , year=

In-the-flow agentic system optimization for effective planning and tool use , author=. arXiv preprint arXiv:2510.05592 , year=

arXiv

[62] [62]

arXiv preprint arXiv:2601.06021 , year=

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards , author=. arXiv preprint arXiv:2601.06021 , year=

arXiv

[63] [63]

arXiv preprint arXiv:2603.04751 , year=

Evaluating the Search Agent in a Parallel World , author=. arXiv preprint arXiv:2603.04751 , year=

Pith/arXiv arXiv

[64] [64]

arXiv preprint arXiv:2603.12458 , year=

Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLMs , author=. arXiv preprint arXiv:2603.12458 , year=

arXiv

[65] [65]

arXiv preprint arXiv:2409.02257 , year=

Mmlu-pro+: Evaluating higher-order reasoning and shortcut learning in llms , author=. arXiv preprint arXiv:2409.02257 , year=

arXiv

[66] [66]

arXiv preprint arXiv:2508.02085 , year=

Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents , author=. arXiv preprint arXiv:2508.02085 , year=

arXiv

[67] [67]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Hydrarag: Structured cross-source enhanced large language model reasoning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[68] [68]

Improving LLM’s Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization , author=. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , pages=

[69] [69]

Knowledge-Based Systems , volume=

Prompting large language models with knowledge graphs for question answering involving long-tail facts , author=. Knowledge-Based Systems , volume=. 2025 , publisher=

2025

[70] [70]

arXiv preprint arXiv:2602.18137 , year=

Agentic Adversarial QA for Improving Domain-Specific LLMs , author=. arXiv preprint arXiv:2602.18137 , year=

arXiv

[71] [71]

arXiv preprint arXiv:2510.14278 , year=

Prism: Agentic retrieval with llms for multi-hop question answering , author=. arXiv preprint arXiv:2510.14278 , year=

arXiv

[72] [72]

arXiv preprint arXiv:2604.23783 , year=

S2G-RAG: Structured Sufficiency and Gap Judging for Iterative Retrieval-Augmented QA , author=. arXiv preprint arXiv:2604.23783 , year=

Pith/arXiv arXiv

[73] [73]

Proceedings of the ACM Web Conference 2026 , pages=

Less is more: Compact clue selection for efficient retrieval-augmented generation reasoning , author=. Proceedings of the ACM Web Conference 2026 , pages=

2026

[74] [74]

arXiv preprint arXiv:2603.26074 , year=

Not All Entities are Created Equal: A Dynamic Anonymization Framework for Privacy-Preserving Retrieval-Augmented Generation , author=. arXiv preprint arXiv:2603.26074 , year=

Pith/arXiv arXiv

[75] [75]

2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA) , pages=

A survey on current trends and recent advances in text anonymization , author=. 2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA) , pages=. 2025 , organization=

2025

[76] [76]

Wang, Xinyuan and Mao, Wenyu and Wu, Junkang and Wang, Xiang and He, Xiangnan , journal=. R\^

[77] [77]

arXiv preprint arXiv:2602.12852 , year=

WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning , author=. arXiv preprint arXiv:2602.12852 , year=

Pith/arXiv arXiv

[78] [78]

arXiv preprint arXiv:2604.04949 , year=

Learning to Retrieve from Agent Trajectories , author=. arXiv preprint arXiv:2604.04949 , year=

Pith/arXiv arXiv

[79] [79]

Journal of Economics, Finance and Accounting Studies , volume=

A Deterministic Trajectory-Level Evaluation Framework for Learning-Based Agentic Systems , author=. Journal of Economics, Finance and Accounting Studies , volume=

[80] [80]

arXiv preprint arXiv:2605.16217 , year=

Argus: Evidence Assembly for Scalable Deep Research Agents , author=. arXiv preprint arXiv:2605.16217 , year=

Pith/arXiv arXiv