pith. machine review for the scientific record. sign in

arxiv: 2505.04588 · v2 · pith:ZYQFF5SAnew · submitted 2025-05-07 · 💻 cs.CL

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Pith reviewed 2026-05-17 17:38 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM search capabilityreinforcement learningsimulated retrievalcurriculum rolloutdocument quality degradationRL training efficiencyinformation retrieval
0
0 comments X

The pith

A fine-tuned retrieval module with degrading document quality trains LLMs to match or beat real search engines via RL without live API calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ZeroSearch replaces real search engine rollouts in RL training with a simulated retrieval module to cut costs and control quality. First a smaller LLM is fine-tuned to generate both useful and noisy documents for queries. Then a curriculum rollout strategy starts with high-quality outputs and progressively degrades them to force stronger reasoning from the main model. The resulting capabilities transfer back to live search use. Experiments show a 7B retrieval module matches real search performance while a 14B module surpasses it.

Core claim

By first applying lightweight supervised fine-tuning to turn an LLM into a retrieval module that can produce useful and noisy documents, then running RL with a curriculum that incrementally degrades the quality of those generated documents, the framework elicits and improves the main model's reasoning and search capabilities, achieving results comparable to or better than training against an actual search engine.

What carries the argument

Curriculum-based rollout strategy that uses a fine-tuned retrieval module to generate documents whose quality is progressively degraded during training.

Load-bearing premise

That reasoning skills honed on increasingly degraded simulated documents will transfer to the variable but generally higher-quality results returned by real search engines.

What would settle it

After training with ZeroSearch, measure the model's accuracy on reasoning benchmarks that require live search queries and compare directly to an identical model trained with real search engine rollouts; a large performance drop would indicate the simulated curriculum did not produce transferable skills.

read the original abstract

Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a novel RL framework that incentivizes the capabilities of LLMs to use a real search engine with simulated searches during training. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ZeroSearch, an RL framework to train LLMs for search-augmented reasoning without real search engine APIs. It first applies lightweight SFT to convert an LLM into a retrieval module that can emit both high-quality and noisy documents for a query. RL training then uses curriculum rollouts that progressively degrade the quality of these simulated documents to elicit stronger reasoning. Experiments across model sizes and RL algorithms claim that a 7B retrieval module matches real-search performance while a 14B module surpasses it, with good generalization to base and instruction-tuned models.

Significance. If the reported transfer from simulated curriculum training to real search engines holds, the framework would substantially lower the cost and instability barriers to RL-based search training, enabling wider exploration of search-augmented reasoning. The curriculum degradation idea is a concrete technical contribution that could be reused; however, the significance is currently limited by the absence of controls that isolate progressive degradation and confirm post-training generalization to live APIs.

major comments (3)
  1. [Experiments] Experiments section: the headline claim that a 7B retrieval module achieves comparable performance to a real search engine (and 14B surpasses it) is presented without reported baselines, exact metrics, statistical tests, or data-exclusion criteria, leaving the central performance equivalence with limited verifiable support.
  2. [Method] Method / Training procedure: the curriculum rollout that incrementally degrades document quality is described as the mechanism for eliciting reasoning, yet no ablation isolating progressive degradation from fixed-quality simulation is provided; without this control the contribution of the curriculum to transferable search behavior cannot be established.
  3. [Evaluation] Evaluation: all reported results use the fine-tuned retrieval module at inference; the manuscript contains no post-training evaluation in which the trained policy is paired with actual live search-engine results, which is required to substantiate the claim that the approach incentivizes real search capability.
minor comments (2)
  1. [Abstract] Abstract: the statement of 'extensive experiments' would be strengthened by explicit pointers to the tables or figures that contain the quantitative comparisons with real search engines.
  2. [Method] Notation: the precise mechanism and schedule used to degrade document quality during curriculum rollouts could be formalized with an equation or pseudocode for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important areas for strengthening the empirical rigor of the work. We have revised the manuscript to incorporate additional baselines, metrics, ablations, and real-API evaluations as detailed below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline claim that a 7B retrieval module achieves comparable performance to a real search engine (and 14B surpasses it) is presented without reported baselines, exact metrics, statistical tests, or data-exclusion criteria, leaving the central performance equivalence with limited verifiable support.

    Authors: We agree that clearer reporting is needed. The revised Experiments section now includes a dedicated comparison table with exact metrics (accuracy, F1, and task-specific scores), real search engine baselines, statistical significance tests (paired t-tests with p-values), and explicit data exclusion criteria. These additions directly support the performance equivalence claims. revision: yes

  2. Referee: [Method] Method / Training procedure: the curriculum rollout that incrementally degrades document quality is described as the mechanism for eliciting reasoning, yet no ablation isolating progressive degradation from fixed-quality simulation is provided; without this control the contribution of the curriculum to transferable search behavior cannot be established.

    Authors: We concur that an isolating ablation is valuable. The revised manuscript adds an ablation study comparing curriculum degradation against fixed-quality (high and low) simulations across the same RL setups. Results show progressive degradation yields measurably stronger reasoning and better downstream transfer, which we now report with quantitative differences. revision: yes

  3. Referee: [Evaluation] Evaluation: all reported results use the fine-tuned retrieval module at inference; the manuscript contains no post-training evaluation in which the trained policy is paired with actual live search-engine results, which is required to substantiate the claim that the approach incentivizes real search capability.

    Authors: This concern is well-taken for confirming transfer. We have added a new post-training evaluation subsection that pairs the trained policies with live search engine APIs on held-out queries. The results demonstrate improved performance relative to non-ZeroSearch baselines, providing direct evidence of incentivized real-search behavior. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external real-search benchmarks

full rationale

The paper trains a retrieval module via SFT then applies curriculum RL with progressively degraded synthetic documents, but reports final performance by directly comparing the resulting policy against live search-engine results on standard QA benchmarks. No equations, fitted parameters, or self-citations are invoked to define the target metric or to force the reported equivalence; the evaluation distribution (real API) is independent of the training distribution (simulated documents). The derivation chain therefore contains no self-definitional, fitted-input, or self-citation-load-bearing reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a fine-tuned LLM can serve as a controllable proxy for search engine output whose quality can be systematically degraded to train reasoning; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption A lightweight SFT stage can produce an LLM retrieval module whose generated documents can be incrementally degraded in quality to create progressively harder training scenarios.
    Invoked to justify the curriculum rollout strategy that replaces real search during RL.

pith-pipeline@v0.9.0 · 5604 in / 1334 out tokens · 47045 ms · 2026-05-17T17:38:01.454336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

    cs.AI 2026-05 unverdicted novelty 7.0

    PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

  2. CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

    cs.AI 2026-05 unverdicted novelty 7.0

    CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.

  3. SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

    cs.AI 2026-05 unverdicted novelty 7.0

    SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.

  4. LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

    cs.CL 2026-05 unverdicted novelty 7.0

    LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.

  5. MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

    cs.CL 2025-11 unverdicted novelty 7.0

    MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.

  6. Group-in-Group Policy Optimization for LLM Agent Training

    cs.LG 2025-05 unverdicted novelty 7.0

    GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...

  7. SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs

    cs.CL 2026-05 unverdicted novelty 6.0

    SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.

  8. PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.

  9. PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    PiCA improves RL for LLM search agents by defining process rewards around pivot steps that act as information peaks boosting final answer success probability via potential-based shaping.

  10. SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

    cs.AI 2026-05 unverdicted novelty 6.0

    SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.

  11. AIPO: : Learning to Reason from Active Interaction

    cs.CL 2026-05 unverdicted novelty 6.0

    AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...

  12. T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.

  13. Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search

    cs.CL 2026-04 unverdicted novelty 6.0

    CalibAdv calibrates advantages in GRPO by downscaling negative signals from incorrect final answers using intermediate step correctness and rebalancing answer-level advantages, yielding better performance and training...

  14. Democratizing Tool Learning with Environments Fully Simulated by a Free 8B Language Model

    cs.LG 2026-04 unverdicted novelty 6.0

    TRUSTEE uses an 8B LM to simulate complete dynamic environments for RL-based tool learning and outperforms baselines that require extra external resources.

  15. $\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

    cs.LG 2026-04 unverdicted novelty 6.0

    π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...

  16. The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    cs.AI 2025-09 accept novelty 6.0

    Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

  17. Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging

    cs.AI 2026-05 unverdicted novelty 5.0

    MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.

  18. CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

    cs.AI 2026-05 unverdicted novelty 5.0

    CuSearch reallocates fixed training budget toward deeper-search rollouts in RLVR for agentic RAG, treating search depth as an annotation-free proxy for supervision density and reporting up to 11.8 exact-match gains ov...

  19. Learning CLI Agents with Structured Action Credit under Selective Observation

    cs.AI 2026-05 unverdicted novelty 5.0

    CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.

  20. Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

    cs.CL 2025-10 unverdicted novelty 5.0

    ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.

  21. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  22. Agentic Reasoning for Large Language Models

    cs.AI 2026-01 unverdicted novelty 4.0

    The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 19 Pith papers · 19 internal anchors

  1. [1]

    A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, 2023

  2. [2]

    Bohnet, V

    B. Bohnet, V . Q. Tran, P. Verga, R. Aharoni, D. Andor, L. B. Soares, J. Eisenstein, K. Ganchev, J. Herzig, K. Hui, et al. Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037, 2022

  3. [3]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  4. [4]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  5. [5]

    W. Feng, C. Hao, Y . Zhang, J. Song, and H. Wang. Airrag: Activating intrinsic reasoning for retrieval augmented generation via tree-based search. arXiv preprint arXiv:2501.10053, 2025

  6. [6]

    L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty, Y . Fan, V . Y . Zhao, N. Lao, H. Lee, D.-C. Juan, et al. Rarr: Researching and revising what language models say, using language models. arXiv preprint arXiv:2210.08726, 2022

  7. [7]

    Y . Guo, L. Hou, R. Shao, P. G. Jin, V . Kumar, W. Weng, Y . Xie, and T.-Y . Liu. Deepseek-r1: Reinforcement learning for retrieval-augmented generation in large language models. arXiv preprint arXiv:2503.01234, 2025

  8. [8]

    Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

    X. Ho, A.-K. D. Nguyen, S. Sugawara, and A. Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060, 2020

  9. [9]

    Hou and et al

    Y . Hou and et al. Rl-based learning for reasoning and decision-making in large language models. In ACL, 2025

  10. [10]

    arXiv:2303.05398

    S. Imani, L. Du, and H. Shrivastava. Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398, 2023

  11. [11]

    Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity.arXiv preprint arXiv:2403.14403,

    S. Jeong, J. Baek, S. Cho, S. J. Hwang, and J. C. Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. arXiv preprint arXiv:2403.14403, 2024

  12. [12]

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023

  13. [13]

    Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement

    J. Jiang, J. Chen, J. Li, R. Ren, S. Wang, W. X. Zhao, Y . Song, and T. Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. arXiv preprint arXiv:2412.12881, 2024

  14. [14]

    Enhancing llm reasoning with reward-guided tree search.arXiv preprint arXiv:2411.11694, 2024a

    J. Jiang, Z. Chen, Y . Min, J. Chen, X. Cheng, J. Wang, Y . Tang, H. Sun, J. Deng, W. X. Zhao, et al. Technical report: Enhancing llm reasoning with reward-guided tree search. arXiv preprint arXiv:2411.11694, 2024

  15. [15]

    Jiang, F

    Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y . Yang, J. Callan, and G. Neubig. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969–7992, 2023

  16. [16]

    B. Jin, H. Zeng, Z. Yue, D. Wang, H. Zamani, and J. Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

  17. [17]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017. 10

  18. [18]

    Kumar and et al

    R. Kumar and et al. Research: Autonomous retrieval decision-making in llms using reinforce- ment learning. In ICLR, 2025

  19. [19]

    Kumar, L

    V . Kumar, L. Hou, Y . Guo, R. Shao, P. G. Jin, W. Weng, Y . Xie, and T.-Y . Liu. Self-correcting language models with reinforcement learning. arXiv preprint arXiv:2409.06543, 2024

  20. [20]

    Kwiatkowski, J

    T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics , 7:453–466, 2019

  21. [21]

    Lewkowycz, A

    A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems , 35:3843–3857, 2022

  22. [22]

    X. Li, G. Dong, J. Jin, Y . Zhang, Y . Zhou, Y . Zhu, P. Zhang, and Z. Dou. Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366, 2025

  23. [23]

    X. Li, J. Jin, G. Dong, H. Qian, Y . Zhu, Y . Wu, J.-R. Wen, and Z. Dou. Webthinker: Empowering large reasoning models with deep research capability. arXiv preprint arXiv:2504.21776, 2025

  24. [24]

    X. Li, J. Jin, Y . Zhou, Y . Wu, Z. Li, Q. Ye, and Z. Dou. Retrollm: Empowering large language models to retrieve fine-grained evidence within generation. arXiv preprint arXiv:2412.11919, 2024

  25. [25]

    X. Li, W. Xu, R. Zhao, F. Jiao, S. Joty, and L. Bing. Can we further elicit reasoning in llms? critic-guided planning with retrieval-augmentation for solving challenging tasks. arXiv preprint arXiv:2410.01428, 2024

  26. [26]

    When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

    A. Mallen, A. Asai, V . Zhong, R. Das, H. Hajishirzi, and D. Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 7, 2022

  27. [27]

    Teaching language models to support answers with verified quotes

    J. Menick, M. Trebacz, V . Mikulik, J. Aslanides, F. Song, M. Chadwick, M. Glaese, S. Young, L. Campbell-Gillingham, G. Irving, et al. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022

  28. [28]

    Measuring and Narrowing the Compositionality Gap in Language Models

    O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022

  29. [29]

    O. Ram, Y . Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-Brown, and Y . Shoham. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023

  30. [30]

    Rashkin, V

    H. Rashkin, V . Nikolaev, M. Lamm, L. Aroyo, M. Collins, D. Das, S. Petrov, G. S. Tomar, I. Turc, and D. Reitter. Measuring attribution in natural language generation models. arXiv preprint arXiv:2112.12870, 2021

  31. [31]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  32. [32]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  33. [33]

    W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W.-t. Yih. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023

  34. [34]

    arXiv preprint arXiv:2104.07567 , year=

    K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021

  35. [35]

    H. Song, J. Jiang, Y . Min, J. Chen, Z. Chen, W. X. Zhao, L. Fang, and J.-R. Wen. R1- searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592, 2025. 11

  36. [36]

    Galactica: A Large Language Model for Science

    R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic. Galactica: A large language model for science. CoRR, abs/2211.09085, 2022

  37. [37]

    Trivedi, N

    H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  38. [38]

    R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992

  39. [39]

    S. Xia, X. Li, Y . Liu, T. Wu, and P. Liu. Evaluating mathematical reasoning beyond accuracy. arXiv preprint arXiv:2404.05692, 2024

  40. [40]

    Yamauchi, S

    R. Yamauchi, S. Sonoda, A. Sannai, and W. Kumagai. Lpml: llm-prompting markup language for mathematical reasoning. arXiv preprint arXiv:2309.13078, 2023

  41. [41]

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

  42. [42]

    Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

  43. [43]

    Yoran, T

    O. Yoran, T. Wolfson, B. Bogin, U. Katz, D. Deutch, and J. Berant. Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007, 2023

  44. [44]

    W. Yu, D. Iter, S. Wang, Y . Xu, M. Ju, S. Sanyal, C. Zhu, M. Zeng, and M. Jiang. Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063, 2022

  45. [45]

    Zhang, Z

    J. Zhang, Z. Li, K. Das, B. Malin, and S. Kumar. Sac3: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. InFindings of the Association for Computational Linguistics: EMNLP 2023 , pages 15445–15458, 2023

  46. [46]

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023

  47. [47]

    Y . Zhao, H. Yin, B. Zeng, H. Wang, T. Shi, C. Lyu, L. Wang, W. Luo, and K. Zhang. Marco-o1: Towards open reasoning models for open-ended solutions. arXiv preprint arXiv:2411.14405, 2024

  48. [48]

    DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

    Y . Zheng, D. Fu, X. Hu, X. Cai, L. Ye, P. Lu, and P. Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025. 12 0 25 50 75 100 125 150 175 200 Step 0.0 0.1 0.2 0.3 0.4 0.5Train Reward ZeroSearch Search-R1 (a) LLaMA-3.2-3B-Base 0 25 50 75 100 125 150 175 200 Step 0.10 0.15 0....

  49. [49]

    1896 – 1897. New York City, 1896 is a time Doc 3: The Alienist: A Novel (2017) · The Angel of Darkness (2018) · The Lost City of Z (2019) · The Devil in the White City (2019) · A Gentleman in Moscow (2019) Doc 4: The sequel to the acclaimed national bestseller The Alienist, Caleb Carr’s The Angel of Darkness is a breathtaking thriller set in 1897 New York...