Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.
hub Mixed citations
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Mixed citation behavior. Most common role is background (67%).
abstract
Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a novel RL framework that incentivizes the capabilities of LLMs to use a real search engine with simulated searches during training. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.
HIPPO is a new RL framework that uses hint-anchored pairwise aggregation to distinguish and promote authentic reasoning deduction in LLMs instead of shortcut memorization from data overlap.
Qwen-AgentWorld are language world models that simulate multi-domain agent environments and boost general agent capabilities via decoupled RL simulation and unified foundation model training.
Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.
ECPO improves GiGPO by shrinking low-count action advantages and suppressing noisy anchor states, yielding +5.2/+7.3 success gains on ALFWorld/WebShop with Qwen2.5-1.5B models at negligible extra cost.
Search-E1 uses GRPO interleaved with on-policy self-distillation to reach 0.440 average EM on seven QA benchmarks with Qwen2.5-3B, outperforming open-source baselines.
HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.
SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.
PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
SearchSkill improves LLM query planning on knowledge QA by using explicit skill selection from an evolving SkillBank and a two-stage SFT process that aligns training with inference-time skill-grounded execution.
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.
TRUSTEE uses an 8B LM to simulate complete dynamic environments for RL-based tool learning and outperforms baselines that require extra external resources.
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.
ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Mujica-MyGo decomposes multi-turn RAG interactions via multi-agent workflows and applies minimalist policy gradient optimization to improve performance on QA benchmarks while avoiding long-context problems.