pith. machine review for the scientific record. sign in

arxiv: 2503.09516 · v5 · submitted 2025-03-12 · 💻 cs.CL · cs.AI· cs.IR

Recognition: 3 theorem links

· Lean Theorem

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Dong Wang, Hamed Zamani, Hansi Zeng, Jiawei Han, Jinsung Yoon, Sercan Arik, Zhenrui Yue

Pith reviewed 2026-05-11 06:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords Search-R1reinforcement learningLLM reasoningsearch enginesretrieval-augmented generationquestion answeringmulti-turn interaction
0
0 comments X

The pith

LLMs trained with reinforcement learning learn to generate and use search queries during step-by-step reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that reinforcement learning can teach large language models to decide on their own when and what to search for while building an answer, rather than depending on fixed prompts or external instructions. By masking retrieved tokens during training and scoring only the final answer, the model discovers multi-turn search strategies that integrate fresh information into its chain of thought. This matters because current prompting approaches often leave models unable to use search engines effectively, resulting in outdated or incomplete reasoning on knowledge-intensive tasks. Experiments across seven question-answering datasets show consistent gains over standard retrieval-augmented baselines, with larger improvements on the 7B model than the 3B model.

Core claim

Search-R1 applies reinforcement learning to reasoning trajectories so that the LLM autonomously emits search queries at chosen points, receives real-time retrieval results, and continues reasoning with those results masked to stabilize training; an outcome-based reward then reinforces trajectories that reach correct final answers. This produces measurable improvements of 41 percent for Qwen2.5-7B and 20 percent for Qwen2.5-3B over comparable RAG baselines on seven QA datasets, while also yielding observations about response-length dynamics and the effects of different RL optimizers.

What carries the argument

Multi-turn search interactions optimized by outcome-based RL rewards and retrieved-token masking, which lets the model learn when to query without query-level supervision.

If this is right

  • Models learn to interleave search calls at useful moments inside long reasoning chains rather than only at the start.
  • Outcome-only rewards suffice to shape useful retrieval behavior across multiple turns.
  • Smaller models still show gains, though smaller than those for larger models under identical training.
  • Response length and search frequency change systematically as training proceeds.
  • The same RL setup supplies empirical comparisons among optimizers and model scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to other external tools such as calculators or code interpreters if similar masking and outcome rewards are used.
  • Learned search timing might reduce unnecessary retrievals and shorten inference latency once the policy stabilizes.
  • Because the method requires no human preference data, it may scale to new domains where only final-answer correctness is available.
  • The observed response-length dynamics suggest a possible trade-off between exploration of searches and concise final answers that future work could tune explicitly.

Load-bearing premise

That scoring only the final answer plus token masking is enough for the model to discover effective multi-turn search behavior without any additional human or query-level signals.

What would settle it

Retraining the same base models with the masking and outcome reward removed, or replaced by standard next-token prediction, and measuring whether the performance gap over RAG baselines disappears on the same seven datasets.

read the original abstract

Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Search-R1, an extension of reinforcement learning (RL) for reasoning frameworks where the LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM reasoning trajectories with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines under the same setting. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Search-R1, an RL extension for training LLMs to autonomously generate multiple search queries during step-by-step reasoning with real-time retrieval. It uses retrieved token masking to stabilize training and a simple outcome-based reward (final answer correctness). Experiments on seven QA datasets report gains of 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over RAG baselines, plus empirical insights on RL methods, model scale, and response length; code and checkpoints are released publicly.

Significance. If the gains are shown to arise from genuinely improved multi-turn search policies rather than confounds, the work would advance retrieval-augmented reasoning by demonstrating that outcome-only RL plus masking can suffice without process supervision. Public code and checkpoints are a clear strength for reproducibility.

major comments (2)
  1. [Experiments] Experiments section: the headline gains (41% and 20%) are reported without run-to-run variance, number of seeds, statistical significance tests, or explicit confirmation that retrieval corpus, top-k, and maximum turns are identical between Search-R1 and all RAG baselines; this is load-bearing for the claim that the RL policy itself drives the improvement.
  2. [Method] Method and RL optimization sections: the combination of sparse outcome reward and retrieved-token masking is asserted to let the model discover effective multi-turn search without query-level supervision, yet no trajectory analysis, search-frequency ablations, or checks for over-searching/reward hacking are provided to substantiate that the learned behavior is optimal rather than lucky or hacky.
minor comments (2)
  1. [Abstract] The abstract lists seven datasets but does not name them; adding the list would improve immediate readability.
  2. [Empirical Insights] Response-length dynamics are mentioned as an insight but lack a dedicated figure or table reference in the provided summary; ensure all claimed analyses have clear visual or tabular support.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline gains (41% and 20%) are reported without run-to-run variance, number of seeds, statistical significance tests, or explicit confirmation that retrieval corpus, top-k, and maximum turns are identical between Search-R1 and all RAG baselines; this is load-bearing for the claim that the RL policy itself drives the improvement.

    Authors: We agree that variance reporting and explicit confirmation of identical settings are important for validating the claims. In the revised manuscript, we will report results averaged over 3 random seeds with standard deviations and include statistical significance tests (e.g., paired t-tests) against the RAG baselines. We will also add an explicit statement in the experimental setup confirming that the retrieval corpus, top-k value, and maximum turns are identical across Search-R1 and all baselines, as implemented in the released code. This directly supports that the improvements arise from the learned RL policy. revision: yes

  2. Referee: [Method] Method and RL optimization sections: the combination of sparse outcome reward and retrieved-token masking is asserted to let the model discover effective multi-turn search without query-level supervision, yet no trajectory analysis, search-frequency ablations, or checks for over-searching/reward hacking are provided to substantiate that the learned behavior is optimal rather than lucky or hacky.

    Authors: We acknowledge that additional analyses would provide stronger evidence for the optimality of the learned policy. The manuscript already includes empirical insights on response length dynamics, which help rule out trivial over-searching as the source of gains. In the revision, we will add qualitative examples of multi-turn search trajectories, an ablation varying search frequency (via modified rewards), and further checks on response patterns to address potential reward hacking. These additions will better substantiate that the model discovers effective search behaviors. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical RL method with no derivation chain

full rationale

The paper introduces Search-R1 as an RL training procedure for LLMs to generate search queries during reasoning, using outcome-based rewards and retrieved-token masking. It reports experimental results on seven QA datasets showing gains over RAG baselines, plus empirical insights on optimization and response lengths. No first-principles derivation, theorem, or prediction is claimed that could reduce to its own inputs by construction; the work is self-contained as a procedural method validated by public code and external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions plus the domain assumption that outcome rewards suffice to shape search behavior.

axioms (1)
  • domain assumption Outcome-based reward is sufficient to optimize search query generation and retrieval use
    Paper explicitly uses a simple outcome-based reward function without query-level signals.

pith-pipeline@v0.9.0 · 5531 in / 1036 out tokens · 44972 ms · 2026-05-11T06:41:26.119835+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

    cs.AI 2026-04 accept novelty 8.0

    AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

  2. Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

    cs.AI 2026-05 unverdicted novelty 7.0

    PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

  3. Learning Agentic Policy from Action Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

  4. Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

  5. DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.

  6. LLM Agents Already Know When to Call Tools -- Even Without Reasoning

    cs.CL 2026-05 conditional novelty 7.0

    LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.

  7. AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

    cs.AI 2026-05 unverdicted novelty 7.0

    AHD Agent trains a 4B-parameter LLM via agentic RL to actively use tools for automatic heuristic design, matching or exceeding larger baselines across eight domains with fewer evaluations.

  8. AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

    cs.CL 2026-05 unverdicted novelty 7.0

    AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.

  9. The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

    cs.LG 2026-05 unverdicted novelty 7.0

    The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...

  10. Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

    cs.AI 2026-05 unverdicted novelty 7.0

    AIDA is the first end-to-end autonomous agent that combines a domain-specific language with Pareto-guided reinforcement learning to discover insights from complex business data.

  11. Inference-Time Budget Control for LLM Search Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.

  12. Skill Retrieval Augmentation for Agentic AI

    cs.CL 2026-04 unverdicted novelty 7.0

    Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.

  13. From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    MAGEO is a multi-agent system that distills validated editing patterns into reusable optimization skills for generative engines, outperforming heuristic baselines on visibility and fidelity via a new benchmark and eva...

  14. ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

    cs.CL 2026-04 unverdicted novelty 7.0

    ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.

  15. Latent Abstraction for Retrieval-Augmented Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA...

  16. Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

    cs.CL 2026-04 unverdicted novelty 7.0

    Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...

  17. Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

    cs.LG 2026-04 unverdicted novelty 7.0

    RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.

  18. AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning

    cs.IR 2026-04 unverdicted novelty 7.0

    A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.

  19. Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    Deep-Reporter introduces a unified agentic framework for grounded multimodal long-form generation via multimodal search, checklist-guided synthesis, and recurrent context management, plus the M2LongBench benchmark.

  20. Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...

  21. VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-...

  22. Retrieval Augmented Conversational Recommendation with Reinforcement Learning

    cs.IR 2026-04 unverdicted novelty 7.0

    RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.

  23. Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems

    cs.IR 2026-04 unverdicted novelty 7.0

    Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.

  24. Group-in-Group Policy Optimization for LLM Agent Training

    cs.LG 2025-05 unverdicted novelty 7.0

    GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...

  25. RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

    cs.CL 2026-05 unverdicted novelty 6.0

    RubricEM uses rubric-guided stagewise policy decomposition and reflection-based meta-policy evolution to improve long-horizon research agents beyond verifiable rewards.

  26. ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

    cs.AI 2026-05 unverdicted novelty 6.0

    ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.

  27. PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.

  28. TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.

  29. PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    PiCA improves RL for LLM search agents by defining process rewards around pivot steps that act as information peaks boosting final answer success probability via potential-based shaping.

  30. PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.

  31. Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

    cs.CV 2026-05 unverdicted novelty 6.0

    Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...

  32. SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

    cs.AI 2026-05 unverdicted novelty 6.0

    SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.

  33. SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    SkillMaster is a training framework that lets LLM agents autonomously propose, update, and apply skills, yielding 8.8% and 9.3% higher success rates on ALFWorld and WebShop than prior methods.

  34. SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    SkillMaster enables LLM agents to autonomously develop skills via trajectory review, counterfactual evaluation, and DualAdv-GRPO training, boosting success rates by 8.8% on ALFWorld and 9.3% on WebShop.

  35. Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.

  36. AIPO: : Learning to Reason from Active Interaction

    cs.CL 2026-05 unverdicted novelty 6.0

    AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...

  37. SOD: Step-wise On-policy Distillation for Small Language Model Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

  38. RRCM: Ranking-Driven Retrieval over Collaborative and Meta Memories for LLM Recommendation

    cs.IR 2026-05 unverdicted novelty 6.0

    RRCM trains an LLM to dynamically retrieve from collaborative and meta memories using group relative policy optimization driven by final top-k recommendation quality.

  39. A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

    cs.CL 2026-05 unverdicted novelty 6.0

    A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges p...

  40. Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

    cs.LG 2026-05 unverdicted novelty 6.0

    LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.

  41. When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

    cs.CL 2026-05 conditional novelty 6.0

    AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.

  42. T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.

  43. Enhancing Judgment Document Generation via Agentic Legal Information Collection and Rubric-Guided Optimization

    cs.CL 2026-05 unverdicted novelty 6.0

    Judge-R1 improves LLM judgment document generation by combining agentic legal information retrieval with GRPO-based rubric-guided optimization, outperforming baselines on the JuDGE benchmark.

  44. Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.

  45. To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

    cs.AI 2026-05 unverdicted novelty 6.0

    LLMs often misalign their self-perceived need for tools with true need and utility, but lightweight estimators trained on hidden states can improve tool-calling decisions and task performance across multiple models and tasks.

  46. SEARCH-R: Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator for Multi-hop Question Answering

    cs.CL 2026-04 unverdicted novelty 6.0

    SEARCH-R improves multi-hop question answering by training a fine-tuned Llama navigator for sub-question decomposition and using dependency-tree retrieval to quantify informational contribution of documents.

  47. JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

    cs.LG 2026-04 unverdicted novelty 6.0

    JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.

  48. Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems

    cs.RO 2026-04 unverdicted novelty 6.0

    Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.

  49. Pause or Fabricate? Training Language Models for Grounded Reasoning

    cs.CL 2026-04 conditional novelty 6.0

    GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task succe...

  50. DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.

  51. TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

    cs.CL 2026-04 unverdicted novelty 6.0

    TRN-R1-Zero is an RL-only post-training method that lets LLMs perform zero-shot node, edge, and graph reasoning on text-rich networks without supervised data or larger-model distillation.

  52. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  53. MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search

    cs.IR 2026-04 unverdicted novelty 6.0

    MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.

  54. MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search

    cs.IR 2026-04 unverdicted novelty 6.0

    MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.

  55. Towards Long-horizon Agentic Multimodal Search

    cs.CV 2026-04 unverdicted novelty 6.0

    LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...

  56. Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

    cs.CL 2026-04 conditional novelty 6.0

    Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-paramete...

  57. ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning

    cs.IR 2026-04 unverdicted novelty 6.0

    ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.

  58. OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

    cs.AI 2026-04 unverdicted novelty 6.0

    OASES co-trains search policies and evaluators to generate outcome-aligned process rewards, outperforming standard RL baselines on five multi-hop QA benchmarks.

  59. Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts

    cs.AI 2026-04 unverdicted novelty 6.0

    HERA evolves query-specific agent topologies via reward-guided sampling and refines role-specific prompts via credit assignment, yielding 38.69% average gains on six knowledge-intensive benchmarks.

  60. ToolRL: Reward is All Tool Learning Needs

    cs.LG 2025-04 conditional novelty 6.0

    A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.