AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
super hub Canonical reference
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Canonical reference. 78% of citing Pith papers cite this work as background.
abstract
Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Search-R1, an extension of reinforcement learning (RL) for reasoning frameworks where the LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM reasoning trajectories with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines under the same setting. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Search-R1, an extension of reinforcement learning (RL) for reasoning frameworks where the LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieva
authors
co-cited works
representative citing papers
CRAFT is a three-pillar credit assignment scheme that uses counterfactual token importance from GRPO sibling rollouts to provide signed per-token distillation signals in self-distilled agentic RL.
MetaSyn benchmark shows LLM agents recover at most 52.7% of relevant studies in meta-analysis pipelines due to failures in PI/ECO-based screening despite strong retrieval.
PixelRAG shows that operating RAG entirely over web screenshots outperforms text-based retrieval on NQ, SimpleQA, MMSearch, LiveVQA, and MoNaCo, with up to 18.1% accuracy gains and 3x token savings via image compression.
RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.
Reasoning models naturally compress context via thinking traces, with reward-constrained optimization yielding 17-23% gains over baselines on long-context QA at high compression ratios.
A self-bootstrapping paradigm uses trajectories from a small seed model to activate pre-planned sub-question decomposition in target models, enabling consistent outperformance on multi-hop QA without external distillation.
SteER introduces an interactive framework for deep research with LLMs that uses cost-benefit analysis for user control at decision points and shows improved alignment over baselines.
Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
LLM agents encode tool necessity in pre-generation hidden states with high linear decodability (AUROC 0.89-0.96); Probe&Prefill uses this to reduce tool calls 48% with 1.7% accuracy loss.
AHD Agent trains a 4B-parameter LLM via agentic RL to actively use tools for automatic heuristic design, matching or exceeding larger baselines across eight domains with fewer evaluations.
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.
AIDA is the first end-to-end autonomous agent that combines a domain-specific language with Pareto-guided reinforcement learning to discover insights from complex business data.
A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
A normative-descriptive framework shows LLMs' tool-calling perceptions misalign with true need/utility for web search, and hidden-state estimators improve decisions over self-perceived baselines.
MAGEO is a multi-agent system that distills validated editing patterns into reusable optimization skills for generative engines, outperforming heuristic baselines on visibility and fidelity via a new benchmark and evaluation protocol.
citing papers explorer
-
RWGBench: Evaluating Scholarly Positioning in Related Work Generation
RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.