DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

· 2025 · cs.AI · arXiv 2509.25454

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

Although RLVR has become an essential component for developing advanced reasoning skills in language models, contemporary studies have documented training plateaus after thousands of optimization steps, i.e., notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search (MCTS) directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance gains over prolonged training. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves an average accuracy of 62.95\% and establishes a new state-of-the-art reasoning model, while using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

cs.LG · 2026-06-24 · conditional · novelty 7.0

KernelPro combines LLM code generation, roofline-guided tool orchestration, and domain-adapted MCTS to produce GPU kernels that outperform prior automated and some hand-tuned baselines on KernelBench and VeOmni workloads.

Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

cs.CL · 2026-06-24 · unverdicted · novelty 7.0 · 2 refs

LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.

Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning

stat.ML · 2026-05-06 · unverdicted · novelty 7.0

InfoTree casts intermediate state selection in tree search as monotone submodular maximization under fixed rollout budgets, yielding closed-form UUCB terms and lifting mixed-outcome ratios while outperforming flat GRPO and prior tree variants on nine benchmarks.

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

Scaling Participation in Modular AI Systems

cs.AI · 2026-06-05 · unverdicted · novelty 6.0

Modular AI systems assembled from contributed small models outperform monolithic LLMs by up to 15.4% on 15 tasks including reasoning and factuality while showing emergent problem-solving and benefits from contributor diversity.

citing papers explorer

Showing 6 of 6 citing papers.

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization cs.LG · 2026-06-24 · conditional · none · ref 32 · internal anchor
KernelPro combines LLM code generation, roofline-guided tool orchestration, and domain-adapted MCTS to produce GPU kernels that outperform prior automated and some hand-tuned baselines on KernelBench and VeOmni workloads.
Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing cs.CL · 2026-06-24 · unverdicted · none · ref 29 · 2 links · internal anchor
LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.
Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning stat.ML · 2026-05-06 · unverdicted · none · ref 23 · internal anchor
InfoTree casts intermediate state selection in tree search as monotone submodular maximization under fixed rollout budgets, yielding closed-form UUCB terms and lifting mixed-outcome ratios while outperforming flat GRPO and prior tree variants on nine benchmarks.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 61 · 2 links · internal anchor
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 132 · internal anchor
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Scaling Participation in Modular AI Systems cs.AI · 2026-06-05 · unverdicted · none · ref 100 · internal anchor
Modular AI systems assembled from contributed small models outperform monolithic LLMs by up to 15.4% on 15 tasks including reasoning and factuality while showing emergent problem-solving and benefits from contributor diversity.

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer