Rest-mcts*: Llm self-training via process re- ward guided tree search.Advances in Neural Information Processing Systems, 37:64735–64772, 2024a

Hanning Zhang, Pengcheng Wang, Shizhe Diao, Yong Lin, Rui Pan, Hanze Dong, Dylan Zhang, Pavlo Molchanov, Tong Zhang · 2024 · arXiv 2412.11006

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

SDR: Set-Distance Rewards for Radiology Report Generation

cs.AI · 2026-05-30 · unverdicted · novelty 6.0

Set-to-set distances on sentence embeddings provide a permutation-invariant reward signal that improves GRPO training and enables efficient test-time scaling for vision-language models generating chest X-ray reports.

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.

GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

cs.AI · 2026-05-02 · unverdicted · novelty 6.0

GR-Ben is a new process-level benchmark that evaluates error detection by PRMs and LLMs in science and logic reasoning, showing weaker performance outside mathematics.

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

cs.LG · 2026-02-08 · unverdicted · novelty 6.0

rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

cs.AI · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization cs.AI · 2026-05-09 · unverdicted · none · ref 25 · 2 links
An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.

Rest-mcts*: Llm self-training via process re- ward guided tree search.Advances in Neural Information Processing Systems, 37:64735–64772, 2024a

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer