AgentV-RL: Scaling Reward Modeling with Agentic Verifier

· 2026 · cs.CL · arXiv 2604.16004

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioning it as a promising paradigm for agentic reward modeling.

representative citing papers

Semantic Consistency Policy Optimization for Reinforcement Learning of LLM Agents

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

SCPO recovers step-level credit from successful siblings within rollout groups to reduce semantic inconsistency in group-based RL for LLM agents, matching or exceeding baselines on ALFWorld and WebShop.

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.

AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

cs.CL · 2026-04-24 · unverdicted · novelty 6.0

AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.

Agon: An Autonomous Large-Scale Omnidisciplinary Research System Built on Prompt Economy

cs.SE · 2026-06-23 · unverdicted · novelty 5.0

Agon is a new autonomous research system using prompt economy loops across 444 iterations to demonstrate scalable omnidisciplinary research and a taxonomy separating machine-fixable failures from those needing human judgment.

citing papers explorer

Showing 4 of 4 citing papers.

Semantic Consistency Policy Optimization for Reinforcement Learning of LLM Agents cs.LG · 2026-06-24 · unverdicted · none · ref 42 · internal anchor
SCPO recovers step-level credit from successful siblings within rollout groups to reduce semantic inconsistency in group-based RL for LLM agents, matching or exceeding baselines on ALFWorld and WebShop.
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control cs.LG · 2026-05-12 · unverdicted · none · ref 51 · 2 links · internal anchor
Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs cs.CL · 2026-04-24 · unverdicted · none · ref 30 · internal anchor
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
Agon: An Autonomous Large-Scale Omnidisciplinary Research System Built on Prompt Economy cs.SE · 2026-06-23 · unverdicted · none · ref 24 · internal anchor
Agon is a new autonomous research system using prompt economy loops across 444 iterations to demonstrate scalable omnidisciplinary research and a taxonomy separating machine-fixable failures from those needing human judgment.

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

fields

years

verdicts

representative citing papers

citing papers explorer