hub

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li + 2 more · 2024 · Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) · DOI 10.18653/v1/2024.acl-long.510

10 Pith papers cite this work, alongside 39 external citations. Polarity classification is still indexing.

10 Pith papers citing it

39 external citations · Crossref

open at publisher browse 10 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

representative citing papers

Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.

MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

MIRL uses mutual information to guide trajectory selection and provide separate rewards for visual perception in RLVR for VLMs, achieving 70.22% average accuracy with 25% fewer full trajectories.

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.

Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs

cs.AI · 2026-04-09 · accept · novelty 7.0

The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.

Milestone-Guided Policy Learning for Long-Horizon Language Agents

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

cs.AI · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

TraceLift trains reasoning planners using rewards that credit traces for both rubric quality and actual performance gains on a frozen executor, outperforming final-answer-only training on math and code tasks.

Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

cs.CL · 2026-04-23 · unverdicted · novelty 6.0

Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.

Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non-mathematical reasoning benchmarks.

Procedural Knowledge at Scale Improves Reasoning

cs.CL · 2026-04-01 · unverdicted · novelty 6.0

Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.

SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking

cs.AI · 2026-04-09 · unverdicted · novelty 5.0

SAT reduces reasoning tokens by up to 40% across multiple large reasoning models and benchmarks by adaptively pruning steps based on difficulty while maintaining or improving accuracy.

citing papers explorer

Showing 10 of 10 citing papers.

Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning cs.CL · 2026-05-08 · unverdicted · none · ref 7
Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.
MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models cs.CV · 2026-05-02 · unverdicted · none · ref 9
MIRL uses mutual information to guide trajectory selection and provide separate rewards for visual perception in RLVR for VLMs, achieving 70.22% average accuracy with 25% fewer full trajectories.
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis cs.CL · 2026-04-27 · unverdicted · none · ref 56
DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs cs.AI · 2026-04-09 · accept · none · ref 99
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
Milestone-Guided Policy Learning for Long-Horizon Language Agents cs.CL · 2026-05-07 · unverdicted · none · ref 36
BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards cs.AI · 2026-05-05 · unverdicted · none · ref 33 · 2 links
TraceLift trains reasoning planners using rewards that credit traces for both rubric quality and actual performance gains on a frozen executor, outperforming final-answer-only training on math and code tasks.
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models cs.CL · 2026-04-23 · unverdicted · none · ref 12
Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards cs.CL · 2026-04-20 · unverdicted · none · ref 6
PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non-mathematical reasoning benchmarks.
Procedural Knowledge at Scale Improves Reasoning cs.CL · 2026-04-01 · unverdicted · none · ref 37
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.
SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking cs.AI · 2026-04-09 · unverdicted · none · ref 35
SAT reduces reasoning tokens by up to 40% across multiple large reasoning models and benchmarks by adaptively pruning steps based on difficulty while maintaining or improving accuracy.

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer