hub

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li + 2 more · 2024 · Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) · DOI 10.18653/v1/2024.acl-long.510

27 Pith papers cite this work, alongside 39 external citations. Polarity classification is still indexing.

27 Pith papers citing it

39 external citations · Crossref

open at publisher browse 27 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

DASH assigns segment-level credit in reasoning traces using drift toward ground-truth answers, yielding 50.8% accuracy on AIME25 versus 45.4% for GRPO while reducing overthinking behaviors.

Fork-Think with Confidence

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

Fork-think with confidence identifies forking points via model confidence in a single path before sampling continuations, cutting tokens up to 30% and runtime up to 57% on reasoning benchmarks while matching or exceeding parallel thinking performance.

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.

From Table to Cell: Attention for Better Reasoning with TABALIGN

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.

Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

cs.AI · 2026-05-05 · conditional · novelty 7.0 · 3 refs

TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured performance uplift on a frozen executor, outperforming outcome-only training on math and code benchmarks.

MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

MIRL uses mutual information to guide trajectory selection and provide separate rewards for visual perception in RLVR for VLMs, achieving 70.22% average accuracy with 25% fewer full trajectories.

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

cs.CL · 2026-04-27 · unverdicted · novelty 7.0 · 2 refs

DataPRM is an environment-aware generative process reward model that improves LLM data analysis agents by 7-11% on benchmarks via active verification and reflection-aware ternary rewards.

Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs

cs.AI · 2026-04-09 · accept · novelty 7.0

The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.

Learning with a Single Rollout via Monte Carlo Pass@k Critic

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

SR-PPO trains a Pass@k critic from single-rollout Monte Carlo outcomes to enable token-level advantage estimation in language model RL, yielding stable training and Pass@128 gains on math benchmarks.

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

cs.CL · 2026-06-09 · unverdicted · novelty 6.0

Reasoning models from SFT, RL post-training and distillation exhibit alignment regressions versus matched instruction-tuned baselines on safety, toxicity, bias, ethics, privacy and robustness.

ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

ADWIN adaptively selects training horizons in on-policy distillation via prefix alignment checks, cutting end-to-end cost by up to 4.1x while matching or exceeding full-rollout accuracy on math and code benchmarks.

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

GrowLoop proposes a human-seeded self-evolving framework that co-evolves rubrics and cases to evaluate conversational human-likeness with differentiated agreement rules.

Forecasting Downstream Performance of LLMs With Proxy Metrics

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.

Verifier-Guided Code Translation via Meta-Step Decoding

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

Decoding Time Verification (DTV) interleaves verifier calls at structural boundaries during autoregressive code generation for C-to-Rust and JavaScript-to-TypeScript translation, raising pass rates while using fewer tokens than post-hoc baselines.

Milestone-Guided Policy Learning for Long-Horizon Language Agents

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.

Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

cs.CL · 2026-04-23 · unverdicted · novelty 6.0

Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.

Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non-mathematical reasoning benchmarks.

Procedural Knowledge at Scale Improves Reasoning

cs.CL · 2026-04-01 · unverdicted · novelty 6.0

Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.

Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

cs.LG · 2025-08-22 · unverdicted · novelty 6.0

In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-time compute extend the reachable depth but do not remove the bound.

Order Is Not Control

cs.LG · 2026-06-11 · unverdicted · novelty 5.0

Order is distinct from control, where control is defined as a local receiver-gated response law demonstrated across biological circuits and LLM response panels with reported prediction accuracies of 72-84%.

DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity

cs.LG · 2026-06-08 · unverdicted · novelty 5.0

DynaCF dynamically downweights shortcut-sensitive samples in reward model training by tracking margin shifts under online counterfactual perturbations within the Bradley-Terry loss.

When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff

cs.LG · 2026-06-07 · unverdicted · novelty 5.0

Excessive SFT reduces LLM plasticity for RL; Rejuvenation restores it via base-anchored fusion and targeted neuron resets, yielding better RL performance and OOD generalization.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

citing papers explorer

Showing 6 of 6 citing papers after filters.

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions cs.AI · 2026-06-02 · unverdicted · none · ref 30
CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.
From Table to Cell: Attention for Better Reasoning with TABALIGN cs.AI · 2026-05-14 · unverdicted · none · ref 57
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards cs.AI · 2026-05-05 · conditional · none · ref 33 · 3 links
TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured performance uplift on a frozen executor, outperforming outcome-only training on math and code benchmarks.
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs cs.AI · 2026-04-09 · accept · none · ref 99
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking cs.AI · 2026-05-26 · unverdicted · none · ref 36
SBBT separates Brier-score calibration gains from AUROC ranking gains in prefix-conditioned success estimation for LLM math reasoning, with structure-aware signals yielding up to +0.110 AUROC over baselines.
SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking cs.AI · 2026-04-09 · unverdicted · none · ref 35
SAT reduces reasoning tokens by up to 40% across multiple large reasoning models and benchmarks by adaptively pruning steps based on difficulty while maintaining or improving accuracy.

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer