DASH assigns segment-level credit in reasoning traces using drift toward ground-truth answers, yielding 50.8% accuracy on AIME25 versus 45.4% for GRPO while reducing overthinking behaviors.
hub
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
27 Pith papers cite this work, alongside 39 external citations. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Fork-think with confidence identifies forking points via model confidence in a single path before sampling continuations, cutting tokens up to 30% and runtime up to 57% on reasoning benchmarks while matching or exceeding parallel thinking performance.
CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.
TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured performance uplift on a frozen executor, outperforming outcome-only training on math and code benchmarks.
MIRL uses mutual information to guide trajectory selection and provide separate rewards for visual perception in RLVR for VLMs, achieving 70.22% average accuracy with 25% fewer full trajectories.
DataPRM is an environment-aware generative process reward model that improves LLM data analysis agents by 7-11% on benchmarks via active verification and reflection-aware ternary rewards.
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
SR-PPO trains a Pass@k critic from single-rollout Monte Carlo outcomes to enable token-level advantage estimation in language model RL, yielding stable training and Pass@128 gains on math benchmarks.
Reasoning models from SFT, RL post-training and distillation exhibit alignment regressions versus matched instruction-tuned baselines on safety, toxicity, bias, ethics, privacy and robustness.
ADWIN adaptively selects training horizons in on-policy distillation via prefix alignment checks, cutting end-to-end cost by up to 4.1x while matching or exceeding full-rollout accuracy on math and code benchmarks.
GrowLoop proposes a human-seeded self-evolving framework that co-evolves rubrics and cases to evaluate conversational human-likeness with differentiated agreement rules.
Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.
Decoding Time Verification (DTV) interleaves verifier calls at structural boundaries during autoregressive code generation for C-to-Rust and JavaScript-to-TypeScript translation, raising pass rates while using fewer tokens than post-hoc baselines.
BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.
Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.
PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non-mathematical reasoning benchmarks.
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.
In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-time compute extend the reachable depth but do not remove the bound.
Order is distinct from control, where control is defined as a local receiver-gated response law demonstrated across biological circuits and LLM response panels with reported prediction accuracies of 72-84%.
DynaCF dynamically downweights shortcut-sensitive samples in reward model training by tracking margin shifts under online counterfactual perturbations within the Bradley-Terry loss.
Excessive SFT reduces LLM plasticity for RL; Rejuvenation restores it via base-anchored fusion and targeted neuron resets, yielding better RL performance and OOD generalization.
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
citing papers explorer
-
CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions
CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.
-
From Table to Cell: Attention for Better Reasoning with TABALIGN
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
-
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured performance uplift on a frozen executor, outperforming outcome-only training on math and code benchmarks.
-
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
-
Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking
SBBT separates Brier-score calibration gains from AUROC ranking gains in prefix-conditioned success estimation for LLM math reasoning, with structure-aware signals yielding up to +0.110 AUROC over baselines.
-
SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking
SAT reduces reasoning tokens by up to 40% across multiple large reasoning models and benchmarks by adaptively pruning steps based on difficulty while maintaining or improving accuracy.