pith. sign in

hub

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

27 Pith papers cite this work, alongside 39 external citations. Polarity classification is still indexing.

27 Pith papers citing it
39 external citations · Crossref

hub tools

citation-role summary

background 1

citation-polarity summary

years

2026 25 2025 2

roles

background 1

polarities

background 1

clear filters

representative citing papers

Fork-Think with Confidence

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

Fork-think with confidence identifies forking points via model confidence in a single path before sampling continuations, cutting tokens up to 30% and runtime up to 57% on reasoning benchmarks while matching or exceeding parallel thinking performance.

From Table to Cell: Attention for Better Reasoning with TABALIGN

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.

Learning with a Single Rollout via Monte Carlo Pass@k Critic

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

SR-PPO trains a Pass@k critic from single-rollout Monte Carlo outcomes to enable token-level advantage estimation in language model RL, yielding stable training and Pass@128 gains on math benchmarks.

ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

ADWIN adaptively selects training horizons in on-policy distillation via prefix alignment checks, cutting end-to-end cost by up to 4.1x while matching or exceeding full-rollout accuracy on math and code benchmarks.

Forecasting Downstream Performance of LLMs With Proxy Metrics

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.

Verifier-Guided Code Translation via Meta-Step Decoding

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

Decoding Time Verification (DTV) interleaves verifier calls at structural boundaries during autoregressive code generation for C-to-Rust and JavaScript-to-TypeScript translation, raising pass rates while using fewer tokens than post-hoc baselines.

Milestone-Guided Policy Learning for Long-Horizon Language Agents

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.

Procedural Knowledge at Scale Improves Reasoning

cs.CL · 2026-04-01 · unverdicted · novelty 6.0

Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.

Order Is Not Control

cs.LG · 2026-06-11 · unverdicted · novelty 5.0

Order is distinct from control, where control is defined as a local receiver-gated response law demonstrated across biological circuits and LLM response panels with reported prediction accuracies of 72-84%.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

citing papers explorer

Showing 6 of 6 citing papers after filters.