super hub Canonical reference

Let's Verify Step by Step

Bowen Baker, Harri Edwards, Hunter Lightman, Teddy Lee, Vineet Kosaraju, Yura Burda · 2023 · cs.LG · arXiv 2305.20050

Canonical reference. 81% of citing Pith papers cite this work as background.

274 Pith papers citing it

Background 81% of classified citations

open full Pith review browse 274 citing papers more from Bowen Baker arXiv PDF

abstract

In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 25 dataset 4 method 2

citation-polarity summary

background 25 use dataset 4 use method 2

claims ledger

abstract In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, bu

authors

Bowen Baker Harri Edwards Hunter Lightman Teddy Lee Vineet Kosaraju Yura Burda

co-cited works

representative citing papers

On the Policy Gradient Foundations of Group Relative Policy Optimization: Credit Assignment, Gradient Sparsity, and Rank Collapse

cs.LG · 2026-06-28 · conditional · novelty 8.0

GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.

The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.

MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.

Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

quant-ph · 2025-10-23 · accept · novelty 8.0

A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

cs.CR · 2025-07-14 · unverdicted · novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 7.0 · 2 refs

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.

Flow Reasoning Models: Scaling Reasoning Through Iterative Self-Refinement

cs.AI · 2026-06-28 · conditional · novelty 7.0

Flow models reach 99.2% Sudoku accuracy in 7 passes and 96.1% on out-of-distribution Sudoku-Extreme by selecting dynamically stable candidates and training with self-conditioning plus DPO to avoid failed outputs.

Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents

cs.AI · 2026-06-26 · unverdicted · novelty 7.0 · 2 refs

GILP trains a parameterized backbone for valid actions and state predictions, then uses a consistency gate with LLM drafts to reduce hallucinated-state rate from 0.176 to 0.035 on GPT-4o-mini while raising success from 0.668 to 0.838.

VCT: A Verifiable Transcript System for LLM Conversations

cs.CR · 2026-06-22 · unverdicted · novelty 7.0

VCT abstracts non-linear LLM operations into authenticated state transitions via atomic Q&A hash chains, session Merkle roots, and account-level roots with joint signatures, plus protocols for deletions and concurrency detection.

A Verifiable Search Is Not a Learnable Chain-of-Thought

cs.LG · 2026-06-20 · unverdicted · novelty 7.0

Verifiable search procedures cannot be learned as forward chain-of-thought by language models; they instead learn memorization, verification, or require precomputed catalogs.

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.

Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search

cs.AI · 2026-06-15 · unverdicted · novelty 7.0

DivInit improves agentic search breadth scaling by selecting diverse first-turn queries from a single model generation, delivering 5-7 point gains on multi-hop QA across five models and eight benchmarks at matched compute.

Failure Modes of Large Language Models on Research-Level Mathematics: A Taxonomy and an Empirical Characterisation

cs.DL · 2026-06-12 · conditional · novelty 7.0

This paper introduces a taxonomy of four LLM failure modes on research math proofs and empirically shows premise smuggling in all eight audited Gemini outputs, with a new audit instrument achieving 100% precision.

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

SWITCH uses explicit <swi> and </swi> boundary tokens to make latent chain-of-thought compatible with on-policy RL (GRPO) and open to causal mechanistic probing, outperforming prior hidden-state recurrence methods.

Agreement in Representation Space for Open-Ended Self-Consistency

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

EBA clusters sampled LLM generations in representation space to estimate agreement, outperforming random selection with stable scaling and showing that central positions correlate with higher generation quality.

The Power of Test-Time Training for Approximate Sampling

cs.DS · 2026-06-09 · unverdicted · novelty 7.0

Establishes a quadratic lower bound on query complexity for sampling from large classes of distributions given approximate density oracles, answers an open question on optimality of random walks, and shows circumvention for bounded classes as an abstraction of TTT.

Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

AR-OPD disentangles privileged supervision via anchored residual guidance to reduce hindsight leakage in on-policy distillation, reporting gains of 2.3 points over full privileged OPD and 7.9 over SFT on reasoning tasks.

VisualFLIP: Do Predictions Depend on Task-Critical Visual Evidence in Multimodal Reasoning?

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

A paired-image benchmark reveals that many MLLMs fail to update predictions when task-critical visual evidence changes, even when they answer individual images correctly.

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

Introduces CHARM framework that detects cascading hallucinations in agentic RAG at 89.4% rate with 5.3% false positives and reduces error propagation by 82.1% on multi-hop QA benchmarks.

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

ChemCoTBench-V2 is a new rule-verifiable benchmark with 5,620 samples across 18 tasks that evaluates LLM chemical reasoning traces using deterministic chemistry rules and reference traces rather than final answers alone.

ResMerge: Residual-based Spectral Merging of Large Language Models

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

ResMerge improves merging of RL expert LLMs via a stable residual consensus backbone plus gated head correction, outperforming task-vector and spectral baselines in capability preservation.

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Chunk-Level Guided Generation uses off-the-shelf large LLMs to score fixed-length chunks from small models via likelihoods, matching trained PRM performance on math benchmarks without reward-model training.

PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

PEFT-Arena reveals distinct stability-plasticity profiles across PEFT methods, with orthogonal finetuning achieving the best Pareto frontier under comparable parameter budgets, supported by weight-space spectral and activation-space retention analyses.

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

ARBITER models reasoning trajectory basins in test-time sampling and uses model-internal signals to correct majority-vote failures, recovering part of the oracle gap on math benchmarks.

citing papers explorer

Showing 50 of 274 citing papers.

AutoBG: A Board Game Design Assistant with Interactive Ideation, Iterative Rulebook Generation, and Individualized Feedback cs.HC · 2026-06-01 · unverdicted · none · ref 6 · internal anchor
AutoBG is a multi-module AI assistant that uses critic-driven iterative refinement on LLM-generated rulebooks, trained on 2.2K rulebooks and 180K reviews, to produce audience-tested designs that outperform GPT-5.4 baselines.
ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks cs.LG · 2026-05-31 · unverdicted · none · ref 11 · internal anchor
ThinkSwitch uses iterative self-distillation with QLoRA and spherical weight interpolation to raise both instruct and thinking checkpoint accuracy on small AIME and PubMedQA sets using only 15 human prompts per domain.
FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search cs.AI · 2026-05-30 · unverdicted · none · ref 27 · internal anchor
FALAT improves failure attribution in LLM agent trajectories via dependency-guided search, achieving 46.0% step-level accuracy on algorithm-generated and 29.1% on hand-crafted trajectories in the Who&When benchmark.
Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs cs.AI · 2026-05-30 · unverdicted · none · ref 21 · internal anchor
LRS trains a latent reward model on final-answer correctness to steer SAE states during inference, improving reasoning performance and implicitly encouraging better cognitive behaviors.
Knowing What to Solve Before How: Preplan Empowered LLM Mathematical Reasoning cs.CL · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
PPC adds a preplan stage to the question-plan-CoT paradigm, achieving best results on 39 of 40 metrics across five math benchmarks with no added inference tokens.
REPOT: Recoverable Program-of-Thought via Checkpoint Repair cs.SE · 2026-05-28 · unverdicted · none · ref 9 · internal anchor
RePoT recovers from PoT failures via deterministic verified replay and checkpoint repair, yielding +3 to +11pp gains on planning benchmarks and showing checkpoint state as the key recovery signal over error-only feedback.
Draft-OPD: On-Policy Distillation for Speculative Draft Models cs.CL · 2026-05-28 · unverdicted · none · ref 19 · internal anchor
Draft-OPD applies on-policy distillation via target-assisted generation and error replay to train speculative draft models, yielding over 5x lossless acceleration and gains over EAGLE-3 and DFlash.
Rethinking Stepwise Model Routing: A Cost-Efficient Table Reasoning Perspective cs.CL · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
EcoTab is a stepwise model routing framework that separately estimates uncertainties for table tokens and text tokens to improve cost-efficiency in table reasoning.
Rubric-Guided Process Reward for Stepwise Model Routing cs.AI · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
RoRo uses alternating optimization of a Rubricor and Judge to create process rewards from outcome-cost-process preference pairs, then combines them with outcome rewards via GRPO to train stepwise model routers that outperform baselines on five reasoning benchmarks.
Label-Free Reinforcement Learning via Cross-Model Entropy cs.LG · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
Cross-Model Entropy supplies a continuous label-free reward for RL post-training by averaging a generator's response log-likelihood under an independent verifier model, yielding win-rate gains on instruction following.
HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs cs.AI · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
HRBench evaluates 12 switching strategy settings across 6 LLMs and 5 benchmarks, showing prompt-based, routing, and speculative methods occupy distinct effectiveness-efficiency trade-off regions.
Verifiable Benchmarking of Long-Horizon Spatial Biology cs.AI · 2026-05-27 · unverdicted · none · ref 20 · internal anchor
Introduces SpatialBench-Long benchmark with 24 evaluations on spatial biology datasets from PDAC, glioblastoma, lung adenocarcinoma and optic nerve systems, reporting top model performance at 8/72 runs (11.1%).
Uncertainty-Aware Budget Allocation for Adaptive Test-Time Reasoning cs.CL · 2026-05-26 · unverdicted · none · ref 3 · internal anchor
UAB uses ANLL from a single generation as a difficulty signal and a marginal-greedy concave optimization to allocate remaining sampling budget, yielding up to 3% higher average accuracy on reasoning benchmarks.
Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching cs.AI · 2026-05-25 · unverdicted · none · ref 9 · internal anchor
DecoR routes LLM queries by decomposing them into capability dimensions and matching to historical examples, yielding higher accuracy and lower inference costs than direct-mapping routers on both in-distribution and OOD data.
Credit Assignment with Resets in Language Model Reasoning cs.AI · 2026-05-25 · unverdicted · none · ref 7 · internal anchor
The paper introduces Random-Reset Policy Optimization (RRPO) and Self-Reset Policy Optimization (SRPO) that use resets to enable more precise credit assignment in RL for language model reasoning, with SRPO outperforming GRPO and RRPO across benchmarks.
Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning cs.CL · 2026-05-23 · unverdicted · none · ref 6 · internal anchor
GuardedRepair uses guarded best-of-N repair with symbolic checks, semantic diagnostics, and conservative policies to selectively replace LLM reasoning traces, raising GSM8K accuracy from 95.60% to 96.89% and ASDiv from 78.40% to 87.60% without breaking correct cases.
Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning cs.AI · 2026-05-21 · unverdicted · none · ref 7 · 2 links · internal anchor
Search-E1 uses GRPO interleaved with on-policy self-distillation to reach 0.440 average EM on seven QA benchmarks with Qwen2.5-3B, outperforming open-source baselines.
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning cs.LG · 2026-05-21 · unverdicted · none · ref 22 · 2 links · internal anchor
OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
Manifold-Guided Attention Steering cs.LG · 2026-05-20 · unverdicted · none · ref 12 · internal anchor
MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning cs.LG · 2026-05-20 · unverdicted · none · ref 24 · internal anchor
Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME 2024/2025.
Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning cs.LG · 2026-05-18 · unverdicted · none · ref 23 · internal anchor
CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.
Self-Supervised On-Policy Distillation for Reasoning Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 56 · internal anchor
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding cs.CV · 2026-05-15 · unverdicted · none · ref 44 · internal anchor
A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts cs.LG · 2026-05-14 · unverdicted · none · ref 12 · 2 links · internal anchor
DualKV eliminates redundant prompt replication in RL training attention kernels via fused dual-KV CUDA operations and token repacking, delivering 1.63-3.82x policy-update speedups while remaining mathematically equivalent to standard attention.
Deep Pre-Alignment for VLMs cs.CV · 2026-05-14 · unverdicted · none · ref 56 · internal anchor
Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards cs.CL · 2026-05-14 · unverdicted · none · ref 30 · internal anchor
CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation cs.CL · 2026-05-13 · unverdicted · none · ref 20 · 2 links · internal anchor
Local teachability collapse occurs in later trajectory segments during strong-to-weak OPD; a margin-based release rule using top-K teacher advantage and BIC change-point detection on sentence segments outperforms full-trajectory supervision on five in-domain benchmarks and preserves out-of-domain pe
Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture cs.LO · 2026-05-13 · unverdicted · partial · ref 42 · internal anchor
Introduces a trust-boundary architecture in Lean 4 with three certificate families and two operators that deliver sorry-free, axiom-audited assurances for LLM pipeline components.
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes cs.CL · 2026-05-13 · unverdicted · none · ref 36 · internal anchor
STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
GRACE: Gradient-aligned Reasoning Data Curation for Efficient Post-training cs.AI · 2026-05-13 · unverdicted · none · ref 30 · internal anchor
GRACE scores reasoning steps via gradient alignment and trajectory consistency to select data subsets that match full performance with 5% of the data on Qwen3-VL-2B-Instruct.
Holder Policy Optimisation cs.LG · 2026-05-12 · unverdicted · none · ref 49 · 2 links · internal anchor
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 25 · internal anchor
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
Internalizing Safety Understanding in Large Reasoning Models via Verification cs.AI · 2026-05-09 · unverdicted · none · ref 13 · internal anchor
Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization cs.AI · 2026-05-09 · unverdicted · none · ref 24 · 2 links · internal anchor
AgentPSO applies a particle-swarm-inspired update rule to evolve natural-language reasoning skills across multiple LLM agents, yielding gains over static and test-time multi-agent baselines with cross-benchmark transfer.
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding cs.CL · 2026-05-09 · unverdicted · none · ref 22 · internal anchor
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
Sanity Checks for Long-Form Hallucination Detection cs.CL · 2026-05-08 · unverdicted · none · ref 11 · 2 links · internal anchor
Hallucination detectors on LLM reasoning traces often rely on final-answer artifacts rather than reasoning validity; once controlled, lightweight lexical trajectory features suffice for robust detection.
Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning cs.AI · 2026-05-08 · unverdicted · none · ref 8 · internal anchor
Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 39 · 2 links · internal anchor
MELT decouples reasoning depth from memory in looped language models by sharing a single gated KV cache per layer and training it via chunk-wise distillation from Ouro starting models.
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training cs.AI · 2026-05-08 · unverdicted · none · ref 32 · internal anchor
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.
Response Time Enhances Alignment with Heterogeneous Preferences cs.LG · 2026-05-07 · unverdicted · none · ref 67 · internal anchor
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation cs.CR · 2026-05-06 · unverdicted · none · ref 83 · internal anchor
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles cs.AI · 2026-05-03 · unverdicted · none · ref 16 · 2 links · internal anchor
NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning cs.CL · 2026-05-03 · unverdicted · none · ref 98 · internal anchor
Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact and FEVER.
AI Alignment via Incentives and Correction cs.LG · 2026-05-02 · unverdicted · none · ref 36 · 2 links · internal anchor
AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM coding tasks.
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning cs.LG · 2026-04-30 · unverdicted · none · ref 41 · internal anchor
SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched baselines.
IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning cs.CL · 2026-04-27 · unverdicted · none · ref 2 · internal anchor
IRIS interleaves staged curriculum supervised fine-tuning with reverse-curriculum reinforcement learning using a composite reward to improve mathematical reasoning in English and low-resource Indian languages, accompanied by a new 29k-problem multilingual dataset.
How LLMs Fail and Generalize in RTL Coding for Hardware Design? cs.CL · 2026-04-26 · unverdicted · none · ref 2 · internal anchor
Frontier LLMs reach a 90.8% pass-rate ceiling on VerilogEval RTL coding due to unsolvable functional errors that alignment and test-time scaling cannot fix, showing pretraining knowledge bounds the capability.
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance cs.CL · 2026-04-25 · unverdicted · none · ref 16 · internal anchor
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.
Large Language Models Decide Early and Explain Later cs.CL · 2026-04-24 · unverdicted · none · ref 6 · internal anchor
LLMs settle on their answer after a minority of CoT tokens and produce an average 760 more as post-decision explanation, enabling early stopping that saves 500 tokens per query at a 2% accuracy cost.
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models cs.CL · 2026-04-23 · unverdicted · none · ref 6 · internal anchor
Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.

Let's Verify Step by Step

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer