POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.
Trace length is a simple uncertainty signal in reasoning models
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
VERDI derives three structural confidence signals from decomposed LLM verification traces and calibrates them with Platt-scaled logistic regression to achieve AUROC 0.72-0.91 on GPT models and 0.56-0.70 on Qwen models where log-probabilities fail.
SELFDOUBT introduces the Hedge-to-Verify Ratio from reasoning traces as a single-pass uncertainty signal, with no-hedge traces correct 96% of the time and outperforming semantic entropy at 10x lower cost.
citing papers explorer
-
Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.
-
VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference
VERDI derives three structural confidence signals from decomposed LLM verification traces and calibrates them with Platt-scaled logistic regression to achieve AUROC 0.72-0.91 on GPT models and 0.56-0.70 on Qwen models where log-probabilities fail.
-
SELFDOUBT: Uncertainty Quantification for Reasoning LLMs via the Hedge-to-Verify Ratio
SELFDOUBT introduces the Hedge-to-Verify Ratio from reasoning traces as a single-pass uncertainty signal, with no-hedge traces correct 96% of the time and outperforming semantic entropy at 10x lower cost.