CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verification on most benchmarks.
Inference-aware fine-tuning for best-of-n sampling in large language models
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 8representative citing papers
Sequential Monte Carlo sampling from a reward-augmented sequence distribution improves LLM performance on HumanEval by up to 54.9% and MATH500 by up to 8.8%, outperforming standard sampling and GRPO.
Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
Conditional optimal transport is used to turn raw PRM outputs into monotonic quantile functions that improve calibration and downstream Best-of-N performance on MATH-500 and AIME.
FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.
RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.
Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.
citing papers explorer
-
CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning
CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verification on most benchmarks.