Inference-aware fine-tuning for best-of-n sampling in large language models

Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, Aleksandra Faust · 2024 · arXiv 2412.15287

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo

cs.LG · 2026-04-07 · unverdicted · novelty 7.0

Sequential Monte Carlo sampling from a reward-augmented sequence distribution improves LLM performance on HumanEval by up to 54.9% and MATH500 by up to 8.8%, outperforming standard sampling and GRPO.

What should post-training optimize? A test-time scaling law perspective

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.

Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Conditional optimal transport is used to turn raw PRM outputs into monotonic quantile functions that improve calibration and downstream Best-of-N performance on MATH-500 and AIME.

FASTER: Value-Guided Sampling for Fast RL

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

citing papers explorer

Showing 4 of 4 citing papers.

Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo cs.LG · 2026-04-07 · unverdicted · none · ref 4
Sequential Monte Carlo sampling from a reward-augmented sequence distribution improves LLM performance on HumanEval by up to 54.9% and MATH500 by up to 8.8%, outperforming standard sampling and GRPO.
What should post-training optimize? A test-time scaling law perspective cs.LG · 2026-05-11 · unverdicted · none · ref 5
Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport cs.LG · 2026-05-07 · unverdicted · none · ref 4 · 2 links
Conditional optimal transport is used to turn raw PRM outputs into monotonic quantile functions that improve calibration and downstream Best-of-N performance on MATH-500 and AIME.
FASTER: Value-Guided Sampling for Fast RL cs.LG · 2026-04-21 · unverdicted · none · ref 3
FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

Inference-aware fine-tuning for best-of-n sampling in large language models

fields

years

verdicts

representative citing papers

citing papers explorer