Inference-aware fine-tuning for best-of-n sampling in large language models

Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, Aleksandra Faust · 2025 · arXiv 2412.15287

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

citation-role summary

background 2 baseline 1

citation-polarity summary

background 2 baseline 1

representative citing papers

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.

CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verification on most benchmarks.

Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo

cs.LG · 2026-04-07 · unverdicted · novelty 7.0

Sequential Monte Carlo sampling from a reward-augmented sequence distribution improves LLM performance on HumanEval by up to 54.9% and MATH500 by up to 8.8%, outperforming standard sampling and GRPO.

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

FADE is a self-adapting advantage for policy-gradient RL that reads training dynamics to balance positive/negative gradient mass and difficulty focus, yielding faster peak performance and better accuracy-diversity trade-offs than static baselines on LLM reasoning benchmarks.

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

cs.LG · 2026-06-17 · unverdicted · novelty 6.0

REVES augments LLM post-training by decoupling revision and verification signals from successful multi-step trajectories, reporting +6.5 point gains on LiveCodeBench over RL baselines.

What should post-training optimize? A test-time scaling law perspective

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.

Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Conditional optimal transport is used to turn raw PRM outputs into monotonic quantile functions that improve calibration and downstream Best-of-N performance on MATH-500 and AIME.

FASTER: Value-Guided Sampling for Fast RL

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment

cs.LG · 2026-01-29 · unverdicted · novelty 6.0 · 2 refs

ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.

RewardBench 2: Advancing Reward Model Evaluation

cs.CL · 2025-06-02 · unverdicted · novelty 6.0

RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.

Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning

cs.CV · 2026-06-06 · unverdicted · novelty 5.0

A survey of test-time scaling for multimodal foundation models that introduces a three-way taxonomy of sampling, feedback, and search approaches along with applications and benchmarks.

Polychromic Objectives for Reinforcement Learning

cs.LG · 2025-09-29 · unverdicted · novelty 5.0

Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Inference-aware fine-tuning for best-of-n sampling in large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer