pith. sign in

Inference-aware fine-tuning for best-of-n sampling in large language models

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

citation-role summary

background 2 baseline 1

citation-polarity summary

years

2026 10 2025 2

verdicts

UNVERDICTED 12

clear filters

representative citing papers

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

FADE is a self-adapting advantage for policy-gradient RL that reads training dynamics to balance positive/negative gradient mass and difficulty focus, yielding faster peak performance and better accuracy-diversity trade-offs than static baselines on LLM reasoning benchmarks.

FASTER: Value-Guided Sampling for Fast RL

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

RewardBench 2: Advancing Reward Model Evaluation

cs.CL · 2025-06-02 · unverdicted · novelty 6.0

RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.

Polychromic Objectives for Reinforcement Learning

cs.LG · 2025-09-29 · unverdicted · novelty 5.0

Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.