Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
Optimizing language models for inference time objectives using reinforcement learning
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Compute Aligned Training derives new loss functions by modeling test-time strategies as operators on the base policy, yielding empirical gains in test-time compute scaling over standard SFT and RL.
citing papers explorer
-
What should post-training optimize? A test-time scaling law perspective
Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
-
Compute Aligned Training: Optimizing for Test Time Inference
Compute Aligned Training derives new loss functions by modeling test-time strategies as operators on the base policy, yielding empirical gains in test-time compute scaling over standard SFT and RL.