DecompRL is an RL method that learns modular code decomposition for LLMs, enabling exponential candidate generation via recombination to solve harder coding problems with lower GPU cost.
Optimizing language models for inference time objectives using reinforcement learning.arXiv preprint arXiv:2503.19595
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 7verdicts
UNVERDICTED 7roles
background 1polarities
background 1representative citing papers
ReMax achieves the first sublinear regret bound for Gaussian rewards at M=2 by characterizing the optimal sampling distribution via an expected-improvement balance condition and separating saturation from underestimation effects.
FADE is a self-adapting advantage for policy-gradient RL that reads training dynamics to balance positive/negative gradient mass and difficulty focus, yielding faster peak performance and better accuracy-diversity trade-offs than static baselines on LLM reasoning benchmarks.
REVES augments LLM post-training by decoupling revision and verification signals from successful multi-step trajectories, reporting +6.5 point gains on LiveCodeBench over RL baselines.
Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
Derives new loss functions for SFT and RL that optimize directly for test-time inference operators like aggregation or filtering, with empirical gains in scaling.
Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.
citing papers explorer
-
DecompRL: Solving Harder Problems by Learning Modular Code Generation
DecompRL is an RL method that learns modular code decomposition for LLMs, enabling exponential candidate generation via recombination to solve harder coding problems with lower GPU cost.
-
Finite-Time Regret Analysis of Retry-Aware Bandits
ReMax achieves the first sublinear regret bound for Gaussian rewards at M=2 by characterizing the optimal sampling distribution via an expected-improvement balance condition and separating saturation from underestimation effects.
-
Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL
FADE is a self-adapting advantage for policy-gradient RL that reads training dynamics to balance positive/negative gradient mass and difficulty focus, yielding faster peak performance and better accuracy-diversity trade-offs than static baselines on LLM reasoning benchmarks.
-
REVES: REvision and VErification--Augmented Training for Test-Time Scaling
REVES augments LLM post-training by decoupling revision and verification signals from successful multi-step trajectories, reporting +6.5 point gains on LiveCodeBench over RL baselines.
-
What should post-training optimize? A test-time scaling law perspective
Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
-
Compute Aligned Training: Optimizing for Test Time Inference
Derives new loss functions for SFT and RL that optimize directly for test-time inference operators like aggregation or filtering, with empirical gains in scaling.
-
Polychromic Objectives for Reinforcement Learning
Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.