pith. sign in

Optimizing language models for inference time objectives using reinforcement learning.arXiv preprint arXiv:2503.19595

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

citation-role summary

background 1

citation-polarity summary

fields

cs.LG 7

years

2026 6 2025 1

verdicts

UNVERDICTED 7

roles

background 1

polarities

background 1

clear filters

representative citing papers

Finite-Time Regret Analysis of Retry-Aware Bandits

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

ReMax achieves the first sublinear regret bound for Gaussian rewards at M=2 by characterizing the optimal sampling distribution via an expected-improvement balance condition and separating saturation from underestimation effects.

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

FADE is a self-adapting advantage for policy-gradient RL that reads training dynamics to balance positive/negative gradient mass and difficulty focus, yielding faster peak performance and better accuracy-diversity trade-offs than static baselines on LLM reasoning benchmarks.

Polychromic Objectives for Reinforcement Learning

cs.LG · 2025-09-29 · unverdicted · novelty 5.0

Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.

citing papers explorer

Showing 1 of 1 citing paper after filters.