arxiv: 2605.09214 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI· cs.IT· math.IT· math.ST· stat.ML· stat.TH

Recognition: 2 theorem links

· Lean Theorem

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

Heyang Zhao, Kaixuan Ji, Qingyue Zhao, Quanquan Gu

Pith reviewed 2026-05-12 03:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.ITmath.STstat.MLstat.TH

keywords offline contextual banditsforward KL regularizationfast statistical ratessingle-policy concentrabilityfunction approximationpessimism principleconvex analysislower bounds

0 comments

The pith

Forward-KL regularization achieves the first tilde-O(epsilon inverse) upper bounds for offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that forward Kullback-Leibler regularization yields fast statistical rates of order epsilon to the minus one for offline contextual bandit problems. This holds under single-policy concentrability assumptions and covers both tabular cases and settings with arbitrary function approximation. The analysis relies on a convex-analytical approach that applies the pessimism principle uniformly across these regimes. Sympathetic readers would care because these rates improve on the standard epsilon to the minus two sample complexity, reducing the data needed to reach a target performance level. Matching lower bounds confirm the rates are tight and show that the method reverts to slower unregularized behavior when regularization strength is low.

Core claim

We give the first tilde O(epsilon inverse) upper bounds in tabular and general function approximation settings, both under notions of single-policy concentrability. In particular, our convex-analytical pipeline unifies these settings by exploiting the pessimism principle in a novel way and completely bypasses the proof routines in previous works based on the mean value theorem. Moreover, we provide rate-optimal lower bounds, manifesting the tightness of our upper bounds in terms of statistical rates. Our lower bounds also demonstrate that the forward-KL-regularized sample complexity recovers the unregularized slow rate in the low-regularization regime, similarly to the reverse-KL case.

What carries the argument

Convex-analytical pipeline that exploits the pessimism principle to unify analysis of forward-KL-regularized objectives across tabular and function-approximation settings under single-policy concentrability.

If this is right

The same convex pipeline applies uniformly to both tabular and general function approximation, removing the need for separate proof techniques.
Rate-optimal lower bounds establish that the epsilon inverse scaling cannot be improved without stronger assumptions.
In the low-regularization regime the forward-KL objective recovers the standard slow rates of unregularized offline contextual bandits.
The bypass of mean-value-theorem arguments may simplify statistical analysis for other regularized decision-making problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unification suggests the pessimism-based convex view could apply to forward-KL objectives in full Markov decision processes beyond bandits.
If single-policy concentrability is approximately satisfied in practice, forward KL may become preferable to reverse KL for achieving faster convergence with limited offline data.
The lower-bound construction that recovers the slow rate at low regularization offers a template for testing regularization strength in other offline RL algorithms.

Load-bearing premise

Single-policy concentrability holds between the behavior policy that collected the data and the target policy being learned.

What would settle it

An offline dataset satisfying single-policy concentrability where the number of samples needed to reach epsilon-suboptimality with forward-KL regularization scales as epsilon to the minus two instead of epsilon to the minus one.

read the original abstract

\emph{Kullback-Leibler} (KL) regularization is ubiquitous in reinforcement learning algorithms in the form of \emph{reverse} or \emph{forward} KL. Recent studies have demonstrated $\epsilon^{-1}$-type fast rates for decision making under reverse KL regularization, in contrast to the standard $\epsilon^{-2}$-type sample complexity. However, for forward-KL-regularized objectives, existing statistical analyses are either not applicable or result in $\tilde{O}(\epsilon^{-2})$ slow rates. We take the first step towards addressing this problem via a streamlined analysis of forward-KL-regularized offline CBs. We give the first $\tilde{O}(\epsilon^{-1})$ upper bounds in tabular and general function approximation settings, both under notions of \emph{single-policy concentrability}. In particular, our convex-analytical pipeline unifies these settings by exploiting the pessimism principle in a novel way and completely bypasses the proof routines in previous works based on the mean value theorem, which might be of independent interest. Moreover, we provide rate-optimal lower bounds, manifesting the tightness of our upper bounds in terms of statistical rates. Our lower bounds also demonstrate that the forward-KL-regularized sample complexity recovers the unregularized slow rate in the low-regularization regime, similarly to the reverse-KL regularization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first tilde O(epsilon^{-1}) rates for forward-KL regularized offline contextual bandits via a convex analysis that unifies tabular and function-approximation cases under single-policy concentrability.

read the letter

The main point is that the authors deliver the first fast statistical rates for forward-KL regularization in offline contextual bandits, with matching lower bounds that were missing before. Their convex pipeline exploits pessimism to handle both tabular and general function approximation settings while avoiding the mean-value theorem steps used in reverse-KL papers. That unification and the lower-bound construction are the concrete advances here, and the lower bounds correctly recover the slow rate when regularization goes to zero, which lines up with what one would expect.

Referee Report

0 major / 3 minor

Summary. The manuscript develops a convex-analytical pipeline for analyzing forward-KL-regularized offline contextual bandits. It establishes the first tilde-O(epsilon^{-1}) upper bounds under single-policy concentrability, both in the tabular setting and under general function approximation, while bypassing mean-value-theorem arguments used in prior work. Matching rate-optimal lower bounds are also provided, which recover the unregularized slow rate in the low-regularization regime.

Significance. If the central claims hold, the work closes a gap between forward-KL and reverse-KL regularization by delivering fast rates where only slow rates were previously known. The unification of tabular and general-function-approximation analyses via a pessimism-based convex pipeline, together with the explicit rate-optimality lower bounds, constitutes a substantive technical contribution that may be of independent interest beyond the specific setting.

minor comments (3)

The abstract states that the pipeline 'completely bypasses the proof routines in previous works based on the mean value theorem'; the introduction or Section 3 should explicitly identify which prior proofs are bypassed and why the new argument avoids the MVT step.
Notation for the single-policy concentrability coefficient (e.g., C_pi or similar) should be introduced once in the preliminaries and used consistently in all subsequent statements of the upper and lower bounds.
The lower-bound construction that recovers the slow rate at low regularization is only sketched in the abstract; a short paragraph in the main text clarifying the regime in which the lower bound transitions from fast to slow would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and significance assessment of our manuscript, as well as the recommendation for minor revision. We are pleased that the contributions—particularly the first tilde-O(epsilon^{-1}) upper bounds under single-policy concentrability for forward-KL regularization, the unified convex-analytical pipeline, and the matching lower bounds—are recognized as closing an important gap relative to reverse-KL results.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation relies on a convex-analytical pipeline that applies the pessimism principle directly to forward-KL-regularized objectives under the single-policy concentrability assumption. Upper bounds are obtained via this pipeline in both tabular and function-approximation settings, while lower bounds are constructed separately to match the rates and recover the slow regime at low regularization. No equation or claim reduces by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation; the analysis bypasses prior mean-value-theorem arguments and remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the single-policy concentrability assumption and the validity of the pessimism principle in the convex formulation; no free parameters are explicitly fitted in the abstract, and no new entities are invented.

free parameters (1)

regularization strength
The coefficient controlling the forward KL penalty is a tunable hyperparameter whose value affects whether the fast or slow rate is recovered.

axioms (1)

domain assumption single-policy concentrability
The behavior policy is assumed to cover the target policy sufficiently so that importance weights remain bounded.

pith-pipeline@v0.9.0 · 5564 in / 1440 out tokens · 47423 ms · 2026-05-12T03:40:49.223345+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
our convex-analytical pipeline unifies these settings by exploiting the pessimism principle in a novel way and completely bypasses the proof routines in previous works based on the mean value theorem
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
SubOptFKL(bπ) = Bh∗(r,br) = Bh(bπ,π∗) … = η^{-1} Σ π_ref(a) ϕ(ya)

Reference graph

Works this paper leans on

126 extracted references · 126 canonical work pages · 5 internal anchors

[1]

International Conference on Machine Learning , pages=

Is pessimism provably efficient for offline rl? , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[2]

Advances in neural information processing systems , volume=

Bellman-consistent pessimism for offline reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[3]

Advances in Neural Information Processing Systems , volume=

Bridging offline reinforcement learning and imitation learning: A tale of pessimism , author=. Advances in Neural Information Processing Systems , volume=

work page
[4]

Advances in neural information processing systems , volume=

Policy finetuning: Bridging sample-efficient offline and online reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[5]

International conference on machine learning , pages=

Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022
[6]

Conference on Learning Theory , pages=

Offline reinforcement learning with realizability and single-policy concentrability , author=. Conference on Learning Theory , pages=. 2022 , organization=

work page 2022
[7]

Advances in Neural Information Processing Systems , volume=

Pessimism for Offline Linear Contextual Bandits using _p Confidence Sets , author=. Advances in Neural Information Processing Systems , volume=

work page
[8]

arXiv preprint arXiv:2205.15512 , year=

Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game , author=. arXiv preprint arXiv:2205.15512 , year=

work page arXiv
[9]

arXiv preprint arXiv:2310.01380 , year=

Pessimistic nonlinear least-squares value iteration for offline reinforcement learning , author=. arXiv preprint arXiv:2310.01380 , year=

work page arXiv
[10]

The Annals of Statistics , volume=

Settling the sample complexity of model-based offline reinforcement learning , author=. The Annals of Statistics , volume=. 2024 , publisher=

work page 2024
[11]

Forty-first International Conference on Machine Learning , year=

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint , author=. Forty-first International Conference on Machine Learning , year=

work page
[12]

Sharp Analysis for

Heyang Zhao and Chenlu Ye and Quanquan Gu and Tong Zhang , booktitle=. Sharp Analysis for

work page
[13]

Coverage improvement and fast convergence of on-policy preference learning.arXiv preprint arXiv:2601.08421, 2026

Coverage Improvement and Fast Convergence of On-policy Preference Learning , author=. arXiv preprint arXiv:2601.08421 , year=

work page arXiv
[14]

International Conference on Machine Learning , pages=

Provably efficient exploration in policy optimization , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[15]

International Conference on Artificial Intelligence and Statistics , pages=

Near-optimal policy optimization algorithms for learning adversarial linear mixture mdps , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022
[16]

arXiv preprint arXiv:2305.08359 , year=

Horizon-free reinforcement learning in adversarial linear mixture MDPs , author=. arXiv preprint arXiv:2305.08359 , year=

work page arXiv
[17]

Advances in neural information processing systems , volume=

Error propagation for approximate policy and value iteration , author=. Advances in neural information processing systems , volume=

work page
[18]

International Conference on Machine Learning , pages=

Information-theoretic considerations in batch reinforcement learning , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[19]

arXiv preprint arXiv:2107.06226 , year=

Pessimistic model-based offline reinforcement learning under partial coverage , author=. arXiv preprint arXiv:2107.06226 , year=

work page arXiv
[20]

arXiv preprint arXiv:2408.08994 , year=

Model-based RL as a Minimalist Approach to Horizon-Free and Second-Order Bounds , author=. arXiv preprint arXiv:2408.08994 , year=

work page arXiv
[21]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

The importance of online data: Understanding preference fine-tuning via coverage , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[22]

The method of paired comparisons , author=

Rank analysis of incomplete block designs: I. The method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

work page 1952
[23]

Journal of Computer and System Sciences , volume=

The k-armed dueling bandits problem , author=. Journal of Computer and System Sciences , volume=. 2012 , publisher=

work page 2012
[24]

International conference on machine learning , pages=

Relative upper confidence bound for the k-armed dueling bandit problem , author=. International conference on machine learning , pages=. 2014 , organization=

work page 2014
[25]

Conference on Learning Theory , pages=

Contextual dueling bandits , author=. Conference on Learning Theory , pages=. 2015 , organization=

work page 2015
[26]

International Conference on Machine Learning , pages=

Principled reinforcement learning with human feedback from pairwise or k-wise comparisons , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[27]

arXiv preprint arXiv:2305.14816 , year=

Provable offline preference-based reinforcement learning , author=. arXiv preprint arXiv:2305.14816 , year=

work page arXiv
[28]

arXiv preprint arXiv:1705.07798 , year=

A unified view of entropy-regularized markov decision processes , author=. arXiv preprint arXiv:1705.07798 , year=

work page arXiv
[29]

International Conference on Machine Learning , pages=

A theory of regularized markov decision processes , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[30]

Advances in Neural Information Processing Systems , volume=

Leverage the average: an analysis of kl regularization in reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[31]

arXiv preprint arXiv:2205.14211 , year=

Kl-entropy-regularized rl with a generative model is minimax optimal , author=. arXiv preprint arXiv:2205.14211 , year=

work page arXiv
[32]

arXiv preprint arXiv:2407.13399v3 , year=

Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization , author=. arXiv preprint arXiv:2407.13399v3 , year=

work page arXiv
[33]

arXiv preprint arXiv:2503.21878 , year=

Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment , author=. arXiv preprint arXiv:2503.21878 , year=

work page arXiv
[34]

arXiv preprint arXiv:1512.08562 , year=

Taming the noise in reinforcement learning via soft updates , author=. arXiv preprint arXiv:1512.08562 , year=

work page arXiv
[35]

Trust region policy optimization,

Trust Region Policy Optimization , author=. arXiv preprint arXiv:1502.05477 , year=

work page arXiv
[36]

Journal of Machine Learning Research , volume=

End-to-end training of deep visuomotor policies , author=. Journal of Machine Learning Research , volume=

work page
[37]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[39]

Advances in Neural Information Processing Systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

work page
[40]

From r to Q^

Rafailov, Rafael and Hejna, Joey and Park, Ryan and Finn, Chelsea , journal=. From r to Q^

work page
[41]

Self-play fine- tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335,

Self-play fine-tuning converts weak language models to strong language models , author=. arXiv preprint arXiv:2401.01335 , year=

work page arXiv
[42]

arXiv preprint arXiv:2405.21046 , year=

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF , author=. arXiv preprint arXiv:2405.21046 , year=

work page arXiv
[43]

arXiv preprint arXiv:2404.03715 , year=

Direct nash optimization: Teaching language models to self-improve with general preferences , author=. arXiv preprint arXiv:2404.03715 , year=

work page arXiv
[44]

arXiv preprint arXiv:2405.00675 , year=

Self-play preference optimization for language model alignment , author=. arXiv preprint arXiv:2405.00675 , year=

work page arXiv
[45]

arXiv preprint arXiv:2405.19107 , year=

Offline Regularised Reinforcement Learning for Large Language Models Alignment , author=. arXiv preprint arXiv:2405.19107 , year=

work page arXiv
[46]

arXiv preprint arXiv:2410.09302 , year=

Enhancing multi-step reasoning abilities of language models through direct q-function optimization , author=. arXiv preprint arXiv:2410.09302 , year=

work page arXiv
[47]

G., Row- land, M., Guo, Z

Nash learning from human feedback , author=. arXiv preprint arXiv:2312.00886 , year=

work page arXiv
[48]

arXiv preprint arXiv:2402.07314 , year=

A theoretical analysis of nash learning from human feedback under general kl-regularized preference , author=. arXiv preprint arXiv:2402.07314 , year=

work page arXiv
[49]

Management science , volume=

Risk-sensitive Markov decision processes , author=. Management science , volume=. 1972 , publisher=

work page 1972
[50]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992
[51]

, author=

Maximum entropy inverse reinforcement learning. , author=. Aaai , volume=. 2008 , organization=

work page 2008
[52]

2010 , publisher=

Modeling purposeful adaptive behavior with the principle of maximum causal entropy , author=. 2010 , publisher=

work page 2010
[53]

International conference on machine learning , pages=

Guided policy search , author=. International conference on machine learning , pages=. 2013 , organization=

work page 2013
[54]

2018 , publisher=

Reinforcement learning: An introduction , author=. 2018 , publisher=

work page 2018
[55]

International conference on machine learning , pages=

Understanding the impact of entropy on policy optimization , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[56]

arXiv preprint arXiv:1910.09191 , year=

Regularization matters in policy optimization , author=. arXiv preprint arXiv:1910.09191 , year=

work page arXiv 1910
[57]

arXiv preprint arXiv:2404.08495 , year=

Dataset reset policy optimization for rlhf , author=. arXiv preprint arXiv:2404.08495 , year=

work page arXiv
[58]

International conference on machine learning , pages=

Constrained policy optimization , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[59]

arXiv preprint arXiv:2003.02189 , year=

Exploration-exploitation in constrained mdps , author=. arXiv preprint arXiv:2003.02189 , year=

work page arXiv 2003
[60]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[61]

International Conference on Machine Learning , pages=

Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[62]

Conference on Learning Theory , pages=

Efficient and robust algorithms for adversarial linear contextual bandits , author=. Conference on Learning Theory , pages=. 2020 , organization=

work page 2020
[63]

arXiv preprint arXiv:2410.07533 , year=

Corruption-robust linear bandits: Minimax optimality and gap-dependent misspecification , author=. arXiv preprint arXiv:2410.07533 , year=

work page arXiv
[64]

SIAM journal on computing , volume=

The nonstochastic multiarmed bandit problem , author=. SIAM journal on computing , volume=. 2002 , publisher=

work page 2002
[65]

Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , pages=

Stochastic bandits robust to adversarial corruptions , author=. Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , pages=

work page
[66]

Advances in Neural Information Processing Systems , volume=

Adversarial bandits with corruptions: Regret lower bound and no-regret algorithm , author=. Advances in Neural Information Processing Systems , volume=

work page
[67]

Annual Review of Control, Robotics, and Autonomous Systems , volume=

Safe learning in robotics: From learning-based control to safe reinforcement learning , author=. Annual Review of Control, Robotics, and Autonomous Systems , volume=. 2022 , publisher=

work page 2022
[68]

Advances in applied probability , volume=

Integral probability metrics and their generating classes of functions , author=. Advances in applied probability , volume=. 1997 , publisher=

work page 1997
[69]

Signal Processing , volume=

Divergence measures for statistical data processing—An annotated bibliography , author=. Signal Processing , volume=. 2013 , publisher=

work page 2013
[70]

IEEE Transactions on Information Theory , volume=

Information measures: the curious case of the binary alphabet , author=. IEEE Transactions on Information Theory , volume=. 2014 , publisher=

work page 2014
[71]

Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics , volume=

On measures of entropy and information , author=. Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics , volume=. 1961 , organization=

work page 1961
[72]

Studia Sci

On information-type measure of difference of probability distributions and indirect observations , author=. Studia Sci. Math. Hungar. , volume=

work page
[73]

USSR computational mathematics and mathematical physics , volume=

The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , author=. USSR computational mathematics and mathematical physics , volume=. 1967 , publisher=

work page 1967
[74]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Reinforcement learning and control as probabilistic inference: Tutorial and review , author=. arXiv preprint arXiv:1805.00909 , year=

work page internal anchor Pith review arXiv
[75]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Offline reinforcement learning: Tutorial, review, and perspectives on open problems , author=. arXiv preprint arXiv:2005.01643 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2005
[76]

Statistical Science , year=

Offline reinforcement learning in large state spaces: Algorithms and guarantees , author=. Statistical Science , year=

work page
[77]

Conference on learning theory , pages=

Provably efficient reinforcement learning with linear function approximation , author=. Conference on learning theory , pages=. 2020 , organization=

work page 2020
[78]

2020 , publisher=

Bandit algorithms , author=. 2020 , publisher=

work page 2020
[79]

doi:10.1017/9781009093057 , publisher=

Mathematical Analysis of Machine Learning Algorithms , author=. doi:10.1017/9781009093057 , publisher=

work page doi:10.1017/9781009093057
[80]

arXiv preprint arXiv:2312.16730 , year=

Foundations of reinforcement learning and interactive decision making , author=. arXiv preprint arXiv:2312.16730 , year=

work page arXiv

Showing first 80 references.