arxiv: 2605.00155 · v1 · submitted 2026-04-30 · 💻 cs.LG · cs.CL· math.OC· stat.ML

Recognition: unknown

Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

Yikai Wang , Shang Liu , Jose Blanchet

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:00 UTC · model grok-4.3

classification 💻 cs.LG cs.CLmath.OCstat.ML

keywords RLHFdistributionally robust optimizationWasserstein distanceregret optimizationwater-filling structurereward misspecificationpolicy gradientover-optimization

0 comments

The pith

DRRO for RLHF solves the inner worst-case regret exactly under l1 ambiguity sets with water-filling optimal policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning from human feedback relies on a learned reward model that is only a proxy for true human preferences, which can lead to over-optimization where proxy scores rise while actual quality falls. The paper introduces Wasserstein distributionally robust regret optimization that guards against this misspecification by minimizing the worst-case regret relative to the best policy under plausible reward perturbations, rather than minimizing worst-case value. It models the problem as a promptwise simplex allocation and shows that an l1 ambiguity set around the estimated reward allows an exact closed-form solution for the inner minimization. The solution takes a water-filling form that spreads policy probability mass across actions according to their uncertainty-adjusted advantages. This structure produces a practical policy-gradient method that adds only a sampled bonus term to standard PPO-style RLHF training.

Core claim

Under the promptwise simplex allocation model with an l1 ambiguity set, the inner worst-case regret in Wasserstein DRRO admits an exact solution, and the optimal policy has a water-filling structure. These results yield a policy-gradient algorithm with a simple sampled-bonus interpretation that requires only minor changes to existing RLHF pipelines.

What carries the argument

The promptwise simplex allocation model under an l1 Wasserstein ambiguity set, which reduces the inner worst-case regret computation to an exact water-filling allocation over policy probabilities.

If this is right

The optimal policy under DRRO can be obtained by a simple water-filling adjustment to the estimated advantage values.
The resulting algorithm integrates into PPO or GRPO training loops by adding a sampled bonus term derived from the regret formulation.
DRRO produces less pessimistic policies than standard distributionally robust value optimization under the same ambiguity sets.
Empirical results indicate that DRRO mitigates reward over-optimization more effectively than uncertainty penalties or conservative constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The water-filling structure may appear in trained language models when similar regret-based robustness is applied at scale.
The regret perspective could extend to other settings with objective misspecification, such as recommendation systems or robotic control.
The framework suggests that bounding regret rather than value may provide tighter guarantees on deployed performance gaps.

Load-bearing premise

The promptwise simplex allocation model sufficiently captures the essential dynamics of full-sequence RLHF optimization under reward misspecification.

What would settle it

A small-scale promptwise instance where the computed worst-case regret solution deviates from the predicted water-filling allocation under l1 perturbations, or controlled RLHF experiments where the DRRO algorithm shows no reduction in over-optimization relative to standard baselines.

Figures

Figures reproduced from arXiv: 2605.00155 by Jose Blanchet, Shang Liu, Yikai Wang.

**Figure 2.** Figure 2: A water-filling view of Theorem 3.2. The colored blocks show the hedge assigned to nonmaximal responses above the threshold t ⋆ ; the remaining mass is sent to the best response y1. As the ambiguity budget grows, the threshold drops and additional responses enter the active support. The function ψ is piecewise linear. At points t distinct from all reward values, ψ ′ (t) = −1 + 1 δ X i:ri>t (r1 − ri). (3.11… view at source ↗

**Figure 3.** Figure 3: Main benchmark across RLHF update rules and mitigation baselines. The [PITH_FULL_IMAGE:figures/full_fig_p031_3.png] view at source ↗

**Figure 4.** Figure 4: Direct comparison between DRO and DRRO on a matched backbone. The figure [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗

**Figure 5.** Figure 5: Internal DRRO ablation comparing fixed versus dynamic ambiguity schedules [PITH_FULL_IMAGE:figures/full_fig_p034_5.png] view at source ↗

read the original abstract

Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value as in standard DRO, DRRO pessimizes worst-case regret relative to the best policy under the same plausible reward perturbation. We study the promptwise problem through a simplex allocation model and show that, under an $\ell_1$ ambiguity set, the inner worst-case regret admits an exact solution and the optimal policy has a water-filling structure. These results lead to a practical policy-gradient algorithm with a simple sampled-bonus interpretation and only minor changes to PPO/GRPO-style RLHF training. The framework also clarifies theoretically why DRRO is less pessimistic than DRO, and our experiments show that DRRO mitigates over-optimization more effectively than existing baselines while standard DRO is systematically over-pessimistic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper replaces worst-case value with worst-case regret in Wasserstein DRO for RLHF and derives an exact l1 closed form plus water-filling policy on the promptwise simplex, yielding a simple sampled-bonus gradient update.

read the letter

The main thing to know is that the authors shift from standard DRO pessimism on value to pessimism on regret relative to the best policy under the same reward perturbation. Under an l1 ambiguity set on their promptwise simplex model, this admits an exact solution whose optimal policy has a water-filling structure, which then produces a policy-gradient algorithm with only minor changes to PPO-style training and a straightforward sampled-bonus interpretation.

Referee Report

2 major / 0 minor

Summary. The paper proposes Wasserstein distributionally robust regret optimization (DRRO) for RLHF to mitigate reward over-optimization (Goodharting). Modeling the problem via a promptwise simplex allocation under an ℓ1 ambiguity set, it claims that the inner worst-case regret admits an exact closed-form solution with a water-filling structure for the optimal policy. This yields a practical policy-gradient algorithm with a sampled-bonus interpretation requiring only minor changes to PPO/GRPO-style training. The framework is argued to be less pessimistic than standard DRO, with experiments indicating superior mitigation of over-optimization compared to baselines.

Significance. If the central derivations hold, the work offers a theoretically grounded alternative to pessimistic DRO methods in RLHF, with an exact solution and interpretable water-filling policy that could improve robustness to reward misspecification in LLM alignment. The sampled-bonus policy-gradient update and theoretical clarification on reduced pessimism relative to DRO represent potential advances for practical robust RLHF algorithms.

major comments (2)

[Abstract and theoretical development of the promptwise model] The central claim of an exact solution for the inner worst-case regret and water-filling optimal policy under the ℓ1 ambiguity set is derived only for the promptwise simplex allocation model (as stated in the abstract). No full derivation, proof, or error analysis is provided in the manuscript, preventing verification that the closed-form regret solution is parameter-free or that it does not reduce to a fitted quantity by construction.
[Promptwise simplex allocation model and § on the inner optimization] The promptwise simplex allocation model treats each prompt's allocation independently and memorylessly. In full-sequence RLHF, reward perturbations can exhibit correlations across tokens within a trajectory or across multiple responses to the same prompt; if such dependencies are material, the exact regret solution and the sampled-bonus policy-gradient update lose their guarantee of mitigating Goodharting more effectively than standard DRO.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our work. We address the major comments point by point below, providing clarifications and indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract and theoretical development of the promptwise model] The central claim of an exact solution for the inner worst-case regret and water-filling optimal policy under the ℓ1 ambiguity set is derived only for the promptwise simplex allocation model (as stated in the abstract). No full derivation, proof, or error analysis is provided in the manuscript, preventing verification that the closed-form regret solution is parameter-free or that it does not reduce to a fitted quantity by construction.

Authors: The derivation of the exact closed-form solution for the worst-case regret under the ℓ1 ambiguity set, along with the water-filling structure of the optimal policy, is provided in Section 3.2 of the manuscript for the promptwise simplex allocation model. We solve the inner min-max problem by considering the dual formulation of the Wasserstein distance and obtain a threshold-based allocation rule that depends solely on the ambiguity radius and the nominal reward vector, without introducing additional fitted parameters. This ensures it does not reduce to a fitted quantity by construction. To address the verification concern, we will include a more detailed proof with intermediate steps and an error analysis in the supplementary material of the revised version. revision: partial
Referee: [Promptwise simplex allocation model and § on the inner optimization] The promptwise simplex allocation model treats each prompt's allocation independently and memorylessly. In full-sequence RLHF, reward perturbations can exhibit correlations across tokens within a trajectory or across multiple responses to the same prompt; if such dependencies are material, the exact regret solution and the sampled-bonus policy-gradient update lose their guarantee of mitigating Goodharting more effectively than standard DRO.

Authors: We acknowledge that the promptwise simplex allocation model assumes independence across prompts and memorylessness within sequences to derive the closed-form solution. In scenarios where reward perturbations exhibit significant correlations across tokens or responses, the strict theoretical guarantees for mitigating Goodharting may not hold. However, the DRRO framework remains less pessimistic than standard DRO, as established in our theoretical analysis, and the sampled-bonus algorithm offers a practical approach that shows improved empirical performance. We will add a section discussing these modeling assumptions and potential limitations in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained optimization result

full rationale

The paper defines the promptwise simplex allocation model and the ℓ1 ambiguity set explicitly, then derives the exact inner worst-case regret solution and water-filling policy structure by solving the resulting min-max optimization problem. This is a direct consequence of the ℓ1-ball geometry on the probability simplex and does not reduce to any fitted parameter, self-citation chain, or renamed input. The subsequent policy-gradient algorithm with sampled-bonus interpretation follows as an implementation of the closed-form result rather than an assumption smuggled in. No load-bearing self-citations or ansatz smuggling are required for the central claims; the derivation remains independent of the target RLHF application once the model is stated.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on modeling RLHF as a promptwise simplex allocation problem and on the choice of Wasserstein distance with l1 norm to define the ambiguity set around the estimated reward.

free parameters (1)

ambiguity radius
Size of the Wasserstein ball around the learned reward; must be chosen or validated externally.

axioms (1)

domain assumption Wasserstein distance defines a valid ambiguity set for reward perturbations
Invoked to model objective misspecification in RLHF.

pith-pipeline@v0.9.0 · 5578 in / 1233 out tokens · 44914 ms · 2026-05-09T21:00:45.746955+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 20 canonical work pages · 6 internal anchors

[1]

Off-policy corrected reward modeling for reinforcement learning from human feedback.arXiv preprint arXiv:2507.15507,

Johannes Ackermann, Takashi Ishida, and Masashi Sugiyama. Off-policy corrected reward modeling for reinforcement learning from human feedback.arXiv preprint arXiv:2507.15507,

work page arXiv
[2]

Minimax regret optimization for robust machine learning under distribution shift.arXiv preprint arXiv:2202.05436,

Alekh Agarwal and Tong Zhang. Minimax regret optimization for robust machine learning under distribution shift.arXiv preprint arXiv:2202.05436,

work page arXiv
[3]

Distributionally robust regret minimization.arXiv preprint arXiv:2412.15406,

Eilyan Bitar. Distributionally robust regret minimization.arXiv preprint arXiv:2412.15406,

work page arXiv
[4]

Adversarial training of reward models.arXiv preprint arXiv:2504.06141,

Alexander Bukharin, Haifeng Qian, Shengyang Sun, Adithya Renduchintala, Soumye Singhal, Zhilin Wang, Oleksii Kuchaiev, Olivier Delalleau, and Tuo Zhao. Adversarial training of reward models.arXiv preprint arXiv:2504.06141,

work page arXiv
[5]

Rlhf deciphered: A critical analysis of reinforcement learning from human feedback for LLMs

Shreyas Chaudhari, Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan, Ameet Deshpande, and Bruno Castro da Silva. Rlhf deciphered: A critical analysis of reinforcement learning from human feedback for LLMs. arXiv preprint arXiv:2404.08555,

work page arXiv
[6]

The accuracy paradox in RLHF: When better reward models don’t yield better language models

Yanjun Chen, Dawei Zhu, Yirong Sun, Xinghao Chen, Wei Zhang, and Xiaoyu Shen. The accuracy paradox in RLHF: When better reward models don’t yield better language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2980–2989,

2024
[7]

Mitigating re- ward over-optimization in RLHF via behavior-supported regularization.arXiv preprint arXiv:2503.18130, 2025

Juntao Dai, Taiye Chen, Yaodong Yang, Qian Zheng, and Gang Pan. Mitigating reward over-optimization in RLHF via behavior-supported regularization.arXiv preprint arXiv:2503.18130,

work page arXiv
[8]

Active preference optimization for sample efficient rlhf.arXiv preprint arXiv:2402.10500, 2024

Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, and Sayak Ray Chowdhury. Active preference optimization for sample efficient RLHF.arXiv preprint arXiv:2402.10500,

work page arXiv
[9]

Wasserstein distributionally robust regret optimization.arXiv preprint arXiv:2504.10796,

Lukas-Benedikt Fiechtner and Jose Blanchet. Wasserstein distributionally robust regret optimization.arXiv preprint arXiv:2504.10796,

work page arXiv
[10]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,

work page internal anchor Pith review arXiv
[11]

Smith, and Hannaneh Hajishirzi

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. RewardBench: Evaluating reward models for language modeling. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1755–1797,

2025
[12]

General exploratory bonus for optimistic exploration in RLHF.arXiv preprint arXiv:2510.03269,

37 Wendi Li, Changdae Oh, and Sharon Li. General exploratory bonus for optimistic exploration in RLHF.arXiv preprint arXiv:2510.03269,

work page arXiv
[13]

Reinforcement Learning from Human Feedback: A Statistical Perspective

Pangpang Liu, Chengchun Shi, and Will Wei Sun. Reinforcement learning from human feedback: A statistical perspective.arXiv preprint arXiv:2604.02507,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290,

work page internal anchor Pith review arXiv
[15]

and MEHROTRA, S

Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review. arXiv preprint arXiv:1908.05659,

work page arXiv 1908
[16]

Countering reward over-optimization in LLM with demonstration- guided reinforcement learning

Mathieu Rita, Florian Strub, Rahma Chaabouni, Paul Michel, Emmanuel Dupoux, and Olivier Pietquin. Countering reward over-optimization in LLM with demonstration- guided reinforcement learning. InFindings of the Association for Computational Linguistics: ACL 2024,

2024
[17]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Yang, Junxiao Xu, Maosong Li, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Reward model overoptimisation in iterated RLHF.arXiv preprint arXiv:2505.18126,

Lorenz Wolf, Robert Kirk, and Mirco Musolesi. Reward model overoptimisation in iterated RLHF.arXiv preprint arXiv:2505.18126,

work page arXiv
[19]

Learning a pessimistic reward model in RLHF.arXiv preprint arXiv:2505.20556,

Yinglun Xu, Hangoo Kang, Tarun Suresh, Yuxuan Wan, and Gagandeep Singh. Learning a pessimistic reward model in RLHF.arXiv preprint arXiv:2505.20556,

work page arXiv
[20]

arXiv preprint arXiv:2409.15360 , year=

Yuzi Yan, Xingzhou Lou, Jialian Li, Yiping Zhang, Jian Xie, Chao Yu, Yu Wang, Dong Yan, and Yuan Shen. Reward-robust RLHF in LLMs.arXiv preprint arXiv:2409.15360,

work page arXiv
[21]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole 39 Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Uncertainty-penalized reinforcement learning from human feedback with diverse reward LoRA ensembles.arXiv preprint arXiv:2401.00243,

Yuanzhao Zhai, Han Zhang, Yu Lei, Yue Yu, Kele Xu, Dawei Feng, Bo Ding, and Huaimin Wang. Uncertainty-penalized reinforcement learning from human feedback with diverse reward LoRA ensembles.arXiv preprint arXiv:2401.00243,

work page arXiv
[23]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review arXiv 1909