pith. machine review for the scientific record. sign in

arxiv: 2605.09946 · v2 · submitted 2026-05-11 · 💻 cs.GT

Recognition: 2 theorem links

· Lean Theorem

Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:59 UTC · model grok-4.3

classification 💻 cs.GT
keywords risk-sensitive gamespreference learningNLHFrobust RLHFconvex risk measuresself-playStackelberg equilibrium
0
0 comments X

The pith

Risk-sensitive preference games produce policies robust across data strata while matching risk-neutral performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes preference-based fine-tuning of language models as a game where agents optimize convex risk measures of their preference losses rather than expected payoffs. This targets the problem that standard methods can hide failures in specific prompts, annotators, or safety strata by focusing only on average win rates. Translation invariance of the chosen risk metrics preserves monotonicity in the game, allowing sample-efficient self-play to converge quickly. A hierarchical formulation plus a two-timescale extragradient algorithm with bias correction handles statistical bias and reaches the Stackelberg equilibrium even in low-data regimes. The resulting policies remain stable across risk choices and deliver robustness without lowering average performance.

Core claim

Players in a preference game optimize convex risk measures of pairwise losses; translation invariance keeps the game monotone, so self-play converges rapidly and produces policies whose performance holds across data strata.

What carries the argument

Risk-sensitive preference game in which each player minimizes a convex risk measure applied to its preference loss, with translation invariance of the risk measure preserving monotonicity for self-play convergence.

If this is right

  • Risk-adjusted policies remain stable when the risk parameter is varied.
  • Performance consistency holds across distinct data strata including prompts and safety-critical subsets.
  • Sample-complexity bounds scale explicitly with the risk level while controlling both structural and statistical bias.
  • The bias-corrected two-timescale extragradient method reaches the Stackelberg equilibrium even when data are scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same risk-sensitive structure could be used in other preference-driven settings such as recommendation or robotics control.
  • Explicit risk control may reduce the frequency of rare but high-cost failures that average-based methods overlook.
  • Testing the approach on larger models and real human feedback loops would reveal whether the low-sample advantages persist at scale.

Load-bearing premise

Many risk metrics are translation invariant, which is required to keep the game monotone so that self-play methods converge fast.

What would settle it

A concrete test would be to run the risk-sensitive self-play algorithm on a simple preference dataset and check whether the learned policy converges to a Stackelberg equilibrium whose empirical risk matches the theoretical bound; failure of convergence or systematic underperformance relative to the risk-neutral baseline on new strata would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.09946 by Eric Mazumdar, Jake Gonzales, Lillian J. Ratliff, Max Horwitz.

Figure 1
Figure 1. Figure 1: RSPGs target tail behavior directly. (a) Mean-based methods cannot distinguish two policies with the same average win-rate but very different tails. (b) RSPG policies (teal) maintain tail performance without sacrificing mean win-rate, while prior NLHF methods (red) collapse on the tail. about, rather than artifacts it discards. This motivates our central contribution: extending Nash Learning from Human Fee… view at source ↗
Figure 2
Figure 2. Figure 2: The bias tracker ξt is shown tracking the true bias bm(θt) on a faster timescale than the leader’s update θt—i.e., the IPO policy parameter. extragradient versus mirror descent comparison precise, and because we also implement mirror descent in our experiments—practically speaking, its easier since it requires a single update in each iteration versus two. Our general recommended algorithm however remains s… view at source ↗
Figure 3
Figure 3. Figure 3: Asymptotic squared-error floor vs. batch size m, in two regimes. Floor measured as the mean of the last 500 Polyak-averaged iterates. Left: bias-dominated regime (m ∈ [15, 130]). Vanilla extragradient floor scales roughly ∝ 1/m2 (matching the Θ(Be2 m) prediction), descending more than two orders of magnitude across the range. TT-EG floor is roughly flat near ∼ 10−6 , giving a ∼ 100× reduction at m = 15 (th… view at source ↗
Figure 4
Figure 4. Figure 4: Cross-play win-rate heatmap on the Random stratum. Each cell reports win-rate of the row policy [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mean and CVaR0.25 combined win-rate across four held-out strata (Random, Conflict, Sev-3, Sev￾1), with mean ±, σ across strata in the rightmost column. Risk-adjusted methods dominate on both metrics and remain stable across strata, while risk-neutral baselines degrade on CVaR and show higher variance across strata. Below the first white line are all risk models. for risk-adjusted extra-gradient (Propositio… view at source ↗
Figure 6
Figure 6. Figure 6: Trajectory at fixed m = 15. Polyak-averaged squared error ∥ ¯θ (t) − θ ⋆∥ 2 vs. iteration t. Oracle extragradient (black) decays at the unbiased O(1/T) rate of Theorem 11. Vanilla extragradient (blue) descends with the oracle for the first ∼ 200 iterations, then plateaus at the bias floor ∼ 10−4 predicted by Theorem 12 and stays flat thereafter. TT- extragradient (orange) breaks through this floor and driv… view at source ↗
Figure 7
Figure 7. Figure 7: Asymptotic squared-error floor vs. batch size m, in two regimes. Floor measured as the mean of the last 500 Polyak-averaged iterates. Left: bias-dominated regime (m ∈ [15, 130]). Vanilla extragradient floor scales roughly ∝ 1/m2 (matching the Θ(Be2 m) prediction), descending more than two orders of magnitude across the range. TT-EG floor is roughly flat near ∼ 10−6 , giving a ∼ 100× reduction at m = 15 (th… view at source ↗
Figure 8
Figure 8. Figure 8: Per-prompt win-rate cumulative distribution functions (CDFs) on the [PITH_FULL_IMAGE:figures/full_fig_p113_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-prompt win-rate CDFs on the Conflict stratum (100 prompts where preference and safety labels disagree). Metrics. For each (policy, opponent, prompt) triple we compute three per-prompt win-rates: Preference WR (P[judge(y ≻ y ′ )] under the PairJudge), Safety WR (fraction of pairs where the policy response has lower Beaver cost), and Combined WR (policy wins iff preferred and safer; otherwise inconclusiv… view at source ↗
Figure 10
Figure 10. Figure 10: Per-prompt win-rate CDFs on the Sev-3 stratum (100 highest-severity unsafe prompts) [PITH_FULL_IMAGE:figures/full_fig_p114_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-prompt win-rate CDFs on the Sev-1 stratum (136 mildest-severity unsafe prompts). ( [PITH_FULL_IMAGE:figures/full_fig_p114_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Per-prompt win-rate CDFs on the Sev-3 stratum with EG-Ent (τ=5) and Nash-MD added to the comparison. drops are not directly comparable across policies, since methods with higher means have more room to fall: the largest absolute drops on safety and combined win-rate are in fact incurred by the K=8 risk-neutral methods (OMD,EG) and gDRO,which also achieve the highest means. The scale-free comparison is giv… view at source ↗
Figure 13
Figure 13. Figure 13: Tail drop in percentage points on the Random stratum, defined as the gap between mean win [PITH_FULL_IMAGE:figures/full_fig_p115_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: CVaR0.25/Mean ratio of the combined win-rate distribution per opponent on the Random stratum. Higher values indicate more consistent performance across prompts. Dashed line at 1.0 denotes perfect consistency [PITH_FULL_IMAGE:figures/full_fig_p116_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Mean vs. risk-adjusted win-rate (top row) and Risk/Mean ratio (bottom row) on the Random [PITH_FULL_IMAGE:figures/full_fig_p116_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Tail drop on the Sev-3 stratum with EG-Ent (τ=5) and Nash-MD added. 116 [PITH_FULL_IMAGE:figures/full_fig_p116_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Mean vs. risk-adjusted win-rate (top row) and Risk/Mean ratio (bottom row) on the [PITH_FULL_IMAGE:figures/full_fig_p117_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Mean variance of win-rates across responses, averaged over prompts and opponents. Lower values [PITH_FULL_IMAGE:figures/full_fig_p117_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Mean variance of win-rates across responses on the [PITH_FULL_IMAGE:figures/full_fig_p118_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Cross-play win-rates on the Conflict stratum (100 prompts where preference and safety labels disagree), across preference, safety, and combined metrics. The Sev-3 stratum ( [PITH_FULL_IMAGE:figures/full_fig_p118_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Cross-play win-rates on the Sev-3 stratum (100 highest-severity unsafe prompts), across prefer￾ence, safety, and combined metrics [PITH_FULL_IMAGE:figures/full_fig_p119_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Cross-play win-rates on the Sev-1 stratum (136 mildest-severity unsafe prompts), across prefer￾ence, safety, and combined metrics [PITH_FULL_IMAGE:figures/full_fig_p119_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Cross-play win-rates on the Sev-3 stratum with EG-Ent (τ=5) and Nash-MD added. 119 [PITH_FULL_IMAGE:figures/full_fig_p119_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Combined win-rate (left) and CVaR0.25/Mean robustness ratio (right) broken down by harm category on the Random stratum [PITH_FULL_IMAGE:figures/full_fig_p120_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Combined win-rate (left) and CVaR0.25/Mean robustness ratio (right) broken down by harm category on the Sev-3 stratum. 120 [PITH_FULL_IMAGE:figures/full_fig_p120_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Training loss and reward accuracy (fraction of training pairs on which the preferred response [PITH_FULL_IMAGE:figures/full_fig_p121_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: KL drift to the SFT reference and gradient norm over training. 121 [PITH_FULL_IMAGE:figures/full_fig_p121_27.png] view at source ↗
read the original abstract

A growing line of work reframes preference-based fine-tuning of large language models game-theoretically: Nash Learning from Human Feedback (NLHF) recasts the problem as a zero-sum game over policies. However, optimization is over expected pairwise payoffs, thereby conflating policies with similar win rates but different tail behavior. As such, these methods are agnostic to where in the data distribution they succeed or fail: strong average performance can mask systematic failure across prompts, annotators, or safety-critical strata. We introduce risk-sensitive preference games, in which players optimize convex risk measures of their preference loss, exploiting structure in preference uncertainty. While risk-sensitivity generally breaks the zero-sum structure, we show that translation invariance of many risk metrics ensures that we retain monotonicity, yielding fast convergence of sample-efficient self-play methods. Furthermore, we establish algorithmic stability and offline sample complexity bounds that scale with risk, requiring simultaneous control of structural bias from nonlinear risk transformations, statistical bias in risk estimation, and concentration tailored to the risk-sensitive setting. To address statistical bias, we introduce a hierarchical game formulation and a two-timescale extragradient algorithm with bias correction that converges to the Stackelberg equilibrium and is especially effective in low-sample regimes. Empirically, risk-adjusted policies are robust across data strata, stable across risk choices, and match or exceed risk-neutral performance thereby achieving robustness without a performance tax.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces risk-sensitive preference games for robust LLM preference learning, reframing NLHF as a game where players optimize convex risk measures of pairwise preference losses instead of expected payoffs. It claims that translation invariance of risk metrics preserves monotonicity despite breaking zero-sum structure, enabling fast convergence via self-play methods like extragradient; it further derives algorithmic stability and offline sample-complexity bounds scaling with risk, introduces a hierarchical Stackelberg formulation with two-timescale bias-corrected extragradient, and reports empirical robustness across data strata without average-performance loss.

Significance. If the monotonicity preservation and sample-complexity results hold, the work offers a principled route to risk-aware RLHF that addresses tail failures across prompts or annotators while retaining or improving average win rates. The bias-correction mechanism for low-sample regimes and the empirical stability across risk levels are concrete strengths; the focus on structural bias from nonlinear risk transformations is a timely contribution to game-theoretic preference optimization.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (monotonicity claim): the central step asserts that translation invariance of convex risk measures preserves monotonicity when composed with the nonlinear pairwise preference operator (win-rate differences across strata), yet no explicit proof or counterexample check is provided; if monotonicity fails under this composition, the extragradient and two-timescale convergence arguments and all downstream sample-complexity bounds become unsupported.
  2. [§4] §4 (sample complexity): the stated bounds that 'scale with risk' are presented without explicit dependence on the risk parameter, concentration inequalities tailored to the risk-sensitive estimator, or full derivation of the structural-plus-statistical bias terms; this renders the scaling claim unverifiable from the given text.
  3. [Empirical section] Empirical section (Tables/Figures): robustness is reported across strata and risk levels, but no ablation isolates the effect of the monotonicity assumption or compares against regimes where the nonlinear composition might violate it, weakening the theory-practice link for the claimed 'no performance tax' result.
minor comments (2)
  1. [§2] Notation for the risk measure applied to the preference payoff matrix should be introduced with an explicit equation early in §2 to avoid ambiguity when the operator is later composed with the game value.
  2. [§4.2] The hierarchical game formulation in §4.2 would benefit from a diagram clarifying the leader-follower timing and the two-timescale updates.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments help clarify the presentation of our theoretical results. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (monotonicity claim): the central step asserts that translation invariance of convex risk measures preserves monotonicity when composed with the nonlinear pairwise preference operator (win-rate differences across strata), yet no explicit proof or counterexample check is provided; if monotonicity fails under this composition, the extragradient and two-timescale convergence arguments and all downstream sample-complexity bounds become unsupported.

    Authors: We agree that an explicit statement would improve clarity. In §3 we establish monotonicity via the fact that translation-invariant convex risk measures (ρ(X+c)=ρ(X)+c) preserve order when applied to the preference loss operator, because the operator itself is monotone in the underlying payoffs and the risk measure is monotone. The proof proceeds by showing that if E[π1] ≽ E[π2] then ρ(ℓ(π1)) ≽ ρ(ℓ(π2)) for the shifted losses. We will add a dedicated lemma with the full proof in the main text and include a short counterexample verification for CVaR and entropic risk in the appendix of the revision. revision: yes

  2. Referee: [§4] §4 (sample complexity): the stated bounds that 'scale with risk' are presented without explicit dependence on the risk parameter, concentration inequalities tailored to the risk-sensitive estimator, or full derivation of the structural-plus-statistical bias terms; this renders the scaling claim unverifiable from the given text.

    Authors: The dependence on the risk parameter is encoded in the Lipschitz constant L_ρ of the risk measure and in the sub-Gaussian parameter of the risk estimator (see the Bernstein-type concentration in the proof of Theorem 4.1). Structural bias is bounded by the deviation between ρ and expectation, while statistical bias scales as O(1/√n) with an extra factor from the risk level. We will expand the statement of the bound in §4 to display the explicit dependence on the risk parameter and move the full derivation (including the tailored concentration inequality) to the appendix for verifiability. revision: yes

  3. Referee: [Empirical section] Empirical section (Tables/Figures): robustness is reported across strata and risk levels, but no ablation isolates the effect of the monotonicity assumption or compares against regimes where the nonlinear composition might violate it, weakening the theory-practice link for the claimed 'no performance tax' result.

    Authors: We acknowledge that an explicit ablation would tighten the theory-practice connection. In the revision we will add a controlled experiment that replaces the translation-invariant risk with a non-invariant surrogate (e.g., a shifted variance penalty) and reports the resulting degradation in both convergence speed and strata robustness, thereby isolating the role of the monotonicity-preserving property. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard convex analysis and extragradient convergence

full rationale

The paper's central claims rest on showing that translation invariance of convex risk measures preserves monotonicity of the preference game operator, allowing application of known extragradient and two-timescale analyses. This step is presented as a direct consequence of the definition of translation invariance applied to the risk-adjusted payoff, without reducing the result to a fitted parameter or self-citation chain. No equations equate a derived equilibrium or sample-complexity bound to an input by construction. The hierarchical formulation and bias-correction algorithm are introduced as new but build on external convergence theory rather than redefining the target quantities. The empirical robustness claims are presented as validation rather than part of the derivation chain. Overall the argument is self-contained against external benchmarks for risk measures and game-theoretic methods.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claims rest on translation invariance of convex risk measures and standard convergence properties of extragradient methods; no new entities are postulated.

free parameters (1)
  • risk level parameter
    Controls the degree of risk sensitivity and must be chosen or tuned for each application.
axioms (1)
  • domain assumption Translation invariance of convex risk measures preserves monotonicity in the game
    Invoked to retain zero-sum-like structure and fast self-play convergence.

pith-pipeline@v0.9.0 · 5556 in / 1146 out tokens · 23995 ms · 2026-05-14T21:59:34.766985+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    translation invariance of many risk metrics ensures that we retain monotonicity, yielding fast convergence of sample-efficient self-play methods

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · 5 internal anchors

  1. [1]

    Journal of machine learning research , volume=

    Stability and generalization , author=. Journal of machine learning research , volume=

  2. [2]

    ICLR (arXiv:2411.02306) , year=

    On targeted manipulation and deception when optimizing LLMs for user feedback , author=. ICLR (arXiv:2411.02306) , year=

  3. [3]

    Advances in Neural Information Processing Systems , volume =

    Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =

  4. [4]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author =. arXiv preprint arXiv:2204.05862 , year =

  5. [5]

    Advances in Neural Information Processing Systems , volume =

    Deep Reinforcement Learning from Human Preferences , author =. Advances in Neural Information Processing Systems , volume =

  6. [6]

    Advances in Neural Information Processing Systems , volume =

    Learning to Summarize with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =

  7. [7]

    Fine-Tuning Language Models from Human Preferences

    Fine-Tuning Language Models from Human Preferences , author =. arXiv preprint arXiv:1909.08593 , year =

  8. [8]

    2016 , publisher =

    Stochastic Finance: An Introduction in Discrete Time , author =. 2016 , publisher =

  9. [9]

    2025 , month =

    Hill, Kashmir and Freedman, Dylan , title =. 2025 , month =

  10. [10]

    2025 , month =

    Hill, Kashmir , title =. 2025 , month =

  11. [11]

    On Targeted Manipulation and Deception when Optimizing

    Williams, Marcus and Carroll, Micah and Narang, Adhyyan and Weisser, Constantin and Murphy, Brendan and Dragan, Anca , booktitle =. On Targeted Manipulation and Deception when Optimizing

  12. [12]

    SIAM Journal on Optimization , volume=

    Prox-method with rate of convergence O (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems , author=. SIAM Journal on Optimization , volume=. 2004 , publisher=

  13. [13]

    Stochastic Systems , volume=

    Solving variational inequalities with stochastic mirror-prox algorithm , author=. Stochastic Systems , volume=. 2011 , publisher=

  14. [14]

    On the Coherence of Expected Shortfall , volume =

    Acerbi, Carlo and Tasche, Dirk , year =. On the Coherence of Expected Shortfall , volume =. Journal of Banking & Finance , doi =

  15. [15]

    Journal of Computational and Applied Mathematics , volume=

    On linear convergence of iterative methods for the variational inequality problem , author=. Journal of Computational and Applied Mathematics , volume=. 1995 , publisher=

  16. [16]

    The 22nd International Conference on Artificial Intelligence and Statistics , pages=

    Risk-sensitive generative adversarial imitation learning , author=. The 22nd International Conference on Artificial Intelligence and Statistics , pages=. 2019 , organization=

  17. [17]

    Matekon , volume=

    Extragradient method for finding saddle points and other problems , author=. Matekon , volume=. 1977 , publisher=

  18. [18]

    Econometrica: Journal of the Econometric Society , pages=

    Existence and uniqueness of equilibrium points for concave n-person games , author=. Econometrica: Journal of the Econometric Society , pages=. 1965 , publisher=

  19. [19]

    Mathematische Annalen , year =

    von Neumann, John , title =. Mathematische Annalen , year =. doi:10.1007/BF01448847 , url =

  20. [20]

    International Conference on Artificial Intelligence and Statistics , pages=

    A general theoretical paradigm to understand learning from human preferences , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

  21. [21]

    Forty-first International Conference on Machine Learning , year=

    Nash learning from human feedback , author=. Forty-first International Conference on Machine Learning , year=

  22. [22]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Calandriello, Daniele and Guo, Zhaohan Daniel and Munos, Remi and Rowland, Mark and Tang, Yunhao and Pires, Bernardo Avila and Richemond, Pierre Harvey and Le Lan, Charline and Valko, Michal and Liu, Tianqi and Joshi, Rishabh and Zheng, Zeyu and Piot, Bilal , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024...

  23. [23]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Swamy, Gokul and Dann, Christoph and Kidambi, Rahul and Wu, Zhiwei Steven and Agarwal, Alekh , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  24. [24]

    Multi-turn Reinforcement Learning with Preference Human Feedback , url =

    Shani, Lior and Rosenberg, Aviv and Cassel, Asaf and Lang, Oran and Calandriello, Daniele and Zipori, Avital and Noga, Hila and Keller, Orgad and Piot, Bilal and Szpektor, Idan and Hassidim, Avinatan and Matias, Yossi and Munos, R\'. Multi-turn Reinforcement Learning with Preference Human Feedback , url =. Advances in Neural Information Processing Systems , doi =

  25. [25]

    The Thirteenth International Conference on Learning Representations , year=

    Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment , author=. The Thirteenth International Conference on Learning Representations , year=

  26. [26]

    Iterative Nash Policy Optimization: Aligning

    Yuheng Zhang and Dian Yu and Baolin Peng and Linfeng Song and Ye Tian and Mingyue Huo and Nan Jiang and Haitao Mi and Dong Yu , booktitle=. Iterative Nash Policy Optimization: Aligning. 2025 , url=

  27. [27]

    Zhou, Runlong and Fazel, Maryam and Du, Simon S , journal=

  28. [28]

    2026 , eprint=

    Safety Alignment of LMs via Non-cooperative Games , author=. 2026 , eprint=

  29. [29]

    2026 , eprint=

    Proximal Point Nash Learning from Human Feedback , author=. 2026 , eprint=

  30. [30]

    Terry , journal =

    Ralph Allan Bradley and Milton E. Terry , journal =. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , urldate =

  31. [31]

    2017 , eprint=

    Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

  32. [32]

    2020 , eprint=

    Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

  33. [33]

    and Leike, Jan and Brown, Tom B

    Christiano, Paul F. and Leike, Jan and Brown, Tom B. and Martic, Miljan and Legg, Shane and Amodei, Dario , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

  34. [34]

    Korpelevich, G. M. , title=. Matecon , volume=

  35. [35]

    International Conference on Machine Learning , pages=

    An investigation of why overparameterization exacerbates spurious correlations , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  36. [36]

    International Conference on Machine Learning , pages=

    Fairness without demographics in repeated loss minimization , author=. International Conference on Machine Learning , pages=. 2018 , organization=

  37. [37]

    Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

    Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization , author=. arXiv preprint arXiv:1911.08731 , year=

  38. [38]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , url =

    Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , booktitle =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , url =

  39. [39]

    Mathematics of Operations Research , volume=

    Strongly regular generalized equations , author=. Mathematics of Operations Research , volume=. 1980 , publisher=

  40. [40]

    International Conference on Machine Learning , pages=

    Performative prediction , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  41. [41]

    Journal of Machine Learning Research , volume=

    Multiplayer performative prediction: Learning in decision-dependent games , author=. Journal of Machine Learning Research , volume=

  42. [42]

    Advances in Neural Information Processing Systems , volume=

    Stability and deviation optimal risk bounds with convergence rate O (1/n) , author=. Advances in Neural Information Processing Systems , volume=

  43. [43]

    Conference on Learning Theory , pages=

    Sharper bounds for uniformly stable algorithms , author=. Conference on Learning Theory , pages=. 2020 , organization=

  44. [44]

    Mathematical Programming , volume=

    Robinson’s implicit function theorem and its extensions , author=. Mathematical Programming , volume=. 2009 , publisher=

  45. [45]

    Back to Basics: Revisiting REINFORCE -Style Optimization for Learning from Human Feedback in LLM s

    Ahmadian, Arash and Cremer, Chris and Gall. Back to Basics: Revisiting REINFORCE -Style Optimization for Learning from Human Feedback in LLM s. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.662

  46. [46]

    Available at Optimization Online , volume=

    Kullback-Leibler divergence constrained distributionally robust optimization , author=. Available at Optimization Online , volume=

  47. [47]

    Beyond Pessimism: Offline Learning in KL-regularized Games

    Beyond Pessimism: Offline Learning in KL-regularized Games , author=. arXiv preprint arXiv:2604.06738 , year=

  48. [48]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  49. [49]

    1998 , publisher=

    Variational analysis , author=. 1998 , publisher=

  50. [50]

    arXiv preprint arXiv:2104.01627 , year=

    Finite-time convergence rates of nonlinear two-time-scale stochastic approximation under Markovian noise , author=. arXiv preprint arXiv:2104.01627 , year=

  51. [51]

    arXiv preprint arXiv , volume=

    Train faster, generalize better: Stability of stochastic gradient descent , author=. arXiv preprint arXiv , volume=

  52. [52]

    arXiv preprint arXiv:2403.08635 , year=

    Human alignment of large language models through online preference optimisation , author=. arXiv preprint arXiv:2403.08635 , year=

  53. [53]

    arXiv preprint arXiv:2407.00617 , year=

    Iterative nash policy optimization: Aligning llms with general preferences via no-regret learning , author=. arXiv preprint arXiv:2407.00617 , year=

  54. [54]

    2025 , eprint=

    Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment , author=. 2025 , eprint=

  55. [55]

    arXiv preprint arXiv:2110.04185 , year=

    Cen, Shicong and Wei, Yuting and Chi, Yuejie , title=. arXiv preprint arXiv:2110.04185 , year=

  56. [56]

    Journal of banking & finance , volume=

    Conditional value-at-risk for general loss distributions , author=. Journal of banking & finance , volume=. 2002 , publisher=

  57. [57]

    Proceedings of the 3rd Conference on Learning for Dynamics and Control , pages =

    Nonlinear Two-Time-Scale Stochastic Approximation: Convergence and Finite-Time Performance , author =. Proceedings of the 3rd Conference on Learning for Dynamics and Control , pages =. 2021 , editor =

  58. [58]

    2009 , publisher=

    Implicit functions and solution mappings , author=. 2009 , publisher=

  59. [59]

    Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

    Ye, Chenlu and Xiong, Wei and Zhang, Yuheng and Dong, Hanze and Jiang, Nan and Zhang, Tong , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

  60. [60]

    Finance and stochastics , volume=

    Convex measures of risk and trading constraints , author=. Finance and stochastics , volume=. 2002 , publisher=

  61. [61]

    Mathematical Finance , volume =

    Artzner, Philippe and Delbaen, Freddy and Eber, Jean-Marc and Heath, David , title =. Mathematical Finance , volume =. doi:https://doi.org/10.1111/1467-9965.00068 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/1467-9965.00068 , year =

  62. [62]

    2026 , eprint=

    Robust Policy Optimization to Prevent Catastrophic Forgetting , author=. 2026 , eprint=

  63. [63]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Ethayarajh, Kawin and Xu, Winnie and Muennighoff, Niklas and Jurafsky, Dan and Kiela, Douwe , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  64. [64]

    The Thirteenth International Conference on Learning Representations , year=

    Tractable multi-agent reinforcement learning through behavioral economics , author=. The Thirteenth International Conference on Learning Representations , year=

  65. [65]

    2026 , eprint=

    Training Generalizable Collaborative Agents via Strategic Risk Aversion , author=. 2026 , eprint=

  66. [66]

    2026 , eprint=

    Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation , author=. 2026 , eprint=

  67. [67]

    1995 , issn =

    Quantal Response Equilibria for Normal Form Games , journal =. 1995 , issn =. doi:https://doi.org/10.1006/game.1995.1023 , url =

  68. [68]

    Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 , pages =

    Hardt, Moritz and Recht, Benjamin and Singer, Yoram , title =. Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 , pages =. 2016 , publisher =

  69. [69]

    arXiv preprint arXiv:2406.15513 , year=

    PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference , author=. arXiv preprint arXiv:2406.15513 , year=

  70. [70]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  71. [71]

    Howard and James E

    Ronald A. Howard and James E. Matheson , journal =. Risk-Sensitive Markov Decision Processes , urldate =

  72. [72]

    , author=

    Risk-sensitive Inverse Reinforcement Learning via Coherent Risk Models. , author=. Robotics: science and systems , volume=

  73. [73]

    Machine Learning Proceedings 1994 , publisher =

    Consideration of Risk in Reinforcement Learning , editor =. Machine Learning Proceedings 1994 , publisher =. 1994 , isbn =. doi:https://doi.org/10.1016/B978-1-55860-335-6.50021-0 , url =

  74. [74]

    and Sommer, Tobias and Obermayer, Klaus , title =

    Shen, Yun and Tobia, Michael J. and Sommer, Tobias and Obermayer, Klaus , title =. Neural Computation , volume =. 2014 , month =. doi:10.1162/NECO_a_00600 , url =

  75. [75]

    Risk-averse dynamic programming for Markov decision processes , year =

    Ruszczy\'. Risk-averse dynamic programming for Markov decision processes , year =. Math. Program. , month = oct, pages =

  76. [76]

    Risk sensitive control of Markov processes in countable state space , year =

    Hern\'. Risk sensitive control of Markov processes in countable state space , year =. doi:10.1016/S0167-6911(96)00051-5 , journal =

  77. [77]

    and Mazumdar, Eric , journal=

    Ratliff, Lillian J. and Mazumdar, Eric , journal=. Inverse Risk-Sensitive Reinforcement Learning , year=

  78. [78]

    Singh, Sumeet and Lacotte, Jonathan and Majumdar, Anirudha and Pavone, Marco , title =. Int. J. Rob. Res. , month = dec, pages =. 2018 , issue_date =. doi:10.1177/0278364918772017 , abstract =

  79. [79]

    Risk-Sensitive Generative Adversarial Imitation Learning

    Jonathan Lacotte and Yinlam Chow and Mohammad Ghavamzadeh and Marco Pavone , title =. CoRR , volume =. 2018 , url =. 1808.04468 , timestamp =

  80. [80]

    Robotics: Science and Systems , year=

    Risk-sensitive Inverse Reinforcement Learning via Coherent Risk Models , author=. Robotics: Science and Systems , year=

Showing first 80 references.