pith. machine review for the scientific record. sign in

arxiv: 2402.01306 · v4 · submitted 2024-02-02 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

KTO: Model Alignment as Prospect Theoretic Optimization

Dan Jurafsky, Douwe Kiela, Kawin Ethayarajh, Niklas Muennighoff, Winnie Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 12:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LLM alignmentprospect theoryhuman-aware lossKTOpreference optimizationbinary feedbackHALOmodel alignment
0
0 comments X

The pith

KTO aligns LLMs by maximizing prospect-theoretic utility from binary desirability signals rather than paired preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing LLM alignment methods like DPO implicitly build in human biases from prospect theory, which explains their success over simple likelihood maximization. It introduces KTO as a new objective that uses the exact utility function from Kahneman-Tversky prospect theory to directly boost the utility of desirable outputs. This approach requires only a binary label for each generation instead of comparative preferences. KTO performs as well or better than established methods across model sizes from 1 billion to 30 billion parameters. The work implies that alignment success depends on choosing the right human-aware loss for the setting rather than seeking a single best method.

Core claim

Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.

What carries the argument

KTO, a human-aware loss (HALO) that applies the prospect theory value function to assign utilities to model outputs based on whether they are desirable or not and maximizes the resulting expected utility.

If this is right

  • KTO matches or exceeds the performance of preference-based methods at scales from 1B to 30B using only binary signals.
  • Current alignment objectives implicitly incorporate prospect theory biases, explaining part of their success over cross-entropy.
  • There is no universally superior HALO; the best loss depends on the inductive biases appropriate for the setting.
  • Alignment can succeed by directly optimizing a utility function rather than preference log-likelihood.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Binary desirability labels may be sufficient for high-quality alignment because they allow direct utility maximization without needing preference pairs.
  • This approach could make alignment more accessible by reducing the data collection burden compared to methods requiring comparative judgments.
  • The lack of a universal best HALO suggests that practitioners should select the loss function based on how well its biases match the target domain.

Load-bearing premise

That the specific utility function from prospect theory literature accurately captures human judgments of LLM outputs and that optimizing it with only binary desirability labels is sufficient without additional modeling assumptions or reference-point choices.

What would settle it

If models trained with KTO on binary labels receive significantly lower human preference win rates than DPO-trained models on paired data, or if collected human ratings of output desirability deviate from the shape of the prospect theory value function used by KTO.

read the original abstract

Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that existing LLM alignment methods (e.g., DPO) implicitly belong to a family of human-aware losses (HALOs) that encode prospect-theoretic biases from Kahneman-Tversky utility. It proposes KTO, which directly optimizes a prospect theory value function v(x) on binary desirability labels for generations rather than pairwise preferences, and reports that KTO matches or exceeds preference-based baselines across 1B–30B model scales.

Significance. If the empirical results hold under rigorous evaluation, the work is significant for showing that competitive alignment is possible with weaker (binary) supervision, which could reduce data collection costs. The HALO framing and observation that no single loss is universally optimal provide a useful conceptual lens for choosing alignment objectives based on inductive biases. The paper does not ship reproducible code or machine-checked proofs, so credit is limited to the conceptual contribution.

major comments (3)
  1. [§3] §3 (KTO objective): The reference point used to classify binary labels as gains or losses is not explicitly defined or ablated. Prospect theory's value function is defined relative to this point, so the lack of justification for the choice (e.g., zero, model prior expectation, or other) and the scaling of binary signals into numeric gains/losses is load-bearing for the claim that the specific Kahneman-Tversky utility provides the performance advantage.
  2. [§5] §5 (Experiments, Tables 1–3): Win-rate differences between KTO and DPO-style baselines are small (typically 1–3 points) at 7B–30B scales, yet no standard errors, number of evaluation prompts, or statistical tests are reported. This makes it impossible to assess whether KTO truly matches or exceeds the baselines, directly undermining the central empirical claim.
  3. [§3.2] §3.2 (Utility parameters): The prospect theory coefficients (α, β, λ) are taken directly from the 1992 literature without ablation or sensitivity analysis on the alignment task. If performance is sensitive to these fixed values, the results may reflect a particular loss shape rather than the claimed theoretical grounding.
minor comments (2)
  1. [§2] The definition of the HALO family in §2 could be made more precise by including an explicit mathematical characterization rather than a descriptive list.
  2. [Figure 2] Figure 2 (loss curves) lacks axis labels on the y-scale in some panels, reducing clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and indicate the revisions that will be incorporated into the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (KTO objective): The reference point used to classify binary labels as gains or losses is not explicitly defined or ablated. Prospect theory's value function is defined relative to this point, so the lack of justification for the choice (e.g., zero, model prior expectation, or other) and the scaling of binary signals into numeric gains/losses is load-bearing for the claim that the specific Kahneman-Tversky utility provides the performance advantage.

    Authors: We will revise §3 to explicitly state that the reference point is set to zero, with desirable generations assigned a positive scalar utility and undesirable generations a negative scalar utility. This choice follows directly from the binary supervision signal, which provides only a directional indicator rather than a magnitude; zero is the natural neutral point separating gains from losses. We will add a short paragraph justifying this mapping and noting that it preserves the key prospect-theoretic asymmetry (loss aversion) without requiring a model-dependent reference. A full ablation of alternative references is not performed, but the performance gains relative to symmetric losses (e.g., standard cross-entropy) are attributable to the functional form rather than the precise reference location. revision: partial

  2. Referee: [§5] §5 (Experiments, Tables 1–3): Win-rate differences between KTO and DPO-style baselines are small (typically 1–3 points) at 7B–30B scales, yet no standard errors, number of evaluation prompts, or statistical tests are reported. This makes it impossible to assess whether KTO truly matches or exceeds the baselines, directly undermining the central empirical claim.

    Authors: We agree that the lack of standard errors and statistical tests weakens the ability to interpret the small observed differences. In the revised manuscript we will report the exact number of evaluation prompts per benchmark, include standard errors obtained via bootstrap resampling over the evaluation set, and add paired statistical tests (e.g., Wilcoxon signed-rank) comparing KTO against each baseline. While the absolute margins are modest, the consistent pattern across model scales and the fact that KTO succeeds with strictly weaker (binary) supervision remain the central empirical observations. revision: yes

  3. Referee: [§3.2] §3.2 (Utility parameters): The prospect theory coefficients (α, β, λ) are taken directly from the 1992 literature without ablation or sensitivity analysis on the alignment task. If performance is sensitive to these fixed values, the results may reflect a particular loss shape rather than the claimed theoretical grounding.

    Authors: The parameters α=0.88, β=0.88, λ=2.25 are the canonical values reported by Tversky and Kahneman (1992) that produce the characteristic concave/convex shape and loss-aversion coefficient of prospect theory. Our contribution is to show that a loss derived from this established functional form is competitive for alignment, not to claim that these exact coefficients are optimal for the task. To address sensitivity concerns we will add an appendix analysis that perturbs the parameters within plausible ranges (e.g., λ ∈ [1.5, 3.0]) and demonstrates that KTO performance remains stable, supporting that the qualitative shape rather than the precise numerical values drives the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper adopts the Kahneman-Tversky prospect theory utility function directly from the 1992 external literature and defines KTO as a new HALO that maximizes this utility on binary desirability labels rather than preference log-likelihoods. No load-bearing step reduces by construction to a fitted parameter, self-defined quantity, or self-citation chain; the implicit-bias analysis of prior methods (DPO etc.) and the performance claims at 1B-30B scales rest on independent empirical evaluation outside any tautological mapping. The reference-point and parameter choices are taken as given from prospect theory rather than optimized against the paper's own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the applicability of the prospect theory utility function to LLM outputs and on the empirical performance being driven by that choice rather than other factors.

free parameters (1)
  • prospect theory parameters (e.g., loss aversion coefficient)
    The utility function is taken from Kahneman-Tversky but its exact parameterization for LLM outputs may require selection or tuning.
axioms (1)
  • domain assumption Humans perceive random variables in a biased but well-defined manner according to prospect theory
    Invoked to justify replacing log-likelihood of preferences with direct utility maximization.
invented entities (1)
  • Human-aware losses (HALOs) no independent evidence
    purpose: A family of loss functions that incorporate human decision biases
    Introduced to categorize existing alignment objectives and position KTO within them.

pith-pipeline@v0.9.0 · 5530 in / 1254 out tokens · 50130 ms · 2026-05-12T12:13:01.699710+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

  2. Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.

  3. Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...

  4. Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

    cs.CV 2026-05 unverdicted novelty 7.0

    PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.

  5. Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

    cs.LG 2026-05 unverdicted novelty 7.0

    The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...

  6. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

    cs.LG 2026-05 unverdicted novelty 7.0

    PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...

  7. Mind the Gap: Structure-Aware Consistency in Preference Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    Standard DPO surrogates are inconsistent for equicontinuous neural nets; SA-DPO provides structure-aware H-consistency bounds by adapting margins to semantic distance and shows heavy-tailed losses yield superior guara...

  8. Three Models of RLHF Annotation: Extension, Evidence, and Authority

    cs.CY 2026-04 unverdicted novelty 7.0

    RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.

  9. HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

    cs.AI 2026-04 unverdicted novelty 7.0

    HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.

  10. DDO-RM: Distribution-Level Policy Improvement after Reward Learning

    stat.ML 2026-04 unverdicted novelty 7.0

    DDO-RM turns reward scores into a target distribution and applies KL-regularized mirror-descent projection on finite candidates to improve policies, outperforming DPO on Pythia-410M.

  11. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 6.0

    TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.

  12. Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

    cs.CV 2026-05 unverdicted novelty 6.0

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...

  13. Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs

    cs.CL 2026-05 unverdicted novelty 6.0

    TPAW uses teams of current and historical model checkpoints that collaborate and compete, plus adaptive weightings for responses and players, to improve self-supervised LLM alignment and outperform baselines.

  14. Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

    cs.LG 2026-05 unverdicted novelty 6.0

    GraphDPO generalizes pairwise DPO to a graph-structured Plackett-Luce objective over DAGs induced by rollout rankings, enforcing transitivity with linear complexity and recovering DPO as a special case.

  15. Threshold-Guided Optimization for Visual Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.

  16. Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models

    cs.LG 2026-05 conditional novelty 6.0

    Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.

  17. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.

  18. PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.

  19. Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

    cs.CL 2026-05 unverdicted novelty 6.0

    Perplexity gaps between finetuned and reference models on random-prefill completions often reveal the original finetuning objectives across diverse model organisms.

  20. Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

    cs.SD 2026-04 unverdicted novelty 6.0

    Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.

  21. Representation-Guided Parameter-Efficient LLM Unlearning

    cs.CL 2026-04 unverdicted novelty 6.0

    REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.

  22. AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems

    cs.LG 2026-04 unverdicted novelty 6.0

    AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.

  23. Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...

  24. Pioneer Agent: Continual Improvement of Small Language Models in Production

    cs.AI 2026-04 unverdicted novelty 6.0

    Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...

  25. Target Policy Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.

  26. Controlling Distributional Bias in Multi-Round LLM Generation via KL-Optimized Fine-Tuning

    cs.CL 2026-04 unverdicted novelty 6.0

    A hybrid fine-tuning objective using KL divergence for token calibration and Kahneman-Tversky optimization for semantic binding enables LLMs to produce outputs that match desired attribute distributions across repeate...

  27. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    cs.CL 2025-06 conditional novelty 6.0

    High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.

  28. StarCoder 2 and The Stack v2: The Next Generation

    cs.SE 2024-02 accept novelty 6.0

    StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

  29. StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

    cs.CL 2026-05 unverdicted novelty 5.0

    StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.

  30. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 5.0

    MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.

  31. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  32. K-CARE: Knowledge-driven Symmetrical Contextual Anchoring and Analogical Prototype Reasoning for E-commerce Relevance

    cs.IR 2026-04 unverdicted novelty 4.0

    K-CARE uses behavior-derived anchoring and expert prototype analogies to ground LLMs and improve relevance on knowledge-intensive e-commerce cases.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 30 Pith papers · 15 internal anchors

  1. [1]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback. arXiv preprint arXiv:2204.05862,

  2. [2]

    Human irrationality: both bad and good for reward inference

    Chan, L., Critch, A., and Dragan, A. Human irrationality: both bad and good for reward inference. arXiv preprint arXiv:2111.06956,

  3. [3]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

  4. [4]

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    Chen, Z., Deng, Y ., Yuan, H., Ji, K., and Gu, Q. Self-play fine-tuning converts weak language models to strong lan- guage models. arXiv preprint arXiv:2401.01335,

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  6. [6]

    Towards ecologically valid research on language user interfaces

    De Vries, H., Bahdanau, D., and Manning, C. Towards ecologically valid research on language user interfaces. arXiv preprint arXiv:2007.14435,

  7. [7]

    The Llama 3 Herd of Models

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  8. [8]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y ., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858,

  9. [9]

    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 11170–11189, 2024

    Hong, J., Lee, N., and Thorne, J. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691,

  10. [10]

    Mistral 7B

    Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

  11. [11]

    o pf, Yannic Kilcher, Dimitri von R \

    K¨opf, A., Kilcher, Y ., von R ¨utte, D., Anagnostidis, S., Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N. M., Stan- ley, O., Nagyfi, R., et al. Openassistant conversations– democratizing large language model alignment. arXiv preprint arXiv:2304.07327,

  12. [12]

    P., and Sadigh, D

    Kwon, M., Biyik, E., Talati, A., Bhasin, K., Losey, D. P., and Sadigh, D. When humans aren’t optimal: Robots that collaborate with risk-aware humans. In Proceedings of the 2020 ACM/IEEE international conference on human- robot interaction, pp. 43–52,

  13. [13]

    G., Row- land, M., Guo, Z

    Munos, R., Valko, M., Calandriello, D., Azar, M. G., Row- land, M., Guo, Z. D., Tang, Y ., Geist, M., Mesnard, T., Michi, A., et al. Nash learning from human feedback. arXiv preprint arXiv:2312.00886,

  14. [14]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177,

  15. [15]

    arXiv preprint arXiv:2404.03715 , year=

    Rosset, C., Cheng, C.-A., Mitra, A., Santacroce, M., Awadal- lah, A., and Xie, T. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715,

  16. [16]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  17. [17]

    Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615,

  18. [18]

    arXiv preprint arXiv:2401.04056 , year=

    Swamy, G., Dann, C., Kidambi, R., Wu, Z. S., and Agarwal, A. A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056,

  19. [19]

    and Finn, Chelsea , month = nov, year =

    Tian, K., Mitchell, E., Yao, H., Manning, C. D., and Finn, C. Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401,

  20. [20]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models. arXiv preprint arXiv:2302.13971,

  21. [21]

    Xu, H., Sharaf, A., Chen, Y ., Tan, W., Shen, L., Van Durme, B., Murray, K., and Kim, Y . J. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417,

  22. [22]

    Qwen2 Technical Report

    Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671,

  23. [23]

    Self-Rewarding Language Models

    Yuan, W., Pang, R. Y ., Cho, K., Sukhbaatar, S., Xu, J., and Weston, J. Self-rewarding language models. arXiv preprint arXiv:2401.10020,

  24. [24]

    Zhao, Y ., Joshi, R., Liu, T., Khalman, M., Saleh, M., and Liu, P. J. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425,

  25. [25]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685,

  26. [26]

    Fine-Tuning Language Models from Human Preferences

    Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,

  27. [27]

    13 Model Alignment as Prospect Theoretic Optimization A. Related Work LLM Alignment Human feedback has been used to improve LLM capabilities in translation (Kreutzer et al., 2018), sum- marization (Stiennon et al., 2020), sentiment-conditioned generation (Ziegler et al., 2019), and instruction-following (Ouyang et al., 2022). The RLHF framework (Christian...

  28. [28]

    Still, momentum has largely shifted in favor of closed-form losses that directly operate on offline preferences, such as DPO (Rafailov et al., 2023)

    traditionally used to accomplish this is detailed in §2. Still, momentum has largely shifted in favor of closed-form losses that directly operate on offline preferences, such as DPO (Rafailov et al., 2023). This single stage of optimization distinguishes DPO from the conventional approach in preference-based RL, which learns a reward and then fits the pol...

  29. [29]

    self-training

    and IPO (Azar et al., 2024). Binary Feedback Despite not being a human-aware loss, unlikelihood training was among the first methods to align language models using a binary signal (Welleck et al., 2019). However, Korbak et al. (2023) found unlikelihood training to be worse than the CSFT baseline we tested in this work, which is among various approaches th...

  30. [30]

    As rθ tends to ±∞, the gradient will tend to zero since either (1 − σ(βz)) or σ(βz) will tend to zero

    This gradient is simple to interpret: if y is desirable, then d(y) is negative and we push up the probability of πθ(y|x) to minimize the loss; if y is undesirable, then d(y) is positive and we push down the probability of πθ(y|x) to minimize the loss. As rθ tends to ±∞, the gradient will tend to zero since either (1 − σ(βz)) or σ(βz) will tend to zero. Th...

  31. [31]

    and (1 − p) ∈ (0, 0.5) respectively. If p1/βπref(ya|x) < (1 − p)1/βπref(yb|x), then the optimal DPO policy is more likely to produce the minority-preferredyb; the optimal KTO policy will strictly produce the majority-preferred ya for a loss-neutral value function (λD = λU ). Proof. Where u = β(rθ(x, ya) − rθ(x, yb)), we can write the total DPO loss for x ...