pith. machine review for the scientific record. sign in

arxiv: 2305.18290 · v3 · submitted 2023-05-29 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Archit Sharma, Chelsea Finn, Christopher D. Manning, Eric Mitchell, Rafael Rafailov, Stefano Ermon

Pith reviewed 2026-05-11 02:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords direct preference optimizationRLHFlanguage model alignmentpreference learningreward model reparameterizationclosed-form policyclassification loss
0
0 comments X

The pith

A reparameterization of the reward model allows language models to be aligned with human preferences using only a simple classification loss instead of reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the standard RLHF pipeline, which fits a reward model and then runs reinforcement learning to optimize the language model, can be replaced by a direct optimization procedure. By rewriting the reward in terms of the policy and a fixed reference model, the optimal policy under the KL-regularized objective becomes available in closed form. This turns the entire alignment step into a supervised classification problem on preference pairs. The resulting method is stable, requires no on-policy sampling during training, and needs little hyperparameter tuning. Experiments indicate it performs as well as or better than PPO-based RLHF on sentiment control, summarization, and dialogue tasks.

Core claim

We show that the RLHF objective admits a closed-form expression for the optimal policy once the reward is reparameterized as a function of the policy's log-ratio to the reference policy, allowing the entire alignment problem to be solved with a single logistic loss on human preference data.

What carries the argument

The reparameterized reward r(x,y) = β log(π(y|x) / π_ref(y|x)) + β log Z(x), which makes the policy that maximizes the RLHF objective directly extractable without running reinforcement learning.

If this is right

  • No sampling from the current model is needed during the fine-tuning stage.
  • The training objective reduces to ordinary supervised learning on labeled preference pairs.
  • Hyperparameter search is limited to learning rate and the temperature β instead of full RL schedules.
  • The method can be implemented in standard language-model fine-tuning code without separate reward-model training or policy-gradient machinery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reparameterization trick could be tested on tasks beyond single-turn dialogue, such as multi-turn conversations where the reference policy already encodes useful structure.
  • If the reference model is chosen poorly, performance may degrade more sharply than in two-stage RLHF that can learn a separate reward model.
  • The closed-form relation suggests exploring whether other regularized objectives in control or planning admit similar direct solutions.

Load-bearing premise

Human preferences must follow the Bradley-Terry model exactly and the reference policy must remain fixed and suitable throughout training.

What would settle it

Run DPO and standard RLHF on the same preference dataset and measure which produces higher win rates against held-out human judgments; if DPO is consistently worse, the closed-form optimality claim is falsified.

read the original abstract

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that by reparameterizing the reward model under the Bradley-Terry preference model in the standard RLHF objective, the corresponding optimal policy can be expressed in closed form. This reduces the RLHF problem to a simple binary classification loss (DPO) on human preference pairs, eliminating the need to train a separate reward model or run reinforcement learning. Experiments on sentiment control, summarization, and single-turn dialogue show DPO matching or exceeding PPO-based RLHF while being simpler and more stable.

Significance. If the central derivation holds, the result is significant: it provides a mathematically clean and practically simpler alternative to the two-stage RLHF pipeline. The closed-form optimality under standard assumptions is a clear strength, and the empirical results on three tasks support that DPO is competitive without the instability or sampling overhead of RL. This could lower the barrier to preference-based alignment for large LMs.

major comments (2)
  1. [§3] §3, Eq. (5): the closed-form optimality of π* holds only when the reference policy π_ref is held fixed and the Bradley-Terry model is assumed to hold exactly; the manuscript does not discuss how sensitive the guarantee is to violations of either assumption (e.g., when human preferences deviate from the logistic form or when π_ref is updated).
  2. [§4] §4.2–4.3: the reported gains over RLHF are consistent, yet the experiments provide only minimal ablation on the scalar β (chosen once per task) and no sensitivity analysis on the choice of reference model; because β is the sole free parameter, this limits assessment of robustness.
minor comments (2)
  1. [Figure 1] Figure 1 caption and surrounding text could more explicitly contrast the DPO training loop with the standard RLHF loop to highlight the eliminated steps.
  2. [§3.2] The notation for the partition function Z(x) is introduced in §3 but its dependence on the policy is not restated when the loss is written in §3.2, which may confuse readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive review and constructive comments. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§3] §3, Eq. (5): the closed-form optimality of π* holds only when the reference policy π_ref is held fixed and the Bradley-Terry model is assumed to hold exactly; the manuscript does not discuss how sensitive the guarantee is to violations of either assumption (e.g., when human preferences deviate from the logistic form or when π_ref is updated).

    Authors: We agree that the closed-form optimality in Eq. (5) is derived under the assumptions that the Bradley-Terry model holds exactly and that π_ref is held fixed. These are the standard assumptions in the RLHF literature from which the derivation begins. The manuscript presents the result under these conditions without claiming robustness to violations. To address the comment, we will add a brief discussion paragraph in Section 3 that explicitly states the assumptions, notes that empirical performance may degrade under strong violations, and points to related work on preference model misspecification. We do not plan to add new theoretical sensitivity bounds or extensive new experiments, as these would constitute a substantial extension. revision: partial

  2. Referee: [§4] §4.2–4.3: the reported gains over RLHF are consistent, yet the experiments provide only minimal ablation on the scalar β (chosen once per task) and no sensitivity analysis on the choice of reference model; because β is the sole free parameter, this limits assessment of robustness.

    Authors: We appreciate the point that limited ablation on β and the reference model restricts robustness assessment. In the original experiments β was selected via validation performance for each task. We will revise the experimental section to include an expanded ablation on β for the sentiment control task, reporting performance across a range of β values (e.g., 0.05 to 2.0) with corresponding plots. For the reference model, we used the base pretrained LM in all experiments, consistent with the theoretical setup; we will add a clarifying sentence in Section 4 explaining this choice and noting that alternative references (such as SFT-tuned models) are left for future work due to computational cost. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation is a direct mathematical reparameterization under stated assumptions

full rationale

The paper begins from the standard RLHF objective (maximize expected reward minus KL penalty to reference policy) and the Bradley-Terry model for preferences. It then algebraically reparameterizes the reward function in terms of the policy ratio, yielding a closed-form expression for the optimal policy and a simple classification loss. This equivalence holds exactly under the modeling assumptions; no parameter is fitted to the same data used for evaluation, no self-citation supplies a load-bearing uniqueness theorem, and β is treated as a fixed hyperparameter rather than a per-task fit. The central result is therefore a re-derivation, not a reduction to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the Bradley-Terry preference model and the standard KL-regularized RL objective; the partition function is shown to cancel analytically, leaving no new free parameters beyond the usual beta coefficient.

free parameters (1)
  • beta
    Scalar coefficient on the KL divergence term that controls how far the policy may deviate from the reference model; chosen by hand or grid search.
axioms (2)
  • domain assumption Bradley-Terry model: P(y_w > y_l) = sigma(r(y_w) - r(y_l))
    Used to express the preference probability in terms of the reward; appears in the derivation of the DPO loss.
  • standard math Optimal policy under KL penalty has closed form pi*(y) proportional to pi_ref(y) exp(r(y)/beta)
    Standard result from maximum-entropy RL; invoked to substitute reward in terms of policy.

pith-pipeline@v0.9.0 · 5564 in / 1451 out tokens · 33553 ms · 2026-05-11T02:27:33.876252+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning the Signature of Memorization in Autoregressive Language Models

    cs.CL 2026-04 accept novelty 8.0

    A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

  2. Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

  3. Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...

  4. Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.

  5. Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets

    math.OC 2026-05 unverdicted novelty 7.0

    Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends...

  6. Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

    cs.LG 2026-05 unverdicted novelty 7.0

    POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.

  7. Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

    cs.LG 2026-05 unverdicted novelty 7.0

    POISE estimates value baselines for RL in LLMs from the actor's internal states via a lightweight probe and cross-rollout construction, matching DAPO performance with lower compute on math reasoning benchmarks.

  8. Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

    cs.LG 2026-05 unverdicted novelty 7.0

    Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

  9. Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

    cs.CL 2026-05 unverdicted novelty 7.0

    Topology-enhanced alignment via persistent homology on trajectories outperforms standard SFT and DPO baselines on preference metrics for LLMs.

  10. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.

  11. The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining

    cs.CY 2026-05 unverdicted novelty 7.0

    Full development of 7B and 32B Olmo 3 models used 12.3 GWh datacenter energy and emitted 4,251 tCO2eq, with development overheads accounting for 82% of compute and reasoning models costing 17x more to post-train than ...

  12. Adaptive Prompt Embedding Optimization for LLM Jailbreaking

    cs.AI 2026-04 unverdicted novelty 7.0

    PEO optimizes original prompt embeddings continuously over adaptive rounds to jailbreak aligned LLMs, preserving the exact visible prompt text and outperforming discrete suffix, appended embedding, and search-based wh...

  13. SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters

    cs.CV 2026-04 unverdicted novelty 7.0

    Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.

  14. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  15. S-GRPO: Unified Post-Training for Large Vision-Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.

  16. Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.

  17. Scaffold-Conditioned Preference Triplets for Controllable Molecular Optimization with Large Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    SCPT creates similarity-constrained preference triplets from scaffolds to train LLMs as conditional molecular editors that improve properties while keeping scaffolds intact.

  18. Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

    cs.CL 2026-04 conditional novelty 7.0

    SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.

  19. Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction

    cs.CL 2026-04 unverdicted novelty 7.0

    VeriGUI adds a Thinking-Verification-Action-Expectation loop and two-stage training on synthetic failures to reduce undetected action errors and improve recovery in GUI automation.

  20. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

  21. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 6.0

    TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.

  22. Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    ATESD makes teacher exposure to reference reasoning a learnable control variable via a Beta-policy optimized on future student improvement, yielding gains of up to +2.33 points over fixed-exposure self-distillation on...

  23. Step Rejection Fine-Tuning: A Practical Distillation Recipe

    cs.LG 2026-05 unverdicted novelty 6.0

    Step Rejection Fine-Tuning masks loss on erroneous steps identified by a critic LLM in unresolved trajectories, raising SWE-bench Verified resolution rate by 3.7% to 32.2% versus 2.4% for trajectory-level rejection.

  24. SkillEvolver: Skill Learning as a Meta-Skill

    cs.AI 2026-05 unverdicted novelty 6.0

    A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.

  25. Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning

    cs.LG 2026-05 conditional novelty 6.0

    Existing LLM unlearning methods fail honesty standards by hallucinating on forgotten knowledge; ReVa improves rejection rates nearly twofold while enhancing retained honesty.

  26. Response Time Enhances Alignment with Heterogeneous Preferences

    cs.LG 2026-05 unverdicted novelty 6.0

    Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

  27. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 6.0

    RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...

  28. Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.

  29. RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

    cs.CL 2026-05 unverdicted novelty 6.0

    RLearner-LLM's Hybrid-DPO fuses DeBERTa NLI and LLM verifier scores to deliver up to 6x higher NLI entailment than standard SFT while preserving answer coverage across academic domains.

  30. RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

    cs.CL 2026-05 unverdicted novelty 6.0

    RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.

  31. Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models

    cs.LG 2026-05 conditional novelty 6.0

    Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.

  32. LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning

    cs.CR 2026-05 unverdicted novelty 6.0

    Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.

  33. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 6.0

    TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

  34. Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

    cs.LG 2026-04 unverdicted novelty 6.0

    DRRO for RLHF replaces worst-case value with worst-case regret in Wasserstein DRO, producing an exact water-filling solution under l1 ambiguity and a practical sampled-bonus algorithm that reduces proxy over-optimization.

  35. Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies

    cs.AI 2026-04 unverdicted novelty 6.0

    A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.

  36. Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

    cs.CL 2026-04 unverdicted novelty 6.0

    POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instr...

  37. HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

    cs.CL 2026-04 unverdicted novelty 6.0

    Tree-of-Writing achieves 0.93 Pearson correlation with human judgments by using a tree-structured workflow to aggregate sub-feature scores, outperforming standard LLM-as-a-judge and overlap metrics on the new HowToBench.

  38. Distillation Traps and Guards: A Calibration Knob for LLM Distillability

    cs.LG 2026-04 unverdicted novelty 6.0

    Reinforcement fine-tuning calibration makes LLM distillability adjustable, allowing optimized knowledge transfer or model IP safeguards via a combined task-KL-calibration objective.

  39. Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

    cs.CR 2026-04 unverdicted novelty 6.0

    Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.

  40. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  41. Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.

  42. Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

    cs.AI 2026-04 unverdicted novelty 6.0

    A generative reward model supplies separate semantic and turn-taking scores for spoken dialogues to enable more reliable reinforcement learning.

  43. ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance

    cs.CL 2026-04 unverdicted novelty 6.0

    ContextLens improves LLM compliance assessment for GDPR and EU AI Act by grounding imperfect contexts through targeted questions on applicability, principles, and provisions while identifying missing factors, without ...

  44. MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation

    cs.CL 2026-04 unverdicted novelty 6.0

    MT-OSC condenses chat history via a one-off sequential process with a few-shot Condenser and lightweight Decider to reduce tokens and preserve LLM accuracy in multi-turn settings.

  45. MemReader: From Passive to Active Extraction for Long-Term Agent Memory

    cs.CL 2026-04 unverdicted novelty 6.0

    MemReader uses distilled passive and GRPO-trained active extractors to selectively write low-noise long-term memories, outperforming passive baselines on knowledge updating, temporal reasoning, and hallucination tasks.

  46. JD-BP: A Joint-Decision Generative Framework for Auto-Bidding and Pricing

    cs.GT 2026-04 unverdicted novelty 6.0

    JD-BP jointly generates bids and pricing corrections via generative models, memory-less return-to-go, trajectory augmentation, and energy-based DPO to improve auto-bidding performance despite prediction errors and latency.

  47. Mitigating LLM biases toward spurious social contexts using direct preference optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    Debiasing-DPO reduces bias to spurious social contexts by 84% and improves predictive accuracy by 52% on average for LLMs evaluating U.S. classroom transcripts.

  48. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    cs.CV 2024-03 conditional novelty 6.0

    Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.

  49. Reinforced Self-Training (ReST) for Language Modeling

    cs.CL 2023-08 unverdicted novelty 6.0

    ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.

  50. Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

    cs.CV 2026-05 unverdicted novelty 5.0

    DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-H...

  51. Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

    cs.AI 2026-05 unverdicted novelty 5.0

    Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.

  52. RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

    cs.CL 2026-05 unverdicted novelty 5.0

    Hybrid-DPO combining NLI and verifier scores delivers up to 6x NLI improvement over SFT baselines across multiple LLMs and domains while preserving answer coverage and inference speed.

  53. Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

    cs.LG 2026-05 unverdicted novelty 5.0

    Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.

  54. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 5.0

    Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.

  55. Cross-Lingual Jailbreak Detection via Semantic Codebooks

    cs.CL 2026-04 unverdicted novelty 5.0

    Semantic similarity to an English jailbreak codebook detects cross-lingual attacks with high accuracy on curated benchmarks but shows poor separability on diverse unsafe prompts.

  56. Explanation Quality Assessment as Ranking with Listwise Rewards

    cs.AI 2026-04 unverdicted novelty 5.0

    Explanation quality assessment is recast as ranking with listwise and pairwise losses that outperform regression, allow small models to match large ones on curated data, and enable stable convergence in reinforcement ...

  57. Mind DeepResearch Technical Report

    cs.AI 2026-04 unverdicted novelty 5.0

    MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

  58. Can Persona-Prompted LLMs Emulate Subgroup Values? An Empirical Analysis of Generalisability and Fairness in Cultural Alignment

    cs.CY 2026-04 unverdicted novelty 5.0

    LLMs show limited ability to emulate subgroup cultural values via persona prompts, with fine-tuning providing gains that come with widened fairness disparities.

  59. From Perception to Autonomous Computational Modeling: A Multi-Agent Approach

    cs.CE 2026-04 unverdicted novelty 5.0

    A multi-agent LLM framework autonomously completes the full computational mechanics pipeline from a photograph to a code-compliant engineering report on a steel L-bracket example.

  60. Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    GRPO tuning on SLMs shows diminishing returns from hard math samples, with easier subsets matching full performance using 45% fewer steps and GSM8K training outperforming MATH training on numeric subsets.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 61 Pith papers · 4 internal anchors

  1. [1]

    Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan....

  2. [2]

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho- seini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R...

  3. [3]

    Biderman, H

    S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023

  4. [4]

    Bong and A

    H. Bong and A. Rinaldo. Generalized results for the existence and consistency of the MLE in the Bradley-Terry-Luce model. International Conference on Machine Learning , 2022. arXiv:2110.11487

  5. [5]

    R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. doi: https://doi.org/10.2307/2334029

  6. [6]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei...

  7. [7]

    URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

    Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

  8. [8]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  9. [9]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y . T. Lee, Y . Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y . Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4, 2023. arXiv preprint arXiv:2303.12712

  10. [10]

    Busa-Fekete, B

    R. Busa-Fekete, B. Szörényi, P. Weng, W. Cheng, and E. Hüllermeier. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine Learning, 97(3):327–351, July 2014. doi: 10.1007/s10994-014-5458-8. URL https://doi.org/10.1007/s10994-014-5458-8

  11. [11]

    Y . Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study.ArXiv, abs/2304.00723, 2023

  12. [12]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  13. [13]

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Sys- tems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips....

  14. [14]

    H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V . Zhao, Y . Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and...

  15. [15]

    Dudík, K

    M. Dudík, K. Hofmann, R. E. Schapire, A. Slivkins, and M. Zoghi. Contextual dueling bandits. In P. Grünwald, E. Hazan, and S. Kale, editors,Proceedings of The 28th Conference on Learning Theory, volume 40 ofProceedings of Machine Learning Research, pages 563–587, Paris, France, 03–06 Jul 2015. PMLR. URL https://proceedings.mlr.press/v40/Dudik15.html

  16. [16]

    D. Go, T. Korbak, G. Kruszewski, J. Rozen, N. Ryu, and M. Dymetman. Aligning language models with preferences through f-divergence minimization. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  17. [17]

    A. Jain, B. Wojcik, T. Joachims, and A. Saxena. Learning trajectory preferences for manip- ulators via iterative improvement. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors,Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/ 2013/...

  18. [18]

    Jaques, S

    N. Jaques, S. Gu, D. Bahdanau, J. M. Hernández-Lobato, R. E. Turner, and D. Eck. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pages 1645–1654. PMLR, 2017

  19. [19]

    Jaques, J

    N. Jaques, J. H. Shen, A. Ghandeharioun, C. Ferguson, A. Lapedriza, N. Jones, S. S. Gu, and R. Picard. Human-centric dialog training via offline reinforcement learning. arXiv preprint arXiv:2010.05848, 2020

  20. [20]

    Korbak, H

    T. Korbak, H. Elsahar, G. Kruszewski, and M. Dymetman. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 16203–16220. Curran Associates, Inc., ...

  21. [21]

    The lessons of developing process reward models in mathematical reasoning

    J. Kreutzer, J. Uyheng, and S. Riezler. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1777–1788, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: ...

  22. [22]

    Kupcsik, D

    A. Kupcsik, D. Hsu, and W. S. Lee. Learning Dynamic Robot-to-Human Object Handover from Human Feedback, pages 161–176. Springer International Publishing, 01 2018. ISBN 978-3-319-51531-1. doi: 10.1007/978-3-319-51532-8_10

  23. [23]

    S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018

  24. [24]

    R. D. Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012

  25. [25]

    A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y . Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URLhttp://www.aclweb.org/ a...

  26. [26]

    Mishra, D

    S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long

  27. [27]

    URL https://aclanthology.org/2022.acl-long.244. 12

  28. [28]

    Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , url=

    R. Nallapati, B. Zhou, C. dos Santos, Ç. Gulçehre, and B. Xiang. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/K16-1028. URL https:// ac...

  29. [29]

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

    D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Anal...

  30. [30]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, ...

  31. [31]

    Paulus, C

    R. Paulus, C. Xiong, and R. Socher. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=HkAClQgA-

  32. [32]

    X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019

  33. [33]

    Peters and S

    J. Peters and S. Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007

  34. [34]

    R. L. Plackett. The analysis of permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics), 24(2):193–202, 1975. doi: https://doi.org/10.2307/2346567

  35. [35]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners, 2019. Ms., OpenAI

  36. [36]

    Ramamurthy, P

    R. Ramamurthy, P. Ammanabrolu, K. Brantley, J. Hessel, R. Sifa, C. Bauckhage, H. Hajishirzi, and Y . Choi. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. ...

  37. [37]

    1511.06732 , archiveprefix =

    M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. CoRR, abs/1511.06732, 2015

  38. [38]

    Sadigh, A

    D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia. Active preference-based learning of reward functions. In Robotics: Science and Systems (RSS), 2017

  39. [39]

    A. Saha, A. Pacchiano, and J. Lee. Dueling rl: Reinforcement learning with trajectory preferences. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , volume 206 of Proceed- ings of Machine Learning Research , pages 6263–6289. PMLR, 25–27 Apr 2023. URL https://pro...

  40. [40]

    V . Sanh, A. Webson, C. Raffel, S. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla, T. Kim, G. Chh- ablani, N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T....

  41. [41]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017

  42. [42]

    Stiennon, L

    N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. Christiano. Learning to summarize from human feedback, 2022

  43. [43]

    Thoppilan, D

    R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, Y . Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y . Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y . Xu, Z. Chen, A. Roberts, M. Bosma, V . Zhao, Y . Zhou, C.-C. Chang, I. Krivokon, W. Rusch, M. Pickett, P. Srinivasan, L. Man, K. Mei...

  44. [44]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  45. [45]

    Völske, M

    M. Völske, M. Potthast, S. Syed, and B. Stein. TL;DR: Mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4508. URL https://aclanthology.org/W17-4508

  46. [46]

    Mattukat, Vincent Schmandt, Langstrof Timo, Zerbe Michael, and Horst Lichter

    L. von Werra, J. Tow, reciprocated, S. Matiana, A. Havrilla, cat state, L. Castricato, Alan, D. V . Phung, A. Thakur, A. Bukhtiyarov, aaronrmm, F. Milo, Daniel, D. King, D. Shin, E. Kim, J. Wei, M. Romero, N. Pochinkov, O. Sanseviero, R. Adithyan, S. Siu, T. Simonini, V . Blagojevic, X. Song, Z. Witten, alexandremuzio, and crumb. CarperAI/trlx: v0.6.0: LL...

  47. [47]

    Wang and A

    B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021

  48. [48]

    Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319, 2019

    S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019

  49. [49]

    R. J. Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning. Mach. Learn. , 8(3–4):229–256, may 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696

  50. [50]

    Wu and B

    Y . Wu and B. Hu. Learning to extract coherent summary via deep reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence , AAAI’18/IAAI’18/EAAI’18. AAAI Press,

  51. [51]

    ISBN 978-1-57735-800-8

  52. [52]

    X. Yan, C. Luo, C. L. A. Clarke, N. Craswell, E. M. V oorhees, and P. Castells. Human preferences as dueling bandits. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR ’22, page 567–577, New York, NY , USA, 2022. Association for Computing Machinery. ISBN 9781450387323. doi: 10.1145/3...

  53. [53]

    Y . Yue, J. Broder, R. Kleinberg, and T. Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012. ISSN 0022-0000. doi: https: //doi.org/10.1016/j.jcss.2011.12.028. URL https://www.sciencedirect.com/science/ article/pii/S0022000012000281. JCSS Special Issue: Cloud Computing 2011

  54. [54]

    A" or "B

    D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences, 2020. 14 Author Contributions All authors provided valuable contributions to designing, analyzing, and iterating on experiments, writing and editing the paper, and generally managing the project’s progres...

  55. [55]

    Virginia Adams 3

    Gordon Chi 2. Virginia Adams 3. Max Du 4. Kaili Huang

  56. [56]

    Ioanna Vavelidou 7

    Ben Prystawski 6. Ioanna Vavelidou 7. Victor Kolev 8. Karel D’Oosterlinck

  57. [57]

    Tyler Lum 11

    Ananth Agarwal 10. Tyler Lum 11. Mike Hardy 12. Niveditha Iyer

  58. [58]

    Katherine Li 15

    Helena Vasconcelos 14. Katherine Li 15. Chenchen Gu 16. Moritz Stephan

  59. [59]

    Ethan Chi 19

    Swee Kiat Lim 18. Ethan Chi 19. Kaien Yang 20. Ryan Chi

  60. [60]

    Abhay Singhal 23

    Joy Yun 22. Abhay Singhal 23. Siyan Li 24. Amelia Hardy

  61. [61]

    Zhengxuan Wu 7One volunteer did not respond for the DPO-PPO comparison. 27