pith. sign in

arxiv: 2305.18290 · v3 · submitted 2023-05-29 · 💻 cs.LG · cs.AI· cs.CL

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Pith reviewed 2026-05-11 02:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords direct preference optimizationRLHFlanguage model alignmentpreference learningreward model reparameterizationclosed-form policyclassification loss
0
0 comments X

The pith

A reparameterization of the reward model allows language models to be aligned with human preferences using only a simple classification loss instead of reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the standard RLHF pipeline, which fits a reward model and then runs reinforcement learning to optimize the language model, can be replaced by a direct optimization procedure. By rewriting the reward in terms of the policy and a fixed reference model, the optimal policy under the KL-regularized objective becomes available in closed form. This turns the entire alignment step into a supervised classification problem on preference pairs. The resulting method is stable, requires no on-policy sampling during training, and needs little hyperparameter tuning. Experiments indicate it performs as well as or better than PPO-based RLHF on sentiment control, summarization, and dialogue tasks.

Core claim

We show that the RLHF objective admits a closed-form expression for the optimal policy once the reward is reparameterized as a function of the policy's log-ratio to the reference policy, allowing the entire alignment problem to be solved with a single logistic loss on human preference data.

What carries the argument

The reparameterized reward r(x,y) = β log(π(y|x) / π_ref(y|x)) + β log Z(x), which makes the policy that maximizes the RLHF objective directly extractable without running reinforcement learning.

If this is right

  • No sampling from the current model is needed during the fine-tuning stage.
  • The training objective reduces to ordinary supervised learning on labeled preference pairs.
  • Hyperparameter search is limited to learning rate and the temperature β instead of full RL schedules.
  • The method can be implemented in standard language-model fine-tuning code without separate reward-model training or policy-gradient machinery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reparameterization trick could be tested on tasks beyond single-turn dialogue, such as multi-turn conversations where the reference policy already encodes useful structure.
  • If the reference model is chosen poorly, performance may degrade more sharply than in two-stage RLHF that can learn a separate reward model.
  • The closed-form relation suggests exploring whether other regularized objectives in control or planning admit similar direct solutions.

Load-bearing premise

Human preferences must follow the Bradley-Terry model exactly and the reference policy must remain fixed and suitable throughout training.

What would settle it

Run DPO and standard RLHF on the same preference dataset and measure which produces higher win rates against held-out human judgments; if DPO is consistently worse, the closed-form optimality claim is falsified.

read the original abstract

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that by reparameterizing the reward model under the Bradley-Terry preference model in the standard RLHF objective, the corresponding optimal policy can be expressed in closed form. This reduces the RLHF problem to a simple binary classification loss (DPO) on human preference pairs, eliminating the need to train a separate reward model or run reinforcement learning. Experiments on sentiment control, summarization, and single-turn dialogue show DPO matching or exceeding PPO-based RLHF while being simpler and more stable.

Significance. If the central derivation holds, the result is significant: it provides a mathematically clean and practically simpler alternative to the two-stage RLHF pipeline. The closed-form optimality under standard assumptions is a clear strength, and the empirical results on three tasks support that DPO is competitive without the instability or sampling overhead of RL. This could lower the barrier to preference-based alignment for large LMs.

major comments (2)
  1. [§3] §3, Eq. (5): the closed-form optimality of π* holds only when the reference policy π_ref is held fixed and the Bradley-Terry model is assumed to hold exactly; the manuscript does not discuss how sensitive the guarantee is to violations of either assumption (e.g., when human preferences deviate from the logistic form or when π_ref is updated).
  2. [§4] §4.2–4.3: the reported gains over RLHF are consistent, yet the experiments provide only minimal ablation on the scalar β (chosen once per task) and no sensitivity analysis on the choice of reference model; because β is the sole free parameter, this limits assessment of robustness.
minor comments (2)
  1. [Figure 1] Figure 1 caption and surrounding text could more explicitly contrast the DPO training loop with the standard RLHF loop to highlight the eliminated steps.
  2. [§3.2] The notation for the partition function Z(x) is introduced in §3 but its dependence on the policy is not restated when the loss is written in §3.2, which may confuse readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive review and constructive comments. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§3] §3, Eq. (5): the closed-form optimality of π* holds only when the reference policy π_ref is held fixed and the Bradley-Terry model is assumed to hold exactly; the manuscript does not discuss how sensitive the guarantee is to violations of either assumption (e.g., when human preferences deviate from the logistic form or when π_ref is updated).

    Authors: We agree that the closed-form optimality in Eq. (5) is derived under the assumptions that the Bradley-Terry model holds exactly and that π_ref is held fixed. These are the standard assumptions in the RLHF literature from which the derivation begins. The manuscript presents the result under these conditions without claiming robustness to violations. To address the comment, we will add a brief discussion paragraph in Section 3 that explicitly states the assumptions, notes that empirical performance may degrade under strong violations, and points to related work on preference model misspecification. We do not plan to add new theoretical sensitivity bounds or extensive new experiments, as these would constitute a substantial extension. revision: partial

  2. Referee: [§4] §4.2–4.3: the reported gains over RLHF are consistent, yet the experiments provide only minimal ablation on the scalar β (chosen once per task) and no sensitivity analysis on the choice of reference model; because β is the sole free parameter, this limits assessment of robustness.

    Authors: We appreciate the point that limited ablation on β and the reference model restricts robustness assessment. In the original experiments β was selected via validation performance for each task. We will revise the experimental section to include an expanded ablation on β for the sentiment control task, reporting performance across a range of β values (e.g., 0.05 to 2.0) with corresponding plots. For the reference model, we used the base pretrained LM in all experiments, consistent with the theoretical setup; we will add a clarifying sentence in Section 4 explaining this choice and noting that alternative references (such as SFT-tuned models) are left for future work due to computational cost. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation is a direct mathematical reparameterization under stated assumptions

full rationale

The paper begins from the standard RLHF objective (maximize expected reward minus KL penalty to reference policy) and the Bradley-Terry model for preferences. It then algebraically reparameterizes the reward function in terms of the policy ratio, yielding a closed-form expression for the optimal policy and a simple classification loss. This equivalence holds exactly under the modeling assumptions; no parameter is fitted to the same data used for evaluation, no self-citation supplies a load-bearing uniqueness theorem, and β is treated as a fixed hyperparameter rather than a per-task fit. The central result is therefore a re-derivation, not a reduction to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the Bradley-Terry preference model and the standard KL-regularized RL objective; the partition function is shown to cancel analytically, leaving no new free parameters beyond the usual beta coefficient.

free parameters (1)
  • beta
    Scalar coefficient on the KL divergence term that controls how far the policy may deviate from the reference model; chosen by hand or grid search.
axioms (2)
  • domain assumption Bradley-Terry model: P(y_w > y_l) = sigma(r(y_w) - r(y_l))
    Used to express the preference probability in terms of the reward; appears in the derivation of the DPO loss.
  • standard math Optimal policy under KL penalty has closed form pi*(y) proportional to pi_ref(y) exp(r(y)/beta)
    Standard result from maximum-entropy RL; invoked to substitute reward in terms of policy.

pith-pipeline@v0.9.0 · 5564 in / 1451 out tokens · 33553 ms · 2026-05-11T02:27:33.876252+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning the Signature of Memorization in Autoregressive Language Models

    cs.CL 2026-04 accept novelty 8.0

    A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

  2. ORPO: Monolithic Preference Optimization without Reference Model

    cs.CL 2024-03 conditional novelty 8.0

    ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

  3. What Drives Interactive Improvement from Feedback?

    cs.AI 2026-06 unverdicted novelty 7.0

    Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.

  4. Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

    cs.AI 2026-06 unverdicted novelty 7.0

    PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing lea...

  5. Flow Reasoning Models: Scaling Reasoning Through Iterative Self-Refinement

    cs.AI 2026-06 conditional novelty 7.0

    Flow models reach 99.2% Sudoku accuracy in 7 passes and 96.1% on out-of-distribution Sudoku-Extreme by selecting dynamically stable candidates and training with self-conditioning plus DPO to avoid failed outputs.

  6. Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

    cs.LG 2026-06 unverdicted novelty 7.0

    LOGICA adds context to pretrained biological LMs via logit-space contrastive alignment with gated adapters, improving AUC on held-out drug-resistance mutation ranking from ~0.55 to ~0.65 while preserving token likelihoods.

  7. LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

    cs.LG 2026-06 unverdicted novelty 7.0

    LLMZero uses LLM agents to search training trajectories and discovers that capacity parameters accumulate monotonically while regularization parameters oscillate, leading to performance improvements of 9-140% on GRPO tasks.

  8. TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

    cs.LG 2026-06 unverdicted novelty 7.0

    TimeROME-DLM enables training-free knowledge editing in masked diffusion language models via temporal causal tracing and low-rank residual edit memory applied at inference time.

  9. Alignment Defends LLMs from Property Inference Attacks

    cs.LG 2026-06 unverdicted novelty 7.0

    Alignment defenses adapted from DPO and GRPO mitigate property inference attacks on LLMs while preserving utility.

  10. Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

    cs.AI 2026-06 unverdicted novelty 7.0

    LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).

  11. PInVerify: An Offline Embodied Benchmark for Active Instance Verification

    cs.CV 2026-05 unverdicted novelty 7.0

    PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no re...

  12. Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

    cs.AI 2026-05 unverdicted novelty 7.0

    A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

  13. Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

    cs.CL 2026-05 unverdicted novelty 7.0

    Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.

  14. Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

    cs.LG 2026-05 unverdicted novelty 7.0

    Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...

  15. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 7.0

    Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...

  16. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 7.0

    TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

  17. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 7.0

    Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.

  18. A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation

    cs.CL 2026-05 unverdicted novelty 7.0

    dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

  19. Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...

  20. Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.

  21. Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets

    math.OC 2026-05 unverdicted novelty 7.0

    Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends...

  22. Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

    cs.LG 2026-05 unverdicted novelty 7.0

    POISE estimates value baselines for RL in LLMs from the actor's internal states via a lightweight probe and cross-rollout construction, matching DAPO performance with lower compute on math reasoning benchmarks.

  23. Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

    cs.LG 2026-05 unverdicted novelty 7.0

    POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.

  24. Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

    cs.LG 2026-05 unverdicted novelty 7.0

    Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

  25. Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

    cs.CL 2026-05 unverdicted novelty 7.0

    Topology-enhanced alignment via persistent homology on trajectories outperforms standard SFT and DPO baselines on preference metrics for LLMs.

  26. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.

  27. The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining

    cs.CY 2026-05 unverdicted novelty 7.0

    Full development of 7B and 32B Olmo 3 models used 12.3 GWh datacenter energy and emitted 4,251 tCO2eq, with development overheads accounting for 82% of compute and reasoning models costing 17x more to post-train than ...

  28. Adaptive Prompt Embedding Optimization for LLM Jailbreaking

    cs.AI 2026-04 unverdicted novelty 7.0

    PEO optimizes original prompt embeddings continuously over adaptive rounds to jailbreak aligned LLMs, preserving the exact visible prompt text and outperforming discrete suffix, appended embedding, and search-based wh...

  29. SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters

    cs.CV 2026-04 unverdicted novelty 7.0

    Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.

  30. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  31. S-GRPO: Unified Post-Training for Large Vision-Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.

  32. Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.

  33. Scaffold-Conditioned Preference Triplets for Controllable Molecular Optimization with Large Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    SCPT creates similarity-constrained preference triplets from scaffolds to train LLMs as conditional molecular editors that improve properties while keeping scaffolds intact.

  34. Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

    cs.CL 2026-04 conditional novelty 7.0

    SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.

  35. Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction

    cs.CL 2026-04 unverdicted novelty 7.0

    VeriGUI adds a Thinking-Verification-Action-Expectation loop and two-stage training on synthetic failures to reduce undetected action errors and improve recovery in GUI automation.

  36. LLM4Log: A Systematic Review of Large Language Model-based Log Analysis

    cs.SE 2026-03 accept novelty 7.0

    LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.

  37. CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

    cs.LG 2026-02 unverdicted novelty 7.0

    CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.

  38. PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data

    cs.CL 2025-12 conditional novelty 7.0

    PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with...

  39. EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention

    cs.SE 2025-08 unverdicted novelty 7.0

    EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.

  40. Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries

    cs.CR 2026-06 unverdicted novelty 6.0

    Bandit algorithms learn optimal jailbreaks from noisy exploration and, paired with complexity-enhanced queries in FrankensteinBench, achieve up to 97% attack success on 15 open-weight LLMs.

  41. Skin-Deep: A Geometric Diagnostic for Alignment Fragility in Large Language Model Representations

    cs.AI 2026-06 unverdicted novelty 6.0

    Skin-Deep extracts a Geometric Fragility Score from LLM activations that identifies which initially safe models retain the most refusal after small LoRA fine-tuning.

  42. Open AI in the Wild: Adoption and Adaptation of Open Models on r/LocalLLaMA

    cs.HC 2026-06 unverdicted novelty 6.0

    Thematic analysis of r/LocalLLaMA discussions finds users define openness via reliability, local control, privacy, and adaptation under compute, licensing, and usability constraints.

  43. Steer, Don't Solve: Training Small Critic Models for Large Code Agents

    cs.SE 2026-06 unverdicted novelty 6.0

    A small SFT-trained critic provides intra-trajectory steering to frozen code agents, delivering +3 to +5 point gains on SWE-bench Verified at 30-92x lower cost than a strong teacher.

  44. Social World Model for Lifelong Social Intelligence

    cs.AI 2026-06 unverdicted novelty 6.0

    The Social World Model supplies a five-dimension decomposition and closed-loop training loop that lets a 7B open model match Gemini 3 Flash on social metrics while showing zero forgetting on ASCENT-Bench.

  45. Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech

    cs.SD 2026-06 unverdicted novelty 6.0

    A code-mixing guided preference-learning method for TTS produces synthetic data that lowers mixed error rate when fine-tuning Whisper on the SEAME Mandarin-English corpus.

  46. MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction

    cs.CL 2026-06 unverdicted novelty 6.0

    MARD-7B outperforms baselines and GPT-4o on novel drug pairs for mechanism-level DDI prediction via a new distillation pipeline with verifiable process rewards and releases all resources.

  47. Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

    cs.CL 2026-06 unverdicted novelty 6.0

    AdvGRPO stabilizes GRPO for joint attacker-defender optimization via multi-channel rewards and curriculum training, yielding effective transferable attacks and stronger co-trained defenders on safety benchmarks.

  48. Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation

    cs.CV 2026-06 unverdicted novelty 6.0

    FaithRewriter is a prompt-enhancement framework that uses an MLLM-generated image as a visual anchor to guide LLM-based rewriting for more faithful text-to-image generation.

  49. Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair

    cs.CL 2026-06 unverdicted novelty 6.0

    TRI trains LLMs on goal-conditioned fill-in-the-middle tasks via PSM token rearrangement and symbolic verification to surgically repair erroneous CoT segments.

  50. ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

    cs.AI 2026-06 unverdicted novelty 6.0

    ThoughtFold applies introspective redundancy detection within correct CoT trajectories to create sub-trajectory spectra, then uses masked preference optimization to penalize redundant explorations, yielding 56% token ...

  51. Narrative Flattening: How Post-Training Compresses Thematic, Affective, and Stylistic Variation in LLM Fiction

    cs.CL 2026-05 unverdicted novelty 6.0

    Post-training on matched OLMo 32B checkpoints compresses thematic motion, affective prevalence, and linguistic diversity in fiction continuations relative to human baselines, producing narrative flattening that conver...

  52. Automating Formal Verification with Agent-Guided Tree Search

    cs.LO 2026-05 unverdicted novelty 6.0

    Agent-directed tree search improves LLM performance on Lean formal verification tasks, with context-based orchestration solving more intermediate specs at lower token cost than baseline agents.

  53. Model Unlearning Objectives Vary for Distinct Language Functions

    cs.CL 2026-05 unverdicted novelty 6.0

    Unlearning objectives should be tailored to distinct language functions, with a meta-learned RMU variant for dangerous knowledge and a multi-layer probe objective for toxicity, yielding strong results on four 7-8B models.

  54. TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

    cs.CL 2026-05 unverdicted novelty 6.0

    TIAR uses trajectory-informed advantage reweighting during GRPO to improve LLM abstention F1 scores on AbstentionBench while preserving accuracy.

  55. What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

    cs.CL 2026-05 unverdicted novelty 6.0

    Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggrega...

  56. Reinforcing Human Behavior Simulation via Verbal Feedback

    cs.LG 2026-05 unverdicted novelty 6.0

    DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.

  57. ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

    cs.LG 2026-05 unverdicted novelty 6.0

    ClaimDiff-RL introduces reference-conditioned atomic claim differences verified by a multimodal judge as the reward signal for fine-grained RL in long-form image captioning.

  58. ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

    cs.LG 2026-05 unverdicted novelty 6.0

    ClaimDiff-RL replaces holistic scalar rewards with reference-conditioned atomic claim differences verified by a multimodal judge to improve the hallucination-missing-fact tradeoff in long-form image captioning.

  59. AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

    cs.CL 2026-05 unverdicted novelty 6.0

    AutoVecCoder combines VecPrompt for automated intrinsic knowledge synthesis and VecRL for efficiency-aligned RL to train an 8B LLM that achieves SOTA on SimdBench SSE/AVX subsets and sometimes exceeds -O3 compiler results.

  60. Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

    cs.CL 2026-05 unverdicted novelty 6.0

    Reinforcement learning with semantic rewards lets LLMs gain low-resource language skills without the alignment tax that degrades general capabilities in supervised fine-tuning.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 173 Pith papers · 4 internal anchors

  1. [1]

    Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan....

  2. [2]

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho- seini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R...

  3. [3]

    Biderman, H

    S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023

  4. [4]

    Bong and A

    H. Bong and A. Rinaldo. Generalized results for the existence and consistency of the MLE in the Bradley-Terry-Luce model. International Conference on Machine Learning , 2022. arXiv:2110.11487

  5. [5]

    R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. doi: https://doi.org/10.2307/2334029

  6. [6]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei...

  7. [7]

    URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

    Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

  8. [8]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  9. [9]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y . T. Lee, Y . Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y . Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4, 2023. arXiv preprint arXiv:2303.12712

  10. [10]

    Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm , journal =

    R. Busa-Fekete, B. Szörényi, P. Weng, W. Cheng, and E. Hüllermeier. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine Learning, 97(3):327–351, July 2014. doi: 10.1007/s10994-014-5458-8. URL https://doi.org/10.1007/s10994-014-5458-8

  11. [11]

    Y . Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study.ArXiv, abs/2304.00723, 2023

  12. [12]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  13. [13]

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Sys- tems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips....

  14. [14]

    H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V . Zhao, Y . Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and...

  15. [15]

    Dudík, K

    M. Dudík, K. Hofmann, R. E. Schapire, A. Slivkins, and M. Zoghi. Contextual dueling bandits. In P. Grünwald, E. Hazan, and S. Kale, editors,Proceedings of The 28th Conference on Learning Theory, volume 40 ofProceedings of Machine Learning Research, pages 563–587, Paris, France, 03–06 Jul 2015. PMLR. URL https://proceedings.mlr.press/v40/Dudik15.html

  16. [16]

    D. Go, T. Korbak, G. Kruszewski, J. Rozen, N. Ryu, and M. Dymetman. Aligning language models with preferences through f-divergence minimization. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  17. [17]

    A. Jain, B. Wojcik, T. Joachims, and A. Saxena. Learning trajectory preferences for manip- ulators via iterative improvement. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors,Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/ 2013/...

  18. [18]

    Jaques, S

    N. Jaques, S. Gu, D. Bahdanau, J. M. Hernández-Lobato, R. E. Turner, and D. Eck. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pages 1645–1654. PMLR, 2017

  19. [19]

    Jaques, J

    N. Jaques, J. H. Shen, A. Ghandeharioun, C. Ferguson, A. Lapedriza, N. Jones, S. S. Gu, and R. Picard. Human-centric dialog training via offline reinforcement learning. arXiv preprint arXiv:2010.05848, 2020

  20. [20]

    Korbak, H

    T. Korbak, H. Elsahar, G. Kruszewski, and M. Dymetman. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 16203–16220. Curran Associates, Inc., ...

  21. [21]

    Fang, J., Jiang, H., Wang, K., Ma, Y ., Shi, J., Wang, X., He, X., and Chua, T

    J. Kreutzer, J. Uyheng, and S. Riezler. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1777–1788, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: ...

  22. [22]

    Kupcsik, D

    A. Kupcsik, D. Hsu, and W. S. Lee. Learning Dynamic Robot-to-Human Object Handover from Human Feedback, pages 161–176. Springer International Publishing, 01 2018. ISBN 978-3-319-51531-1. doi: 10.1007/978-3-319-51532-8_10

  23. [23]

    S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018

  24. [24]

    R. D. Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012

  25. [25]

    A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y . Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URLhttp://www.aclweb.org/ a...

  26. [26]

    Chalkidis, A

    S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long

  27. [27]

    URL https://aclanthology.org/2022.acl-long.244. 12

  28. [28]

    11 COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models Nangia, N., Vania, C., Bhalerao, R., and Bowman, S

    R. Nallapati, B. Zhou, C. dos Santos, Ç. Gulçehre, and B. Xiang. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/K16-1028. URL https:// ac...

  29. [29]

    Efficient large-scale language model training on gpu clusters using megatron-lm,

    D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Anal...

  30. [30]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, ...

  31. [31]

    Paulus, C

    R. Paulus, C. Xiong, and R. Socher. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=HkAClQgA-

  32. [32]

    X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019

  33. [33]

    Peters and S

    J. Peters and S. Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007

  34. [34]

    R. L. Plackett. The analysis of permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics), 24(2):193–202, 1975. doi: https://doi.org/10.2307/2346567

  35. [35]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners, 2019. Ms., OpenAI

  36. [36]

    Ramamurthy, P

    R. Ramamurthy, P. Ammanabrolu, K. Brantley, J. Hessel, R. Sifa, C. Bauckhage, H. Hajishirzi, and Y . Choi. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. ...

  37. [37]

    Sequence Level Training with Recurrent Neural Networks

    M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. CoRR, abs/1511.06732, 2015

  38. [38]

    Sadigh, A

    D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia. Active preference-based learning of reward functions. In Robotics: Science and Systems (RSS), 2017

  39. [39]

    A. Saha, A. Pacchiano, and J. Lee. Dueling rl: Reinforcement learning with trajectory preferences. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , volume 206 of Proceed- ings of Machine Learning Research , pages 6263–6289. PMLR, 25–27 Apr 2023. URL https://pro...

  40. [40]

    V . Sanh, A. Webson, C. Raffel, S. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla, T. Kim, G. Chh- ablani, N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T....

  41. [41]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017

  42. [42]

    Stiennon, L

    N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. Christiano. Learning to summarize from human feedback, 2022

  43. [43]

    Thoppilan, D

    R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, Y . Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y . Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y . Xu, Z. Chen, A. Roberts, M. Bosma, V . Zhao, Y . Zhou, C.-C. Chang, I. Krivokon, W. Rusch, M. Pickett, P. Srinivasan, L. Man, K. Mei...

  44. [44]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  45. [45]

    TL ; DR : Mining R eddit to Learn Automatic Summarization

    M. Völske, M. Potthast, S. Syed, and B. Stein. TL;DR: Mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4508. URL https://aclanthology.org/W17-4508

  46. [46]

    https://doi.org/10.5281/zenodo

    L. von Werra, J. Tow, reciprocated, S. Matiana, A. Havrilla, cat state, L. Castricato, Alan, D. V . Phung, A. Thakur, A. Bukhtiyarov, aaronrmm, F. Milo, Daniel, D. King, D. Shin, E. Kim, J. Wei, M. Romero, N. Pochinkov, O. Sanseviero, R. Adithyan, S. Siu, T. Simonini, V . Blagojevic, X. Song, Z. Witten, alexandremuzio, and crumb. CarperAI/trlx: v0.6.0: LL...

  47. [47]

    Wang and A

    B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021

  48. [48]

    Welleck, I

    S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019

  49. [49]

    R. J. Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning. Mach. Learn. , 8(3–4):229–256, may 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696

  50. [50]

    Wu and B

    Y . Wu and B. Hu. Learning to extract coherent summary via deep reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence , AAAI’18/IAAI’18/EAAI’18. AAAI Press,

  51. [51]

    ISBN 978-1-57735-800-8

  52. [52]

    X. Yan, C. Luo, C. L. A. Clarke, N. Craswell, E. M. V oorhees, and P. Castells. Human preferences as dueling bandits. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR ’22, page 567–577, New York, NY , USA, 2022. Association for Computing Machinery. ISBN 9781450387323. doi: 10.1145/3...

  53. [53]

    Y . Yue, J. Broder, R. Kleinberg, and T. Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012. ISSN 0022-0000. doi: https: //doi.org/10.1016/j.jcss.2011.12.028. URL https://www.sciencedirect.com/science/ article/pii/S0022000012000281. JCSS Special Issue: Cloud Computing 2011

  54. [54]

    A" or "B

    D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences, 2020. 14 Author Contributions All authors provided valuable contributions to designing, analyzing, and iterating on experiments, writing and editing the paper, and generally managing the project’s progres...

  55. [55]

    Virginia Adams 3

    Gordon Chi 2. Virginia Adams 3. Max Du 4. Kaili Huang

  56. [56]

    Ioanna Vavelidou 7

    Ben Prystawski 6. Ioanna Vavelidou 7. Victor Kolev 8. Karel D’Oosterlinck

  57. [57]

    Tyler Lum 11

    Ananth Agarwal 10. Tyler Lum 11. Mike Hardy 12. Niveditha Iyer

  58. [58]

    Katherine Li 15

    Helena Vasconcelos 14. Katherine Li 15. Chenchen Gu 16. Moritz Stephan

  59. [59]

    Ethan Chi 19

    Swee Kiat Lim 18. Ethan Chi 19. Kaien Yang 20. Ryan Chi

  60. [60]

    Abhay Singhal 23

    Joy Yun 22. Abhay Singhal 23. Siyan Li 24. Amelia Hardy

  61. [61]

    Zhengxuan Wu 7One volunteer did not respond for the DPO-PPO comparison. 27