super hub Mixed citations

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Archit Sharma, Chelsea Finn, Christopher D. Manning, Eric Mitchell, Rafael Rafailov, Stefano Ermon · 2023 · cs.LG · arXiv 2305.18290

Mixed citation behavior. Most common role is background (67%).

167 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 167 citing papers more from Archit Sharma arXiv PDF

abstract

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 25 method 9 baseline 1 dataset 1

citation-polarity summary

background 24 use method 9 baseline 1 unclear 1 use dataset 1

claims ledger

abstract While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then

authors

Archit Sharma Chelsea Finn Christopher D. Manning Eric Mitchell Rafael Rafailov Stefano Ermon

co-cited works

representative citing papers

Learning the Signature of Memorization in Autoregressive Language Models

cs.CL · 2026-04-03 · accept · novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

What Drives Interactive Improvement from Feedback?

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.

Flow Reasoning Models: Scaling Reasoning Through Iterative Self-Refinement

cs.AI · 2026-06-28 · conditional · novelty 7.0

Flow models reach 99.2% Sudoku accuracy in 7 passes and 96.1% on out-of-distribution Sudoku-Extreme by selecting dynamically stable candidates and training with self-conditioning plus DPO to avoid failed outputs.

Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

LOGICA adds context to pretrained biological LMs via logit-space contrastive alignment with gated adapters, improving AUC on held-out drug-resistance mutation ranking from ~0.55 to ~0.65 while preserving token likelihoods.

TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

TimeROME-DLM enables training-free knowledge editing in masked diffusion language models via temporal causal tracing and low-rank residual edit memory applied at inference time.

Alignment Defends LLMs from Property Inference Attacks

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

Alignment defenses adapted from DPO and GRPO mitigate property inference attacks on LLMs while preserving utility.

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.

Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.

Learning, Fast and Slow: Towards LLMs That Adapt Continually

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 3 refs

Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.

Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets

math.OC · 2026-05-09 · unverdicted · novelty 7.0

Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends on intrinsic manifold dimension.

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.

Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Topology-enhanced alignment via persistent homology on trajectories outperforms standard SFT and DPO baselines on preference metrics for LLMs.

The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining

cs.CY · 2026-05-01 · unverdicted · novelty 7.0

Full development of 7B and 32B Olmo 3 models used 12.3 GWh datacenter energy and emitted 4,251 tCO2eq, with development overheads accounting for 82% of compute and reasoning models costing 17x more to post-train than instruction-tuned ones.

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

cs.AI · 2026-04-27 · unverdicted · novelty 7.0

PEO optimizes original prompt embeddings continuously over adaptive rounds to jailbreak aligned LLMs, preserving the exact visible prompt text and outperforming discrete suffix, appended embedding, and search-based white-box attacks on harmful-behavior benchmarks.

SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets math.OC · 2026-05-09 · unverdicted · none · ref 206 · internal anchor
Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends on intrinsic manifold dimension.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer