super hub Canonical reference

Fine-Tuning Language Models from Human Preferences

Alec Radford, Daniel M. Ziegler, Dario Amodei, Jeffrey Wu, Nisan Stiennon, Tom B. Brown · 2019 · cs.CL · arXiv 1909.08593

Canonical reference. 88% of citing Pith papers cite this work as background.

219 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 219 citing papers more from Alec Radford arXiv PDF

abstract

Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 31 method 2 baseline 1

citation-polarity summary

background 30 baseline 1 support 1 unclear 1 use method 1

claims ledger

abstract Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentimen

authors

Alec Radford Daniel M. Ziegler Dario Amodei Jeffrey Wu Nisan Stiennon Tom B. Brown

co-cited works

representative citing papers

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Decision Transformer: Reinforcement Learning via Sequence Modeling

cs.LG · 2021-06-02 · accept · novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics

cs.CL · 2026-06-06 · unverdicted · novelty 7.0

SVR learns a bank of contrastive rubrics from preference data via max-margin boundaries and prompt-conditioned selection, narrowing the gap to human rubrics on RubricBench from 24.1 to 0.3 points.

On the Geometry of On-Policy Distillation

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

OrderGrad supplies unbiased likelihood-ratio and reparameterization gradient estimators for finite-sample L-statistics by applying a rank-based reward transformation usable in standard policy-gradient updates.

Efficient Exploration for Iterative Nash Preference Optimization

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

An explicitly exploratory iterative NLHF method achieves O(sqrt(T)) regret for Nash equilibria under general preference models, removing the exponential KL dependence that plagues standard iterative approaches.

Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

DEPO formulates detector-evasive paraphrasing as a constrained MDP and solves it via Lagrangian primal-dual RL with GRPO-style updates to achieve evasion while satisfying a semantic-preservation constraint.

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

cs.AI · 2026-05-20 · conditional · novelty 7.0

DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.

Measuring Safety Alignment Effects in Autonomous Security Agents

cs.CR · 2026-05-19 · conditional · novelty 7.0

A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generation tasks such as role-playing.

From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery

cs.CE · 2026-05-14 · unverdicted · novelty 7.0

QuantEvolver applies reinforcement fine-tuning to evolve an LLM policy for generating executable alpha factor expressions, yielding higher-quality and more complementary factors than prompt-based baselines on market benchmarks.

Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment performance.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.

Convex Optimization with Nested Evolving Feasible Sets

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

For convex losses in nested evolving feasible sets, a lazy algorithm balances O(T^{1-β}) regret with O(T^β) movement for any β; for strongly convex or sharp losses, Frugal achieves zero regret with O(log T) movement, shown optimal by matching lower bound.

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.

$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

The paper establishes the first O(log T) regret and O(1/T) sub-optimality bounds for online RLHF under general f-divergence regularization via two sampling algorithms.

Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

cs.SE · 2026-05-06 · unverdicted · novelty 7.0

Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

cs.AI · 2026-05-05 · conditional · novelty 7.0 · 3 refs

TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured performance uplift on a frozen executor, outperforming outcome-only training on math and code benchmarks.

Efficient Preference Poisoning Attack on Offline RLHF

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

Preference poisoning against log-linear DPO reduces to a binary sparse approximation problem solved by lattice-reduction (BAL-A) and matching-pursuit (BMP-A) algorithms that carry recovery guarantees.

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating coverage, variance, and other terms.

Multi-User Dueling Bandits: A Fair Approach using Nash Social Welfare

cs.LG · 2026-05-03 · unverdicted · novelty 7.0

The work establishes a regret lower bound of Ω(T^{2/3} min(K,D)^{1/3}) for fair multi-user dueling bandits with heterogeneous Condorcet winners and gives algorithms achieving matching upper bounds up to logs.

Three Models of RLHF Annotation: Extension, Evidence, and Authority

cs.CY · 2026-04-28 · unverdicted · novelty 7.0

RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.

Interactive Episodic Memory with User Feedback

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

Introduces an interactive episodic memory task with user feedback and a Feedback Alignment Module that improves retrieval accuracy on video benchmarks while remaining efficient.

citing papers explorer

Showing 32 of 32 citing papers after filters.

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment cs.AI · 2026-05-20 · conditional · none · ref 22 · internal anchor
DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.
Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation cs.AI · 2026-05-18 · unverdicted · none · ref 19 · internal anchor
PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generation tasks such as role-playing.
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards cs.AI · 2026-05-05 · conditional · none · ref 41 · 3 links · internal anchor
TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured performance uplift on a frozen executor, outperforming outcome-only training on math and code benchmarks.
Measuring Faithfulness in Chain-of-Thought Reasoning cs.AI · 2023-07-17 · conditional · none · ref 27 · internal anchor
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures cs.AI · 2026-05-28 · unverdicted · none · ref 29 · internal anchor
TLO is a logit-based diagnostic that visualizes temporal patterns of LLM jailbreak failures on a calibrated 2D plane, distinguishing attacks with identical ASR and enabling early stopping that reduces successful jailbreaks by more than half.
BrickAnything: Geometry-Conditioned Buildable Brick Generation with Structure-Aware Tokenization cs.AI · 2026-05-25 · unverdicted · none · ref 40 · internal anchor
BrickAnything generates buildable brick structures from 3D point clouds via geometry-conditioned autoregressive prediction with structure-aware tree tokenization and post-training for stability.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR cs.AI · 2026-05-19 · unverdicted · none · ref 24 · internal anchor
POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents cs.AI · 2026-05-19 · unverdicted · none · ref 43 · internal anchor
SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection cs.AI · 2026-05-13 · unverdicted · none · ref 50 · internal anchor
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models cs.AI · 2026-05-09 · unverdicted · none · ref 75 · 2 links · internal anchor
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models cs.AI · 2026-05-05 · unverdicted · none · ref 24 · internal anchor
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest cs.AI · 2026-04-09 · unverdicted · none · ref 110 · internal anchor
Many LLMs prioritize company ad incentives over user welfare by recommending pricier sponsored products, disrupting purchases, or concealing prices in comparisons.
AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning cs.AI · 2026-03-22 · conditional · none · ref 14 · internal anchor
AdaRubric adaptively generates task-specific rubrics via LLM, scores agent trajectories with per-dimension confidence weighting, and produces filtered DPO pairs that raise human correlation to Pearson r=0.79 and downstream task success by 6.8-8.5%.
From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails cs.AI · 2025-10-15 · unverdicted · none · ref 70 · internal anchor
Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.
Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies cs.AI · 2024-12-03 · unverdicted · none · ref 59 · internal anchor
PGT optimizes latent goal embeddings for frozen policies via trajectory-level preference objectives, reporting 72-81.6% relative gains on 17 Minecraft tasks and 13.4% better OOD performance than fine-tuning.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 266 · internal anchor
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework cs.AI · 2024-05-20 · unverdicted · none · ref 18 · internal anchor
OpenRLHF is a new open-source RLHF framework reporting 1.22x to 1.68x speedups and fewer lines of code than prior systems.
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts cs.AI · 2023-09-19 · unverdicted · none · ref 77 · internal anchor
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning cs.AI · 2026-06-09 · unverdicted · none · ref 22 · internal anchor
DiRL extracts a reasoning-memorization direction from model representations inside GRPO to weight gradients and shape rewards so that exploration favors reasoning trajectories over memorization ones.
Characterizing initial human-AI proof formalization workflows cs.AI · 2026-06-02 · unverdicted · none · ref 141 · internal anchor
A controlled user study and qualitative survey find that AI assistance raises formalization accuracy for math proofs, with users flexibly combining multiple tools while retaining oversight.
Improving Collaborative Storytelling with a Multi-Agent Framework Based on Large Language Models cs.AI · 2026-05-28 · unverdicted · none · ref 40 · internal anchor
An iterative writer-editor multi-agent LLM process improves perceived story quality in simulations of child collaborative storytelling.
On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective cs.AI · 2026-05-08 · unverdicted · none · ref 4 · internal anchor
Post-training reweights a pretrained model's behavior distribution either within its existing accessible support (elicitation) or by expanding that support (creation), with both SFT and RL acting as free-energy minimization under different signals.
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety cs.AI · 2025-07-15 · unverdicted · none · ref 2 · internal anchor
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.
A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions cs.AI · 2025-01-27 · unverdicted · none · ref 194 · internal anchor
A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.
BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards cs.AI · 2026-06-27 · unverdicted · none · ref 57 · internal anchor
BV-Blend blends prompt-local and semantic-cluster historical reward statistics via SEM-derived weights to stabilize critic-free RL advantage estimation.
Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis cs.AI · 2026-06-02 · unverdicted · none · ref 22 · internal anchor
HAZDIAL shows structured agentic dialogue improves LLM-based hazard identification quality over single-pass methods on a curated golden dataset using classification and dialogue metrics.
Beyond Context: Large Language Models' Failure to Grasp Users' Intent cs.AI · 2025-12-24 · unverdicted · none · ref 103 · internal anchor
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.
AI Alignment: A Comprehensive Survey cs.AI · 2023-10-30 · unverdicted · none · ref 13 · internal anchor
The paper surveys AI alignment by proposing the RICE principles and categorizing research into forward training-based alignment and backward assurance and governance approaches.
A Survey of Hallucination in Large Foundation Models cs.AI · 2023-09-12 · accept · none · ref 22 · internal anchor
A survey classifying hallucination phenomena specific to large foundation models, establishing evaluation criteria, examining mitigation strategies, and discussing future directions.
Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security cs.AI · 2026-05-17 · unverdicted · none · ref 51 · internal anchor
A survey that maps risks along the agent workflow and consolidates metrics and benchmarks for safety, robustness, privacy, and security in agentic AI.
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems cs.AI · 2025-03-31 · unverdicted · none · ref 138 · internal anchor
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.
Autogenesis: A Self-Evolving Agent Protocol cs.AI · 2026-04-16 · unreviewed · ref 18 · 2 links · internal anchor

Fine-Tuning Language Models from Human Preferences

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer