hub Mixed citations

Deep reinforcement learning from human preferences

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei · 2017 · stat.ML · arXiv 1706.03741

Mixed citation behavior. Most common role is background (60%).

26 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 26 citing papers arXiv PDF

abstract

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 method 2

citation-polarity summary

background 6 unclear 2 extend 1 use method 1

representative citing papers

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

cs.MA · 2024-10-09 · unverdicted · novelty 8.0

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics

cond-mat.stat-mech · 2026-05-11 · unverdicted · novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.

Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Topology-enhanced alignment via persistent homology on trajectories outperforms standard SFT and DPO baselines on preference metrics for LLMs.

The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.

Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

cs.CV · 2026-04-15 · conditional · novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention

cs.SE · 2025-08-22 · unverdicted · novelty 7.0

EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

cs.CL · 2026-04-13 · unverdicted · novelty 6.0

Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

cs.AI · 2025-10-05 · unverdicted · novelty 6.0

A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

cs.RO · 2025-05-24 · conditional · novelty 6.0

VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

cs.LG · 2024-07-31 · unverdicted · novelty 6.0

Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

cs.CL · 2024-06-05 · conditional · novelty 6.0

OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

cs.LG · 2024-03-28 · unverdicted · novelty 6.0

Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.

Simple synthetic data reduces sycophancy in large language models

cs.CL · 2023-08-07 · unverdicted · novelty 6.0

Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Explanation Quality Assessment as Ranking with Listwise Rewards

cs.AI · 2026-04-27 · unverdicted · novelty 5.0

Explanation quality assessment is recast as ranking with listwise and pairwise losses that outperform regression, allow small models to match large ones on curated data, and enable stable convergence in reinforcement learning.

From Perception to Autonomous Computational Modeling: A Multi-Agent Approach

cs.CE · 2026-04-08 · unverdicted · novelty 5.0

A multi-agent LLM framework autonomously completes the full computational mechanics pipeline from a photograph to a code-compliant engineering report on a steel L-bracket example.

Failure Modes of Maximum Entropy RLHF

cs.LG · 2025-09-24 · unverdicted · novelty 5.0

Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.

Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks

cs.LG · 2025-09-08 · unverdicted · novelty 5.0

Proposes a probabilistic framework for latent agentic substructures in DNNs using log-score utilities and log pooling, with proofs on unanimity and an application to persona emergence in LLM alignment.

citing papers explorer

Showing 26 of 26 citing papers.

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems cs.MA · 2024-10-09 · unverdicted · none · ref 54 · internal anchor
Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution cs.CL · 2023-09-28 · unverdicted · none · ref 81 · internal anchor
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Discovering Latent Knowledge in Language Models Without Supervision cs.CL · 2022-12-07 · conditional · none · ref 6 · internal anchor
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics cond-mat.stat-mech · 2026-05-11 · unverdicted · none · ref 16 · internal anchor
LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization cs.CL · 2026-05-08 · unverdicted · none · ref 10 · internal anchor
Topology-enhanced alignment via persistent homology on trajectories outperforms standard SFT and DPO baselines on preference metrics for LLMs.
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice cs.LG · 2026-05-02 · unverdicted · none · ref 5 · internal anchor
An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 26 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs cs.AI · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation cs.CV · 2026-04-15 · conditional · none · ref 33 · internal anchor
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention cs.SE · 2025-08-22 · unverdicted · none · ref 7 · internal anchor
EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 216 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation cs.LG · 2026-05-06 · unverdicted · none · ref 72 · internal anchor
The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale cs.CL · 2026-04-13 · unverdicted · none · ref 14 · internal anchor
Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation cs.AI · 2025-10-05 · unverdicted · none · ref 21 · internal anchor
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning cs.RO · 2025-05-24 · conditional · none · ref 13 · internal anchor
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling cs.LG · 2024-07-31 · unverdicted · none · ref 17 · internal anchor
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Improve Mathematical Reasoning in Language Models by Automated Process Supervision cs.CL · 2024-06-05 · conditional · none · ref 1 · internal anchor
OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models cs.LG · 2024-03-28 · unverdicted · none · ref 11 · internal anchor
Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.
Simple synthetic data reduces sycophancy in large language models cs.CL · 2023-08-07 · unverdicted · none · ref 10 · internal anchor
Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 215 · internal anchor
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Explanation Quality Assessment as Ranking with Listwise Rewards cs.AI · 2026-04-27 · unverdicted · none · ref 1 · internal anchor
Explanation quality assessment is recast as ranking with listwise and pairwise losses that outperform regression, allow small models to match large ones on curated data, and enable stable convergence in reinforcement learning.
From Perception to Autonomous Computational Modeling: A Multi-Agent Approach cs.CE · 2026-04-08 · unverdicted · none · ref 64 · internal anchor
A multi-agent LLM framework autonomously completes the full computational mechanics pipeline from a photograph to a code-compliant engineering report on a steel L-bracket example.
Failure Modes of Maximum Entropy RLHF cs.LG · 2025-09-24 · unverdicted · none · ref 12 · internal anchor
Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks cs.LG · 2025-09-08 · unverdicted · none · ref 35 · internal anchor
Proposes a probabilistic framework for latent agentic substructures in DNNs using log-score utilities and log pooling, with proofs on unanimity and an application to persona emergence in LLM alignment.
Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning cs.LG · 2025-06-09 · unverdicted · none · ref 30 · internal anchor
Proposes token-significance and dynamic length rewards in RL to reduce LLM response length while preserving or improving reasoning correctness across benchmarks.
An Overview of Catastrophic AI Risks cs.CY · 2023-06-21 · accept · none · ref 129 · internal anchor
The paper categorizes sources of catastrophic AI risks into malicious use, AI race, organizational risks, and rogue AIs, providing illustrative stories and mitigation suggestions for each.

Deep reinforcement learning from human preferences

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer