Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
super hub Canonical reference
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and
authors
co-cited works
representative citing papers
Contrastive pair presentations yield exact identifiability characterizations via a geometric refinement of Angluin's condition, a new contrastive closure dimension for generation, mutual incomparability with text identification, and a single algorithm that tolerates any finite corruption budget.
Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
NPO enables stable unlearning of 50%+ training data in LLMs on TOFU by making collapse exponentially slower than gradient ascent, preserving sensible outputs where prior methods fail.
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
LLMs in a pre-specified cheap-talk benchmark over-reveal by 1.8-4.2x relative to the most-informative equilibrium, producing NMI of 0.78-0.94 against oracle values of 0.18-0.53 and exhibiting bias-tracking exaggeration rather than strategic coarsening.
A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
First convergence analysis of DPO under federated and decentralized training, characterizing rates via client drift, communication frequency, preference heterogeneity, and graph spectral connectivity.
GRASP aggregates stable local LLM interaction judgments into global argument rankings via a convergent attack-defense propagation operator on interaction graphs, yielding higher reproducibility than holistic judging and no correlation with human convincingness.
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.
Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment performance.
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2.5 models.
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
Topology-enhanced alignment via persistent homology on trajectories outperforms standard SFT and DPO baselines on preference metrics for LLMs.
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.
Moira parameterizes hierarchical RL policies for pair trading with LLMs and adapts them via prompt updates based on trajectory and episode feedback, outperforming baselines on real market data.
citing papers explorer
-
The Statistical Cost of Adaptation in Multi-Source Transfer Learning
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
-
Contrastive Identification and Generation in the Limit
Contrastive pair presentations yield exact identifiability characterizations via a geometric refinement of Angluin's condition, a new contrastive closure dimension for generation, mutual incomparability with text identification, and a single algorithm that tolerates any finite corruption budget.
-
Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)
Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.
-
Revisable by Design: A Theory of Streaming LLM Agent Execution
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.
-
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
-
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
-
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
NPO enables stable unlearning of 50%+ training data in LLMs on TOFU by making collapse exponentially slower than gradient ascent, preserving sensible outputs where prior methods fail.
-
ORPO: Monolithic Preference Optimization without Reference Model
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
-
Universal and Transferable Adversarial Attacks on Aligned Language Models
Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.
-
Discovering Latent Knowledge in Language Models Without Supervision
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
-
Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment
LLMs in a pre-specified cheap-talk benchmark over-reveal by 1.8-4.2x relative to the most-informative equilibrium, producing NMI of 0.78-0.94 against oracle values of 0.18-0.53 and exhibiting bias-tracking exaggeration rather than strategic coarsening.
-
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
-
Distributed Direct Preference Optimization
First convergence analysis of DPO under federated and decentralized training, characterizing rates via client drift, communication frequency, preference heterogeneity, and graph spectral connectivity.
-
GRASP: Deterministic argument ranking in interaction graphs
GRASP aggregates stable local LLM interaction judgments into global argument rankings via a convergent attack-defense propagation operator on interaction graphs, yielding higher reproducibility than holistic judging and no correlation with human convincingness.
-
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.
-
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment performance.
-
Variance-aware Reward Modeling with Anchor Guidance
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
-
The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play
Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2.5 models.
-
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
-
Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization
Topology-enhanced alignment via persistent homology on trajectories outperforms standard SFT and DPO baselines on preference metrics for LLMs.
-
Theoretical Limits of Language Model Alignment
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
-
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.
-
Moira: Language-driven Hierarchical Reinforcement Learning for Pair Trading
Moira parameterizes hierarchical RL policies for pair trading with LLMs and adapts them via prompt updates based on trajectory and episode feedback, outperforming baselines on real market data.
-
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
-
Attention Is Where You Attack
ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
-
Mind the Gap: Structure-Aware Consistency in Preference Learning
Standard DPO surrogates are inconsistent for equicontinuous neural nets; SA-DPO provides structure-aware H-consistency bounds by adapting margins to semantic distance and shows heavy-tailed losses yield superior guarantees for capacity-bounded models via the Margin-Capacity Profile.
-
Three Models of RLHF Annotation: Extension, Evidence, and Authority
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
-
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.
-
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
-
Causal inference for social network formation
Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.
-
Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
STOMP extends direct preference optimization to the multi-objective setting via smooth Tchebysheff scalarization and standardization of observed rewards, achieving highest hypervolume in eight of nine protein engineering evaluations.
-
AI Integrity: A New Paradigm for Verifiable AI Governance
AI Integrity is defined as verifiable protection of an AI system's four-layer Authority Stack from corruption, with PRISM as the measurement framework.
-
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
-
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.
-
SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation
SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
-
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
-
Personalizing Text-to-Image Generation to Individual Taste
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
-
Many Preferences, Few Policies: Towards Scalable Language Model Personalization
PALM produces a small portfolio of LLMs that contains a near-optimal model for any user preference weight vector, with theoretical bounds on portfolio size and approximation quality.
-
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
-
Same Words, Different Judgments: How Preferences Vary Across Modalities
Human preferences for the same semantic content show near-chance agreement between text and audio, with audio raters using narrower decision thresholds, less length bias, and more user-oriented criteria.
-
MINT: Minimal Information Neuro-Symbolic Tree for Objective-Driven Knowledge-Gap Reasoning and Active Elicitation
MINT combines symbolic trees with neural uncertainty estimation and LLM query curation to achieve near-expert planning performance by asking a small number of targeted questions that close knowledge gaps.
-
RACC: Representation-Aware Coverage Criteria for LLM Safety Testing
RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
-
Owen-Shapley Policy Optimization: A Principled RL Algorithm for Generative Search LLMs
OSPO redistributes sequence-level advantages in LLM RL training via Shapley-Owen values on semantic coalitions to improve token-level credit assignment without parametric value models.
-
ShareChat: A Dataset of Chatbot Conversations in the Wild
ShareChat is a large-scale dataset of 142,808 conversations from five major chatbot platforms that retains native affordances for cross-platform analyses of completeness, citations, and latency.
-
Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs
Low-rank compression preserves training-data privacy and improves adversarial robustness but weakens personal-information protection, reduces ethical behavior in zero-shot use, and harms fairness.
-
Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning
TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.
-
EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention
EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
-
User-Assistant Bias in LLMs
LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.
-
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.