hub

Concrete Problems in AI Safety

· 2016 · cs.AI · arXiv 1606.06565

76 Pith papers cite this work. Polarity classification is still indexing.

76 Pith papers citing it

open full Pith review browse 76 citing papers arXiv PDF

abstract

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

claims ledger

abstract Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), a

co-cited works

representative citing papers

The Statistical Cost of Adaptation in Multi-Source Transfer Learning

math.ST · 2026-05-10 · unverdicted · novelty 8.0

Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

AI safety via debate

stat.ML · 2018-05-02 · conditional · novelty 8.0

AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

Theoretical Limits of Language Model Alignment

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.

Beyond Ability: The Four-Fold Spectrum of Power and the Logic of Full Inability

cs.LO · 2026-05-06 · unverdicted · novelty 7.0

Coalition Logic is extended by defining Full Inability (FI) as a distinct modality alongside Full Control, Positive Determination, and Adverse Determination, with algebraic structure, Klein four-group symmetry, and a sound, complete, conservative axiomatization CLFI that remains PSPACE-complete.

A Logic of Inability

cs.LO · 2026-04-30 · unverdicted · novelty 7.0

A conservative extension of Coalition Logic introduces an inability operator as negation of ability, with proofs of soundness, completeness, and conservativity plus analysis of its modal properties.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Discovering Agentic Safety Specifications from 1-Bit Danger Signals

cs.AI · 2026-04-25 · unverdicted · novelty 7.0

LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.

Navigating the Conceptual Multiverse

cs.HC · 2026-04-20 · unverdicted · novelty 7.0

The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.

Reinforcement Learning via Value Gradient Flow

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

cs.LG · 2026-04-13 · unverdicted · novelty 7.0 · 2 refs

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.

Learning Robustness at Test-Time from a Non-Robust Teacher

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.

AI Integrity: A New Paradigm for Verifiable AI Governance

cs.AI · 2026-04-13 · unverdicted · novelty 7.0

AI Integrity is defined as verifiable protection of an AI system's four-layer Authority Stack from corruption, with PRISM as the measurement framework.

Emotion Concepts and their Function in a Large Language Model

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

A Generalist Agent

cs.AI · 2022-05-12 · accept · novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

Sustaining AI safety: Control-theoretic external impossibility, intrinsic necessity, and structural requirements

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

External control strategies are structurally impossible for sustaining AI safety beyond bounded capability thresholds; any remaining viable strategies must be intrinsic with stable safety-compatible objectives.

Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Semantic Reward Collapse compresses different epistemic issues into unified rewards in preference optimization, risking loss of calibrated uncertainty, with Constitutional Reward Stratification proposed as a domain-stratified alternative framework.

Overtrained, Not Misaligned

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

SARC: A Governance-by-Architecture Framework for Agentic AI Systems

cs.SE · 2026-05-08 · unverdicted · novelty 6.0

SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-code baselines in synthetic procurement evaluations.

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Behavior Cue Reasoning trains LLMs to emit special tokens before behaviors, enabling monitors to prune up to 50% of wasted tokens and recover safe actions from 80% of unsafe traces, more than doubling success rates with no performance loss.

On the Blessing of Pre-training in Weak-to-Strong Generalization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.

citing papers explorer

Showing 50 of 76 citing papers.

The Statistical Cost of Adaptation in Multi-Source Transfer Learning math.ST · 2026-05-10 · unverdicted · none · ref 167 · internal anchor
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling cs.CL · 2020-12-31 · conditional · none · ref 190 · internal anchor
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
AI safety via debate stat.ML · 2018-05-02 · conditional · none · ref 2 · internal anchor
AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack cs.AI · 2026-05-12 · conditional · none · ref 1 · internal anchor
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Theoretical Limits of Language Model Alignment cs.LG · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites cs.AI · 2026-05-07 · unverdicted · none · ref 73 · internal anchor
AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.
Beyond Ability: The Four-Fold Spectrum of Power and the Logic of Full Inability cs.LO · 2026-05-06 · unverdicted · none · ref 8 · internal anchor
Coalition Logic is extended by defining Full Inability (FI) as a distinct modality alongside Full Control, Positive Determination, and Adverse Determination, with algebraic structure, Klein four-group symmetry, and a sound, complete, conservative axiomatization CLFI that remains PSPACE-complete.
A Logic of Inability cs.LO · 2026-04-30 · unverdicted · none · ref 6 · internal anchor
A conservative extension of Coalition Logic introduces an inability operator as negation of ability, with proofs of soundness, completeness, and conservativity plus analysis of its modal properties.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 128 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Discovering Agentic Safety Specifications from 1-Bit Danger Signals cs.AI · 2026-04-25 · unverdicted · none · ref 2 · internal anchor
LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.
Navigating the Conceptual Multiverse cs.HC · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF cs.CL · 2026-04-20 · unverdicted · none · ref 41 · internal anchor
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
Reinforcement Learning via Value Gradient Flow cs.LG · 2026-04-15 · unverdicted · none · ref 3 · internal anchor
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts cs.LG · 2026-04-13 · unverdicted · none · ref 1 · 2 links · internal anchor
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.
Learning Robustness at Test-Time from a Non-Robust Teacher cs.CV · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.
AI Integrity: A New Paradigm for Verifiable AI Governance cs.AI · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
AI Integrity is defined as verifiable protection of an AI system's four-layer Authority Stack from corruption, with PRISM as the measurement framework.
Emotion Concepts and their Function in a Large Language Model cs.AI · 2026-04-09 · unverdicted · none · ref 43 · internal anchor
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 5 · internal anchor
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Sustaining AI safety: Control-theoretic external impossibility, intrinsic necessity, and structural requirements cs.AI · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
External control strategies are structurally impossible for sustaining AI safety beyond bounded capability thresholds; any remaining viable strategies must be intrinsic with stable safety-compatible objectives.
Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems cs.AI · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
Semantic Reward Collapse compresses different epistemic issues into unified rewards in preference optimization, risking loss of calibrated uncertainty, with Constitutional Reward Stratification proposed as a domain-stratified alternative framework.
Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 30 · internal anchor
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
SARC: A Governance-by-Architecture Framework for Agentic AI Systems cs.SE · 2026-05-08 · unverdicted · none · ref 51 · internal anchor
SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-code baselines in synthetic procurement evaluations.
Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight cs.AI · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
Behavior Cue Reasoning trains LLMs to emit special tokens before behaviors, enabling monitors to prune up to 50% of wasted tokens and recover safe actions from 80% of unsafe traces, more than doubling success rates with no performance loss.
On the Blessing of Pre-training in Weak-to-Strong Generalization cs.LG · 2026-05-07 · unverdicted · none · ref 94 · internal anchor
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
Understanding Annotator Safety Policy with Interpretability cs.AI · 2026-05-06 · unverdicted · none · ref 25 · internal anchor
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation cs.CR · 2026-05-06 · unverdicted · none · ref 3 · internal anchor
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
Stayin' Aligned Over Time: Towards Longitudinal Human-LLM Alignment via Contextual Reflection and Privacy-Preserving Behavioral Data cs.HC · 2026-05-05 · unverdicted · none · ref 2 · internal anchor
A methodological framework and browser system BITE for collecting evolving user preferences on LLM outputs through context-triggered reflections and privacy-preserving data over time.
A Robust Out-of-Distribution Detection Framework via Synergistic Smoothing cs.CV · 2026-05-05 · unverdicted · none · ref 1 · internal anchor
ROSS combines median smoothing with local instability measurement to create a robust OOD detector that outperforms prior methods by up to 40 AUROC points on CIFAR and ImageNet benchmarks while defending symmetrically against score attacks.
AI Alignment via Incentives and Correction cs.LG · 2026-05-02 · unverdicted · none · ref 2 · 2 links · internal anchor
AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM coding tasks.
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework cs.AI · 2026-05-02 · unverdicted · none · ref 1 · internal anchor
The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for continuous production evaluation with an open-source implementation.
Unifying Runtime Monitoring Approaches for Safety-Critical Machine Learning: Application to Vision-Based Landing cs.LG · 2026-04-29 · unverdicted · none · ref 1 · internal anchor
A framework unifies runtime monitoring for safety-critical ML into ODD, OOD, and OMS categories and demonstrates them on vision-based runway detection for landing.
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking cs.LG · 2026-04-29 · unverdicted · none · ref 1 · internal anchor
Uncertainty-aware RL framework using ensemble disagreement and annotation variability reduces reward-hacking trap visits by 93.7% across grid and continuous control tasks while remaining robust to 30% label noise.
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient cs.LG · 2026-04-28 · unverdicted · none · ref 4 · internal anchor
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.
Removing Sandbagging in LLMs by Training with Weak Supervision cs.LG · 2026-04-23 · unverdicted · none · ref 1 · internal anchor
SFT on weak demonstrations followed by RL elicits full performance from sandbagging LLMs, but only when training and deployment are indistinguishable to the model.
Post-AGI Economies: Autonomy and the First Fundamental Theorem of Welfare Economics econ.TH · 2026-04-23 · unverdicted · none · ref 7 · internal anchor
The First Fundamental Theorem of Welfare Economics holds for autonomy-complete competitive equilibria that are autonomy-Pareto efficient, with the classical version recovered in the low-autonomy limit.
AI Governance under Political Turnover: The Alignment Surface of Compliance Design cs.AI · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
A formal model shows that AI compliance designs in government create learnable approval boundaries that political successors can exploit, causing initial oversight gains to increase long-term strategic vulnerability.
Evaluation-driven Scaling for Scientific Discovery cs.LG · 2026-04-21 · unverdicted · none · ref 5 · internal anchor
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks cs.CL · 2026-04-20 · unverdicted · none · ref 11 · internal anchor
QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition cs.AI · 2026-04-20 · unverdicted · none · ref 7 · internal anchor
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories cs.CR · 2026-04-19 · unverdicted · none · ref 1 · internal anchor
Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.
Long-Term Dynamical Evolution and Ejection of Near-Earth Asteroids astro-ph.EP · 2026-04-18 · unverdicted · none · ref 5 · 2 links · internal anchor
Machine learning classifiers on initial orbital elements and convolutional neural networks on recurrence plots from short integrations classify long-term ejection of near-Earth asteroids with accuracy comparable to full numerical simulations.
Human Cognition in Machines: A Unified Perspective of World Models cs.RO · 2026-04-17 · unverdicted · none · ref 5 · internal anchor
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
Measuring the Authority Stack of AI Systems: Empirical Analysis of 366,120 Forced-Choice Responses Across 8 AI Models cs.AI · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
Eight AI models show split value priorities at the top layer, divergent evidence preferences in the middle, and broad convergence on institutional sources at the bottom, with substantial sensitivity to scenario framing.
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems cs.RO · 2026-04-13 · unverdicted · none · ref 19 · internal anchor
EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.
PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC cs.LG · 2026-04-09 · unverdicted · none · ref 10 · internal anchor
PriPG-RL trains RL policies for POMDPs by distilling knowledge from a privileged anytime-feasible MPC planner into a P2P-SAC policy, improving sample efficiency and performance in partially observable robotic navigation.
Active Reward Machine Inference From Raw State Trajectories cs.RO · 2026-04-08 · unverdicted · none · ref 10 · internal anchor
Reward machines can be inferred from raw state trajectories alone when sufficient data is available, with an active learning extension that queries trajectory extensions for better efficiency.
Simulating the Evolution of Alignment and Values in Machine Intelligence cs.AI · 2026-04-07 · unverdicted · none · ref 1 · internal anchor
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
Cognitive Comparability and the Limits of Governance: Evaluating Authority Under Radical Capability Asymmetry cs.CY · 2026-04-03 · unverdicted · none · ref 5 · internal anchor
A six-dimension framework shows structural failures in four governance principles under radical capability asymmetry, with two requiring new normative theory and a pattern of interdependent breakdown.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model cs.CV · 2025-04-10 · unverdicted · none · ref 5 · internal anchor
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 209 · internal anchor
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Concrete Problems in AI Safety

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer