super hub Canonical reference

Concrete Problems in AI Safety

Chris Olah, Dario Amodei, Jacob Steinhardt, John Schulman, Paul Christiano · 2016 · cs.AI · arXiv 1606.06565

Canonical reference. 90% of citing Pith papers cite this work as background.

147 Pith papers citing it

Background 90% of classified citations

open full Pith review browse 147 citing papers more from Chris Olah arXiv PDF

abstract

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 40 method 1

citation-polarity summary

background 37 support 2 unclear 1 use method 1

claims ledger

abstract Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), a

authors

Chris Olah Dan Man\'e Dario Amodei Jacob Steinhardt John Schulman Paul Christiano

co-cited works

representative citing papers

Risks from Learned Optimization in Advanced Machine Learning Systems

cs.AI · 2019-06-05 · accept · novelty 9.0

Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

The Statistical Cost of Adaptation in Multi-Source Transfer Learning

math.ST · 2026-05-10 · unverdicted · novelty 8.0

Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

AI safety via debate

stat.ML · 2018-05-02 · conditional · novelty 8.0

AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

Theoretical Limits of Language Model Alignment

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.

Beyond Ability: The Four-Fold Spectrum of Power and the Logic of Full Inability

cs.LO · 2026-05-06 · unverdicted · novelty 7.0

Coalition Logic is extended by defining Full Inability (FI) as a distinct modality alongside Full Control, Positive Determination, and Adverse Determination, with algebraic structure, Klein four-group symmetry, and a sound, complete, conservative axiomatization CLFI that remains PSPACE-complete.

A Logic of Inability

cs.LO · 2026-04-30 · unverdicted · novelty 7.0

A conservative extension of Coalition Logic introduces an inability operator as negation of ability, with proofs of soundness, completeness, and conservativity plus analysis of its modal properties.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Discovering Agentic Safety Specifications from 1-Bit Danger Signals

cs.AI · 2026-04-25 · unverdicted · novelty 7.0

LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.

Navigating the Conceptual Multiverse

cs.HC · 2026-04-20 · unverdicted · novelty 7.0

The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

cs.LG · 2026-04-18 · unverdicted · novelty 7.0

HealthCraft is the first public RL safety environment for emergency medicine that evaluates frontier LLMs on trajectory-level safety with a dual-layer rubric, showing low multi-step performance and high safety failure rates.

Reinforcement Learning via Value Gradient Flow

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

cs.LG · 2026-04-13 · unverdicted · novelty 7.0 · 2 refs

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.

Learning Robustness at Test-Time from a Non-Robust Teacher

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.

AI Integrity: A New Paradigm for Verifiable AI Governance

cs.AI · 2026-04-13 · unverdicted · novelty 7.0

AI Integrity is defined as verifiable protection of an AI system's four-layer Authority Stack from corruption, with PRISM as the measurement framework.

Emotion Concepts and their Function in a Large Language Model

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

Geographic Blind Spots in AI Control Monitors: A Cross-National Audit of Claude Opus 4.6

cs.CY · 2026-03-20 · unverdicted · novelty 7.0

Claude Opus 4.6 fabricates more answers on Global North AI contexts than Global South ones, creating an exploitable vulnerability in AI control monitors.

citing papers explorer

Showing 3 of 3 citing papers after filters.

A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 5 · internal anchor
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Solving math word problems with process- and outcome-based feedback cs.LG · 2022-11-25 · unverdicted · none · ref 3 · internal anchor
On GSM8K, outcome-based supervision achieves similar final-answer error rates to process-based with less labeling, but process-based or learned reward models are needed to reach 3.4% reasoning error among correct solutions.
Measuring Progress on Scalable Oversight for Large Language Models cs.HC · 2022-11-04 · unverdicted · none · ref 33 · internal anchor
Humans chatting with an unreliable LLM assistant outperform both the model alone and unaided humans on MMLU and time-limited QuALITY tasks.

Concrete Problems in AI Safety

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer