super hub Canonical reference

Concrete Problems in AI Safety

Chris Olah, Dario Amodei, Jacob Steinhardt, John Schulman, Paul Christiano · 2016 · cs.AI · arXiv 1606.06565

Canonical reference. 90% of citing Pith papers cite this work as background.

225 Pith papers citing it

Background 90% of classified citations

open full Pith review browse 225 citing papers more from Chris Olah arXiv PDF

abstract

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 41 method 1

citation-polarity summary

background 38 support 2 unclear 1 use method 1

claims ledger

abstract Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), a

authors

Chris Olah Dan Man\'e Dario Amodei Jacob Steinhardt John Schulman Paul Christiano

co-cited works

representative citing papers

Risks from Learned Optimization in Advanced Machine Learning Systems

cs.AI · 2019-06-05 · accept · novelty 9.0

Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

The Statistical Cost of Adaptation in Multi-Source Transfer Learning

math.ST · 2026-05-10 · unverdicted · novelty 8.0

Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

AI safety via debate

stat.ML · 2018-05-02 · conditional · novelty 8.0

AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.

Managed Autonomy at Runtime: Gear-Based Safety and Governance for Single- and Multi-Agent Cyber-Physical Systems

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

\system{} combines five gears with utility-gated dispatch for safety in autonomous agents, proving stability for single agents and providing distributed guarantees for multi-agent CPS, evaluated on UR5 robots.

ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

cs.RO · 2026-06-25 · unverdicted · novelty 7.0 · 2 refs

ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.

Evolving Quantum Error-Correcting Encodings for Molecular Simulation

quant-ph · 2026-06-24 · conditional · novelty 7.0

LLM-driven evolutionary program synthesis discovers Generalized Superfast Encodings with exact distance 5 (and 6 on one instance) for molecular Hamiltonians, the first beyond distance 3.

Beyond Value Benchmarks: Measuring Value-Structure Alignment in Large Language Models via Symmetric Q-Sorts

cs.CL · 2026-06-20 · unverdicted · novelty 7.0

Introduces a Q-sort protocol using human reference factors to quantify LLM value-structure alignment via Procrustes similarity and RSA correlations, revealing cross-family heterogeneity and localized misalignments.

World Model Self-Distillation: Training World Models to Solve General Tasks

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.

Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

VLM-Safe-RL adds frozen VLM signals as anticipatory costs to the CMDP Lagrangian update via dual-path CLIP, VLM-Lagrange, and confidence gating, outperforming baselines on Safety-Gymnasium FormulaOne while showing partial generalization.

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

cs.AI · 2026-06-05 · unverdicted · novelty 7.0

MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.

Competing Auctions in Intermediated Markets

cs.GT · 2026-06-04 · unverdicted · novelty 7.0

Sealed-bid second-price intermediary auctions fully unravel into sealed first-price principal auctions while open formats unravel only partially, limiting intermediary design space when a credible first-price channel exists.

Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

Self-commitment latency measures early behavioral commitment in hinted vs. honest reasoning contexts on GSM8K using Qwen2.5-3B, achieving AUROC 0.878 for first-commitment latency and up to 0.926 for curve summaries.

A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

CVT-RL improves verified task success to 78.9% and reduces hacking to 3.9% in long-horizon language agents by combining intervention-validity gating with a selection-adjusted doubly robust PCCC estimator.

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.

What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants

cs.SE · 2026-05-29 · unverdicted · novelty 7.0

An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

Theoretical Limits of Language Model Alignment

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

citing papers explorer

Showing 25 of 25 citing papers after filters.

A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents cs.AI · 2025-12-23 · unverdicted · none · ref 5 · internal anchor
A new benchmark of 40 scenarios finds state-of-the-art LLMs exhibit outcome-driven constraint violations in 0-62.8% of cases under KPI pressure, with no consistent safety gains across model generations.
Beyond Binary Out-of-Distribution Detection: Characterizing Distributional Shifts with Multi-Statistic Diffusion Trajectories cs.LG · 2025-10-20 · unverdicted · none · ref 2 · internal anchor
DISC extracts multi-statistic trajectories from diffusion denoising to both detect and classify types of distributional shifts in OOD data.
Accelerated Learning with Linear Temporal Logic using Differentiable Simulation cs.LG · 2025-06-01 · unverdicted · none · ref 51 · internal anchor
Differentiable relaxation of LTL automata via soft labeling enables gradient-based RL from formal specifications, with theoretical bounds on discrete-differentiable discrepancy and up to 2x returns on nonlinear tasks.
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation cs.AI · 2025-03-14 · conditional · none · ref 2 · internal anchor
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
Differentiable Evolutionary Reinforcement Learning cs.AI · 2025-12-15 · unverdicted · none · ref 1 · internal anchor
DERL is a differentiable bi-level method that evolves optimal reward structures for RL policies by composing atomic primitives and using meta-gradients from validation performance.
Constraint-Aware Reinforcement Learning via Adaptive Action Scaling cs.RO · 2025-10-13 · unverdicted · none · ref 6 · internal anchor
A separate regulator module adaptively scales actions in RL to reduce constraint violations while preserving exploration, yielding up to 126x fewer violations and over 10x higher returns on Safety Gym tasks.
When Should Users Check? Modeling Confirmation Frequency inMulti-Step Agentic AI Tasks cs.HC · 2025-10-06 · conditional · none · ref 5 · internal anchor
A decision-theoretic model based on the observed Confirmation-Diagnosis-Correction-Redo user pattern places intermediate confirmations in AI agent tasks, yielding 81% user preference and 13.54% faster completion versus confirm-at-end.
On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization cs.CL · 2025-09-28 · unverdicted · none · ref 1 · internal anchor
Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
Limitations on Accurate, Trusted, Human-level Reasoning cs.LG · 2025-09-25 · unverdicted · none · ref 2 · internal anchor
An accurate and trusted AI system cannot achieve human-level reasoning because there exist tasks easily solvable by humans but not by the system.
WebSailor: Navigating Super-human Reasoning for Web Agent cs.CL · 2025-07-03 · conditional · none · ref 1 · internal anchor
WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models cs.CL · 2025-05-30 · conditional · none · ref 9 · internal anchor
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
Learning to Reason under Off-Policy Guidance cs.LG · 2025-04-21 · unverdicted · none · ref 17 · internal anchor
LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model cs.CV · 2025-04-10 · unverdicted · none · ref 5 · internal anchor
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
AI Failures in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges cs.SE · 2025-03-25 · unverdicted · none · ref 15 · internal anchor
Mixed-methods study maps downstream developers' concerns, practices, and challenges with AI failures in PTM-based software.
How Generative AI Empowers Attackers and Defenders Across the Trust & Safety Landscape cs.HC · 2025-11-10 · unverdicted · none · ref 2 · internal anchor
Generative AI boosts attackers' ability to create harmful content at scale while also enabling defenders to detect threats, support users, and improve moderation processes.
RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation cs.RO · 2025-10-20 · unverdicted · none · ref 29 · internal anchor
RESample uses exploratory sampling guided by a lightweight Coverage Function to expand VLA training data coverage, yielding 12% performance gains on LIBERO and real-world tasks with 10-20% added samples.
Failure Modes of Maximum Entropy RLHF cs.LG · 2025-09-24 · unverdicted · none · ref 6 · internal anchor
Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins cs.RO · 2025-09-16 · unverdicted · none · ref 4 · internal anchor
ODiSAR uses a Transformer digital twin with reconstruction error and Monte Carlo dropout to detect OOD events in self-adaptive robots, reporting up to 98% AUROC on office navigation and maritime ship tasks.
Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks cs.LG · 2025-09-08 · unverdicted · none · ref 32 · internal anchor
Proposes a probabilistic framework for latent agentic substructures in DNNs using log-score utilities and log pooling, with proofs on unanimity and an application to persona emergence in LLM alignment.
HiPreNets: High-Precision Neural Networks through Progressive Training cs.LG · 2025-06-18 · unverdicted · none · ref 2 · internal anchor
HiPreNets progressively refines neural networks via residual learning and adaptive techniques to reduce both RMSE and L^∞ errors, outperforming standard networks on Feynman benchmarks and enabling fast high-dimensional ODE surrogates.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 13 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Phi-4-reasoning Technical Report cs.AI · 2025-04-30 · unverdicted · none · ref 6 · internal anchor
A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.
A Survey of Scaling in Large Language Model Reasoning cs.AI · 2025-04-02 · unverdicted · none · ref 5 · internal anchor
A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems cs.AI · 2025-03-31 · unverdicted · none · ref 42 · internal anchor
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.
Strategic Preemption Under Shared Catastrophic Risk: The Suicide Region and the Race to Artificial General Intelligence q-fin.RM · 2025-12-08 · unreviewed · ref 2 · internal anchor

Concrete Problems in AI Safety

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer