arxiv: 1606.06565 · v2 · submitted 2016-06-21 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Concrete Problems in AI Safety

Dario Amodei , Chris Olah , Jacob Steinhardt , Paul Christiano , John Schulman , Dan Man\'e

Authors on Pith no claims yet

Pith reviewed 2026-05-11 05:12 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords AI safetymachine learning accidentsside effectsreward hackingscalable supervisionsafe explorationdistributional shift

0 comments

The pith

The main risks of accidents in AI systems come from five specific problems related to their objectives and learning processes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to shift AI safety discussions toward concrete, actionable issues by defining accidents as unintended harmful behavior that emerges from flawed real-world designs. It groups five research problems into categories based on whether they stem from an incorrect objective, an objective that is too costly to check frequently, or unwanted behavior that occurs during training. A sympathetic reader would care because solving these problems could prevent common failures as AI systems take on more real-world responsibilities. The authors review relevant prior work and propose directions that apply to current advanced machine learning systems. They also raise the broader question of how to approach safety for future AI applications.

Core claim

Accidents in machine learning systems are unintended and harmful behaviors that arise from poor design. The authors present five practical problems that contribute to such accidents, grouped by origin: avoiding side effects and avoiding reward hacking arise from having the wrong objective function; scalable supervision addresses objectives that are too expensive to evaluate often; and safe exploration and distributional shift cover undesirable behavior during the learning process. Previous work is surveyed and research directions are suggested with emphasis on relevance to cutting-edge AI systems.

What carries the argument

A five-problem taxonomy that classifies accident risks according to whether they originate in the objective function or in the learning process itself.

If this is right

Research focused on avoiding side effects will reduce cases where AI pursues its goal while damaging unrelated aspects of its environment.
Work on avoiding reward hacking will limit AI from exploiting loopholes in its objective that produce unintended outcomes.
Advances in scalable supervision will allow training on complex tasks without requiring human evaluation at every step.
Safe exploration methods will decrease the chance that AI takes dangerous actions while learning about its surroundings.
Handling distributional shift will improve reliability when an AI encounters conditions different from its training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The problems may interact with one another, so progress on one could affect the difficulty of addressing the others.
The taxonomy might be extended to cover multi-agent systems or longer time horizons that the paper does not examine in detail.
Empirical tests could check whether systems that mitigate all five problems exhibit fewer unintended behaviors in controlled simulations.
The list could help guide safety standards for AI used in high-stakes domains such as transportation or healthcare.

Load-bearing premise

That these five problems represent the primary and most actionable sources of accident risk in real-world AI systems.

What would settle it

An observed case of unintended harmful behavior in a deployed AI system that cannot be traced to any of the five problems even after targeted mitigations are applied.

read the original abstract

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper organizes five AI safety problems into a useful framework but offers no new technical results.

read the letter

The one or two things to know are that this paper identifies five specific problems in AI safety and sorts them into three buckets based on their root cause. The buckets are wrong objective functions, objectives that cost too much to check often, and bad behavior while the system is still learning. This gives a simple way to talk about where safety issues might come from in practice. What the paper does well is lay out each problem with enough detail to see why it matters for current machine learning methods. For instance, it explains avoiding side effects by talking about how an agent might change the environment in unintended ways, and it connects this to existing work on inverse reinforcement learning and other areas. The scalable supervision section points to the challenge of using human feedback when tasks get complex. The literature review is balanced and the suggested next steps, such as developing better exploration strategies that avoid dangerous states, are reasonable starting points. Overall the paper stays focused on problems that could affect real systems soon rather than distant future risks. The soft spots are that this remains a high-level discussion without any new experiments, proofs, or quantitative analysis. The central claim that these problems are practical and worth studying follows from the definitions but does not come with evidence that they are the most important ones or that progress on them will reduce accident risk by a measurable amount. Some of the issues, particularly distributional shift, are already active research areas in standard machine learning, so the paper's addition is mainly the safety context. There are no obvious problems with how it cites prior work. This paper is for people who want an entry point into thinking about AI safety in applied settings or for groups trying to prioritize research topics. A reader already familiar with the safety literature might see it as a useful summary rather than something groundbreaking, but it provides value by making the issues accessible. It deserves a serious referee because the categorization is internally consistent and the paper has gone on to shape how many people approach these questions. I would recommend sending it to peer review.

Referee Report

0 major / 2 minor

Summary. The manuscript defines accidents in AI systems as unintended and harmful behavior arising from poor design of real-world systems. It presents five practical research problems related to accident risk, grouped by origin: wrong objective functions (avoiding side effects and avoiding reward hacking), expensive-to-evaluate objectives (scalable supervision), and issues during learning (safe exploration and distributional shift). The authors review prior work in each area, suggest research directions relevant to cutting-edge AI, and close by considering how to think productively about safety for forward-looking applications.

Significance. If the framing holds, the paper supplies a structured, actionable list of research problems that can orient the AI safety literature toward near-term, practical concerns rather than purely speculative ones. Its categorization by source (objective vs. learning process) offers a useful organizing lens, and the literature review integrates existing threads in ML with safety considerations. This approach has the potential to encourage safety work that is directly relevant to deployed systems without requiring new theoretical machinery.

minor comments (2)

[Introduction] The definition of accidents in the opening could be grounded with one concrete, non-speculative example drawn from current ML deployments to improve accessibility.
[concluding section] The final high-level section on productive thinking about safety would benefit from a short paragraph outlining minimal criteria (e.g., falsifiability or relevance to current systems) that future safety proposals should meet.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. The referee's summary accurately reflects the paper's focus on defining AI accidents and organizing five concrete research problems by their origins in objective functions, evaluation costs, and learning dynamics.

Circularity Check

0 steps flagged

No circularity: conceptual taxonomy without derivations or self-referential predictions

full rationale

The paper offers a high-level categorization of five AI safety research problems (avoiding side effects, avoiding reward hacking, scalable supervision, safe exploration, distributional shift) grouped by origin in objective functions or learning dynamics. This taxonomy is introduced via conceptual analysis and external literature review rather than any derivation chain, equations, fitted parameters, or first-principles predictions. No step claims a result that reduces by construction to its own inputs; the paper explicitly frames the list as practical and non-exhaustive. Self-citations appear only for background and do not bear load for any uniqueness theorem or forced conclusion. The work is self-contained as a forward-looking problem statement and carries no circularity under the specified criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on domain assumptions about AI goal-directed behavior and learning without introducing new entities or fitted parameters; the categorization itself is an ad hoc framing proposed for utility.

axioms (2)

domain assumption Machine learning systems can exhibit unintended and harmful behavior due to poor design of real-world AI systems.
This is the core definition of 'accidents' used to motivate the entire discussion.
ad hoc to paper The five problems can be usefully categorized by their origin in objective functions or learning processes.
The paper proposes this taxonomy as a productive way to organize research without deriving it from prior theorems.

pith-pipeline@v0.9.0 · 5462 in / 1475 out tokens · 56870 ms · 2026-05-11T05:12:03.516148+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders
cs.AI 2026-05 accept novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
The Statistical Cost of Adaptation in Multi-Source Transfer Learning
math.ST 2026-05 unverdicted novelty 8.0

Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
cs.CL 2020-12 conditional novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
AI safety via debate
stat.ML 2018-05 conditional novelty 8.0

AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
cs.AI 2026-05 conditional novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Theoretical Limits of Language Model Alignment
cs.LG 2026-05 unverdicted novelty 7.0

The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites
cs.AI 2026-05 unverdicted novelty 7.0

AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.
Beyond Ability: The Four-Fold Spectrum of Power and the Logic of Full Inability
cs.LO 2026-05 unverdicted novelty 7.0

Coalition Logic is extended by defining Full Inability (FI) as a distinct modality alongside Full Control, Positive Determination, and Adverse Determination, with algebraic structure, Klein four-group symmetry, and a ...
A Logic of Inability
cs.LO 2026-04 unverdicted novelty 7.0

A conservative extension of Coalition Logic introduces an inability operator as negation of ability, with proofs of soundness, completeness, and conservativity plus analysis of its modal properties.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Discovering Agentic Safety Specifications from 1-Bit Danger Signals
cs.AI 2026-04 unverdicted novelty 7.0

LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.
Navigating the Conceptual Multiverse
cs.HC 2026-04 unverdicted novelty 7.0

The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choic...
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
cs.CL 2026-04 unverdicted novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
Reinforcement Learning via Value Gradient Flow
cs.LG 2026-04 unverdicted novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
cs.LG 2026-04 unverdicted novelty 7.0

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and sal...
Learning Robustness at Test-Time from a Non-Robust Teacher
cs.CV 2026-04 unverdicted novelty 7.0

A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.
AI Integrity: A New Paradigm for Verifiable AI Governance
cs.AI 2026-04 unverdicted novelty 7.0

AI Integrity is defined as verifiable protection of an AI system's four-layer Authority Stack from corruption, with PRISM as the measurement framework.
Emotion Concepts and their Function in a Large Language Model
cs.AI 2026-04 unverdicted novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Sustaining AI safety: Control-theoretic external impossibility, intrinsic necessity, and structural requirements
cs.AI 2026-05 unverdicted novelty 6.0

External control strategies are structurally impossible for sustaining AI safety beyond bounded capability thresholds; any remaining viable strategies must be intrinsic with stable safety-compatible objectives.
Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems
cs.AI 2026-05 unverdicted novelty 6.0

Semantic Reward Collapse compresses different epistemic issues into unified rewards in preference optimization, risking loss of calibrated uncertainty, with Constitutional Reward Stratification proposed as a domain-st...
Overtrained, Not Misaligned
cs.LG 2026-05 unverdicted novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 6.0

Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
SARC: A Governance-by-Architecture Framework for Agentic AI Systems
cs.SE 2026-05 unverdicted novelty 6.0

SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-cod...
Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight
cs.AI 2026-05 unverdicted novelty 6.0

Behavior Cue Reasoning trains LLMs to emit special tokens before behaviors, enabling monitors to prune up to 50% of wasted tokens and recover safe actions from 80% of unsafe traces, more than doubling success rates wi...
On the Blessing of Pre-training in Weak-to-Strong Generalization
cs.LG 2026-05 unverdicted novelty 6.0

Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
Understanding Annotator Safety Policy with Interpretability
cs.AI 2026-05 unverdicted novelty 6.0

Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
cs.CR 2026-05 unverdicted novelty 6.0

NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
Stayin' Aligned Over Time: Towards Longitudinal Human-LLM Alignment via Contextual Reflection and Privacy-Preserving Behavioral Data
cs.HC 2026-05 unverdicted novelty 6.0

A methodological framework and browser system BITE for collecting evolving user preferences on LLM outputs through context-triggered reflections and privacy-preserving data over time.
A Robust Out-of-Distribution Detection Framework via Synergistic Smoothing
cs.CV 2026-05 unverdicted novelty 6.0

ROSS combines median smoothing with local instability measurement to create a robust OOD detector that outperforms prior methods by up to 40 AUROC points on CIFAR and ImageNet benchmarks while defending symmetrically ...
AI Alignment via Incentives and Correction
cs.LG 2026-05 unverdicted novelty 6.0

AI alignment is framed as inducing equilibrium behavior in a solver-auditor interaction via adaptive rewards found by bandit optimization, yielding improved oversight and reduced errors in LLM coding experiments.
AI Alignment via Incentives and Correction
cs.LG 2026-05 unverdicted novelty 6.0

AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM ...
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework
cs.AI 2026-05 unverdicted novelty 6.0

The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for cont...
Unifying Runtime Monitoring Approaches for Safety-Critical Machine Learning: Application to Vision-Based Landing
cs.LG 2026-04 unverdicted novelty 6.0

A framework unifies runtime monitoring for safety-critical ML into ODD, OOD, and OMS categories and demonstrates them on vision-based runway detection for landing.
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
cs.LG 2026-04 unverdicted novelty 6.0

Uncertainty-aware RL framework using ensemble disagreement and annotation variability reduces reward-hacking trap visits by 93.7% across grid and continuous control tasks while remaining robust to 30% label noise.
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
cs.LG 2026-04 unverdicted novelty 6.0

Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...
Removing Sandbagging in LLMs by Training with Weak Supervision
cs.LG 2026-04 unverdicted novelty 6.0

SFT on weak demonstrations followed by RL elicits full performance from sandbagging LLMs, but only when training and deployment are indistinguishable to the model.
Post-AGI Economies: Autonomy and the First Fundamental Theorem of Welfare Economics
econ.TH 2026-04 unverdicted novelty 6.0

The First Fundamental Theorem of Welfare Economics holds for autonomy-complete competitive equilibria that are autonomy-Pareto efficient, with the classical version recovered in the low-autonomy limit.
AI Governance under Political Turnover: The Alignment Surface of Compliance Design
cs.AI 2026-04 unverdicted novelty 6.0

A formal model shows that AI compliance designs in government create learnable approval boundaries that political successors can exploit, causing initial oversight gains to increase long-term strategic vulnerability.
Evaluation-driven Scaling for Scientific Discovery
cs.LG 2026-04 unverdicted novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks
cs.CL 2026-04 unverdicted novelty 6.0

QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition
cs.AI 2026-04 unverdicted novelty 6.0

Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories
cs.CR 2026-04 unverdicted novelty 6.0

Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.
Long-Term Dynamical Evolution and Ejection of Near-Earth Asteroids
astro-ph.EP 2026-04 unverdicted novelty 6.0

Machine learning classifiers on initial orbital elements and convolutional neural networks on recurrence plots from short integrations classify long-term ejection of near-Earth asteroids with accuracy comparable to fu...
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
cs.LG 2026-04 unverdicted novelty 6.0

Features in deep networks correspond to linear directions of centroids summarizing local functional behavior, enabling sparser and more effective feature dictionaries via sparse autoencoders applied to centroids rathe...
Measuring the Authority Stack of AI Systems: Empirical Analysis of 366,120 Forced-Choice Responses Across 8 AI Models
cs.AI 2026-04 unverdicted novelty 6.0

Eight AI models show split value priorities at the top layer, divergent evidence preferences in the middle, and broad convergence on institutional sources at the bottom, with substantial sensitivity to scenario framing.
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
cs.RO 2026-04 unverdicted novelty 6.0

EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.
PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC
cs.LG 2026-04 unverdicted novelty 6.0

PriPG-RL trains RL policies for POMDPs by distilling knowledge from a privileged anytime-feasible MPC planner into a P2P-SAC policy, improving sample efficiency and performance in partially observable robotic navigation.
Active Reward Machine Inference From Raw State Trajectories
cs.RO 2026-04 unverdicted novelty 6.0

Reward machines can be inferred from raw state trajectories alone when sufficient data is available, with an active learning extension that queries trajectory extensions for better efficiency.
Simulating the Evolution of Alignment and Values in Machine Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
Cognitive Comparability and the Limits of Governance: Evaluating Authority Under Radical Capability Asymmetry
cs.CY 2026-04 unverdicted novelty 6.0

A six-dimension framework shows structural failures in four governance principles under radical capability asymmetry, with two requiring new normative theory and a pattern of interdependent breakdown.
ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling
cs.RO 2026-03 unverdicted novelty 6.0

ROBOGATE applies adaptive boundary-focused sampling in simulation to discover robot policy failure boundaries, revealing a 97.65 percentage point performance gap for a VLA model between LIBERO and industrial scenarios.
Alignment as Institutional Design: From Behavioral Correction to Transaction Structure in Intelligent Systems
cs.CY 2026-03 unverdicted novelty 6.0

AI alignment emerges when designers specify internal transaction structures that make aligned behavior the lowest-cost strategy for each component, transforming the problem from behavioral control into institutional design.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Towards A Rigorous Science of Interpretable Machine Learning
stat.ML 2017-02 unverdicted novelty 6.0

The authors define interpretability for machine learning, specify when it is required, and propose a taxonomy for its rigorous evaluation while identifying open research questions.
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems
cs.MA 2026-05 unverdicted novelty 5.0

Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair
cs.AI 2026-05 unverdicted novelty 5.0

Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.
Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc
cs.AI 2026-05 unverdicted novelty 5.0

Mechanical conscience is a supervisory filter that minimally corrects baseline AI policies to reduce cumulative deviation from admissible behavioral trajectories under epistemic uncertainty.

Reference graph

Works this paper leans on

171 extracted references · 171 canonical work pages · cited by 81 Pith papers · 4 internal anchors

[1]

Deep Learning with Diﬀerential Privacy

Martin Abadi et al. “Deep Learning with Diﬀerential Privacy”. In: (in press (2016))

work page 2016
[2]

Exploration and apprenticeship learning in reinforcement learning

Pieter Abbeel and Andrew Y Ng. “Exploration and apprenticeship learning in reinforcement learning”. In: Proceedings of the 22nd international conference on Machine learning . ACM. 2005, pp. 1–8

work page 2005
[3]

The Hidden Cost of Eﬃciency: Fairness and Discrimination in Predictive Modeling

Julius Adebayo, Lalana Kagal, and Alex Pentland. The Hidden Cost of Eﬃciency: Fairness and Discrimination in Predictive Modeling . 2015

work page 2015
[4]

Taming the monster: A fast and simple algorithm for contextual ban- dits

Alekh Agarwal et al. “Taming the monster: A fast and simple algorithm for contextual ban- dits”. In: (2014)

work page 2014
[5]

Domain-adversarial neural networks

Hana Ajakan et al. “Domain-adversarial neural networks”. In: arXiv preprint arXiv:1412.4446 (2014)

work page arXiv 2014
[6]

Hiring by algorithm: predicting and preventing disparate impact

Ifeoma Ajunwa et al. “Hiring by algorithm: predicting and preventing disparate impact”. In: Available at SSRN 2746078 (2016)

work page 2016
[7]

Deep Speech 2: End-to-End Speech Recognition in English and Man- darin

Dario Amodei et al. “Deep Speech 2: End-to-End Speech Recognition in English and Man- darin”. In: arXiv preprint arXiv:1512.02595 (2015)

work page arXiv 2015
[8]

Open Letter

An Open Letter: Research Priorities for Robust and Beneﬁcial Artiﬁcial Intelligence . Open Letter. Signed by 8,600 people; see attached research agenda. 2015

work page 2015
[9]

A method of moments for mixture models and hidden Markov models

Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. “A method of moments for mixture models and hidden Markov models”. In: arXiv preprint arXiv:1203.0683 (2012)

work page arXiv 2012
[10]

Estimation of the parameters of a single equation in a complete system of stochastic equations

Theodore W Anderson and Herman Rubin. “Estimation of the parameters of a single equation in a complete system of stochastic equations”. In: The Annals of Mathematical Statistics (1949), pp. 46–63

work page 1949
[11]

The asymptotic properties of estimates of the parameters of a single equation in a complete system of stochastic equations

Theodore W Anderson and Herman Rubin. “The asymptotic properties of estimates of the parameters of a single equation in a complete system of stochastic equations”. In: The Annals of Mathematical Statistics (1950), pp. 570–582

work page 1950
[12]

Motivated value selection for artiﬁcial agents

Stuart Armstrong. “Motivated value selection for artiﬁcial agents”. In: Workshops at the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence . 2015

work page 2015
[13]

The mathematics of reduced impact: help needed

Stuart Armstrong. The mathematics of reduced impact: help needed . 2012

work page 2012
[14]

Utility indiﬀerence

Stuart Armstrong. Utility indiﬀerence. Tech. rep. Technical Report 2010-1. Oxford: Future of Humanity Institute, University of Oxford, 2010

work page 2010
[15]

The Risk of Automation for Jobs in OECD Countries

Melanie Arntz, Terry Gregory, and Ulrich Zierahn. “The Risk of Automation for Jobs in OECD Countries”. In: OECD Social, Employment and Migration Working Papers (2016). url: http://dx.doi.org/10.1787/5jlz9h56dvq7-en

work page doi:10.1787/5jlz9h56dvq7-en 2016
[16]

Open Letter

Autonomous Weapons: An Open Letter from AI & Robotics Researchers. Open Letter. Signed by 20,000+ people. 2015. 22

work page 2015
[17]

The AGI Containment Problem

James Babcock, Janos Kramar, and Roman Yampolskiy. “The AGI Containment Problem”. In: The Ninth Conference on Artiﬁcial General Intelligence (2016)

work page 2016
[18]

Unsupervised super- vised learning ii: Margin-based classiﬁcation without labels

Krishnakumar Balasubramanian, Pinar Donmez, and Guy Lebanon. “Unsupervised super- vised learning ii: Margin-based classiﬁcation without labels”. In: The Journal of Machine Learning Research 12 (2011), pp. 3119–3145

work page 2011
[19]

The security of machine learning

Marco Barreno et al. “The security of machine learning”. In: Machine Learning 81.2 (2010), pp. 121–148

work page 2010
[20]

H-inﬁnity optimal control and related minimax design problems: a dynamic game approach

Tamer Ba¸ sar and Pierre Bernhard. H-inﬁnity optimal control and related minimax design problems: a dynamic game approach . Springer Science & Business Media, 2008

work page 2008
[21]

Detecting changes in signals and systems—a survey

Mich` ele Basseville. “Detecting changes in signals and systems—a survey”. In: Automatica 24.3 (1988), pp. 309–326

work page 1988
[22]

Bayesian optimization with safety con- straints: safe and automatic parameter tuning in robotics

F Berkenkamp, A Krause, and Angela P Schoellig. “Bayesian optimization with safety con- straints: safe and automatic parameter tuning in robotics.” arXiv, 2016”. In: arXiv preprint arXiv:1602.04450 ()

work page arXiv 2016
[23]

The evolved radio and its implications for modelling the evolution of novel sensors

Jon Bird and Paul Layzell. “The evolved radio and its implications for modelling the evolution of novel sensors”. In: Evolutionary Computation, 2002. CEC’02. Proceedings of the 2002 Congress on. Vol. 2. IEEE. 2002, pp. 1836–1841

work page 2002
[24]

Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classiﬁcation

John Blitzer, Mark Dredze, Fernando Pereira, et al. “Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classiﬁcation”. In:ACL. Vol. 7. 2007, pp. 440– 447

work page 2007
[25]

Domain adaptation with coupled sub- spaces

John Blitzer, Sham Kakade, and Dean P Foster. “Domain adaptation with coupled sub- spaces”. In: International Conference on Artiﬁcial Intelligence and Statistics . 2011, pp. 173– 181

work page 2011
[26]

Blundell, J

Charles Blundell et al. “Weight uncertainty in neural networks”. In: arXiv preprint arXiv:1505.05424 (2015)

work page arXiv 2015
[27]

Superintelligence: Paths, dangers, strategies

Nick Bostrom. Superintelligence: Paths, dangers, strategies . OUP Oxford, 2014

work page 2014
[28]

Two high stakes challenges in machine learning

L´ eon Bottou. “Two high stakes challenges in machine learning”. Invited talk at the 32nd International Conference on Machine Learning. 2015

work page 2015
[29]

Counterfactual Reasoning and Learning Systems

L´ eon Bottou et al. “Counterfactual Reasoning and Learning Systems”. In: arXiv preprint arXiv:1209.2355 (2012)

work page arXiv 2012
[30]

Counterfactual reasoning and learning systems: The example of compu- tational advertising

L´ eon Bottou et al. “Counterfactual reasoning and learning systems: The example of compu- tational advertising”. In: The Journal of Machine Learning Research 14.1 (2013), pp. 3207– 3260

work page 2013
[31]

R-max-a general polynomial time algorithm for near-optimal reinforcement learning

Ronen I Brafman and Moshe Tennenholtz. “R-max-a general polynomial time algorithm for near-optimal reinforcement learning”. In: The Journal of Machine Learning Research 3 (2003), pp. 213–231

work page 2003
[32]

The second machine age: work, progress, and pros- perity in a time of brilliant technologies

Erik Brynjolfsson and Andrew McAfee. The second machine age: work, progress, and pros- perity in a time of brilliant technologies . WW Norton & Company, 2014

work page 2014
[33]

Open robotics

Ryan Calo. “Open robotics”. In: Maryland Law Review 70.3 (2011)

work page 2011
[34]

AI Control

Paul Christiano. AI Control. [Online; accessed 13-June-2016]. 2015. url: https://medium. com/ai-control

work page 2016
[35]

Risks of semi-supervised learning

Fabio Cozman and Ira Cohen. “Risks of semi-supervised learning”. In: Semi-Supervised Learn- ing (2006), pp. 56–72

work page 2006
[36]

Parametric Bounded L¨ ob’s Theorem and Robust Cooperation of Bounded Agents

Andrew Critch. “Parametric Bounded L¨ ob’s Theorem and Robust Cooperation of Bounded Agents”. In: (2016)

work page 2016
[37]

Active reward learning

Christian Daniel et al. “Active reward learning”. In: Proceedings of Robotics Science & Sys- tems. 2014

work page 2014
[38]

Ethical guidelines for a superintelligence

Ernest Davis. “Ethical guidelines for a superintelligence.” In: Artif. Intell. 220 (2015), pp. 121– 124

work page 2015
[39]

Maximum likelihood estimation of observer error-rates using the EM algorithm

Alexander Philip Dawid and Allan M Skene. “Maximum likelihood estimation of observer error-rates using the EM algorithm”. In: Applied statistics (1979), pp. 20–28. 23

work page 1979
[40]

Feudal reinforcement learning

Peter Dayan and Geoﬀrey E Hinton. “Feudal reinforcement learning”. In: Advances in neural information processing systems. Morgan Kaufmann Publishers. 1993, pp. 271–271

work page 1993
[41]

Multi-objective optimization

Kalyanmoy Deb. “Multi-objective optimization”. In: Search methodologies. Springer, 2014, pp. 403–449

work page 2014
[42]

Learning what to value

Daniel Dewey. “Learning what to value”. In: Artiﬁcial General Intelligence . Springer, 2011, pp. 309–314

work page 2011
[43]

Reinforcement learning and the reward engineering principle

Daniel Dewey. “Reinforcement learning and the reward engineering principle”. In: 2014 AAAI Spring Symposium Series . 2014

work page 2014
[44]

Unsupervised super- vised learning i: Estimating classiﬁcation and regression errors without labels

Pinar Donmez, Guy Lebanon, and Krishnakumar Balasubramanian. “Unsupervised super- vised learning i: Estimating classiﬁcation and regression errors without labels”. In: The Jour- nal of Machine Learning Research 11 (2010), pp. 1323–1351

work page 2010
[45]

Learning from labeled features using generalized expectation criteria

Gregory Druck, Gideon Mann, and Andrew McCallum. “Learning from labeled features using generalized expectation criteria”. In:Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval . ACM. 2008, pp. 595–602

work page 2008
[46]

Fairness through awareness

Cynthia Dwork et al. “Fairness through awareness”. In: Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. ACM. 2012, pp. 214–226

work page 2012
[47]

Computers and the theory of statistics: thinking the unthinkable

Bradley Efron. “Computers and the theory of statistics: thinking the unthinkable”. In: SIAM review 21.4 (1979), pp. 460–480

work page 1979
[48]

Learning the preferences of ignorant, inconsistent agents

Owain Evans, Andreas Stuhlm¨ uller, and Noah D Goodman. “Learning the preferences of ignorant, inconsistent agents”. In: arXiv preprint arXiv:1512.05832 (2015)

work page arXiv 2015
[49]

Avoiding wireheading with value reinforcement learning

Tom Everitt and Marcus Hutter. “Avoiding wireheading with value reinforcement learning”. In: arXiv preprint arXiv:1605.03143 (2016)

work page arXiv 2016
[50]

Self-Modiﬁcation of Policy and Utility Function in Rational Agents

Tom Everitt et al. “Self-Modiﬁcation of Policy and Utility Function in Rational Agents”. In: arXiv preprint arXiv:1605.03142 (2016)

work page arXiv 2016
[51]

Guided Cost Learning: Deep Inverse Op- timal Control via Policy Optimization

Chelsea Finn, Sergey Levine, and Pieter Abbeel. “Guided Cost Learning: Deep Inverse Op- timal Control via Policy Optimization”. In: arXiv preprint arXiv:1603.00448 (2016)

work page arXiv 2016
[52]

The future of employment: how susceptible are jobs to computerisation

Carl Benedikt Frey and Michael A Osborne. “The future of employment: how susceptible are jobs to computerisation”. In: Retrieved September 7 (2013), p. 2013

work page 2013
[53]

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

Yarin Gal and Zoubin Ghahramani. “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning”. In: arXiv preprint arXiv:1506.02142 (2015)

work page Pith review arXiv 2015
[54]

Learning with drift detection

Joao Gama et al. “Learning with drift detection”. In: Advances in artiﬁcial intelligence–SBIA

work page
[55]

Springer, 2004, pp. 286–295

work page 2004
[56]

A Comprehensive Survey on Safe Reinforcement Learning

Javier Garc´ ıa and Fernando Fern´ andez. “A Comprehensive Survey on Safe Reinforcement Learning”. In: Journal of Machine Learning Research 16 (2015), pp. 1437–1480

work page 2015
[57]

Asymptotic Convergence in Online Learning with Unbounded Delays

Scott Garrabrant, Nate Soares, and Jessica Taylor. “Asymptotic Convergence in Online Learning with Unbounded Delays”. In: arXiv preprint arXiv:1604.05280 (2016)

work page arXiv 2016
[58]

Uniform Coherence

Scott Garrabrant et al. “Uniform Coherence”. In: arXiv preprint arXiv:1604.05288 (2016)

work page arXiv 2016
[59]

Trusted Machine Learning for Probabilistic Models

Shalini Ghosh et al. “Trusted Machine Learning for Probabilistic Models”. In: Reliable Ma- chine Learning in the Wild at ICML 2016 (2016)

work page 2016
[60]

Amplify scientiﬁc discovery with artiﬁcial intelligence

Yolanda Gil et al. “Amplify scientiﬁc discovery with artiﬁcial intelligence”. In: Science 346.6206 (2014), pp. 171–172

work page 2014
[61]

Twitter sentiment classiﬁcation using distant supervision

Alec Go, Richa Bhayani, and Lei Huang. “Twitter sentiment classiﬁcation using distant supervision”. In: CS224N Project Report, Stanford 1 (2009), p. 12

work page 2009
[62]

Generative adversarial nets

Ian Goodfellow et al. “Generative adversarial nets”. In: Advances in Neural Information Processing Systems. 2014, pp. 2672–2680

work page 2014
[63]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. “Explaining and harnessing ad- versarial examples”. In: arXiv preprint arXiv:1412.6572 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[64]

Problems of monetary management: the UK experience

Charles AE Goodhart. Problems of monetary management: the UK experience . Springer, 1984

work page 1984
[65]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. “Neural turing machines”. In: arXiv preprint arXiv:1410.5401 (2014). 24

work page internal anchor Pith review arXiv 2014
[66]

Distantly Supervised Information Extraction Using Bootstrapped Patterns

Sonal Gupta. “Distantly Supervised Information Extraction Using Bootstrapped Patterns”. PhD thesis. Stanford University, 2015

work page 2015
[67]

Cooperative Inverse Reinforcement Learning

Dylan Hadﬁeld-Menell et al. Cooperative Inverse Reinforcement Learning. 2016

work page 2016
[68]

The Oﬀ-Switch

Dylan Hadﬁeld-Menell et al. “The Oﬀ-Switch”. In: (2016)

work page 2016
[69]

Large sample properties of generalized method of moments estimators

Lars Peter Hansen. “Large sample properties of generalized method of moments estimators”. In: Econometrica: Journal of the Econometric Society (1982), pp. 1029–1054

work page 1982
[70]

Nobel Lecture: Uncertainty Outside and Inside Economic Models

Lars Peter Hansen. “Nobel Lecture: Uncertainty Outside and Inside Economic Models”. In: Journal of Political Economy 122.5 (2014), pp. 945–987

work page 2014
[71]

Tracking the best linear predictor

Mark Herbster and Manfred K Warmuth. “Tracking the best linear predictor”. In: The Jour- nal of Machine Learning Research 1 (2001), pp. 281–309

work page 2001
[72]

Model-based utility functions

Bill Hibbard. “Model-based utility functions”. In: Journal of Artiﬁcial General Intelligence 3.1 (2012), pp. 1–24

work page 2012
[73]

Kernel methods in machine learning

Thomas Hofmann, Bernhard Sch¨ olkopf, and Alexander J Smola. “Kernel methods in machine learning”. In: The annals of statistics (2008), pp. 1171–1220

work page 2008
[74]

Robust dynamic programming

Garud N Iyengar. “Robust dynamic programming”. In: Mathematics of Operations Research 30.2 (2005), pp. 257–280

work page 2005
[75]

Estimating the accuracies of multiple classiﬁers without labeled data

Ariel Jaﬀe, Boaz Nadler, and Yuval Kluger. “Estimating the accuracies of multiple classiﬁers without labeled data”. In: arXiv preprint arXiv:1407.7644 (2014)

work page arXiv 2014
[76]

A formally veriﬁed hybrid system for the next-generation air- borne collision avoidance system

Jean-Baptiste Jeannin et al. “A formally veriﬁed hybrid system for the next-generation air- borne collision avoidance system”. In:Tools and Algorithms for the Construction and Analysis of Systems. Springer, 2015, pp. 21–36

work page 2015
[77]

Diﬀerential privacy and machine learn- ing: A survey and review

Zhanglong Ji, Zachary C Lipton, and Charles Elkan. “Diﬀerential privacy and machine learn- ing: A survey and review”. In: arXiv preprint arXiv:1412.7584 (2014)

work page arXiv 2014
[78]

Learning Representations for Counter- factual Inference

Fredrik D Johansson, Uri Shalit, and David Sontag. “Learning Representations for Counter- factual Inference”. In: arXiv preprint arXiv:1605.03661 (2016)

work page arXiv 2016
[79]

Planning and acting in partially observable stochastic domains

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. “Planning and acting in partially observable stochastic domains”. In: Artiﬁcial intelligence 101.1 (1998), pp. 99– 134

work page 1998
[80]

Łukasz Kaiser, Aidan N

Lukasz Kaiser and Ilya Sutskever. “Neural GPUs learn algorithms”. In:arXiv preprint arXiv:1511.08228 (2015)

work page arXiv 2015

Showing first 80 references.