pith. machine review for the scientific record. sign in

arxiv: 1606.06565 · v2 · submitted 2016-06-21 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Concrete Problems in AI Safety

Authors on Pith no claims yet

Pith reviewed 2026-05-11 05:12 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords AI safetymachine learning accidentsside effectsreward hackingscalable supervisionsafe explorationdistributional shift
0
0 comments X

The pith

The main risks of accidents in AI systems come from five specific problems related to their objectives and learning processes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to shift AI safety discussions toward concrete, actionable issues by defining accidents as unintended harmful behavior that emerges from flawed real-world designs. It groups five research problems into categories based on whether they stem from an incorrect objective, an objective that is too costly to check frequently, or unwanted behavior that occurs during training. A sympathetic reader would care because solving these problems could prevent common failures as AI systems take on more real-world responsibilities. The authors review relevant prior work and propose directions that apply to current advanced machine learning systems. They also raise the broader question of how to approach safety for future AI applications.

Core claim

Accidents in machine learning systems are unintended and harmful behaviors that arise from poor design. The authors present five practical problems that contribute to such accidents, grouped by origin: avoiding side effects and avoiding reward hacking arise from having the wrong objective function; scalable supervision addresses objectives that are too expensive to evaluate often; and safe exploration and distributional shift cover undesirable behavior during the learning process. Previous work is surveyed and research directions are suggested with emphasis on relevance to cutting-edge AI systems.

What carries the argument

A five-problem taxonomy that classifies accident risks according to whether they originate in the objective function or in the learning process itself.

If this is right

  • Research focused on avoiding side effects will reduce cases where AI pursues its goal while damaging unrelated aspects of its environment.
  • Work on avoiding reward hacking will limit AI from exploiting loopholes in its objective that produce unintended outcomes.
  • Advances in scalable supervision will allow training on complex tasks without requiring human evaluation at every step.
  • Safe exploration methods will decrease the chance that AI takes dangerous actions while learning about its surroundings.
  • Handling distributional shift will improve reliability when an AI encounters conditions different from its training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The problems may interact with one another, so progress on one could affect the difficulty of addressing the others.
  • The taxonomy might be extended to cover multi-agent systems or longer time horizons that the paper does not examine in detail.
  • Empirical tests could check whether systems that mitigate all five problems exhibit fewer unintended behaviors in controlled simulations.
  • The list could help guide safety standards for AI used in high-stakes domains such as transportation or healthcare.

Load-bearing premise

That these five problems represent the primary and most actionable sources of accident risk in real-world AI systems.

What would settle it

An observed case of unintended harmful behavior in a deployed AI system that cannot be traced to any of the five problems even after targeted mitigations are applied.

read the original abstract

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript defines accidents in AI systems as unintended and harmful behavior arising from poor design of real-world systems. It presents five practical research problems related to accident risk, grouped by origin: wrong objective functions (avoiding side effects and avoiding reward hacking), expensive-to-evaluate objectives (scalable supervision), and issues during learning (safe exploration and distributional shift). The authors review prior work in each area, suggest research directions relevant to cutting-edge AI, and close by considering how to think productively about safety for forward-looking applications.

Significance. If the framing holds, the paper supplies a structured, actionable list of research problems that can orient the AI safety literature toward near-term, practical concerns rather than purely speculative ones. Its categorization by source (objective vs. learning process) offers a useful organizing lens, and the literature review integrates existing threads in ML with safety considerations. This approach has the potential to encourage safety work that is directly relevant to deployed systems without requiring new theoretical machinery.

minor comments (2)
  1. [Introduction] The definition of accidents in the opening could be grounded with one concrete, non-speculative example drawn from current ML deployments to improve accessibility.
  2. [concluding section] The final high-level section on productive thinking about safety would benefit from a short paragraph outlining minimal criteria (e.g., falsifiability or relevance to current systems) that future safety proposals should meet.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. The referee's summary accurately reflects the paper's focus on defining AI accidents and organizing five concrete research problems by their origins in objective functions, evaluation costs, and learning dynamics.

Circularity Check

0 steps flagged

No circularity: conceptual taxonomy without derivations or self-referential predictions

full rationale

The paper offers a high-level categorization of five AI safety research problems (avoiding side effects, avoiding reward hacking, scalable supervision, safe exploration, distributional shift) grouped by origin in objective functions or learning dynamics. This taxonomy is introduced via conceptual analysis and external literature review rather than any derivation chain, equations, fitted parameters, or first-principles predictions. No step claims a result that reduces by construction to its own inputs; the paper explicitly frames the list as practical and non-exhaustive. Self-citations appear only for background and do not bear load for any uniqueness theorem or forced conclusion. The work is self-contained as a forward-looking problem statement and carries no circularity under the specified criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on domain assumptions about AI goal-directed behavior and learning without introducing new entities or fitted parameters; the categorization itself is an ad hoc framing proposed for utility.

axioms (2)
  • domain assumption Machine learning systems can exhibit unintended and harmful behavior due to poor design of real-world AI systems.
    This is the core definition of 'accidents' used to motivate the entire discussion.
  • ad hoc to paper The five problems can be usefully categorized by their origin in objective functions or learning processes.
    The paper proposes this taxonomy as a productive way to organize research without deriving it from prior theorems.

pith-pipeline@v0.9.0 · 5462 in / 1475 out tokens · 56870 ms · 2026-05-11T05:12:03.516148+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Unsteady Metrics and Benchmarking Cultures of AI Model Builders

    cs.AI 2026-05 accept novelty 8.0

    AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

  2. The Statistical Cost of Adaptation in Multi-Source Transfer Learning

    math.ST 2026-05 unverdicted novelty 8.0

    Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

  3. The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    cs.CL 2020-12 conditional novelty 8.0

    The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...

  4. AI safety via debate

    stat.ML 2018-05 conditional novelty 8.0

    AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.

  5. Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

    cs.AI 2026-05 conditional novelty 7.0

    BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

  6. Theoretical Limits of Language Model Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

  7. AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites

    cs.AI 2026-05 unverdicted novelty 7.0

    AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.

  8. Beyond Ability: The Four-Fold Spectrum of Power and the Logic of Full Inability

    cs.LO 2026-05 unverdicted novelty 7.0

    Coalition Logic is extended by defining Full Inability (FI) as a distinct modality alongside Full Control, Positive Determination, and Adverse Determination, with algebraic structure, Klein four-group symmetry, and a ...

  9. A Logic of Inability

    cs.LO 2026-04 unverdicted novelty 7.0

    A conservative extension of Coalition Logic introduces an inability operator as negation of ability, with proofs of soundness, completeness, and conservativity plus analysis of its modal properties.

  10. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  11. Discovering Agentic Safety Specifications from 1-Bit Danger Signals

    cs.AI 2026-04 unverdicted novelty 7.0

    LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.

  12. Navigating the Conceptual Multiverse

    cs.HC 2026-04 unverdicted novelty 7.0

    The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choic...

  13. Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

    cs.CL 2026-04 unverdicted novelty 7.0

    R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.

  14. Reinforcement Learning via Value Gradient Flow

    cs.LG 2026-04 unverdicted novelty 7.0

    VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

  15. The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

    cs.LG 2026-04 unverdicted novelty 7.0

    The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and sal...

  16. Learning Robustness at Test-Time from a Non-Robust Teacher

    cs.CV 2026-04 unverdicted novelty 7.0

    A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.

  17. AI Integrity: A New Paradigm for Verifiable AI Governance

    cs.AI 2026-04 unverdicted novelty 7.0

    AI Integrity is defined as verifiable protection of an AI system's four-layer Authority Stack from corruption, with PRISM as the measurement framework.

  18. Emotion Concepts and their Function in a Large Language Model

    cs.AI 2026-04 unverdicted novelty 7.0

    Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

  19. A Generalist Agent

    cs.AI 2022-05 accept novelty 7.0

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  20. Sustaining AI safety: Control-theoretic external impossibility, intrinsic necessity, and structural requirements

    cs.AI 2026-05 unverdicted novelty 6.0

    External control strategies are structurally impossible for sustaining AI safety beyond bounded capability thresholds; any remaining viable strategies must be intrinsic with stable safety-compatible objectives.

  21. Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    Semantic Reward Collapse compresses different epistemic issues into unified rewards in preference optimization, risking loss of calibrated uncertainty, with Constitutional Reward Stratification proposed as a domain-st...

  22. Overtrained, Not Misaligned

    cs.LG 2026-05 unverdicted novelty 6.0

    Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

  23. Positive Alignment: Artificial Intelligence for Human Flourishing

    cs.AI 2026-05 unverdicted novelty 6.0

    Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.

  24. SARC: A Governance-by-Architecture Framework for Agentic AI Systems

    cs.SE 2026-05 unverdicted novelty 6.0

    SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-cod...

  25. Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

    cs.AI 2026-05 unverdicted novelty 6.0

    Behavior Cue Reasoning trains LLMs to emit special tokens before behaviors, enabling monitors to prune up to 50% of wasted tokens and recover safe actions from 80% of unsafe traces, more than doubling success rates wi...

  26. On the Blessing of Pre-training in Weak-to-Strong Generalization

    cs.LG 2026-05 unverdicted novelty 6.0

    Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.

  27. Understanding Annotator Safety Policy with Interpretability

    cs.AI 2026-05 unverdicted novelty 6.0

    Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.

  28. You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

    cs.CR 2026-05 unverdicted novelty 6.0

    NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...

  29. Stayin' Aligned Over Time: Towards Longitudinal Human-LLM Alignment via Contextual Reflection and Privacy-Preserving Behavioral Data

    cs.HC 2026-05 unverdicted novelty 6.0

    A methodological framework and browser system BITE for collecting evolving user preferences on LLM outputs through context-triggered reflections and privacy-preserving data over time.

  30. A Robust Out-of-Distribution Detection Framework via Synergistic Smoothing

    cs.CV 2026-05 unverdicted novelty 6.0

    ROSS combines median smoothing with local instability measurement to create a robust OOD detector that outperforms prior methods by up to 40 AUROC points on CIFAR and ImageNet benchmarks while defending symmetrically ...

  31. AI Alignment via Incentives and Correction

    cs.LG 2026-05 unverdicted novelty 6.0

    AI alignment is framed as inducing equilibrium behavior in a solver-auditor interaction via adaptive rewards found by bandit optimization, yielding improved oversight and reduced errors in LLM coding experiments.

  32. AI Alignment via Incentives and Correction

    cs.LG 2026-05 unverdicted novelty 6.0

    AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM ...

  33. Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

    cs.AI 2026-05 unverdicted novelty 6.0

    The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for cont...

  34. Unifying Runtime Monitoring Approaches for Safety-Critical Machine Learning: Application to Vision-Based Landing

    cs.LG 2026-04 unverdicted novelty 6.0

    A framework unifies runtime monitoring for safety-critical ML into ODD, OOD, and OMS categories and demonstrates them on vision-based runway detection for landing.

  35. Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

    cs.LG 2026-04 unverdicted novelty 6.0

    Uncertainty-aware RL framework using ensemble disagreement and annotation variability reduces reward-hacking trap visits by 93.7% across grid and continuous control tasks while remaining robust to 30% label noise.

  36. When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

    cs.LG 2026-04 unverdicted novelty 6.0

    Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...

  37. Removing Sandbagging in LLMs by Training with Weak Supervision

    cs.LG 2026-04 unverdicted novelty 6.0

    SFT on weak demonstrations followed by RL elicits full performance from sandbagging LLMs, but only when training and deployment are indistinguishable to the model.

  38. Post-AGI Economies: Autonomy and the First Fundamental Theorem of Welfare Economics

    econ.TH 2026-04 unverdicted novelty 6.0

    The First Fundamental Theorem of Welfare Economics holds for autonomy-complete competitive equilibria that are autonomy-Pareto efficient, with the classical version recovered in the low-autonomy limit.

  39. AI Governance under Political Turnover: The Alignment Surface of Compliance Design

    cs.AI 2026-04 unverdicted novelty 6.0

    A formal model shows that AI compliance designs in government create learnable approval boundaries that political successors can exploit, causing initial oversight gains to increase long-term strategic vulnerability.

  40. Evaluation-driven Scaling for Scientific Discovery

    cs.LG 2026-04 unverdicted novelty 6.0

    SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...

  41. QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks

    cs.CL 2026-04 unverdicted novelty 6.0

    QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.

  42. Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition

    cs.AI 2026-04 unverdicted novelty 6.0

    Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.

  43. Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

    cs.CR 2026-04 unverdicted novelty 6.0

    Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.

  44. Long-Term Dynamical Evolution and Ejection of Near-Earth Asteroids

    astro-ph.EP 2026-04 unverdicted novelty 6.0

    Machine learning classifiers on initial orbital elements and convolutional neural networks on recurrence plots from short integrations classify long-term ejection of near-Earth asteroids with accuracy comparable to fu...

  45. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  46. The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    Features in deep networks correspond to linear directions of centroids summarizing local functional behavior, enabling sparser and more effective feature dictionaries via sparse autoencoders applied to centroids rathe...

  47. Measuring the Authority Stack of AI Systems: Empirical Analysis of 366,120 Forced-Choice Responses Across 8 AI Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Eight AI models show split value priorities at the top layer, divergent evidence preferences in the middle, and broad convergence on institutional sources at the bottom, with substantial sensitivity to scenario framing.

  48. EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

    cs.RO 2026-04 unverdicted novelty 6.0

    EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.

  49. PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC

    cs.LG 2026-04 unverdicted novelty 6.0

    PriPG-RL trains RL policies for POMDPs by distilling knowledge from a privileged anytime-feasible MPC planner into a P2P-SAC policy, improving sample efficiency and performance in partially observable robotic navigation.

  50. Active Reward Machine Inference From Raw State Trajectories

    cs.RO 2026-04 unverdicted novelty 6.0

    Reward machines can be inferred from raw state trajectories alone when sufficient data is available, with an active learning extension that queries trajectory extensions for better efficiency.

  51. Simulating the Evolution of Alignment and Values in Machine Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.

  52. Cognitive Comparability and the Limits of Governance: Evaluating Authority Under Radical Capability Asymmetry

    cs.CY 2026-04 unverdicted novelty 6.0

    A six-dimension framework shows structural failures in four governance principles under radical capability asymmetry, with two requiring new normative theory and a pattern of interdependent breakdown.

  53. ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling

    cs.RO 2026-03 unverdicted novelty 6.0

    ROBOGATE applies adaptive boundary-focused sampling in simulation to discover robot policy failure boundaries, revealing a 97.65 percentage point performance gap for a VLA model between LIBERO and industrial scenarios.

  54. Alignment as Institutional Design: From Behavioral Correction to Transaction Structure in Intelligent Systems

    cs.CY 2026-03 unverdicted novelty 6.0

    AI alignment emerges when designers specify internal transaction structures that make aligned behavior the lowest-cost strategy for each component, transforming the problem from behavioral control into institutional design.

  55. VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    cs.CV 2025-04 unverdicted novelty 6.0

    VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

  56. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  57. Towards A Rigorous Science of Interpretable Machine Learning

    stat.ML 2017-02 unverdicted novelty 6.0

    The authors define interpretability for machine learning, specify when it is required, and propose a taxonomy for its rigorous evaluation while identifying open research questions.

  58. Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems

    cs.MA 2026-05 unverdicted novelty 5.0

    Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.

  59. Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

    cs.AI 2026-05 unverdicted novelty 5.0

    Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.

  60. Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc

    cs.AI 2026-05 unverdicted novelty 5.0

    Mechanical conscience is a supervisory filter that minimally corrects baseline AI policies to reduce cumulative deviation from admissible behavioral trajectories under epistemic uncertainty.

Reference graph

Works this paper leans on

171 extracted references · 171 canonical work pages · cited by 81 Pith papers · 4 internal anchors

  1. [1]

    Deep Learning with Differential Privacy

    Martin Abadi et al. “Deep Learning with Differential Privacy”. In: (in press (2016))

  2. [2]

    Exploration and apprenticeship learning in reinforcement learning

    Pieter Abbeel and Andrew Y Ng. “Exploration and apprenticeship learning in reinforcement learning”. In: Proceedings of the 22nd international conference on Machine learning . ACM. 2005, pp. 1–8

  3. [3]

    The Hidden Cost of Efficiency: Fairness and Discrimination in Predictive Modeling

    Julius Adebayo, Lalana Kagal, and Alex Pentland. The Hidden Cost of Efficiency: Fairness and Discrimination in Predictive Modeling . 2015

  4. [4]

    Taming the monster: A fast and simple algorithm for contextual ban- dits

    Alekh Agarwal et al. “Taming the monster: A fast and simple algorithm for contextual ban- dits”. In: (2014)

  5. [5]

    Domain-adversarial neural networks

    Hana Ajakan et al. “Domain-adversarial neural networks”. In: arXiv preprint arXiv:1412.4446 (2014)

  6. [6]

    Hiring by algorithm: predicting and preventing disparate impact

    Ifeoma Ajunwa et al. “Hiring by algorithm: predicting and preventing disparate impact”. In: Available at SSRN 2746078 (2016)

  7. [7]

    Deep Speech 2: End-to-End Speech Recognition in English and Man- darin

    Dario Amodei et al. “Deep Speech 2: End-to-End Speech Recognition in English and Man- darin”. In: arXiv preprint arXiv:1512.02595 (2015)

  8. [8]

    Open Letter

    An Open Letter: Research Priorities for Robust and Beneficial Artificial Intelligence . Open Letter. Signed by 8,600 people; see attached research agenda. 2015

  9. [9]

    A method of moments for mixture models and hidden Markov models

    Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. “A method of moments for mixture models and hidden Markov models”. In: arXiv preprint arXiv:1203.0683 (2012)

  10. [10]

    Estimation of the parameters of a single equation in a complete system of stochastic equations

    Theodore W Anderson and Herman Rubin. “Estimation of the parameters of a single equation in a complete system of stochastic equations”. In: The Annals of Mathematical Statistics (1949), pp. 46–63

  11. [11]

    The asymptotic properties of estimates of the parameters of a single equation in a complete system of stochastic equations

    Theodore W Anderson and Herman Rubin. “The asymptotic properties of estimates of the parameters of a single equation in a complete system of stochastic equations”. In: The Annals of Mathematical Statistics (1950), pp. 570–582

  12. [12]

    Motivated value selection for artificial agents

    Stuart Armstrong. “Motivated value selection for artificial agents”. In: Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence . 2015

  13. [13]

    The mathematics of reduced impact: help needed

    Stuart Armstrong. The mathematics of reduced impact: help needed . 2012

  14. [14]

    Utility indifference

    Stuart Armstrong. Utility indifference. Tech. rep. Technical Report 2010-1. Oxford: Future of Humanity Institute, University of Oxford, 2010

  15. [15]

    The Risk of Automation for Jobs in OECD Countries

    Melanie Arntz, Terry Gregory, and Ulrich Zierahn. “The Risk of Automation for Jobs in OECD Countries”. In: OECD Social, Employment and Migration Working Papers (2016). url: http://dx.doi.org/10.1787/5jlz9h56dvq7-en

  16. [16]

    Open Letter

    Autonomous Weapons: An Open Letter from AI & Robotics Researchers. Open Letter. Signed by 20,000+ people. 2015. 22

  17. [17]

    The AGI Containment Problem

    James Babcock, Janos Kramar, and Roman Yampolskiy. “The AGI Containment Problem”. In: The Ninth Conference on Artificial General Intelligence (2016)

  18. [18]

    Unsupervised super- vised learning ii: Margin-based classification without labels

    Krishnakumar Balasubramanian, Pinar Donmez, and Guy Lebanon. “Unsupervised super- vised learning ii: Margin-based classification without labels”. In: The Journal of Machine Learning Research 12 (2011), pp. 3119–3145

  19. [19]

    The security of machine learning

    Marco Barreno et al. “The security of machine learning”. In: Machine Learning 81.2 (2010), pp. 121–148

  20. [20]

    H-infinity optimal control and related minimax design problems: a dynamic game approach

    Tamer Ba¸ sar and Pierre Bernhard. H-infinity optimal control and related minimax design problems: a dynamic game approach . Springer Science & Business Media, 2008

  21. [21]

    Detecting changes in signals and systems—a survey

    Mich` ele Basseville. “Detecting changes in signals and systems—a survey”. In: Automatica 24.3 (1988), pp. 309–326

  22. [22]

    Bayesian optimization with safety con- straints: safe and automatic parameter tuning in robotics

    F Berkenkamp, A Krause, and Angela P Schoellig. “Bayesian optimization with safety con- straints: safe and automatic parameter tuning in robotics.” arXiv, 2016”. In: arXiv preprint arXiv:1602.04450 ()

  23. [23]

    The evolved radio and its implications for modelling the evolution of novel sensors

    Jon Bird and Paul Layzell. “The evolved radio and its implications for modelling the evolution of novel sensors”. In: Evolutionary Computation, 2002. CEC’02. Proceedings of the 2002 Congress on. Vol. 2. IEEE. 2002, pp. 1836–1841

  24. [24]

    Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification

    John Blitzer, Mark Dredze, Fernando Pereira, et al. “Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification”. In:ACL. Vol. 7. 2007, pp. 440– 447

  25. [25]

    Domain adaptation with coupled sub- spaces

    John Blitzer, Sham Kakade, and Dean P Foster. “Domain adaptation with coupled sub- spaces”. In: International Conference on Artificial Intelligence and Statistics . 2011, pp. 173– 181

  26. [26]

    Blundell, J

    Charles Blundell et al. “Weight uncertainty in neural networks”. In: arXiv preprint arXiv:1505.05424 (2015)

  27. [27]

    Superintelligence: Paths, dangers, strategies

    Nick Bostrom. Superintelligence: Paths, dangers, strategies . OUP Oxford, 2014

  28. [28]

    Two high stakes challenges in machine learning

    L´ eon Bottou. “Two high stakes challenges in machine learning”. Invited talk at the 32nd International Conference on Machine Learning. 2015

  29. [29]

    Counterfactual Reasoning and Learning Systems

    L´ eon Bottou et al. “Counterfactual Reasoning and Learning Systems”. In: arXiv preprint arXiv:1209.2355 (2012)

  30. [30]

    Counterfactual reasoning and learning systems: The example of compu- tational advertising

    L´ eon Bottou et al. “Counterfactual reasoning and learning systems: The example of compu- tational advertising”. In: The Journal of Machine Learning Research 14.1 (2013), pp. 3207– 3260

  31. [31]

    R-max-a general polynomial time algorithm for near-optimal reinforcement learning

    Ronen I Brafman and Moshe Tennenholtz. “R-max-a general polynomial time algorithm for near-optimal reinforcement learning”. In: The Journal of Machine Learning Research 3 (2003), pp. 213–231

  32. [32]

    The second machine age: work, progress, and pros- perity in a time of brilliant technologies

    Erik Brynjolfsson and Andrew McAfee. The second machine age: work, progress, and pros- perity in a time of brilliant technologies . WW Norton & Company, 2014

  33. [33]

    Open robotics

    Ryan Calo. “Open robotics”. In: Maryland Law Review 70.3 (2011)

  34. [34]

    AI Control

    Paul Christiano. AI Control. [Online; accessed 13-June-2016]. 2015. url: https://medium. com/ai-control

  35. [35]

    Risks of semi-supervised learning

    Fabio Cozman and Ira Cohen. “Risks of semi-supervised learning”. In: Semi-Supervised Learn- ing (2006), pp. 56–72

  36. [36]

    Parametric Bounded L¨ ob’s Theorem and Robust Cooperation of Bounded Agents

    Andrew Critch. “Parametric Bounded L¨ ob’s Theorem and Robust Cooperation of Bounded Agents”. In: (2016)

  37. [37]

    Active reward learning

    Christian Daniel et al. “Active reward learning”. In: Proceedings of Robotics Science & Sys- tems. 2014

  38. [38]

    Ethical guidelines for a superintelligence

    Ernest Davis. “Ethical guidelines for a superintelligence.” In: Artif. Intell. 220 (2015), pp. 121– 124

  39. [39]

    Maximum likelihood estimation of observer error-rates using the EM algorithm

    Alexander Philip Dawid and Allan M Skene. “Maximum likelihood estimation of observer error-rates using the EM algorithm”. In: Applied statistics (1979), pp. 20–28. 23

  40. [40]

    Feudal reinforcement learning

    Peter Dayan and Geoffrey E Hinton. “Feudal reinforcement learning”. In: Advances in neural information processing systems. Morgan Kaufmann Publishers. 1993, pp. 271–271

  41. [41]

    Multi-objective optimization

    Kalyanmoy Deb. “Multi-objective optimization”. In: Search methodologies. Springer, 2014, pp. 403–449

  42. [42]

    Learning what to value

    Daniel Dewey. “Learning what to value”. In: Artificial General Intelligence . Springer, 2011, pp. 309–314

  43. [43]

    Reinforcement learning and the reward engineering principle

    Daniel Dewey. “Reinforcement learning and the reward engineering principle”. In: 2014 AAAI Spring Symposium Series . 2014

  44. [44]

    Unsupervised super- vised learning i: Estimating classification and regression errors without labels

    Pinar Donmez, Guy Lebanon, and Krishnakumar Balasubramanian. “Unsupervised super- vised learning i: Estimating classification and regression errors without labels”. In: The Jour- nal of Machine Learning Research 11 (2010), pp. 1323–1351

  45. [45]

    Learning from labeled features using generalized expectation criteria

    Gregory Druck, Gideon Mann, and Andrew McCallum. “Learning from labeled features using generalized expectation criteria”. In:Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval . ACM. 2008, pp. 595–602

  46. [46]

    Fairness through awareness

    Cynthia Dwork et al. “Fairness through awareness”. In: Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. ACM. 2012, pp. 214–226

  47. [47]

    Computers and the theory of statistics: thinking the unthinkable

    Bradley Efron. “Computers and the theory of statistics: thinking the unthinkable”. In: SIAM review 21.4 (1979), pp. 460–480

  48. [48]

    Learning the preferences of ignorant, inconsistent agents

    Owain Evans, Andreas Stuhlm¨ uller, and Noah D Goodman. “Learning the preferences of ignorant, inconsistent agents”. In: arXiv preprint arXiv:1512.05832 (2015)

  49. [49]

    Avoiding wireheading with value reinforcement learning

    Tom Everitt and Marcus Hutter. “Avoiding wireheading with value reinforcement learning”. In: arXiv preprint arXiv:1605.03143 (2016)

  50. [50]

    Self-Modification of Policy and Utility Function in Rational Agents

    Tom Everitt et al. “Self-Modification of Policy and Utility Function in Rational Agents”. In: arXiv preprint arXiv:1605.03142 (2016)

  51. [51]

    Guided Cost Learning: Deep Inverse Op- timal Control via Policy Optimization

    Chelsea Finn, Sergey Levine, and Pieter Abbeel. “Guided Cost Learning: Deep Inverse Op- timal Control via Policy Optimization”. In: arXiv preprint arXiv:1603.00448 (2016)

  52. [52]

    The future of employment: how susceptible are jobs to computerisation

    Carl Benedikt Frey and Michael A Osborne. “The future of employment: how susceptible are jobs to computerisation”. In: Retrieved September 7 (2013), p. 2013

  53. [53]

    Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

    Yarin Gal and Zoubin Ghahramani. “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning”. In: arXiv preprint arXiv:1506.02142 (2015)

  54. [54]

    Learning with drift detection

    Joao Gama et al. “Learning with drift detection”. In: Advances in artificial intelligence–SBIA

  55. [55]

    Springer, 2004, pp. 286–295

  56. [56]

    A Comprehensive Survey on Safe Reinforcement Learning

    Javier Garc´ ıa and Fernando Fern´ andez. “A Comprehensive Survey on Safe Reinforcement Learning”. In: Journal of Machine Learning Research 16 (2015), pp. 1437–1480

  57. [57]

    Asymptotic Convergence in Online Learning with Unbounded Delays

    Scott Garrabrant, Nate Soares, and Jessica Taylor. “Asymptotic Convergence in Online Learning with Unbounded Delays”. In: arXiv preprint arXiv:1604.05280 (2016)

  58. [58]

    Uniform Coherence

    Scott Garrabrant et al. “Uniform Coherence”. In: arXiv preprint arXiv:1604.05288 (2016)

  59. [59]

    Trusted Machine Learning for Probabilistic Models

    Shalini Ghosh et al. “Trusted Machine Learning for Probabilistic Models”. In: Reliable Ma- chine Learning in the Wild at ICML 2016 (2016)

  60. [60]

    Amplify scientific discovery with artificial intelligence

    Yolanda Gil et al. “Amplify scientific discovery with artificial intelligence”. In: Science 346.6206 (2014), pp. 171–172

  61. [61]

    Twitter sentiment classification using distant supervision

    Alec Go, Richa Bhayani, and Lei Huang. “Twitter sentiment classification using distant supervision”. In: CS224N Project Report, Stanford 1 (2009), p. 12

  62. [62]

    Generative adversarial nets

    Ian Goodfellow et al. “Generative adversarial nets”. In: Advances in Neural Information Processing Systems. 2014, pp. 2672–2680

  63. [63]

    Explaining and Harnessing Adversarial Examples

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. “Explaining and harnessing ad- versarial examples”. In: arXiv preprint arXiv:1412.6572 (2014)

  64. [64]

    Problems of monetary management: the UK experience

    Charles AE Goodhart. Problems of monetary management: the UK experience . Springer, 1984

  65. [65]

    Neural Turing Machines

    Alex Graves, Greg Wayne, and Ivo Danihelka. “Neural turing machines”. In: arXiv preprint arXiv:1410.5401 (2014). 24

  66. [66]

    Distantly Supervised Information Extraction Using Bootstrapped Patterns

    Sonal Gupta. “Distantly Supervised Information Extraction Using Bootstrapped Patterns”. PhD thesis. Stanford University, 2015

  67. [67]

    Cooperative Inverse Reinforcement Learning

    Dylan Hadfield-Menell et al. Cooperative Inverse Reinforcement Learning. 2016

  68. [68]

    The Off-Switch

    Dylan Hadfield-Menell et al. “The Off-Switch”. In: (2016)

  69. [69]

    Large sample properties of generalized method of moments estimators

    Lars Peter Hansen. “Large sample properties of generalized method of moments estimators”. In: Econometrica: Journal of the Econometric Society (1982), pp. 1029–1054

  70. [70]

    Nobel Lecture: Uncertainty Outside and Inside Economic Models

    Lars Peter Hansen. “Nobel Lecture: Uncertainty Outside and Inside Economic Models”. In: Journal of Political Economy 122.5 (2014), pp. 945–987

  71. [71]

    Tracking the best linear predictor

    Mark Herbster and Manfred K Warmuth. “Tracking the best linear predictor”. In: The Jour- nal of Machine Learning Research 1 (2001), pp. 281–309

  72. [72]

    Model-based utility functions

    Bill Hibbard. “Model-based utility functions”. In: Journal of Artificial General Intelligence 3.1 (2012), pp. 1–24

  73. [73]

    Kernel methods in machine learning

    Thomas Hofmann, Bernhard Sch¨ olkopf, and Alexander J Smola. “Kernel methods in machine learning”. In: The annals of statistics (2008), pp. 1171–1220

  74. [74]

    Robust dynamic programming

    Garud N Iyengar. “Robust dynamic programming”. In: Mathematics of Operations Research 30.2 (2005), pp. 257–280

  75. [75]

    Estimating the accuracies of multiple classifiers without labeled data

    Ariel Jaffe, Boaz Nadler, and Yuval Kluger. “Estimating the accuracies of multiple classifiers without labeled data”. In: arXiv preprint arXiv:1407.7644 (2014)

  76. [76]

    A formally verified hybrid system for the next-generation air- borne collision avoidance system

    Jean-Baptiste Jeannin et al. “A formally verified hybrid system for the next-generation air- borne collision avoidance system”. In:Tools and Algorithms for the Construction and Analysis of Systems. Springer, 2015, pp. 21–36

  77. [77]

    Differential privacy and machine learn- ing: A survey and review

    Zhanglong Ji, Zachary C Lipton, and Charles Elkan. “Differential privacy and machine learn- ing: A survey and review”. In: arXiv preprint arXiv:1412.7584 (2014)

  78. [78]

    Learning Representations for Counter- factual Inference

    Fredrik D Johansson, Uri Shalit, and David Sontag. “Learning Representations for Counter- factual Inference”. In: arXiv preprint arXiv:1605.03661 (2016)

  79. [79]

    Planning and acting in partially observable stochastic domains

    Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. “Planning and acting in partially observable stochastic domains”. In: Artificial intelligence 101.1 (1998), pp. 99– 134

  80. [80]

    Łukasz Kaiser, Aidan N

    Lukasz Kaiser and Ilya Sutskever. “Neural GPUs learn algorithms”. In:arXiv preprint arXiv:1511.08228 (2015)

Showing first 80 references.