Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.
super hub Canonical reference
Concrete Problems in AI Safety
Canonical reference. 90% of citing Pith papers cite this work as background.
abstract
Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), a
authors
co-cited works
representative citing papers
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.
\system{} combines five gears with utility-gated dispatch for safety in autonomous agents, proving stability for single agents and providing distributed guarantees for multi-agent CPS, evaluated on UR5 robots.
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
Self-commitment latency measures early behavioral commitment in hinted vs. honest reasoning contexts on GSM8K using Qwen2.5-3B, achieving AUROC 0.878 for first-commitment latency and up to 0.926 for curve summaries.
PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.
CVT-RL improves verified task success to 78.9% and reduces hacking to 3.9% in long-horizon language agents by combining intervention-validity gating with a selection-adjusted doubly robust PCCC estimator.
EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.
An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.
Coalition Logic is extended by defining Full Inability (FI) as a distinct modality alongside Full Control, Positive Determination, and Adverse Determination, with algebraic structure, Klein four-group symmetry, and a sound, complete, conservative axiomatization CLFI that remains PSPACE-complete.
A conservative extension of Coalition Logic introduces an inability operator as negation of ability, with proofs of soundness, completeness, and conservativity plus analysis of its modal properties.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.
The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
citing papers explorer
No citing papers match the current filters.