Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.
super hub Canonical reference
Concrete Problems in AI Safety
Canonical reference. 90% of citing Pith papers cite this work as background.
abstract
Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), a
authors
co-cited works
representative citing papers
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.
\system{} combines five gears with utility-gated dispatch for safety in autonomous agents, proving stability for single agents and providing distributed guarantees for multi-agent CPS, evaluated on UR5 robots.
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.
VLM-Safe-RL adds frozen VLM signals as anticipatory costs to the CMDP Lagrangian update via dual-path CLIP, VLM-Lagrange, and confidence gating, outperforming baselines on Safety-Gymnasium FormulaOne while showing partial generalization.
ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.
MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.
Sealed-bid second-price intermediary auctions fully unravel into sealed first-price principal auctions while open formats unravel only partially, limiting intermediary design space when a credible first-price channel exists.
Self-commitment latency measures early behavioral commitment in hinted vs. honest reasoning contexts on GSM8K using Qwen2.5-3B, achieving AUROC 0.878 for first-commitment latency and up to 0.926 for curve summaries.
PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.
CVT-RL improves verified task success to 78.9% and reduces hacking to 3.9% in long-horizon language agents by combining intervention-validity gating with a selection-adjusted doubly robust PCCC estimator.
EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.
An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.
Coalition Logic is extended by defining Full Inability (FI) as a distinct modality alongside Full Control, Positive Determination, and Adverse Determination, with algebraic structure, Klein four-group symmetry, and a sound, complete, conservative axiomatization CLFI that remains PSPACE-complete.
citing papers explorer
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Chameleon: Recovering Cyber-Physical Systems from Memory Corruption Attacks via ML Surrogates
Chameleon recovers CPS from memory corruption attacks by swapping compromised compartments with ML surrogates that approximate original behavior (avg R²=0.96) while avoiding the same vulnerabilities.
-
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
-
Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories
Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.
-
Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety
Cognitive Firewall applies four gates (intent, zero-trust context, consistency, output risk) via an oversight model to cut jailbreak success to 2% or below on most tested sets while keeping over-refusal at 8%.
-
Heartbeat-Bound Hierarchical Credentials: Cryptographic Revocation for AI Agent Swarms
HBHC protocol binds hierarchical credentials to heartbeat proofs for deterministic bounded-time revocation in AI agent swarms without network round-trips.