super hub Canonical reference

Concrete Problems in AI Safety

Chris Olah, Dario Amodei, Jacob Steinhardt, John Schulman, Paul Christiano · 2016 · cs.AI · arXiv 1606.06565

Canonical reference. 90% of citing Pith papers cite this work as background.

225 Pith papers citing it

Background 90% of classified citations

open full Pith review browse 225 citing papers more from Chris Olah arXiv PDF

abstract

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 41 method 1

citation-polarity summary

background 38 support 2 unclear 1 use method 1

claims ledger

abstract Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), a

authors

Chris Olah Dan Man\'e Dario Amodei Jacob Steinhardt John Schulman Paul Christiano

co-cited works

representative citing papers

Risks from Learned Optimization in Advanced Machine Learning Systems

cs.AI · 2019-06-05 · accept · novelty 9.0

Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

The Statistical Cost of Adaptation in Multi-Source Transfer Learning

math.ST · 2026-05-10 · unverdicted · novelty 8.0

Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

AI safety via debate

stat.ML · 2018-05-02 · conditional · novelty 8.0

AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.

Managed Autonomy at Runtime: Gear-Based Safety and Governance for Single- and Multi-Agent Cyber-Physical Systems

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

\system{} combines five gears with utility-gated dispatch for safety in autonomous agents, proving stability for single agents and providing distributed guarantees for multi-agent CPS, evaluated on UR5 robots.

ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

cs.RO · 2026-06-25 · unverdicted · novelty 7.0 · 2 refs

ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.

Evolving Quantum Error-Correcting Encodings for Molecular Simulation

quant-ph · 2026-06-24 · conditional · novelty 7.0

LLM-driven evolutionary program synthesis discovers Generalized Superfast Encodings with exact distance 5 (and 6 on one instance) for molecular Hamiltonians, the first beyond distance 3.

Beyond Value Benchmarks: Measuring Value-Structure Alignment in Large Language Models via Symmetric Q-Sorts

cs.CL · 2026-06-20 · unverdicted · novelty 7.0

Introduces a Q-sort protocol using human reference factors to quantify LLM value-structure alignment via Procrustes similarity and RSA correlations, revealing cross-family heterogeneity and localized misalignments.

World Model Self-Distillation: Training World Models to Solve General Tasks

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.

Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

VLM-Safe-RL adds frozen VLM signals as anticipatory costs to the CMDP Lagrangian update via dual-path CLIP, VLM-Lagrange, and confidence gating, outperforming baselines on Safety-Gymnasium FormulaOne while showing partial generalization.

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

cs.AI · 2026-06-05 · unverdicted · novelty 7.0

MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.

Competing Auctions in Intermediated Markets

cs.GT · 2026-06-04 · unverdicted · novelty 7.0

Sealed-bid second-price intermediary auctions fully unravel into sealed first-price principal auctions while open formats unravel only partially, limiting intermediary design space when a credible first-price channel exists.

Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

Self-commitment latency measures early behavioral commitment in hinted vs. honest reasoning contexts on GSM8K using Qwen2.5-3B, achieving AUROC 0.878 for first-commitment latency and up to 0.926 for curve summaries.

A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

CVT-RL improves verified task success to 78.9% and reduces hacking to 3.9% in long-horizon language agents by combining intervention-validity gating with a selection-adjusted doubly robust PCCC estimator.

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.

What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants

cs.SE · 2026-05-29 · unverdicted · novelty 7.0

An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

Theoretical Limits of Language Model Alignment

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

citing papers explorer

Showing 50 of 178 citing papers after filters.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders cs.AI · 2026-05-13 · accept · none · ref 3 · internal anchor
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
The Statistical Cost of Adaptation in Multi-Source Transfer Learning math.ST · 2026-05-10 · unverdicted · none · ref 167 · internal anchor
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
Managed Autonomy at Runtime: Gear-Based Safety and Governance for Single- and Multi-Agent Cyber-Physical Systems cs.AI · 2026-07-01 · unverdicted · none · ref 13 · internal anchor
\system{} combines five gears with utility-gated dispatch for safety in autonomous agents, proving stability for single agents and providing distributed guarantees for multi-agent CPS, evaluated on UR5 robots.
ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models cs.RO · 2026-06-25 · unverdicted · none · ref 24 · 2 links · internal anchor
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
Evolving Quantum Error-Correcting Encodings for Molecular Simulation quant-ph · 2026-06-24 · conditional · none · ref 20 · internal anchor
LLM-driven evolutionary program synthesis discovers Generalized Superfast Encodings with exact distance 5 (and 6 on one instance) for molecular Hamiltonians, the first beyond distance 3.
Beyond Value Benchmarks: Measuring Value-Structure Alignment in Large Language Models via Symmetric Q-Sorts cs.CL · 2026-06-20 · unverdicted · none · ref 44 · internal anchor
Introduces a Q-sort protocol using human reference factors to quantify LLM value-structure alignment via Procrustes similarity and RSA correlations, revealing cross-family heterogeneity and localized misalignments.
World Model Self-Distillation: Training World Models to Solve General Tasks cs.CV · 2026-06-10 · unverdicted · none · ref 4 · internal anchor
Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.
Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models cs.LG · 2026-06-09 · unverdicted · none · ref 22 · internal anchor
VLM-Safe-RL adds frozen VLM signals as anticipatory costs to the CMDP Lagrangian update via dual-path CLIP, VLM-Lagrange, and confidence gating, outperforming baselines on Safety-Gymnasium FormulaOne while showing partial generalization.
ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies cs.RO · 2026-06-08 · unverdicted · none · ref 10 · internal anchor
ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.
Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems cs.AI · 2026-06-05 · unverdicted · none · ref 1 · internal anchor
MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.
Competing Auctions in Intermediated Markets cs.GT · 2026-06-04 · unverdicted · none · ref 19 · internal anchor
Sealed-bid second-price intermediary auctions fully unravel into sealed first-price principal auctions while open formats unravel only partially, limiting intermediary design space when a credible first-price channel exists.
Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking cs.AI · 2026-06-04 · unverdicted · none · ref 25 · internal anchor
Self-commitment latency measures early behavioral commitment in hinted vs. honest reasoning contexts on GSM8K using Qwen2.5-3B, achieving AUROC 0.878 for first-commitment latency and up to 0.926 for curve summaries.
A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing cs.CL · 2026-06-03 · unverdicted · none · ref 4 · internal anchor
PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.
Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents cs.LG · 2026-06-03 · unverdicted · none · ref 3 · internal anchor
CVT-RL improves verified task success to 78.9% and reduces hacking to 3.9% in long-horizon language agents by combining intervention-validity gating with a selection-adjusted doubly robust PCCC estimator.
EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing cs.LG · 2026-05-30 · unverdicted · none · ref 77 · internal anchor
EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.
What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants cs.SE · 2026-05-29 · unverdicted · none · ref 5 · internal anchor
An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 1 · 2 links · internal anchor
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning cs.CV · 2026-05-19 · unverdicted · none · ref 31 · internal anchor
ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains cs.AI · 2026-05-19 · unverdicted · none · ref 2 · internal anchor
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack cs.AI · 2026-05-12 · conditional · none · ref 1 · internal anchor
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Theoretical Limits of Language Model Alignment cs.LG · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites cs.AI · 2026-05-07 · unverdicted · none · ref 73 · internal anchor
AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.
Beyond Ability: The Four-Fold Spectrum of Power and the Logic of Full Inability cs.LO · 2026-05-06 · unverdicted · none · ref 8 · internal anchor
Coalition Logic is extended by defining Full Inability (FI) as a distinct modality alongside Full Control, Positive Determination, and Adverse Determination, with algebraic structure, Klein four-group symmetry, and a sound, complete, conservative axiomatization CLFI that remains PSPACE-complete.
A Logic of Inability cs.LO · 2026-04-30 · unverdicted · none · ref 6 · internal anchor
A conservative extension of Coalition Logic introduces an inability operator as negation of ability, with proofs of soundness, completeness, and conservativity plus analysis of its modal properties.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 128 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Discovering Agentic Safety Specifications from 1-Bit Danger Signals cs.AI · 2026-04-25 · unverdicted · none · ref 2 · internal anchor
LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.
Navigating the Conceptual Multiverse cs.HC · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF cs.CL · 2026-04-20 · unverdicted · none · ref 41 · internal anchor
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine cs.LG · 2026-04-18 · unverdicted · none · ref 3 · internal anchor
HealthCraft is the first public RL safety environment for emergency medicine that evaluates frontier LLMs on trajectory-level safety with a dual-layer rubric, showing low multi-step performance and high safety failure rates.
Reinforcement Learning via Value Gradient Flow cs.LG · 2026-04-15 · unverdicted · none · ref 3 · internal anchor
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts cs.LG · 2026-04-13 · unverdicted · none · ref 1 · 2 links · internal anchor
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.
Learning Robustness at Test-Time from a Non-Robust Teacher cs.CV · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.
AI Integrity: A New Paradigm for Verifiable AI Governance cs.AI · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
AI Integrity is defined as verifiable protection of an AI system's four-layer Authority Stack from corruption, with PRISM as the measurement framework.
Emotion Concepts and their Function in a Large Language Model cs.AI · 2026-04-09 · unverdicted · none · ref 43 · internal anchor
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Geographic Blind Spots in AI Control Monitors: A Cross-National Audit of Claude Opus 4.6 cs.CY · 2026-03-20 · unverdicted · none · ref 8 · internal anchor
Claude Opus 4.6 fabricates more answers on Global North AI contexts than Global South ones, creating an exploitable vulnerability in AI control monitors.
Optimizing Visual Generative Models via Distribution-wise Rewards cs.LG · 2026-07-02 · unverdicted · none · ref 1 · internal anchor
Distribution-wise rewards with subset-replace strategy and post-hoc merging improve FID-50K on SiT (8.30 to 5.77) and EDM2 (3.74 to 3.52) while preserving diversity.
Chameleon: Recovering Cyber-Physical Systems from Memory Corruption Attacks via ML Surrogates cs.CR · 2026-07-01 · unverdicted · none · ref 41 · internal anchor
Chameleon recovers CPS from memory corruption attacks by swapping compromised compartments with ML surrogates that approximate original behavior (avg R²=0.96) while avoiding the same vulnerabilities.
Safety from Honesty in a Disinterested AI Predictor cs.AI · 2026-06-28 · unverdicted · none · ref 3 · internal anchor
A disinterested Bayesian Predictor trained on contextualized statements has low probability of producing harmful agency because dangerous behaviors require rare coordinated underestimation of harm with no training signal favoring them.
Support-Constrained RL Enables Real-World Policy Improvement without Real-World Experience cs.RO · 2026-06-25 · unverdicted · none · ref 35 · internal anchor
SCORE constrains sim RL to the support of a real-data policy via flow steering, raising average success on eight dexterous tasks from 37.8% to 89.9%.
The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report cs.CL · 2026-06-25 · unverdicted · none · ref 35 · internal anchor
Task conditioning suppresses safety-critical signal reporting in language and vision models that unconstrained versions report at higher rates, creating an inattentional gap that decouples benchmark safety from real-world safety.
Autodata: An agentic data scientist to create high quality synthetic data cs.AI · 2026-06-24 · unverdicted · none · ref 49 · internal anchor
Autodata introduces an agentic method with meta-optimization to create higher-quality synthetic data, yielding performance gains over standard methods on CS, legal, and math tasks.
Tensor-Based Batch Fuzzing with Adaptive Perturbation Scaling for Deep Neural Networks cs.SE · 2026-06-23 · unverdicted · none · ref 3 · internal anchor
A tensor-based batch fuzzing framework with adaptive perturbation scaling from specification ranges achieves up to 40X higher throughput and 4X more detected violations than sequential baselines on DNN benchmarks.
Reinforcement Learning Towards Broadly and Persistently Beneficial Models cs.AI · 2026-06-22 · unverdicted · none · ref 44 · internal anchor
Reinforcement learning on beneficial traits in realistic domains yields broad improvements on over 80% of out-of-distribution alignment benchmarks and greater resistance to adversarial steering.
Data Provenance for Image Auto-Regressive Generation cs.CV · 2026-06-22 · unverdicted · none · ref 11 · internal anchor
A post-hoc detection framework exploits generation-induced patterns in autoregressive image outputs to enable provenance tracing across multiple IAR models without altering the generation process.
Causal Reward World Models: Zero-shot Reward Design for Automated Skill Generation cs.RO · 2026-06-22 · unverdicted · none · ref 1 · internal anchor
CRWM pre-trains a causal model on multi-task interaction data to supply task-irrelevant causal priors that enable LLMs to synthesize executable reward functions zero-shot for robotic skill acquisition.
Uncertainty-Aware Reward Modeling for Stable RLHF cs.LG · 2026-06-18 · unverdicted · none · ref 2 · internal anchor
UARM equips reward models with quantile-based conformal prediction uncertainty and reweights GRPO advantages via heteroscedastic variance decomposition to improve calibration and reduce reward hacking in RLHF.
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning cs.CL · 2026-06-16 · unverdicted · none · ref 19 · internal anchor
The LLM-as-Environment-Engineer framework lets the policy model redesign its own RL environments on the new MAPF-FrozenLake testbed, outperforming larger models and fixed baselines with Qwen3-4B.
ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space cs.AI · 2026-06-11 · unverdicted · none · ref 32 · internal anchor
ERTS encodes ethical dilemmas in a 22D space, applies 17 semantic perturbations under 6 constraints, and uses a 4-component index to test 6 models on 1500 cases, finding only 33% pass clearance.
Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier cs.LG · 2026-06-10 · unverdicted · none · ref 180 · internal anchor
PROPEL amortizes solver evaluation with a trained activation probe to optimize task generators toward a target solve rate, raising the share of learnable tasks from ~10% to ~20% in coding and SWE experiments.
Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It cs.CL · 2026-06-09 · conditional · none · ref 19 · internal anchor
CoT SFT disrupts long-range routing in hybrid models via changes to W_Q and W_K; QK-Restore restores pre-SFT projections to recover NIAH performance.

Concrete Problems in AI Safety

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer