Categorizing Variants of Goodhart's Law

David Manheim; Scott Garrabrant

arxiv: 1803.04585 · v4 · pith:QRNI77FCnew · submitted 2018-03-13 · 💻 cs.AI · q-fin.GN· stat.ML

Categorizing Variants of Goodhart's Law

David Manheim , Scott Garrabrant This is my paper

classification 💻 cs.AI q-fin.GNstat.ML

keywords goodhartdiscussionfailureambiguousartificialbecausedifferentfurther

0 comments

read the original abstract

There are several distinct failure modes for overoptimization of systems on the basis of metrics. This occurs when a metric which can be used to improve a system is used to an extent that further optimization is ineffective or harmful, and is sometimes termed Goodhart's Law. This class of failure is often poorly understood, partly because terminology for discussing them is ambiguous, and partly because discussion using this ambiguous terminology ignores distinctions between different failure modes of this general type. This paper expands on an earlier discussion by Garrabrant, which notes there are "(at least) four different mechanisms" that relate to Goodhart's Law. This paper is intended to explore these mechanisms further, and specify more clearly how they occur. This discussion should be helpful in better understanding these types of failures in economic regulation, in public policy, in machine learning, and in Artificial Intelligence alignment. The importance of Goodhart effects depends on the amount of power directed towards optimizing the proxy, and so the increased optimization power offered by artificial intelligence makes it especially critical for that field.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Risks from Learned Optimization in Advanced Machine Learning Systems
cs.AI 2019-06 accept novelty 9.0

Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.
SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents
cs.SE 2026-05 unverdicted novelty 7.0

SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
cs.CL 2026-04 unverdicted novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
Privacy, Prediction, and Allocation
cs.CR 2026-04 unverdicted novelty 7.0

Differentially private variants of individual and unit-level aid allocation strategies admit clean bounds on the tradeoffs between privacy, efficiency, and targeting precision across stochastic and distribution-free regimes.
TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults
cs.LG 2026-06 unverdicted novelty 6.0

TS-Fault benchmark finds clean-data accuracy anti-correlates with robustness to structural faults, with all catastrophic failures under mechanism-level faults and foundation models most fragile.
Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions
cs.CY 2026-05 conditional novelty 6.0

Healthcare LLM benchmarks overlook implicit assumptions about user behavior that split into task assumptions testable from conversation data and outcome assumptions requiring behavioral studies, shown by reanalyzing a...
Metis AI: The Overlooked Middle Zone Between AI-Native and World-Movers
cs.AI 2026-05 unverdicted novelty 6.0

Metis AI identifies digital tasks entangled in irreversibility, relationships, norms, and accountability that require human oversight rather than pure automation.
The Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested
cs.AI 2026-05 unverdicted novelty 6.0

Frontier AI models can detect evaluation settings and alter their behavior, so standard test scores do not reliably support safety conclusions.
SARC: A Governance-by-Architecture Framework for Agentic AI Systems
cs.SE 2026-05 unverdicted novelty 6.0

SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-cod...
The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting
cs.GT 2026-05 unverdicted novelty 6.0

Non-affine approval functions create unavoidable miscalibration in proper scoring rules for strategic agents, but step-function thresholds enable first-best screening without it, uniquely for the Brier score.
Automated alignment is harder than you think
cs.AI 2026-05 conditional novelty 6.0

AI agents automating alignment research are prone to systematic undetected errors in fuzzy tasks, leading to overconfident but flawed safety assessments even without deliberate sabotage.
Automated alignment is harder than you think
cs.AI 2026-05 unverdicted novelty 6.0

Automating alignment research with AI agents risks generating hard-to-detect errors in fuzzy tasks, producing misleading safety evaluations even without deliberate sabotage.
Automated alignment is harder than you think
cs.AI 2026-05 unverdicted novelty 6.0

Automating alignment research with AI agents risks undetected systematic errors in fuzzy tasks, producing overconfident but misleading safety evaluations that could enable deployment of misaligned AI.
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum
cs.AI 2026-04 unverdicted novelty 6.0

AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specif...
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
cs.AI 2026-04 unverdicted novelty 6.0

AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 pe...
Simulating the Evolution of Alignment and Values in Machine Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
Why Code, Why Now: An Information-Theoretic Perspective on the Limits of Machine Learning
cs.LG 2026-02 unverdicted novelty 6.0

Task information structure determines ML scaling success, with code's dense verifiable signals enabling predictable progress while sparse-feedback tasks like typical RL do not.
Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search
cs.LG 2026-06 unverdicted novelty 5.0

Empirical study of RLAIF for portable query generation finds reward shaping controls performance more than optimizer choice and a rule-based reward floor yields +0.147 quality gain.
The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act
cs.CY 2026-06 unverdicted novelty 5.0

No benchmark exists for doctrinal legal reasoning in LLMs, leaving the EU AI Act's accuracy mandate for judicial AI without an operational test.
Signed Compression Progress on a Sealed Audit is Goodhart-Resistant
cs.LG 2026-06 unverdicted novelty 5.0 partial

Signed compression progress on a sealed audit as intrinsic reward equals true audit improvement plus at most 2 Delta_n deviation, making it Goodhart-resistant.
The Specification Trap: Why Static Value Alignment Alone Is Insufficient for Robust Alignment
cs.AI 2025-11 unverdicted novelty 5.0

Static value alignment approaches for AI are structurally insufficient for robust alignment because of Hume's is-ought gap, value pluralism, and the extended frame problem.
Operationalizing Fairness in Text-to-Image Models: A Survey of Bias, Fairness Audits and Mitigation Strategies
cs.CV 2026-04 unverdicted novelty 4.0

A systematic review of T2I bias literature that distinguishes target and threshold fairness and proposes a target-based operationalization framework.
Welfare as a Guiding Principle for Machine Learning -- From Compass, to Lens, to Roadmap
cs.LG 2025-02 unverdicted novelty 3.0

Advocates treating social welfare from economics as an additional core criterion for ML design and use in social settings, complementing optimization, generalization, and expressivity.
Against Proxy Optimization
cs.AI 2026-06 unverdicted novelty 2.0

Discusses conditions under which maximizing proxy utilities is harmful and suggests problems for decision theory applications.