{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:B7AN7BJIMEA74YQXXOQOIYEHTC","short_pith_number":"pith:B7AN7BJI","schema_version":"1.0","canonical_sha256":"0fc0df85286101fe6217bba0e46087989df53eface275ac61b42b63f2f348fc9","source":{"kind":"arxiv","id":"2406.10162","version":3},"attestation_state":"computed","paper":{"title":"Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.","cross_cats":["cs.CL"],"primary_cat":"cs.AI","authors_text":"Alex Tamkin, Buck Shlegeris, Carson Denison, David Duvenaud, Ethan Perez, Evan Hubinger, Fazl Barez, Jared Kaplan, Monte MacDiarmid, Nicholas Schiefer, Ryan Soklaski, Samuel Marks, Samuel R. Bowman, Shauna Kravec","submitted_at":"2024-06-14T16:26:20Z","abstract_excerpt":"In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2406.10162","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.AI","submitted_at":"2024-06-14T16:26:20Z","cross_cats_sorted":["cs.CL"],"title_canon_sha256":"9a6e5118d907e05a3d967860bcba7407ebe5c60df55309c3dac1c0e763eb29ea","abstract_canon_sha256":"76fc273494efdc5d8ddeaff25e5acdeb2c93071c9ff837ec17d03b5ee6b85d2f"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.801344Z","signature_b64":"EqDZfGpWjojmfJd2z+S2IvM0d1nhS6ufBc2f21IxgTJ/gaZSi7f8VzFMPzNXLa3AuAZ34bp3x3jpOAhf+bU5Bw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"0fc0df85286101fe6217bba0e46087989df53eface275ac61b42b63f2f348fc9","last_reissued_at":"2026-05-17T23:38:13.800617Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.800617Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.","cross_cats":["cs.CL"],"primary_cat":"cs.AI","authors_text":"Alex Tamkin, Buck Shlegeris, Carson Denison, David Duvenaud, Ethan Perez, Evan Hubinger, Fazl Barez, Jared Kaplan, Monte MacDiarmid, Nicholas Schiefer, Ryan Soklaski, Samuel Marks, Samuel R. Bowman, Shauna Kravec","submitted_at":"2024-06-14T16:26:20Z","abstract_excerpt":"In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform "},"claims":{"count":3,"items":[{"kind":"strongest_claim","text":"a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The constructed curriculum of gameable environments sufficiently captures the dynamics and incentives present in real-world LLM training pipelines so that observed generalization reflects likely behavior outside the lab.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"}],"snapshot_sha256":"de64a045961807258ed49f3a19e33c858941698fefd915cfd0c4c06266397671"},"source":{"id":"2406.10162","kind":"arxiv","version":3},"verdict":{"id":"6ab8f8e1-ca7d-40c8-b4df-703c7b25cafd","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T14:37:48.467186Z","strongest_claim":"a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function.","one_line_summary":"LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The constructed curriculum of gameable environments sufficiently captures the dynamics and incentives present in real-world LLM training pipelines so that observed generalization reflects likely behavior outside the lab.","pith_extraction_headline":""},"references":{"count":298,"sample":[{"doi":"","year":2017,"title":"Thinking fast and slow with deep learning and tree search, 2017","work_id":"d14c5666-8857-4dc7-a4b6-1a35befe2781","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Understanding strategic deception and deceptive alignment, 9 2023","work_id":"1af913e1-79c1-4c81-89a4-6f863ee0a42f","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"A general language assistant as a laboratory for alignment","work_id":"51b13307-1831-4a7b-bea8-559d663289df","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","ref_index":4,"cited_arxiv_id":"2212.08073","is_internal_anchor":true},{"doi":"","year":2023,"title":"Taken out of context: On measuring situational awareness in llms, 2023","work_id":"e1b48371-99f7-489c-97b5-1ad0a7257cc6","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":298,"snapshot_sha256":"2757ccc09e387ff05dd788084a6cbf9b325c03bfe6ce74b27d26568e441fe954","internal_anchors":35},"formal_canon":{"evidence_count":3,"snapshot_sha256":"a0e3758b22acada12e1e63dee86551a31d1b877901a6e4b0228bb91d7e225177"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2406.10162","created_at":"2026-05-17T23:38:13.800759+00:00"},{"alias_kind":"arxiv_version","alias_value":"2406.10162v3","created_at":"2026-05-17T23:38:13.800759+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2406.10162","created_at":"2026-05-17T23:38:13.800759+00:00"},{"alias_kind":"pith_short_12","alias_value":"B7AN7BJIMEA7","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"B7AN7BJIMEA74YQX","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"B7AN7BJI","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":17,"internal_anchor_count":17,"sample":[{"citing_arxiv_id":"2510.12826","citing_title":"Scheming Ability in LLM-to-LLM Strategic Interactions","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2412.04984","citing_title":"Frontier Models are Capable of In-context Scheming","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2601.10467","citing_title":"User Detection and Response Patterns of Sycophantic Behavior in Conversational AI","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12673","citing_title":"Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13334","citing_title":"LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02686","citing_title":"Beyond Semantic Manipulation: Token-Space Attacks on Reward Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02585","citing_title":"Mitigating LLM biases toward spurious social contexts using direct preference optimization","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2504.07615","citing_title":"VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2412.14093","citing_title":"Alignment faking in large language models","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.28093","citing_title":"What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08671","citing_title":"Explanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23488","citing_title":"Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21564","citing_title":"Measuring Opinion Bias and Sycophancy via LLM-based Persuasion","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13107","citing_title":"Can Coding Agents Be General Agents?","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17573","citing_title":"Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17596","citing_title":"Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06327","citing_title":"Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity","ref_index":23,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/B7AN7BJIMEA74YQXXOQOIYEHTC","json":"https://pith.science/pith/B7AN7BJIMEA74YQXXOQOIYEHTC.json","graph_json":"https://pith.science/api/pith-number/B7AN7BJIMEA74YQXXOQOIYEHTC/graph.json","events_json":"https://pith.science/api/pith-number/B7AN7BJIMEA74YQXXOQOIYEHTC/events.json","paper":"https://pith.science/paper/B7AN7BJI"},"agent_actions":{"view_html":"https://pith.science/pith/B7AN7BJIMEA74YQXXOQOIYEHTC","download_json":"https://pith.science/pith/B7AN7BJIMEA74YQXXOQOIYEHTC.json","view_paper":"https://pith.science/paper/B7AN7BJI","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2406.10162&json=true","fetch_graph":"https://pith.science/api/pith-number/B7AN7BJIMEA74YQXXOQOIYEHTC/graph.json","fetch_events":"https://pith.science/api/pith-number/B7AN7BJIMEA74YQXXOQOIYEHTC/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/B7AN7BJIMEA74YQXXOQOIYEHTC/action/timestamp_anchor","attest_storage":"https://pith.science/pith/B7AN7BJIMEA74YQXXOQOIYEHTC/action/storage_attestation","attest_author":"https://pith.science/pith/B7AN7BJIMEA74YQXXOQOIYEHTC/action/author_attestation","sign_citation":"https://pith.science/pith/B7AN7BJIMEA74YQXXOQOIYEHTC/action/citation_signature","submit_replication":"https://pith.science/pith/B7AN7BJIMEA74YQXXOQOIYEHTC/action/replication_record"}},"created_at":"2026-05-17T23:38:13.800759+00:00","updated_at":"2026-05-17T23:38:13.800759+00:00"}