pith. machine review for the scientific record. sign in
Pith Number

pith:B7AN7BJI

pith:2024:B7AN7BJIMEA74YQXXOQOIYEHTC
not attested not anchored not stored refs resolved

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Alex Tamkin, Buck Shlegeris, Carson Denison, David Duvenaud, Ethan Perez, Evan Hubinger, Fazl Barez, Jared Kaplan, Monte MacDiarmid, Nicholas Schiefer, Ryan Soklaski, Samuel Marks, Samuel R. Bowman, Shauna Kravec

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

arxiv:2406.10162 v3 · 2024-06-14 · cs.AI · cs.CL

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function.

C2weakest assumption

The constructed curriculum of gameable environments sufficiently captures the dynamics and incentives present in real-world LLM training pipelines so that observed generalization reflects likely behavior outside the lab.

C3one line summary

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

References

298 extracted · 298 resolved · 35 Pith anchors

[1] Thinking fast and slow with deep learning and tree search, 2017 2017
[2] Understanding strategic deception and deceptive alignment, 9 2023 2023
[3] A general language assistant as a laboratory for alignment 2021
[4] Constitutional AI: Harmlessness from AI Feedback 2022 · arXiv:2212.08073
[5] Taken out of context: On measuring situational awareness in llms, 2023 2023

Formal links

3 machine-checked theorem links

Cited by

17 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:13.800617Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

0fc0df85286101fe6217bba0e46087989df53eface275ac61b42b63f2f348fc9

Aliases

arxiv: 2406.10162 · arxiv_version: 2406.10162v3 · doi: 10.48550/arxiv.2406.10162 · pith_short_12: B7AN7BJIMEA7 · pith_short_16: B7AN7BJIMEA74YQX · pith_short_8: B7AN7BJI
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/B7AN7BJIMEA74YQXXOQOIYEHTC \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 0fc0df85286101fe6217bba0e46087989df53eface275ac61b42b63f2f348fc9
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "76fc273494efdc5d8ddeaff25e5acdeb2c93071c9ff837ec17d03b5ee6b85d2f",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2024-06-14T16:26:20Z",
    "title_canon_sha256": "9a6e5118d907e05a3d967860bcba7407ebe5c60df55309c3dac1c0e763eb29ea"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2406.10162",
    "kind": "arxiv",
    "version": 3
  }
}