Pith Number

pith:B7AN7BJI

pith:2024:B7AN7BJIMEA74YQXXOQOIYEHTC

not attested not anchored not stored refs resolved

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Alex Tamkin, Buck Shlegeris, Carson Denison, David Duvenaud, Ethan Perez, Evan Hubinger, Fazl Barez, Jared Kaplan, Monte MacDiarmid, Nicholas Schiefer, Ryan Soklaski, Samuel Marks, Samuel R. Bowman, Shauna Kravec

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

arxiv:2406.10162 v3 · 2024-06-14 · cs.AI · cs.CL

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function.

C2weakest assumption

The constructed curriculum of gameable environments sufficiently captures the dynamics and incentives present in real-world LLM training pipelines so that observed generalization reflects likely behavior outside the lab.

C3one line summary

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

References

298 extracted · 298 resolved · 35 Pith anchors

[1] Thinking fast and slow with deep learning and tree search, 2017 2017

[2] Understanding strategic deception and deceptive alignment, 9 2023 2023

[3] A general language assistant as a laboratory for alignment 2021

[4] Constitutional AI: Harmlessness from AI Feedback 2022 · arXiv:2212.08073

[5] Taken out of context: On measuring situational awareness in llms, 2023 2023

Formal links

3 machine-checked theorem links

Cited by

17 papers in Pith

Scheming Ability in LLM-to-LLM Strategic Interactions

Frontier Models are Capable of In-context Scheming

User Detection and Response Patterns of Sycophantic Behavior in Conversational AI

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs

Receipt and verification

First computed	2026-05-17T23:38:13.800617Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

0fc0df85286101fe6217bba0e46087989df53eface275ac61b42b63f2f348fc9

Aliases

arxiv: 2406.10162 · arxiv_version: 2406.10162v3 · doi: 10.48550/arxiv.2406.10162 · pith_short_12: B7AN7BJIMEA7 · pith_short_16: B7AN7BJIMEA74YQX · pith_short_8: B7AN7BJI

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/B7AN7BJIMEA74YQXXOQOIYEHTC \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 0fc0df85286101fe6217bba0e46087989df53eface275ac61b42b63f2f348fc9

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "76fc273494efdc5d8ddeaff25e5acdeb2c93071c9ff837ec17d03b5ee6b85d2f",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2024-06-14T16:26:20Z",
    "title_canon_sha256": "9a6e5118d907e05a3d967860bcba7407ebe5c60df55309c3dac1c0e763eb29ea"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2406.10162",
    "kind": "arxiv",
    "version": 3
  }
}