pith. sign in
Pith Number

pith:KLYE5RSM

pith:2026:KLYE5RSM4VWGTG5O5F6NYV5H2T
not attested not anchored not stored refs resolved

The Evaluation Trap: Benchmark Design as Theoretical Commitment

Theodore J Kalaitzidis

AI benchmarks embed unexamined theoretical assumptions that redefine capabilities to match what they can easily measure.

arxiv:2605.14167 v1 · 2026-05-13 · cs.AI · cs.CY

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{KLYE5RSM4VWGTG5O5F6NYV5H2T}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Narrow evaluation reorganizes capability concepts: architectures and definitions are selected for benchmark legibility until evaluation ceases to track an independent object and instead produces a version of the target defined by its own operational assumptions.

C2weakest assumption

That evaluation criteria can be derived directly from technical capability claims in a way that avoids introducing new unexamined assumptions of its own, allowing the audit to reliably discriminate claimed capabilities from proxy behaviors.

C3one line summary

AI benchmarks trap progress by operationalizing assumptions that redefine capabilities around the benchmarks themselves, and Epistematics provides an audit procedure to detect when evaluations cannot discriminate claimed capabilities from proxy behaviors.

References

28 extracted · 28 resolved · 1 Pith anchors

[1] Agre, Philip E. , title =
[2] and Star, Susan Leigh , title =
[3] Cartwright, Nancy , title =
[4] On the Measure of Intelligence 1911 · arXiv:1911.01547
[5] Cognition , volume =
Receipt and verification
First computed 2026-05-17T23:39:11.398338Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

52f04ec64ce56c699baee97cdc57a7d4cbb79ae60c0b8998ee3696fd1e4c1a1e

Aliases

arxiv: 2605.14167 · arxiv_version: 2605.14167v1 · doi: 10.48550/arxiv.2605.14167 · pith_short_12: KLYE5RSM4VWG · pith_short_16: KLYE5RSM4VWGTG5O · pith_short_8: KLYE5RSM
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/KLYE5RSM4VWGTG5O5F6NYV5H2T \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 52f04ec64ce56c699baee97cdc57a7d4cbb79ae60c0b8998ee3696fd1e4c1a1e
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "4d43e96ec2bf5edffbda7ffeb2ba3bf8e09b76bf0aad52231b3e7632afae60f3",
    "cross_cats_sorted": [
      "cs.CY"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2026-05-13T22:41:29Z",
    "title_canon_sha256": "fb6cd0e18708611d294f82d9479787dd97a031acc511c440a12a78bdf6b1abbe"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.14167",
    "kind": "arxiv",
    "version": 1
  }
}