pith:KLYE5RSM
The Evaluation Trap: Benchmark Design as Theoretical Commitment
AI benchmarks embed unexamined theoretical assumptions that redefine capabilities to match what they can easily measure.
arxiv:2605.14167 v1 · 2026-05-13 · cs.AI · cs.CY
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{KLYE5RSM4VWGTG5O5F6NYV5H2T}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Narrow evaluation reorganizes capability concepts: architectures and definitions are selected for benchmark legibility until evaluation ceases to track an independent object and instead produces a version of the target defined by its own operational assumptions.
That evaluation criteria can be derived directly from technical capability claims in a way that avoids introducing new unexamined assumptions of its own, allowing the audit to reliably discriminate claimed capabilities from proxy behaviors.
AI benchmarks trap progress by operationalizing assumptions that redefine capabilities around the benchmarks themselves, and Epistematics provides an audit procedure to detect when evaluations cannot discriminate claimed capabilities from proxy behaviors.
References
Receipt and verification
| First computed | 2026-05-17T23:39:11.398338Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
52f04ec64ce56c699baee97cdc57a7d4cbb79ae60c0b8998ee3696fd1e4c1a1e
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/KLYE5RSM4VWGTG5O5F6NYV5H2T \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 52f04ec64ce56c699baee97cdc57a7d4cbb79ae60c0b8998ee3696fd1e4c1a1e
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "4d43e96ec2bf5edffbda7ffeb2ba3bf8e09b76bf0aad52231b3e7632afae60f3",
"cross_cats_sorted": [
"cs.CY"
],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.AI",
"submitted_at": "2026-05-13T22:41:29Z",
"title_canon_sha256": "fb6cd0e18708611d294f82d9479787dd97a031acc511c440a12a78bdf6b1abbe"
},
"schema_version": "1.0",
"source": {
"id": "2605.14167",
"kind": "arxiv",
"version": 1
}
}