Pith Number

pith:HEOXNOUS

pith:2026:HEOXNOUS4HTWETZMVTL2AFQ5SI

not attested not anchored not stored refs resolved

Holistic Evaluation and Failure Diagnosis of AI Agents

Alon Mecilati, Amos Rimon, David Connack, Edo Dekel, Gilad Dym, Jonatan Liberman, Liron Schliesser, Max Svidlo, Netta Madvil, Orel Shalom, Philip Tannor, Rotem Brazilay, Shai Nir, Shir Chorev, Yaron Friedman

Decomposing AI agent traces into independent spans enables precise failure diagnosis and higher accuracy.

arxiv:2605.14865 v1 · 2026-05-14 · cs.AI · cs.CL

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{HEOXNOUS4HTWETZMVTL2AFQ5SI}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy.

C2weakest assumption

That agent traces can be meaningfully decomposed into independent spans whose separate assessments accurately capture failure causes without requiring full trace context for interdependent errors.

C3one line summary

A span-decomposed evaluation framework for AI agents achieves state-of-the-art results on GAIA and SWE-Bench with up to 3.5x gains in localization accuracy by breaking traces into independent per-span judgments.

References

28 extracted · 28 resolved · 1 Pith anchors

[1] Agentrx: Diagnosing ai agent failures from execution trajectories 2026

[2] Bhonsle et al 2025

[3] Why Do Multi-Agent LLM Systems Fail? 2025 · arXiv:2503.13657

[4] T-eval: Evaluating the tool utilization capability step by step 2024

[5] CrewAI: Framework for orchestrating role-playing, autonomous AI agents.https: //www.crewai.com, 2024 2024

Formal links

2 machine-checked theorem links

Receipt and verification

First computed	2026-05-17T23:38:56.194880Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

391d76ba92e1e7624f2cacd7a0161d92333544bbb669e870cfd014ad9b146c2e

Aliases

arxiv: 2605.14865 · arxiv_version: 2605.14865v1 · doi: 10.48550/arxiv.2605.14865 · pith_short_12: HEOXNOUS4HTW · pith_short_16: HEOXNOUS4HTWETZM · pith_short_8: HEOXNOUS

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/HEOXNOUS4HTWETZMVTL2AFQ5SI \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 391d76ba92e1e7624f2cacd7a0161d92333544bbb669e870cfd014ad9b146c2e

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "4bf0ccb043aec5123deb548881ba820228eb266f6468279aa48086530c8cd17b",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2026-05-14T14:12:39Z",
    "title_canon_sha256": "8380426923e57cc7f78a39d036f01e8e64249855e3c0c2770805b5186bf4dfea"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.14865",
    "kind": "arxiv",
    "version": 1
  }
}