Pith Number

pith:YEBBADE6

pith:2026:YEBBADE6LKWFN5LJKE23EVDYLP

not attested not anchored not stored refs pending

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

Aditi Kumaresan, Wenjun Zeng, Yizheng Huang, Zi Wang

ProEval uses pre-trained Gaussian Processes as surrogates to estimate generative AI performance accurately with 8-65 times fewer samples while finding more failures.

arxiv:2604.23099 v2 · 2026-04-25 · cs.LG · cs.AI · stat.ML

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{YEBBADE6LKWFN5LJKE23EVDYLP}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

C2weakest assumption

That pre-trained Gaussian Processes trained on prior model evaluations can accurately serve as surrogates for the performance score function on new models and inputs, enabling effective transfer and active selection without significant distribution shift.

C3one line summary

ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.

Receipt and verification

First computed	2026-06-03T01:05:14.026197Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

c102100c9e5aac56f5695135b254785bcde487d00422095cdee9f3d04150d695

Aliases

arxiv: 2604.23099 · arxiv_version: 2604.23099v2 · doi: 10.48550/arxiv.2604.23099 · pith_short_12: YEBBADE6LKWF · pith_short_16: YEBBADE6LKWFN5LJ · pith_short_8: YEBBADE6

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/YEBBADE6LKWFN5LJKE23EVDYLP \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c102100c9e5aac56f5695135b254785bcde487d00422095cdee9f3d04150d695

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "27aa955aeb612d5985e11cb111df7a0cb07efde29a98cfb5b4b38a7d3fa64153",
    "cross_cats_sorted": [
      "cs.AI",
      "stat.ML"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-04-25T01:33:57Z",
    "title_canon_sha256": "cd42fd7037b61e8bf263d87feafe55bb858ba5c1794fd962351f6d568e0cc87f"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2604.23099",
    "kind": "arxiv",
    "version": 2
  }
}