pith. sign in
Pith Number

pith:5BX7IZ4P

pith:2026:5BX7IZ4PULQ2AKGJGNVBVSQLYO
not attested not anchored not stored refs pending

Active Testing of Large Language Models via Approximate Neyman Allocation

Cong Liu, Jiancheng Zhang, Yinglun Zhu, Zeli Liu

Semantic entropy from surrogate models drives approximate Neyman allocation to evaluate generative LLM tasks with far fewer samples.

arxiv:2605.10075 v2 · 2026-05-11 · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{5BX7IZ4PULQ2AKGJGNVBVSQLYO}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28% MSE reduction over Uniform Sampling and an average of 22.9% budget savings.

C2weakest assumption

That semantic entropy signals extracted from surrogate models are sufficiently correlated with the per-example variance or informativeness that would be observed under the target model on generative tasks, so that the approximate Neyman allocation remains near-optimal.

C3one line summary

Active testing via surrogate semantic entropy stratification and approximate Neyman allocation reduces MSE by up to 28% versus uniform sampling and saves about 23% of the labeling budget on language and multimodal benchmarks.

Receipt and verification
First computed 2026-05-20T00:06:36.461780Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

e86ff4678fa2e1a028c9336a1aca0bc3b455452d24e87921e282e1279ae13d08

Aliases

arxiv: 2605.10075 · arxiv_version: 2605.10075v2 · doi: 10.48550/arxiv.2605.10075 · pith_short_12: 5BX7IZ4PULQ2 · pith_short_16: 5BX7IZ4PULQ2AKGJ · pith_short_8: 5BX7IZ4P
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/5BX7IZ4PULQ2AKGJGNVBVSQLYO \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: e86ff4678fa2e1a028c9336a1aca0bc3b455452d24e87921e282e1279ae13d08
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "fb1111aff5e33eb3a3038456999127809635cbc44d3dc4e9f516a243835e327b",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2026-05-11T06:58:07Z",
    "title_canon_sha256": "7a39126d2994922bbf4e16ccee8c60cbcf6d033e94ec6fa371a7ac4fd30322d6"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.10075",
    "kind": "arxiv",
    "version": 2
  }
}