{"paper":{"title":"Active Testing of Large Language Models via Approximate Neyman Allocation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Semantic entropy from surrogate models drives approximate Neyman allocation to evaluate generative LLM tasks with far fewer samples.","cross_cats":[],"primary_cat":"cs.AI","authors_text":"Cong Liu, Jiancheng Zhang, Yinglun Zhu, Zeli Liu","submitted_at":"2026-05-11T06:58:07Z","abstract_excerpt":"Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm ta"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28% MSE reduction over Uniform Sampling and an average of 22.9% budget savings.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That semantic entropy signals extracted from surrogate models are sufficiently correlated with the per-example variance or informativeness that would be observed under the target model on generative tasks, so that the approximate Neyman allocation remains near-optimal.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Active testing via surrogate semantic entropy stratification and approximate Neyman allocation reduces MSE by up to 28% versus uniform sampling and saves about 23% of the labeling budget on language and multimodal benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Semantic entropy from surrogate models drives approximate Neyman allocation to evaluate generative LLM tasks with far fewer samples.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"bcec72c00d659e11307a20915cf8469f1750245d7fbf383c18837885f72dc246"},"source":{"id":"2605.10075","kind":"arxiv","version":2},"verdict":{"id":"94916162-b504-49ea-8570-5666397dfd1d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-12T04:00:05.146387Z","strongest_claim":"Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28% MSE reduction over Uniform Sampling and an average of 22.9% budget savings.","one_line_summary":"Active testing via surrogate semantic entropy stratification and approximate Neyman allocation reduces MSE by up to 28% versus uniform sampling and saves about 23% of the labeling budget on language and multimodal benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That semantic entropy signals extracted from surrogate models are sufficiently correlated with the per-example variance or informativeness that would be observed under the target model on generative tasks, so that the approximate Neyman allocation remains near-optimal.","pith_extraction_headline":"Semantic entropy from surrogate models drives approximate Neyman allocation to evaluate generative LLM tasks with far fewer samples."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.10075/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"ai_meta_artifact","ran_at":"2026-05-19T15:41:21.604574Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T12:01:17.914157Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T09:41:54.583122Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"a064baf51d33b2bc0c91082fc85da5e30fa1629ed37d516b64ffd3924e375a49"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}