pith. machine review for the scientific record. sign in

arxiv: 2605.10075 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

Active Testing of Large Language Models via Approximate Neyman Allocation

Zeli Liu , Jiancheng Zhang , Cong Liu , Yinglun Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:00 UTC · model grok-4.3

classification 💻 cs.AI
keywords active testinglarge language modelssemantic entropyNeyman allocationgenerative tasksevaluation efficiencysurrogate modelsbudget reduction
0
0 comments X

The pith

Semantic entropy from surrogate models drives approximate Neyman allocation to evaluate generative LLM tasks with far fewer samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluating large language models on generative tasks grows expensive as models scale and require expert labels. The paper introduces an active testing procedure that first extracts semantic entropy signals from cheaper surrogate models to partition the evaluation pool into strata of varying informativeness. It then applies an approximate version of Neyman allocation, which assigns more test examples to strata with higher estimated variance. Across language and multimodal benchmarks and multiple surrogate-target pairs, this approach cuts mean-squared error by up to 28 percent relative to uniform sampling while using an average of 22.9 percent fewer labels, and it stays close to the performance of an oracle that knows the true variances.

Core claim

The central discovery is that semantic entropy extracted from surrogate models supplies a usable proxy for per-example variance on generative tasks, allowing an approximate Neyman allocation that selects an informative subset of the evaluation pool and thereby reduces the labeling budget needed to reach a target estimation error.

What carries the argument

Approximate Neyman allocation that uses semantic entropy signals from surrogate models to stratify the pool and allocate samples proportionally to estimated stratum variances.

If this is right

  • Evaluation budgets for large generative models can be reduced without sacrificing accuracy on standard benchmarks.
  • The same stratification-plus-allocation pattern extends directly to multimodal and instruction-following tasks.
  • Surrogate models of modest size become practical proxies for guiding expensive target-model evaluations.
  • Repeated evaluations across model scales become cheaper, supporting continuous monitoring rather than one-time testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the correlation between surrogate entropy and target variance holds for open-ended generation, the method could lower the cost of measuring scaling laws on new capabilities.
  • The approach may generalize to other generative domains such as code or scientific text where variance is hard to estimate directly.
  • A natural next measurement would be how quickly the savings degrade when the surrogate is much weaker or trained on a different domain.

Load-bearing premise

Semantic entropy computed on surrogate models must remain sufficiently correlated with the actual per-example variance that would be observed when the target model is evaluated on the same generative tasks.

What would settle it

Run the procedure on a new surrogate-target pair drawn from the same benchmark distribution and measure whether the resulting mean-squared error exceeds that of uniform sampling by more than a small constant or deviates sharply from the oracle Neyman allocation.

Figures

Figures reproduced from arXiv: 2605.10075 by Cong Liu, Jiancheng Zhang, Yinglun Zhu, Zeli Liu.

Figure 1
Figure 1. Figure 1: Performance comparison across four benchmarks with labeling budget [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison with adapted LURE variants. We report the relative MSE (lower is better) of SE-LURE, PE-LURE, and our method against Uniform Sampling across four benchmarks and model pairs at budget M = 100. trial t, and let T = 3,000 be the number of trials. We compute the mean squared error (MSE) MSE(Rb) = 1 T X T t=1  Rb(t) − RD 2 . To enable comparisons across different benchmarks, we also follow Berrada … view at source ↗
Figure 3
Figure 3. Figure 3: Results of active testing across four benchmarks with various surrogate-target model pairs. We [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm tailored to generative tasks. Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates. Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28\% MSE reduction over Uniform Sampling and an average of 22.9\% budget savings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces an active testing algorithm for LLMs on generative tasks that stratifies the evaluation pool using semantic entropy signals extracted from surrogate models and then applies approximate Neyman allocation based on variance estimates derived from those same surrogate signals. It reports empirical results across language and multimodal benchmarks with multiple surrogate-target model pairs, claiming significant improvements over baselines, close tracking of an Oracle-Neyman allocation, up to 28% MSE reduction versus uniform sampling, and an average of 22.9% budget savings.

Significance. If the central approximation holds, the work offers a practical way to reduce the growing evaluation costs for large generative models that require expert annotation. The empirical evaluation across diverse benchmarks and surrogate-target pairs is a clear strength, providing evidence of gains over uniform sampling and proximity to the oracle in the tested regimes.

major comments (2)
  1. [Abstract] The load-bearing assumption that surrogate semantic entropy signals are sufficiently correlated with (or proportional to) the per-example variance that would be observed under the target model on generative tasks is not directly validated. The abstract claims the method 'closely tracks Oracle-Neyman' and delivers 22.9% average budget savings, yet supplies no diagnostic such as a reported correlation coefficient between surrogate entropy and target variance or an efficiency-loss metric relative to true Neyman allocation.
  2. [Method description] Without details on how the approximate Neyman allocation is computed from surrogate signals (e.g., exact variance estimation procedure or handling of generative scoring functions), it is unclear whether the reported MSE reductions could be biased by the choice of surrogate or by mismatch in scale/architecture between surrogate and target.
minor comments (1)
  1. [Abstract] The abstract would benefit from a one-sentence statement of the key modeling assumption and any stated limitations on surrogate-target mismatch.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and have revised the manuscript to incorporate additional validation and methodological details.

read point-by-point responses
  1. Referee: [Abstract] The load-bearing assumption that surrogate semantic entropy signals are sufficiently correlated with (or proportional to) the per-example variance that would be observed under the target model on generative tasks is not directly validated. The abstract claims the method 'closely tracks Oracle-Neyman' and delivers 22.9% average budget savings, yet supplies no diagnostic such as a reported correlation coefficient between surrogate entropy and target variance or an efficiency-loss metric relative to true Neyman allocation.

    Authors: We agree that a direct diagnostic would strengthen the presentation of the core assumption. While the empirical results demonstrate close tracking of the Oracle-Neyman allocation (with up to 28% MSE reduction and 22.9% average budget savings across benchmarks), we acknowledge the absence of explicit correlation or efficiency-loss metrics in the original submission. In the revised manuscript we have added a new appendix containing Pearson correlation coefficients between surrogate semantic entropy and target-model variance for every surrogate-target pair, together with an efficiency-loss metric that quantifies the sub-optimality of our approximate allocation relative to the true Neyman allocation. These additions provide the requested direct validation without altering the reported performance figures. revision: yes

  2. Referee: [Method description] Without details on how the approximate Neyman allocation is computed from surrogate signals (e.g., exact variance estimation procedure or handling of generative scoring functions), it is unclear whether the reported MSE reductions could be biased by the choice of surrogate or by mismatch in scale/architecture between surrogate and target.

    Authors: We have expanded the Method section (Section 3.2) to include the precise variance-estimation procedure: per-stratum variance is estimated as the product of the surrogate semantic entropy and a scaling factor obtained from a small pilot sample evaluated on the target model. For generative scoring functions we explicitly use the average negative log-likelihood over multiple sampled continuations. We have also added a dedicated paragraph discussing potential surrogate-target mismatch, including an ablation study that repeats the main experiments with surrogates of varying scale and architecture; the gains remain consistent. These clarifications address the concern about possible bias while preserving the original empirical claims. revision: yes

Circularity Check

0 steps flagged

No circularity: surrogate-based approximation is externally validated rather than self-referential

full rationale

The derivation chain consists of stratifying the pool by semantic entropy extracted from an independent surrogate model and then performing Neyman allocation using variance signals from that same surrogate. The claimed MSE reductions and budget savings are reported as empirical results across surrogate-target pairs on external benchmarks, not as quantities forced by fitting parameters to the target evaluation data itself. No equations define the allocation weights in terms of the final performance metric, no self-citation supplies a uniqueness theorem that forbids alternatives, and the tracking of Oracle-Neyman is presented as an observed outcome rather than a definitional identity. The method therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only view yields minimal ledger entries; the approach assumes surrogate entropy is a usable proxy without providing independent validation of that correlation.

invented entities (1)
  • surrogate semantic entropy signals no independent evidence
    purpose: to stratify the evaluation pool and drive approximate Neyman allocation
    Introduced as the key input for deciding which examples to label; no independent evidence supplied in abstract that these signals match target-model variance.

pith-pipeline@v0.9.0 · 5458 in / 1162 out tokens · 56697 ms · 2026-05-12T04:00:05.146387+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 11 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

  2. [2]

    Scaling up active testing to large language models.arXiv preprint arXiv:2508.09093,

    Gabrielle Berrada, Jannik Kossen, Freddie Bickford Smith, Muhammed Razzak, Yarin Gal, and Tom Rainforth. Scaling up active testing to large language models.arXiv preprint arXiv:2508.09093,

  3. [3]

    An experimental design framework for label-efficient supervised finetuning of large language models

    Gantavya Bhatt, Yifang Chen, Arnav Das, Jifan Zhang, Sang Truong, Stephen Mussmann, Yinglun Zhu, Jeff Bilmes, Simon Du, Kevin Jamieson, et al. An experimental design framework for label-efficient supervised finetuning of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6549–6560,

  4. [4]

    On statistical bias in active learning: How and when to fix it.arXiv preprint arXiv:2101.11665,

    Sebastian Farquhar, Yarin Gal, and Tom Rainforth. On statistical bias in active learning: How and when to fix it.arXiv preprint arXiv:2101.11665,

  5. [5]

    Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologi...

  6. [6]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10,

  7. [7]

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664,

  8. [8]

    Mixtraining: A better trade-off between compute and performance.arXiv preprint arXiv:2502.19513,

    Zexin Li, Jiancheng Zhang, Yufei Li, Yinglun Zhu, and Cong Liu. Mixtraining: A better trade-off between compute and performance.arXiv preprint arXiv:2502.19513,

  9. [9]

    Aashish Anantha Ramakrishnan, Ardavan Saeedi, Hamid Reza Hassanzadeh, Fazlolah Mohaghegh, and Dongwon Lee

    URL https://qwen.ai/blog?id= qwen3.5. Aashish Anantha Ramakrishnan, Ardavan Saeedi, Hamid Reza Hassanzadeh, Fazlolah Mohaghegh, and Dongwon Lee. Generative active testing: Efficient llm evaluation via proxy task adaptation.arXiv preprint arXiv:2603.19264,

  10. [10]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,

  11. [11]

    Active Learning for Convolutional Neural Networks: A Core-Set Approach

    Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489,

  12. [12]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

  13. [13]

    Moq: mixture-of-format activation quantization for communication-efficient ai inference system

    Haonan Wang, Zeli Liu, Chao Fang, John Paul Walters, and Stephen P Crago. Moq: mixture-of-format activation quantization for communication-efficient ai inference system. InNeurIPS 2024 Workshop Machine Learning with new Compute Paradigms, 2024a. Haonan Wang, Zeli Liu, Kajimusugura Hoshino, Tuo Zhang, John Paul Walters, and Stephen Crago. Fedpai: Achieving...

  14. [14]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,

  15. [15]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,

  16. [16]

    Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

    Jiancheng Zhang and Yinglun Zhu. Towards multimodal active learning: Efficient learning with limited paired data.arXiv preprint arXiv:2510.03247,

  17. [17]

    Labelbench: A comprehensive framework for benchmarking adaptive label-efficient learning.arXiv preprint arXiv:2306.09910,

    Jifan Zhang, Yifang Chen, Gregory Canal, Stephen Mussmann, Arnav M Das, Gantavya Bhatt, Yinglun Zhu, Jeffrey Bilmes, Simon Shaolei Du, Kevin Jamieson, et al. Labelbench: A comprehensive framework for benchmarking adaptive label-efficient learning.arXiv preprint arXiv:2306.09910,

  18. [18]

    Accelerating unbiased llm evaluation via synthetic feedback

    Zhaoyi Zhou, Yuda Song, and Andrea Zanette. Accelerating unbiased llm evaluation via synthetic feedback. arXiv preprint arXiv:2502.10563,

  19. [19]

    Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

    Yinglun Zhu and Robert Nowak. Active learning with neural networks: Insights from nonparametric statistics. Advances in Neural Information Processing Systems, 35:142–155, 2022a. Yinglun Zhu and Robert Nowak. Efficient active learning with abstention.Advances in Neural Information Processing Systems, 35:35379–35391, 2022b. Yinglun Zhu, Jiancheng Zhang, and...

  20. [20]

    Strategic Scaling of Test-Time Compute: A Bandit Learning Approach

    Bowen Zuo and Yinglun Zhu. Strategic scaling of test-time compute: A bandit learning approach.arXiv preprint arXiv:2506.12721,

  21. [21]

    Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations

    Bowen Zuo, Dongruo Zhou, and Yinglun Zhu. Adaptive test-time compute allocation with evolving in-context demonstrations.arXiv preprint arXiv:2604.21018,