Recognition: no theorem link
Active Testing of Large Language Models via Approximate Neyman Allocation
Pith reviewed 2026-05-12 04:00 UTC · model grok-4.3
The pith
Semantic entropy from surrogate models drives approximate Neyman allocation to evaluate generative LLM tasks with far fewer samples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that semantic entropy extracted from surrogate models supplies a usable proxy for per-example variance on generative tasks, allowing an approximate Neyman allocation that selects an informative subset of the evaluation pool and thereby reduces the labeling budget needed to reach a target estimation error.
What carries the argument
Approximate Neyman allocation that uses semantic entropy signals from surrogate models to stratify the pool and allocate samples proportionally to estimated stratum variances.
If this is right
- Evaluation budgets for large generative models can be reduced without sacrificing accuracy on standard benchmarks.
- The same stratification-plus-allocation pattern extends directly to multimodal and instruction-following tasks.
- Surrogate models of modest size become practical proxies for guiding expensive target-model evaluations.
- Repeated evaluations across model scales become cheaper, supporting continuous monitoring rather than one-time testing.
Where Pith is reading between the lines
- If the correlation between surrogate entropy and target variance holds for open-ended generation, the method could lower the cost of measuring scaling laws on new capabilities.
- The approach may generalize to other generative domains such as code or scientific text where variance is hard to estimate directly.
- A natural next measurement would be how quickly the savings degrade when the surrogate is much weaker or trained on a different domain.
Load-bearing premise
Semantic entropy computed on surrogate models must remain sufficiently correlated with the actual per-example variance that would be observed when the target model is evaluated on the same generative tasks.
What would settle it
Run the procedure on a new surrogate-target pair drawn from the same benchmark distribution and measure whether the resulting mean-squared error exceeds that of uniform sampling by more than a small constant or deviates sharply from the oracle Neyman allocation.
Figures
read the original abstract
Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm tailored to generative tasks. Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates. Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28\% MSE reduction over Uniform Sampling and an average of 22.9\% budget savings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces an active testing algorithm for LLMs on generative tasks that stratifies the evaluation pool using semantic entropy signals extracted from surrogate models and then applies approximate Neyman allocation based on variance estimates derived from those same surrogate signals. It reports empirical results across language and multimodal benchmarks with multiple surrogate-target model pairs, claiming significant improvements over baselines, close tracking of an Oracle-Neyman allocation, up to 28% MSE reduction versus uniform sampling, and an average of 22.9% budget savings.
Significance. If the central approximation holds, the work offers a practical way to reduce the growing evaluation costs for large generative models that require expert annotation. The empirical evaluation across diverse benchmarks and surrogate-target pairs is a clear strength, providing evidence of gains over uniform sampling and proximity to the oracle in the tested regimes.
major comments (2)
- [Abstract] The load-bearing assumption that surrogate semantic entropy signals are sufficiently correlated with (or proportional to) the per-example variance that would be observed under the target model on generative tasks is not directly validated. The abstract claims the method 'closely tracks Oracle-Neyman' and delivers 22.9% average budget savings, yet supplies no diagnostic such as a reported correlation coefficient between surrogate entropy and target variance or an efficiency-loss metric relative to true Neyman allocation.
- [Method description] Without details on how the approximate Neyman allocation is computed from surrogate signals (e.g., exact variance estimation procedure or handling of generative scoring functions), it is unclear whether the reported MSE reductions could be biased by the choice of surrogate or by mismatch in scale/architecture between surrogate and target.
minor comments (1)
- [Abstract] The abstract would benefit from a one-sentence statement of the key modeling assumption and any stated limitations on surrogate-target mismatch.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below and have revised the manuscript to incorporate additional validation and methodological details.
read point-by-point responses
-
Referee: [Abstract] The load-bearing assumption that surrogate semantic entropy signals are sufficiently correlated with (or proportional to) the per-example variance that would be observed under the target model on generative tasks is not directly validated. The abstract claims the method 'closely tracks Oracle-Neyman' and delivers 22.9% average budget savings, yet supplies no diagnostic such as a reported correlation coefficient between surrogate entropy and target variance or an efficiency-loss metric relative to true Neyman allocation.
Authors: We agree that a direct diagnostic would strengthen the presentation of the core assumption. While the empirical results demonstrate close tracking of the Oracle-Neyman allocation (with up to 28% MSE reduction and 22.9% average budget savings across benchmarks), we acknowledge the absence of explicit correlation or efficiency-loss metrics in the original submission. In the revised manuscript we have added a new appendix containing Pearson correlation coefficients between surrogate semantic entropy and target-model variance for every surrogate-target pair, together with an efficiency-loss metric that quantifies the sub-optimality of our approximate allocation relative to the true Neyman allocation. These additions provide the requested direct validation without altering the reported performance figures. revision: yes
-
Referee: [Method description] Without details on how the approximate Neyman allocation is computed from surrogate signals (e.g., exact variance estimation procedure or handling of generative scoring functions), it is unclear whether the reported MSE reductions could be biased by the choice of surrogate or by mismatch in scale/architecture between surrogate and target.
Authors: We have expanded the Method section (Section 3.2) to include the precise variance-estimation procedure: per-stratum variance is estimated as the product of the surrogate semantic entropy and a scaling factor obtained from a small pilot sample evaluated on the target model. For generative scoring functions we explicitly use the average negative log-likelihood over multiple sampled continuations. We have also added a dedicated paragraph discussing potential surrogate-target mismatch, including an ablation study that repeats the main experiments with surrogates of varying scale and architecture; the gains remain consistent. These clarifications address the concern about possible bias while preserving the original empirical claims. revision: yes
Circularity Check
No circularity: surrogate-based approximation is externally validated rather than self-referential
full rationale
The derivation chain consists of stratifying the pool by semantic entropy extracted from an independent surrogate model and then performing Neyman allocation using variance signals from that same surrogate. The claimed MSE reductions and budget savings are reported as empirical results across surrogate-target pairs on external benchmarks, not as quantities forced by fitting parameters to the target evaluation data itself. No equations define the allocation weights in terms of the final performance metric, no self-citation supplies a uniqueness theorem that forbids alternatives, and the tracking of Oracle-Neyman is presented as an observed outcome rather than a definitional identity. The method therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
surrogate semantic entropy signals
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Scaling up active testing to large language models.arXiv preprint arXiv:2508.09093,
Gabrielle Berrada, Jannik Kossen, Freddie Bickford Smith, Muhammed Razzak, Yarin Gal, and Tom Rainforth. Scaling up active testing to large language models.arXiv preprint arXiv:2508.09093,
-
[3]
An experimental design framework for label-efficient supervised finetuning of large language models
Gantavya Bhatt, Yifang Chen, Arnav Das, Jifan Zhang, Sang Truong, Stephen Mussmann, Yinglun Zhu, Jeff Bilmes, Simon Du, Kevin Jamieson, et al. An experimental design framework for label-efficient supervised finetuning of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6549–6560,
work page 2024
-
[4]
On statistical bias in active learning: How and when to fix it.arXiv preprint arXiv:2101.11665,
Sebastian Farquhar, Yarin Gal, and Tom Rainforth. On statistical bias in active learning: How and when to fix it.arXiv preprint arXiv:2101.11665,
-
[5]
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologi...
work page 2025
-
[6]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664,
work page internal anchor Pith review arXiv
-
[8]
Mixtraining: A better trade-off between compute and performance.arXiv preprint arXiv:2502.19513,
Zexin Li, Jiancheng Zhang, Yufei Li, Yinglun Zhu, and Cong Liu. Mixtraining: A better trade-off between compute and performance.arXiv preprint arXiv:2502.19513,
-
[9]
URL https://qwen.ai/blog?id= qwen3.5. Aashish Anantha Ramakrishnan, Ardavan Saeedi, Hamid Reza Hassanzadeh, Fazlolah Mohaghegh, and Dongwon Lee. Generative active testing: Efficient llm evaluation via proxy task adaptation.arXiv preprint arXiv:2603.19264,
-
[10]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Active Learning for Convolutional Neural Networks: A Core-Set Approach
Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489,
-
[12]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Moq: mixture-of-format activation quantization for communication-efficient ai inference system
Haonan Wang, Zeli Liu, Chao Fang, John Paul Walters, and Stephen P Crago. Moq: mixture-of-format activation quantization for communication-efficient ai inference system. InNeurIPS 2024 Workshop Machine Learning with new Compute Paradigms, 2024a. Haonan Wang, Zeli Liu, Kajimusugura Hoshino, Tuo Zhang, John Paul Walters, and Stephen Crago. Fedpai: Achieving...
work page 2024
-
[14]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data
Jiancheng Zhang and Yinglun Zhu. Towards multimodal active learning: Efficient learning with limited paired data.arXiv preprint arXiv:2510.03247,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Jifan Zhang, Yifang Chen, Gregory Canal, Stephen Mussmann, Arnav M Das, Gantavya Bhatt, Yinglun Zhu, Jeffrey Bilmes, Simon Shaolei Du, Kevin Jamieson, et al. Labelbench: A comprehensive framework for benchmarking adaptive label-efficient learning.arXiv preprint arXiv:2306.09910,
-
[18]
Accelerating unbiased llm evaluation via synthetic feedback
Zhaoyi Zhou, Yuda Song, and Andrea Zanette. Accelerating unbiased llm evaluation via synthetic feedback. arXiv preprint arXiv:2502.10563,
-
[19]
Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
Yinglun Zhu and Robert Nowak. Active learning with neural networks: Insights from nonparametric statistics. Advances in Neural Information Processing Systems, 35:142–155, 2022a. Yinglun Zhu and Robert Nowak. Efficient active learning with abstention.Advances in Neural Information Processing Systems, 35:35379–35391, 2022b. Yinglun Zhu, Jiancheng Zhang, and...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Strategic Scaling of Test-Time Compute: A Bandit Learning Approach
Bowen Zuo and Yinglun Zhu. Strategic scaling of test-time compute: A bandit learning approach.arXiv preprint arXiv:2506.12721,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations
Bowen Zuo, Dongruo Zhou, and Yinglun Zhu. Adaptive test-time compute allocation with evolving in-context demonstrations.arXiv preprint arXiv:2604.21018,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.