arxiv: 2604.23099 · v1 · submitted 2026-04-25 · 💻 cs.LG · cs.AI· stat.ML

Recognition: unknown

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

Aditi Kumaresan, Wenjun Zeng, Yizheng Huang, Zi Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords generative AI evaluationGaussian processesBayesian quadraturefailure discoveryperformance estimationtransfer learningactive samplingproactive evaluation

0 comments

The pith

ProEval uses pre-trained Gaussian Processes as surrogates to estimate generative AI performance accurately with 8-65 times fewer samples while finding more failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops ProEval to tackle the growing expense of evaluating generative AI models caused by slow inference and costly human ratings. It trains Gaussian Processes on past model results to stand in for the function that scores new inputs on metrics like error severity or safety issues. These surrogates then guide active selection of test cases through Bayesian quadrature for performance estimates and superlevel set sampling for failures. The result is estimates within 1 percent of ground truth using far fewer evaluations than baselines, plus discovery of more diverse problems under tight budgets. A sympathetic reader would care because this approach could keep thorough testing viable as the number of models and benchmarks multiplies.

Core claim

ProEval frames performance estimation as Bayesian quadrature with pre-trained Gaussian Processes serving as surrogates for the score function that maps inputs to metrics, and failure discovery as superlevel set sampling that uses uncertainty to pick informative cases. The paper proves the resulting estimator is unbiased and bounded, and experiments on reasoning, safety alignment, and classification benchmarks show it reaches estimates within 1 percent of ground truth with 8-65 times fewer samples while surfacing more varied failures than baselines under the same constraints.

What carries the argument

Pre-trained Gaussian Processes as surrogates for the performance score function, supporting Bayesian quadrature for estimation and superlevel set sampling for failure discovery.

If this is right

Performance estimates reach within 1 percent of ground truth using 8-65 times fewer samples than standard methods.
The same or tighter evaluation budgets yield a greater variety of discovered failure cases.
The approach works across reasoning tasks, safety alignment checks, and classification benchmarks.
The Bayesian quadrature estimator is theoretically unbiased and bounded regardless of the specific inputs chosen.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

With accumulated prior data the method could support ongoing monitoring of model families without repeated full-scale testing.
Active input synthesis might lower dependence on broad human rating pools in safety reviews.
The surrogate approach could be tested on non-generative models if suitable prior evaluation histories exist.

Load-bearing premise

Pre-trained Gaussian Processes from earlier evaluations must closely approximate the score function on new models and inputs without large distribution shifts.

What would settle it

A new model or benchmark family produces actual performance values that deviate more than 1 percent from the ProEval estimates even after the reduced sample count, or the method misses key failures that random sampling finds.

read the original abstract

Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProEval's pre-trained GP plus Bayesian quadrature setup for cheaper generative model evaluation is a sensible practical idea, but the transfer assumptions look like the load-bearing weak point.

read the letter

Hey, the core contribution here is framing performance estimation as Bayesian quadrature over a pre-trained GP surrogate and failure discovery as superlevel set sampling, then using uncertainty to pick or synthesize test inputs. That combination applied to generative AI benchmarks is new enough to notice, and the paper spells out the active selection rules clearly. The experiments on reasoning, safety alignment, and classification tasks report 8-65x sample reductions to reach 1% error while surfacing more diverse failures, which would be genuinely useful if it generalizes. They also claim a proof that the pre-trained GP-based BQ estimator is unbiased and bounded, which at least gives the work some formal grounding. The citation pattern pulls in the usual GP and quadrature references without obvious gaps for this scope. The soft spot is exactly the transfer step the stress-test flagged. Unbiasedness for standard BQ holds when the integrand matches the GP, but here the GP comes from prior models and is applied to new ones. If error patterns, input distributions, or safety violation semantics shift even moderately, the surrogate posterior no longer supports the guarantee, and the sample-efficiency numbers become empirical luck rather than a reliable property. The abstract and setup do not appear to include shift bounds, sensitivity checks, or pre-training diversity analysis, so the theoretical claim feels optimistic relative to the evidence shown. This is for groups that run large-scale model evaluations and need to stretch limited rater or inference budgets. A reader already working on active or surrogate-based testing would get concrete strategies to try, even if they have to add their own robustness tests. It deserves a serious referee because the framework is coherent, the empirical scale is decent, and the math is at least attempted, though revisions would need to address how much distribution shift the method can tolerate before the guarantees collapse.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ProEval, a proactive evaluation framework for generative AI models that uses pre-trained Gaussian Processes (GPs) as surrogates for performance score functions. It frames performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, providing uncertainty-aware active selection strategies. The paper claims to prove that the pre-trained GP-based BQ estimator is unbiased and bounded, and demonstrates empirically that it requires 8-65x fewer samples than baselines to achieve estimates within 1% of ground truth while identifying more diverse failures on reasoning, safety alignment, and classification benchmarks.

Significance. If the transfer assumptions for the pre-trained GP hold without significant distribution shift and the empirical efficiency gains prove robust, ProEval could meaningfully reduce the computational and human costs of evaluating new generative models, supporting more scalable safety and performance testing. The dual focus on efficient estimation and proactive failure discovery is a strength, particularly if the theoretical unbiasedness and boundedness results extend reliably to new models.

major comments (3)

[Theoretical Analysis] Theoretical Analysis section (proof of unbiasedness for pre-trained GP-based BQ estimator): The claim that the estimator is unbiased and bounded relies on the performance function for a new model being drawn from (or well-approximated by) the posterior of the GP pre-trained on prior models. The manuscript provides no explicit conditions, bounds, or analysis on distribution shift arising from changes in model architecture, input distributions, or failure semantics; without this, the unbiasedness guarantee does not necessarily transfer, undermining the central theoretical contribution.
[Experimental Results] Experimental Results section (efficiency claims of 8-65x fewer samples): The reported sample reductions to reach 1% error relative to ground truth depend on the transferred GP surrogate accurately guiding active selection. No ablation is described that varies the similarity between pre-training data and target model distributions, leaving open whether the gains hold under realistic model-specific shifts or are limited to low-shift cases.
[Method] Method section (superlevel set sampling for failure discovery): The uncertainty-aware strategy for selecting inputs to reveal diverse failures uses the pre-trained GP posterior, but the manuscript does not detail how posterior predictive variance is adjusted or regularized to account for potential mismatch with new models, which is load-bearing for the claim of simultaneously better failure coverage under strict budgets.

minor comments (2)

[Method] Notation for the GP kernel and quadrature weights could be clarified with an explicit equation reference when first introduced, to aid readers in following the BQ formulation.
[Abstract] The abstract and experiments section would benefit from specifying the exact number of models, benchmarks, and total evaluation budgets used in the comparisons for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript. We address each major comment point by point below, providing clarifications on the theoretical assumptions, empirical robustness, and methodological details. Where appropriate, we indicate revisions that will be incorporated into the next version of the paper.

read point-by-point responses

Referee: [Theoretical Analysis] Theoretical Analysis section (proof of unbiasedness for pre-trained GP-based BQ estimator): The claim that the estimator is unbiased and bounded relies on the performance function for a new model being drawn from (or well-approximated by) the posterior of the GP pre-trained on prior models. The manuscript provides no explicit conditions, bounds, or analysis on distribution shift arising from changes in model architecture, input distributions, or failure semantics; without this, the unbiasedness guarantee does not necessarily transfer, undermining the central theoretical contribution.

Authors: We appreciate the referee highlighting the importance of clearly stating the assumptions in our theoretical analysis. The proof of unbiasedness and boundedness for the pre-trained GP-based Bayesian quadrature estimator is derived under the explicit modeling assumption that the target performance function is sampled from the posterior of the GP pre-trained on prior models; this assumption is stated in the Theoretical Analysis section. However, we agree that the manuscript does not provide a dedicated discussion of conditions under which the assumption holds or quantitative bounds on bias due to distribution shift. In the revised manuscript, we will expand the Theoretical Analysis section with: (i) a formal restatement of the transfer assumption, (ii) qualitative analysis of shift sources (architecture changes, input distribution shifts, evolving failure semantics), and (iii) guidance on when the approximation remains useful in practice, supported by the empirical similarity metrics used in our experiments. This addition will clarify the scope of the guarantees without changing the core result. revision: yes
Referee: [Experimental Results] Experimental Results section (efficiency claims of 8-65x fewer samples): The reported sample reductions to reach 1% error relative to ground truth depend on the transferred GP surrogate accurately guiding active selection. No ablation is described that varies the similarity between pre-training data and target model distributions, leaving open whether the gains hold under realistic model-specific shifts or are limited to low-shift cases.

Authors: We agree that an explicit ablation on distributional similarity would strengthen the empirical claims. Our current experiments already span multiple benchmark families (reasoning, safety alignment, classification) with pre-training performed on related but non-identical models, which we view as representative of realistic transfer scenarios. To directly respond to the concern, the revised Experimental Results section will include a new ablation study that systematically varies pre-training set composition by similarity (using embedding-based or task-overlap metrics) and reports the resulting sample-efficiency curves on held-out target models. This will quantify how efficiency gains degrade under increasing shift while confirming robustness in moderate-shift regimes typical of model evaluation. revision: yes
Referee: [Method] Method section (superlevel set sampling for failure discovery): The uncertainty-aware strategy for selecting inputs to reveal diverse failures uses the pre-trained GP posterior, but the manuscript does not detail how posterior predictive variance is adjusted or regularized to account for potential mismatch with new models, which is load-bearing for the claim of simultaneously better failure coverage under strict budgets.

Authors: The superlevel set sampling procedure selects points using the posterior predictive mean and variance of the pre-trained GP, where elevated variance naturally encourages exploration in regions of potential mismatch. We acknowledge that the manuscript does not explicitly describe any additional regularization or adjustment for shift. In the revised Method section we will add: (i) a description of an optional variance-inflation mechanism that scales the predictive variance by a shift-detection factor computed from a small calibration set on the target model, (ii) the corresponding mathematical formulation, and (iii) pseudocode illustrating how the adjustment integrates with the acquisition function. This will make the handling of mismatch transparent while preserving the core uncertainty-aware selection strategy. revision: yes

Circularity Check

0 steps flagged

No circularity; theoretical claim rests on standard BQ properties applied to transferred surrogate

full rationale

The paper states a proof that the pre-trained GP-based BQ estimator is unbiased and bounded, with the GP surrogate constructed from prior model evaluations and then used for active selection on new inputs. No quoted equations or derivation steps reduce this unbiasedness to a tautology, a fitted parameter renamed as a prediction, or a self-citation chain that assumes the target result. The pre-training step is external to the BQ math itself, and the empirical sample-efficiency claims are presented as separate validation rather than derived from the same fitted quantities. The derivation chain is therefore self-contained against external benchmarks for BQ unbiasedness.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the core reliance on Gaussian Process surrogates and transfer learning implies unstated assumptions about model transferability and GP modeling of performance functions.

pith-pipeline@v0.9.0 · 5484 in / 1217 out tokens · 76521 ms · 2026-05-08T08:17:42.260967+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 33 canonical work pages · 11 internal anchors

[1]

Claude 3.5 Sonnet , October 2024

Anthropic. Claude 3.5 Sonnet , October 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet

2024
[2]

Claude 3.7 Sonnet and Claude Code , February 2025 a

Anthropic. Claude 3.7 Sonnet and Claude Code , February 2025 a . URL https://www.anthropic.com/news/claude-3-7-sonnet

2025
[3]

Introducing Claude Opus 4.5 , November 2025 b

Anthropic. Introducing Claude Opus 4.5 , November 2025 b . URL https://www.anthropic.com/news/claude-opus-4-5

2025
[4]

Introducing Claude Sonnet 4.5 , September 2025 c

Anthropic. Introducing Claude Sonnet 4.5 , September 2025 c . URL https://www.anthropic.com/news/claude-sonnet-4-5

2025
[5]

Aroyo, A

L. Aroyo, A. Taylor, M. Diaz, C. Homan, A. Parrish, G. Serapio-Garc \' a, V. Prabhakaran, and D. Wang. Dices dataset: Diversity in conversational AI evaluation for safety . Advances in Neural Information Processing Systems (NeurIPS), 36: 0 53330--53342, 2023. URL https://openreview.net/forum?id=GjNvvswoUL

2023
[6]

Ashury Tahan, A

S. Ashury Tahan, A. Gera, B. Sznajder, L. Choshen, L. Ein-Dor, and E. Shnarch. Label-efficient model selection for text generation. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8384--8402, Bangkok, Thailand, Aug. 2024. Association f...

work page doi:10.18653/v1/2024.acl-long.456 2024
[7]

Atlas, D

L. Atlas, D. Cohn, and R. Ladner. Training connectionist networks with queries and selective sampling . In D. Touretzky, editor, Advances in Neural Information Processing Systems (NeurIPS), volume 2. Morgan-Kaufmann, 1989. URL https://proceedings.neurips.cc/paper_files/paper/1989/file/b1a59b315fc9a3002ce38bbe070ec3f5-Paper.pdf

1989
[8]

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem . Machine Learning, 47 0 (2-3): 0 235--256, 2002. URL https://link.springer.com/article/10.1023/A:1013689704352

work page doi:10.1023/a:1013689704352 2002
[9]

Berrada, J

G. Berrada, J. Kossen, F. B. Smith, M. Razzak, Y. Gal, and T. Rainforth. Scaling up active testing to large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum?id=UE0cxjNnIw

2025
[10]

Borkan, L

D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification . In Companion proceedings of the 2019 world wide web conference , pages 491--500, 2019. URL https://doi.org/10.1145/3308560.3317593

work page doi:10.1145/3308560.3317593 2019
[11]

Chang, J

M. Chang, J. Zhang, Z. Zhu, C. Yang, Y. Yang, Y. Jin, Z. Lan, L. Kong, and J. He. Agentboard: An analytical evaluation board of multi-turn LLM agents . Advances in Neural Information Processing Systems (NeurIPS), 37: 0 74325--74362, 2024. URL https://openreview.net/forum?id=4S8agvKjle

2024
[12]

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries , 2024. URL https://arxiv.org/abs/2310.08419

work page internal anchor Pith review arXiv 2024
[13]

J. Chen, Y. Lu, X. Wang, H. Zeng, J. Huang, J. Gesi, Y. Xu, B. Yao, and D. Wang. Multi-agent-as-judge: Aligning LLM -agent-based automated evaluation with multi-dimensional human evaluation . arXiv preprint arXiv:2507.21028 [cs.CL], 2025. URL https://arxiv.org/abs/2507.21028

work page arXiv 2025
[14]

L. Chen, M. Zaharia, and J. Zou. FrugalGPT : How to use large language models while reducing cost and improving performance . arXiv preprint arXiv:2305.05176 [cs.LG], 2023. URL https://arxiv.org/abs/2305.05176

work page internal anchor Pith review arXiv 2023
[15]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems . arXiv preprint arXiv:2110.14168 [cs.LG], 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review arXiv 2021
[16]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities . arXiv preprint arXiv:2507.06261 [cs.CL], 2025. URL https://arxiv.org/abs/2507.06261

work page internal anchor Pith review arXiv 2025
[17]

Corbi\` e re, N

C. Corbi\` e re, N. THOME, A. Bar-Hen, M. Cord, and P. P\' e rez. Addressing failure prediction by learning model confidence . In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems (NeurIPS), volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neur...

2019
[18]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team Google . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , 2024. URL https://arxiv.org/abs/2403.05530

work page internal anchor Pith review arXiv 2024
[19]

M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, and J. Berant. Did Aristotle use a laptop? A question answering benchmark with implicit reasoning strategies . Transactions of the Association for Computational Linguistics, 9: 0 346--361, 2021. URL https://aclanthology.org/2021.tacl-1.21/

2021
[20]

Ghahramani and C

Z. Ghahramani and C. Rasmussen. Bayesian Monte Carlo . In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems (NeurIPS), volume 15. MIT Press, 2002. URL https://proceedings.neurips.cc/paper_files/paper/2002/file/24917db15c4e37e421866448c9ab23d8-Paper.pdf

2002
[21]

Gotovos, N

A. Gotovos, N. Casati, G. Hitz, and A. Krause. Active learning for level set estimation. In International Joint Conference on Artificial Intelligence (IJCAI), 2013. URL https://api.semanticscholar.org/CorpusID:9115200

2013
[22]

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

M. Grootendorst. BERTopic : Neural topic modeling with a class-based TF-IDF procedure . arXiv preprint arXiv:2203.05794 [cs.CL], 2022. URL https://arxiv.org/abs/2203.05794

work page internal anchor Pith review arXiv 2022
[23]

S. Guan, H. Xiong, J. Wang, J. Bian, B. Zhu, and J.-g. Lou. Evaluating LLM -based agents for multi-turn conversations: A survey . arXiv preprint arXiv:2503.22458 [cs.CL], 2025. URL https://arxiv.org/abs/2503.22458

work page arXiv 2025
[24]

Gunter, M

T. Gunter, M. A. Osborne, R. Garnett, P. Hennig, and S. J. Roberts. Sampling for inference in probabilistic models with fast Bayesian quadrature . In Advances in Neural Information Processing Systems (NeurIPS), 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/a0d08267a0fcee6970544a6d12286691-Paper.pdf

2014
[25]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding . International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

2021
[26]

Hofmann, D

V. Hofmann, D. Heineman, I. Magnusson, K. Lo, J. Dodge, M. Sap, P. W. Koh, C. Wang, H. Hajishirzi, and N. A. Smith. Fluid language model benchmarking. In Conference on Language Modeling (COLM), 2025. URL https://openreview.net/forum?id=mxcCg9YRqj

2025
[27]

Bayesian active learning for classiﬁcation and preferenc e learning,

N. Houlsby, F. Husz \'a r, Z. Ghahramani, and M. Lengyel. Bayesian active learning for classification and preference learning . ArXiv, abs/1112.5745, 2011. URL https://api.semanticscholar.org/CorpusID:13612582

work page arXiv 2011
[28]

Huang, J

Y. Huang, J. Song, Q. Hu, F. Juefei-Xu, and L. Ma. AcTracer : Active testing of large language model via multi-stage sampling . ACM Transactions on Software Engineering and Methodology, 2025. URL https://doi.org/10.1145/3744340

work page doi:10.1145/3744340 2025
[29]

D. A. Hudson and C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering . In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6700--6709, 2019. URL https://doi.org/10.48550/arXiv.1902.09506

work page doi:10.48550/arxiv.1902.09506 2019
[30]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. GPT-4o system card . arXiv preprint arXiv:2410.21276 [cs.CL], 2024. URL https://arxiv.org/abs/2410.21276

work page internal anchor Pith review arXiv 2024
[31]

Kalibhat, Z

N. Kalibhat, Z. Wang, P. Bajpai, D. Proud, W. Zeng, B. Kim, and M. Malek. Interpreting and controlling model behavior via constitutions for atomic concept edits. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2026

2026
[32]

B. Kim, Z. Wang, L. P. Kaelbling, and T. Lozano-P \'e rez. Learning to guide task and motion planning using score-space representation. International Journal of Robotics Research (IJRR), 38 0 (7): 0 793--812, 2019

2019
[33]

Kipnis, K

A. Kipnis, K. Voudouris, L. M. S. Buschoff, and E. Schulz. metabench--A sparse benchmark of reasoning and knowledge in large language models . In International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=4T33izzFpK

2025
[34]

Kossen, S

J. Kossen, S. Farquhar, Y. Gal, and T. Rainforth. Active testing: Sample-efficient model evaluation . In International Conference on Machine Learning (ICML), 2021. URL https://proceedings.mlr.press/v139/kossen21a.html

2021
[35]

Kossen, S

J. Kossen, S. Farquhar, Y. Gal, and T. Rainforth. Active surrogate estimators: An active learning approach to label-efficient model evaluation . In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9b9cfd5428153ccfbd4ba34b7e007305-Paper-Conference.pdf

2022
[36]

Kulesza, B

A. Kulesza, B. Taskar, et al. Determinantal point processes for machine learning . Foundations and Trends in Machine Learning , 5 0 (2--3): 0 123--286, 2012

2012
[37]

D. Lee, J. Lee, J.-W. Ha, J.-H. Kim, S.-W. Lee, H. Lee, and H. O. Song. Query-efficient black-box red teaming via B ayesian optimization. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11551--11574, Toronto, Canada, July 2023. Ass...

work page doi:10.18653/v1/2023.acl-long.646 2023
[38]

K. Li, O. Patel, F. Vi \'e gas, H. Pfister, and M. Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://openreview.net/forum?id=aLLuYpn83y

2023
[39]

R. Li, R. Li, B. Wang, and X. Du. IQA-Eval : Automatic evaluation of human-model interactive question answering . In Advances in Neural Information Processing Systems (NeurIPS), 2024. URL https://openreview.net/forum?id=MzM99vV5Rx

2024
[40]

Y. Li, J. Ma, M. Ballesteros, Y. Benajiba, and G. Horwood. Active evaluation acquisition for efficient LLM benchmarking. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors, International Conference on Machine Learning (ICML), volume 267 of Proceedings of Machine Learning Research, pages 35581--356...

2025
[41]

Z. Lin, Z. Wang, Y. Tong, Y. Wang, Y. Guo, Y. Wang, and J. Shang. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user- AI conversation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. URL https://openreview.net/forum?id=jTiJPDv82w

2023
[42]

X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, et al. Agentbench: Evaluating LLMs as agents . arXiv preprint arXiv:2308.03688 [cs.AI], 2023. URL https://arxiv.org/abs/2308.03688

work page internal anchor Pith review arXiv 2023
[43]

Y. Lu, B. Yao, H. Gu, J. Huang, Z. J. Wang, Y. Li, J. Gesi, Q. He, T. J.-J. Li, and D. Wang. Uxagent: An LLM agent-based usability testing framework for web design . In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages 1--12, 2025. URL https://doi.org/10.1145/3706599.3719729

work page doi:10.1145/3706599.3719729 2025
[44]

Mehrotra, M

A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nelson, H. S. Anderson, Y. Singer, and A. Karbasi. Tree of attacks: Jailbreaking black-box LLM s automatically. In Advances in Neural Information Processing Systems (NeurIPS), 2024. URL https://openreview.net/forum?id=SoM3vngOH5

2024
[45]

C. C. NeurIPS. Reflecting on the 2025 review process from the datasets and benchmarks chairs , 2025. URL https://blog.neurips.cc/2025/09/30/reflecting-on-the-2025-review-process-from-the-datasets-and-benchmarks-chairs/

2025
[46]

Nguyen, D

P. Nguyen, D. Ramanan, and C. Fowlkes. Active testing: An efficient and robust framework for estimating accuracy. In J. Dy and A. Krause, editors, International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pages 3759--3768. PMLR, 10--15 Jul 2018. URL https://proceedings.mlr.press/v80/nguyen18d.html

2018
[47]

GPT-3.5 Turbo , July 2024

OpenAI. GPT-3.5 Turbo , July 2024. URL https://platform.openai.com/docs/models/gpt-3.5-turbo/

2024
[48]

Introducing GPT-5 , August 2025 a

OpenAI. Introducing GPT-5 , August 2025 a . URL https://openai.com/index/introducing-gpt-5/

2025
[49]

GPT-5.1: A smarter, more conversational ChatGPT , November 2025 b

OpenAI. GPT-5.1: A smarter, more conversational ChatGPT , November 2025 b . URL https://openai.com/index/gpt-5-1/

2025
[50]

Introducing GPT-5.2 , December 2025 c

OpenAI. Introducing GPT-5.2 , December 2025 c . URL https://openai.com/index/introducing-gpt-5-2/

2025
[51]

M. A. Osborne, D. Duvenaud, R. Garnett, C. E. Rasmussen, S. J. Roberts, and Z. Ghahramani. Active learning of model evidence using Bayesian quadrature . In Advances in Neural Information Processing Systems (NeurIPS), NIPS'12, page 46–54, Red Hook, NY, USA, 2012. Curran Associates Inc. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/

2012
[52]

A. O’Hagan. Bayes–Hermite quadrature . Journal of Statistical Planning and Inference, 29: 0 245--260, 1991. URL https://api.semanticscholar.org/CorpusID:122652750

1991
[53]

S. Park, M. Zecchin, and O. Simeone. Adaptive prediction-powered autoeval with reliability and efficiency guarantees. In Advances in Neural Information Processing Systems (NeurIPS), 2025

2025
[54]

Are NLP Models really able to Solve Simple Math Word Problems?

A. Patel, S. Bhattamishra, and N. Goyal. Are NLP models really able to solve simple math word problems? In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 2080--2094, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021....

work page doi:10.18653/v1/2021.naacl-main.168 2080
[55]

Red Teaming Language Models with Language Models

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red teaming language models with language models. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3419--3448, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Comput...

work page doi:10.18653/v1/2022.emnlp-main.225 2022
[56]

Perlitz, E

Y. Perlitz, E. Bandel, A. Gera, O. Arviv, L. E. Dor, E. Shnarch, N. Slonim, M. Shmueli-Scheuer, and L. Choshen. Efficient benchmarking (of language models) . In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 2519--2536, 2024. URL https://aclanthology.org/2024.naacl-long.139/

2024
[57]

L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity's last exam . arXiv preprint arXiv:2501.14249 [cs.LG], 2025. URL https://arxiv.org/abs/2501.14249

work page internal anchor Pith review arXiv 2025
[58]

Philipov, V

D. Philipov, V. Dongre, G. Tur, and D. Hakkani-T \"u r. Simulating user agents for embodied conversational- AI . arXiv preprint arXiv:2410.23535 [cs.CL], 2024. URL https://arxiv.org/abs/2410.23535

work page arXiv 2024
[59]

F. M. Polo, L. Weber, L. Choshen, Y. Sun, G. Xu, and M. Yurochkin. tinyBenchmarks: evaluating LLMs with fewer examples . arXiv preprint arXiv:2402.14992 [cs.CL], 2024. URL https://arxiv.org/abs/2402.14992

work page arXiv 2024
[60]

C. E. Rasmussen and C. K. Williams. G aussian processes for machine learning . The MIT Press, 2006

2006
[61]

Rastogi, T

C. Rastogi, T. H. Teh, P. Mishra, R. Patel, D. Wang, M. Diaz, A. Parrish, A. M. Davani, Z. Ashwood, M. Paganini, V. Prabhakaran, V. Rieser, and L. Aroyo. Whose view of safety? A deep DIVE dataset for pluralistic alignment of text-to-image models . In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum?id=2TxdMkJ6Yw

2025
[62]

Rubinstein, B

A. Rubinstein, B. Raible, M. Gubri, and S. J. Oh. DISCO: Diversifying Sample Condensation for Efficient Model Evaluation , 2025. URL https://arxiv.org/abs/2510.07959

work page arXiv 2025
[63]

Ruder and B

S. Ruder and B. Plank. Learning to select data for transfer learning with B ayesian optimization. In M. Palmer, R. Hwa, and S. Riedel, editors, Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 372--382, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics. doi:10.18653/v1/D17-1038. URL https://aclantholog...

work page doi:10.18653/v1/d17-1038 2017
[64]

Samvelyan, S

M. Samvelyan, S. C. Raparthy, A. Lupu, E. Hambro, A. H. Markosyan, M. Bhatt, Y. Mao, M. Jiang, J. Parker-Holder, J. N. Foerster, T. Rockt \"a schel, and R. Raileanu. Rainbow teaming: Open-ended generation of diverse adversarial prompts. In Advances in Neural Information Processing Systems (NeurIPS), 2024. URL https://openreview.net/forum?id=FCsEvaMorw

2024
[65]

Sener and S

O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach . arXiv: Machine Learning, 2017. URL https://api.semanticscholar.org/CorpusID:3383786

2017
[66]

B. Settles. Active learning literature survey . 2009

2009
[67]

Srivastava, A

A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models . Transactions on Machine Learning Research (TMLR), 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj

2023
[68]

Y. Sun, A. Stolfo, and M. Sachan. Probing for arithmetic errors in language models . In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng, editors, Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8111--8128, Suzhou, China, Nov. 2025. Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi:10.18653/v1/...

work page doi:10.18653/v1/2025.emnlp-main.411 2025
[69]

Team and G

G. Team and G. DeepMind. Gemini 3 technical report , November 2025. URL https://deepmind.google/technologies/gemini/v3

2025
[70]

G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram \'e , M. Rivi \`e re, et al. Gemma 3 technical report . arXiv preprint arXiv:2503.19786 [cs.CL], 2025. URL https://arxiv.org/abs/2503.19786

work page internal anchor Pith review arXiv 2025
[71]

Anchor Points: Benchmarking Models with Much Fewer Examples

R. Vivek, K. Ethayarajh, D. Yang, and D. Kiela. Anchor points: Benchmarking models with much fewer examples . In Y. Graham and M. Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1576--1601, St. Julian ' s, Malta, Mar. 2024. Association for Computat...

work page doi:10.18653/v1/2024.eacl-long.95 2024
[72]

Wagstaff, S

E. Wagstaff, S. Hamid, and M. A. Osborne. Batch selection for parallelisation of Bayesian quadrature . ArXiv, abs/1812.01553, 2018. URL https://api.semanticscholar.org/CorpusID:54446127

work page arXiv 2018
[73]

G. Wang, Z. Chen, B. Li, and H. Xu. Cer-Eval: Certifiable and cost-efficient evaluation framework for LLMs . arXiv preprint arXiv:2505.03814, 2025

work page arXiv 2025
[74]

Z. Wang, C. R. Garrett, L. P. Kaelbling, and T. Lozano-P \'e rez. Active model learning and diverse action sampling for task and motion planning. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4107--4114. IEEE, 2018 a

2018
[75]

Z. Wang, B. Kim, and L. P. Kaelbling. Regret bounds for meta B ayesian optimization with an unknown Gaussian process prior. In Advances in Neural Information Processing Systems (NeurIPS), 2018 b

2018
[76]

Z. Wang, G. E. Dahl, K. Swersky, C. Lee, Z. Nado, J. Gilmer, J. Snoek, and Z. Ghahramani. Pre-trained Gaussian processes for Bayesian optimization . Journal of Machine Learning Research (JMLR), 25 0 (212): 0 1--83, 2024. URL http://jmlr.org/papers/v25/23-0269.html

2024
[77]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models . Advances in Neural Information Processing Systems (NeurIPS), 35: 0 24824--24837, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/

2022
[78]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report . arXiv preprint arXiv:2505.09388 [cs.CL], 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review arXiv 2025
[79]

Zhang, L

C. Zhang, L. F. D'Haro, Y. Chen, M. Zhang, and H. Li. A comprehensive analysis of the effectiveness of large language models as automatic dialogue evaluators . In AAAI Conference on Artificial Intelligence (AAAI), 2024 a . URL https://doi.org/10.1609/aaai.v38i17.29923

work page doi:10.1609/aaai.v38i17.29923 2024
[80]

Zhang, L

Z. Zhang, L. Lei, L. Wu, R. Sun, Y. Huang, C. Long, X. Liu, X. Lei, J. Tang, and M. Huang. SafetyBench: Evaluating the safety of large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 15537--15553, 2024 b . URL https://aclanthology.org/2024.acl-long.830/

2024

Showing first 80 references.