Recognition: unknown
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
Pith reviewed 2026-05-08 08:17 UTC · model grok-4.3
The pith
ProEval uses pre-trained Gaussian Processes as surrogates to estimate generative AI performance accurately with 8-65 times fewer samples while finding more failures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ProEval frames performance estimation as Bayesian quadrature with pre-trained Gaussian Processes serving as surrogates for the score function that maps inputs to metrics, and failure discovery as superlevel set sampling that uses uncertainty to pick informative cases. The paper proves the resulting estimator is unbiased and bounded, and experiments on reasoning, safety alignment, and classification benchmarks show it reaches estimates within 1 percent of ground truth with 8-65 times fewer samples while surfacing more varied failures than baselines under the same constraints.
What carries the argument
Pre-trained Gaussian Processes as surrogates for the performance score function, supporting Bayesian quadrature for estimation and superlevel set sampling for failure discovery.
If this is right
- Performance estimates reach within 1 percent of ground truth using 8-65 times fewer samples than standard methods.
- The same or tighter evaluation budgets yield a greater variety of discovered failure cases.
- The approach works across reasoning tasks, safety alignment checks, and classification benchmarks.
- The Bayesian quadrature estimator is theoretically unbiased and bounded regardless of the specific inputs chosen.
Where Pith is reading between the lines
- With accumulated prior data the method could support ongoing monitoring of model families without repeated full-scale testing.
- Active input synthesis might lower dependence on broad human rating pools in safety reviews.
- The surrogate approach could be tested on non-generative models if suitable prior evaluation histories exist.
Load-bearing premise
Pre-trained Gaussian Processes from earlier evaluations must closely approximate the score function on new models and inputs without large distribution shifts.
What would settle it
A new model or benchmark family produces actual performance values that deviate more than 1 percent from the ProEval estimates even after the reduced sample count, or the method misses key failures that random sampling finds.
read the original abstract
Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ProEval, a proactive evaluation framework for generative AI models that uses pre-trained Gaussian Processes (GPs) as surrogates for performance score functions. It frames performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, providing uncertainty-aware active selection strategies. The paper claims to prove that the pre-trained GP-based BQ estimator is unbiased and bounded, and demonstrates empirically that it requires 8-65x fewer samples than baselines to achieve estimates within 1% of ground truth while identifying more diverse failures on reasoning, safety alignment, and classification benchmarks.
Significance. If the transfer assumptions for the pre-trained GP hold without significant distribution shift and the empirical efficiency gains prove robust, ProEval could meaningfully reduce the computational and human costs of evaluating new generative models, supporting more scalable safety and performance testing. The dual focus on efficient estimation and proactive failure discovery is a strength, particularly if the theoretical unbiasedness and boundedness results extend reliably to new models.
major comments (3)
- [Theoretical Analysis] Theoretical Analysis section (proof of unbiasedness for pre-trained GP-based BQ estimator): The claim that the estimator is unbiased and bounded relies on the performance function for a new model being drawn from (or well-approximated by) the posterior of the GP pre-trained on prior models. The manuscript provides no explicit conditions, bounds, or analysis on distribution shift arising from changes in model architecture, input distributions, or failure semantics; without this, the unbiasedness guarantee does not necessarily transfer, undermining the central theoretical contribution.
- [Experimental Results] Experimental Results section (efficiency claims of 8-65x fewer samples): The reported sample reductions to reach 1% error relative to ground truth depend on the transferred GP surrogate accurately guiding active selection. No ablation is described that varies the similarity between pre-training data and target model distributions, leaving open whether the gains hold under realistic model-specific shifts or are limited to low-shift cases.
- [Method] Method section (superlevel set sampling for failure discovery): The uncertainty-aware strategy for selecting inputs to reveal diverse failures uses the pre-trained GP posterior, but the manuscript does not detail how posterior predictive variance is adjusted or regularized to account for potential mismatch with new models, which is load-bearing for the claim of simultaneously better failure coverage under strict budgets.
minor comments (2)
- [Method] Notation for the GP kernel and quadrature weights could be clarified with an explicit equation reference when first introduced, to aid readers in following the BQ formulation.
- [Abstract] The abstract and experiments section would benefit from specifying the exact number of models, benchmarks, and total evaluation budgets used in the comparisons for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thorough and constructive review of our manuscript. We address each major comment point by point below, providing clarifications on the theoretical assumptions, empirical robustness, and methodological details. Where appropriate, we indicate revisions that will be incorporated into the next version of the paper.
read point-by-point responses
-
Referee: [Theoretical Analysis] Theoretical Analysis section (proof of unbiasedness for pre-trained GP-based BQ estimator): The claim that the estimator is unbiased and bounded relies on the performance function for a new model being drawn from (or well-approximated by) the posterior of the GP pre-trained on prior models. The manuscript provides no explicit conditions, bounds, or analysis on distribution shift arising from changes in model architecture, input distributions, or failure semantics; without this, the unbiasedness guarantee does not necessarily transfer, undermining the central theoretical contribution.
Authors: We appreciate the referee highlighting the importance of clearly stating the assumptions in our theoretical analysis. The proof of unbiasedness and boundedness for the pre-trained GP-based Bayesian quadrature estimator is derived under the explicit modeling assumption that the target performance function is sampled from the posterior of the GP pre-trained on prior models; this assumption is stated in the Theoretical Analysis section. However, we agree that the manuscript does not provide a dedicated discussion of conditions under which the assumption holds or quantitative bounds on bias due to distribution shift. In the revised manuscript, we will expand the Theoretical Analysis section with: (i) a formal restatement of the transfer assumption, (ii) qualitative analysis of shift sources (architecture changes, input distribution shifts, evolving failure semantics), and (iii) guidance on when the approximation remains useful in practice, supported by the empirical similarity metrics used in our experiments. This addition will clarify the scope of the guarantees without changing the core result. revision: yes
-
Referee: [Experimental Results] Experimental Results section (efficiency claims of 8-65x fewer samples): The reported sample reductions to reach 1% error relative to ground truth depend on the transferred GP surrogate accurately guiding active selection. No ablation is described that varies the similarity between pre-training data and target model distributions, leaving open whether the gains hold under realistic model-specific shifts or are limited to low-shift cases.
Authors: We agree that an explicit ablation on distributional similarity would strengthen the empirical claims. Our current experiments already span multiple benchmark families (reasoning, safety alignment, classification) with pre-training performed on related but non-identical models, which we view as representative of realistic transfer scenarios. To directly respond to the concern, the revised Experimental Results section will include a new ablation study that systematically varies pre-training set composition by similarity (using embedding-based or task-overlap metrics) and reports the resulting sample-efficiency curves on held-out target models. This will quantify how efficiency gains degrade under increasing shift while confirming robustness in moderate-shift regimes typical of model evaluation. revision: yes
-
Referee: [Method] Method section (superlevel set sampling for failure discovery): The uncertainty-aware strategy for selecting inputs to reveal diverse failures uses the pre-trained GP posterior, but the manuscript does not detail how posterior predictive variance is adjusted or regularized to account for potential mismatch with new models, which is load-bearing for the claim of simultaneously better failure coverage under strict budgets.
Authors: The superlevel set sampling procedure selects points using the posterior predictive mean and variance of the pre-trained GP, where elevated variance naturally encourages exploration in regions of potential mismatch. We acknowledge that the manuscript does not explicitly describe any additional regularization or adjustment for shift. In the revised Method section we will add: (i) a description of an optional variance-inflation mechanism that scales the predictive variance by a shift-detection factor computed from a small calibration set on the target model, (ii) the corresponding mathematical formulation, and (iii) pseudocode illustrating how the adjustment integrates with the acquisition function. This will make the handling of mismatch transparent while preserving the core uncertainty-aware selection strategy. revision: yes
Circularity Check
No circularity; theoretical claim rests on standard BQ properties applied to transferred surrogate
full rationale
The paper states a proof that the pre-trained GP-based BQ estimator is unbiased and bounded, with the GP surrogate constructed from prior model evaluations and then used for active selection on new inputs. No quoted equations or derivation steps reduce this unbiasedness to a tautology, a fitted parameter renamed as a prediction, or a self-citation chain that assumes the target result. The pre-training step is external to the BQ math itself, and the empirical sample-efficiency claims are presented as separate validation rather than derived from the same fitted quantities. The derivation chain is therefore self-contained against external benchmarks for BQ unbiasedness.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Claude 3.5 Sonnet , October 2024
Anthropic. Claude 3.5 Sonnet , October 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet
2024
-
[2]
Claude 3.7 Sonnet and Claude Code , February 2025 a
Anthropic. Claude 3.7 Sonnet and Claude Code , February 2025 a . URL https://www.anthropic.com/news/claude-3-7-sonnet
2025
-
[3]
Introducing Claude Opus 4.5 , November 2025 b
Anthropic. Introducing Claude Opus 4.5 , November 2025 b . URL https://www.anthropic.com/news/claude-opus-4-5
2025
-
[4]
Introducing Claude Sonnet 4.5 , September 2025 c
Anthropic. Introducing Claude Sonnet 4.5 , September 2025 c . URL https://www.anthropic.com/news/claude-sonnet-4-5
2025
-
[5]
Aroyo, A
L. Aroyo, A. Taylor, M. Diaz, C. Homan, A. Parrish, G. Serapio-Garc \' a, V. Prabhakaran, and D. Wang. Dices dataset: Diversity in conversational AI evaluation for safety . Advances in Neural Information Processing Systems (NeurIPS), 36: 0 53330--53342, 2023. URL https://openreview.net/forum?id=GjNvvswoUL
2023
-
[6]
S. Ashury Tahan, A. Gera, B. Sznajder, L. Choshen, L. Ein-Dor, and E. Shnarch. Label-efficient model selection for text generation. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8384--8402, Bangkok, Thailand, Aug. 2024. Association f...
-
[7]
Atlas, D
L. Atlas, D. Cohn, and R. Ladner. Training connectionist networks with queries and selective sampling . In D. Touretzky, editor, Advances in Neural Information Processing Systems (NeurIPS), volume 2. Morgan-Kaufmann, 1989. URL https://proceedings.neurips.cc/paper_files/paper/1989/file/b1a59b315fc9a3002ce38bbe070ec3f5-Paper.pdf
1989
-
[8]
P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem . Machine Learning, 47 0 (2-3): 0 235--256, 2002. URL https://link.springer.com/article/10.1023/A:1013689704352
-
[9]
Berrada, J
G. Berrada, J. Kossen, F. B. Smith, M. Razzak, Y. Gal, and T. Rainforth. Scaling up active testing to large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum?id=UE0cxjNnIw
2025
-
[10]
D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification . In Companion proceedings of the 2019 world wide web conference , pages 491--500, 2019. URL https://doi.org/10.1145/3308560.3317593
-
[11]
Chang, J
M. Chang, J. Zhang, Z. Zhu, C. Yang, Y. Yang, Y. Jin, Z. Lan, L. Kong, and J. He. Agentboard: An analytical evaluation board of multi-turn LLM agents . Advances in Neural Information Processing Systems (NeurIPS), 37: 0 74325--74362, 2024. URL https://openreview.net/forum?id=4S8agvKjle
2024
-
[12]
P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries , 2024. URL https://arxiv.org/abs/2310.08419
work page internal anchor Pith review arXiv 2024
- [13]
-
[14]
L. Chen, M. Zaharia, and J. Zou. FrugalGPT : How to use large language models while reducing cost and improving performance . arXiv preprint arXiv:2305.05176 [cs.LG], 2023. URL https://arxiv.org/abs/2305.05176
work page internal anchor Pith review arXiv 2023
-
[15]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems . arXiv preprint arXiv:2110.14168 [cs.LG], 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review arXiv 2021
-
[16]
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities . arXiv preprint arXiv:2507.06261 [cs.CL], 2025. URL https://arxiv.org/abs/2507.06261
work page internal anchor Pith review arXiv 2025
-
[17]
Corbi\` e re, N
C. Corbi\` e re, N. THOME, A. Bar-Hen, M. Cord, and P. P\' e rez. Addressing failure prediction by learning model confidence . In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems (NeurIPS), volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neur...
2019
-
[18]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team Google . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , 2024. URL https://arxiv.org/abs/2403.05530
work page internal anchor Pith review arXiv 2024
-
[19]
M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, and J. Berant. Did Aristotle use a laptop? A question answering benchmark with implicit reasoning strategies . Transactions of the Association for Computational Linguistics, 9: 0 346--361, 2021. URL https://aclanthology.org/2021.tacl-1.21/
2021
-
[20]
Ghahramani and C
Z. Ghahramani and C. Rasmussen. Bayesian Monte Carlo . In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems (NeurIPS), volume 15. MIT Press, 2002. URL https://proceedings.neurips.cc/paper_files/paper/2002/file/24917db15c4e37e421866448c9ab23d8-Paper.pdf
2002
-
[21]
Gotovos, N
A. Gotovos, N. Casati, G. Hitz, and A. Krause. Active learning for level set estimation. In International Joint Conference on Artificial Intelligence (IJCAI), 2013. URL https://api.semanticscholar.org/CorpusID:9115200
2013
-
[22]
BERTopic: Neural topic modeling with a class-based TF-IDF procedure
M. Grootendorst. BERTopic : Neural topic modeling with a class-based TF-IDF procedure . arXiv preprint arXiv:2203.05794 [cs.CL], 2022. URL https://arxiv.org/abs/2203.05794
work page internal anchor Pith review arXiv 2022
- [23]
-
[24]
Gunter, M
T. Gunter, M. A. Osborne, R. Garnett, P. Hennig, and S. J. Roberts. Sampling for inference in probabilistic models with fast Bayesian quadrature . In Advances in Neural Information Processing Systems (NeurIPS), 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/a0d08267a0fcee6970544a6d12286691-Paper.pdf
2014
-
[25]
Hendrycks, C
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding . International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ
2021
-
[26]
Hofmann, D
V. Hofmann, D. Heineman, I. Magnusson, K. Lo, J. Dodge, M. Sap, P. W. Koh, C. Wang, H. Hajishirzi, and N. A. Smith. Fluid language model benchmarking. In Conference on Language Modeling (COLM), 2025. URL https://openreview.net/forum?id=mxcCg9YRqj
2025
-
[27]
Bayesian active learning for classification and preferenc e learning,
N. Houlsby, F. Husz \'a r, Z. Ghahramani, and M. Lengyel. Bayesian active learning for classification and preference learning . ArXiv, abs/1112.5745, 2011. URL https://api.semanticscholar.org/CorpusID:13612582
-
[28]
Y. Huang, J. Song, Q. Hu, F. Juefei-Xu, and L. Ma. AcTracer : Active testing of large language model via multi-stage sampling . ACM Transactions on Software Engineering and Methodology, 2025. URL https://doi.org/10.1145/3744340
-
[29]
D. A. Hudson and C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering . In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6700--6709, 2019. URL https://doi.org/10.48550/arXiv.1902.09506
-
[30]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. GPT-4o system card . arXiv preprint arXiv:2410.21276 [cs.CL], 2024. URL https://arxiv.org/abs/2410.21276
work page internal anchor Pith review arXiv 2024
-
[31]
Kalibhat, Z
N. Kalibhat, Z. Wang, P. Bajpai, D. Proud, W. Zeng, B. Kim, and M. Malek. Interpreting and controlling model behavior via constitutions for atomic concept edits. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2026
2026
-
[32]
B. Kim, Z. Wang, L. P. Kaelbling, and T. Lozano-P \'e rez. Learning to guide task and motion planning using score-space representation. International Journal of Robotics Research (IJRR), 38 0 (7): 0 793--812, 2019
2019
-
[33]
Kipnis, K
A. Kipnis, K. Voudouris, L. M. S. Buschoff, and E. Schulz. metabench--A sparse benchmark of reasoning and knowledge in large language models . In International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=4T33izzFpK
2025
-
[34]
Kossen, S
J. Kossen, S. Farquhar, Y. Gal, and T. Rainforth. Active testing: Sample-efficient model evaluation . In International Conference on Machine Learning (ICML), 2021. URL https://proceedings.mlr.press/v139/kossen21a.html
2021
-
[35]
Kossen, S
J. Kossen, S. Farquhar, Y. Gal, and T. Rainforth. Active surrogate estimators: An active learning approach to label-efficient model evaluation . In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9b9cfd5428153ccfbd4ba34b7e007305-Paper-Conference.pdf
2022
-
[36]
Kulesza, B
A. Kulesza, B. Taskar, et al. Determinantal point processes for machine learning . Foundations and Trends in Machine Learning , 5 0 (2--3): 0 123--286, 2012
2012
-
[37]
D. Lee, J. Lee, J.-W. Ha, J.-H. Kim, S.-W. Lee, H. Lee, and H. O. Song. Query-efficient black-box red teaming via B ayesian optimization. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11551--11574, Toronto, Canada, July 2023. Ass...
-
[38]
K. Li, O. Patel, F. Vi \'e gas, H. Pfister, and M. Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://openreview.net/forum?id=aLLuYpn83y
2023
-
[39]
R. Li, R. Li, B. Wang, and X. Du. IQA-Eval : Automatic evaluation of human-model interactive question answering . In Advances in Neural Information Processing Systems (NeurIPS), 2024. URL https://openreview.net/forum?id=MzM99vV5Rx
2024
-
[40]
Y. Li, J. Ma, M. Ballesteros, Y. Benajiba, and G. Horwood. Active evaluation acquisition for efficient LLM benchmarking. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors, International Conference on Machine Learning (ICML), volume 267 of Proceedings of Machine Learning Research, pages 35581--356...
2025
-
[41]
Z. Lin, Z. Wang, Y. Tong, Y. Wang, Y. Guo, Y. Wang, and J. Shang. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user- AI conversation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. URL https://openreview.net/forum?id=jTiJPDv82w
2023
-
[42]
X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, et al. Agentbench: Evaluating LLMs as agents . arXiv preprint arXiv:2308.03688 [cs.AI], 2023. URL https://arxiv.org/abs/2308.03688
work page internal anchor Pith review arXiv 2023
-
[43]
Y. Lu, B. Yao, H. Gu, J. Huang, Z. J. Wang, Y. Li, J. Gesi, Q. He, T. J.-J. Li, and D. Wang. Uxagent: An LLM agent-based usability testing framework for web design . In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages 1--12, 2025. URL https://doi.org/10.1145/3706599.3719729
-
[44]
Mehrotra, M
A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nelson, H. S. Anderson, Y. Singer, and A. Karbasi. Tree of attacks: Jailbreaking black-box LLM s automatically. In Advances in Neural Information Processing Systems (NeurIPS), 2024. URL https://openreview.net/forum?id=SoM3vngOH5
2024
-
[45]
C. C. NeurIPS. Reflecting on the 2025 review process from the datasets and benchmarks chairs , 2025. URL https://blog.neurips.cc/2025/09/30/reflecting-on-the-2025-review-process-from-the-datasets-and-benchmarks-chairs/
2025
-
[46]
Nguyen, D
P. Nguyen, D. Ramanan, and C. Fowlkes. Active testing: An efficient and robust framework for estimating accuracy. In J. Dy and A. Krause, editors, International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pages 3759--3768. PMLR, 10--15 Jul 2018. URL https://proceedings.mlr.press/v80/nguyen18d.html
2018
-
[47]
GPT-3.5 Turbo , July 2024
OpenAI. GPT-3.5 Turbo , July 2024. URL https://platform.openai.com/docs/models/gpt-3.5-turbo/
2024
-
[48]
Introducing GPT-5 , August 2025 a
OpenAI. Introducing GPT-5 , August 2025 a . URL https://openai.com/index/introducing-gpt-5/
2025
-
[49]
GPT-5.1: A smarter, more conversational ChatGPT , November 2025 b
OpenAI. GPT-5.1: A smarter, more conversational ChatGPT , November 2025 b . URL https://openai.com/index/gpt-5-1/
2025
-
[50]
Introducing GPT-5.2 , December 2025 c
OpenAI. Introducing GPT-5.2 , December 2025 c . URL https://openai.com/index/introducing-gpt-5-2/
2025
-
[51]
M. A. Osborne, D. Duvenaud, R. Garnett, C. E. Rasmussen, S. J. Roberts, and Z. Ghahramani. Active learning of model evidence using Bayesian quadrature . In Advances in Neural Information Processing Systems (NeurIPS), NIPS'12, page 46–54, Red Hook, NY, USA, 2012. Curran Associates Inc. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/
2012
-
[52]
A. O’Hagan. Bayes–Hermite quadrature . Journal of Statistical Planning and Inference, 29: 0 245--260, 1991. URL https://api.semanticscholar.org/CorpusID:122652750
1991
-
[53]
S. Park, M. Zecchin, and O. Simeone. Adaptive prediction-powered autoeval with reliability and efficiency guarantees. In Advances in Neural Information Processing Systems (NeurIPS), 2025
2025
-
[54]
Are NLP Models really able to Solve Simple Math Word Problems?
A. Patel, S. Bhattamishra, and N. Goyal. Are NLP models really able to solve simple math word problems? In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 2080--2094, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021....
-
[55]
Red Teaming Language Models with Language Models
E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red teaming language models with language models. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3419--3448, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Comput...
-
[56]
Perlitz, E
Y. Perlitz, E. Bandel, A. Gera, O. Arviv, L. E. Dor, E. Shnarch, N. Slonim, M. Shmueli-Scheuer, and L. Choshen. Efficient benchmarking (of language models) . In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 2519--2536, 2024. URL https://aclanthology.org/2024.naacl-long.139/
2024
-
[57]
L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity's last exam . arXiv preprint arXiv:2501.14249 [cs.LG], 2025. URL https://arxiv.org/abs/2501.14249
work page internal anchor Pith review arXiv 2025
-
[58]
D. Philipov, V. Dongre, G. Tur, and D. Hakkani-T \"u r. Simulating user agents for embodied conversational- AI . arXiv preprint arXiv:2410.23535 [cs.CL], 2024. URL https://arxiv.org/abs/2410.23535
- [59]
-
[60]
C. E. Rasmussen and C. K. Williams. G aussian processes for machine learning . The MIT Press, 2006
2006
-
[61]
Rastogi, T
C. Rastogi, T. H. Teh, P. Mishra, R. Patel, D. Wang, M. Diaz, A. Parrish, A. M. Davani, Z. Ashwood, M. Paganini, V. Prabhakaran, V. Rieser, and L. Aroyo. Whose view of safety? A deep DIVE dataset for pluralistic alignment of text-to-image models . In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum?id=2TxdMkJ6Yw
2025
-
[62]
A. Rubinstein, B. Raible, M. Gubri, and S. J. Oh. DISCO: Diversifying Sample Condensation for Efficient Model Evaluation , 2025. URL https://arxiv.org/abs/2510.07959
-
[63]
S. Ruder and B. Plank. Learning to select data for transfer learning with B ayesian optimization. In M. Palmer, R. Hwa, and S. Riedel, editors, Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 372--382, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics. doi:10.18653/v1/D17-1038. URL https://aclantholog...
-
[64]
Samvelyan, S
M. Samvelyan, S. C. Raparthy, A. Lupu, E. Hambro, A. H. Markosyan, M. Bhatt, Y. Mao, M. Jiang, J. Parker-Holder, J. N. Foerster, T. Rockt \"a schel, and R. Raileanu. Rainbow teaming: Open-ended generation of diverse adversarial prompts. In Advances in Neural Information Processing Systems (NeurIPS), 2024. URL https://openreview.net/forum?id=FCsEvaMorw
2024
-
[65]
Sener and S
O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach . arXiv: Machine Learning, 2017. URL https://api.semanticscholar.org/CorpusID:3383786
2017
-
[66]
B. Settles. Active learning literature survey . 2009
2009
-
[67]
Srivastava, A
A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models . Transactions on Machine Learning Research (TMLR), 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj
2023
-
[68]
Y. Sun, A. Stolfo, and M. Sachan. Probing for arithmetic errors in language models . In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng, editors, Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8111--8128, Suzhou, China, Nov. 2025. Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi:10.18653/v1/...
-
[69]
Team and G
G. Team and G. DeepMind. Gemini 3 technical report , November 2025. URL https://deepmind.google/technologies/gemini/v3
2025
-
[70]
G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram \'e , M. Rivi \`e re, et al. Gemma 3 technical report . arXiv preprint arXiv:2503.19786 [cs.CL], 2025. URL https://arxiv.org/abs/2503.19786
work page internal anchor Pith review arXiv 2025
-
[71]
Anchor Points: Benchmarking Models with Much Fewer Examples
R. Vivek, K. Ethayarajh, D. Yang, and D. Kiela. Anchor points: Benchmarking models with much fewer examples . In Y. Graham and M. Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1576--1601, St. Julian ' s, Malta, Mar. 2024. Association for Computat...
-
[72]
E. Wagstaff, S. Hamid, and M. A. Osborne. Batch selection for parallelisation of Bayesian quadrature . ArXiv, abs/1812.01553, 2018. URL https://api.semanticscholar.org/CorpusID:54446127
- [73]
-
[74]
Z. Wang, C. R. Garrett, L. P. Kaelbling, and T. Lozano-P \'e rez. Active model learning and diverse action sampling for task and motion planning. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4107--4114. IEEE, 2018 a
2018
-
[75]
Z. Wang, B. Kim, and L. P. Kaelbling. Regret bounds for meta B ayesian optimization with an unknown Gaussian process prior. In Advances in Neural Information Processing Systems (NeurIPS), 2018 b
2018
-
[76]
Z. Wang, G. E. Dahl, K. Swersky, C. Lee, Z. Nado, J. Gilmer, J. Snoek, and Z. Ghahramani. Pre-trained Gaussian processes for Bayesian optimization . Journal of Machine Learning Research (JMLR), 25 0 (212): 0 1--83, 2024. URL http://jmlr.org/papers/v25/23-0269.html
2024
-
[77]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models . Advances in Neural Information Processing Systems (NeurIPS), 35: 0 24824--24837, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/
2022
-
[78]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report . arXiv preprint arXiv:2505.09388 [cs.CL], 2025. URL https://arxiv.org/abs/2505.09388
work page internal anchor Pith review arXiv 2025
-
[79]
C. Zhang, L. F. D'Haro, Y. Chen, M. Zhang, and H. Li. A comprehensive analysis of the effectiveness of large language models as automatic dialogue evaluators . In AAAI Conference on Artificial Intelligence (AAAI), 2024 a . URL https://doi.org/10.1609/aaai.v38i17.29923
-
[80]
Zhang, L
Z. Zhang, L. Lei, L. Wu, R. Sun, Y. Huang, C. Long, X. Liu, X. Lei, J. Tang, and M. Huang. SafetyBench: Evaluating the safety of large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 15537--15553, 2024 b . URL https://aclanthology.org/2024.acl-long.830/
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.