arxiv: 2605.07002 · v1 · submitted 2026-05-07 · 💻 cs.AI · math.ST· stat.ML· stat.TH

Recognition: no theorem link

Adaptive auditing of AI systems with anytime-valid guarantees

Jean Feng, Patrick Vossler, Siyu Zhou, Venkatesh Sivaraman, Yifan Mai

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:49 UTC · model grok-4.3

classification 💻 cs.AI math.STstat.MLstat.TH

keywords adaptive auditinganytime-valid inferenceAI failure modeshypothesis testingrobustness certificationgenerative AI evaluatione-processes

0 comments

The pith

Passing a stringent adaptive audit certifies an AI system as globally robust if the auditor can find failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up a hypothesis testing framework for auditing AI systems when the number and choice of test cases can change based on earlier results. It pits the model's claim that no failure modes exist below a performance threshold against the auditor's claim that it has a sampling plan able to reveal any such failures. By translating the audit into simultaneous e-processes under safe anytime-valid inference, the method keeps error rates controlled at every step even with small samples and flexible stopping rules. A central result shows that when the auditor is strong enough, failing to find problems under a tight test effectively proves the system has no major failures anywhere.

Core claim

If the auditor is sufficiently powerful, the model's null hypothesis (no failure mode with performance below a target threshold) and the auditor's null hypothesis (a sampling strategy exists that will uncover a failure mode) are asymptotically inverses, so passage of a stringent audit certifies the AI system as being globally robust.

What carries the argument

Simultaneous e-processes that formalize 'testing by betting' for the two dueling null hypotheses under safe anytime-valid inference.

If this is right

The procedures keep type-I error controlled at any point during an audit with as few as 20 observations.
Adaptive testing can reach statistically valid conclusions faster than any fixed pre-specified sampling plan.
Passing the audit under the dueling framework provides a direct certificate of global robustness rather than a local one.
The approach applies directly to the adaptive sampling and stopping rules already used in practical AI evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Auditors could design their sampling strategies explicitly to satisfy the power condition, turning routine checks into formal certifications.
The method could let evaluation teams stop data collection early once evidence against failure modes accumulates, lowering annotation costs.
Similar dueling-hypothesis setups might transfer to other adaptive testing domains such as software quality assurance or clinical trial monitoring.
Empirical checks on real generative models would show how many samples are typically needed before the certification threshold is crossed.

Load-bearing premise

The auditor must be powerful enough to uncover failure modes whenever they actually exist in the system.

What would settle it

An AI system passes the adaptive audit yet later shows a concrete failure mode below the target threshold when examined with additional, independent tests.

Figures

Figures reproduced from arXiv: 2605.07002 by Jean Feng, Patrick Vossler, Siyu Zhou, Venkatesh Sivaraman, Yifan Mai.

**Figure 2.** Figure 2: Detecting failure modes with semi-synthetic data (Experiment 1). The rows correspond to [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Auditing an LLM pipeline for extraction for Social Determinants of Health (SDoH) from [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

A major bottleneck in characterizing the failure modes of generative AI systems is the cost and time of annotation and evaluation. Consequently, adaptive testing paradigms have gained popularity, where one opportunistically decides which cases and how many to annotate based on past results. While this framework is highly practical, its extreme flexibility makes it difficult to draw statistically rigorous conclusions, as it violates classical assumptions: the number of observations is typically limited (often 10 to 50 cases) and decisions regarding sampling and stopping are made in the midst of data collection rather than based a pre-specified rule. To characterize what statistical inferences can be drawn from highly adaptive audits, we introduce a hypothesis testing framework from two 'dueling' perspectives: (i) the model's null that asserts there is no failure mode with performance below a target threshold versus (ii) the auditor's null that asserts they have a sampling strategy that will uncover a failure mode. Leveraging Safe Anytime-Valid Inference (SAVI), we formalize the auditor as conducting 'testing by betting', which translates into simultaneous e-processes for testing the dueling null hypotheses. Furthermore, if the auditor is sufficiently powerful, we prove that these two hypotheses are asymptotically inverses of each other, in that passage of a stringent audit does in fact certify the AI system as being globally robust. Empirically, we demonstrate that our proposed testing procedures maintain anytime-valid type-I error control, outperform pre-specified testing methods, and can reach statistically rigorous conclusions sometimes with as few as 20 observations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sets up dueling nulls for adaptive AI audits via SAVI e-processes and proves asymptotic inversion to global robustness when the auditor is powerful enough, which is a clean formal move but rests on a hard-to-verify condition.

read the letter

The main thing to know is that this work gives a hypothesis-testing setup for highly adaptive audits of generative AI, where you can decide on the fly which cases to check and when to stop, yet still get anytime-valid error control. They pit the model's null (no failure mode below a performance threshold) against the auditor's null (the sampling strategy will find one if it exists), run both through e-processes from the SAVI literature, and show that under a power condition the two become asymptotic inverses: a stringent audit that passes really does certify the system as robust overall. Empirically they report type-I control and that conclusions can sometimes be reached with around 20 observations, beating fixed-sample baselines.

Referee Report

2 major / 3 minor

Summary. The paper introduces a hypothesis testing framework for adaptive auditing of AI systems, addressing the challenges of limited annotations (often 10-50 cases) and data-dependent sampling/stopping rules that violate classical assumptions. It defines two dueling null hypotheses: (i) the model's null asserting no failure mode with performance below a target threshold, and (ii) the auditor's null asserting the existence of a sampling strategy to uncover such a mode. Leveraging Safe Anytime-Valid Inference (SAVI), the auditor is formalized via testing-by-betting, yielding simultaneous e-processes for the dueling hypotheses. The central theoretical result is that, if the auditor is sufficiently powerful, these hypotheses are asymptotically inverses, so that passing a stringent audit certifies global robustness of the AI system. Empirical results demonstrate anytime-valid type-I error control, outperformance over pre-specified tests, and valid conclusions with as few as 20 observations.

Significance. If the results hold, this provides a statistically rigorous approach to adaptive auditing of generative AI with anytime-valid guarantees, directly tackling the practical bottleneck of annotation costs. The use of SAVI e-processes for adaptive sampling and the proof of asymptotic inversion between dueling hypotheses (conditioned on auditor power) are notable strengths, offering a principled way to interpret audit outcomes as robustness certifications. This could influence AI safety evaluation standards by enabling valid inferences from opportunistic data collection rather than fixed protocols. Credit is due for grounding the framework in established SAVI literature while extending it to dueling perspectives and demonstrating small-sample efficiency.

major comments (2)

[§4] §4 (Asymptotic Inversion Theorem): The central claim that the hypotheses are 'asymptotically inverses' under a 'sufficiently powerful' auditor is load-bearing for the robustness certification interpretation, yet the manuscript provides only a qualitative definition of auditor power without an explicit characterization (e.g., a lower bound on detection probability for existing failure modes under the adaptive strategy). This leaves the scope of the inversion result unclear and risks the claim reducing to a tautology if power is defined circularly via the audit outcome.
[§3.1] §3.1 (SAVI Application): The proof that SAVI e-processes apply directly to the adaptive sampling and stopping rules under the dueling nulls assumes the framework's conditions hold without violation; however, the manuscript does not explicitly verify or cite the martingale properties or optional stopping conditions for the specific composite hypotheses and betting strategies used here. Any mismatch could undermine the anytime-valid type-I error control asserted in the abstract.

minor comments (3)

[Abstract] Abstract and §5 (Experiments): The claim that the procedures 'outperform pre-specified testing methods' should specify the exact baselines (e.g., fixed-sample-size tests or Bonferroni-corrected procedures) and metrics (e.g., average sample size to rejection or power curves) for reproducibility.
[§2] Notation in §2: The distinction between the model's null H_0^M and auditor's null H_0^A could be clarified with an explicit table or side-by-side comparison of their formal statements to aid readers unfamiliar with testing-by-betting.
Figure captions (e.g., Figure 2): Captions should explicitly note the adaptive stopping rule and how the e-process trajectories relate to the dueling hypotheses to improve clarity for the empirical results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and positive recommendation for minor revision. We address each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [§4] §4 (Asymptotic Inversion Theorem): The central claim that the hypotheses are 'asymptotically inverses' under a 'sufficiently powerful' auditor is load-bearing for the robustness certification interpretation, yet the manuscript provides only a qualitative definition of auditor power without an explicit characterization (e.g., a lower bound on detection probability for existing failure modes under the adaptive strategy). This leaves the scope of the inversion result unclear and risks the claim reducing to a tautology if power is defined circularly via the audit outcome.

Authors: We agree that the current qualitative definition of auditor power leaves the scope of the Asymptotic Inversion Theorem somewhat open to interpretation. In the revision, we will augment §4 with an explicit, non-circular characterization: the auditor is sufficiently powerful if its betting strategy ensures that, whenever a failure mode exists below the target threshold, the associated e-process grows to exceed 1/α with probability at least 1-δ (for user-specified α, δ) under the adaptive sampling rule. This condition is stated in terms of the e-process growth rate under the alternative and is independent of any particular audit outcome, thereby clarifying the conditions under which passage of the audit certifies global robustness. revision: yes
Referee: [§3.1] §3.1 (SAVI Application): The proof that SAVI e-processes apply directly to the adaptive sampling and stopping rules under the dueling nulls assumes the framework's conditions hold without violation; however, the manuscript does not explicitly verify or cite the martingale properties or optional stopping conditions for the specific composite hypotheses and betting strategies used here. Any mismatch could undermine the anytime-valid type-I error control asserted in the abstract.

Authors: We appreciate the referee's call for explicit verification. While the application follows from the general SAVI theory for e-processes under adapted filtrations, the manuscript does not contain a dedicated check for our composite dueling nulls. In the revised version we will insert a short paragraph (and supporting appendix material) in §3.1 that (i) confirms the chosen betting strategies produce non-negative supermartingales under each null and (ii) verifies that the optional stopping theorem applies because the data-dependent stopping time is adapted to the filtration generated by the sequential observations. We will cite the precise SAVI results that guarantee the anytime-valid type-I error control under these conditions. revision: yes

Circularity Check

0 steps flagged

Minor reliance on prior SAVI literature; core inversion proof and dueling hypotheses derived independently

full rationale

The paper introduces dueling null hypotheses (model's null on absence of failure modes below threshold vs. auditor's null on uncovering failure modes via adaptive sampling) and proves their asymptotic inversion under a 'sufficiently powerful auditor' condition. This proof is presented as an original result in the manuscript and does not reduce by construction to fitted parameters, self-definitions, or prior self-citations. SAVI e-processes and testing-by-betting are leveraged from established anytime-valid inference literature for simultaneous control, which is standard external support rather than a load-bearing self-citation chain. Empirical type-I error control and comparisons to pre-specified methods provide separate validation. No patterns of self-definitional claims, fitted inputs renamed as predictions, or ansatz smuggling appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claims depend on the applicability of SAVI to adaptive audit sampling and the 'sufficiently powerful auditor' condition; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Safe Anytime-Valid Inference (SAVI) properties hold for the adaptive sampling and stopping rules in AI audits.
The e-processes and type-I error control rely on this prior framework applying without violation.

pith-pipeline@v0.9.0 · 5583 in / 1196 out tokens · 69916 ms · 2026-05-11T00:49:52.787028+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references

[1]

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

Lakshminarayanan, Balaji and Pritzel, Alexander and Blundell, Charles. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. Advances in Neural Information Processing Systems 30
[2]

A Multiple Testing Procedure for Clinical Trials

O'Brien, Peter C and Fleming, Thomas R. A Multiple Testing Procedure for Clinical Trials. Biometrics
[3]

Context-aware testing: A new paradigm for model testing with large language models

Rauba, Paulius and Ruiz Luyten, Max and Seedat, Nabeel and van der Schaar, Mihaela. Context-aware testing: A new paradigm for model testing with large language models. Advances in Neural Information Processing Systems 37
[4]

Deep Residual Learning for Image Recognition

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian. Deep Residual Learning for Image Recognition. arXiv [cs.CV]
[5]

Deep Residual Learning for Image Recognition

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian. Deep Residual Learning for Image Recognition. CVPR
[6]

K. K. Gordon Lan and DeMets, David L. Discrete Sequential Boundaries for Clinical Trials. Biometrika
[7]

Group sequential methods in the design and analysis of clinical trials

Pocock, Stuart J. Group sequential methods in the design and analysis of clinical trials. Biometrika
[8]

Deep Residual Learning for Image Recognition

He, K and Zhang, X and Ren, S and Sun, J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

2016
[9]

Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness

Kearns, Michael and Neel, Seth and Roth, Aaron and Wu, Zhiwei Steven. Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. International Conference on Machine Learning
[10]

Prediction, Learning, and Games

Cesa-Bianchi, Nicolo and Lugosi, Gabor. Prediction, Learning, and Games
[11]

Algorithmic Fairness: Choices, Assumptions, and Definitions

Mitchell, Shira and Potash, Eric and Barocas, Solon and D'Amour, Alexander and Lum, Kristian. Algorithmic Fairness: Choices, Assumptions, and Definitions. Annu. Rev. Stat. Appl
[12]

Doubly robust confidence sequences for sequential causal inference

Waudby-Smith, Ian and Arbour, David and Sinha, Ritwik and Kennedy, Edward H and Ramdas, Aaditya. Doubly robust confidence sequences for sequential causal inference. arXiv [math.ST]
[13]

Anytime-valid and asymptotically efficient inference driven by predictive recursion

Dixit, Vaidehi and Martin, Ryan. Anytime-valid and asymptotically efficient inference driven by predictive recursion. Biometrika
[14]

Evaluating Model Robustness and Stability to Dataset Shift

Subbaswamy, Adarsh and Adams, Roy and Saria, Suchi. Evaluating Model Robustness and Stability to Dataset Shift. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics
[15]

A Comparison of Some Control Chart Procedures

Roberts, S W. A Comparison of Some Control Chart Procedures. Technometrics
[16]

On Optimum Methods in Quickest Detection Problems

Shiryaev, A N. On Optimum Methods in Quickest Detection Problems. Theory Probab. Appl
[17]

A snapshot of the frontiers of fairness in machine learning

Chouldechova, Alexandra and Roth, Aaron. A snapshot of the frontiers of fairness in machine learning. Commun. ACM
[18]

FAIRVIS : Visual Analytics for Discovering Intersectional Bias in Machine Learning

Cabrera, Ángel Alexander and Epperson, Will and Hohman, Fred and Kahng, Minsuk and Morgenstern, Jamie and Chau, Duen Horng. FAIRVIS : Visual Analytics for Discovering Intersectional Bias in Machine Learning. 2019 IEEE Conference on Visual Analytics Science and Technology (VAST)

2019
[19]

Multicalibration: Calibration for the ( C omputationally-Identifiable) Masses

Hebert-Johnson, Ursula and Kim, Michael and Reingold, Omer and Rothblum, Guy. Multicalibration: Calibration for the ( C omputationally-Identifiable) Masses. International Conference on Machine Learning
[20]

Slice Finder: Automated Data Slicing for Model Validation

Chung, Yeounoh and Kraska, Tim and Polyzotis, Neoklis and Tae, Ki Hyun and Whang, Steven Euijong. Slice Finder: Automated Data Slicing for Model Validation. 2019 IEEE 35th International Conference on Data Engineering (ICDE)

2019
[21]

Active Testing: Sample-Efficient Model Evaluation

Kossen, Jannik and Farquhar, Sebastian and Gal, Yarin and Rainforth, Tom. Active Testing: Sample-Efficient Model Evaluation. Proceedings of the 38th International Conference on Machine Learning
[22]

Domino: Discovering Systematic Errors with Cross-Modal Embeddings

Eyuboglu, Sabri and Varma, Maya and Saab, Khaled and Delbrouck, Jean-Benoit and Lee-Messer, Christopher and Dunnmon, Jared and Zou, James and Ré, Christopher. Domino: Discovering Systematic Errors with Cross-Modal Embeddings. International Conference on Learning Representations
[23]

Active multiple testing with proxy p-values and e-values

Xu, Ziyu and Wang, Catherine and Wasserman, Larry and Roeder, Kathryn and Ramdas, Aaditya. Active multiple testing with proxy p-values and e-values. arXiv [stat.ME]
[24]

Online multiple testing with e-values

Xu, Ziyu and Ramdas, Aaditya. Online multiple testing with e-values. arXiv [stat.ME]
[25]

Adaptive learn-then-test: Statistically valid and efficient hyperparameter selection

Zecchin, Matteo and Park, Sangwoo and Simeone, Osvaldo. Adaptive learn-then-test: Statistically valid and efficient hyperparameter selection. arXiv [stat.ML]
[26]

Active hypothesis testing under computational budgets with applications to GWAS and LLM

Kuang, Qi and Gang, Bowen and Xia, Yin. Active hypothesis testing under computational budgets with applications to GWAS and LLM. arXiv [stat.ME]
[27]

Scaling Up Active Testing to Large Language Models

Berrada, Gabrielle and Kossen, Jannik and Smith, Freddie Bickford and Razzak, Muhammed and Gal, Yarin and Rainforth, Tom. Scaling Up Active Testing to Large Language Models. The Thirty-ninth Annual Conference on Neural Information Processing Systems
[28]

Is this model reliable for everyone? Testing for strong calibration

Feng, Jean and Gossmann, Alexej and Pirracchio, Romain and Petrick, Nicholas and Pennello, Gene and Sahiner, Berkman. Is this model reliable for everyone? Testing for strong calibration. AISTATS
[29]

The statistical scope of multicalibration

Noarov, Georgy and Roth, Aaron. The statistical scope of multicalibration. International Conference on Machine Learning
[30]

Universal inference

Wasserman, Larry and Ramdas, Aaditya and Balakrishnan, Sivaraman. Universal inference. Proc. Natl. Acad. Sci. U. S. A
[31]

Holistic Evaluation of Language Models

Liang, Percy and Bommasani, Rishi and Lee, Tony and Tsipras, Dimitris and Soylu, Dilara and Yasunaga, Michihiro and Zhang, Yian and Narayanan, Deepak and Wu, Yuhuai and Kumar, Ananya and Newman, Benjamin and Yuan, Binhang and Yan, Bobby and Zhang, Ce and Cosgrove, Christian Alexander and Manning, Christopher D and Re, Christopher and Acosta-Navas, Diana a...
[32]

A Brief Tutorial on Sample Size Calculations for Fairness Audits

Singh, Harvineet and Xia, Fan and Kim, Mi-Ok and Pirracchio, Romain and Chunara, Rumi and Feng, Jean. A Brief Tutorial on Sample Size Calculations for Fairness Audits. Workshop on Regulatable Machine Learning at the 37th Conference on Neural Information Processing Systems
[33]

Red-teaming for generative AI : Silver bullet or security theater?

Feffer, Michael and Sinha, Anusha and Lipton, Zachary C and Heidari, Hoda. Red-teaming for generative AI : Silver bullet or security theater?. arXiv [cs.CY]
[34]

Query-Efficient Black-Box Red Teaming via Bayesian Optimization

Lee, Deokjae and Lee, Junyeong and Ha, Jung-Woo and Kim, Jin-Hwa and Lee, Sang-Woo and Lee, Hwaran and Song, Hyun Oh. Query-Efficient Black-Box Red Teaming via Bayesian Optimization. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
[35]

Hypothesis testing with e-values

Ramdas, Aaditya and Wang, Ruodu. Hypothesis testing with e-values. arXiv [math.ST]
[36]

E -detectors: a nonparametric framework for sequential change detection

Shin, Jaehyeok and Ramdas, Aaditya and Rinaldo, Alessandro. E -detectors: a nonparametric framework for sequential change detection. New England Journal of Statistics in Data Science
[37]

Testing by betting: A strategy for statistical and scientific communication

Shafer, Glenn. Testing by betting: A strategy for statistical and scientific communication. J. R. Stat. Soc. Ser. A Stat. Soc
[38]

Safe testing

Grünwald, Peter and de Heide, Rianne and Koolen, Wouter. Safe testing. J. R. Stat. Soc. Series B Stat. Methodol
[39]

A data-driven framework for identifying patient subgroups on which an AI /machine learning model may underperform

Subbaswamy, Adarsh and Sahiner, Berkman and Petrick, Nicholas and Pai, Vinay and Adams, Roy and Diamond, Matthew C and Saria, Suchi. A data-driven framework for identifying patient subgroups on which an AI /machine learning model may underperform. NPJ Digit. Med
[40]

EvalTree : Profiling Language Model weaknesses via hierarchical capability trees

Zeng, Zhiyuan and Wang, Yizhong and Hajishirzi, Hannaneh and Koh, Pang Wei. EvalTree : Profiling Language Model weaknesses via hierarchical capability trees. Conference on Language Modeling
[41]

Adaptive Testing and Debugging of NLP Models

Ribeiro, Marco Tulio and Lundberg, Scott. Adaptive Testing and Debugging of NLP Models. ACL 2022

2022
[42]

GPQA : A Graduate-Level Google-Proof Q&A Benchmark

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. GPQA : A Graduate-Level Google-Proof Q&A Benchmark. arXiv [cs.AI]
[43]

LLMs Judging LLMs : A Simplex Perspective

Vossler, Patrick and Xia, Fan and Mai, Yifan and Subbaswamy, Adarsh and Feng, Jean. LLMs Judging LLMs : A Simplex Perspective. International Conference on Artificial Intelligence and Statistics
[44]

When the domain expert has no time and the LLM developer has no clinical expertise: Real-world lessons from LLM co-design in a safety-net hospital

Kothari, Avni and Vossler, Patrick and Digitale, Jean and Forouzannia, Mohammad and Rosenberg, Elise and Lee, Michele and Bryant, Jennee and Molina, Melanie and Marks, James and Zier, Lucas and Feng, Jean. When the domain expert has no time and the LLM developer has no clinical expertise: Real-world lessons from LLM co-design in a safety-net hospital. Pro...
[45]

MCGrad : Multicalibration at Web Scale

Tax, Niek and Perini, Lorenzo and Linder, Fridolin and Haimovich, Daniel and Karamshuk, Dima and Okati, Nastaran and Vojnovic, Milan and Apostolopoulos, Pavlos Athanasios. MCGrad : Multicalibration at Web Scale. arXiv [cs.LG]
[46]

LLM Evals: Everything You Need to Know – Hamel’s Blog - Hamel Husain

Husain, Hamel and Shankar, Shreya. LLM Evals: Everything You Need to Know – Hamel’s Blog - Hamel Husain. Hamel's Blog - Hamel Husain
[47]

Quantifying Local Model Validity using Active Learning

Lämmle, Sven and Bogoclu, Can and Vosshall, Robert and Haselhoff, Anselm and Roos, Dirk. Quantifying Local Model Validity using Active Learning. Uncertainty in Artificial Intelligence
[48]

Sequential tests of statistical hypotheses

Wald, A. Sequential tests of statistical hypotheses. Ann. Math. Stat
[49]

HiBayES : A Hierarchical Bayesian modeling framework for AI Evaluation Statistics

Luettgau, Lennart and Coppock, Harry and Dubois, Magda and Summerfield, Christopher and Ududec, Cozmin. HiBayES : A Hierarchical Bayesian modeling framework for AI Evaluation Statistics. arXiv [cs.AI]
[50]

Active sequential hypothesis testing

Naghshvar, Mohammad and Javidi, Tara. Active sequential hypothesis testing. Ann. Stat
[51]

Active sequential two-sample testing

Li, Weizhi and Kadambi, Prad and Saidi, Pouria and Ramamurthy, Karthikeyan Natesan and Dasarathy, Gautam and Berisha, Visar. Active sequential two-sample testing. Transact. Mach. Learn. Res
[52]

Automated Hypothesis Validation with Agentic Sequential Falsifications

Huang, Kexin and Jin, Ying and Li, Ryan and Li, Michael Y and Candes, Emmanuel and Leskovec, Jure. Automated Hypothesis Validation with Agentic Sequential Falsifications. Forty-second International Conference on Machine Learning
[53]

Multi-armed sequential hypothesis testing by betting

Sandoval, Ricardo J and Waudby-Smith, Ian and Jordan, Michael I. Multi-armed sequential hypothesis testing by betting. arXiv [stat.ME]
[54]

Active fairness auditing

Yan, Tom and Zhang, Chicheng. Active fairness auditing. International Conference on Machine Learning
[55]

Audit me if you can: Query-efficient active fairness auditing of black-box LLMs

Hartmann, David and Pohlmann, Lena and Hanslik, Lelia and Gießing, Noah and Berendt, Bettina and Delobelle, Pieter. Audit me if you can: Query-efficient active fairness auditing of black-box LLMs. arXiv [cs.LG]
[56]

Anchor points: Benchmarking models with much fewer examples

Vivek, Rajan and Ethayarajh, Kawin and Yang, Diyi and Kiela, Douwe. Anchor points: Benchmarking models with much fewer examples. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
[57]

On statistical bias in active learning: How and when to fix it

Farquhar, Sebastian and Gal, Yarin and Rainforth, Tom. On statistical bias in active learning: How and when to fix it. international conference on learning representations
[58]

Admissible online closed testing must employ e-values

Fischer, Lasse and Ramdas, Aaditya. Admissible online closed testing must employ e-values. arXiv [stat.ME]
[59]

Family-wise error rate control with E -values

Hartog, Will and Lei, Lihua. Family-wise error rate control with E -values. arXiv [stat.ME]
[60]

E -values for adaptive clinical trials: Anytime-valid monitoring in practice

Sokolova, Alexandra and Sokolov, Vadim. E -values for adaptive clinical trials: Anytime-valid monitoring in practice. arXiv [stat.ME]
[61]

The Caltech- UCSD Birds-200-2011 Dataset

Wah, Catherine and Branson, Steve and Welinder, Peter and Perona, Pietro and Belongie, Serge. The Caltech- UCSD Birds-200-2011 Dataset. CaltechAUTHORS

2011
[62]

ASPEST : Bridging the gap between active learning and selective prediction

Chen, Jiefeng and Yoon, Jinsung and Ebrahimi, Sayna and Arik, Sercan and Jha, Somesh and Pfister, Tomas. ASPEST : Bridging the gap between active learning and selective prediction. Transact. Mach. Learn. Res
[63]

AcTracer : Active testing of large language model via multi-stage sampling

Huang, Yuheng and Song, Jiayang and Hu, Qiang and Juefei-Xu, Felix and Ma, Lei. AcTracer : Active testing of large language model via multi-stage sampling. arXiv [cs.SE]
[64]

Adaptive testing of computer vision models

Gao, Irena and Ilharco, Gabriel and Lundberg, Scott and Ribeiro, Marco Tulio. Adaptive testing of computer vision models. IEEE/CVF International Conference on Computer Vision
[65]

AutoBencher : Towards Declarative Benchmark Construction

Li, Xiang Lisa and Kaiyom, Farzaan and Liu, Evan Zheran and Mai, Yifan and Liang, Percy and Hashimoto, Tatsunori. AutoBencher : Towards Declarative Benchmark Construction. The Thirteenth International Conference on Learning Representations
[66]

Beyond accuracy: Behavioral testing of NLP models with CheckList

Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond accuracy: Behavioral testing of NLP models with CheckList. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
[67]

Red teaming language models with language models

Perez, Ethan and Huang, Saffron and Song, Francis and Cai, Trevor and Ring, Roman and Aslanides, John and Glaese, Amelia and McAleese, Nat and Irving, Geoffrey. Red teaming language models with language models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

2022
[68]

Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned

Ganguli, Deep and Lovitt, Liane and Kernion, Jackson and Askell, Amanda and Bai, Yuntao and Kadavath, Saurav and Mann, Ben and Perez, Ethan and Schiefer, Nicholas and Ndousse, Kamal and Jones, Andy and Bowman, Sam and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Elhage, Nelson and El-Showk, Sheer and Fort, Stanislav and Hatfield-Dodd...
[69]

Food and Drug Administration

U.S. Food and Drug Administration. Recommended Content and Format of Non-Clinical Bench Performance Testing Information in Premarket Submissions. U.S. Food and Drug Administration
[70]

E -valuator: Reliable agent verifiers with sequential hypothesis testing

Sadhuka, Shuvom and Prinster, Drew and Fannjiang, Clara and Scalia, Gabriele and Regev, Aviv and Wang, Hanchen. E -valuator: Reliable agent verifiers with sequential hypothesis testing. arXiv [cs.LG]
[71]

Testing fisher, Neyman, Pearson, and Bayes

Christensen, Ronald. Testing fisher, Neyman, Pearson, and Bayes. Am. Stat
[72]

Product Evals in Three Simple Steps

Yan, Eugene. Product Evals in Three Simple Steps. eugeneyan.com
[73]

Demystifying evals for AI agents

Anthropic. Demystifying evals for AI agents. Engineering at Anthropic: Inside the team building reliable AI systems
[74]

Neural network learning: theoretical foundations

Anthony, Martin and Bartlett, Peter L. Neural network learning: theoretical foundations
[75]

Consistency of random forests

Scornet, Erwan and Biau, Gérard and Vert, Jean-Philippe. Consistency of random forests. Ann. Stat
[76]

Multivariate smoothing spline functions

Cox, Dennis D. Multivariate smoothing spline functions. SIAM J. Numer. Anal
[77]

On the asymptotics of random forests

Scornet, Erwan. On the asymptotics of random forests. J. Multivar. Anal
[78]

Finite-time analysis of the multiarmed bandit problem

Auer, Peter and Cesa-Bianchi, Nicolò and Fischer, Paul. Finite-time analysis of the multiarmed bandit problem. Mach. Learn
[79]

Reinforcement learning: An introduction, 2nd ed

Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction, 2nd ed
[80]

General Agent Evaluation

Bandel, Elron and Yehudai, Asaf and Eden, Lilach and Sagron, Yehoshua and Perlitz, Yotam and Venezian, Elad and Razinkov, Natalia and Ergas, Natan and Ifergan, Shlomit Shachor and Shlomov, Segev and Jacovi, Michal and Choshen, Leshem and Ein-Dor, Liat and Katz, Yoav and Shmueli-Scheuer, Michal. General Agent Evaluation. arXiv [cs.AI]

Showing first 80 references.