pith. machine review for the scientific record. sign in

arxiv: 2605.07002 · v1 · submitted 2026-05-07 · 💻 cs.AI · math.ST· stat.ML· stat.TH

Recognition: no theorem link

Adaptive auditing of AI systems with anytime-valid guarantees

Jean Feng, Patrick Vossler, Siyu Zhou, Venkatesh Sivaraman, Yifan Mai

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:49 UTC · model grok-4.3

classification 💻 cs.AI math.STstat.MLstat.TH
keywords adaptive auditinganytime-valid inferenceAI failure modeshypothesis testingrobustness certificationgenerative AI evaluatione-processes
0
0 comments X

The pith

Passing a stringent adaptive audit certifies an AI system as globally robust if the auditor can find failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up a hypothesis testing framework for auditing AI systems when the number and choice of test cases can change based on earlier results. It pits the model's claim that no failure modes exist below a performance threshold against the auditor's claim that it has a sampling plan able to reveal any such failures. By translating the audit into simultaneous e-processes under safe anytime-valid inference, the method keeps error rates controlled at every step even with small samples and flexible stopping rules. A central result shows that when the auditor is strong enough, failing to find problems under a tight test effectively proves the system has no major failures anywhere.

Core claim

If the auditor is sufficiently powerful, the model's null hypothesis (no failure mode with performance below a target threshold) and the auditor's null hypothesis (a sampling strategy exists that will uncover a failure mode) are asymptotically inverses, so passage of a stringent audit certifies the AI system as being globally robust.

What carries the argument

Simultaneous e-processes that formalize 'testing by betting' for the two dueling null hypotheses under safe anytime-valid inference.

If this is right

  • The procedures keep type-I error controlled at any point during an audit with as few as 20 observations.
  • Adaptive testing can reach statistically valid conclusions faster than any fixed pre-specified sampling plan.
  • Passing the audit under the dueling framework provides a direct certificate of global robustness rather than a local one.
  • The approach applies directly to the adaptive sampling and stopping rules already used in practical AI evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Auditors could design their sampling strategies explicitly to satisfy the power condition, turning routine checks into formal certifications.
  • The method could let evaluation teams stop data collection early once evidence against failure modes accumulates, lowering annotation costs.
  • Similar dueling-hypothesis setups might transfer to other adaptive testing domains such as software quality assurance or clinical trial monitoring.
  • Empirical checks on real generative models would show how many samples are typically needed before the certification threshold is crossed.

Load-bearing premise

The auditor must be powerful enough to uncover failure modes whenever they actually exist in the system.

What would settle it

An AI system passes the adaptive audit yet later shows a concrete failure mode below the target threshold when examined with additional, independent tests.

Figures

Figures reproduced from arXiv: 2605.07002 by Jean Feng, Patrick Vossler, Siyu Zhou, Venkatesh Sivaraman, Yifan Mai.

Figure 1
Figure 1. Figure 1: We (a) formalize flexible audits of failure modes in AI systems as dueling hypothesis [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detecting failure modes with semi-synthetic data (Experiment 1). The rows correspond to [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Auditing an LLM pipeline for extraction for Social Determinants of Health (SDoH) from [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

A major bottleneck in characterizing the failure modes of generative AI systems is the cost and time of annotation and evaluation. Consequently, adaptive testing paradigms have gained popularity, where one opportunistically decides which cases and how many to annotate based on past results. While this framework is highly practical, its extreme flexibility makes it difficult to draw statistically rigorous conclusions, as it violates classical assumptions: the number of observations is typically limited (often 10 to 50 cases) and decisions regarding sampling and stopping are made in the midst of data collection rather than based a pre-specified rule. To characterize what statistical inferences can be drawn from highly adaptive audits, we introduce a hypothesis testing framework from two 'dueling' perspectives: (i) the model's null that asserts there is no failure mode with performance below a target threshold versus (ii) the auditor's null that asserts they have a sampling strategy that will uncover a failure mode. Leveraging Safe Anytime-Valid Inference (SAVI), we formalize the auditor as conducting 'testing by betting', which translates into simultaneous e-processes for testing the dueling null hypotheses. Furthermore, if the auditor is sufficiently powerful, we prove that these two hypotheses are asymptotically inverses of each other, in that passage of a stringent audit does in fact certify the AI system as being globally robust. Empirically, we demonstrate that our proposed testing procedures maintain anytime-valid type-I error control, outperform pre-specified testing methods, and can reach statistically rigorous conclusions sometimes with as few as 20 observations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces a hypothesis testing framework for adaptive auditing of AI systems, addressing the challenges of limited annotations (often 10-50 cases) and data-dependent sampling/stopping rules that violate classical assumptions. It defines two dueling null hypotheses: (i) the model's null asserting no failure mode with performance below a target threshold, and (ii) the auditor's null asserting the existence of a sampling strategy to uncover such a mode. Leveraging Safe Anytime-Valid Inference (SAVI), the auditor is formalized via testing-by-betting, yielding simultaneous e-processes for the dueling hypotheses. The central theoretical result is that, if the auditor is sufficiently powerful, these hypotheses are asymptotically inverses, so that passing a stringent audit certifies global robustness of the AI system. Empirical results demonstrate anytime-valid type-I error control, outperformance over pre-specified tests, and valid conclusions with as few as 20 observations.

Significance. If the results hold, this provides a statistically rigorous approach to adaptive auditing of generative AI with anytime-valid guarantees, directly tackling the practical bottleneck of annotation costs. The use of SAVI e-processes for adaptive sampling and the proof of asymptotic inversion between dueling hypotheses (conditioned on auditor power) are notable strengths, offering a principled way to interpret audit outcomes as robustness certifications. This could influence AI safety evaluation standards by enabling valid inferences from opportunistic data collection rather than fixed protocols. Credit is due for grounding the framework in established SAVI literature while extending it to dueling perspectives and demonstrating small-sample efficiency.

major comments (2)
  1. [§4] §4 (Asymptotic Inversion Theorem): The central claim that the hypotheses are 'asymptotically inverses' under a 'sufficiently powerful' auditor is load-bearing for the robustness certification interpretation, yet the manuscript provides only a qualitative definition of auditor power without an explicit characterization (e.g., a lower bound on detection probability for existing failure modes under the adaptive strategy). This leaves the scope of the inversion result unclear and risks the claim reducing to a tautology if power is defined circularly via the audit outcome.
  2. [§3.1] §3.1 (SAVI Application): The proof that SAVI e-processes apply directly to the adaptive sampling and stopping rules under the dueling nulls assumes the framework's conditions hold without violation; however, the manuscript does not explicitly verify or cite the martingale properties or optional stopping conditions for the specific composite hypotheses and betting strategies used here. Any mismatch could undermine the anytime-valid type-I error control asserted in the abstract.
minor comments (3)
  1. [Abstract] Abstract and §5 (Experiments): The claim that the procedures 'outperform pre-specified testing methods' should specify the exact baselines (e.g., fixed-sample-size tests or Bonferroni-corrected procedures) and metrics (e.g., average sample size to rejection or power curves) for reproducibility.
  2. [§2] Notation in §2: The distinction between the model's null H_0^M and auditor's null H_0^A could be clarified with an explicit table or side-by-side comparison of their formal statements to aid readers unfamiliar with testing-by-betting.
  3. Figure captions (e.g., Figure 2): Captions should explicitly note the adaptive stopping rule and how the e-process trajectories relate to the dueling hypotheses to improve clarity for the empirical results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and positive recommendation for minor revision. We address each major comment below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [§4] §4 (Asymptotic Inversion Theorem): The central claim that the hypotheses are 'asymptotically inverses' under a 'sufficiently powerful' auditor is load-bearing for the robustness certification interpretation, yet the manuscript provides only a qualitative definition of auditor power without an explicit characterization (e.g., a lower bound on detection probability for existing failure modes under the adaptive strategy). This leaves the scope of the inversion result unclear and risks the claim reducing to a tautology if power is defined circularly via the audit outcome.

    Authors: We agree that the current qualitative definition of auditor power leaves the scope of the Asymptotic Inversion Theorem somewhat open to interpretation. In the revision, we will augment §4 with an explicit, non-circular characterization: the auditor is sufficiently powerful if its betting strategy ensures that, whenever a failure mode exists below the target threshold, the associated e-process grows to exceed 1/α with probability at least 1-δ (for user-specified α, δ) under the adaptive sampling rule. This condition is stated in terms of the e-process growth rate under the alternative and is independent of any particular audit outcome, thereby clarifying the conditions under which passage of the audit certifies global robustness. revision: yes

  2. Referee: [§3.1] §3.1 (SAVI Application): The proof that SAVI e-processes apply directly to the adaptive sampling and stopping rules under the dueling nulls assumes the framework's conditions hold without violation; however, the manuscript does not explicitly verify or cite the martingale properties or optional stopping conditions for the specific composite hypotheses and betting strategies used here. Any mismatch could undermine the anytime-valid type-I error control asserted in the abstract.

    Authors: We appreciate the referee's call for explicit verification. While the application follows from the general SAVI theory for e-processes under adapted filtrations, the manuscript does not contain a dedicated check for our composite dueling nulls. In the revised version we will insert a short paragraph (and supporting appendix material) in §3.1 that (i) confirms the chosen betting strategies produce non-negative supermartingales under each null and (ii) verifies that the optional stopping theorem applies because the data-dependent stopping time is adapted to the filtration generated by the sequential observations. We will cite the precise SAVI results that guarantee the anytime-valid type-I error control under these conditions. revision: yes

Circularity Check

0 steps flagged

Minor reliance on prior SAVI literature; core inversion proof and dueling hypotheses derived independently

full rationale

The paper introduces dueling null hypotheses (model's null on absence of failure modes below threshold vs. auditor's null on uncovering failure modes via adaptive sampling) and proves their asymptotic inversion under a 'sufficiently powerful auditor' condition. This proof is presented as an original result in the manuscript and does not reduce by construction to fitted parameters, self-definitions, or prior self-citations. SAVI e-processes and testing-by-betting are leveraged from established anytime-valid inference literature for simultaneous control, which is standard external support rather than a load-bearing self-citation chain. Empirical type-I error control and comparisons to pre-specified methods provide separate validation. No patterns of self-definitional claims, fitted inputs renamed as predictions, or ansatz smuggling appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claims depend on the applicability of SAVI to adaptive audit sampling and the 'sufficiently powerful auditor' condition; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Safe Anytime-Valid Inference (SAVI) properties hold for the adaptive sampling and stopping rules in AI audits.
    The e-processes and type-I error control rely on this prior framework applying without violation.

pith-pipeline@v0.9.0 · 5583 in / 1196 out tokens · 69916 ms · 2026-05-11T00:49:52.787028+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references

  1. [1]

    Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

    Lakshminarayanan, Balaji and Pritzel, Alexander and Blundell, Charles. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. Advances in Neural Information Processing Systems 30

  2. [2]

    A Multiple Testing Procedure for Clinical Trials

    O'Brien, Peter C and Fleming, Thomas R. A Multiple Testing Procedure for Clinical Trials. Biometrics

  3. [3]

    Context-aware testing: A new paradigm for model testing with large language models

    Rauba, Paulius and Ruiz Luyten, Max and Seedat, Nabeel and van der Schaar, Mihaela. Context-aware testing: A new paradigm for model testing with large language models. Advances in Neural Information Processing Systems 37

  4. [4]

    Deep Residual Learning for Image Recognition

    He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian. Deep Residual Learning for Image Recognition. arXiv [cs.CV]

  5. [5]

    Deep Residual Learning for Image Recognition

    He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian. Deep Residual Learning for Image Recognition. CVPR

  6. [6]

    K. K. Gordon Lan and DeMets, David L. Discrete Sequential Boundaries for Clinical Trials. Biometrika

  7. [7]

    Group sequential methods in the design and analysis of clinical trials

    Pocock, Stuart J. Group sequential methods in the design and analysis of clinical trials. Biometrika

  8. [8]

    Deep Residual Learning for Image Recognition

    He, K and Zhang, X and Ren, S and Sun, J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  9. [9]

    Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness

    Kearns, Michael and Neel, Seth and Roth, Aaron and Wu, Zhiwei Steven. Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. International Conference on Machine Learning

  10. [10]

    Prediction, Learning, and Games

    Cesa-Bianchi, Nicolo and Lugosi, Gabor. Prediction, Learning, and Games

  11. [11]

    Algorithmic Fairness: Choices, Assumptions, and Definitions

    Mitchell, Shira and Potash, Eric and Barocas, Solon and D'Amour, Alexander and Lum, Kristian. Algorithmic Fairness: Choices, Assumptions, and Definitions. Annu. Rev. Stat. Appl

  12. [12]

    Doubly robust confidence sequences for sequential causal inference

    Waudby-Smith, Ian and Arbour, David and Sinha, Ritwik and Kennedy, Edward H and Ramdas, Aaditya. Doubly robust confidence sequences for sequential causal inference. arXiv [math.ST]

  13. [13]

    Anytime-valid and asymptotically efficient inference driven by predictive recursion

    Dixit, Vaidehi and Martin, Ryan. Anytime-valid and asymptotically efficient inference driven by predictive recursion. Biometrika

  14. [14]

    Evaluating Model Robustness and Stability to Dataset Shift

    Subbaswamy, Adarsh and Adams, Roy and Saria, Suchi. Evaluating Model Robustness and Stability to Dataset Shift. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics

  15. [15]

    A Comparison of Some Control Chart Procedures

    Roberts, S W. A Comparison of Some Control Chart Procedures. Technometrics

  16. [16]

    On Optimum Methods in Quickest Detection Problems

    Shiryaev, A N. On Optimum Methods in Quickest Detection Problems. Theory Probab. Appl

  17. [17]

    A snapshot of the frontiers of fairness in machine learning

    Chouldechova, Alexandra and Roth, Aaron. A snapshot of the frontiers of fairness in machine learning. Commun. ACM

  18. [18]

    FAIRVIS : Visual Analytics for Discovering Intersectional Bias in Machine Learning

    Cabrera, Ángel Alexander and Epperson, Will and Hohman, Fred and Kahng, Minsuk and Morgenstern, Jamie and Chau, Duen Horng. FAIRVIS : Visual Analytics for Discovering Intersectional Bias in Machine Learning. 2019 IEEE Conference on Visual Analytics Science and Technology (VAST)

  19. [19]

    Multicalibration: Calibration for the ( C omputationally-Identifiable) Masses

    Hebert-Johnson, Ursula and Kim, Michael and Reingold, Omer and Rothblum, Guy. Multicalibration: Calibration for the ( C omputationally-Identifiable) Masses. International Conference on Machine Learning

  20. [20]

    Slice Finder: Automated Data Slicing for Model Validation

    Chung, Yeounoh and Kraska, Tim and Polyzotis, Neoklis and Tae, Ki Hyun and Whang, Steven Euijong. Slice Finder: Automated Data Slicing for Model Validation. 2019 IEEE 35th International Conference on Data Engineering (ICDE)

  21. [21]

    Active Testing: Sample-Efficient Model Evaluation

    Kossen, Jannik and Farquhar, Sebastian and Gal, Yarin and Rainforth, Tom. Active Testing: Sample-Efficient Model Evaluation. Proceedings of the 38th International Conference on Machine Learning

  22. [22]

    Domino: Discovering Systematic Errors with Cross-Modal Embeddings

    Eyuboglu, Sabri and Varma, Maya and Saab, Khaled and Delbrouck, Jean-Benoit and Lee-Messer, Christopher and Dunnmon, Jared and Zou, James and Ré, Christopher. Domino: Discovering Systematic Errors with Cross-Modal Embeddings. International Conference on Learning Representations

  23. [23]

    Active multiple testing with proxy p-values and e-values

    Xu, Ziyu and Wang, Catherine and Wasserman, Larry and Roeder, Kathryn and Ramdas, Aaditya. Active multiple testing with proxy p-values and e-values. arXiv [stat.ME]

  24. [24]

    Online multiple testing with e-values

    Xu, Ziyu and Ramdas, Aaditya. Online multiple testing with e-values. arXiv [stat.ME]

  25. [25]

    Adaptive learn-then-test: Statistically valid and efficient hyperparameter selection

    Zecchin, Matteo and Park, Sangwoo and Simeone, Osvaldo. Adaptive learn-then-test: Statistically valid and efficient hyperparameter selection. arXiv [stat.ML]

  26. [26]

    Active hypothesis testing under computational budgets with applications to GWAS and LLM

    Kuang, Qi and Gang, Bowen and Xia, Yin. Active hypothesis testing under computational budgets with applications to GWAS and LLM. arXiv [stat.ME]

  27. [27]

    Scaling Up Active Testing to Large Language Models

    Berrada, Gabrielle and Kossen, Jannik and Smith, Freddie Bickford and Razzak, Muhammed and Gal, Yarin and Rainforth, Tom. Scaling Up Active Testing to Large Language Models. The Thirty-ninth Annual Conference on Neural Information Processing Systems

  28. [28]

    Is this model reliable for everyone? Testing for strong calibration

    Feng, Jean and Gossmann, Alexej and Pirracchio, Romain and Petrick, Nicholas and Pennello, Gene and Sahiner, Berkman. Is this model reliable for everyone? Testing for strong calibration. AISTATS

  29. [29]

    The statistical scope of multicalibration

    Noarov, Georgy and Roth, Aaron. The statistical scope of multicalibration. International Conference on Machine Learning

  30. [30]

    Universal inference

    Wasserman, Larry and Ramdas, Aaditya and Balakrishnan, Sivaraman. Universal inference. Proc. Natl. Acad. Sci. U. S. A

  31. [31]

    Holistic Evaluation of Language Models

    Liang, Percy and Bommasani, Rishi and Lee, Tony and Tsipras, Dimitris and Soylu, Dilara and Yasunaga, Michihiro and Zhang, Yian and Narayanan, Deepak and Wu, Yuhuai and Kumar, Ananya and Newman, Benjamin and Yuan, Binhang and Yan, Bobby and Zhang, Ce and Cosgrove, Christian Alexander and Manning, Christopher D and Re, Christopher and Acosta-Navas, Diana a...

  32. [32]

    A Brief Tutorial on Sample Size Calculations for Fairness Audits

    Singh, Harvineet and Xia, Fan and Kim, Mi-Ok and Pirracchio, Romain and Chunara, Rumi and Feng, Jean. A Brief Tutorial on Sample Size Calculations for Fairness Audits. Workshop on Regulatable Machine Learning at the 37th Conference on Neural Information Processing Systems

  33. [33]

    Red-teaming for generative AI : Silver bullet or security theater?

    Feffer, Michael and Sinha, Anusha and Lipton, Zachary C and Heidari, Hoda. Red-teaming for generative AI : Silver bullet or security theater?. arXiv [cs.CY]

  34. [34]

    Query-Efficient Black-Box Red Teaming via Bayesian Optimization

    Lee, Deokjae and Lee, Junyeong and Ha, Jung-Woo and Kim, Jin-Hwa and Lee, Sang-Woo and Lee, Hwaran and Song, Hyun Oh. Query-Efficient Black-Box Red Teaming via Bayesian Optimization. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  35. [35]

    Hypothesis testing with e-values

    Ramdas, Aaditya and Wang, Ruodu. Hypothesis testing with e-values. arXiv [math.ST]

  36. [36]

    E -detectors: a nonparametric framework for sequential change detection

    Shin, Jaehyeok and Ramdas, Aaditya and Rinaldo, Alessandro. E -detectors: a nonparametric framework for sequential change detection. New England Journal of Statistics in Data Science

  37. [37]

    Testing by betting: A strategy for statistical and scientific communication

    Shafer, Glenn. Testing by betting: A strategy for statistical and scientific communication. J. R. Stat. Soc. Ser. A Stat. Soc

  38. [38]

    Safe testing

    Grünwald, Peter and de Heide, Rianne and Koolen, Wouter. Safe testing. J. R. Stat. Soc. Series B Stat. Methodol

  39. [39]

    A data-driven framework for identifying patient subgroups on which an AI /machine learning model may underperform

    Subbaswamy, Adarsh and Sahiner, Berkman and Petrick, Nicholas and Pai, Vinay and Adams, Roy and Diamond, Matthew C and Saria, Suchi. A data-driven framework for identifying patient subgroups on which an AI /machine learning model may underperform. NPJ Digit. Med

  40. [40]

    EvalTree : Profiling Language Model weaknesses via hierarchical capability trees

    Zeng, Zhiyuan and Wang, Yizhong and Hajishirzi, Hannaneh and Koh, Pang Wei. EvalTree : Profiling Language Model weaknesses via hierarchical capability trees. Conference on Language Modeling

  41. [41]

    Adaptive Testing and Debugging of NLP Models

    Ribeiro, Marco Tulio and Lundberg, Scott. Adaptive Testing and Debugging of NLP Models. ACL 2022

  42. [42]

    GPQA : A Graduate-Level Google-Proof Q&A Benchmark

    Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. GPQA : A Graduate-Level Google-Proof Q&A Benchmark. arXiv [cs.AI]

  43. [43]

    LLMs Judging LLMs : A Simplex Perspective

    Vossler, Patrick and Xia, Fan and Mai, Yifan and Subbaswamy, Adarsh and Feng, Jean. LLMs Judging LLMs : A Simplex Perspective. International Conference on Artificial Intelligence and Statistics

  44. [44]

    When the domain expert has no time and the LLM developer has no clinical expertise: Real-world lessons from LLM co-design in a safety-net hospital

    Kothari, Avni and Vossler, Patrick and Digitale, Jean and Forouzannia, Mohammad and Rosenberg, Elise and Lee, Michele and Bryant, Jennee and Molina, Melanie and Marks, James and Zier, Lucas and Feng, Jean. When the domain expert has no time and the LLM developer has no clinical expertise: Real-world lessons from LLM co-design in a safety-net hospital. Pro...

  45. [45]

    MCGrad : Multicalibration at Web Scale

    Tax, Niek and Perini, Lorenzo and Linder, Fridolin and Haimovich, Daniel and Karamshuk, Dima and Okati, Nastaran and Vojnovic, Milan and Apostolopoulos, Pavlos Athanasios. MCGrad : Multicalibration at Web Scale. arXiv [cs.LG]

  46. [46]

    LLM Evals: Everything You Need to Know – Hamel’s Blog - Hamel Husain

    Husain, Hamel and Shankar, Shreya. LLM Evals: Everything You Need to Know – Hamel’s Blog - Hamel Husain. Hamel's Blog - Hamel Husain

  47. [47]

    Quantifying Local Model Validity using Active Learning

    Lämmle, Sven and Bogoclu, Can and Vosshall, Robert and Haselhoff, Anselm and Roos, Dirk. Quantifying Local Model Validity using Active Learning. Uncertainty in Artificial Intelligence

  48. [48]

    Sequential tests of statistical hypotheses

    Wald, A. Sequential tests of statistical hypotheses. Ann. Math. Stat

  49. [49]

    HiBayES : A Hierarchical Bayesian modeling framework for AI Evaluation Statistics

    Luettgau, Lennart and Coppock, Harry and Dubois, Magda and Summerfield, Christopher and Ududec, Cozmin. HiBayES : A Hierarchical Bayesian modeling framework for AI Evaluation Statistics. arXiv [cs.AI]

  50. [50]

    Active sequential hypothesis testing

    Naghshvar, Mohammad and Javidi, Tara. Active sequential hypothesis testing. Ann. Stat

  51. [51]

    Active sequential two-sample testing

    Li, Weizhi and Kadambi, Prad and Saidi, Pouria and Ramamurthy, Karthikeyan Natesan and Dasarathy, Gautam and Berisha, Visar. Active sequential two-sample testing. Transact. Mach. Learn. Res

  52. [52]

    Automated Hypothesis Validation with Agentic Sequential Falsifications

    Huang, Kexin and Jin, Ying and Li, Ryan and Li, Michael Y and Candes, Emmanuel and Leskovec, Jure. Automated Hypothesis Validation with Agentic Sequential Falsifications. Forty-second International Conference on Machine Learning

  53. [53]

    Multi-armed sequential hypothesis testing by betting

    Sandoval, Ricardo J and Waudby-Smith, Ian and Jordan, Michael I. Multi-armed sequential hypothesis testing by betting. arXiv [stat.ME]

  54. [54]

    Active fairness auditing

    Yan, Tom and Zhang, Chicheng. Active fairness auditing. International Conference on Machine Learning

  55. [55]

    Audit me if you can: Query-efficient active fairness auditing of black-box LLMs

    Hartmann, David and Pohlmann, Lena and Hanslik, Lelia and Gießing, Noah and Berendt, Bettina and Delobelle, Pieter. Audit me if you can: Query-efficient active fairness auditing of black-box LLMs. arXiv [cs.LG]

  56. [56]

    Anchor points: Benchmarking models with much fewer examples

    Vivek, Rajan and Ethayarajh, Kawin and Yang, Diyi and Kiela, Douwe. Anchor points: Benchmarking models with much fewer examples. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

  57. [57]

    On statistical bias in active learning: How and when to fix it

    Farquhar, Sebastian and Gal, Yarin and Rainforth, Tom. On statistical bias in active learning: How and when to fix it. international conference on learning representations

  58. [58]

    Admissible online closed testing must employ e-values

    Fischer, Lasse and Ramdas, Aaditya. Admissible online closed testing must employ e-values. arXiv [stat.ME]

  59. [59]

    Family-wise error rate control with E -values

    Hartog, Will and Lei, Lihua. Family-wise error rate control with E -values. arXiv [stat.ME]

  60. [60]

    E -values for adaptive clinical trials: Anytime-valid monitoring in practice

    Sokolova, Alexandra and Sokolov, Vadim. E -values for adaptive clinical trials: Anytime-valid monitoring in practice. arXiv [stat.ME]

  61. [61]

    The Caltech- UCSD Birds-200-2011 Dataset

    Wah, Catherine and Branson, Steve and Welinder, Peter and Perona, Pietro and Belongie, Serge. The Caltech- UCSD Birds-200-2011 Dataset. CaltechAUTHORS

  62. [62]

    ASPEST : Bridging the gap between active learning and selective prediction

    Chen, Jiefeng and Yoon, Jinsung and Ebrahimi, Sayna and Arik, Sercan and Jha, Somesh and Pfister, Tomas. ASPEST : Bridging the gap between active learning and selective prediction. Transact. Mach. Learn. Res

  63. [63]

    AcTracer : Active testing of large language model via multi-stage sampling

    Huang, Yuheng and Song, Jiayang and Hu, Qiang and Juefei-Xu, Felix and Ma, Lei. AcTracer : Active testing of large language model via multi-stage sampling. arXiv [cs.SE]

  64. [64]

    Adaptive testing of computer vision models

    Gao, Irena and Ilharco, Gabriel and Lundberg, Scott and Ribeiro, Marco Tulio. Adaptive testing of computer vision models. IEEE/CVF International Conference on Computer Vision

  65. [65]

    AutoBencher : Towards Declarative Benchmark Construction

    Li, Xiang Lisa and Kaiyom, Farzaan and Liu, Evan Zheran and Mai, Yifan and Liang, Percy and Hashimoto, Tatsunori. AutoBencher : Towards Declarative Benchmark Construction. The Thirteenth International Conference on Learning Representations

  66. [66]

    Beyond accuracy: Behavioral testing of NLP models with CheckList

    Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond accuracy: Behavioral testing of NLP models with CheckList. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

  67. [67]

    Red teaming language models with language models

    Perez, Ethan and Huang, Saffron and Song, Francis and Cai, Trevor and Ring, Roman and Aslanides, John and Glaese, Amelia and McAleese, Nat and Irving, Geoffrey. Red teaming language models with language models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

  68. [68]

    Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned

    Ganguli, Deep and Lovitt, Liane and Kernion, Jackson and Askell, Amanda and Bai, Yuntao and Kadavath, Saurav and Mann, Ben and Perez, Ethan and Schiefer, Nicholas and Ndousse, Kamal and Jones, Andy and Bowman, Sam and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Elhage, Nelson and El-Showk, Sheer and Fort, Stanislav and Hatfield-Dodd...

  69. [69]

    Food and Drug Administration

    U.S. Food and Drug Administration. Recommended Content and Format of Non-Clinical Bench Performance Testing Information in Premarket Submissions. U.S. Food and Drug Administration

  70. [70]

    E -valuator: Reliable agent verifiers with sequential hypothesis testing

    Sadhuka, Shuvom and Prinster, Drew and Fannjiang, Clara and Scalia, Gabriele and Regev, Aviv and Wang, Hanchen. E -valuator: Reliable agent verifiers with sequential hypothesis testing. arXiv [cs.LG]

  71. [71]

    Testing fisher, Neyman, Pearson, and Bayes

    Christensen, Ronald. Testing fisher, Neyman, Pearson, and Bayes. Am. Stat

  72. [72]

    Product Evals in Three Simple Steps

    Yan, Eugene. Product Evals in Three Simple Steps. eugeneyan.com

  73. [73]

    Demystifying evals for AI agents

    Anthropic. Demystifying evals for AI agents. Engineering at Anthropic: Inside the team building reliable AI systems

  74. [74]

    Neural network learning: theoretical foundations

    Anthony, Martin and Bartlett, Peter L. Neural network learning: theoretical foundations

  75. [75]

    Consistency of random forests

    Scornet, Erwan and Biau, Gérard and Vert, Jean-Philippe. Consistency of random forests. Ann. Stat

  76. [76]

    Multivariate smoothing spline functions

    Cox, Dennis D. Multivariate smoothing spline functions. SIAM J. Numer. Anal

  77. [77]

    On the asymptotics of random forests

    Scornet, Erwan. On the asymptotics of random forests. J. Multivar. Anal

  78. [78]

    Finite-time analysis of the multiarmed bandit problem

    Auer, Peter and Cesa-Bianchi, Nicolò and Fischer, Paul. Finite-time analysis of the multiarmed bandit problem. Mach. Learn

  79. [79]

    Reinforcement learning: An introduction, 2nd ed

    Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction, 2nd ed

  80. [80]

    General Agent Evaluation

    Bandel, Elron and Yehudai, Asaf and Eden, Lilach and Sagron, Yehoshua and Perlitz, Yotam and Venezian, Elad and Razinkov, Natalia and Ergas, Natan and Ifergan, Shlomit Shachor and Shlomov, Segev and Jacovi, Michal and Choshen, Leshem and Ein-Dor, Liat and Katz, Yoav and Shmueli-Scheuer, Michal. General Agent Evaluation. arXiv [cs.AI]

Showing first 80 references.