Recognition: no theorem link
Adaptive auditing of AI systems with anytime-valid guarantees
Pith reviewed 2026-05-11 00:49 UTC · model grok-4.3
The pith
Passing a stringent adaptive audit certifies an AI system as globally robust if the auditor can find failures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
If the auditor is sufficiently powerful, the model's null hypothesis (no failure mode with performance below a target threshold) and the auditor's null hypothesis (a sampling strategy exists that will uncover a failure mode) are asymptotically inverses, so passage of a stringent audit certifies the AI system as being globally robust.
What carries the argument
Simultaneous e-processes that formalize 'testing by betting' for the two dueling null hypotheses under safe anytime-valid inference.
If this is right
- The procedures keep type-I error controlled at any point during an audit with as few as 20 observations.
- Adaptive testing can reach statistically valid conclusions faster than any fixed pre-specified sampling plan.
- Passing the audit under the dueling framework provides a direct certificate of global robustness rather than a local one.
- The approach applies directly to the adaptive sampling and stopping rules already used in practical AI evaluations.
Where Pith is reading between the lines
- Auditors could design their sampling strategies explicitly to satisfy the power condition, turning routine checks into formal certifications.
- The method could let evaluation teams stop data collection early once evidence against failure modes accumulates, lowering annotation costs.
- Similar dueling-hypothesis setups might transfer to other adaptive testing domains such as software quality assurance or clinical trial monitoring.
- Empirical checks on real generative models would show how many samples are typically needed before the certification threshold is crossed.
Load-bearing premise
The auditor must be powerful enough to uncover failure modes whenever they actually exist in the system.
What would settle it
An AI system passes the adaptive audit yet later shows a concrete failure mode below the target threshold when examined with additional, independent tests.
Figures
read the original abstract
A major bottleneck in characterizing the failure modes of generative AI systems is the cost and time of annotation and evaluation. Consequently, adaptive testing paradigms have gained popularity, where one opportunistically decides which cases and how many to annotate based on past results. While this framework is highly practical, its extreme flexibility makes it difficult to draw statistically rigorous conclusions, as it violates classical assumptions: the number of observations is typically limited (often 10 to 50 cases) and decisions regarding sampling and stopping are made in the midst of data collection rather than based a pre-specified rule. To characterize what statistical inferences can be drawn from highly adaptive audits, we introduce a hypothesis testing framework from two 'dueling' perspectives: (i) the model's null that asserts there is no failure mode with performance below a target threshold versus (ii) the auditor's null that asserts they have a sampling strategy that will uncover a failure mode. Leveraging Safe Anytime-Valid Inference (SAVI), we formalize the auditor as conducting 'testing by betting', which translates into simultaneous e-processes for testing the dueling null hypotheses. Furthermore, if the auditor is sufficiently powerful, we prove that these two hypotheses are asymptotically inverses of each other, in that passage of a stringent audit does in fact certify the AI system as being globally robust. Empirically, we demonstrate that our proposed testing procedures maintain anytime-valid type-I error control, outperform pre-specified testing methods, and can reach statistically rigorous conclusions sometimes with as few as 20 observations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a hypothesis testing framework for adaptive auditing of AI systems, addressing the challenges of limited annotations (often 10-50 cases) and data-dependent sampling/stopping rules that violate classical assumptions. It defines two dueling null hypotheses: (i) the model's null asserting no failure mode with performance below a target threshold, and (ii) the auditor's null asserting the existence of a sampling strategy to uncover such a mode. Leveraging Safe Anytime-Valid Inference (SAVI), the auditor is formalized via testing-by-betting, yielding simultaneous e-processes for the dueling hypotheses. The central theoretical result is that, if the auditor is sufficiently powerful, these hypotheses are asymptotically inverses, so that passing a stringent audit certifies global robustness of the AI system. Empirical results demonstrate anytime-valid type-I error control, outperformance over pre-specified tests, and valid conclusions with as few as 20 observations.
Significance. If the results hold, this provides a statistically rigorous approach to adaptive auditing of generative AI with anytime-valid guarantees, directly tackling the practical bottleneck of annotation costs. The use of SAVI e-processes for adaptive sampling and the proof of asymptotic inversion between dueling hypotheses (conditioned on auditor power) are notable strengths, offering a principled way to interpret audit outcomes as robustness certifications. This could influence AI safety evaluation standards by enabling valid inferences from opportunistic data collection rather than fixed protocols. Credit is due for grounding the framework in established SAVI literature while extending it to dueling perspectives and demonstrating small-sample efficiency.
major comments (2)
- [§4] §4 (Asymptotic Inversion Theorem): The central claim that the hypotheses are 'asymptotically inverses' under a 'sufficiently powerful' auditor is load-bearing for the robustness certification interpretation, yet the manuscript provides only a qualitative definition of auditor power without an explicit characterization (e.g., a lower bound on detection probability for existing failure modes under the adaptive strategy). This leaves the scope of the inversion result unclear and risks the claim reducing to a tautology if power is defined circularly via the audit outcome.
- [§3.1] §3.1 (SAVI Application): The proof that SAVI e-processes apply directly to the adaptive sampling and stopping rules under the dueling nulls assumes the framework's conditions hold without violation; however, the manuscript does not explicitly verify or cite the martingale properties or optional stopping conditions for the specific composite hypotheses and betting strategies used here. Any mismatch could undermine the anytime-valid type-I error control asserted in the abstract.
minor comments (3)
- [Abstract] Abstract and §5 (Experiments): The claim that the procedures 'outperform pre-specified testing methods' should specify the exact baselines (e.g., fixed-sample-size tests or Bonferroni-corrected procedures) and metrics (e.g., average sample size to rejection or power curves) for reproducibility.
- [§2] Notation in §2: The distinction between the model's null H_0^M and auditor's null H_0^A could be clarified with an explicit table or side-by-side comparison of their formal statements to aid readers unfamiliar with testing-by-betting.
- Figure captions (e.g., Figure 2): Captions should explicitly note the adaptive stopping rule and how the e-process trajectories relate to the dueling hypotheses to improve clarity for the empirical results.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and positive recommendation for minor revision. We address each major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [§4] §4 (Asymptotic Inversion Theorem): The central claim that the hypotheses are 'asymptotically inverses' under a 'sufficiently powerful' auditor is load-bearing for the robustness certification interpretation, yet the manuscript provides only a qualitative definition of auditor power without an explicit characterization (e.g., a lower bound on detection probability for existing failure modes under the adaptive strategy). This leaves the scope of the inversion result unclear and risks the claim reducing to a tautology if power is defined circularly via the audit outcome.
Authors: We agree that the current qualitative definition of auditor power leaves the scope of the Asymptotic Inversion Theorem somewhat open to interpretation. In the revision, we will augment §4 with an explicit, non-circular characterization: the auditor is sufficiently powerful if its betting strategy ensures that, whenever a failure mode exists below the target threshold, the associated e-process grows to exceed 1/α with probability at least 1-δ (for user-specified α, δ) under the adaptive sampling rule. This condition is stated in terms of the e-process growth rate under the alternative and is independent of any particular audit outcome, thereby clarifying the conditions under which passage of the audit certifies global robustness. revision: yes
-
Referee: [§3.1] §3.1 (SAVI Application): The proof that SAVI e-processes apply directly to the adaptive sampling and stopping rules under the dueling nulls assumes the framework's conditions hold without violation; however, the manuscript does not explicitly verify or cite the martingale properties or optional stopping conditions for the specific composite hypotheses and betting strategies used here. Any mismatch could undermine the anytime-valid type-I error control asserted in the abstract.
Authors: We appreciate the referee's call for explicit verification. While the application follows from the general SAVI theory for e-processes under adapted filtrations, the manuscript does not contain a dedicated check for our composite dueling nulls. In the revised version we will insert a short paragraph (and supporting appendix material) in §3.1 that (i) confirms the chosen betting strategies produce non-negative supermartingales under each null and (ii) verifies that the optional stopping theorem applies because the data-dependent stopping time is adapted to the filtration generated by the sequential observations. We will cite the precise SAVI results that guarantee the anytime-valid type-I error control under these conditions. revision: yes
Circularity Check
Minor reliance on prior SAVI literature; core inversion proof and dueling hypotheses derived independently
full rationale
The paper introduces dueling null hypotheses (model's null on absence of failure modes below threshold vs. auditor's null on uncovering failure modes via adaptive sampling) and proves their asymptotic inversion under a 'sufficiently powerful auditor' condition. This proof is presented as an original result in the manuscript and does not reduce by construction to fitted parameters, self-definitions, or prior self-citations. SAVI e-processes and testing-by-betting are leveraged from established anytime-valid inference literature for simultaneous control, which is standard external support rather than a load-bearing self-citation chain. Empirical type-I error control and comparisons to pre-specified methods provide separate validation. No patterns of self-definitional claims, fitted inputs renamed as predictions, or ansatz smuggling appear in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Safe Anytime-Valid Inference (SAVI) properties hold for the adaptive sampling and stopping rules in AI audits.
Reference graph
Works this paper leans on
-
[1]
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Lakshminarayanan, Balaji and Pritzel, Alexander and Blundell, Charles. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. Advances in Neural Information Processing Systems 30
-
[2]
A Multiple Testing Procedure for Clinical Trials
O'Brien, Peter C and Fleming, Thomas R. A Multiple Testing Procedure for Clinical Trials. Biometrics
-
[3]
Context-aware testing: A new paradigm for model testing with large language models
Rauba, Paulius and Ruiz Luyten, Max and Seedat, Nabeel and van der Schaar, Mihaela. Context-aware testing: A new paradigm for model testing with large language models. Advances in Neural Information Processing Systems 37
-
[4]
Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian. Deep Residual Learning for Image Recognition. arXiv [cs.CV]
-
[5]
Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian. Deep Residual Learning for Image Recognition. CVPR
-
[6]
K. K. Gordon Lan and DeMets, David L. Discrete Sequential Boundaries for Clinical Trials. Biometrika
-
[7]
Group sequential methods in the design and analysis of clinical trials
Pocock, Stuart J. Group sequential methods in the design and analysis of clinical trials. Biometrika
-
[8]
Deep Residual Learning for Image Recognition
He, K and Zhang, X and Ren, S and Sun, J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
2016
-
[9]
Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness
Kearns, Michael and Neel, Seth and Roth, Aaron and Wu, Zhiwei Steven. Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. International Conference on Machine Learning
-
[10]
Prediction, Learning, and Games
Cesa-Bianchi, Nicolo and Lugosi, Gabor. Prediction, Learning, and Games
-
[11]
Algorithmic Fairness: Choices, Assumptions, and Definitions
Mitchell, Shira and Potash, Eric and Barocas, Solon and D'Amour, Alexander and Lum, Kristian. Algorithmic Fairness: Choices, Assumptions, and Definitions. Annu. Rev. Stat. Appl
-
[12]
Doubly robust confidence sequences for sequential causal inference
Waudby-Smith, Ian and Arbour, David and Sinha, Ritwik and Kennedy, Edward H and Ramdas, Aaditya. Doubly robust confidence sequences for sequential causal inference. arXiv [math.ST]
-
[13]
Anytime-valid and asymptotically efficient inference driven by predictive recursion
Dixit, Vaidehi and Martin, Ryan. Anytime-valid and asymptotically efficient inference driven by predictive recursion. Biometrika
-
[14]
Evaluating Model Robustness and Stability to Dataset Shift
Subbaswamy, Adarsh and Adams, Roy and Saria, Suchi. Evaluating Model Robustness and Stability to Dataset Shift. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics
-
[15]
A Comparison of Some Control Chart Procedures
Roberts, S W. A Comparison of Some Control Chart Procedures. Technometrics
-
[16]
On Optimum Methods in Quickest Detection Problems
Shiryaev, A N. On Optimum Methods in Quickest Detection Problems. Theory Probab. Appl
-
[17]
A snapshot of the frontiers of fairness in machine learning
Chouldechova, Alexandra and Roth, Aaron. A snapshot of the frontiers of fairness in machine learning. Commun. ACM
-
[18]
FAIRVIS : Visual Analytics for Discovering Intersectional Bias in Machine Learning
Cabrera, Ángel Alexander and Epperson, Will and Hohman, Fred and Kahng, Minsuk and Morgenstern, Jamie and Chau, Duen Horng. FAIRVIS : Visual Analytics for Discovering Intersectional Bias in Machine Learning. 2019 IEEE Conference on Visual Analytics Science and Technology (VAST)
2019
-
[19]
Multicalibration: Calibration for the ( C omputationally-Identifiable) Masses
Hebert-Johnson, Ursula and Kim, Michael and Reingold, Omer and Rothblum, Guy. Multicalibration: Calibration for the ( C omputationally-Identifiable) Masses. International Conference on Machine Learning
-
[20]
Slice Finder: Automated Data Slicing for Model Validation
Chung, Yeounoh and Kraska, Tim and Polyzotis, Neoklis and Tae, Ki Hyun and Whang, Steven Euijong. Slice Finder: Automated Data Slicing for Model Validation. 2019 IEEE 35th International Conference on Data Engineering (ICDE)
2019
-
[21]
Active Testing: Sample-Efficient Model Evaluation
Kossen, Jannik and Farquhar, Sebastian and Gal, Yarin and Rainforth, Tom. Active Testing: Sample-Efficient Model Evaluation. Proceedings of the 38th International Conference on Machine Learning
-
[22]
Domino: Discovering Systematic Errors with Cross-Modal Embeddings
Eyuboglu, Sabri and Varma, Maya and Saab, Khaled and Delbrouck, Jean-Benoit and Lee-Messer, Christopher and Dunnmon, Jared and Zou, James and Ré, Christopher. Domino: Discovering Systematic Errors with Cross-Modal Embeddings. International Conference on Learning Representations
-
[23]
Active multiple testing with proxy p-values and e-values
Xu, Ziyu and Wang, Catherine and Wasserman, Larry and Roeder, Kathryn and Ramdas, Aaditya. Active multiple testing with proxy p-values and e-values. arXiv [stat.ME]
-
[24]
Online multiple testing with e-values
Xu, Ziyu and Ramdas, Aaditya. Online multiple testing with e-values. arXiv [stat.ME]
-
[25]
Adaptive learn-then-test: Statistically valid and efficient hyperparameter selection
Zecchin, Matteo and Park, Sangwoo and Simeone, Osvaldo. Adaptive learn-then-test: Statistically valid and efficient hyperparameter selection. arXiv [stat.ML]
-
[26]
Active hypothesis testing under computational budgets with applications to GWAS and LLM
Kuang, Qi and Gang, Bowen and Xia, Yin. Active hypothesis testing under computational budgets with applications to GWAS and LLM. arXiv [stat.ME]
-
[27]
Scaling Up Active Testing to Large Language Models
Berrada, Gabrielle and Kossen, Jannik and Smith, Freddie Bickford and Razzak, Muhammed and Gal, Yarin and Rainforth, Tom. Scaling Up Active Testing to Large Language Models. The Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[28]
Is this model reliable for everyone? Testing for strong calibration
Feng, Jean and Gossmann, Alexej and Pirracchio, Romain and Petrick, Nicholas and Pennello, Gene and Sahiner, Berkman. Is this model reliable for everyone? Testing for strong calibration. AISTATS
-
[29]
The statistical scope of multicalibration
Noarov, Georgy and Roth, Aaron. The statistical scope of multicalibration. International Conference on Machine Learning
-
[30]
Universal inference
Wasserman, Larry and Ramdas, Aaditya and Balakrishnan, Sivaraman. Universal inference. Proc. Natl. Acad. Sci. U. S. A
-
[31]
Holistic Evaluation of Language Models
Liang, Percy and Bommasani, Rishi and Lee, Tony and Tsipras, Dimitris and Soylu, Dilara and Yasunaga, Michihiro and Zhang, Yian and Narayanan, Deepak and Wu, Yuhuai and Kumar, Ananya and Newman, Benjamin and Yuan, Binhang and Yan, Bobby and Zhang, Ce and Cosgrove, Christian Alexander and Manning, Christopher D and Re, Christopher and Acosta-Navas, Diana a...
-
[32]
A Brief Tutorial on Sample Size Calculations for Fairness Audits
Singh, Harvineet and Xia, Fan and Kim, Mi-Ok and Pirracchio, Romain and Chunara, Rumi and Feng, Jean. A Brief Tutorial on Sample Size Calculations for Fairness Audits. Workshop on Regulatable Machine Learning at the 37th Conference on Neural Information Processing Systems
-
[33]
Red-teaming for generative AI : Silver bullet or security theater?
Feffer, Michael and Sinha, Anusha and Lipton, Zachary C and Heidari, Hoda. Red-teaming for generative AI : Silver bullet or security theater?. arXiv [cs.CY]
-
[34]
Query-Efficient Black-Box Red Teaming via Bayesian Optimization
Lee, Deokjae and Lee, Junyeong and Ha, Jung-Woo and Kim, Jin-Hwa and Lee, Sang-Woo and Lee, Hwaran and Song, Hyun Oh. Query-Efficient Black-Box Red Teaming via Bayesian Optimization. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
-
[35]
Hypothesis testing with e-values
Ramdas, Aaditya and Wang, Ruodu. Hypothesis testing with e-values. arXiv [math.ST]
-
[36]
E -detectors: a nonparametric framework for sequential change detection
Shin, Jaehyeok and Ramdas, Aaditya and Rinaldo, Alessandro. E -detectors: a nonparametric framework for sequential change detection. New England Journal of Statistics in Data Science
-
[37]
Testing by betting: A strategy for statistical and scientific communication
Shafer, Glenn. Testing by betting: A strategy for statistical and scientific communication. J. R. Stat. Soc. Ser. A Stat. Soc
-
[38]
Safe testing
Grünwald, Peter and de Heide, Rianne and Koolen, Wouter. Safe testing. J. R. Stat. Soc. Series B Stat. Methodol
-
[39]
A data-driven framework for identifying patient subgroups on which an AI /machine learning model may underperform
Subbaswamy, Adarsh and Sahiner, Berkman and Petrick, Nicholas and Pai, Vinay and Adams, Roy and Diamond, Matthew C and Saria, Suchi. A data-driven framework for identifying patient subgroups on which an AI /machine learning model may underperform. NPJ Digit. Med
-
[40]
EvalTree : Profiling Language Model weaknesses via hierarchical capability trees
Zeng, Zhiyuan and Wang, Yizhong and Hajishirzi, Hannaneh and Koh, Pang Wei. EvalTree : Profiling Language Model weaknesses via hierarchical capability trees. Conference on Language Modeling
-
[41]
Adaptive Testing and Debugging of NLP Models
Ribeiro, Marco Tulio and Lundberg, Scott. Adaptive Testing and Debugging of NLP Models. ACL 2022
2022
-
[42]
GPQA : A Graduate-Level Google-Proof Q&A Benchmark
Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. GPQA : A Graduate-Level Google-Proof Q&A Benchmark. arXiv [cs.AI]
-
[43]
LLMs Judging LLMs : A Simplex Perspective
Vossler, Patrick and Xia, Fan and Mai, Yifan and Subbaswamy, Adarsh and Feng, Jean. LLMs Judging LLMs : A Simplex Perspective. International Conference on Artificial Intelligence and Statistics
-
[44]
When the domain expert has no time and the LLM developer has no clinical expertise: Real-world lessons from LLM co-design in a safety-net hospital
Kothari, Avni and Vossler, Patrick and Digitale, Jean and Forouzannia, Mohammad and Rosenberg, Elise and Lee, Michele and Bryant, Jennee and Molina, Melanie and Marks, James and Zier, Lucas and Feng, Jean. When the domain expert has no time and the LLM developer has no clinical expertise: Real-world lessons from LLM co-design in a safety-net hospital. Pro...
-
[45]
MCGrad : Multicalibration at Web Scale
Tax, Niek and Perini, Lorenzo and Linder, Fridolin and Haimovich, Daniel and Karamshuk, Dima and Okati, Nastaran and Vojnovic, Milan and Apostolopoulos, Pavlos Athanasios. MCGrad : Multicalibration at Web Scale. arXiv [cs.LG]
-
[46]
LLM Evals: Everything You Need to Know – Hamel’s Blog - Hamel Husain
Husain, Hamel and Shankar, Shreya. LLM Evals: Everything You Need to Know – Hamel’s Blog - Hamel Husain. Hamel's Blog - Hamel Husain
-
[47]
Quantifying Local Model Validity using Active Learning
Lämmle, Sven and Bogoclu, Can and Vosshall, Robert and Haselhoff, Anselm and Roos, Dirk. Quantifying Local Model Validity using Active Learning. Uncertainty in Artificial Intelligence
-
[48]
Sequential tests of statistical hypotheses
Wald, A. Sequential tests of statistical hypotheses. Ann. Math. Stat
-
[49]
HiBayES : A Hierarchical Bayesian modeling framework for AI Evaluation Statistics
Luettgau, Lennart and Coppock, Harry and Dubois, Magda and Summerfield, Christopher and Ududec, Cozmin. HiBayES : A Hierarchical Bayesian modeling framework for AI Evaluation Statistics. arXiv [cs.AI]
-
[50]
Active sequential hypothesis testing
Naghshvar, Mohammad and Javidi, Tara. Active sequential hypothesis testing. Ann. Stat
-
[51]
Active sequential two-sample testing
Li, Weizhi and Kadambi, Prad and Saidi, Pouria and Ramamurthy, Karthikeyan Natesan and Dasarathy, Gautam and Berisha, Visar. Active sequential two-sample testing. Transact. Mach. Learn. Res
-
[52]
Automated Hypothesis Validation with Agentic Sequential Falsifications
Huang, Kexin and Jin, Ying and Li, Ryan and Li, Michael Y and Candes, Emmanuel and Leskovec, Jure. Automated Hypothesis Validation with Agentic Sequential Falsifications. Forty-second International Conference on Machine Learning
-
[53]
Multi-armed sequential hypothesis testing by betting
Sandoval, Ricardo J and Waudby-Smith, Ian and Jordan, Michael I. Multi-armed sequential hypothesis testing by betting. arXiv [stat.ME]
-
[54]
Active fairness auditing
Yan, Tom and Zhang, Chicheng. Active fairness auditing. International Conference on Machine Learning
-
[55]
Audit me if you can: Query-efficient active fairness auditing of black-box LLMs
Hartmann, David and Pohlmann, Lena and Hanslik, Lelia and Gießing, Noah and Berendt, Bettina and Delobelle, Pieter. Audit me if you can: Query-efficient active fairness auditing of black-box LLMs. arXiv [cs.LG]
-
[56]
Anchor points: Benchmarking models with much fewer examples
Vivek, Rajan and Ethayarajh, Kawin and Yang, Diyi and Kiela, Douwe. Anchor points: Benchmarking models with much fewer examples. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
-
[57]
On statistical bias in active learning: How and when to fix it
Farquhar, Sebastian and Gal, Yarin and Rainforth, Tom. On statistical bias in active learning: How and when to fix it. international conference on learning representations
-
[58]
Admissible online closed testing must employ e-values
Fischer, Lasse and Ramdas, Aaditya. Admissible online closed testing must employ e-values. arXiv [stat.ME]
-
[59]
Family-wise error rate control with E -values
Hartog, Will and Lei, Lihua. Family-wise error rate control with E -values. arXiv [stat.ME]
-
[60]
E -values for adaptive clinical trials: Anytime-valid monitoring in practice
Sokolova, Alexandra and Sokolov, Vadim. E -values for adaptive clinical trials: Anytime-valid monitoring in practice. arXiv [stat.ME]
-
[61]
The Caltech- UCSD Birds-200-2011 Dataset
Wah, Catherine and Branson, Steve and Welinder, Peter and Perona, Pietro and Belongie, Serge. The Caltech- UCSD Birds-200-2011 Dataset. CaltechAUTHORS
2011
-
[62]
ASPEST : Bridging the gap between active learning and selective prediction
Chen, Jiefeng and Yoon, Jinsung and Ebrahimi, Sayna and Arik, Sercan and Jha, Somesh and Pfister, Tomas. ASPEST : Bridging the gap between active learning and selective prediction. Transact. Mach. Learn. Res
-
[63]
AcTracer : Active testing of large language model via multi-stage sampling
Huang, Yuheng and Song, Jiayang and Hu, Qiang and Juefei-Xu, Felix and Ma, Lei. AcTracer : Active testing of large language model via multi-stage sampling. arXiv [cs.SE]
-
[64]
Adaptive testing of computer vision models
Gao, Irena and Ilharco, Gabriel and Lundberg, Scott and Ribeiro, Marco Tulio. Adaptive testing of computer vision models. IEEE/CVF International Conference on Computer Vision
-
[65]
AutoBencher : Towards Declarative Benchmark Construction
Li, Xiang Lisa and Kaiyom, Farzaan and Liu, Evan Zheran and Mai, Yifan and Liang, Percy and Hashimoto, Tatsunori. AutoBencher : Towards Declarative Benchmark Construction. The Thirteenth International Conference on Learning Representations
-
[66]
Beyond accuracy: Behavioral testing of NLP models with CheckList
Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond accuracy: Behavioral testing of NLP models with CheckList. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
-
[67]
Red teaming language models with language models
Perez, Ethan and Huang, Saffron and Song, Francis and Cai, Trevor and Ring, Roman and Aslanides, John and Glaese, Amelia and McAleese, Nat and Irving, Geoffrey. Red teaming language models with language models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
2022
-
[68]
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned
Ganguli, Deep and Lovitt, Liane and Kernion, Jackson and Askell, Amanda and Bai, Yuntao and Kadavath, Saurav and Mann, Ben and Perez, Ethan and Schiefer, Nicholas and Ndousse, Kamal and Jones, Andy and Bowman, Sam and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Elhage, Nelson and El-Showk, Sheer and Fort, Stanislav and Hatfield-Dodd...
-
[69]
Food and Drug Administration
U.S. Food and Drug Administration. Recommended Content and Format of Non-Clinical Bench Performance Testing Information in Premarket Submissions. U.S. Food and Drug Administration
-
[70]
E -valuator: Reliable agent verifiers with sequential hypothesis testing
Sadhuka, Shuvom and Prinster, Drew and Fannjiang, Clara and Scalia, Gabriele and Regev, Aviv and Wang, Hanchen. E -valuator: Reliable agent verifiers with sequential hypothesis testing. arXiv [cs.LG]
-
[71]
Testing fisher, Neyman, Pearson, and Bayes
Christensen, Ronald. Testing fisher, Neyman, Pearson, and Bayes. Am. Stat
-
[72]
Product Evals in Three Simple Steps
Yan, Eugene. Product Evals in Three Simple Steps. eugeneyan.com
-
[73]
Demystifying evals for AI agents
Anthropic. Demystifying evals for AI agents. Engineering at Anthropic: Inside the team building reliable AI systems
-
[74]
Neural network learning: theoretical foundations
Anthony, Martin and Bartlett, Peter L. Neural network learning: theoretical foundations
-
[75]
Consistency of random forests
Scornet, Erwan and Biau, Gérard and Vert, Jean-Philippe. Consistency of random forests. Ann. Stat
-
[76]
Multivariate smoothing spline functions
Cox, Dennis D. Multivariate smoothing spline functions. SIAM J. Numer. Anal
-
[77]
On the asymptotics of random forests
Scornet, Erwan. On the asymptotics of random forests. J. Multivar. Anal
-
[78]
Finite-time analysis of the multiarmed bandit problem
Auer, Peter and Cesa-Bianchi, Nicolò and Fischer, Paul. Finite-time analysis of the multiarmed bandit problem. Mach. Learn
-
[79]
Reinforcement learning: An introduction, 2nd ed
Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction, 2nd ed
-
[80]
General Agent Evaluation
Bandel, Elron and Yehudai, Asaf and Eden, Lilach and Sagron, Yehoshua and Perlitz, Yotam and Venezian, Elad and Razinkov, Natalia and Ergas, Natan and Ifergan, Shlomit Shachor and Shlomov, Segev and Jacovi, Michal and Choshen, Leshem and Ein-Dor, Liat and Katz, Yoav and Shmueli-Scheuer, Michal. General Agent Evaluation. arXiv [cs.AI]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.