Recognition: 2 theorem links
· Lean TheoremThe Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime
Pith reviewed 2026-05-12 03:33 UTC · model grok-4.3
The pith
AI deployment in sensitive domains should use calibrated verification instead of mechanistic interpretability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Authorization for deploying AI in high-stakes domains should rest on calibrated verification—domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable—rather than on explanations of model internals, since capability is uneven across tasks and societies govern non-transparent expertise without requiring mechanistic transparency.
What carries the argument
Calibrated verification regime, with Verification Coverage as the six-component reportable standard and minimum-composition rule that accompanies capability scores.
If this is right
- Authorization must attach to specific uses rather than to models in general because capability varies across nearby tasks.
- Post-release monitoring and surveillance become required elements of any deployment process.
- Verification Coverage reports should appear alongside performance metrics in model cards, leaderboards, and regulatory filings.
- Accountability and revocation mechanisms create concrete recourse when issues appear after deployment.
- Independent checkability enables third-party oversight without full disclosure of internal mechanisms.
Where Pith is reading between the lines
- Regulatory agencies could develop domain-specific verification protocols that speed beneficial deployments while retaining revocation as a safety valve.
- This framework might reduce pressure on interpretability research to serve as a deployment gate, freeing it for other uses.
- Testing could compare real-world outcomes in jurisdictions adopting verification standards versus those insisting on mechanistic explanations.
- The approach implies that model cards should prioritize use-case verification details over general internal explanations.
Load-bearing premise
That historical methods for governing opaque human expertise through credentials, monitoring, liability, appeal, and revocation can be translated effectively to AI systems without needing mechanistic understanding.
What would settle it
A controlled comparison showing that deployments authorized only via calibrated verification produce unmanageable harms preventable by mechanistic understanding, or that requiring interpretability blocks safe uses that verification would have allowed.
read the original abstract
AI deployment in sensitive domains such as health care, credit, employment, and criminal justice is often treated as unsafe to authorize until model internals can be explained. This often leads to an excessive reliance on mechanistic interpretability to address a deployment challenge beyond its intended scope. We argue that the gate should instead be calibrated verification: authorization should be domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable. The reason is twofold. First, model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general. Second, societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation. Recent evidence reinforces this distinction between mechanistic understanding and deployment authority: a 53-percentage-point gap between internal representations and output correction shows that understanding may not translate into action, while one scoping review found that only 9.0% of FDA-approved AI/ML device documents contained a prospective post-market surveillance study. We propose Verification Coverage, a six-component reportable standard with a minimum-composition rule, as the metric that should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that AI deployment in high-stakes domains should not require mechanistic interpretability as a precondition for authorization. Instead, it advocates shifting to a 'calibrated verification' regime in which authorizations are domain-scoped, independently checkable, post-release monitored, accountable, contestable, and revocable. The argument rests on two points: model capabilities are uneven across tasks, and societies have historically managed opaque human expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation. Supporting evidence includes a 53-percentage-point gap between internal representations and output correction plus a 9% rate of prospective post-market surveillance studies in FDA-approved AI/ML devices. The paper introduces 'Verification Coverage' as a six-component reportable standard to be placed alongside capability scores in model cards and regulatory disclosures.
Significance. If the normative position holds, the work could reorient AI governance discussions away from an over-reliance on interpretability toward practical, enforceable verification frameworks. This distinction between mechanistic understanding and deployment authority is timely and could inform regulatory design in healthcare, finance, and justice domains. The introduction of Verification Coverage offers a concrete, if high-level, metric for documentation and disclosure.
major comments (2)
- [Proposal of Verification Coverage] Proposal of Verification Coverage (abstract and closing section): The six-component standard is presented as a minimum-composition rule suitable for model cards and regulatory disclosures, yet the manuscript provides neither operational definitions for the components nor methods for measuring or auditing them. This detail is load-bearing for the central claim that calibrated verification can function as a practical alternative to interpretability requirements.
- [Historical governance analogy] Section on historical governance of opaque expertise: The analogy to credentialing, monitoring, liability, appeal, and revocation for human experts is used to justify translation to AI systems, but the text does not address domain-specific differences such as the velocity of AI updates, the distributed nature of model training, or enforcement mechanisms at scale. This assumption underpins the twofold rationale and requires explicit justification or counter-example discussion to support the recommendation.
minor comments (3)
- [Abstract] Abstract: The 53-percentage-point gap statistic is stated without a citation or one-sentence description of the underlying study, reducing traceability for readers.
- [Title and abstract] Terminology: 'Open-Box Fallacy' appears in the title but receives no explicit definition or scoping in the abstract; ensure the term is introduced with a clear contrast to the proposed verification regime early in the manuscript.
- [References] References: The manuscript would benefit from additional citations to existing AI governance frameworks (e.g., NIST AI RMF or EU AI Act provisions on post-market monitoring) to situate the six-component proposal relative to current standards.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report. The comments highlight important areas where the proposal can be strengthened, and we address each major comment below, indicating the revisions we will incorporate.
read point-by-point responses
-
Referee: [Proposal of Verification Coverage] Proposal of Verification Coverage (abstract and closing section): The six-component standard is presented as a minimum-composition rule suitable for model cards and regulatory disclosures, yet the manuscript provides neither operational definitions for the components nor methods for measuring or auditing them. This detail is load-bearing for the central claim that calibrated verification can function as a practical alternative to interpretability requirements.
Authors: We agree that the manuscript presents Verification Coverage at a high conceptual level without operational definitions or auditing methods. In the revised manuscript we will expand the relevant section to supply preliminary operational definitions for each of the six components and describe high-level approaches to measurement and auditing (for example, through standardized disclosure templates and independent third-party checks). We will also note that full standardization remains a matter for subsequent regulatory and technical work. revision: yes
-
Referee: [Historical governance analogy] Section on historical governance of opaque expertise: The analogy to credentialing, monitoring, liability, appeal, and revocation for human experts is used to justify translation to AI systems, but the text does not address domain-specific differences such as the velocity of AI updates, the distributed nature of model training, or enforcement mechanisms at scale. This assumption underpins the twofold rationale and requires explicit justification or counter-example discussion to support the recommendation.
Authors: The referee correctly identifies that the historical analogy would benefit from explicit treatment of AI-specific characteristics. We will revise the section to discuss differences in update velocity, distributed training, and enforcement scale, drawing on existing practices such as post-market surveillance in medical-device regulation. We will also include brief counter-examples where the analogy is strained, thereby providing the requested justification while preserving the core argument. revision: yes
Circularity Check
No significant circularity
full rationale
The paper advances a normative policy argument that AI deployment authorization should shift from mechanistic interpretability to calibrated verification (domain-scoped, independently checkable, post-release monitored, accountable, contestable, revocable). Its load-bearing steps rest on two external foundations: (1) the observed unevenness of model capability across tasks and (2) historical precedents for governing opaque human expertise via credentials, monitoring, liability, appeal, and revocation. These are supported by cited external statistics (53-point internal-to-output gap; 9 % FDA post-market surveillance rate) and a scoping review, none of which are derived from the paper's own definitions or fitted parameters. No equations, self-referential predictions, uniqueness theorems, or ansatzes appear; the central claim does not reduce to its inputs by construction and remains independent of any self-citation chain.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general.
- domain assumption Societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation.
invented entities (1)
-
Verification Coverage
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclearWe propose Verification Coverage, a six-component reportable standard with a minimum-composition rule, as the metric that should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearThe open-box fallacy is the stronger inference that mechanistic evidence should be the decisive deployment condition.
Reference graph
Works this paper leans on
-
[1]
Sanjay Basu, Sadiq Y . Patel, Parth Sheth, Bhairavi Muralidharan, Namrata Elamaran, Aakriti Kinra, John Morgan, and Rajaie Batniji. Interpretability without actionability: Mechanistic methods cannot correct language model errors despite near-perfect internal representations.arXiv preprint arXiv:2603.18353, 2026
-
[2]
Reasoning Models Don't Always Say What They Think
Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410, 2025. Anthropic Alignment Science Team
work page internal anchor Pith review arXiv 2025
-
[3]
Consumer Financial Protection Bureau. Consumer financial protection circular 2022-03: Adverse action notifi- cation requirements in connection with credit decisions based on complex algorithms. CFPB Circular 2022-03, 87 Fed. Reg. 35864, https://files.consumerfinance.gov/f/documents/cfpb_2022-03_circular_ 2022-05.pdf, 2022
work page 2022
-
[4]
Interpretive rules, policy statements, and advisory opinions; with- drawal
Consumer Financial Protection Bureau. Interpretive rules, policy statements, and advisory opinions; with- drawal. 90 Federal Register 20084, 2025. https://www.federalregister.gov/documents/2025/05/12/ 2025-08286/interpretive-rules-policy-statements-and-advisory-opinions-withdrawal
work page 2025
-
[5]
Samantha Cruz Rivera, Xiaoxuan Liu, An-Wen Chan, Alastair K. Denniston, Melanie J. Calvert, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: The SPIRIT-AI extension.Nature Medicine, 26:1351–1363, 2020. doi: 10.1038/s41591-020-1037-7
-
[6]
Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev, et al
Jeffrey De Fauw, Joseph R. Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease.Nature Medicine, 24(9):1342–1350,
-
[7]
URL https://www.nature.com/articles/s41591-018-0107-6
doi: 10.1038/s41591-018-0107-6. URL https://www.nature.com/articles/s41591-018-0107-6
-
[8]
Fabrizio Dell’Acqua, Edward McFowland III, Ethan R. Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani. Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality.Organization Science, 37(2):403–423, 2026. do...
-
[9]
Statistically valid post-deployment monitoring should be standard for AI-based digital health
Pavel Dolin, Weizhi Li, Gautam Dasarathy, and Visar Berisha. Statistically valid post-deployment monitoring should be standard for AI-based digital health. InAdvances in Neural Information Processing Systems (NeurIPS 2025 Position Paper Track), 2025. Argues for statistically valid post-deployment monitoring for AI-based digital health. OpenReview:https://...
work page 2025
-
[10]
Towards A Rigorous Science of Interpretable Machine Learning
Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.arXiv preprint arXiv:1702.08608, 2017. URLhttps://arxiv.org/abs/1702.08608
work page internal anchor Pith review arXiv 2017
-
[11]
The accuracy, fairness, and limits of predicting recidivism
Julia Dressel and Hany Farid. The accuracy, fairness, and limits of predicting recidivism.Science Advances, 4(1): eaao5580, 2018. doi: 10.1126/sciadv.aao5580
-
[12]
Baek, Subhash Kantamneni, and Max Tegmark
Joshua Engels, David D. Baek, Subhash Kantamneni, and Max Tegmark. Scaling laws for scalable oversight. In Advances in Neural Information Processing Systems, 2025. NeurIPS 2025 spotlight
work page 2025
-
[13]
European Parliament and Council of the European Union. Regulation (EU) 2024/1689, annex III: High-risk AI systems.https://eur-lex.europa.eu/eli/reg/2024/1689/oj, 2024
work page 2024
-
[14]
Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence
European Union. Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence. Official Journal of the European Union, 2024.https://eur-lex.europa.eu/eli/reg/2024/1689/oj. 8 The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification RegimeA PREPRINT
work page 2024
-
[15]
Why do large language models (LLMs) struggle to count letters?arXiv preprint arXiv:2412.18626, 2024
Tairan Fu, Raquel Ferrando, Javier Conde, Carlos Arriaga, and Pedro Reviriego. Why do large language models (LLMs) struggle to count letters?arXiv preprint arXiv:2412.18626, 2024
-
[16]
In: 2018 IEEE Inter- national Conference on Software Maintenance and Evolution (ICSME)
Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), pages 80–89. IEEE, 2018. doi: 10.1109/DSAA.2018.00018. URL https://arxiv.org/abs/1806.00069
-
[17]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive LLMs that persist through safety training.arXiv preprint arXiv:2401.05566, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Eric Jonas and Konrad P. Kording. Could a neuroscientist understand a microprocessor?PLOS Computational Biology, 13(1):e1005268, 2017. doi: 10.1371/journal.pcbi.1005268
-
[19]
Hendrik Kempt, Jan-Christoph Heilinger, and Saskia K. Nagel. Relative explainability and double standards in medical decision-making: Should medical AI be subjected to higher standards in medical decision-making than doctors?Ethics and Information Technology, 24(2):20, 2022. doi: 10.1007/s10676-022-09646-x
-
[20]
Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, et al. On the biology of a large language model.Transformer Circuits Thread, 2025. https://transformer-circuits. pub/2025/attribution-graphs/biology.html
work page 2025
-
[21]
Xiaoxuan Liu, Samantha Cruz Rivera, David Moher, Melanie J. Calvert, Alastair K. Denniston, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. Nature Medicine, 26:1364–1374, 2020. doi: 10.1038/s41591-020-1034-x
-
[22]
Model cards for model reporting,
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229. ACM, 2019. doi: 10.1145/3287560.3287596
-
[23]
Vijaytha Muralidharan, Boluwatife Adeleye Adewale, Caroline J. Huang, Mfon Thelma Nta, Peter Oluwaduy- ilemi Ademiju, Pirunthan Pathmarajah, Man Kien Hang, Oluwafolajimi Adesanya, et al. A scoping re- view of reporting gaps in FDA-approved AI medical devices.npj Digital Medicine, 7(1):273, 2024. doi: 10.1038/s41746-024-01270-x
-
[24]
ADS-equipped vehicle safety, transparency, and evaluation program (A V STEP)
National Highway Traffic Safety Administration. ADS-equipped vehicle safety, transparency, and evaluation program (A V STEP). 90 Federal Register 4130,https://www.federalregister.gov/documents/2025/01/ 15/2024-30854/ads-equipped-vehicle-safety-transparency-and-evaluation-program, 2025
work page 2025
-
[25]
National Highway Traffic Safety Administration. Third amended standing general order 2021-01: Incident reporting for automated driving systems and level 2 advanced driver assistance systems. NHTSA Standing General Order on Crash Reporting, 2025. Effective June 16, 2025. https://www.nhtsa.gov/laws-regulations/ standing-general-order-crash-reporting
work page 2021
-
[26]
Large language models often know when they are being evaluated
Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn. Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836, 2025. Gemini 2.5 Pro reaches AUC 0.83 on evaluation-awareness classification; human baseline 0.92
-
[27]
Automated employment decision tools
New York City Department of Consumer and Worker Protection. Automated employment decision tools. https: //www.nyc.gov/site/dca/about/automated-employment-decision-tools.page, 2023
work page 2023
-
[28]
Sycophancy in GPT-4o: What happened and what we’re doing about it
OpenAI. Sycophancy in GPT-4o: What happened and what we’re doing about it. OpenAI blog, 29 April 2025,
work page 2025
-
[29]
Expanding on what we missed with sycophancy
https://openai.com/index/sycophancy-in-gpt-4o/ . See also the follow-up “Expanding on what we missed with sycophancy” (https://openai.com/index/expanding-on-sycophancy/)
-
[30]
Inioluwa Deborah Raji, Andrew Smart, Rebecca N. White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes. Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 33–44. ...
-
[31]
The age of secrecy and unfairness in recidivism prediction
Cynthia Rudin, Caroline Wang, and Beau Coker. The age of secrecy and unfairness in recidivism prediction. Harvard Data Science Review, 2(1), 2020. doi: 10.1162/99608f92.6ed64b30
-
[32]
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCan- dlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models.arXiv prep...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
AI risk management framework (AI RMF 1.0)
Elham Tabassi. AI risk management framework (AI RMF 1.0). Technical Report NIST AI 100-1, National Institute of Standards and Technology, 2023
work page 2023
-
[34]
U.S. Food and Drug Administration. Artificial intelligence-enabled medical de- vices. https://www.fda.gov/medical-devices/software-medical-device-samd/ artificial-intelligence-enabled-medical-devices, 2026. Accessed May 7, 2026
work page 2026
-
[35]
U.S. Food and Drug Administration. Clinical decision support software: Guidance for indus- try and Food and Drug Administration staff. https://www.fda.gov/regulatory-information/ search-fda-guidance-documents/clinical-decision-support-software, 2026
work page 2026
-
[36]
Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, and Francis Rhys Ward. AI sandbagging: Language models can strategically underperform on evaluations.arXiv preprint arXiv:2406.07358, 2024
-
[37]
Baptiste Vasey, Myura Nagendran, Bruce Campbell, David A. Clifton, Gary S. Collins, Spiros Denaxas, Alastair K. Denniston, et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI.Nature Medicine, 28:924–933, 2022. doi: 10.1038/s41591-022-01772-9
-
[38]
Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR.Harvard Journal of Law & Technology, 31(2):841–887, 2018
work page 2018
-
[39]
Language models learn to mislead humans via rlhf
Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Bowman, He He, and Shi Feng. Language models learn to mislead humans via RLHF.arXiv preprint arXiv:2409.12822, 2024
-
[40]
Wisconsin Supreme Court. State v. Loomis, 2016 WI 68, 371 Wis. 2d 235, 881 n.w.2d 749 (Wis. 2016). https://www.wicourts.gov/sc/opinion/DisplayDocument.pdf?content=pdf&seqNo=171690, 2016
work page 2016
-
[41]
Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu, and Liang He
Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. AxBench: Steering LLMs? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025. ICML 2025 spotlight
-
[42]
Nan Xu and Xuezhe Ma. LLM the genius paradox: A linguistic and math expert’s struggle with simple word- based counting problems. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3344–3370, 2025. doi: 10.18653/v1/2025.naac...
-
[43]
John Zerilli, Alistair Knott, James Maclaurin, and Colin Gavaghan. Transparency in algorithmic and human decision-making: Is there a double standard?Philosophy & Technology, 32(4):661–683, 2019. doi: 10.1007/ s13347-018-0330-6. 10
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.