arxiv: 2605.10601 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime

Ane Cathrine Holst Merrild, Phongsakon Mark Konrad, Rebecca De Rosa, Riccardo Terrenzi, Serkan Ayvaz, Tim Lukas Adam, Toygar Tanyel

Pith reviewed 2026-05-12 03:33 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI deploymentcalibrated verificationmechanistic interpretabilitypost-market surveillanceAI governanceregulatory policyverification coverage

0 comments

The pith

AI deployment in sensitive domains should use calibrated verification instead of mechanistic interpretability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that requiring full mechanistic understanding of AI models before deployment in areas like healthcare, credit, employment, and criminal justice is misplaced. Societies have long authorized opaque human expertise through domain-specific credentials, ongoing monitoring, liability, appeal processes, and revocation rather than mechanism-level insight. The authors propose calibrated verification where authorization is tied to specific uses, independently checkable, post-release monitored, accountable, contestable, and revocable. Evidence includes a large gap between internal representations and output corrections plus low rates of prospective post-market surveillance in approved AI medical devices. They introduce Verification Coverage as a six-component reporting standard to accompany capability metrics in disclosures.

Core claim

Authorization for deploying AI in high-stakes domains should rest on calibrated verification—domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable—rather than on explanations of model internals, since capability is uneven across tasks and societies govern non-transparent expertise without requiring mechanistic transparency.

What carries the argument

Calibrated verification regime, with Verification Coverage as the six-component reportable standard and minimum-composition rule that accompanies capability scores.

If this is right

Authorization must attach to specific uses rather than to models in general because capability varies across nearby tasks.
Post-release monitoring and surveillance become required elements of any deployment process.
Verification Coverage reports should appear alongside performance metrics in model cards, leaderboards, and regulatory filings.
Accountability and revocation mechanisms create concrete recourse when issues appear after deployment.
Independent checkability enables third-party oversight without full disclosure of internal mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regulatory agencies could develop domain-specific verification protocols that speed beneficial deployments while retaining revocation as a safety valve.
This framework might reduce pressure on interpretability research to serve as a deployment gate, freeing it for other uses.
Testing could compare real-world outcomes in jurisdictions adopting verification standards versus those insisting on mechanistic explanations.
The approach implies that model cards should prioritize use-case verification details over general internal explanations.

Load-bearing premise

That historical methods for governing opaque human expertise through credentials, monitoring, liability, appeal, and revocation can be translated effectively to AI systems without needing mechanistic understanding.

What would settle it

A controlled comparison showing that deployments authorized only via calibrated verification produce unmanageable harms preventable by mechanistic understanding, or that requiring interpretability blocks safe uses that verification would have allowed.

read the original abstract

AI deployment in sensitive domains such as health care, credit, employment, and criminal justice is often treated as unsafe to authorize until model internals can be explained. This often leads to an excessive reliance on mechanistic interpretability to address a deployment challenge beyond its intended scope. We argue that the gate should instead be calibrated verification: authorization should be domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable. The reason is twofold. First, model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general. Second, societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation. Recent evidence reinforces this distinction between mechanistic understanding and deployment authority: a 53-percentage-point gap between internal representations and output correction shows that understanding may not translate into action, while one scoping review found that only 9.0% of FDA-approved AI/ML device documents contained a prospective post-market surveillance study. We propose Verification Coverage, a six-component reportable standard with a minimum-composition rule, as the metric that should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper argues for replacing mechanistic interpretability demands with a practical verification regime for AI in high-stakes domains, but the proposal stays normative without tests or measurement details.

read the letter

The main thing to know is that this paper claims current pushes for opening AI models to explain their internals are misplaced for deployment decisions in healthcare, finance, or justice. It says authorization should instead rest on verification that is scoped to a specific use, independently checkable, post-release monitored, accountable, contestable, and revocable. The authors back this with a cited 53-point gap between internal model states and output corrections, plus the low 9% rate of prospective post-market studies in FDA AI approvals. They introduce Verification Coverage as a six-component standard to report alongside capability scores in model cards and disclosures. That framing is clear and draws a useful line between understanding mechanisms and governing outcomes. The historical analogy to licensing opaque human experts also lands reasonably. The soft spot is that the piece offers no worked example of how to define, measure, or enforce those six components, nor any simulation or case study showing the regime would outperform current practices. It treats the translation from human credentials to AI systems as straightforward without addressing scaling differences or novel failure modes. This is for readers in AI policy and regulation rather than core technical work. It synthesizes known limits of interpretability with regulatory ideas but adds no new data or formal result. The thinking is coherent on its own terms, so it deserves a serious referee to pressure-test the framework details and see if the verification components can be made operational.

Referee Report

2 major / 3 minor

Summary. The paper claims that AI deployment in high-stakes domains should not require mechanistic interpretability as a precondition for authorization. Instead, it advocates shifting to a 'calibrated verification' regime in which authorizations are domain-scoped, independently checkable, post-release monitored, accountable, contestable, and revocable. The argument rests on two points: model capabilities are uneven across tasks, and societies have historically managed opaque human expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation. Supporting evidence includes a 53-percentage-point gap between internal representations and output correction plus a 9% rate of prospective post-market surveillance studies in FDA-approved AI/ML devices. The paper introduces 'Verification Coverage' as a six-component reportable standard to be placed alongside capability scores in model cards and regulatory disclosures.

Significance. If the normative position holds, the work could reorient AI governance discussions away from an over-reliance on interpretability toward practical, enforceable verification frameworks. This distinction between mechanistic understanding and deployment authority is timely and could inform regulatory design in healthcare, finance, and justice domains. The introduction of Verification Coverage offers a concrete, if high-level, metric for documentation and disclosure.

major comments (2)

[Proposal of Verification Coverage] Proposal of Verification Coverage (abstract and closing section): The six-component standard is presented as a minimum-composition rule suitable for model cards and regulatory disclosures, yet the manuscript provides neither operational definitions for the components nor methods for measuring or auditing them. This detail is load-bearing for the central claim that calibrated verification can function as a practical alternative to interpretability requirements.
[Historical governance analogy] Section on historical governance of opaque expertise: The analogy to credentialing, monitoring, liability, appeal, and revocation for human experts is used to justify translation to AI systems, but the text does not address domain-specific differences such as the velocity of AI updates, the distributed nature of model training, or enforcement mechanisms at scale. This assumption underpins the twofold rationale and requires explicit justification or counter-example discussion to support the recommendation.

minor comments (3)

[Abstract] Abstract: The 53-percentage-point gap statistic is stated without a citation or one-sentence description of the underlying study, reducing traceability for readers.
[Title and abstract] Terminology: 'Open-Box Fallacy' appears in the title but receives no explicit definition or scoping in the abstract; ensure the term is introduced with a clear contrast to the proposed verification regime early in the manuscript.
[References] References: The manuscript would benefit from additional citations to existing AI governance frameworks (e.g., NIST AI RMF or EU AI Act provisions on post-market monitoring) to situate the six-component proposal relative to current standards.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The comments highlight important areas where the proposal can be strengthened, and we address each major comment below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: [Proposal of Verification Coverage] Proposal of Verification Coverage (abstract and closing section): The six-component standard is presented as a minimum-composition rule suitable for model cards and regulatory disclosures, yet the manuscript provides neither operational definitions for the components nor methods for measuring or auditing them. This detail is load-bearing for the central claim that calibrated verification can function as a practical alternative to interpretability requirements.

Authors: We agree that the manuscript presents Verification Coverage at a high conceptual level without operational definitions or auditing methods. In the revised manuscript we will expand the relevant section to supply preliminary operational definitions for each of the six components and describe high-level approaches to measurement and auditing (for example, through standardized disclosure templates and independent third-party checks). We will also note that full standardization remains a matter for subsequent regulatory and technical work. revision: yes
Referee: [Historical governance analogy] Section on historical governance of opaque expertise: The analogy to credentialing, monitoring, liability, appeal, and revocation for human experts is used to justify translation to AI systems, but the text does not address domain-specific differences such as the velocity of AI updates, the distributed nature of model training, or enforcement mechanisms at scale. This assumption underpins the twofold rationale and requires explicit justification or counter-example discussion to support the recommendation.

Authors: The referee correctly identifies that the historical analogy would benefit from explicit treatment of AI-specific characteristics. We will revise the section to discuss differences in update velocity, distributed training, and enforcement scale, drawing on existing practices such as post-market surveillance in medical-device regulation. We will also include brief counter-examples where the analogy is strained, thereby providing the requested justification while preserving the core argument. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper advances a normative policy argument that AI deployment authorization should shift from mechanistic interpretability to calibrated verification (domain-scoped, independently checkable, post-release monitored, accountable, contestable, revocable). Its load-bearing steps rest on two external foundations: (1) the observed unevenness of model capability across tasks and (2) historical precedents for governing opaque human expertise via credentials, monitoring, liability, appeal, and revocation. These are supported by cited external statistics (53-point internal-to-output gap; 9 % FDA post-market surveillance rate) and a scoping review, none of which are derived from the paper's own definitions or fitted parameters. No equations, self-referential predictions, uniqueness theorems, or ansatzes appear; the central claim does not reduce to its inputs by construction and remains independent of any self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on adapting non-AI governance models to AI deployment and introduces Verification Coverage as a new reporting standard without independent empirical validation in the abstract.

axioms (2)

domain assumption Model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general.
Stated explicitly as the first reason for preferring calibrated verification.
domain assumption Societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation.
Stated explicitly as the second reason supporting the proposed approach.

invented entities (1)

Verification Coverage no independent evidence
purpose: A six-component reportable standard that should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.
Newly proposed metric defined in the abstract as the central practical output.

pith-pipeline@v0.9.0 · 5544 in / 1382 out tokens · 63963 ms · 2026-05-12T03:33:22.877508+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
We propose Verification Coverage, a six-component reportable standard with a minimum-composition rule, as the metric that should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
The open-box fallacy is the stronger inference that mechanistic evidence should be the decisive deployment condition.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 4 internal anchors

[1]

Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations.arXiv preprint arXiv:2603.18353, 2026

Sanjay Basu, Sadiq Y . Patel, Parth Sheth, Bhairavi Muralidharan, Namrata Elamaran, Aakriti Kinra, John Morgan, and Rajaie Batniji. Interpretability without actionability: Mechanistic methods cannot correct language model errors despite near-perfect internal representations.arXiv preprint arXiv:2603.18353, 2026

work page arXiv 2026
[2]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410, 2025. Anthropic Alignment Science Team

work page internal anchor Pith review arXiv 2025
[3]

Consumer financial protection circular 2022-03: Adverse action notifi- cation requirements in connection with credit decisions based on complex algorithms

Consumer Financial Protection Bureau. Consumer financial protection circular 2022-03: Adverse action notifi- cation requirements in connection with credit decisions based on complex algorithms. CFPB Circular 2022-03, 87 Fed. Reg. 35864, https://files.consumerfinance.gov/f/documents/cfpb_2022-03_circular_ 2022-05.pdf, 2022

work page 2022
[4]

Interpretive rules, policy statements, and advisory opinions; with- drawal

Consumer Financial Protection Bureau. Interpretive rules, policy statements, and advisory opinions; with- drawal. 90 Federal Register 20084, 2025. https://www.federalregister.gov/documents/2025/05/12/ 2025-08286/interpretive-rules-policy-statements-and-advisory-opinions-withdrawal

work page 2025
[5]

Denniston, Melanie J

Samantha Cruz Rivera, Xiaoxuan Liu, An-Wen Chan, Alastair K. Denniston, Melanie J. Calvert, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: The SPIRIT-AI extension.Nature Medicine, 26:1351–1363, 2020. doi: 10.1038/s41591-020-1037-7

work page doi:10.1038/s41591-020-1037-7 2020
[6]

Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev, et al

Jeffrey De Fauw, Joseph R. Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease.Nature Medicine, 24(9):1342–1350,

work page
[7]

URL https://www.nature.com/articles/s41591-018-0107-6

doi: 10.1038/s41591-018-0107-6. URL https://www.nature.com/articles/s41591-018-0107-6

work page doi:10.1038/s41591-018-0107-6
[8]

Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of Artificial Intelligence on Knowledge Worker Productivity and Quality

Fabrizio Dell’Acqua, Edward McFowland III, Ethan R. Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani. Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality.Organization Science, 37(2):403–423, 2026. do...

work page doi:10.1287/orsc.2025.21838 2026
[9]

Statistically valid post-deployment monitoring should be standard for AI-based digital health

Pavel Dolin, Weizhi Li, Gautam Dasarathy, and Visar Berisha. Statistically valid post-deployment monitoring should be standard for AI-based digital health. InAdvances in Neural Information Processing Systems (NeurIPS 2025 Position Paper Track), 2025. Argues for statistically valid post-deployment monitoring for AI-based digital health. OpenReview:https://...

work page 2025
[10]

Towards A Rigorous Science of Interpretable Machine Learning

Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.arXiv preprint arXiv:1702.08608, 2017. URLhttps://arxiv.org/abs/1702.08608

work page internal anchor Pith review arXiv 2017
[11]

The accuracy, fairness, and limits of predicting recidivism

Julia Dressel and Hany Farid. The accuracy, fairness, and limits of predicting recidivism.Science Advances, 4(1): eaao5580, 2018. doi: 10.1126/sciadv.aao5580

work page doi:10.1126/sciadv.aao5580 2018
[12]

Baek, Subhash Kantamneni, and Max Tegmark

Joshua Engels, David D. Baek, Subhash Kantamneni, and Max Tegmark. Scaling laws for scalable oversight. In Advances in Neural Information Processing Systems, 2025. NeurIPS 2025 spotlight

work page 2025
[13]

Regulation (EU) 2024/1689, annex III: High-risk AI systems.https://eur-lex.europa.eu/eli/reg/2024/1689/oj, 2024

European Parliament and Council of the European Union. Regulation (EU) 2024/1689, annex III: High-risk AI systems.https://eur-lex.europa.eu/eli/reg/2024/1689/oj, 2024

work page 2024
[14]

Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence

European Union. Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence. Official Journal of the European Union, 2024.https://eur-lex.europa.eu/eli/reg/2024/1689/oj. 8 The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification RegimeA PREPRINT

work page 2024
[15]

Why do large language models (LLMs) struggle to count letters?arXiv preprint arXiv:2412.18626, 2024

Tairan Fu, Raquel Ferrando, Javier Conde, Carlos Arriaga, and Pedro Reviriego. Why do large language models (LLMs) struggle to count letters?arXiv preprint arXiv:2412.18626, 2024

work page arXiv 2024
[16]

In: 2018 IEEE Inter- national Conference on Software Maintenance and Evolution (ICSME)

Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), pages 80–89. IEEE, 2018. doi: 10.1109/DSAA.2018.00018. URL https://arxiv.org/abs/1806.00069

work page doi:10.1109/dsaa.2018.00018 2018
[17]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive LLMs that persist through safety training.arXiv preprint arXiv:2401.05566, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Eric Jonas and Konrad P. Kording. Could a neuroscientist understand a microprocessor?PLOS Computational Biology, 13(1):e1005268, 2017. doi: 10.1371/journal.pcbi.1005268

work page doi:10.1371/journal.pcbi.1005268 2017
[19]

Hendrik Kempt, Jan-Christoph Heilinger, and Saskia K. Nagel. Relative explainability and double standards in medical decision-making: Should medical AI be subjected to higher standards in medical decision-making than doctors?Ethics and Information Technology, 24(2):20, 2022. doi: 10.1007/s10676-022-09646-x

work page doi:10.1007/s10676-022-09646-x 2022
[20]

Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, et al

Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, et al. On the biology of a large language model.Transformer Circuits Thread, 2025. https://transformer-circuits. pub/2025/attribution-graphs/biology.html

work page 2025
[21]

Calvert, Alastair K

Xiaoxuan Liu, Samantha Cruz Rivera, David Moher, Melanie J. Calvert, Alastair K. Denniston, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. Nature Medicine, 26:1364–1374, 2020. doi: 10.1038/s41591-020-1034-x

work page doi:10.1038/s41591-020-1034-x 2020
[22]

Model cards for model reporting,

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229. ACM, 2019. doi: 10.1145/3287560.3287596

work page doi:10.1145/3287560.3287596 2019
[23]

Huang, Mfon Thelma Nta, Peter Oluwaduy- ilemi Ademiju, Pirunthan Pathmarajah, Man Kien Hang, Oluwafolajimi Adesanya, et al

Vijaytha Muralidharan, Boluwatife Adeleye Adewale, Caroline J. Huang, Mfon Thelma Nta, Peter Oluwaduy- ilemi Ademiju, Pirunthan Pathmarajah, Man Kien Hang, Oluwafolajimi Adesanya, et al. A scoping re- view of reporting gaps in FDA-approved AI medical devices.npj Digital Medicine, 7(1):273, 2024. doi: 10.1038/s41746-024-01270-x

work page doi:10.1038/s41746-024-01270-x 2024
[24]

ADS-equipped vehicle safety, transparency, and evaluation program (A V STEP)

National Highway Traffic Safety Administration. ADS-equipped vehicle safety, transparency, and evaluation program (A V STEP). 90 Federal Register 4130,https://www.federalregister.gov/documents/2025/01/ 15/2024-30854/ads-equipped-vehicle-safety-transparency-and-evaluation-program, 2025

work page 2025
[25]

Third amended standing general order 2021-01: Incident reporting for automated driving systems and level 2 advanced driver assistance systems

National Highway Traffic Safety Administration. Third amended standing general order 2021-01: Incident reporting for automated driving systems and level 2 advanced driver assistance systems. NHTSA Standing General Order on Crash Reporting, 2025. Effective June 16, 2025. https://www.nhtsa.gov/laws-regulations/ standing-general-order-crash-reporting

work page 2021
[26]

Large language models often know when they are being evaluated

Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn. Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836, 2025. Gemini 2.5 Pro reaches AUC 0.83 on evaluation-awareness classification; human baseline 0.92

work page arXiv 2025
[27]

Automated employment decision tools

New York City Department of Consumer and Worker Protection. Automated employment decision tools. https: //www.nyc.gov/site/dca/about/automated-employment-decision-tools.page, 2023

work page 2023
[28]

Sycophancy in GPT-4o: What happened and what we’re doing about it

OpenAI. Sycophancy in GPT-4o: What happened and what we’re doing about it. OpenAI blog, 29 April 2025,

work page 2025
[29]

Expanding on what we missed with sycophancy

https://openai.com/index/sycophancy-in-gpt-4o/ . See also the follow-up “Expanding on what we missed with sycophancy” (https://openai.com/index/expanding-on-sycophancy/)

work page
[30]

White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes

Inioluwa Deborah Raji, Andrew Smart, Rebecca N. White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes. Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 33–44. ...

work page doi:10.1145/3351095.3372873 2020
[31]

The age of secrecy and unfairness in recidivism prediction

Cynthia Rudin, Caroline Wang, and Beau Coker. The age of secrecy and unfairness in recidivism prediction. Harvard Data Science Review, 2(1), 2020. doi: 10.1162/99608f92.6ed64b30

work page doi:10.1162/99608f92.6ed64b30 2020
[32]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCan- dlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models.arXiv prep...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

AI risk management framework (AI RMF 1.0)

Elham Tabassi. AI risk management framework (AI RMF 1.0). Technical Report NIST AI 100-1, National Institute of Standards and Technology, 2023

work page 2023
[34]

Food and Drug Administration

U.S. Food and Drug Administration. Artificial intelligence-enabled medical de- vices. https://www.fda.gov/medical-devices/software-medical-device-samd/ artificial-intelligence-enabled-medical-devices, 2026. Accessed May 7, 2026

work page 2026
[35]

Food and Drug Administration

U.S. Food and Drug Administration. Clinical decision support software: Guidance for indus- try and Food and Drug Administration staff. https://www.fda.gov/regulatory-information/ search-fda-guidance-documents/clinical-decision-support-software, 2026

work page 2026
[36]

Ai sandbagging: Language models can strategically underperform on evaluations.arXiv preprint arXiv:2406.07358, 2024

Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, and Francis Rhys Ward. AI sandbagging: Language models can strategically underperform on evaluations.arXiv preprint arXiv:2406.07358, 2024

work page arXiv 2024
[37]

Clifton, Gary S

Baptiste Vasey, Myura Nagendran, Bruce Campbell, David A. Clifton, Gary S. Collins, Spiros Denaxas, Alastair K. Denniston, et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI.Nature Medicine, 28:924–933, 2022. doi: 10.1038/s41591-022-01772-9

work page doi:10.1038/s41591-022-01772-9 2022
[38]

Counterfactual explanations without opening the black box: Automated decisions and the GDPR.Harvard Journal of Law & Technology, 31(2):841–887, 2018

Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR.Harvard Journal of Law & Technology, 31(2):841–887, 2018

work page 2018
[39]

Language models learn to mislead humans via rlhf

Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Bowman, He He, and Shi Feng. Language models learn to mislead humans via RLHF.arXiv preprint arXiv:2409.12822, 2024

work page arXiv 2024
[40]

Wisconsin Supreme Court. State v. Loomis, 2016 WI 68, 371 Wis. 2d 235, 881 n.w.2d 749 (Wis. 2016). https://www.wicourts.gov/sc/opinion/DisplayDocument.pdf?content=pdf&seqNo=171690, 2016

work page 2016
[41]

Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu, and Liang He

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. AxBench: Steering LLMs? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025. ICML 2025 spotlight

work page arXiv 2025
[42]

LLM the genius paradox: A linguistic and math expert’s struggle with simple word- based counting problems

Nan Xu and Xuezhe Ma. LLM the genius paradox: A linguistic and math expert’s struggle with simple word- based counting problems. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3344–3370, 2025. doi: 10.18653/v1/2025.naac...

work page doi:10.18653/v1/2025.naacl-long.172 2025
[43]

Transparency in algorithmic and human decision-making: Is there a double standard?Philosophy & Technology, 32(4):661–683, 2019

John Zerilli, Alistair Knott, James Maclaurin, and Colin Gavaghan. Transparency in algorithmic and human decision-making: Is there a double standard?Philosophy & Technology, 32(4):661–683, 2019. doi: 10.1007/ s13347-018-0330-6. 10

work page 2019