pith. machine review for the scientific record. sign in

arxiv: 2605.10601 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime

Ane Cathrine Holst Merrild, Phongsakon Mark Konrad, Rebecca De Rosa, Riccardo Terrenzi, Serkan Ayvaz, Tim Lukas Adam, Toygar Tanyel

Pith reviewed 2026-05-12 03:33 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI deploymentcalibrated verificationmechanistic interpretabilitypost-market surveillanceAI governanceregulatory policyverification coverage
0
0 comments X

The pith

AI deployment in sensitive domains should use calibrated verification instead of mechanistic interpretability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that requiring full mechanistic understanding of AI models before deployment in areas like healthcare, credit, employment, and criminal justice is misplaced. Societies have long authorized opaque human expertise through domain-specific credentials, ongoing monitoring, liability, appeal processes, and revocation rather than mechanism-level insight. The authors propose calibrated verification where authorization is tied to specific uses, independently checkable, post-release monitored, accountable, contestable, and revocable. Evidence includes a large gap between internal representations and output corrections plus low rates of prospective post-market surveillance in approved AI medical devices. They introduce Verification Coverage as a six-component reporting standard to accompany capability metrics in disclosures.

Core claim

Authorization for deploying AI in high-stakes domains should rest on calibrated verification—domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable—rather than on explanations of model internals, since capability is uneven across tasks and societies govern non-transparent expertise without requiring mechanistic transparency.

What carries the argument

Calibrated verification regime, with Verification Coverage as the six-component reportable standard and minimum-composition rule that accompanies capability scores.

If this is right

  • Authorization must attach to specific uses rather than to models in general because capability varies across nearby tasks.
  • Post-release monitoring and surveillance become required elements of any deployment process.
  • Verification Coverage reports should appear alongside performance metrics in model cards, leaderboards, and regulatory filings.
  • Accountability and revocation mechanisms create concrete recourse when issues appear after deployment.
  • Independent checkability enables third-party oversight without full disclosure of internal mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Regulatory agencies could develop domain-specific verification protocols that speed beneficial deployments while retaining revocation as a safety valve.
  • This framework might reduce pressure on interpretability research to serve as a deployment gate, freeing it for other uses.
  • Testing could compare real-world outcomes in jurisdictions adopting verification standards versus those insisting on mechanistic explanations.
  • The approach implies that model cards should prioritize use-case verification details over general internal explanations.

Load-bearing premise

That historical methods for governing opaque human expertise through credentials, monitoring, liability, appeal, and revocation can be translated effectively to AI systems without needing mechanistic understanding.

What would settle it

A controlled comparison showing that deployments authorized only via calibrated verification produce unmanageable harms preventable by mechanistic understanding, or that requiring interpretability blocks safe uses that verification would have allowed.

read the original abstract

AI deployment in sensitive domains such as health care, credit, employment, and criminal justice is often treated as unsafe to authorize until model internals can be explained. This often leads to an excessive reliance on mechanistic interpretability to address a deployment challenge beyond its intended scope. We argue that the gate should instead be calibrated verification: authorization should be domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable. The reason is twofold. First, model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general. Second, societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation. Recent evidence reinforces this distinction between mechanistic understanding and deployment authority: a 53-percentage-point gap between internal representations and output correction shows that understanding may not translate into action, while one scoping review found that only 9.0% of FDA-approved AI/ML device documents contained a prospective post-market surveillance study. We propose Verification Coverage, a six-component reportable standard with a minimum-composition rule, as the metric that should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that AI deployment in high-stakes domains should not require mechanistic interpretability as a precondition for authorization. Instead, it advocates shifting to a 'calibrated verification' regime in which authorizations are domain-scoped, independently checkable, post-release monitored, accountable, contestable, and revocable. The argument rests on two points: model capabilities are uneven across tasks, and societies have historically managed opaque human expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation. Supporting evidence includes a 53-percentage-point gap between internal representations and output correction plus a 9% rate of prospective post-market surveillance studies in FDA-approved AI/ML devices. The paper introduces 'Verification Coverage' as a six-component reportable standard to be placed alongside capability scores in model cards and regulatory disclosures.

Significance. If the normative position holds, the work could reorient AI governance discussions away from an over-reliance on interpretability toward practical, enforceable verification frameworks. This distinction between mechanistic understanding and deployment authority is timely and could inform regulatory design in healthcare, finance, and justice domains. The introduction of Verification Coverage offers a concrete, if high-level, metric for documentation and disclosure.

major comments (2)
  1. [Proposal of Verification Coverage] Proposal of Verification Coverage (abstract and closing section): The six-component standard is presented as a minimum-composition rule suitable for model cards and regulatory disclosures, yet the manuscript provides neither operational definitions for the components nor methods for measuring or auditing them. This detail is load-bearing for the central claim that calibrated verification can function as a practical alternative to interpretability requirements.
  2. [Historical governance analogy] Section on historical governance of opaque expertise: The analogy to credentialing, monitoring, liability, appeal, and revocation for human experts is used to justify translation to AI systems, but the text does not address domain-specific differences such as the velocity of AI updates, the distributed nature of model training, or enforcement mechanisms at scale. This assumption underpins the twofold rationale and requires explicit justification or counter-example discussion to support the recommendation.
minor comments (3)
  1. [Abstract] Abstract: The 53-percentage-point gap statistic is stated without a citation or one-sentence description of the underlying study, reducing traceability for readers.
  2. [Title and abstract] Terminology: 'Open-Box Fallacy' appears in the title but receives no explicit definition or scoping in the abstract; ensure the term is introduced with a clear contrast to the proposed verification regime early in the manuscript.
  3. [References] References: The manuscript would benefit from additional citations to existing AI governance frameworks (e.g., NIST AI RMF or EU AI Act provisions on post-market monitoring) to situate the six-component proposal relative to current standards.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The comments highlight important areas where the proposal can be strengthened, and we address each major comment below, indicating the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Proposal of Verification Coverage] Proposal of Verification Coverage (abstract and closing section): The six-component standard is presented as a minimum-composition rule suitable for model cards and regulatory disclosures, yet the manuscript provides neither operational definitions for the components nor methods for measuring or auditing them. This detail is load-bearing for the central claim that calibrated verification can function as a practical alternative to interpretability requirements.

    Authors: We agree that the manuscript presents Verification Coverage at a high conceptual level without operational definitions or auditing methods. In the revised manuscript we will expand the relevant section to supply preliminary operational definitions for each of the six components and describe high-level approaches to measurement and auditing (for example, through standardized disclosure templates and independent third-party checks). We will also note that full standardization remains a matter for subsequent regulatory and technical work. revision: yes

  2. Referee: [Historical governance analogy] Section on historical governance of opaque expertise: The analogy to credentialing, monitoring, liability, appeal, and revocation for human experts is used to justify translation to AI systems, but the text does not address domain-specific differences such as the velocity of AI updates, the distributed nature of model training, or enforcement mechanisms at scale. This assumption underpins the twofold rationale and requires explicit justification or counter-example discussion to support the recommendation.

    Authors: The referee correctly identifies that the historical analogy would benefit from explicit treatment of AI-specific characteristics. We will revise the section to discuss differences in update velocity, distributed training, and enforcement scale, drawing on existing practices such as post-market surveillance in medical-device regulation. We will also include brief counter-examples where the analogy is strained, thereby providing the requested justification while preserving the core argument. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper advances a normative policy argument that AI deployment authorization should shift from mechanistic interpretability to calibrated verification (domain-scoped, independently checkable, post-release monitored, accountable, contestable, revocable). Its load-bearing steps rest on two external foundations: (1) the observed unevenness of model capability across tasks and (2) historical precedents for governing opaque human expertise via credentials, monitoring, liability, appeal, and revocation. These are supported by cited external statistics (53-point internal-to-output gap; 9 % FDA post-market surveillance rate) and a scoping review, none of which are derived from the paper's own definitions or fitted parameters. No equations, self-referential predictions, uniqueness theorems, or ansatzes appear; the central claim does not reduce to its inputs by construction and remains independent of any self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on adapting non-AI governance models to AI deployment and introduces Verification Coverage as a new reporting standard without independent empirical validation in the abstract.

axioms (2)
  • domain assumption Model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general.
    Stated explicitly as the first reason for preferring calibrated verification.
  • domain assumption Societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation.
    Stated explicitly as the second reason supporting the proposed approach.
invented entities (1)
  • Verification Coverage no independent evidence
    purpose: A six-component reportable standard that should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.
    Newly proposed metric defined in the abstract as the central practical output.

pith-pipeline@v0.9.0 · 5544 in / 1382 out tokens · 63963 ms · 2026-05-12T03:33:22.877508+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 4 internal anchors

  1. [1]

    Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations.arXiv preprint arXiv:2603.18353, 2026

    Sanjay Basu, Sadiq Y . Patel, Parth Sheth, Bhairavi Muralidharan, Namrata Elamaran, Aakriti Kinra, John Morgan, and Rajaie Batniji. Interpretability without actionability: Mechanistic methods cannot correct language model errors despite near-perfect internal representations.arXiv preprint arXiv:2603.18353, 2026

  2. [2]

    Reasoning Models Don't Always Say What They Think

    Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410, 2025. Anthropic Alignment Science Team

  3. [3]

    Consumer financial protection circular 2022-03: Adverse action notifi- cation requirements in connection with credit decisions based on complex algorithms

    Consumer Financial Protection Bureau. Consumer financial protection circular 2022-03: Adverse action notifi- cation requirements in connection with credit decisions based on complex algorithms. CFPB Circular 2022-03, 87 Fed. Reg. 35864, https://files.consumerfinance.gov/f/documents/cfpb_2022-03_circular_ 2022-05.pdf, 2022

  4. [4]

    Interpretive rules, policy statements, and advisory opinions; with- drawal

    Consumer Financial Protection Bureau. Interpretive rules, policy statements, and advisory opinions; with- drawal. 90 Federal Register 20084, 2025. https://www.federalregister.gov/documents/2025/05/12/ 2025-08286/interpretive-rules-policy-statements-and-advisory-opinions-withdrawal

  5. [5]

    Denniston, Melanie J

    Samantha Cruz Rivera, Xiaoxuan Liu, An-Wen Chan, Alastair K. Denniston, Melanie J. Calvert, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: The SPIRIT-AI extension.Nature Medicine, 26:1351–1363, 2020. doi: 10.1038/s41591-020-1037-7

  6. [6]

    Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev, et al

    Jeffrey De Fauw, Joseph R. Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease.Nature Medicine, 24(9):1342–1350,

  7. [7]

    URL https://www.nature.com/articles/s41591-018-0107-6

    doi: 10.1038/s41591-018-0107-6. URL https://www.nature.com/articles/s41591-018-0107-6

  8. [8]

    Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of Artificial Intelligence on Knowledge Worker Productivity and Quality

    Fabrizio Dell’Acqua, Edward McFowland III, Ethan R. Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani. Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality.Organization Science, 37(2):403–423, 2026. do...

  9. [9]

    Statistically valid post-deployment monitoring should be standard for AI-based digital health

    Pavel Dolin, Weizhi Li, Gautam Dasarathy, and Visar Berisha. Statistically valid post-deployment monitoring should be standard for AI-based digital health. InAdvances in Neural Information Processing Systems (NeurIPS 2025 Position Paper Track), 2025. Argues for statistically valid post-deployment monitoring for AI-based digital health. OpenReview:https://...

  10. [10]

    Towards A Rigorous Science of Interpretable Machine Learning

    Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.arXiv preprint arXiv:1702.08608, 2017. URLhttps://arxiv.org/abs/1702.08608

  11. [11]

    The accuracy, fairness, and limits of predicting recidivism

    Julia Dressel and Hany Farid. The accuracy, fairness, and limits of predicting recidivism.Science Advances, 4(1): eaao5580, 2018. doi: 10.1126/sciadv.aao5580

  12. [12]

    Baek, Subhash Kantamneni, and Max Tegmark

    Joshua Engels, David D. Baek, Subhash Kantamneni, and Max Tegmark. Scaling laws for scalable oversight. In Advances in Neural Information Processing Systems, 2025. NeurIPS 2025 spotlight

  13. [13]

    Regulation (EU) 2024/1689, annex III: High-risk AI systems.https://eur-lex.europa.eu/eli/reg/2024/1689/oj, 2024

    European Parliament and Council of the European Union. Regulation (EU) 2024/1689, annex III: High-risk AI systems.https://eur-lex.europa.eu/eli/reg/2024/1689/oj, 2024

  14. [14]

    Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence

    European Union. Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence. Official Journal of the European Union, 2024.https://eur-lex.europa.eu/eli/reg/2024/1689/oj. 8 The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification RegimeA PREPRINT

  15. [15]

    Why do large language models (LLMs) struggle to count letters?arXiv preprint arXiv:2412.18626, 2024

    Tairan Fu, Raquel Ferrando, Javier Conde, Carlos Arriaga, and Pedro Reviriego. Why do large language models (LLMs) struggle to count letters?arXiv preprint arXiv:2412.18626, 2024

  16. [16]

    In: 2018 IEEE Inter- national Conference on Software Maintenance and Evolution (ICSME)

    Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), pages 80–89. IEEE, 2018. doi: 10.1109/DSAA.2018.00018. URL https://arxiv.org/abs/1806.00069

  17. [17]

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive LLMs that persist through safety training.arXiv preprint arXiv:2401.05566, 2024

  18. [18]

    Eric Jonas and Konrad P. Kording. Could a neuroscientist understand a microprocessor?PLOS Computational Biology, 13(1):e1005268, 2017. doi: 10.1371/journal.pcbi.1005268

  19. [19]

    Hendrik Kempt, Jan-Christoph Heilinger, and Saskia K. Nagel. Relative explainability and double standards in medical decision-making: Should medical AI be subjected to higher standards in medical decision-making than doctors?Ethics and Information Technology, 24(2):20, 2022. doi: 10.1007/s10676-022-09646-x

  20. [20]

    Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, et al

    Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, et al. On the biology of a large language model.Transformer Circuits Thread, 2025. https://transformer-circuits. pub/2025/attribution-graphs/biology.html

  21. [21]

    Calvert, Alastair K

    Xiaoxuan Liu, Samantha Cruz Rivera, David Moher, Melanie J. Calvert, Alastair K. Denniston, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. Nature Medicine, 26:1364–1374, 2020. doi: 10.1038/s41591-020-1034-x

  22. [22]

    Model cards for model reporting,

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229. ACM, 2019. doi: 10.1145/3287560.3287596

  23. [23]

    Huang, Mfon Thelma Nta, Peter Oluwaduy- ilemi Ademiju, Pirunthan Pathmarajah, Man Kien Hang, Oluwafolajimi Adesanya, et al

    Vijaytha Muralidharan, Boluwatife Adeleye Adewale, Caroline J. Huang, Mfon Thelma Nta, Peter Oluwaduy- ilemi Ademiju, Pirunthan Pathmarajah, Man Kien Hang, Oluwafolajimi Adesanya, et al. A scoping re- view of reporting gaps in FDA-approved AI medical devices.npj Digital Medicine, 7(1):273, 2024. doi: 10.1038/s41746-024-01270-x

  24. [24]

    ADS-equipped vehicle safety, transparency, and evaluation program (A V STEP)

    National Highway Traffic Safety Administration. ADS-equipped vehicle safety, transparency, and evaluation program (A V STEP). 90 Federal Register 4130,https://www.federalregister.gov/documents/2025/01/ 15/2024-30854/ads-equipped-vehicle-safety-transparency-and-evaluation-program, 2025

  25. [25]

    Third amended standing general order 2021-01: Incident reporting for automated driving systems and level 2 advanced driver assistance systems

    National Highway Traffic Safety Administration. Third amended standing general order 2021-01: Incident reporting for automated driving systems and level 2 advanced driver assistance systems. NHTSA Standing General Order on Crash Reporting, 2025. Effective June 16, 2025. https://www.nhtsa.gov/laws-regulations/ standing-general-order-crash-reporting

  26. [26]

    Large language models often know when they are being evaluated

    Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn. Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836, 2025. Gemini 2.5 Pro reaches AUC 0.83 on evaluation-awareness classification; human baseline 0.92

  27. [27]

    Automated employment decision tools

    New York City Department of Consumer and Worker Protection. Automated employment decision tools. https: //www.nyc.gov/site/dca/about/automated-employment-decision-tools.page, 2023

  28. [28]

    Sycophancy in GPT-4o: What happened and what we’re doing about it

    OpenAI. Sycophancy in GPT-4o: What happened and what we’re doing about it. OpenAI blog, 29 April 2025,

  29. [29]

    Expanding on what we missed with sycophancy

    https://openai.com/index/sycophancy-in-gpt-4o/ . See also the follow-up “Expanding on what we missed with sycophancy” (https://openai.com/index/expanding-on-sycophancy/)

  30. [30]

    White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes

    Inioluwa Deborah Raji, Andrew Smart, Rebecca N. White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes. Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 33–44. ...

  31. [31]

    The age of secrecy and unfairness in recidivism prediction

    Cynthia Rudin, Caroline Wang, and Beau Coker. The age of secrecy and unfairness in recidivism prediction. Harvard Data Science Review, 2(1), 2020. doi: 10.1162/99608f92.6ed64b30

  32. [32]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCan- dlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models.arXiv prep...

  33. [33]

    AI risk management framework (AI RMF 1.0)

    Elham Tabassi. AI risk management framework (AI RMF 1.0). Technical Report NIST AI 100-1, National Institute of Standards and Technology, 2023

  34. [34]

    Food and Drug Administration

    U.S. Food and Drug Administration. Artificial intelligence-enabled medical de- vices. https://www.fda.gov/medical-devices/software-medical-device-samd/ artificial-intelligence-enabled-medical-devices, 2026. Accessed May 7, 2026

  35. [35]

    Food and Drug Administration

    U.S. Food and Drug Administration. Clinical decision support software: Guidance for indus- try and Food and Drug Administration staff. https://www.fda.gov/regulatory-information/ search-fda-guidance-documents/clinical-decision-support-software, 2026

  36. [36]

    Ai sandbagging: Language models can strategically underperform on evaluations.arXiv preprint arXiv:2406.07358, 2024

    Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, and Francis Rhys Ward. AI sandbagging: Language models can strategically underperform on evaluations.arXiv preprint arXiv:2406.07358, 2024

  37. [37]

    Clifton, Gary S

    Baptiste Vasey, Myura Nagendran, Bruce Campbell, David A. Clifton, Gary S. Collins, Spiros Denaxas, Alastair K. Denniston, et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI.Nature Medicine, 28:924–933, 2022. doi: 10.1038/s41591-022-01772-9

  38. [38]

    Counterfactual explanations without opening the black box: Automated decisions and the GDPR.Harvard Journal of Law & Technology, 31(2):841–887, 2018

    Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR.Harvard Journal of Law & Technology, 31(2):841–887, 2018

  39. [39]

    Language models learn to mislead humans via rlhf

    Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Bowman, He He, and Shi Feng. Language models learn to mislead humans via RLHF.arXiv preprint arXiv:2409.12822, 2024

  40. [40]

    Wisconsin Supreme Court. State v. Loomis, 2016 WI 68, 371 Wis. 2d 235, 881 n.w.2d 749 (Wis. 2016). https://www.wicourts.gov/sc/opinion/DisplayDocument.pdf?content=pdf&seqNo=171690, 2016

  41. [41]

    Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu, and Liang He

    Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. AxBench: Steering LLMs? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025. ICML 2025 spotlight

  42. [42]

    LLM the genius paradox: A linguistic and math expert’s struggle with simple word- based counting problems

    Nan Xu and Xuezhe Ma. LLM the genius paradox: A linguistic and math expert’s struggle with simple word- based counting problems. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3344–3370, 2025. doi: 10.18653/v1/2025.naac...

  43. [43]

    Transparency in algorithmic and human decision-making: Is there a double standard?Philosophy & Technology, 32(4):661–683, 2019

    John Zerilli, Alistair Knott, James Maclaurin, and Colin Gavaghan. Transparency in algorithmic and human decision-making: Is there a double standard?Philosophy & Technology, 32(4):661–683, 2019. doi: 10.1007/ s13347-018-0330-6. 10