pith. machine review for the scientific record. sign in

arxiv: 2604.24966 · v1 · submitted 2026-04-27 · 💻 cs.CY · cs.AI

Recognition: unknown

Risk Reporting for Developers' Internal AI Model Use

Joe O'Brien, Oliver Guest, Oscar Delaney, Sambhav Maheshwari, Theo Bearman

Pith reviewed 2026-05-07 17:42 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords internal AI userisk reportingfrontier AIautonomous misbehaviorinsider threatsregulatory complianceAI safety testingmeans motive opportunity
0
0 comments X

The pith

A guide gives frontier AI developers one reporting structure to meet internal-use risk obligations under California, New York, and EU rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a single reporting template that frontier AI companies can use when they first deploy advanced models internally for testing and iteration. Internal deployment creates risks that external-deployment rules do not cover, yet three major legal frameworks now require developers to plan for and report on those risks. The template organizes the required content around two threat vectors—autonomous AI misbehavior and insider threats—and requires each vector to be assessed through the factors of means, motive, and opportunity. A sympathetic reader would care because internal use is the first real-world exposure for the most capable models and external visibility into that phase remains low, so consistent reporting may be one of the few practical ways to surface problems before they cause harm.

Core claim

The paper claims that developers should produce a dedicated internal-use risk report for every substantially more capable model placed into internal deployment, and that the report can satisfy all three regulatory frameworks by structuring its analysis around two threat vectors—autonomous AI misbehavior and insider threats—each examined through the lenses of means, motive, and opportunity. The report must describe the safeguards in place and any residual risks that remain after those safeguards. The authors position the template as directly usable by evaluation and safety teams and as a reference for regulators and auditors.

What carries the argument

The two-threat-vector reporting framework that assesses autonomous AI misbehavior and insider threats through means, motive, and opportunity for each vector.

Load-bearing premise

That organizing analysis around autonomous misbehavior and insider threats and scoring each by means, motive, and opportunity will catch the risks that actually matter for internal model use.

What would settle it

An internal deployment incident that causes measurable harm yet was not flagged as a residual risk in a report built on this means-motive-opportunity structure.

read the original abstract

Frontier AI companies first deploy their most advanced models internally, for weeks or months of safety testing, evaluation, and iteration, before a possible public release. For example, Anthropic recently developed a new class of model with advanced cyberoffense-relevant capabilities, Mythos Preview, which was available internally for at least six weeks before it was publicly announced. This internal use creates risks that external deployment frameworks may fail to address. Legal frameworks, notably California's Transparency in Frontier Artificial Intelligence Act (SB 53), New York's Responsible AI Safety And Education (RAISE) Act, and the EU's General-Purpose AI Code of Practice, all discuss risks from internal AI use. They require frontier developers to make and implement plans for how to manage risks from internal use, and to produce internal use risk reports describing their safeguards and any residual risks. This guide provides a harmonized standard for companies to produce internal use risk reports suitable for all three regulatory frameworks. It is addressed primarily to evaluation and safety teams at frontier AI developers, and secondarily to regulators and auditors seeking to understand what good reporting looks like. Given the pace of AI R&D automation and the limited external visibility into how companies use their most capable models internally, regular and detailed risk reporting may be one of the few mechanisms available to ensure that the risks from internal AI use are identified and managed before they materialize. Whenever a substantially more capable or riskier model is deployed internally, the developer should create a risk report and argue why the model is safe to deploy. We structure the reporting framework around two threat vectors -- autonomous AI misbehavior and insider threats -- and three risk factors for each: means, motive, and opportunity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a harmonized reporting framework for frontier AI developers' internal model use risks, intended to satisfy requirements under California's SB 53, New York's RAISE Act, and the EU GPAI Code of Practice. It structures reports around two threat vectors—autonomous AI misbehavior and insider threats—each assessed via the factors of means, motive, and opportunity, with the goal of enabling identification and management of risks during pre-release internal deployment periods.

Significance. If the proposed structure adequately covers the distinct reporting obligations across the three frameworks, the guide could serve as a practical, standardized template for evaluation and safety teams, addressing a gap where external-deployment-focused regulations leave internal-use risks under-specified. The prescriptive organization around threat vectors and risk factors offers a clear, actionable format that could improve consistency in risk identification and residual-risk disclosure.

major comments (1)
  1. [Abstract] The central claim (abstract) that the guide supplies a single reporting format usable under all three frameworks requires that every mandated element—risk identification, safeguard description, residual-risk disclosure, and evaluation/notification triggers—is addressed by the autonomous-misbehavior and insider-threat sections using means/motive/opportunity. The manuscript presents the structure but contains no explicit mapping, gap analysis, or side-by-side comparison against the statutory language of SB 53, RAISE Act, or EU GPAI Code; this omission is load-bearing for the harmonization assertion.
minor comments (2)
  1. [Abstract] The example of Mythos Preview in the abstract would benefit from a citation or reference to the source announcement to allow readers to verify the six-week internal-use timeline.
  2. Clarify whether the means-motive-opportunity factors are intended to be applied uniformly or with vector-specific adaptations, as the current description leaves this ambiguous for implementers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the paper's potential to address a regulatory gap. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] The central claim (abstract) that the guide supplies a single reporting format usable under all three frameworks requires that every mandated element—risk identification, safeguard description, residual-risk disclosure, and evaluation/notification triggers—is addressed by the autonomous-misbehavior and insider-threat sections using means/motive/opportunity. The manuscript presents the structure but contains no explicit mapping, gap analysis, or side-by-side comparison against the statutory language of SB 53, RAISE Act, or EU GPAI Code; this omission is load-bearing for the harmonization assertion.

    Authors: We agree that the absence of an explicit mapping weakens the central harmonization claim. The proposed structure was designed to encompass the required elements across the frameworks by focusing on the two primary threat vectors and the means/motive/opportunity factors, which we believe naturally cover risk identification, safeguard description, residual-risk disclosure, and relevant triggers. However, to make this coverage verifiable and to strengthen the assertion, we will revise the manuscript to include a dedicated section or appendix with a side-by-side comparison. This will map each key mandated element from SB 53, the RAISE Act, and the EU GPAI Code of Practice to the corresponding parts of the reporting framework, demonstrating how the autonomous-misbehavior and insider-threat sections address them. We will also note any minor gaps where the frameworks diverge and how our template accommodates them. revision: yes

Circularity Check

0 steps flagged

No circularity detected in prescriptive policy guide

full rationale

The manuscript is a non-mathematical prescriptive guide that proposes a reporting template organized around two threat vectors (autonomous AI misbehavior and insider threats) each assessed via means, motive, and opportunity. It contains no equations, derivations, fitted parameters, statistical predictions, or self-citations of prior uniqueness theorems. The central harmonization claim is an assertion that the proposed structure satisfies three named regulatory instruments, but the text supplies no explicit mapping or gap analysis; this is an evidentiary limitation rather than a circular reduction in which any claimed result is definitionally equivalent to its own inputs. The document is therefore self-contained as a set of recommendations with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The document is a policy and guidance paper with no mathematical models, fitted parameters, or new postulated entities. It relies on standard domain assumptions about AI risk categories that are common in the safety literature.

pith-pipeline@v0.9.0 · 5612 in / 1130 out tokens · 61405 ms · 2026-05-07T17:42:45.494568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 41 canonical work pages · 10 internal anchors

  1. [1]

    Risk Report: February 2026

    https://www-cdn.anthropic.com/e670587677525f28df69b59e5fb4c22cc5461a17.pdf. – – –. “Risk Report: February 2026.” February

  2. [2]

    International AI Safety Report 2025

    https://arcprize.org/. Bengio, Yoshua, Stephen Clare, Carina Prunkl, et al. “International AI Safety Report 2025.” International AI Safety Report, October

  3. [3]

    International AI Safety Report 2026

    https://internationalaisafetyreport.org/publication/international-ai-safety-report-2025. – – –. “International AI Safety Report 2026.” February

  4. [4]

    Taken out of Context: On Measuring Situational Awareness in LLMs

    https://internationalaisafetyreport.org/sites/default/files/2026-02/international-ai-safety-report- 2026.pdf. Berglund, Lukas, Asa Cooper Stickland, Mikita Balesni, et al. “Taken out of Context: On Measuring Situational Awareness in LLMs.” arXiv, September

  5. [5]

    Taken out of context: On measuring situational awareness in llms, 2023

    https://arxiv.org/abs/2309.00667. Bricken, Trenton, Adly Templeton, Joshua Batson, et al. “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.” Anthropic, October

  6. [6]

    SB-53 Transparency in Frontier Artificial Intelligence Act

    https://transformer-circuits.pub/2023/monosemantic-features/. 25 California State Legislature. “SB-53 Transparency in Frontier Artificial Intelligence Act.”

  7. [7]

    AI Models Can Be Dangerous before Public Deployment

    https://arxiv.org/abs/2603.03992. Chan, Lawrence. “AI Models Can Be Dangerous before Public Deployment.” METR, January

  8. [8]

    Reasoning Models Don’t Always Say What They Think

    https://metr.org/blog/2025-01-17-ai-models-dangerous-before-public-deployment/. Chen, Yanda, Joe Benton, Ansh Radhakrishnan, et al. “Reasoning Models Don’t Always Say What They Think.” arXiv, May

  9. [9]

    Reasoning Models Don't Always Say What They Think

    https://arxiv.org/abs/2505.05410. Clymer, Joshua, Nick Gabrieli, David Krueger, and Thomas Larsen. “Safety Cases: How to Justify the Safety of Advanced AI Systems.” arXiv, March

  10. [10]

    The Hidden AI Frontier

    https://arxiv.org/abs/2403.10462. Delaney, Oscar and Ashwin Acharya. “The Hidden AI Frontier.” AI Frontiers, August

  11. [11]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    https://arxiv.org/abs/2509.16941. Department for Science, Innovation and Technology. “Frontier AI Safety Commitments, AI Seoul Summit 2024.” Gov. UK, February

  12. [12]

    FrontierMath: Benchmarking AI against Advanced Mathematical Research

    https://www.gov.uk/government/publications/frontier-ai-safety-commitments-ai-seoul-summi t-2024/frontier-ai-safety-commitments-ai-seoul-summit-2024. Epoch AI. “FrontierMath: Benchmarking AI against Advanced Mathematical Research.”

  13. [13]

    Virology capabilities test (vct): A multimodal virology q&a benchmark, 2025

    https://arxiv.org/abs/2504.16137. Greenblatt, Ryan, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. “AI Control: Improving Safety Despite Intentional Subversion.” arXiv, December

  14. [14]

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang

    https://arxiv.org/abs/2312.06942. 26 Greenblatt, Ryan, Carson Denison, Benjamin Wright, et al. “Alignment Faking in Large Language Models.” arXiv, December

  15. [15]

    Alignment faking in large language models

    https://arxiv.org/abs/2412.14093. Hao, Shibo, Sainbayar Sukhbaatar, DiJia Su, et al. “Training Large Language Models to Reason in a Continuous Latent Space.” arXiv, December

  16. [16]

    Training Large Language Models to Reason in a Continuous Latent Space

    https://arxiv.org/abs/2412.06769. Hong, Shen Zhou, Alex Kleinman, Alyssa Mathiowetz, et al. “Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology.” arXiv, February

  17. [17]

    Risks from Learned Optimization in Advanced Machine Learning Systems

    https://arxiv.org/abs/2602.16703. Hubinger, Evan, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. “Risks from Learned Optimization in Advanced Machine Learning Systems.” arXiv, June

  18. [18]

    2019 , month = may, journal =

    https://arxiv.org/abs/1906.01820. Kinniment, Megan, Lucas Jun Koba Sato, Haoxing Du, et al. “Evaluating Language-Model Agents on Realistic Autonomous Tasks.” arXiv, December

  19. [19]

    Evaluating language-model agents on realistic autonomous tasks

    https://arxiv.org/abs/2312.11671. Kutasov, Jonathan, Yuqi Sun, Paul Colognese, et al. “SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents.” arXiv, June

  20. [20]

    Shade-arena: Evaluating sabotage and monitoring in llm agents.arXiv preprint arXiv:2506.15740, 2025

    https://arxiv.org/abs/2506.15740. Kwon, Joe and Stephen Casper. “Internal Deployment Gaps in AI Regulation.” arXiv, January

  21. [21]

    Goal Misgeneralization in Deep Reinforcement Learning

    https://arxiv.org/abs/2601.08005. Langosco, Lauro, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, and David Krueger. “Goal Misgeneralization in Deep Reinforcement Learning.” arXiv, May

  22. [22]

    2023 , month = jan, journal =

    https://arxiv.org/abs/2105.14111. Lanham, Tamera, Anna Chen, Ansh Radhakrishnan, et al. “Measuring Faithfulness in Chain-of-Thought Reasoning.” arXiv, July

  23. [23]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    https://arxiv.org/abs/2307.13702. Laurent, Jon M., Joseph D. Janizek, Michael Ruzo, et al. “LAB-Bench: Measuring Capabilities of Language Models for Biology Research.” arXiv, July

  24. [24]

    arXiv preprint arXiv:2407.10362 (2024)

    https://arxiv.org/abs/2407.10362. Li, Nathaniel, Alexander Pan, Anjali Gopal, et al. “The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning.” arXiv, March

  25. [25]

    org/abs/2403.03218

    https://arxiv.org/abs/2403.03218. Mallen, Alex and Ryan Greenblatt. “Anthropic Repeatedly Accidentally Trained against the CoT, Demonstrating Inadequate Processes.” Redwood Research Blog, April

  26. [26]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    https://arxiv.org/abs/2310.06824. Martin, Samuel. “Overview of Transformative AI Misuse Risks: What Could Go Wrong Beyond Misalignment.” Center on Long-Term Risk, December

  27. [27]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    https://arxiv.org/abs/2402.04249. McCaslin, Tegan, Jide Alaga, Samira Nedungadi, et al. “STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports.” arXiv, August

  28. [28]

    STREAM (chembio): A standard for transparently reporting evaluations in AI model reports

    https://arxiv.org/abs/2508.09853. Meek, Austin, Eitan Sprejer, Iván Arcuschin, Austin J. Brockmeier, and Steven Basart. “Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity.” arXiv, October

  29. [29]

    Mea- suring chain-of-thought monitorability through faithfulness and verbosity.arXiv preprint arXiv:2510.27378, 2025

    https://arxiv.org/abs/2510.27378. METR. “Details about METR’s Evaluation of OpenAI GPT-5.” August

  30. [30]

    Large language models often know when they are being evaluated

    https://arxiv.org/abs/2505.23836. Nevo, Sella, Dan Lahav, Ajay Karpur, Yogev Bar-On, Henry Alexander Bradley, and Jeff Alstott. “Securing AI Model Weights.” Rand, May

  31. [31]

    GPT-4 Technical Report

    https://arxiv.org/pdf/2303.08774. – – –. “GPT-5 System Card.” August

  32. [32]

    OpenAI GPT-5 System Card

    https://arxiv.org/pdf/2601.03267. – – –. “Preparedness Framework.” April

  33. [33]

    The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models , shorttitle =

    https://arxiv.org/abs/2201.03544. Phuong, Mary, Roland S. Zimmermann, Ziyue Wang, et al. “Evaluating Frontier Models for Stealth and Situational Awareness.” arXiv, May

  34. [34]

    Internal Deployment in the EU AI Act

    https://arxiv.org/abs/2505.01420. 28 Pistillo, Matteo. “Internal Deployment in the EU AI Act.” arXiv,

  35. [35]

    Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    https://arxiv.org/pdf/2512.05742. Qi, Xiangyu, Yi Zeng, Tinghao Xie, et al. “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” arXiv, October

  36. [36]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    https://arxiv.org/abs/2310.03693. Reed, Tom, Tegan McCaslin, and Luca Righetti. “What Do Model Reports Say about Their ChemBio Benchmark Evaluations? Comparing Recent Releases to the STREAM Framework.” arXiv, October

  37. [37]

    Research Note: Five Lessons from Having Helped Run an AI-Biology RCT

    https://arxiv.org/abs/2510.20927. Righetti, Luca. “Research Note: Five Lessons from Having Helped Run an AI-Biology RCT.” METR, February

  38. [38]

    Anthropic Revokes OpenAI’s Access to Claude

    https://metr.org/blog/2026-02-19-five-lessons-from-ai-biology-rct/. Robinson, Kylie. “Anthropic Revokes OpenAI’s Access to Claude.” WIRED, August

  39. [39]

    Safe Superintelligence Inc

    https://www.axios.com/2024/03/19/ai-insider-threat-espionage-china. Safe Superintelligence Inc. “Safe Superintelligence Inc.”

  40. [40]

    Marcantonio Bracale Syrnikov, Federico Pierucci, Marcello Galisai, Matteo Prandi, Piercosma Bis- conti, Francesco Giarrusso, Olga Sorokoletova, Vincenzo Suriani, and Daniele Nardi

    https://arxiv.org/abs/2311.07590. Shevlane, Toby, Sebastian Farquhar, Ben Garfinkel, et al. “Model Evaluation for Extreme Risks.” arXiv, May

  41. [41]

    arXiv preprint arXiv:2305.15324 , year=

    https://arxiv.org/abs/2305.15324. Skaf, Joey, Luis Ibanez-Lissen, Robert McCarthy, et al. “Large Language Models Can Learn and Generalize Steganographic Chain-of-Thought under Process Supervision.” arXiv, June

  42. [42]

    Defining and Characterizing Reward Hacking

    https://arxiv.org/abs/2506.01926. Skalse, Joar, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. “Defining and Characterizing Reward Hacking.” arXiv, September

  43. [43]

    Defining and characterizing reward hacking.arXiv preprint arXiv:2209.13085, 2022

    https://arxiv.org/abs/2209.13085. Starace, Giulio, Oliver Jaffe, Dane Sherburn, et al. “PaperBench: Evaluating AI’s Ability to Replicate AI Research.” arXiv, April

  44. [44]

    arXiv preprint arXiv:2504.01848 (2025)

    https://arxiv.org/abs/2504.01848. Stix, Charlotte, Matteo Pistillo, Girish Sastry, et al. “AI Behind Closed Doors: A Primer on The Governance of Internal Deployment.” arXiv, April

  45. [45]

    CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

    https://arxiv.org/abs/2504.12170. 29 Wan, Shengye, Cyrus Nikolaidis, Daniel Song, et al. “CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models.” arXiv, August

  46. [46]

    CYBERSECEV AL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models, September 2024

    https://arxiv.org/abs/2408.01605. Weij, Teun van der, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, and Francis Rhys Ward. “AI Sandbagging: Language Models Can Strategically Underperform on Evaluations.” arXiv, June

  47. [47]

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, et al

    https://arxiv.org/abs/2406.07358. Wijk, Hjalmar, Tao Lin, Joel Becker, et al. “RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts.” arXiv, November

  48. [48]

    Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

    https://arxiv.org/abs/2411.15114. Zhang, Andy K., Neil Perry, Riya Dulepet, et al. “Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models.” arXiv, August

  49. [49]

    Cybench: A framework for evaluating cybersecurity capabilities and risks of language models

    https://arxiv.org/abs/2408.08926. Zolkowski, Artur, Kei Nishimura-Gasparian, Robert McCarthy, Roland S. Zimmermann, and David Lindner. “Early Signs of Steganographic Capabilities in Frontier LLMs.” arXiv, July

  50. [50]

    https://arxiv.org/abs/2507.02737. 30