Risk Reporting for Developers' Internal AI Model Use

Joe O'Brien, Oliver Guest, Oscar Delaney, Sambhav Maheshwari, Theo Bearman

Pith reviewed 2026-05-07 17:42 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords internal AI userisk reportingfrontier AIautonomous misbehaviorinsider threatsregulatory complianceAI safety testingmeans motive opportunity

0 comments

The pith

A guide gives frontier AI developers one reporting structure to meet internal-use risk obligations under California, New York, and EU rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a single reporting template that frontier AI companies can use when they first deploy advanced models internally for testing and iteration. Internal deployment creates risks that external-deployment rules do not cover, yet three major legal frameworks now require developers to plan for and report on those risks. The template organizes the required content around two threat vectors—autonomous AI misbehavior and insider threats—and requires each vector to be assessed through the factors of means, motive, and opportunity. A sympathetic reader would care because internal use is the first real-world exposure for the most capable models and external visibility into that phase remains low, so consistent reporting may be one of the few practical ways to surface problems before they cause harm.

Core claim

The paper claims that developers should produce a dedicated internal-use risk report for every substantially more capable model placed into internal deployment, and that the report can satisfy all three regulatory frameworks by structuring its analysis around two threat vectors—autonomous AI misbehavior and insider threats—each examined through the lenses of means, motive, and opportunity. The report must describe the safeguards in place and any residual risks that remain after those safeguards. The authors position the template as directly usable by evaluation and safety teams and as a reference for regulators and auditors.

What carries the argument

The two-threat-vector reporting framework that assesses autonomous AI misbehavior and insider threats through means, motive, and opportunity for each vector.

Load-bearing premise

That organizing analysis around autonomous misbehavior and insider threats and scoring each by means, motive, and opportunity will catch the risks that actually matter for internal model use.

What would settle it

An internal deployment incident that causes measurable harm yet was not flagged as a residual risk in a report built on this means-motive-opportunity structure.

read the original abstract

Frontier AI companies first deploy their most advanced models internally, for weeks or months of safety testing, evaluation, and iteration, before a possible public release. For example, Anthropic recently developed a new class of model with advanced cyberoffense-relevant capabilities, Mythos Preview, which was available internally for at least six weeks before it was publicly announced. This internal use creates risks that external deployment frameworks may fail to address. Legal frameworks, notably California's Transparency in Frontier Artificial Intelligence Act (SB 53), New York's Responsible AI Safety And Education (RAISE) Act, and the EU's General-Purpose AI Code of Practice, all discuss risks from internal AI use. They require frontier developers to make and implement plans for how to manage risks from internal use, and to produce internal use risk reports describing their safeguards and any residual risks. This guide provides a harmonized standard for companies to produce internal use risk reports suitable for all three regulatory frameworks. It is addressed primarily to evaluation and safety teams at frontier AI developers, and secondarily to regulators and auditors seeking to understand what good reporting looks like. Given the pace of AI R&D automation and the limited external visibility into how companies use their most capable models internally, regular and detailed risk reporting may be one of the few mechanisms available to ensure that the risks from internal AI use are identified and managed before they materialize. Whenever a substantially more capable or riskier model is deployed internally, the developer should create a risk report and argue why the model is safe to deploy. We structure the reporting framework around two threat vectors -- autonomous AI misbehavior and insider threats -- and three risk factors for each: means, motive, and opportunity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a harmonized reporting framework for frontier AI developers' internal model use risks, intended to satisfy requirements under California's SB 53, New York's RAISE Act, and the EU GPAI Code of Practice. It structures reports around two threat vectors—autonomous AI misbehavior and insider threats—each assessed via the factors of means, motive, and opportunity, with the goal of enabling identification and management of risks during pre-release internal deployment periods.

Significance. If the proposed structure adequately covers the distinct reporting obligations across the three frameworks, the guide could serve as a practical, standardized template for evaluation and safety teams, addressing a gap where external-deployment-focused regulations leave internal-use risks under-specified. The prescriptive organization around threat vectors and risk factors offers a clear, actionable format that could improve consistency in risk identification and residual-risk disclosure.

major comments (1)

[Abstract] The central claim (abstract) that the guide supplies a single reporting format usable under all three frameworks requires that every mandated element—risk identification, safeguard description, residual-risk disclosure, and evaluation/notification triggers—is addressed by the autonomous-misbehavior and insider-threat sections using means/motive/opportunity. The manuscript presents the structure but contains no explicit mapping, gap analysis, or side-by-side comparison against the statutory language of SB 53, RAISE Act, or EU GPAI Code; this omission is load-bearing for the harmonization assertion.

minor comments (2)

[Abstract] The example of Mythos Preview in the abstract would benefit from a citation or reference to the source announcement to allow readers to verify the six-week internal-use timeline.
Clarify whether the means-motive-opportunity factors are intended to be applied uniformly or with vector-specific adaptations, as the current description leaves this ambiguous for implementers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the paper's potential to address a regulatory gap. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract] The central claim (abstract) that the guide supplies a single reporting format usable under all three frameworks requires that every mandated element—risk identification, safeguard description, residual-risk disclosure, and evaluation/notification triggers—is addressed by the autonomous-misbehavior and insider-threat sections using means/motive/opportunity. The manuscript presents the structure but contains no explicit mapping, gap analysis, or side-by-side comparison against the statutory language of SB 53, RAISE Act, or EU GPAI Code; this omission is load-bearing for the harmonization assertion.

Authors: We agree that the absence of an explicit mapping weakens the central harmonization claim. The proposed structure was designed to encompass the required elements across the frameworks by focusing on the two primary threat vectors and the means/motive/opportunity factors, which we believe naturally cover risk identification, safeguard description, residual-risk disclosure, and relevant triggers. However, to make this coverage verifiable and to strengthen the assertion, we will revise the manuscript to include a dedicated section or appendix with a side-by-side comparison. This will map each key mandated element from SB 53, the RAISE Act, and the EU GPAI Code of Practice to the corresponding parts of the reporting framework, demonstrating how the autonomous-misbehavior and insider-threat sections address them. We will also note any minor gaps where the frameworks diverge and how our template accommodates them. revision: yes

Circularity Check

0 steps flagged

No circularity detected in prescriptive policy guide

full rationale

The manuscript is a non-mathematical prescriptive guide that proposes a reporting template organized around two threat vectors (autonomous AI misbehavior and insider threats) each assessed via means, motive, and opportunity. It contains no equations, derivations, fitted parameters, statistical predictions, or self-citations of prior uniqueness theorems. The central harmonization claim is an assertion that the proposed structure satisfies three named regulatory instruments, but the text supplies no explicit mapping or gap analysis; this is an evidentiary limitation rather than a circular reduction in which any claimed result is definitionally equivalent to its own inputs. The document is therefore self-contained as a set of recommendations with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The document is a policy and guidance paper with no mathematical models, fitted parameters, or new postulated entities. It relies on standard domain assumptions about AI risk categories that are common in the safety literature.

pith-pipeline@v0.9.0 · 5612 in / 1130 out tokens · 61405 ms · 2026-05-07T17:42:45.494568+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 41 canonical work pages · 10 internal anchors

[1]

Risk Report: February 2026

https://www-cdn.anthropic.com/e670587677525f28df69b59e5fb4c22cc5461a17.pdf. – – –. “Risk Report: February 2026.” February

2026
[2]

International AI Safety Report 2025

https://arcprize.org/. Bengio, Yoshua, Stephen Clare, Carina Prunkl, et al. “International AI Safety Report 2025.” International AI Safety Report, October

2025
[3]

International AI Safety Report 2026

https://internationalaisafetyreport.org/publication/international-ai-safety-report-2025. – – –. “International AI Safety Report 2026.” February

2025
[4]

Taken out of Context: On Measuring Situational Awareness in LLMs

https://internationalaisafetyreport.org/sites/default/ﬁles/2026-02/international-ai-safety-report- 2026.pdf. Berglund, Lukas, Asa Cooper Stickland, Mikita Balesni, et al. “Taken out of Context: On Measuring Situational Awareness in LLMs.” arXiv, September

2026
[5]

Taken out of context: On measuring situational awareness in llms, 2023

https://arxiv.org/abs/2309.00667. Bricken, Trenton, Adly Templeton, Joshua Batson, et al. “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.” Anthropic, October

work page arXiv
[6]

SB-53 Transparency in Frontier Artiﬁcial Intelligence Act

https://transformer-circuits.pub/2023/monosemantic-features/. 25 California State Legislature. “SB-53 Transparency in Frontier Artiﬁcial Intelligence Act.”

2023
[7]

AI Models Can Be Dangerous before Public Deployment

https://arxiv.org/abs/2603.03992. Chan, Lawrence. “AI Models Can Be Dangerous before Public Deployment.” METR, January

work page arXiv
[8]

Reasoning Models Don’t Always Say What They Think

https://metr.org/blog/2025-01-17-ai-models-dangerous-before-public-deployment/. Chen, Yanda, Joe Benton, Ansh Radhakrishnan, et al. “Reasoning Models Don’t Always Say What They Think.” arXiv, May

2025
[9]

Reasoning Models Don't Always Say What They Think

https://arxiv.org/abs/2505.05410. Clymer, Joshua, Nick Gabrieli, David Krueger, and Thomas Larsen. “Safety Cases: How to Justify the Safety of Advanced AI Systems.” arXiv, March

work page internal anchor Pith review arXiv
[10]

The Hidden AI Frontier

https://arxiv.org/abs/2403.10462. Delaney, Oscar and Ashwin Acharya. “The Hidden AI Frontier.” AI Frontiers, August

work page arXiv
[11]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

https://arxiv.org/abs/2509.16941. Department for Science, Innovation and Technology. “Frontier AI Safety Commitments, AI Seoul Summit 2024.” Gov. UK, February

work page internal anchor Pith review arXiv 2024
[12]

FrontierMath: Benchmarking AI against Advanced Mathematical Research

https://www.gov.uk/government/publications/frontier-ai-safety-commitments-ai-seoul-summi t-2024/frontier-ai-safety-commitments-ai-seoul-summit-2024. Epoch AI. “FrontierMath: Benchmarking AI against Advanced Mathematical Research.”

2024
[13]

Virology capabilities test (vct): A multimodal virology q&a benchmark, 2025

https://arxiv.org/abs/2504.16137. Greenblatt, Ryan, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. “AI Control: Improving Safety Despite Intentional Subversion.” arXiv, December

work page arXiv
[14]

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang

https://arxiv.org/abs/2312.06942. 26 Greenblatt, Ryan, Carson Denison, Benjamin Wright, et al. “Alignment Faking in Large Language Models.” arXiv, December

work page arXiv
[15]

Alignment faking in large language models

https://arxiv.org/abs/2412.14093. Hao, Shibo, Sainbayar Sukhbaatar, DiJia Su, et al. “Training Large Language Models to Reason in a Continuous Latent Space.” arXiv, December

work page internal anchor Pith review arXiv
[16]

Training Large Language Models to Reason in a Continuous Latent Space

https://arxiv.org/abs/2412.06769. Hong, Shen Zhou, Alex Kleinman, Alyssa Mathiowetz, et al. “Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology.” arXiv, February

work page internal anchor Pith review arXiv 2025
[17]

Risks from Learned Optimization in Advanced Machine Learning Systems

https://arxiv.org/abs/2602.16703. Hubinger, Evan, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. “Risks from Learned Optimization in Advanced Machine Learning Systems.” arXiv, June

work page arXiv
[18]

2019 , month = may, journal =

https://arxiv.org/abs/1906.01820. Kinniment, Megan, Lucas Jun Koba Sato, Haoxing Du, et al. “Evaluating Language-Model Agents on Realistic Autonomous Tasks.” arXiv, December

work page arXiv 1906
[19]

Evaluating language-model agents on realistic autonomous tasks

https://arxiv.org/abs/2312.11671. Kutasov, Jonathan, Yuqi Sun, Paul Colognese, et al. “SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents.” arXiv, June

work page arXiv
[20]

Shade-arena: Evaluating sabotage and monitoring in llm agents.arXiv preprint arXiv:2506.15740, 2025

https://arxiv.org/abs/2506.15740. Kwon, Joe and Stephen Casper. “Internal Deployment Gaps in AI Regulation.” arXiv, January

work page arXiv
[21]

Goal Misgeneralization in Deep Reinforcement Learning

https://arxiv.org/abs/2601.08005. Langosco, Lauro, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, and David Krueger. “Goal Misgeneralization in Deep Reinforcement Learning.” arXiv, May

work page arXiv
[22]

2023 , month = jan, journal =

https://arxiv.org/abs/2105.14111. Lanham, Tamera, Anna Chen, Ansh Radhakrishnan, et al. “Measuring Faithfulness in Chain-of-Thought Reasoning.” arXiv, July

work page arXiv
[23]

Measuring Faithfulness in Chain-of-Thought Reasoning

https://arxiv.org/abs/2307.13702. Laurent, Jon M., Joseph D. Janizek, Michael Ruzo, et al. “LAB-Bench: Measuring Capabilities of Language Models for Biology Research.” arXiv, July

work page internal anchor Pith review arXiv
[24]

arXiv preprint arXiv:2407.10362 (2024)

https://arxiv.org/abs/2407.10362. Li, Nathaniel, Alexander Pan, Anjali Gopal, et al. “The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning.” arXiv, March

work page arXiv
[25]

org/abs/2403.03218

https://arxiv.org/abs/2403.03218. Mallen, Alex and Ryan Greenblatt. “Anthropic Repeatedly Accidentally Trained against the CoT, Demonstrating Inadequate Processes.” Redwood Research Blog, April

work page arXiv
[26]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

https://arxiv.org/abs/2310.06824. Martin, Samuel. “Overview of Transformative AI Misuse Risks: What Could Go Wrong Beyond Misalignment.” Center on Long-Term Risk, December

work page internal anchor Pith review arXiv
[27]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

https://arxiv.org/abs/2402.04249. McCaslin, Tegan, Jide Alaga, Samira Nedungadi, et al. “STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports.” arXiv, August

work page internal anchor Pith review arXiv
[28]

STREAM (chembio): A standard for transparently reporting evaluations in AI model reports

https://arxiv.org/abs/2508.09853. Meek, Austin, Eitan Sprejer, Iván Arcuschin, Austin J. Brockmeier, and Steven Basart. “Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity.” arXiv, October

work page arXiv
[29]

Mea- suring chain-of-thought monitorability through faithfulness and verbosity.arXiv preprint arXiv:2510.27378, 2025

https://arxiv.org/abs/2510.27378. METR. “Details about METR’s Evaluation of OpenAI GPT-5.” August

work page arXiv
[30]

Large language models often know when they are being evaluated

https://arxiv.org/abs/2505.23836. Nevo, Sella, Dan Lahav, Ajay Karpur, Yogev Bar-On, Henry Alexander Bradley, and Jeﬀ Alstott. “Securing AI Model Weights.” Rand, May

work page arXiv
[31]

GPT-4 Technical Report

https://arxiv.org/pdf/2303.08774. – – –. “GPT-5 System Card.” August

work page internal anchor Pith review arXiv
[32]

OpenAI GPT-5 System Card

https://arxiv.org/pdf/2601.03267. – – –. “Preparedness Framework.” April

work page internal anchor Pith review Pith/arXiv arXiv
[33]

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models , shorttitle =

https://arxiv.org/abs/2201.03544. Phuong, Mary, Roland S. Zimmermann, Ziyue Wang, et al. “Evaluating Frontier Models for Stealth and Situational Awareness.” arXiv, May

work page arXiv
[34]

Internal Deployment in the EU AI Act

https://arxiv.org/abs/2505.01420. 28 Pistillo, Matteo. “Internal Deployment in the EU AI Act.” arXiv,

work page arXiv
[35]

Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

https://arxiv.org/pdf/2512.05742. Qi, Xiangyu, Yi Zeng, Tinghao Xie, et al. “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” arXiv, October

work page arXiv
[36]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

https://arxiv.org/abs/2310.03693. Reed, Tom, Tegan McCaslin, and Luca Righetti. “What Do Model Reports Say about Their ChemBio Benchmark Evaluations? Comparing Recent Releases to the STREAM Framework.” arXiv, October

work page internal anchor Pith review arXiv
[37]

Research Note: Five Lessons from Having Helped Run an AI-Biology RCT

https://arxiv.org/abs/2510.20927. Righetti, Luca. “Research Note: Five Lessons from Having Helped Run an AI-Biology RCT.” METR, February

work page arXiv
[38]

Anthropic Revokes OpenAI’s Access to Claude

https://metr.org/blog/2026-02-19-ﬁve-lessons-from-ai-biology-rct/. Robinson, Kylie. “Anthropic Revokes OpenAI’s Access to Claude.” WIRED, August

2026
[39]

Safe Superintelligence Inc

https://www.axios.com/2024/03/19/ai-insider-threat-espionage-china. Safe Superintelligence Inc. “Safe Superintelligence Inc.”

2024
[40]

Marcantonio Bracale Syrnikov, Federico Pierucci, Marcello Galisai, Matteo Prandi, Piercosma Bis- conti, Francesco Giarrusso, Olga Sorokoletova, Vincenzo Suriani, and Daniele Nardi

https://arxiv.org/abs/2311.07590. Shevlane, Toby, Sebastian Farquhar, Ben Garﬁnkel, et al. “Model Evaluation for Extreme Risks.” arXiv, May

work page arXiv
[41]

arXiv preprint arXiv:2305.15324 , year=

https://arxiv.org/abs/2305.15324. Skaf, Joey, Luis Ibanez-Lissen, Robert McCarthy, et al. “Large Language Models Can Learn and Generalize Steganographic Chain-of-Thought under Process Supervision.” arXiv, June

work page arXiv
[42]

Deﬁning and Characterizing Reward Hacking

https://arxiv.org/abs/2506.01926. Skalse, Joar, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. “Deﬁning and Characterizing Reward Hacking.” arXiv, September

work page arXiv
[43]

Defining and characterizing reward hacking.arXiv preprint arXiv:2209.13085, 2022

https://arxiv.org/abs/2209.13085. Starace, Giulio, Oliver Jaﬀe, Dane Sherburn, et al. “PaperBench: Evaluating AI’s Ability to Replicate AI Research.” arXiv, April

work page arXiv
[44]

arXiv preprint arXiv:2504.01848 (2025)

https://arxiv.org/abs/2504.01848. Stix, Charlotte, Matteo Pistillo, Girish Sastry, et al. “AI Behind Closed Doors: A Primer on The Governance of Internal Deployment.” arXiv, April

work page arXiv
[45]

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

https://arxiv.org/abs/2504.12170. 29 Wan, Shengye, Cyrus Nikolaidis, Daniel Song, et al. “CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models.” arXiv, August

work page arXiv
[46]

CYBERSECEV AL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models, September 2024

https://arxiv.org/abs/2408.01605. Weij, Teun van der, Felix Hofstätter, Ollie Jaﬀe, Samuel F. Brown, and Francis Rhys Ward. “AI Sandbagging: Language Models Can Strategically Underperform on Evaluations.” arXiv, June

work page arXiv
[47]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, et al

https://arxiv.org/abs/2406.07358. Wijk, Hjalmar, Tao Lin, Joel Becker, et al. “RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts.” arXiv, November

work page arXiv
[48]

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

https://arxiv.org/abs/2411.15114. Zhang, Andy K., Neil Perry, Riya Dulepet, et al. “Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models.” arXiv, August

work page arXiv
[49]

Cybench: A framework for evaluating cybersecurity capabilities and risks of language models

https://arxiv.org/abs/2408.08926. Zolkowski, Artur, Kei Nishimura-Gasparian, Robert McCarthy, Roland S. Zimmermann, and David Lindner. “Early Signs of Steganographic Capabilities in Frontier LLMs.” arXiv, July

work page arXiv
[50]

https://arxiv.org/abs/2507.02737. 30

work page arXiv