Recognition: unknown
Risk Reporting for Developers' Internal AI Model Use
Pith reviewed 2026-05-07 17:42 UTC · model grok-4.3
The pith
A guide gives frontier AI developers one reporting structure to meet internal-use risk obligations under California, New York, and EU rules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that developers should produce a dedicated internal-use risk report for every substantially more capable model placed into internal deployment, and that the report can satisfy all three regulatory frameworks by structuring its analysis around two threat vectors—autonomous AI misbehavior and insider threats—each examined through the lenses of means, motive, and opportunity. The report must describe the safeguards in place and any residual risks that remain after those safeguards. The authors position the template as directly usable by evaluation and safety teams and as a reference for regulators and auditors.
What carries the argument
The two-threat-vector reporting framework that assesses autonomous AI misbehavior and insider threats through means, motive, and opportunity for each vector.
Load-bearing premise
That organizing analysis around autonomous misbehavior and insider threats and scoring each by means, motive, and opportunity will catch the risks that actually matter for internal model use.
What would settle it
An internal deployment incident that causes measurable harm yet was not flagged as a residual risk in a report built on this means-motive-opportunity structure.
read the original abstract
Frontier AI companies first deploy their most advanced models internally, for weeks or months of safety testing, evaluation, and iteration, before a possible public release. For example, Anthropic recently developed a new class of model with advanced cyberoffense-relevant capabilities, Mythos Preview, which was available internally for at least six weeks before it was publicly announced. This internal use creates risks that external deployment frameworks may fail to address. Legal frameworks, notably California's Transparency in Frontier Artificial Intelligence Act (SB 53), New York's Responsible AI Safety And Education (RAISE) Act, and the EU's General-Purpose AI Code of Practice, all discuss risks from internal AI use. They require frontier developers to make and implement plans for how to manage risks from internal use, and to produce internal use risk reports describing their safeguards and any residual risks. This guide provides a harmonized standard for companies to produce internal use risk reports suitable for all three regulatory frameworks. It is addressed primarily to evaluation and safety teams at frontier AI developers, and secondarily to regulators and auditors seeking to understand what good reporting looks like. Given the pace of AI R&D automation and the limited external visibility into how companies use their most capable models internally, regular and detailed risk reporting may be one of the few mechanisms available to ensure that the risks from internal AI use are identified and managed before they materialize. Whenever a substantially more capable or riskier model is deployed internally, the developer should create a risk report and argue why the model is safe to deploy. We structure the reporting framework around two threat vectors -- autonomous AI misbehavior and insider threats -- and three risk factors for each: means, motive, and opportunity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a harmonized reporting framework for frontier AI developers' internal model use risks, intended to satisfy requirements under California's SB 53, New York's RAISE Act, and the EU GPAI Code of Practice. It structures reports around two threat vectors—autonomous AI misbehavior and insider threats—each assessed via the factors of means, motive, and opportunity, with the goal of enabling identification and management of risks during pre-release internal deployment periods.
Significance. If the proposed structure adequately covers the distinct reporting obligations across the three frameworks, the guide could serve as a practical, standardized template for evaluation and safety teams, addressing a gap where external-deployment-focused regulations leave internal-use risks under-specified. The prescriptive organization around threat vectors and risk factors offers a clear, actionable format that could improve consistency in risk identification and residual-risk disclosure.
major comments (1)
- [Abstract] The central claim (abstract) that the guide supplies a single reporting format usable under all three frameworks requires that every mandated element—risk identification, safeguard description, residual-risk disclosure, and evaluation/notification triggers—is addressed by the autonomous-misbehavior and insider-threat sections using means/motive/opportunity. The manuscript presents the structure but contains no explicit mapping, gap analysis, or side-by-side comparison against the statutory language of SB 53, RAISE Act, or EU GPAI Code; this omission is load-bearing for the harmonization assertion.
minor comments (2)
- [Abstract] The example of Mythos Preview in the abstract would benefit from a citation or reference to the source announcement to allow readers to verify the six-week internal-use timeline.
- Clarify whether the means-motive-opportunity factors are intended to be applied uniformly or with vector-specific adaptations, as the current description leaves this ambiguous for implementers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive evaluation of the paper's potential to address a regulatory gap. We address the major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] The central claim (abstract) that the guide supplies a single reporting format usable under all three frameworks requires that every mandated element—risk identification, safeguard description, residual-risk disclosure, and evaluation/notification triggers—is addressed by the autonomous-misbehavior and insider-threat sections using means/motive/opportunity. The manuscript presents the structure but contains no explicit mapping, gap analysis, or side-by-side comparison against the statutory language of SB 53, RAISE Act, or EU GPAI Code; this omission is load-bearing for the harmonization assertion.
Authors: We agree that the absence of an explicit mapping weakens the central harmonization claim. The proposed structure was designed to encompass the required elements across the frameworks by focusing on the two primary threat vectors and the means/motive/opportunity factors, which we believe naturally cover risk identification, safeguard description, residual-risk disclosure, and relevant triggers. However, to make this coverage verifiable and to strengthen the assertion, we will revise the manuscript to include a dedicated section or appendix with a side-by-side comparison. This will map each key mandated element from SB 53, the RAISE Act, and the EU GPAI Code of Practice to the corresponding parts of the reporting framework, demonstrating how the autonomous-misbehavior and insider-threat sections address them. We will also note any minor gaps where the frameworks diverge and how our template accommodates them. revision: yes
Circularity Check
No circularity detected in prescriptive policy guide
full rationale
The manuscript is a non-mathematical prescriptive guide that proposes a reporting template organized around two threat vectors (autonomous AI misbehavior and insider threats) each assessed via means, motive, and opportunity. It contains no equations, derivations, fitted parameters, statistical predictions, or self-citations of prior uniqueness theorems. The central harmonization claim is an assertion that the proposed structure satisfies three named regulatory instruments, but the text supplies no explicit mapping or gap analysis; this is an evidentiary limitation rather than a circular reduction in which any claimed result is definitionally equivalent to its own inputs. The document is therefore self-contained as a set of recommendations with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Risk Report: February 2026
https://www-cdn.anthropic.com/e670587677525f28df69b59e5fb4c22cc5461a17.pdf. – – –. “Risk Report: February 2026.” February
2026
-
[2]
International AI Safety Report 2025
https://arcprize.org/. Bengio, Yoshua, Stephen Clare, Carina Prunkl, et al. “International AI Safety Report 2025.” International AI Safety Report, October
2025
-
[3]
International AI Safety Report 2026
https://internationalaisafetyreport.org/publication/international-ai-safety-report-2025. – – –. “International AI Safety Report 2026.” February
2025
-
[4]
Taken out of Context: On Measuring Situational Awareness in LLMs
https://internationalaisafetyreport.org/sites/default/files/2026-02/international-ai-safety-report- 2026.pdf. Berglund, Lukas, Asa Cooper Stickland, Mikita Balesni, et al. “Taken out of Context: On Measuring Situational Awareness in LLMs.” arXiv, September
2026
-
[5]
Taken out of context: On measuring situational awareness in llms, 2023
https://arxiv.org/abs/2309.00667. Bricken, Trenton, Adly Templeton, Joshua Batson, et al. “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.” Anthropic, October
-
[6]
SB-53 Transparency in Frontier Artificial Intelligence Act
https://transformer-circuits.pub/2023/monosemantic-features/. 25 California State Legislature. “SB-53 Transparency in Frontier Artificial Intelligence Act.”
2023
-
[7]
AI Models Can Be Dangerous before Public Deployment
https://arxiv.org/abs/2603.03992. Chan, Lawrence. “AI Models Can Be Dangerous before Public Deployment.” METR, January
-
[8]
Reasoning Models Don’t Always Say What They Think
https://metr.org/blog/2025-01-17-ai-models-dangerous-before-public-deployment/. Chen, Yanda, Joe Benton, Ansh Radhakrishnan, et al. “Reasoning Models Don’t Always Say What They Think.” arXiv, May
2025
-
[9]
Reasoning Models Don't Always Say What They Think
https://arxiv.org/abs/2505.05410. Clymer, Joshua, Nick Gabrieli, David Krueger, and Thomas Larsen. “Safety Cases: How to Justify the Safety of Advanced AI Systems.” arXiv, March
work page internal anchor Pith review arXiv
-
[10]
https://arxiv.org/abs/2403.10462. Delaney, Oscar and Ashwin Acharya. “The Hidden AI Frontier.” AI Frontiers, August
-
[11]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
https://arxiv.org/abs/2509.16941. Department for Science, Innovation and Technology. “Frontier AI Safety Commitments, AI Seoul Summit 2024.” Gov. UK, February
work page internal anchor Pith review arXiv 2024
-
[12]
FrontierMath: Benchmarking AI against Advanced Mathematical Research
https://www.gov.uk/government/publications/frontier-ai-safety-commitments-ai-seoul-summi t-2024/frontier-ai-safety-commitments-ai-seoul-summit-2024. Epoch AI. “FrontierMath: Benchmarking AI against Advanced Mathematical Research.”
2024
-
[13]
Virology capabilities test (vct): A multimodal virology q&a benchmark, 2025
https://arxiv.org/abs/2504.16137. Greenblatt, Ryan, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. “AI Control: Improving Safety Despite Intentional Subversion.” arXiv, December
-
[14]
https://arxiv.org/abs/2312.06942. 26 Greenblatt, Ryan, Carson Denison, Benjamin Wright, et al. “Alignment Faking in Large Language Models.” arXiv, December
-
[15]
Alignment faking in large language models
https://arxiv.org/abs/2412.14093. Hao, Shibo, Sainbayar Sukhbaatar, DiJia Su, et al. “Training Large Language Models to Reason in a Continuous Latent Space.” arXiv, December
work page internal anchor Pith review arXiv
-
[16]
Training Large Language Models to Reason in a Continuous Latent Space
https://arxiv.org/abs/2412.06769. Hong, Shen Zhou, Alex Kleinman, Alyssa Mathiowetz, et al. “Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology.” arXiv, February
work page internal anchor Pith review arXiv 2025
-
[17]
Risks from Learned Optimization in Advanced Machine Learning Systems
https://arxiv.org/abs/2602.16703. Hubinger, Evan, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. “Risks from Learned Optimization in Advanced Machine Learning Systems.” arXiv, June
-
[18]
https://arxiv.org/abs/1906.01820. Kinniment, Megan, Lucas Jun Koba Sato, Haoxing Du, et al. “Evaluating Language-Model Agents on Realistic Autonomous Tasks.” arXiv, December
-
[19]
Evaluating language-model agents on realistic autonomous tasks
https://arxiv.org/abs/2312.11671. Kutasov, Jonathan, Yuqi Sun, Paul Colognese, et al. “SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents.” arXiv, June
-
[20]
Shade-arena: Evaluating sabotage and monitoring in llm agents.arXiv preprint arXiv:2506.15740, 2025
https://arxiv.org/abs/2506.15740. Kwon, Joe and Stephen Casper. “Internal Deployment Gaps in AI Regulation.” arXiv, January
-
[21]
Goal Misgeneralization in Deep Reinforcement Learning
https://arxiv.org/abs/2601.08005. Langosco, Lauro, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, and David Krueger. “Goal Misgeneralization in Deep Reinforcement Learning.” arXiv, May
-
[22]
https://arxiv.org/abs/2105.14111. Lanham, Tamera, Anna Chen, Ansh Radhakrishnan, et al. “Measuring Faithfulness in Chain-of-Thought Reasoning.” arXiv, July
-
[23]
Measuring Faithfulness in Chain-of-Thought Reasoning
https://arxiv.org/abs/2307.13702. Laurent, Jon M., Joseph D. Janizek, Michael Ruzo, et al. “LAB-Bench: Measuring Capabilities of Language Models for Biology Research.” arXiv, July
work page internal anchor Pith review arXiv
-
[24]
arXiv preprint arXiv:2407.10362 (2024)
https://arxiv.org/abs/2407.10362. Li, Nathaniel, Alexander Pan, Anjali Gopal, et al. “The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning.” arXiv, March
-
[25]
https://arxiv.org/abs/2403.03218. Mallen, Alex and Ryan Greenblatt. “Anthropic Repeatedly Accidentally Trained against the CoT, Demonstrating Inadequate Processes.” Redwood Research Blog, April
-
[26]
https://arxiv.org/abs/2310.06824. Martin, Samuel. “Overview of Transformative AI Misuse Risks: What Could Go Wrong Beyond Misalignment.” Center on Long-Term Risk, December
work page internal anchor Pith review arXiv
-
[27]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
https://arxiv.org/abs/2402.04249. McCaslin, Tegan, Jide Alaga, Samira Nedungadi, et al. “STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports.” arXiv, August
work page internal anchor Pith review arXiv
-
[28]
STREAM (chembio): A standard for transparently reporting evaluations in AI model reports
https://arxiv.org/abs/2508.09853. Meek, Austin, Eitan Sprejer, Iván Arcuschin, Austin J. Brockmeier, and Steven Basart. “Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity.” arXiv, October
-
[29]
https://arxiv.org/abs/2510.27378. METR. “Details about METR’s Evaluation of OpenAI GPT-5.” August
-
[30]
Large language models often know when they are being evaluated
https://arxiv.org/abs/2505.23836. Nevo, Sella, Dan Lahav, Ajay Karpur, Yogev Bar-On, Henry Alexander Bradley, and Jeff Alstott. “Securing AI Model Weights.” Rand, May
-
[31]
https://arxiv.org/pdf/2303.08774. – – –. “GPT-5 System Card.” August
work page internal anchor Pith review arXiv
-
[32]
https://arxiv.org/pdf/2601.03267. – – –. “Preparedness Framework.” April
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models , shorttitle =
https://arxiv.org/abs/2201.03544. Phuong, Mary, Roland S. Zimmermann, Ziyue Wang, et al. “Evaluating Frontier Models for Stealth and Situational Awareness.” arXiv, May
-
[34]
Internal Deployment in the EU AI Act
https://arxiv.org/abs/2505.01420. 28 Pistillo, Matteo. “Internal Deployment in the EU AI Act.” arXiv,
-
[35]
Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
https://arxiv.org/pdf/2512.05742. Qi, Xiangyu, Yi Zeng, Tinghao Xie, et al. “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” arXiv, October
-
[36]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
https://arxiv.org/abs/2310.03693. Reed, Tom, Tegan McCaslin, and Luca Righetti. “What Do Model Reports Say about Their ChemBio Benchmark Evaluations? Comparing Recent Releases to the STREAM Framework.” arXiv, October
work page internal anchor Pith review arXiv
-
[37]
Research Note: Five Lessons from Having Helped Run an AI-Biology RCT
https://arxiv.org/abs/2510.20927. Righetti, Luca. “Research Note: Five Lessons from Having Helped Run an AI-Biology RCT.” METR, February
-
[38]
Anthropic Revokes OpenAI’s Access to Claude
https://metr.org/blog/2026-02-19-five-lessons-from-ai-biology-rct/. Robinson, Kylie. “Anthropic Revokes OpenAI’s Access to Claude.” WIRED, August
2026
-
[39]
Safe Superintelligence Inc
https://www.axios.com/2024/03/19/ai-insider-threat-espionage-china. Safe Superintelligence Inc. “Safe Superintelligence Inc.”
2024
-
[40]
https://arxiv.org/abs/2311.07590. Shevlane, Toby, Sebastian Farquhar, Ben Garfinkel, et al. “Model Evaluation for Extreme Risks.” arXiv, May
-
[41]
arXiv preprint arXiv:2305.15324 , year=
https://arxiv.org/abs/2305.15324. Skaf, Joey, Luis Ibanez-Lissen, Robert McCarthy, et al. “Large Language Models Can Learn and Generalize Steganographic Chain-of-Thought under Process Supervision.” arXiv, June
-
[42]
Defining and Characterizing Reward Hacking
https://arxiv.org/abs/2506.01926. Skalse, Joar, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. “Defining and Characterizing Reward Hacking.” arXiv, September
-
[43]
Defining and characterizing reward hacking.arXiv preprint arXiv:2209.13085, 2022
https://arxiv.org/abs/2209.13085. Starace, Giulio, Oliver Jaffe, Dane Sherburn, et al. “PaperBench: Evaluating AI’s Ability to Replicate AI Research.” arXiv, April
-
[44]
arXiv preprint arXiv:2504.01848 (2025)
https://arxiv.org/abs/2504.01848. Stix, Charlotte, Matteo Pistillo, Girish Sastry, et al. “AI Behind Closed Doors: A Primer on The Governance of Internal Deployment.” arXiv, April
-
[45]
https://arxiv.org/abs/2504.12170. 29 Wan, Shengye, Cyrus Nikolaidis, Daniel Song, et al. “CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models.” arXiv, August
-
[46]
https://arxiv.org/abs/2408.01605. Weij, Teun van der, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, and Francis Rhys Ward. “AI Sandbagging: Language Models Can Strategically Underperform on Evaluations.” arXiv, June
-
[47]
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, et al
https://arxiv.org/abs/2406.07358. Wijk, Hjalmar, Tao Lin, Joel Becker, et al. “RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts.” arXiv, November
-
[48]
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
https://arxiv.org/abs/2411.15114. Zhang, Andy K., Neil Perry, Riya Dulepet, et al. “Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models.” arXiv, August
-
[49]
Cybench: A framework for evaluating cybersecurity capabilities and risks of language models
https://arxiv.org/abs/2408.08926. Zolkowski, Artur, Kei Nishimura-Gasparian, Robert McCarthy, Roland S. Zimmermann, and David Lindner. “Early Signs of Steganographic Capabilities in Frontier LLMs.” arXiv, July
- [50]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.