Recognition: unknown
What Should Frontier AI Developers Disclose About Internal Deployments?
Pith reviewed 2026-05-08 09:30 UTC · model grok-4.3
The pith
Frontier AI developers should disclose information about internal model deployments in four categories to enable external safety oversight.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper's central claim is that companies should disclose key information about internally deployed frontier models across capabilities, usage, safety mitigations, and governance. For each category the authors analyze the benefits for oversight, the limitations including competitive risks, and strategies to mitigate those risks. This framework is intended to guide both public transparency documents like model system cards and private reports under emerging regulation.
What carries the argument
The four-category disclosure framework (capabilities, usage, safety mitigations, governance) for internally deployed frontier AI models.
Load-bearing premise
That the disclosures can be made in enough detail to allow real external oversight while avoiding the release of sensitive competitive information.
What would settle it
A case where the proposed disclosures are implemented but external reviewers still cannot adequately evaluate the safety of the internal deployments due to gaps in the information provided.
read the original abstract
Frontier AI developers are increasingly deploying highly capable models internally to automate AI R&D, but these deployments currently face limited external oversight. It is essential, therefore, that developers provide evidence that internally deployed models are safe. While recent work has highlighted the risks of internal deployments and proposed broad approaches to transparency and governance, there remains little guidance on the specific information developers should disclose about them. We address this gap by identifying key information that companies should disclose about internally deployed models across four categories: capabilities, usage, safety mitigations, and governance. For each category, we analyse the key benefits and limitations of disclosure and consider how disclosure-related risks can be mitigated. Our framework could be used by developers to inform both public transparency documents, such as model system cards, and private periodic reports required under emerging frontier AI regulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that frontier AI developers should disclose specific information about internally deployed models across four categories—capabilities, usage, safety mitigations, and governance—to enable external oversight of these deployments, which currently lack transparency. For each category, it analyzes benefits and limitations of disclosure and proposes ways to mitigate associated risks such as competitive harm, with the framework intended to inform both public model system cards and private reports under emerging regulation.
Significance. If the proposed disclosures prove workable, the framework would offer timely, structured policy guidance that directly links identified risks of internal frontier model use (e.g., for automating R&D) to concrete transparency measures. It builds usefully on prior literature by moving from broad calls for governance to category-specific recommendations, potentially aiding both voluntary industry practices and regulatory implementation.
Simulated Author's Rebuttal
We thank the referee for their positive and accurate summary of the manuscript, as well as for the recommendation of minor revision. The assessment correctly identifies the paper's focus on category-specific disclosure recommendations for internal frontier model deployments and its potential utility for both voluntary and regulatory contexts. No specific major comments were provided in the report.
Circularity Check
No significant circularity; conceptual policy framework
full rationale
The paper proposes a disclosure framework across four categories (capabilities, usage, safety mitigations, governance) by analyzing benefits, limitations, and risk mitigations. No equations, derivations, fitted parameters, or self-referential claims appear. All steps draw from externally identified risks in prior literature rather than reducing to the paper's own inputs or self-citations. The work is scoped as policy guidance and remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Internal deployments of frontier AI models pose risks that necessitate external oversight through specific disclosures.
- domain assumption Disclosure risks in the four categories can be mitigated without undermining the value of the disclosures.
Reference graph
Works this paper leans on
-
[1]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
and Delaney, O
Acharya, A. and Delaney, O. Managing risks from internal AI systems. https://www.iaps.ai/research/managing-risks-from-internal-ai-systems, 2025
2025
-
[3]
2025 AI Forecasting Survey
AI Digest . 2025 AI Forecasting Survey . https://ai2025.org/, 2025
2025
-
[4]
A grading rubric for ai safety frameworks, 2024
Alaga, J., Schuett, J., and Anderljung, M. A grading rubric for ai safety frameworks, 2024. URL https://arxiv.org/abs/2409.08751
-
[5]
Strengthening our safeguards through collaboration with US CAISI and UK AISI
Anthropic . Strengthening our safeguards through collaboration with US CAISI and UK AISI . https://www.anthropic.com/news/strengthening-our-safeguards-through-collaboration-with-us-caisi-and-uk-aisi, 2025
2025
-
[6]
Claude 's constitution
Anthropic . Claude 's constitution. https://www.anthropic.com/constitution, 2026 a
2026
-
[7]
Alignment risk update: Claude Mythos preview
Anthropic . Alignment risk update: Claude Mythos preview. https://www.anthropic.com/claude-mythos-preview-risk-report, April 2026 b
2026
-
[8]
System card: Claude Mythos Preview
Anthropic . System card: Claude Mythos Preview . https://cdn.sanity.io/files/4zrzovbb/website/7624816413e9b4d2e3ba620c5a5e091b98b190a5.pdf, 2026 c
-
[9]
System card: Claude Opus 4.6
Anthropic . System card: Claude Opus 4.6. https://www-cdn.anthropic.com/6a5fa276ac68b9aeb0c8b6af5fa36326e0e166dd.pdf, 2026 d
2026
-
[10]
Sabotage risk report: Claude Opus 4.6
Anthropic . Sabotage risk report: Claude Opus 4.6. https://www-cdn.anthropic.com/f21d93f21602ead5cdbecb8c8e1c765759d9e232.pdf, 2026 e
2026
-
[11]
Risk report: February 2026
Anthropic . Risk report: February 2026. https://www-cdn.anthropic.com/08eca2757081e850ed2ad490e5253e940240ca4f.pdf, 2026 f
2026
-
[12]
a , J., Johnson, C., Jolly, G., Katzir, Z., Khan, S. M., Kitano, H., Kr \
Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y., Fox, P., Garfinkel, B., Goldfarb, D., Heidari, H., Ho, A., Kapoor, S., Khalatbari, L., Longpre, S., Manning, S., Mavroudis, V., Mazeika, M., Michael, J., Newman, J., Ng, K. Y., Okolo, C. T., Raji, D., Sastry, G., Seger, E., Skeadas, T., South, T., Strubell, E., ...
2025
-
[13]
a , J., Johnson, C., Jolly, G., Katzir, Z., Kerema, M. N., Kitano, H., Kr \
Bengio, Y., Clare, S., Prunkl, C., Murray, M., Andriushchenko, M., Bucknall, B., Bommasani, R., Casper, S., Davidson, T., Douglas, R., Duvenaud, D., Fox, P., Gohar, U., Hadshar, R., Ho, A., Hu, T., Jones, C., Kapoor, S., Kasirzadeh, A., Manning, S., Maslej, N., Mavroudis, V., McGlynn, C., Moulange, R., Newman, J., Ng, K. Y., Paskov, P., Rismani, S., Sastr...
2026
-
[14]
Brundage, M., Avin, S., Wang, J., Belfield, H., Krueger, G., Hadfield, G., Khlaaf, H., Yang, J., Toner, H., Fong, R., Maharaj, T., Koh, P. W., Hooker, S., Leung, J., Trask, A., Bluemke, E., Lebensold, J., O'Keefe, C., Koren, M., Ryffel, T., Rubinovitz, J., Besiroglu, T., Carugati, F., Clark, J., Eckersley, P., de Haas, S., Johnson, M., Laurie, B., Ingerma...
-
[15]
Brundage, M., Dreksler, N., Homewood, A., McGregor, S., Paskov, P., Stosz, C., Sastry, G., Cooper, A. F., Balston, G., Adler, S., Casper, S., Anderljung, M., Werner, G., Mindermann, S., Mavroudis, V., Bucknall, B., Stix, C., Freund, J., Pacchiardi, L., Hernandez-Orallo, J., Pistillo, M., Chen, M., Painter, C., Ball, D. W., O'Keefe, C., Weil, G., Harack, B...
-
[16]
Senate bill no
California State Legislature . Senate bill no. 53: Transparency in frontier artificial intelligence act. https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202520260SB53, 2025
2025
-
[17]
AI Models Can Be Dangerous before Public Deployment
Chan, A., Padarath, R., Kwon, J., Greaves, H., and Anderljung, M. Measuring ai R&D automation, 2026. URL https://arxiv.org/abs/2603.03992
-
[18]
Ai models can be dangerous before public deployment
Chan, L. Ai models can be dangerous before public deployment. https://metr.org/blog/2025-01-17-ai-models-dangerous-before-public-deployment/, 01 2025
2025
-
[19]
Cloud, A., Le, M., Chua, J., Betley, J., Sztyber-Betley, A., Hilton, J., Marks, S., and Evans, O. Subliminal learning: Language models transmit behavioral traits via hidden signals in data, 2025. URL https://arxiv.org/abs/2507.14805
-
[20]
Ai-enabled coups: How a small group could use ai to seize power, 2025
Davidson, T., Finnveden, L., and Hadshar, R. Ai-enabled coups: How a small group could use ai to seize power, 2025. URL https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power. Accessed: 2026-04-22
2025
-
[21]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Deng, X., Da, J., Pan, E., He, Y. Y., Ide, C., Garg, K., Lauffer, N., Park, A., Pasari, N., Rane, C., Sampath, K., Krishnan, M., Kundurthy, S., Hendryx, S., Wang, Z., Bharadwaj, V., Holm, J., Aluri, R., Zhang, C. B. C., Jacobson, N., Liu, B., and Kenstler, B. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?, 2025. URL https://ar...
work page internal anchor Pith review arXiv 2025
-
[22]
and Davidson, T
Eth, D. and Davidson, T. Will ai r&d automation cause a software intelligence explosion?, 2025. URL https://www.forethought.org/research/will-ai-r-and-d-automation-cause-a-software-intelligence-explosion. Accessed: 2026-04-24
2025
-
[23]
The general-purpose AI code of practice: Safety and security chapter
European Commission . The general-purpose AI code of practice: Safety and security chapter. https://digital-strategy.ec.europa.eu/en/policies/contents-code-gpai, 2025
2025
-
[24]
Ai researchers' views on automating ai R&D and intelligence explosions, 2026
Field, S., Douglas, R., and Krueger, D. Ai researchers' views on automating ai R&D and intelligence explosions, 2026. URL https://arxiv.org/abs/2603.03338
-
[25]
SUP 5.6: Confidential information and privilege, 2016
Financial Conduct Authority . SUP 5.6: Confidential information and privilege, 2016. URL https://www.handbook.fca.org.uk/handbook/SUP/5/6.html?date=2016-03-07. Last amended 7 March 2016. Section within Chapter 5 (Reports by skilled persons) of the Supervision Manual, FCA Handbook
2016
-
[26]
Accelerating mathematical and scientific discovery with Gemini Deep Think
Google DeepMind . Accelerating mathematical and scientific discovery with Gemini Deep Think . https://deepmind.google/blog/accelerating-mathematical-and-scientific-discovery-with-gemini-deep-think/, 2026 a
2026
-
[27]
Gemini 3.1 Pro model card
Google DeepMind . Gemini 3.1 Pro model card. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf, 2026 b
2026
-
[28]
Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y
Guan, M. Y., Wang, M., Carroll, M., Dou, Z., Wei, A. Y., Williams, M., Arnav, B., Huizinga, J., Kivlichan, I., Glaese, M., Pachocki, J., and Baker, B. Monitoring monitorability, 2025. URL https://arxiv.org/abs/2512.18311
-
[29]
ac.uk/publications/claude-mythos-future-cybersecurity
Homewood, A., Williams, S., Dreksler, N., Lidiard, J., Murray, M., Heim, L., Ziosi, M., h \'E igeartaigh, S. \'O ., Chen, M., Wei, K., Winter, C., Brundage, M., Garfinkel, B., and Schuett, J. Third-party compliance reviews for frontier ai safety frameworks, 2025. URL https://arxiv.org/abs/2505.01643
-
[30]
Information security, cybersecurity and privacy protection --- Information security management systems --- Requirements , 2022
ISO/IEC . Information security, cybersecurity and privacy protection --- Information security management systems --- Requirements , 2022. URL https://www.iso.org/standard/27001
2022
-
[31]
Jagadeeswari, M., Karthi, P., Nitish Kumar, V., and Ram, S. S. A secure file sharing and audit trail tracking platform with advanced encryption standard for cloud-based environments. In 2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC), pp.\ 540--547, 2023. doi:10.1109/ICESC57686.2023.10193389
-
[32]
Early work on monitorability evaluations
Kinniment, M., Nix, S., Broadley, T., Wijk, H., and Parikh, N. Early work on monitorability evaluations. https://metr.org/blog/2026-01-19-early-work-on-monitorability-evaluations/, 01 2026
2026
-
[33]
K., Heim, L., Rodriguez, M., Sandbrink, J
Kolt, N., Anderljung, M., Barnhart, J., Brass, A., Esvelt, K., Hadfield, G. K., Heim, L., Rodriguez, M., Sandbrink, J. B., and Woodside, T. Responsible reporting for frontier ai development. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 7 0 (1): 0 768--783, Oct. 2024. doi:10.1609/aies.v7i1.31678. URL https://ojs.aaai.org/index.php/AIE...
- [34]
-
[35]
Goal Misgeneralization in Deep Reinforcement Learning
Kwon, J. and Casper, S. Internal deployment gaps in ai regulation, 2026. URL https://arxiv.org/abs/2601.08005
-
[36]
Review of the anthropic summer 2025 pilot sabotage risk report
METR . Review of the anthropic summer 2025 pilot sabotage risk report. https://alignment.anthropic.com/2025/sabotage-risk-report/2025_pilot_risk_report_metr_review.pdf, 2025
2025
-
[37]
What should companies share about risks from frontier ai models? https://metr.org/blog/2025-06-27-risk-transparency/, 06 2025
METR. What should companies share about risks from frontier ai models? https://metr.org/blog/2025-06-27-risk-transparency/, 06 2025
2025
-
[38]
Responsible AI safety and education ( RAISE ) act, 6453--A
New York State Legislature . Responsible AI safety and education ( RAISE ) act, 6453--A . https://www.nysenate.gov/legislation/bills/2025/A6453/amendment/A, 2025
2025
-
[39]
Bank supervision process
Office of the Comptroller of the Currency . Bank supervision process. Comptroller's handbook booklet, examination process series, Office of the Comptroller of the Currency, September 2019. URL https://www.occ.gov/publications-and-resources/publications/comptrollers-handbook/files/bank-supervision-process/pub-ch-bank-supervision-process.pdf. Transmitted vi...
2019
-
[40]
Working with us caisi and uk aisi to build more secure ai systems
OpenAI . Working with us caisi and uk aisi to build more secure ai systems. https://openai.com/index/us-caisi-uk-aisi-ai-update/, 2025
2025
-
[41]
Introducing GPT-5.3 Codex
OpenAI . Introducing GPT-5.3 Codex . https://openai.com/index/introducing-gpt-5-3-codex/, 2026 a
2026
-
[42]
GPT-5.3 Codex system card
OpenAI . GPT-5.3 Codex system card. https://deploymentsafety.openai.com/gpt-5-3-codex/gpt-5-3-codex.pdf, 2026 b
2026
-
[43]
How we monitor internal coding agents for misalignment
OpenAI . How we monitor internal coding agents for misalignment. https://openai.com/index/how-we-monitor-internal-coding-agents-misalignment/, 2026 c
2026
-
[44]
Introducing GPT-5.4 mini and nano
OpenAI . Introducing GPT-5.4 mini and nano. https://openai.com/index/introducing-gpt-5-4-mini-and-nano/, 2026 d
2026
-
[45]
Openai's Raising Concerns Policy
OpenAI . Openai's Raising Concerns Policy . https://openai.com/index/openai-raising-concerns-policy/, 2026 e
2026
-
[46]
Inference scaling reshapes ai governance, 2025
Ord, T. Inference scaling reshapes ai governance, 2025. URL https://arxiv.org/abs/2503.05705
-
[47]
Red-teaming anthropic's internal agent monitoring systems
Rein, D. Red-teaming anthropic's internal agent monitoring systems. https://metr.org/blog/2026-03-25-red-teaming-anthropic-agent-monitoring/, 03 2026
2026
-
[48]
The Thinking Machines Tinker API is good news for ai control and security
Shlegeris, B. The Thinking Machines Tinker API is good news for ai control and security. https://blog.redwoodresearch.org/p/the-thinking-machines-tinker-api, 2025. Redwood Research blog
2025
-
[49]
Stix, C., Pistillo, M., Sastry, G., Hobbhahn, M., Ortega, A., Balesni, M., Hallensleben, A., Goldowsky-Dill, N., and Sharkey, L. Ai behind closed doors: a primer on the governance of internal deployment, 2025. URL https://arxiv.org/abs/2504.12170
-
[50]
M., Shea-Blymyer, C., Yee, E., Acharya, A., Fisher, K., Scholl, K., Wildeford, P., Greenblatt, R., Albanie, S., Ballard, S., and Larsen, T
Toner, H., Beers, K., Newman, S., Khan, S. M., Shea-Blymyer, C., Yee, E., Acharya, A., Fisher, K., Scholl, K., Wildeford, P., Greenblatt, R., Albanie, S., Ballard, S., and Larsen, T. When AI builds AI : Findings from a workshop on automation of AI R&D . Technical report, Center for Security and Emerging Technology, January 2026. URL https://cset.georgetow...
2026
-
[51]
Food and Drug Administration
U.S. Food and Drug Administration . 21 CFR 20.61: Trade secrets and commercial or financial information which is privileged or confidential. https://www.ecfr.gov/current/title-21/chapter-I/subchapter-A/part-20/subpart-D/section-20.61, 2026. Electronic Code of Federal Regulations
2026
-
[52]
Vinten, G., Neidermeyer, A. A., and Neidermeyer, P. E. Audit anticipation: does it impact job performance? Managerial Auditing Journal, 20 0 (1): 0 19--29, 01 2005. ISSN 0268-6902. doi:10.1108/02686900510570669. URL https://doi.org/10.1108/02686900510570669
-
[53]
R., Becker, J., Jawhar, S., Parikh, N., Broadley, T., Chan, L., Chen, M., Clymer, J
Wijk, H., Lin, T. R., Becker, J., Jawhar, S., Parikh, N., Broadley, T., Chan, L., Chen, M., Clymer, J. M., Dhyani, J., Ericheva, E., Garcia, K., Goodrich, B., Jurkovic, N., Kinniment, M., Lajko, A., Nix, S., Koba Sato, L. J., Saunders, W., Taran, M., West, B., and Barnes, E. RE -bench: Evaluating frontier AI R&D capabilities of language model agents again...
2025
-
[54]
Williams, M., Raymond, C., Carroll, M., and Team, S. O. Sidestepping evaluation awareness and anticipating misalignment with production evaluations. https://alignment.openai.com/prod-evals/, 2025 a . OpenAI Alignment Blog
2025
-
[55]
Assessing risk relative to competitors: An analysis of current AI company policies
Williams, S., Dreksler, N., Homewood, A., Anderljung, M., and Freund, J. Assessing risk relative to competitors: An analysis of current AI company policies. https://www.governance.ai/research-paper/assessing-risk-relative-to-competitors-an-analysis-of-current-ai-company-policies, 2025 b . GovAI research paper
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.