What Should Frontier AI Developers Disclose About Internal Deployments?

Jacob Charnock, Justin Miller, Raja Mehta Moreno, William L. Anderson

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:30 UTC · model grok-4.3

classification 💻 cs.CY cs.SE

keywords frontier AIinternal deploymentsAI safetytransparencydisclosure frameworkAI governancemodel system cardsAI regulation

0 comments

The pith

Frontier AI developers should disclose information about internal model deployments in four categories to enable external safety oversight.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Frontier AI developers are using advanced models internally to speed up their own research, but these uses receive little external review. The paper identifies specific information that should be shared about these internal deployments to demonstrate safety. It organizes this into four categories: the models' capabilities, how they are being used, the safety steps taken, and the governance around decisions. The authors review the upsides and downsides of sharing each type of detail and ways to limit risks like leaking business secrets. If adopted, the approach would help both public reports and compliance with new AI rules.

Core claim

The paper's central claim is that companies should disclose key information about internally deployed frontier models across capabilities, usage, safety mitigations, and governance. For each category the authors analyze the benefits for oversight, the limitations including competitive risks, and strategies to mitigate those risks. This framework is intended to guide both public transparency documents like model system cards and private reports under emerging regulation.

What carries the argument

The four-category disclosure framework (capabilities, usage, safety mitigations, governance) for internally deployed frontier AI models.

Load-bearing premise

That the disclosures can be made in enough detail to allow real external oversight while avoiding the release of sensitive competitive information.

What would settle it

A case where the proposed disclosures are implemented but external reviewers still cannot adequately evaluate the safety of the internal deployments due to gaps in the information provided.

read the original abstract

Frontier AI developers are increasingly deploying highly capable models internally to automate AI R&D, but these deployments currently face limited external oversight. It is essential, therefore, that developers provide evidence that internally deployed models are safe. While recent work has highlighted the risks of internal deployments and proposed broad approaches to transparency and governance, there remains little guidance on the specific information developers should disclose about them. We address this gap by identifying key information that companies should disclose about internally deployed models across four categories: capabilities, usage, safety mitigations, and governance. For each category, we analyse the key benefits and limitations of disclosure and consider how disclosure-related risks can be mitigated. Our framework could be used by developers to inform both public transparency documents, such as model system cards, and private periodic reports required under emerging frontier AI regulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A clear four-category disclosure framework for internal frontier AI deployments that is more specific than prior work but remains untested.

read the letter

This paper gives a usable four-category list of things frontier AI developers should disclose about internal model deployments: capabilities, usage, safety mitigations, and governance. For each it walks through benefits, limitations, and ways to reduce risks like competitive leaks. That level of detail is the main advance. Earlier papers noted the oversight gap and called for more transparency in broad terms, but this one turns the idea into something developers could actually put into a model card or regulatory filing. The risk-mitigation sections are straightforward and address real implementation worries without overclaiming. The logic tracks existing risk literature cleanly and avoids any circular reasoning. The result is a practical checklist that regulators or labs could pick up and adapt. The main limitation is that everything stays at the level of reasoned suggestion. There are no examples of these disclosures in use, no pilot data, and no check on whether the proposed mitigations would hold up once companies try them. The assumption that external parties could actually use the information for oversight is plausible but unproven. If the categories turn out too vague or too costly to produce, the framework would need heavy revision. This is aimed at people working on AI policy, corporate governance, or regulatory design rather than technical researchers. A regulator drafting rules or a lab compliance team would find it directly useful. I would send it to peer review. The topic is current, the structure is clear, and the gap it targets is real. Referees can test the practicality claims and suggest concrete improvements.

Referee Report

0 major / 0 minor

Summary. The paper claims that frontier AI developers should disclose specific information about internally deployed models across four categories—capabilities, usage, safety mitigations, and governance—to enable external oversight of these deployments, which currently lack transparency. For each category, it analyzes benefits and limitations of disclosure and proposes ways to mitigate associated risks such as competitive harm, with the framework intended to inform both public model system cards and private reports under emerging regulation.

Significance. If the proposed disclosures prove workable, the framework would offer timely, structured policy guidance that directly links identified risks of internal frontier model use (e.g., for automating R&D) to concrete transparency measures. It builds usefully on prior literature by moving from broad calls for governance to category-specific recommendations, potentially aiding both voluntary industry practices and regulatory implementation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of the manuscript, as well as for the recommendation of minor revision. The assessment correctly identifies the paper's focus on category-specific disclosure recommendations for internal frontier model deployments and its potential utility for both voluntary and regulatory contexts. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity; conceptual policy framework

full rationale

The paper proposes a disclosure framework across four categories (capabilities, usage, safety mitigations, governance) by analyzing benefits, limitations, and risk mitigations. No equations, derivations, fitted parameters, or self-referential claims appear. All steps draw from externally identified risks in prior literature rather than reducing to the paper's own inputs or self-citations. The work is scoped as policy guidance and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that internal deployments require external oversight via disclosure and that balanced disclosure is feasible; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Internal deployments of frontier AI models pose risks that necessitate external oversight through specific disclosures.
Explicitly stated as essential in the abstract for safety.
domain assumption Disclosure risks in the four categories can be mitigated without undermining the value of the disclosures.
The paper states it considers how disclosure-related risks can be mitigated.

pith-pipeline@v0.9.0 · 5437 in / 1269 out tokens · 35392 ms · 2026-05-08T09:30:37.010068+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 17 canonical work pages · 1 internal anchor

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

and Delaney, O

Acharya, A. and Delaney, O. Managing risks from internal AI systems. https://www.iaps.ai/research/managing-risks-from-internal-ai-systems, 2025

2025
[3]

2025 AI Forecasting Survey

AI Digest . 2025 AI Forecasting Survey . https://ai2025.org/, 2025

2025
[4]

A grading rubric for ai safety frameworks, 2024

Alaga, J., Schuett, J., and Anderljung, M. A grading rubric for ai safety frameworks, 2024. URL https://arxiv.org/abs/2409.08751

work page arXiv 2024
[5]

Strengthening our safeguards through collaboration with US CAISI and UK AISI

Anthropic . Strengthening our safeguards through collaboration with US CAISI and UK AISI . https://www.anthropic.com/news/strengthening-our-safeguards-through-collaboration-with-us-caisi-and-uk-aisi, 2025

2025
[6]

Claude 's constitution

Anthropic . Claude 's constitution. https://www.anthropic.com/constitution, 2026 a

2026
[7]

Alignment risk update: Claude Mythos preview

Anthropic . Alignment risk update: Claude Mythos preview. https://www.anthropic.com/claude-mythos-preview-risk-report, April 2026 b

2026
[8]

System card: Claude Mythos Preview

Anthropic . System card: Claude Mythos Preview . https://cdn.sanity.io/files/4zrzovbb/website/7624816413e9b4d2e3ba620c5a5e091b98b190a5.pdf, 2026 c

work page arXiv 2026
[9]

System card: Claude Opus 4.6

Anthropic . System card: Claude Opus 4.6. https://www-cdn.anthropic.com/6a5fa276ac68b9aeb0c8b6af5fa36326e0e166dd.pdf, 2026 d

2026
[10]

Sabotage risk report: Claude Opus 4.6

Anthropic . Sabotage risk report: Claude Opus 4.6. https://www-cdn.anthropic.com/f21d93f21602ead5cdbecb8c8e1c765759d9e232.pdf, 2026 e

2026
[11]

Risk report: February 2026

Anthropic . Risk report: February 2026. https://www-cdn.anthropic.com/08eca2757081e850ed2ad490e5253e940240ca4f.pdf, 2026 f

2026
[12]

a , J., Johnson, C., Jolly, G., Katzir, Z., Khan, S. M., Kitano, H., Kr \

Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y., Fox, P., Garfinkel, B., Goldfarb, D., Heidari, H., Ho, A., Kapoor, S., Khalatbari, L., Longpre, S., Manning, S., Mavroudis, V., Mazeika, M., Michael, J., Newman, J., Ng, K. Y., Okolo, C. T., Raji, D., Sastry, G., Seger, E., Skeadas, T., South, T., Strubell, E., ...

2025
[13]

a , J., Johnson, C., Jolly, G., Katzir, Z., Kerema, M. N., Kitano, H., Kr \

Bengio, Y., Clare, S., Prunkl, C., Murray, M., Andriushchenko, M., Bucknall, B., Bommasani, R., Casper, S., Davidson, T., Douglas, R., Duvenaud, D., Fox, P., Gohar, U., Hadshar, R., Ho, A., Hu, T., Jones, C., Kapoor, S., Kasirzadeh, A., Manning, S., Maslej, N., Mavroudis, V., McGlynn, C., Moulange, R., Newman, J., Ng, K. Y., Paskov, P., Rismani, S., Sastr...

2026
[14]

Brundage, M., Avin, S., Wang, J., Belfield, H., Krueger, G., Hadfield, G., Khlaaf, H., Yang, J., Toner, H., Fong, R., Maharaj, T., Koh, P. W., Hooker, S., Leung, J., Trask, A., Bluemke, E., Lebensold, J., O'Keefe, C., Koren, M., Ryffel, T., Rubinovitz, J., Besiroglu, T., Carugati, F., Clark, J., Eckersley, P., de Haas, S., Johnson, M., Laurie, B., Ingerma...

work page arXiv 2020
[15]

Brundage, M., Dreksler, N., Homewood, A., McGregor, S., Paskov, P., Stosz, C., Sastry, G., Cooper, A. F., Balston, G., Adler, S., Casper, S., Anderljung, M., Werner, G., Mindermann, S., Mavroudis, V., Bucknall, B., Stix, C., Freund, J., Pacchiardi, L., Hernandez-Orallo, J., Pistillo, M., Chen, M., Painter, C., Ball, D. W., O'Keefe, C., Weil, G., Harack, B...

work page arXiv 2026
[16]

Senate bill no

California State Legislature . Senate bill no. 53: Transparency in frontier artificial intelligence act. https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202520260SB53, 2025

2025
[17]

AI Models Can Be Dangerous before Public Deployment

Chan, A., Padarath, R., Kwon, J., Greaves, H., and Anderljung, M. Measuring ai R&D automation, 2026. URL https://arxiv.org/abs/2603.03992

work page arXiv 2026
[18]

Ai models can be dangerous before public deployment

Chan, L. Ai models can be dangerous before public deployment. https://metr.org/blog/2025-01-17-ai-models-dangerous-before-public-deployment/, 01 2025

2025
[19]

Subliminal learning: Language models transmit behavioral traits via hidden signals in data.arXiv preprint arXiv:2507.14805,

Cloud, A., Le, M., Chua, J., Betley, J., Sztyber-Betley, A., Hilton, J., Marks, S., and Evans, O. Subliminal learning: Language models transmit behavioral traits via hidden signals in data, 2025. URL https://arxiv.org/abs/2507.14805

work page arXiv 2025
[20]

Ai-enabled coups: How a small group could use ai to seize power, 2025

Davidson, T., Finnveden, L., and Hadshar, R. Ai-enabled coups: How a small group could use ai to seize power, 2025. URL https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power. Accessed: 2026-04-22

2025
[21]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Deng, X., Da, J., Pan, E., He, Y. Y., Ide, C., Garg, K., Lauffer, N., Park, A., Pasari, N., Rane, C., Sampath, K., Krishnan, M., Kundurthy, S., Hendryx, S., Wang, Z., Bharadwaj, V., Holm, J., Aluri, R., Zhang, C. B. C., Jacobson, N., Liu, B., and Kenstler, B. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?, 2025. URL https://ar...

work page internal anchor Pith review arXiv 2025
[22]

and Davidson, T

Eth, D. and Davidson, T. Will ai r&d automation cause a software intelligence explosion?, 2025. URL https://www.forethought.org/research/will-ai-r-and-d-automation-cause-a-software-intelligence-explosion. Accessed: 2026-04-24

2025
[23]

The general-purpose AI code of practice: Safety and security chapter

European Commission . The general-purpose AI code of practice: Safety and security chapter. https://digital-strategy.ec.europa.eu/en/policies/contents-code-gpai, 2025

2025
[24]

Ai researchers' views on automating ai R&D and intelligence explosions, 2026

Field, S., Douglas, R., and Krueger, D. Ai researchers' views on automating ai R&D and intelligence explosions, 2026. URL https://arxiv.org/abs/2603.03338

work page arXiv 2026
[25]

SUP 5.6: Confidential information and privilege, 2016

Financial Conduct Authority . SUP 5.6: Confidential information and privilege, 2016. URL https://www.handbook.fca.org.uk/handbook/SUP/5/6.html?date=2016-03-07. Last amended 7 March 2016. Section within Chapter 5 (Reports by skilled persons) of the Supervision Manual, FCA Handbook

2016
[26]

Accelerating mathematical and scientific discovery with Gemini Deep Think

Google DeepMind . Accelerating mathematical and scientific discovery with Gemini Deep Think . https://deepmind.google/blog/accelerating-mathematical-and-scientific-discovery-with-gemini-deep-think/, 2026 a

2026
[27]

Gemini 3.1 Pro model card

Google DeepMind . Gemini 3.1 Pro model card. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf, 2026 b

2026
[28]

Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y

Guan, M. Y., Wang, M., Carroll, M., Dou, Z., Wei, A. Y., Williams, M., Arnav, B., Huizinga, J., Kivlichan, I., Glaese, M., Pachocki, J., and Baker, B. Monitoring monitorability, 2025. URL https://arxiv.org/abs/2512.18311

work page arXiv 2025
[29]

ac.uk/publications/claude-mythos-future-cybersecurity

Homewood, A., Williams, S., Dreksler, N., Lidiard, J., Murray, M., Heim, L., Ziosi, M., h \'E igeartaigh, S. \'O ., Chen, M., Wei, K., Winter, C., Brundage, M., Garfinkel, B., and Schuett, J. Third-party compliance reviews for frontier ai safety frameworks, 2025. URL https://arxiv.org/abs/2505.01643

work page arXiv 2025
[30]

Information security, cybersecurity and privacy protection --- Information security management systems --- Requirements , 2022

ISO/IEC . Information security, cybersecurity and privacy protection --- Information security management systems --- Requirements , 2022. URL https://www.iso.org/standard/27001

2022
[31]

Jagadeeswari, M., Karthi, P., Nitish Kumar, V., and Ram, S. S. A secure file sharing and audit trail tracking platform with advanced encryption standard for cloud-based environments. In 2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC), pp.\ 540--547, 2023. doi:10.1109/ICESC57686.2023.10193389

work page doi:10.1109/icesc57686.2023.10193389 2023
[32]

Early work on monitorability evaluations

Kinniment, M., Nix, S., Broadley, T., Wijk, H., and Parikh, N. Early work on monitorability evaluations. https://metr.org/blog/2026-01-19-early-work-on-monitorability-evaluations/, 01 2026

2026
[33]

K., Heim, L., Rodriguez, M., Sandbrink, J

Kolt, N., Anderljung, M., Barnhart, J., Brass, A., Esvelt, K., Hadfield, G. K., Heim, L., Rodriguez, M., Sandbrink, J. B., and Woodside, T. Responsible reporting for frontier ai development. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 7 0 (1): 0 768--783, Oct. 2024. doi:10.1609/aies.v7i1.31678. URL https://ojs.aaai.org/index.php/AIE...

work page doi:10.1609/aies.v7i1.31678 2024
[34]

Kutasov, J., Sun, Y., Colognese, P., van der Weij, T., Petrini, L., Zhang, C. B. C., Hughes, J., Deng, X., Sleight, H., Tracy, T., Shlegeris, B., and Benton, J. Shade-arena: Evaluating sabotage and monitoring in llm agents, 2025. URL https://arxiv.org/abs/2506.15740

work page arXiv 2025
[35]

Goal Misgeneralization in Deep Reinforcement Learning

Kwon, J. and Casper, S. Internal deployment gaps in ai regulation, 2026. URL https://arxiv.org/abs/2601.08005

work page arXiv 2026
[36]

Review of the anthropic summer 2025 pilot sabotage risk report

METR . Review of the anthropic summer 2025 pilot sabotage risk report. https://alignment.anthropic.com/2025/sabotage-risk-report/2025_pilot_risk_report_metr_review.pdf, 2025

2025
[37]

What should companies share about risks from frontier ai models? https://metr.org/blog/2025-06-27-risk-transparency/, 06 2025

METR. What should companies share about risks from frontier ai models? https://metr.org/blog/2025-06-27-risk-transparency/, 06 2025

2025
[38]

Responsible AI safety and education ( RAISE ) act, 6453--A

New York State Legislature . Responsible AI safety and education ( RAISE ) act, 6453--A . https://www.nysenate.gov/legislation/bills/2025/A6453/amendment/A, 2025

2025
[39]

Bank supervision process

Office of the Comptroller of the Currency . Bank supervision process. Comptroller's handbook booklet, examination process series, Office of the Comptroller of the Currency, September 2019. URL https://www.occ.gov/publications-and-resources/publications/comptrollers-handbook/files/bank-supervision-process/pub-ch-bank-supervision-process.pdf. Transmitted vi...

2019
[40]

Working with us caisi and uk aisi to build more secure ai systems

OpenAI . Working with us caisi and uk aisi to build more secure ai systems. https://openai.com/index/us-caisi-uk-aisi-ai-update/, 2025

2025
[41]

Introducing GPT-5.3 Codex

OpenAI . Introducing GPT-5.3 Codex . https://openai.com/index/introducing-gpt-5-3-codex/, 2026 a

2026
[42]

GPT-5.3 Codex system card

OpenAI . GPT-5.3 Codex system card. https://deploymentsafety.openai.com/gpt-5-3-codex/gpt-5-3-codex.pdf, 2026 b

2026
[43]

How we monitor internal coding agents for misalignment

OpenAI . How we monitor internal coding agents for misalignment. https://openai.com/index/how-we-monitor-internal-coding-agents-misalignment/, 2026 c

2026
[44]

Introducing GPT-5.4 mini and nano

OpenAI . Introducing GPT-5.4 mini and nano. https://openai.com/index/introducing-gpt-5-4-mini-and-nano/, 2026 d

2026
[45]

Openai's Raising Concerns Policy

OpenAI . Openai's Raising Concerns Policy . https://openai.com/index/openai-raising-concerns-policy/, 2026 e

2026
[46]

Inference scaling reshapes ai governance, 2025

Ord, T. Inference scaling reshapes ai governance, 2025. URL https://arxiv.org/abs/2503.05705

work page arXiv 2025
[47]

Red-teaming anthropic's internal agent monitoring systems

Rein, D. Red-teaming anthropic's internal agent monitoring systems. https://metr.org/blog/2026-03-25-red-teaming-anthropic-agent-monitoring/, 03 2026

2026
[48]

The Thinking Machines Tinker API is good news for ai control and security

Shlegeris, B. The Thinking Machines Tinker API is good news for ai control and security. https://blog.redwoodresearch.org/p/the-thinking-machines-tinker-api, 2025. Redwood Research blog

2025
[49]

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

Stix, C., Pistillo, M., Sastry, G., Hobbhahn, M., Ortega, A., Balesni, M., Hallensleben, A., Goldowsky-Dill, N., and Sharkey, L. Ai behind closed doors: a primer on the governance of internal deployment, 2025. URL https://arxiv.org/abs/2504.12170

work page arXiv 2025
[50]

M., Shea-Blymyer, C., Yee, E., Acharya, A., Fisher, K., Scholl, K., Wildeford, P., Greenblatt, R., Albanie, S., Ballard, S., and Larsen, T

Toner, H., Beers, K., Newman, S., Khan, S. M., Shea-Blymyer, C., Yee, E., Acharya, A., Fisher, K., Scholl, K., Wildeford, P., Greenblatt, R., Albanie, S., Ballard, S., and Larsen, T. When AI builds AI : Findings from a workshop on automation of AI R&D . Technical report, Center for Security and Emerging Technology, January 2026. URL https://cset.georgetow...

2026
[51]

Food and Drug Administration

U.S. Food and Drug Administration . 21 CFR 20.61: Trade secrets and commercial or financial information which is privileged or confidential. https://www.ecfr.gov/current/title-21/chapter-I/subchapter-A/part-20/subpart-D/section-20.61, 2026. Electronic Code of Federal Regulations

2026
[52]

A., and Neidermeyer, P

Vinten, G., Neidermeyer, A. A., and Neidermeyer, P. E. Audit anticipation: does it impact job performance? Managerial Auditing Journal, 20 0 (1): 0 19--29, 01 2005. ISSN 0268-6902. doi:10.1108/02686900510570669. URL https://doi.org/10.1108/02686900510570669

work page doi:10.1108/02686900510570669 2005
[53]

R., Becker, J., Jawhar, S., Parikh, N., Broadley, T., Chan, L., Chen, M., Clymer, J

Wijk, H., Lin, T. R., Becker, J., Jawhar, S., Parikh, N., Broadley, T., Chan, L., Chen, M., Clymer, J. M., Dhyani, J., Ericheva, E., Garcia, K., Goodrich, B., Jurkovic, N., Kinniment, M., Lajko, A., Nix, S., Koba Sato, L. J., Saunders, W., Taran, M., West, B., and Barnes, E. RE -bench: Evaluating frontier AI R&D capabilities of language model agents again...

2025
[54]

Williams, M., Raymond, C., Carroll, M., and Team, S. O. Sidestepping evaluation awareness and anticipating misalignment with production evaluations. https://alignment.openai.com/prod-evals/, 2025 a . OpenAI Alignment Blog

2025
[55]

Assessing risk relative to competitors: An analysis of current AI company policies

Williams, S., Dreksler, N., Homewood, A., Anderljung, M., and Freund, J. Assessing risk relative to competitors: An analysis of current AI company policies. https://www.governance.ai/research-paper/assessing-risk-relative-to-competitors-an-analysis-of-current-ai-company-policies, 2025 b . GovAI research paper

2025