From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI

Jianbing Ni; Jie Cao; Lingshuang Liu; Qi Li; Zelin Zhang

arxiv: 2605.16471 · v1 · pith:UMJJF7HYnew · submitted 2026-05-15 · 💻 cs.CR

From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI

Zelin Zhang , Qi Li , Jie Cao , Lingshuang Liu , Jianbing Ni This is my paper

Pith reviewed 2026-05-20 18:02 UTC · model grok-4.3

classification 💻 cs.CR

keywords generative AIsecurity threatsagentic AIAI safetytool usecountermeasuresattack surfacegovernance

0 comments

The pith

As generative AI shifts from content creation to executing actions, security threats expand faster than defenses can keep up.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative AI is moving beyond producing text or images to retrieving data, calling tools, and taking real-world actions through external systems. This paper reviews the security and safety threats that arise at content, model, and agent levels, tracking how attacker access needs, system autonomy, and potential damage all increase in this progression. It evaluates defenses such as detection, watermarking, alignment, and new agentic safeguards, and observes that many of these require coordination across institutions that current arrangements do not support. The core observation is that new capabilities and larger attack surfaces consistently appear ahead of effective protections.

Core claim

Generative AI systems are increasingly used not only to produce content but also to retrieve data, invoke tools, and execute actions. This work examines the security and safety implications of that shift across content-level, model-level, and agentic threats. It analyzes how attacker access requirements, system autonomy, and the scope of potential harm change as models move from generating artifacts to executing operations through tool chains and external APIs. It then assesses technical countermeasures including detection, watermarking, alignment, and emerging agentic safeguards, and shows that several depend on forms of institutional coordination that current governance arrangements do not

What carries the argument

Staged threat analysis progressing from content-level to model-level to agentic threats, which maps rising attacker requirements, autonomy, and harm scope while checking whether defenses advance at the same rate.

Load-bearing premise

The specific cases reviewed stand in for the general move toward agentic systems and the listed countermeasures cover the main technical options now available.

What would settle it

A report or study documenting large-scale agentic AI deployments where new safeguards have measurably reduced security incidents would test whether defenses are truly lagging.

Figures

Figures reproduced from arXiv: 2605.16471 by Jianbing Ni, Jie Cao, Lingshuang Liu, Qi Li, Zelin Zhang.

**Figure 2.** Figure 2: AIGC threat taxonomy. The adversarial manipulation branch (highlighted) spans direct prompt injection, [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Layered defense architecture for AIGC security. Maturity decreases from top to bottom. Maturity levels [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Observed ordering across capability deployment, defensive response, and governance frameworks. The 14- [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

Generative AI systems are increasingly used not only to produce content but also to retrieve data, invoke tools, and execute actions. This work examines the security and safety implications of that shift across content-level, model-level, and agentic threats. We analyze how attacker access requirements, system autonomy, and the scope of potential harm change as models move from generating artifacts to executing operations through tool chains and external APIs. We then assess technical countermeasures including detection, watermarking, alignment, and emerging agentic safeguards, and show that several depend on forms of institutional coordination that current governance arrangements do not yet provide. Across the cases examined, capability deployment and attack-surface expansion repeatedly outpace defensive responses as systems move from generating content to executing real-world actions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript surveys the security and safety implications of generative AI shifting from content generation to agentic systems that retrieve data, invoke tools, and execute real-world actions via APIs. It categorizes threats at content, model, and agentic levels, analyzes changes in attacker access requirements, autonomy, and harm scope, evaluates countermeasures such as detection, watermarking, alignment, and agentic safeguards, and concludes that capability deployment and attack-surface expansion repeatedly outpace defensive responses across the examined cases, with many countermeasures depending on institutional coordination not yet provided by current governance.

Significance. If the observed patterns hold, the paper offers a timely synthesis of escalating risks in the move toward agentic AI, which could help frame future empirical work and policy discussions in AI security. Its structured progression from content-level to action-level threats provides a useful organizing framework. As a qualitative survey without new empirical data, derivations, or falsifiable predictions, its primary value is in highlighting gaps rather than resolving them; no machine-checked proofs or reproducible code are present.

major comments (2)

[Abstract] Abstract: The load-bearing claim that 'capability deployment and attack-surface expansion repeatedly outpace defensive responses' across examined cases is presented as an observational conclusion but lacks documented case selection criteria, quantitative metrics (e.g., timelines or capability deltas), or explicit counter-examples considered and rejected. This directly affects the generalizability of the central thesis.
[Countermeasures discussion] Countermeasures assessment: The statement that several defenses 'depend on forms of institutional coordination that current governance arrangements do not yet provide' is central to the safety implications but is not supported by concrete references to specific governance mechanisms, existing standards, or failed coordination attempts, leaving the practical barrier claim under-specified.

minor comments (2)

[Introduction] The manuscript would benefit from an explicit early definition or taxonomy of 'agentic action' versus prior generative capabilities to improve accessibility for readers outside the immediate subfield.
[References] Some citations on rapidly evolving topics (e.g., alignment techniques and agentic safeguards) appear to stop short of the most recent preprints; updating the reference list would strengthen the survey character.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's constructive report. We appreciate the feedback on strengthening the presentation of our qualitative survey and have addressed each major comment below with planned revisions to improve transparency and support for key claims.

read point-by-point responses

Referee: [Abstract] Abstract: The load-bearing claim that 'capability deployment and attack-surface expansion repeatedly outpace defensive responses' across examined cases is presented as an observational conclusion but lacks documented case selection criteria, quantitative metrics (e.g., timelines or capability deltas), or explicit counter-examples considered and rejected. This directly affects the generalizability of the central thesis.

Authors: We thank the referee for this observation. As a qualitative survey synthesizing existing literature rather than an empirical study, the manuscript does not introduce new quantitative metrics or a formal meta-analysis. To address the concern, we will revise the abstract and introduction to explicitly document case selection criteria, focusing on representative, high-profile examples from peer-reviewed works and public reports (2022-2024) that illustrate the shift to agentic systems. We will also add a brief discussion of considered counter-examples, such as partial successes in content detection that have not extended to tool-using agents, to clarify the observational scope and limits on generalizability. revision: yes
Referee: [Countermeasures discussion] Countermeasures assessment: The statement that several defenses 'depend on forms of institutional coordination that current governance arrangements do not yet provide' is central to the safety implications but is not supported by concrete references to specific governance mechanisms, existing standards, or failed coordination attempts, leaving the practical barrier claim under-specified.

Authors: We agree this claim requires more concrete grounding to support its policy relevance. In the revised manuscript, we will expand the countermeasures section with specific references, including the EU AI Act's systemic risk obligations for general-purpose models, the NIST AI Risk Management Framework's emphasis on multi-stakeholder coordination, and documented challenges in ISO/IEC standards development for AI watermarking. We will also cite examples of coordination shortfalls, such as inconsistent provider adoption of safety standards despite public commitments, to better substantiate the institutional barrier without overstating the claim. revision: yes

Circularity Check

0 steps flagged

No circularity: qualitative survey without derivations or fitted inputs

full rationale

The manuscript is a survey paper that reviews security and safety threats as generative AI shifts from content generation to agentic actions. It presents observational patterns across content-level, model-level, and agentic threats plus countermeasures, without any equations, parameter fitting, uniqueness theorems, or self-citation chains that reduce claims to prior results by construction. The central statement that capability deployment outpaces defenses is framed as a summary of examined cases rather than a derived prediction or self-defined quantity. No load-bearing step equates to its own inputs; the analysis remains self-contained as general trend observation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a qualitative survey paper. It introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5662 in / 984 out tokens · 100903 ms · 2026-05-20T18:02:06.285779+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across the cases examined, capability deployment and attack-surface expansion repeatedly outpace defensive responses as systems move from generating content to executing real-world actions.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We organize this analysis as a complexity progression defined by three escalation dimensions (Table 3)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

162 extracted references · 162 canonical work pages · 25 internal anchors

[1]

Automated malware source code generation via uncensored llms and adversarial evasion of censored model

Acosta-Bermejo, R., Terrazas-Chavez, J.A., Aguirre-Anaya, E., 2025. Automated malware source code generation via uncensored llms and adversarial evasion of censored model. Applied Sciences 15, 9252

work page 2025
[2]

Gptzero: Robust detection of llm-generated texts

Adam, G.A., Cui, A., Thomas, E., Napier, E., Shmatko, N., Schnell, J., Tian, J.J., Dronavalli, A., Tian, E., Lee, D., 2026. Gptzero: Robust detection of llm-generated texts. arXiv preprint arXiv:2602.13042

work page arXiv 2026
[3]

Many-shot jailbreaking

Anil, C., Durmus, E., Panickssery, N., Sharma, M., Benton, J., Kundu, S., Batson, J., Tong, M., Mu, J., Ford, D., et al., 2024. Many-shot jailbreaking. Advances in Neural Information Processing Systems 37, 129696–129742

work page 2024
[4]

Introducing Computer Use, a New Claude 3.5 Sonnet, and Claude 3.5 Haiku.https://www.anthropic.com/news/ 3-5-models-and-computer-use

Anthropic, 2024a. Introducing Computer Use, a New Claude 3.5 Sonnet, and Claude 3.5 Haiku.https://www.anthropic.com/news/ 3-5-models-and-computer-use. Accessed: 2026-03-29

work page 2026
[5]

Introducing the Model Context Protocol.https://www.anthropic.com/news/model-context-protocol

Anthropic, 2024b. Introducing the Model Context Protocol.https://www.anthropic.com/news/model-context-protocol. Accessed: 2026-03-29

work page 2026
[6]

Anthropic Acquires Bun as Claude Code Reaches $1B Milestone.https://www.anthropic.com/news/ anthropic-acquires-bun-as-claude-code-reaches-usd1b-milestone

Anthropic, 2025a. Anthropic Acquires Bun as Claude Code Reaches $1B Milestone.https://www.anthropic.com/news/ anthropic-acquires-bun-as-claude-code-reaches-usd1b-milestone. Claude Code GA May 2025; $1B run-rate revenue by November 2025. Accessed: 2026-03-29

work page 2025
[7]

Disrupting the First Reported AI-Orchestrated Cyber Espionage Campaign.https://www.anthropic.com/news/ disrupting-AI-espionage

Anthropic, 2025b. Disrupting the First Reported AI-Orchestrated Cyber Espionage Campaign.https://www.anthropic.com/news/ disrupting-AI-espionage. Accessed: 2026-03-29

work page 2026
[8]

Donating the Model Context Protocol and Establishing the Agentic AI Foundation.https://www.anthropic

Anthropic, 2025c. Donating the Model Context Protocol and Establishing the Agentic AI Foundation.https://www.anthropic. com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation. Accessed: 2026- 03-29

work page 2026
[9]

Model Context Protocol Specification (v2025-11-25).https://modelcontextprotocol.io/specification/ 2025-11-25

Anthropic, 2025d. Model Context Protocol Specification (v2025-11-25).https://modelcontextprotocol.io/specification/ 2025-11-25. Now maintained by the Agentic AI Foundation under the Linux Foundation. Accessed: 2026-03-29

work page 2025
[10]

Refusal in language models is mediated by a single direction

Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., Nanda, N., 2024. Refusal in language models is mediated by a single direction. Advances in Neural Information Processing Systems 37, 136037–136083

work page 2024
[11]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al., 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

International ai safety report 2026

Bengio, Y., Clare, S., Prunkl, C., Andriushchenko, M., Bucknall, B., Murray, M., Bommasani, R., Casper, S., Davidson, T., Douglas, R., et al., 2026. International ai safety report 2026. arXiv preprint arXiv:2602.21012

work page arXiv 2026
[13]

Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y., Fox, P., Garfinkel, B., Goldfarb, D., et al.,

work page
[14]

arXiv preprint arXiv:2501.17805

International ai safety report. arXiv preprint arXiv:2501.17805

work page arXiv
[15]

Video generation models as world simulators

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A., 2024. Video generation models as world simulators. URL:https://openai.com/index/ video-generation-models-as-world-simulators/

work page 2024
[16]

Bullwinkel, B., Russinovich, M., Salem, A., Zanella-Beguelin, S., Jones, D., Severi, G., Kim, E., Hines, K., Minnich, A., Zunger, Y., et al.,

work page
[17]

arXiv preprint arXiv:2507.02956

A representation engineering perspective on the effectiveness of multi-turn jailbreaks. arXiv preprint arXiv:2507.02956

work page arXiv
[18]

Secure and robust watermarking for ai-generated images: A comprehensive survey

Cao, J., Li, Q., Zhang, Z., Ni, J., 2025. Secure and robust watermarking for ai-generated images: A comprehensive survey. arXiv preprint arXiv:2510.02384

work page arXiv 2025
[19]

Marksweep: A no-box removal attack on ai-generated image watermarking via noise intensification and frequency-aware denoising

Cao, J., Zhang, Z., Li, Q., Ni, J., 2026. Marksweep: A no-box removal attack on ai-generated image watermarking via noise intensification and frequency-aware denoising. arXiv preprint arXiv:2602.15364

work page arXiv 2026
[20]

Poisoning web-scale training datasets is practical, in: 2024 IEEE Symposium on Security and Privacy (SP), IEEE

Carlini, N., Jagielski, M., Choquette-Choo, C.A., Paleka, D., Pearce, W., Anderson, H., Terzis, A., Thomas, K., Tramèr, F., 2024. Poisoning web-scale training datasets is practical, in: 2024 IEEE Symposium on Security and Privacy (SP), IEEE. pp. 407–425

work page 2024
[21]

arXiv preprint arXiv:2503.02857

Chandra,N.A.,Murtfeldt,R.,Qiu,L.,Karmakar,A.,Lee,H.,Tanumihardja,E.,Farhat,K.,Caffee,B.,Paik,S.,Lee,C.,etal.,2025.Deepfake- eval-2024: A multi-modal in-the-wild benchmark of deepfakes circulated in 2024. arXiv preprint arXiv:2503.02857

work page arXiv 2025
[22]

Jailbreakbench: An open robustness benchmark for jailbreaking large language models

Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramer, F., et al., 2024. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. Advances in Neural Information Processing Systems 37, 55005–55029

work page 2024
[23]

Jailbreakingblackboxlargelanguagemodelsintwentyqueries, in: 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), IEEE

Chao,P.,Robey,A.,Dobriban,E.,Hassani,H.,Pappas,G.J.,Wong,E.,2025. Jailbreakingblackboxlargelanguagemodelsintwentyqueries, in: 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), IEEE. pp. 23–42

work page 2025
[24]

Audiojailbreak: Jailbreak attacks against end-to-end large audio-language models

Chen, G., Song, F., Zhao, Z., Jia, X., Liu, Y., Qiao, Y., Zhang, W., Tu, W., Yang, Y., Du, B., 2026. Audiojailbreak: Jailbreak attacks against end-to-end large audio-language models. IEEE Transactions on Dependable and Secure Computing

work page 2026
[25]

Demamba: Ai-generated video detection on million-scale genvideo benchmark

Chen, H., Hong, Y., Huang, Z., Xu, Z., Gu, Z., Li, Y., Lan, J., Zhu, H., Zhang, J., Wang, W., et al., 2024. Demamba: Ai-generated video detection on million-scale genvideo benchmark. arXiv preprint arXiv:2405.19707

work page arXiv 2024
[26]

2383–2400

Chen, S., Piet, J., Sitawarin, C., Wagner, D., 2025.{StruQ}: Defending against prompt injection with structured queries, in: 34th USENIX Security Symposium (USENIX Security 25), pp. 2383–2400

work page 2025
[27]

Revealing weaknesses in text watermarking through self-information rewrite attacks

Cheng, Y., Guo, H., Li, Y., Sigal, L., 2025. Revealing weaknesses in text watermarking through self-information rewrite attacks. arXiv preprint arXiv:2505.05190

work page arXiv 2025
[28]

Deep fakes: A looming challenge for privacy, democracy, and national security

Chesney, B., Citron, D., 2019. Deep fakes: A looming challenge for privacy, democracy, and national security. Calif. L. Rev. 107, 1753

work page 2019
[29]

How to backdoor diffusion models?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Chou, S.Y., Chen, P.Y., Ho, T.Y., 2023. How to backdoor diffusion models?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4015–4024

work page 2023
[30]

Undetectablewatermarksforlanguagemodels,in:TheThirtySeventhAnnualConferenceonLearning Theory, PMLR

Christ,M.,Gunn,S.,Zamir,O.,2024. Undetectablewatermarksforlanguagemodels,in:TheThirtySeventhAnnualConferenceonLearning Theory, PMLR. pp. 1125–1139

work page 2024
[31]

EvaluatingsecurityriskinDeepSeekandotherfrontierreasoningmodels

CiscoRobustIntelligence,2025. EvaluatingsecurityriskinDeepSeekandotherfrontierreasoningmodels. CiscoSecurityBlog.https:// blogs.cisco.com/security/evaluating-security-risk-in-deepseek-and-other-frontier-reasoning-models. In col- laboration with the University of Pennsylvania. Accessed: 2026-03-29. First Author et al.:Preprint submitted to ElsevierPage 20 ...

work page 2025
[32]

C2PA Specification v2.2.https://c2pa.org/specifications/ specifications/2.2/specs/C2PA_Specification.html

Coalition for Content Provenance and Authenticity, 2025. C2PA Specification v2.2.https://c2pa.org/specifications/ specifications/2.2/specs/C2PA_Specification.html. Open standard for digital content provenance. Accessed: 2026-03-29

work page 2025
[33]

InterimMeasuresfortheManagementofGenerativeArtificialIntelligenceServices

CyberspaceAdministrationofChinaandothers,2023. InterimMeasuresfortheManagementofGenerativeArtificialIntelligenceServices. http://www.cac.gov.cn/. Effective August 15, 2023

work page 2023
[34]

Dathathri, S., See, A., Ghaisas, S., Huang, P.S., McAdam, R., Welbl, J., Bachani, V., Kaskasoli, A., Stanforth, R., Matejovicova, T., et al.,

work page
[35]

Nature 634, 818–823

Scalable watermarking for identifying large language model outputs. Nature 634, 818–823

work page
[36]

Asvspoof2021:Automaticspeakerverificationspoofingandcountermeasureschallengeevaluationplan

Delgado, H., Evans, N., Kinnunen, T., Lee, K.A., Liu, X., Nautsch, A., Patino, J., Sahidullah, M., Todisco, M., Wang, X., et al., 2021. Asvspoof2021:Automaticspeakerverificationspoofingandcountermeasureschallengeevaluationplan. arXivpreprintarXiv:2109.00535

work page arXiv 2021
[37]

Multilingualjailbreakchallengesinlargelanguagemodels

Deng,Y.,Zhang,W.,Pan,S.J.,Bing,L.,2023. Multilingualjailbreakchallengesinlargelanguagemodels. arXivpreprintarXiv:2310.06474

work page arXiv 2023
[38]

Dugan,L.,Hwang,A.,Trhlík,F.,Zhu,A.,Ludan,J.M.,Xu,H.,Ippolito,D.,Callison-Burch,C.,2024. Raid:Asharedbenchmarkforrobust evaluationofmachine-generatedtextdetectors,in:Proceedingsofthe62ndAnnualMeetingoftheAssociationforComputationalLinguistics (Volume 1: Long Papers), pp. 12463–12492

work page 2024
[39]

KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., Kiela, D., 2024. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Regulation(EU)2024/1689–TheAIAct.https://digital-strategy

EuropeanParliamentandCounciloftheEuropeanUnion,2024. Regulation(EU)2024/1689–TheAIAct.https://digital-strategy. ec.europa.eu/en/policies/regulatory-framework-ai. Entered into force August 1, 2024. Accessed: 2026-03-29

work page 2024
[41]

Thestablesignature:Rootingwatermarksinlatentdiffusionmodels,in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Fernandez,P.,Couairon,G.,Jégou,H.,Douze,M.,Furon,T.,2023. Thestablesignature:Rootingwatermarksinlatentdiffusionmodels,in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22466–22477

work page 2023
[42]

Video seal: Open and efficient video watermarking

Fernandez, P., Elsahar, H., Yalniz, I.Z., Mourachko, A., 2024. Video seal: Open and efficient video watermarking. arXiv preprint arXiv:2412.09492

work page arXiv 2024
[43]

Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b

Gade, P., Lermen, S., Rogers-Smith, C., Ladish, J., 2023. Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b. arXiv preprint arXiv:2311.00117

work page arXiv 2023
[44]

Scaling laws for reward model overoptimization, in: International Conference on Machine Learning, PMLR

Gao, L., Schulman, J., Hilton, J., 2023. Scaling laws for reward model overoptimization, in: International Conference on Machine Learning, PMLR. pp. 10835–10866

work page 2023
[45]

Artificial intelligence - carrying us into the future

Ghosh, S., Frase, H., Williams, A., Luger, S., Röttger, P., Barez, F., McGregor, S., Fricklas, K., Kumar, M., Bollacker, K., et al., 2025. Ailuminate: Introducing v1. 0 of the ai risk and reliability benchmark from mlcommons. arXiv preprint arXiv:2503.05731

work page arXiv 2025
[46]

Figstep: Jailbreaking large vision-language models via typographic visual prompts, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Gong, Y., Ran, D., Liu, J., Wang, C., Cong, T., Wang, A., Duan, S., Wang, X., 2025. Figstep: Jailbreaking large vision-language models via typographic visual prompts, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 23951–23959

work page 2025
[47]

Lessons from Defending Gemini Against Indirect Prompt Injections

Google DeepMind, 2025. Advancing Gemini’s Security Safeguards.https://deepmind.google/blog/ advancing-geminis-security-safeguards/. Accompanied by white paper “Lessons from Defending Gemini Against Indirect Prompt Injections”. Accessed: 2026-03-29

work page 2025
[48]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al., 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, M., 2023. Not what you’ve signed up for: Compromising real-world llm-integratedapplicationswithindirectpromptinjection,in:Proceedingsofthe16thACMworkshoponartificialintelligenceandsecurity, pp. 79–90

work page 2023
[50]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo,D.,Yang,D.,Zhang,H.,Song,J.,Wang,P.,Zhu,Q.,Xu,R.,Zhang,R.,Ma,S.,Bi,X.,etal.,2025. Deepseek-r1:Incentivizingreasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

CursorVulnerability(CVE-2025-59944):HowaCase-SensitivityBugExposedtheRisksofAgenticDeveloperTools

Gustafson,B.,2025. CursorVulnerability(CVE-2025-59944):HowaCase-SensitivityBugExposedtheRisksofAgenticDeveloperTools. LakeraBlog.https://www.lakera.ai/blog/cursor-vulnerability-cve-2025-59944. CVE-2025-59944.Accessed:2026-03-29

work page 2025
[52]

Spotting llms with binoculars: Zero-shot detection of machine-generated text, 2024

Hans, A., Schwarzschild, A., Cherepanova, V., Kazemi, H., Saha, A., Goldblum, M., Geiping, J., Goldstein, T., 2024. Spotting llms with binoculars: Zero-shot detection of machine-generated text. arXiv preprint arXiv:2401.12070

work page arXiv 2024
[53]

Spear phishing with large language models,

Hazell, J., 2023. Spear phishing with large language models. arXiv preprint arXiv:2305.06972

work page arXiv 2023
[54]

Lora:Low-rankadaptationoflargelanguage models

Hu,E.J.,Shen,Y.,Wallis,P.,Allen-Zhu,Z.,Li,Y.,Wang,S.,Wang,L.,Chen,W.,etal.,2022. Lora:Low-rankadaptationoflargelanguage models. Iclr 1, 3

work page 2022
[55]

Videoshield: Regulating diffusion-based video generation models via watermarking

Hu, R., Zhang, J., Li, Y., Li, J., Guo, Q., Qiu, H., Zhang, T., 2025. Videoshield: Regulating diffusion-based video generation models via watermarking. arXiv preprint arXiv:2501.14195

work page arXiv 2025
[56]

Radar: Robust ai-text detection via adversarial learning

Hu, X., Chen, P.Y., Ho, T.Y., 2023. Radar: Robust ai-text detection via adversarial learning. Advances in neural information processing systems 36, 15077–15095

work page 2023
[57]

Safety tax: Safety alignment makes your large reasoning models less reasonable

Huang, T., Hu, S., Ilhan, F., Tekin, S.F., Yahn, Z., Xu, Y., Liu, L., 2025. Safety tax: Safety alignment makes your large reasoning models less reasonable. arXiv preprint arXiv:2503.00555

work page arXiv 2025
[58]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Hubinger,E.,Denison,C.,Mu,J.,Lambert,M.,Tong,M.,MacDiarmid,M.,Lanham,T.,Ziegler,D.M.,Maxwell,T.,Cheng,N.,etal.,2024. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Safetensors:ASimple,SafeWaytoStoreandDistributeTensors.https://huggingface.co/docs/safetensors

HuggingFace,2023. Safetensors:ASimple,SafeWaytoStoreandDistributeTensors.https://huggingface.co/docs/safetensors. Accessed: 2026-03-29

work page 2023
[60]

Security at Hugging Face.https://huggingface.co/docs/hub/security

Hugging Face, 2024. Security at Hugging Face.https://huggingface.co/docs/hub/security. Accessed: 2026-03-29

work page 2024
[61]

Qwen2.5-Coder Technical Report

Hui,B.,Yang,J.,Cui,Z.,Yang,J.,Liu,D.,Zhang,L.,Liu,T.,Zhang,J.,Yu,B.,Lu,K.,etal.,2024. Qwen2.5-codertechnicalreport. arXiv preprint arXiv:2409.12186

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., et al., 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Model AI Governance Framework for Agentic AI

Infocomm Media Development Authority (IMDA), 2026. Model AI Governance Framework for Agentic AI. Technical Report. Government of Singapore. URL:https://www.imda.gov.sg/-/media/imda/files/about/emerging-tech-and-research/ First Author et al.:Preprint submitted to ElsevierPage 21 of 25 Security and Safety Threats in Generative AI artificial-intelligence/mgf...

work page 2026
[64]

Ai-generatedvideodetectionviaperceptualstraightening

Internò,C.,Geirhos,R.,Olhofer,M.,Liu,S.,Hammer,B.,Klindt,D.,2025. Ai-generatedvideodetectionviaperceptualstraightening. arXiv preprint arXiv:2507.00583

work page arXiv 2025
[65]

MCP Security Notification: Tool Poisoning Attacks.https://invariantlabs.ai/blog/ mcp-security-notification-tool-poisoning-attacks

Invariant Labs, 2025a. MCP Security Notification: Tool Poisoning Attacks.https://invariantlabs.ai/blog/ mcp-security-notification-tool-poisoning-attacks. Accessed: 2026-03-29

work page 2026
[66]

WhatsApp MCP Exploited: Exfiltrating Your Message History via MCP.https://invariantlabs.ai/blog/ whatsapp-mcp-exploited

Invariant Labs, 2025b. WhatsApp MCP Exploited: Exfiltrating Your Message History via MCP.https://invariantlabs.ai/blog/ whatsapp-mcp-exploited. Accessed: 2026-03-29

work page 2026
[67]

Critical mcp-remote RCE Vulnerability (CVE-2025-6514)

JFrog Security Research, 2025. Critical mcp-remote RCE Vulnerability (CVE-2025-6514). JFrog Blog.https://jfrog.com/blog/ 2025-6514-critical-mcp-remote-rce-vulnerability/. CVSS 9.6. Accessed: 2026-03-29

work page 2025
[68]

Jiang, F., Xu, Z., Niu, L., Xiang, Z., Ramasubramanian, B., Li, B., Poovendran, R., 2024. Artprompt: Ascii art-based jailbreak attacks against aligned llms, in: Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pp. 15157–15173

work page 2024
[69]

FoolingAIAgents:Web-BasedIndirectPromptInjectionObservedintheWild

Kaleli,B.,Farooqi,S.,Starov,O.,Mohamed,N.,2026. FoolingAIAgents:Web-BasedIndirectPromptInjectionObservedintheWild. Palo Alto Networks Unit 42.https://unit42.paloaltonetworks.com/ai-agent-prompt-injection/. Accessed: 2026-03-29

work page 2026
[70]

A watermark for large language models, in: International conference on machine learning, PMLR

Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., Goldstein, T., 2023. A watermark for large language models, in: International conference on machine learning, PMLR. pp. 17061–17084

work page 2023
[71]

Safellm:Unlearningharmfuloutputsfromlargelanguagemodelsagainstjailbreakattacks

Li,X.,Wu,X.,Li,Q.,Ni,J.,Lu,R.,2025. Safellm:Unlearningharmfuloutputsfromlargelanguagemodelsagainstjailbreakattacks. arXiv preprint arXiv:2508.15182

work page arXiv 2025
[72]

Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models, in: European Conference on Computer Vision, Springer

Li, Y., Guo, H., Zhou, K., Zhao, W.X., Wen, J.R., 2024a. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models, in: European Conference on Computer Vision, Springer. pp. 174–189

work page
[73]

arXiv preprint arXiv:2402.00798

Li,Z.,Hua,W.,Wang,H.,Zhu,H.,Zhang,Y.,2024b.Formal-llm:Integratingformallanguageandnaturallanguageforcontrollablellm-based agents. arXiv preprint arXiv:2402.00798

work page arXiv
[74]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Liu,X.,Xu,N.,Chen,M.,Xiao,C.,2023. Autodan:Generatingstealthyjailbreakpromptsonalignedlargelanguagemodels. arXivpreprint arXiv:2310.04451

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

Voxstructor: Voice reconstruction from voiceprint, in: International Conference on Information Security, Springer

Lu, P., Li, Q., Zhu, H., Sovernigo, G., Lin, X., 2021. Voxstructor: Voice reconstruction from voiceprint, in: International Conference on Information Security, Springer. pp. 374–397

work page 2021
[76]

The Dark Side of LLMs: Agent-based Attack Vectors for System-level Compromise

Lupinacci, M., Pironti, F.A., Blefari, F., Romeo, F., Arena, L., Furfaro, A., 2025. The dark side of llms: Agent-based attacks for complete computer takeover. arXiv preprint arXiv:2507.06850

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Multitude: Large-scale multilingual machine-generated text detection benchmark, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp

Macko, D., Moro, R., Uchendu, A., Lucas, J., Yamashita, M., Pikuliak, M., Srba, I., Le, T., Lee, D., Simko, J., et al., 2023. Multitude: Large-scale multilingual machine-generated text detection benchmark, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9960–9987

work page 2023
[78]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al., 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

Aios:Llmagentoperatingsystem

Mei,K.,Zhu,X.,Xu,W.,Hua,W.,Jin,M.,Li,Z.,Xu,S.,Ye,R.,Ge,Y.,Zhang,Y.,2024. Aios:Llmagentoperatingsystem. arXivpreprint arXiv:2403.16971

work page arXiv 2024
[80]

Simpo: Simple preference optimization with a reference-free reward

Meng, Y., Xia, M., Chen, D., 2024. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems 37, 124198–124235

work page 2024

Showing first 80 references.

[1] [1]

Automated malware source code generation via uncensored llms and adversarial evasion of censored model

Acosta-Bermejo, R., Terrazas-Chavez, J.A., Aguirre-Anaya, E., 2025. Automated malware source code generation via uncensored llms and adversarial evasion of censored model. Applied Sciences 15, 9252

work page 2025

[2] [2]

Gptzero: Robust detection of llm-generated texts

Adam, G.A., Cui, A., Thomas, E., Napier, E., Shmatko, N., Schnell, J., Tian, J.J., Dronavalli, A., Tian, E., Lee, D., 2026. Gptzero: Robust detection of llm-generated texts. arXiv preprint arXiv:2602.13042

work page arXiv 2026

[3] [3]

Many-shot jailbreaking

Anil, C., Durmus, E., Panickssery, N., Sharma, M., Benton, J., Kundu, S., Batson, J., Tong, M., Mu, J., Ford, D., et al., 2024. Many-shot jailbreaking. Advances in Neural Information Processing Systems 37, 129696–129742

work page 2024

[4] [4]

Introducing Computer Use, a New Claude 3.5 Sonnet, and Claude 3.5 Haiku.https://www.anthropic.com/news/ 3-5-models-and-computer-use

Anthropic, 2024a. Introducing Computer Use, a New Claude 3.5 Sonnet, and Claude 3.5 Haiku.https://www.anthropic.com/news/ 3-5-models-and-computer-use. Accessed: 2026-03-29

work page 2026

[5] [5]

Introducing the Model Context Protocol.https://www.anthropic.com/news/model-context-protocol

Anthropic, 2024b. Introducing the Model Context Protocol.https://www.anthropic.com/news/model-context-protocol. Accessed: 2026-03-29

work page 2026

[6] [6]

Anthropic Acquires Bun as Claude Code Reaches $1B Milestone.https://www.anthropic.com/news/ anthropic-acquires-bun-as-claude-code-reaches-usd1b-milestone

Anthropic, 2025a. Anthropic Acquires Bun as Claude Code Reaches $1B Milestone.https://www.anthropic.com/news/ anthropic-acquires-bun-as-claude-code-reaches-usd1b-milestone. Claude Code GA May 2025; $1B run-rate revenue by November 2025. Accessed: 2026-03-29

work page 2025

[7] [7]

Disrupting the First Reported AI-Orchestrated Cyber Espionage Campaign.https://www.anthropic.com/news/ disrupting-AI-espionage

Anthropic, 2025b. Disrupting the First Reported AI-Orchestrated Cyber Espionage Campaign.https://www.anthropic.com/news/ disrupting-AI-espionage. Accessed: 2026-03-29

work page 2026

[8] [8]

Donating the Model Context Protocol and Establishing the Agentic AI Foundation.https://www.anthropic

Anthropic, 2025c. Donating the Model Context Protocol and Establishing the Agentic AI Foundation.https://www.anthropic. com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation. Accessed: 2026- 03-29

work page 2026

[9] [9]

Model Context Protocol Specification (v2025-11-25).https://modelcontextprotocol.io/specification/ 2025-11-25

Anthropic, 2025d. Model Context Protocol Specification (v2025-11-25).https://modelcontextprotocol.io/specification/ 2025-11-25. Now maintained by the Agentic AI Foundation under the Linux Foundation. Accessed: 2026-03-29

work page 2025

[10] [10]

Refusal in language models is mediated by a single direction

Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., Nanda, N., 2024. Refusal in language models is mediated by a single direction. Advances in Neural Information Processing Systems 37, 136037–136083

work page 2024

[11] [11]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al., 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

International ai safety report 2026

Bengio, Y., Clare, S., Prunkl, C., Andriushchenko, M., Bucknall, B., Murray, M., Bommasani, R., Casper, S., Davidson, T., Douglas, R., et al., 2026. International ai safety report 2026. arXiv preprint arXiv:2602.21012

work page arXiv 2026

[13] [13]

Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y., Fox, P., Garfinkel, B., Goldfarb, D., et al.,

work page

[14] [14]

arXiv preprint arXiv:2501.17805

International ai safety report. arXiv preprint arXiv:2501.17805

work page arXiv

[15] [15]

Video generation models as world simulators

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A., 2024. Video generation models as world simulators. URL:https://openai.com/index/ video-generation-models-as-world-simulators/

work page 2024

[16] [16]

Bullwinkel, B., Russinovich, M., Salem, A., Zanella-Beguelin, S., Jones, D., Severi, G., Kim, E., Hines, K., Minnich, A., Zunger, Y., et al.,

work page

[17] [17]

arXiv preprint arXiv:2507.02956

A representation engineering perspective on the effectiveness of multi-turn jailbreaks. arXiv preprint arXiv:2507.02956

work page arXiv

[18] [18]

Secure and robust watermarking for ai-generated images: A comprehensive survey

Cao, J., Li, Q., Zhang, Z., Ni, J., 2025. Secure and robust watermarking for ai-generated images: A comprehensive survey. arXiv preprint arXiv:2510.02384

work page arXiv 2025

[19] [19]

Marksweep: A no-box removal attack on ai-generated image watermarking via noise intensification and frequency-aware denoising

Cao, J., Zhang, Z., Li, Q., Ni, J., 2026. Marksweep: A no-box removal attack on ai-generated image watermarking via noise intensification and frequency-aware denoising. arXiv preprint arXiv:2602.15364

work page arXiv 2026

[20] [20]

Poisoning web-scale training datasets is practical, in: 2024 IEEE Symposium on Security and Privacy (SP), IEEE

Carlini, N., Jagielski, M., Choquette-Choo, C.A., Paleka, D., Pearce, W., Anderson, H., Terzis, A., Thomas, K., Tramèr, F., 2024. Poisoning web-scale training datasets is practical, in: 2024 IEEE Symposium on Security and Privacy (SP), IEEE. pp. 407–425

work page 2024

[21] [21]

arXiv preprint arXiv:2503.02857

Chandra,N.A.,Murtfeldt,R.,Qiu,L.,Karmakar,A.,Lee,H.,Tanumihardja,E.,Farhat,K.,Caffee,B.,Paik,S.,Lee,C.,etal.,2025.Deepfake- eval-2024: A multi-modal in-the-wild benchmark of deepfakes circulated in 2024. arXiv preprint arXiv:2503.02857

work page arXiv 2025

[22] [22]

Jailbreakbench: An open robustness benchmark for jailbreaking large language models

Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramer, F., et al., 2024. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. Advances in Neural Information Processing Systems 37, 55005–55029

work page 2024

[23] [23]

Jailbreakingblackboxlargelanguagemodelsintwentyqueries, in: 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), IEEE

Chao,P.,Robey,A.,Dobriban,E.,Hassani,H.,Pappas,G.J.,Wong,E.,2025. Jailbreakingblackboxlargelanguagemodelsintwentyqueries, in: 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), IEEE. pp. 23–42

work page 2025

[24] [24]

Audiojailbreak: Jailbreak attacks against end-to-end large audio-language models

Chen, G., Song, F., Zhao, Z., Jia, X., Liu, Y., Qiao, Y., Zhang, W., Tu, W., Yang, Y., Du, B., 2026. Audiojailbreak: Jailbreak attacks against end-to-end large audio-language models. IEEE Transactions on Dependable and Secure Computing

work page 2026

[25] [25]

Demamba: Ai-generated video detection on million-scale genvideo benchmark

Chen, H., Hong, Y., Huang, Z., Xu, Z., Gu, Z., Li, Y., Lan, J., Zhu, H., Zhang, J., Wang, W., et al., 2024. Demamba: Ai-generated video detection on million-scale genvideo benchmark. arXiv preprint arXiv:2405.19707

work page arXiv 2024

[26] [26]

2383–2400

Chen, S., Piet, J., Sitawarin, C., Wagner, D., 2025.{StruQ}: Defending against prompt injection with structured queries, in: 34th USENIX Security Symposium (USENIX Security 25), pp. 2383–2400

work page 2025

[27] [27]

Revealing weaknesses in text watermarking through self-information rewrite attacks

Cheng, Y., Guo, H., Li, Y., Sigal, L., 2025. Revealing weaknesses in text watermarking through self-information rewrite attacks. arXiv preprint arXiv:2505.05190

work page arXiv 2025

[28] [28]

Deep fakes: A looming challenge for privacy, democracy, and national security

Chesney, B., Citron, D., 2019. Deep fakes: A looming challenge for privacy, democracy, and national security. Calif. L. Rev. 107, 1753

work page 2019

[29] [29]

How to backdoor diffusion models?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Chou, S.Y., Chen, P.Y., Ho, T.Y., 2023. How to backdoor diffusion models?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4015–4024

work page 2023

[30] [30]

Undetectablewatermarksforlanguagemodels,in:TheThirtySeventhAnnualConferenceonLearning Theory, PMLR

Christ,M.,Gunn,S.,Zamir,O.,2024. Undetectablewatermarksforlanguagemodels,in:TheThirtySeventhAnnualConferenceonLearning Theory, PMLR. pp. 1125–1139

work page 2024

[31] [31]

EvaluatingsecurityriskinDeepSeekandotherfrontierreasoningmodels

CiscoRobustIntelligence,2025. EvaluatingsecurityriskinDeepSeekandotherfrontierreasoningmodels. CiscoSecurityBlog.https:// blogs.cisco.com/security/evaluating-security-risk-in-deepseek-and-other-frontier-reasoning-models. In col- laboration with the University of Pennsylvania. Accessed: 2026-03-29. First Author et al.:Preprint submitted to ElsevierPage 20 ...

work page 2025

[32] [32]

C2PA Specification v2.2.https://c2pa.org/specifications/ specifications/2.2/specs/C2PA_Specification.html

Coalition for Content Provenance and Authenticity, 2025. C2PA Specification v2.2.https://c2pa.org/specifications/ specifications/2.2/specs/C2PA_Specification.html. Open standard for digital content provenance. Accessed: 2026-03-29

work page 2025

[33] [33]

InterimMeasuresfortheManagementofGenerativeArtificialIntelligenceServices

CyberspaceAdministrationofChinaandothers,2023. InterimMeasuresfortheManagementofGenerativeArtificialIntelligenceServices. http://www.cac.gov.cn/. Effective August 15, 2023

work page 2023

[34] [34]

Dathathri, S., See, A., Ghaisas, S., Huang, P.S., McAdam, R., Welbl, J., Bachani, V., Kaskasoli, A., Stanforth, R., Matejovicova, T., et al.,

work page

[35] [35]

Nature 634, 818–823

Scalable watermarking for identifying large language model outputs. Nature 634, 818–823

work page

[36] [36]

Asvspoof2021:Automaticspeakerverificationspoofingandcountermeasureschallengeevaluationplan

Delgado, H., Evans, N., Kinnunen, T., Lee, K.A., Liu, X., Nautsch, A., Patino, J., Sahidullah, M., Todisco, M., Wang, X., et al., 2021. Asvspoof2021:Automaticspeakerverificationspoofingandcountermeasureschallengeevaluationplan. arXivpreprintarXiv:2109.00535

work page arXiv 2021

[37] [37]

Multilingualjailbreakchallengesinlargelanguagemodels

Deng,Y.,Zhang,W.,Pan,S.J.,Bing,L.,2023. Multilingualjailbreakchallengesinlargelanguagemodels. arXivpreprintarXiv:2310.06474

work page arXiv 2023

[38] [38]

Dugan,L.,Hwang,A.,Trhlík,F.,Zhu,A.,Ludan,J.M.,Xu,H.,Ippolito,D.,Callison-Burch,C.,2024. Raid:Asharedbenchmarkforrobust evaluationofmachine-generatedtextdetectors,in:Proceedingsofthe62ndAnnualMeetingoftheAssociationforComputationalLinguistics (Volume 1: Long Papers), pp. 12463–12492

work page 2024

[39] [39]

KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., Kiela, D., 2024. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Regulation(EU)2024/1689–TheAIAct.https://digital-strategy

EuropeanParliamentandCounciloftheEuropeanUnion,2024. Regulation(EU)2024/1689–TheAIAct.https://digital-strategy. ec.europa.eu/en/policies/regulatory-framework-ai. Entered into force August 1, 2024. Accessed: 2026-03-29

work page 2024

[41] [41]

Thestablesignature:Rootingwatermarksinlatentdiffusionmodels,in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Fernandez,P.,Couairon,G.,Jégou,H.,Douze,M.,Furon,T.,2023. Thestablesignature:Rootingwatermarksinlatentdiffusionmodels,in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22466–22477

work page 2023

[42] [42]

Video seal: Open and efficient video watermarking

Fernandez, P., Elsahar, H., Yalniz, I.Z., Mourachko, A., 2024. Video seal: Open and efficient video watermarking. arXiv preprint arXiv:2412.09492

work page arXiv 2024

[43] [43]

Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b

Gade, P., Lermen, S., Rogers-Smith, C., Ladish, J., 2023. Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b. arXiv preprint arXiv:2311.00117

work page arXiv 2023

[44] [44]

Scaling laws for reward model overoptimization, in: International Conference on Machine Learning, PMLR

Gao, L., Schulman, J., Hilton, J., 2023. Scaling laws for reward model overoptimization, in: International Conference on Machine Learning, PMLR. pp. 10835–10866

work page 2023

[45] [45]

Artificial intelligence - carrying us into the future

Ghosh, S., Frase, H., Williams, A., Luger, S., Röttger, P., Barez, F., McGregor, S., Fricklas, K., Kumar, M., Bollacker, K., et al., 2025. Ailuminate: Introducing v1. 0 of the ai risk and reliability benchmark from mlcommons. arXiv preprint arXiv:2503.05731

work page arXiv 2025

[46] [46]

Figstep: Jailbreaking large vision-language models via typographic visual prompts, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Gong, Y., Ran, D., Liu, J., Wang, C., Cong, T., Wang, A., Duan, S., Wang, X., 2025. Figstep: Jailbreaking large vision-language models via typographic visual prompts, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 23951–23959

work page 2025

[47] [47]

Lessons from Defending Gemini Against Indirect Prompt Injections

Google DeepMind, 2025. Advancing Gemini’s Security Safeguards.https://deepmind.google/blog/ advancing-geminis-security-safeguards/. Accompanied by white paper “Lessons from Defending Gemini Against Indirect Prompt Injections”. Accessed: 2026-03-29

work page 2025

[48] [48]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al., 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, M., 2023. Not what you’ve signed up for: Compromising real-world llm-integratedapplicationswithindirectpromptinjection,in:Proceedingsofthe16thACMworkshoponartificialintelligenceandsecurity, pp. 79–90

work page 2023

[50] [50]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo,D.,Yang,D.,Zhang,H.,Song,J.,Wang,P.,Zhu,Q.,Xu,R.,Zhang,R.,Ma,S.,Bi,X.,etal.,2025. Deepseek-r1:Incentivizingreasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

CursorVulnerability(CVE-2025-59944):HowaCase-SensitivityBugExposedtheRisksofAgenticDeveloperTools

Gustafson,B.,2025. CursorVulnerability(CVE-2025-59944):HowaCase-SensitivityBugExposedtheRisksofAgenticDeveloperTools. LakeraBlog.https://www.lakera.ai/blog/cursor-vulnerability-cve-2025-59944. CVE-2025-59944.Accessed:2026-03-29

work page 2025

[52] [52]

Spotting llms with binoculars: Zero-shot detection of machine-generated text, 2024

Hans, A., Schwarzschild, A., Cherepanova, V., Kazemi, H., Saha, A., Goldblum, M., Geiping, J., Goldstein, T., 2024. Spotting llms with binoculars: Zero-shot detection of machine-generated text. arXiv preprint arXiv:2401.12070

work page arXiv 2024

[53] [53]

Spear phishing with large language models,

Hazell, J., 2023. Spear phishing with large language models. arXiv preprint arXiv:2305.06972

work page arXiv 2023

[54] [54]

Lora:Low-rankadaptationoflargelanguage models

Hu,E.J.,Shen,Y.,Wallis,P.,Allen-Zhu,Z.,Li,Y.,Wang,S.,Wang,L.,Chen,W.,etal.,2022. Lora:Low-rankadaptationoflargelanguage models. Iclr 1, 3

work page 2022

[55] [55]

Videoshield: Regulating diffusion-based video generation models via watermarking

Hu, R., Zhang, J., Li, Y., Li, J., Guo, Q., Qiu, H., Zhang, T., 2025. Videoshield: Regulating diffusion-based video generation models via watermarking. arXiv preprint arXiv:2501.14195

work page arXiv 2025

[56] [56]

Radar: Robust ai-text detection via adversarial learning

Hu, X., Chen, P.Y., Ho, T.Y., 2023. Radar: Robust ai-text detection via adversarial learning. Advances in neural information processing systems 36, 15077–15095

work page 2023

[57] [57]

Safety tax: Safety alignment makes your large reasoning models less reasonable

Huang, T., Hu, S., Ilhan, F., Tekin, S.F., Yahn, Z., Xu, Y., Liu, L., 2025. Safety tax: Safety alignment makes your large reasoning models less reasonable. arXiv preprint arXiv:2503.00555

work page arXiv 2025

[58] [58]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Hubinger,E.,Denison,C.,Mu,J.,Lambert,M.,Tong,M.,MacDiarmid,M.,Lanham,T.,Ziegler,D.M.,Maxwell,T.,Cheng,N.,etal.,2024. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [59]

Safetensors:ASimple,SafeWaytoStoreandDistributeTensors.https://huggingface.co/docs/safetensors

HuggingFace,2023. Safetensors:ASimple,SafeWaytoStoreandDistributeTensors.https://huggingface.co/docs/safetensors. Accessed: 2026-03-29

work page 2023

[60] [60]

Security at Hugging Face.https://huggingface.co/docs/hub/security

Hugging Face, 2024. Security at Hugging Face.https://huggingface.co/docs/hub/security. Accessed: 2026-03-29

work page 2024

[61] [61]

Qwen2.5-Coder Technical Report

Hui,B.,Yang,J.,Cui,Z.,Yang,J.,Liu,D.,Zhang,L.,Liu,T.,Zhang,J.,Yu,B.,Lu,K.,etal.,2024. Qwen2.5-codertechnicalreport. arXiv preprint arXiv:2409.12186

work page internal anchor Pith review Pith/arXiv arXiv 2024

[62] [62]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., et al., 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023

[63] [63]

Model AI Governance Framework for Agentic AI

Infocomm Media Development Authority (IMDA), 2026. Model AI Governance Framework for Agentic AI. Technical Report. Government of Singapore. URL:https://www.imda.gov.sg/-/media/imda/files/about/emerging-tech-and-research/ First Author et al.:Preprint submitted to ElsevierPage 21 of 25 Security and Safety Threats in Generative AI artificial-intelligence/mgf...

work page 2026

[64] [64]

Ai-generatedvideodetectionviaperceptualstraightening

Internò,C.,Geirhos,R.,Olhofer,M.,Liu,S.,Hammer,B.,Klindt,D.,2025. Ai-generatedvideodetectionviaperceptualstraightening. arXiv preprint arXiv:2507.00583

work page arXiv 2025

[65] [65]

MCP Security Notification: Tool Poisoning Attacks.https://invariantlabs.ai/blog/ mcp-security-notification-tool-poisoning-attacks

Invariant Labs, 2025a. MCP Security Notification: Tool Poisoning Attacks.https://invariantlabs.ai/blog/ mcp-security-notification-tool-poisoning-attacks. Accessed: 2026-03-29

work page 2026

[66] [66]

WhatsApp MCP Exploited: Exfiltrating Your Message History via MCP.https://invariantlabs.ai/blog/ whatsapp-mcp-exploited

Invariant Labs, 2025b. WhatsApp MCP Exploited: Exfiltrating Your Message History via MCP.https://invariantlabs.ai/blog/ whatsapp-mcp-exploited. Accessed: 2026-03-29

work page 2026

[67] [67]

Critical mcp-remote RCE Vulnerability (CVE-2025-6514)

JFrog Security Research, 2025. Critical mcp-remote RCE Vulnerability (CVE-2025-6514). JFrog Blog.https://jfrog.com/blog/ 2025-6514-critical-mcp-remote-rce-vulnerability/. CVSS 9.6. Accessed: 2026-03-29

work page 2025

[68] [68]

Jiang, F., Xu, Z., Niu, L., Xiang, Z., Ramasubramanian, B., Li, B., Poovendran, R., 2024. Artprompt: Ascii art-based jailbreak attacks against aligned llms, in: Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pp. 15157–15173

work page 2024

[69] [69]

FoolingAIAgents:Web-BasedIndirectPromptInjectionObservedintheWild

Kaleli,B.,Farooqi,S.,Starov,O.,Mohamed,N.,2026. FoolingAIAgents:Web-BasedIndirectPromptInjectionObservedintheWild. Palo Alto Networks Unit 42.https://unit42.paloaltonetworks.com/ai-agent-prompt-injection/. Accessed: 2026-03-29

work page 2026

[70] [70]

A watermark for large language models, in: International conference on machine learning, PMLR

Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., Goldstein, T., 2023. A watermark for large language models, in: International conference on machine learning, PMLR. pp. 17061–17084

work page 2023

[71] [71]

Safellm:Unlearningharmfuloutputsfromlargelanguagemodelsagainstjailbreakattacks

Li,X.,Wu,X.,Li,Q.,Ni,J.,Lu,R.,2025. Safellm:Unlearningharmfuloutputsfromlargelanguagemodelsagainstjailbreakattacks. arXiv preprint arXiv:2508.15182

work page arXiv 2025

[72] [72]

Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models, in: European Conference on Computer Vision, Springer

Li, Y., Guo, H., Zhou, K., Zhao, W.X., Wen, J.R., 2024a. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models, in: European Conference on Computer Vision, Springer. pp. 174–189

work page

[73] [73]

arXiv preprint arXiv:2402.00798

Li,Z.,Hua,W.,Wang,H.,Zhu,H.,Zhang,Y.,2024b.Formal-llm:Integratingformallanguageandnaturallanguageforcontrollablellm-based agents. arXiv preprint arXiv:2402.00798

work page arXiv

[74] [74]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Liu,X.,Xu,N.,Chen,M.,Xiao,C.,2023. Autodan:Generatingstealthyjailbreakpromptsonalignedlargelanguagemodels. arXivpreprint arXiv:2310.04451

work page internal anchor Pith review Pith/arXiv arXiv 2023

[75] [75]

Voxstructor: Voice reconstruction from voiceprint, in: International Conference on Information Security, Springer

Lu, P., Li, Q., Zhu, H., Sovernigo, G., Lin, X., 2021. Voxstructor: Voice reconstruction from voiceprint, in: International Conference on Information Security, Springer. pp. 374–397

work page 2021

[76] [76]

The Dark Side of LLMs: Agent-based Attack Vectors for System-level Compromise

Lupinacci, M., Pironti, F.A., Blefari, F., Romeo, F., Arena, L., Furfaro, A., 2025. The dark side of llms: Agent-based attacks for complete computer takeover. arXiv preprint arXiv:2507.06850

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [77]

Multitude: Large-scale multilingual machine-generated text detection benchmark, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp

Macko, D., Moro, R., Uchendu, A., Lucas, J., Yamashita, M., Pikuliak, M., Srba, I., Le, T., Lee, D., Simko, J., et al., 2023. Multitude: Large-scale multilingual machine-generated text detection benchmark, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9960–9987

work page 2023

[78] [78]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al., 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024

[79] [79]

Aios:Llmagentoperatingsystem

Mei,K.,Zhu,X.,Xu,W.,Hua,W.,Jin,M.,Li,Z.,Xu,S.,Ye,R.,Ge,Y.,Zhang,Y.,2024. Aios:Llmagentoperatingsystem. arXivpreprint arXiv:2403.16971

work page arXiv 2024

[80] [80]

Simpo: Simple preference optimization with a reference-free reward

Meng, Y., Xia, M., Chen, D., 2024. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems 37, 124198–124235

work page 2024