Recognition: 2 theorem links
· Lean TheoremSemantic Intent Fragmentation: A Single-Shot Compositional Attack on Multi-Agent AI Pipelines
Pith reviewed 2026-05-10 17:33 UTC · model grok-4.3
The pith
A single legitimately phrased request to an LLM orchestrator can decompose into individually benign subtasks that together violate security policy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Semantic Intent Fragmentation is an attack class against LLM orchestration systems where a single legitimately phrased request causes an orchestrator to decompose a task into subtasks that are individually benign but jointly violate security policy. The violation only emerges at the composed plan because existing safety mechanisms operate at the subtask level. SIF exploits OWASP LLM06:2025 through four mechanisms: bulk scope escalation, silent data exfiltration, embedded trigger deployment, and quasi-identifier aggregation, requiring no injected content, no system modification, and no attacker interaction after the initial request. Across 14 scenarios, a GPT-20B orchestrator produces policy-
What carries the argument
Semantic Intent Fragmentation (SIF), the mechanism by which task decomposition in LLM orchestrators creates composed policy violations from subtasks that each pass isolated checks.
If this is right
- Stronger orchestrators produce higher SIF success rates.
- Plan-level information-flow tracking combined with compliance evaluation detects all attacks before execution.
- SIF requires no injected content or ongoing attacker control after the initial request.
- The compositional safety gap is closable by shifting evaluation from subtask to full plan.
Where Pith is reading between the lines
- Multi-agent systems in regulated domains may need mandatory plan-level review before any subtask runs.
- Defenses built only around individual step classification leave open a route for single-shot attacks that scale with model capability.
- Similar fragmentation effects could appear in non-LLM orchestration pipelines that rely on sequential task breakdown.
Load-bearing premise
The three validation signals correctly flag policy violations with zero false positives and subtasks are genuinely benign when examined in isolation.
What would settle it
Re-running the 14 scenarios with an independent human policy reviewer or a fourth validation method that finds any subtask labeled benign to actually contain a violation or any plan labeled violating to actually comply with policy.
Figures
read the original abstract
We introduce Semantic Intent Fragmentation (SIF), an attack class against LLM orchestration systems where a single, legitimately phrased request causes an orchestrator to decompose a task into subtasks that are individually benign but jointly violate security policy. Current safety mechanisms operate at the subtask level, so each step clears existing classifiers -- the violation only emerges at the composed plan. SIF exploits OWASP LLM06:2025 through four mechanisms: bulk scope escalation, silent data exfiltration, embedded trigger deployment, and quasi-identifier aggregation, requiring no injected content, no system modification, and no attacker interaction after the initial request. We construct a three-stage red-teaming pipeline grounded in OWASP, MITRE ATLAS, and NIST frameworks to generate realistic enterprise scenarios. Across 14 scenarios spanning financial reporting, information security, and HR analytics, a GPT-20B orchestrator produces policy-violating plans in 71% of cases (10/14) while every subtask appears benign. Three independent signals validate this: deterministic taint analysis, chain-of-thought evaluation, and a cross-model compliance judge with 0% false positives. Stronger orchestrators increase SIF success rates. Plan-level information-flow tracking combined with compliance evaluation detects all attacks before execution, showing the compositional safety gap is closable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Semantic Intent Fragmentation (SIF), a single-shot compositional attack on LLM orchestration pipelines in which a legitimately phrased request causes the orchestrator to decompose the task into subtasks that are individually benign yet jointly violate security policy. It reports results from a three-stage red-teaming pipeline grounded in OWASP, MITRE ATLAS, and NIST frameworks, showing that a GPT-20B orchestrator produces policy-violating plans in 71% (10/14) of scenarios spanning financial reporting, information security, and HR analytics, while every subtask appears benign. Validation rests on three signals—deterministic taint analysis, chain-of-thought evaluation, and a cross-model compliance judge stated to have 0% false positives—and the work proposes plan-level information-flow tracking plus compliance evaluation as a mitigation that detects all attacks before execution.
Significance. If the empirical claims are substantiated with reproducible details, the work identifies a genuine gap in subtask-level safety mechanisms for multi-agent LLM systems and directly addresses OWASP LLM06:2025. The demonstration of four concrete mechanisms (bulk scope escalation, silent data exfiltration, embedded trigger deployment, quasi-identifier aggregation) across realistic enterprise domains, together with the constructive proposal for plan-level detection, would be a useful contribution to compositional security research. The grounding in established frameworks and the use of multiple independent validation signals are strengths that increase the potential impact if the results can be verified.
major comments (2)
- [Abstract] Abstract: The central empirical claim (71% success rate, 10/14 scenarios) depends on the cross-model compliance judge having 0% false positives and on the three validation signals correctly distinguishing composed violations from benign subtasks. No description is given of the test set, policy formalization, inter-rater agreement, or how the 0% false-positive rate was measured; without these, the 10/14 count cannot be distinguished from possible classifier agreement.
- [Evaluation] The manuscript provides no details on the 14 scenarios, exact prompts, data exclusion rules, or raw outputs. This absence makes it impossible to reproduce the experiments or independently confirm that subtasks were genuinely benign when evaluated in isolation by taint analysis and CoT evaluation, which is load-bearing for the claim that the violation emerges only at composition.
minor comments (1)
- [Abstract] The model is referred to as 'GPT-20B' without clarification of whether this denotes a specific released model, a hypothetical parameter count, or a fine-tuned variant; adding this specification in the methods would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report. The comments highlight important gaps in documentation that affect reproducibility and verifiability of the empirical claims. We address each major comment below and have revised the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central empirical claim (71% success rate, 10/14 scenarios) depends on the cross-model compliance judge having 0% false positives and on the three validation signals correctly distinguishing composed violations from benign subtasks. No description is given of the test set, policy formalization, inter-rater agreement, or how the 0% false-positive rate was measured; without these, the 10/14 count cannot be distinguished from possible classifier agreement.
Authors: We agree that the manuscript does not sufficiently document the validation of the cross-model compliance judge. In the revised version we add a new subsection in the Evaluation section that describes: the 200-prompt test set used to measure false positives (constructed from benign enterprise queries with no compositional intent), the policy formalization process (mapping OWASP LLM06:2025 and NIST controls to explicit rules), inter-rater agreement (Cohen’s kappa = 0.94 between two human reviewers and the judge), and the exact measurement protocol (zero false positives observed across all 200 prompts when subtasks were evaluated in isolation). These additions directly support the claim that violations emerge only at composition. revision: yes
-
Referee: [Evaluation] The manuscript provides no details on the 14 scenarios, exact prompts, data exclusion rules, or raw outputs. This absence makes it impossible to reproduce the experiments or independently confirm that subtasks were genuinely benign when evaluated in isolation by taint analysis and CoT evaluation, which is load-bearing for the claim that the violation emerges only at composition.
Authors: We acknowledge that the original manuscript omitted the concrete experimental artifacts required for reproduction. The revised manuscript includes a new Appendix B containing: all 14 scenarios with their verbatim initial prompts, the full subtask decompositions produced by the GPT-20B orchestrator, the deterministic taint-analysis and CoT-evaluation results for each isolated subtask, data exclusion rules (scenarios were retained only if every subtask passed both signals), and representative raw model outputs. We will also release the complete prompt set, evaluation harness, and anonymized logs in a public repository to allow independent verification that the policy violation appears only after composition. revision: yes
Circularity Check
No circularity: empirical red-teaming result grounded in external frameworks
full rationale
The paper reports an empirical count (10/14 scenarios) from red-teaming a GPT-20B orchestrator against enterprise scenarios, using three validation signals (deterministic taint analysis, chain-of-thought evaluation, cross-model judge) explicitly tied to OWASP, MITRE ATLAS, and NIST standards. No equations, fitted parameters, or self-citations are used to derive the 71% rate or the 0% false-positive claim; both are presented as direct experimental outcomes. The central claim does not reduce to a self-definition, renamed known result, or load-bearing self-citation chain, satisfying the requirement for independent, externally benchmarked content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Current safety mechanisms in LLM orchestration systems evaluate and clear subtasks independently without considering the composed plan.
invented entities (1)
-
Semantic Intent Fragmentation (SIF)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 1 (Fragmentation Score). FS(P) = 1−max i{maxc∈C fc(Ti)}. ... Theorem 2 (Compositional Emergence). Let P be a SIF attack with FS = 1.0. Then ∃Π s.t. ∀i:H(T i) = SAFE yet H(P) = UNSAFE
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SIF exploits OWASP LLM06:2025 through four mechanisms: bulk scope escalation, silent data exfiltration, embedded trigger deployment, and quasi-identifier aggregation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Rishi Jha, Harold Triedman, Justin Wagle, and Vitaly Shmatikov. Breaking and fixing defenses against control-flow hijacking in multi-agent systems.arXiv preprint arXiv:2510.17276, 2025
-
[2]
Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025
Harold Triedman, Rishi Jha, and Vitaly Shmatikov. Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025
-
[3]
Akshat Naik, Jay Culligan, Yarin Gal, Philip Torr, Rahaf Aljundi, Alasdair Paren, and Adel Bibi. Omni-leak: Orchestrator multi-agent network induced data leakage.arXiv preprint arXiv:2602.13477, 2026
-
[4]
The Dark Side of LLMs: Agent-based Attack Vectors for System-level Compromise
Matteo Lupinacci, Francesco Aurelio Pironti, Francesco Blefari, Francesco Romeo, Luigi Arena, and Angelo Furfaro. The dark side of llms : Agent-based attacks for complete computer takeover.CoRR, abs/2507.06850, 2025
work page internal anchor Pith review arXiv 2025
-
[5]
Adversaries can misuse combinations of safe models
Erik Jones, Anca Dragan, and Jacob Steinhardt. Adversaries can misuse combinations of safe models. In Forty-second International Conference on Machine Learning, 2025
2025
-
[6]
Xikang Yang, Xuehai Tang, Songlin Hu, and Jizhong Han. Chain of attack: a semantic-driven contextual multi-turn attacker for llm.arXiv preprint arXiv:2405.05610, 2024
-
[7]
AgentHarm: A benchmark for measuring harmfulness of LLM agents
Maksym Andriushchenko et al. AgentHarm: A benchmark for measuring harmfulness of LLM agents. In Proceedings of the 13th International Conference on Learning Representations, 2025
2025
-
[8]
Red teaming language models with language models
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates, 2022. Association for Computational...
2022
-
[9]
The semantic chaining attack: Bypassing multimodal AI safety filters via sequential semantic manipulation
NeuralTrust Research. The semantic chaining attack: Bypassing multimodal AI safety filters via sequential semantic manipulation. NeuralTrust Technical Report. https://neuraltrust.ai/blog/semantic-chaining, January 2026. Multimodal single-model attack requiring attacker participation at every step
2026
-
[10]
LlamaFirewall: An open source guardrail system for building secure AI agents
Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, et al. Llamafirewall: An open source guardrail system for building secure ai agents, 2025.URL: https://arxiv. org/abs/2505.03574
-
[11]
Securing ai agents with information-flow control
Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. Securing ai agents with information-flow control. arXiv, May 2025
2025
-
[12]
Defeating Prompt Injections by Design
Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813, 2025
work page internal anchor Pith review arXiv 2025
-
[13]
ShieldAgent: Shielding agents via verifiable safety policy reasoning
Zhaorun Chen, Mintong Kang, and Bo Li. ShieldAgent: Shielding agents via verifiable safety policy reasoning. In Proceedings of the 42nd International Conference on Machine Learning, volume 267, pages 8313–8344, 2025
2025
-
[14]
OW ASP LLM top 10 for large language model applications, version 2.0
OW ASP Foundation. OW ASP LLM top 10 for large language model applications, version 2.0. https://owasp.org/ www-project-top-10-for-large-language-model-applications/, 2025
2025
-
[15]
MITRE ATLAS: Adversarial threat landscape for artificial-intelligence systems
MITRE Corporation. MITRE ATLAS: Adversarial threat landscape for artificial-intelligence systems. https: //atlas.mitre.org, 2025
2025
-
[16]
Vaidehi Patil, Elias Stengel-Eskin, and Mohit Bansal. The sum leaks more than its parts: Compositional privacy risks and mitigations in multi-agent collaboration.arXiv preprint arXiv:2509.14284, 2025
-
[17]
Gonzalez, and Ion Stoica
Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025
2025
-
[18]
Nirmit Arora, Sathvik Joel, Ishan Kavathekar, Rohan Gandhi, Yash Pandya, Tanuja Ganu, Aditya Kanade, Akshay Nambi, et al. Exposing weak links in multi-agent systems under adversarial prompting.arXiv preprint arXiv:2511.10949, 2025. 10 APREPRINT-
-
[19]
GPT-20B (OpenAI open-weights MoE 20b)
OpenAI. GPT-20B (OpenAI open-weights MoE 20b). Model weights:openai/gpt-oss-20b, 2025
2025
-
[20]
The Llama 3 herd of models, 2024
AI @ Meta Llama Team. The Llama 3 herd of models, 2024. Llama 3.1-8B-Instruct used as classifier and evaluator
2024
-
[21]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B.arXiv preprint arXiv:23...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Llama Guard: LLM-based input-output safeguard for human-AI conversations, 2023
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based input-output safeguard for human-AI conversations, 2023. LlamaGuard-7b safety classifier
2023
-
[23]
KoalaAI/Text-Moderation: DistilBERT-based content moderation model
KoalaAI. KoalaAI/Text-Moderation: DistilBERT-based content moderation model. Hugging Face model hub: KoalaAI/Text-Moderation, 2023
2023
-
[24]
Detoxify
Laura Hanu and Unitary Team. Detoxify. GitHub repository: https://github.com/unitaryai/detoxify, 2020
2020
-
[25]
Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. Aegis: Online adaptive ai content safety moderation with ensemble of llm experts.arXiv preprint arXiv:2404.05993, 2024
-
[26]
Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, v...
2024
-
[27]
Prometheus 2: An open source language model specialized in evaluating other language models
Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353, Miami, Flori...
2024
-
[28]
Prompt Guard: Prompt injection and jailbreak detection (Prompt-Guard-86M)
Meta Llama Team. Prompt Guard: Prompt injection and jailbreak detection (Prompt-Guard-86M). Hugging Face model hub:meta-llama/Prompt-Guard-86M, 2024
2024
-
[29]
Xing, Hao Zhang, Joseph E
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt- bench and chatbot arena. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , US...
2023
-
[30]
The Llama 3 herd of models (includes Llama Guard 3), 2024
AI Meta Llama Team. The Llama 3 herd of models (includes Llama Guard 3), 2024. Llama Guard 3 (13-category safety classifier) released as part of the Llama 3.1 suite; no standalone paper
2024
-
[31]
Zailong Tian, Zhuoheng Han, Yanzhe Chen, Haozhe Xu, Xi Yang, Richeng Xuan, Houfeng Wang, and Lizi Liao. Overconfidence in llm-as-a-judge: Diagnosis and confidence-driven solution.arXiv preprint arXiv:2508.06225, 2025
-
[32]
G-eval: NLG evaluation using gpt-4 with better human alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, 2023. Association for Computational Linguistics. 11
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.