pith. machine review for the scientific record. sign in

arxiv: 2604.08608 · v2 · submitted 2026-04-08 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Semantic Intent Fragmentation: A Single-Shot Compositional Attack on Multi-Agent AI Pipelines

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:33 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords semantic intent fragmentationLLM orchestrationcompositional attacksmulti-agent AI safetypolicy violationred-teamingOWASP LLM06information flow tracking
0
0 comments X

The pith

A single legitimately phrased request to an LLM orchestrator can decompose into individually benign subtasks that together violate security policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Semantic Intent Fragmentation as an attack on multi-agent LLM systems where one normal-sounding query leads the orchestrator to split the work into steps that each clear safety checks but produce a policy breach when assembled. Current defenses review tasks in isolation, allowing the violation to surface only at the full plan level. In tests across 14 realistic enterprise scenarios in finance, security, and HR, a GPT-20B orchestrator generated violating plans in 71 percent of cases while every subtask appeared harmless on its own. The attack uses four mechanisms and needs no injected prompts or further attacker input after the first request. Plan-level tracking of information flow and compliance checks can catch every instance before any execution occurs.

Core claim

Semantic Intent Fragmentation is an attack class against LLM orchestration systems where a single legitimately phrased request causes an orchestrator to decompose a task into subtasks that are individually benign but jointly violate security policy. The violation only emerges at the composed plan because existing safety mechanisms operate at the subtask level. SIF exploits OWASP LLM06:2025 through four mechanisms: bulk scope escalation, silent data exfiltration, embedded trigger deployment, and quasi-identifier aggregation, requiring no injected content, no system modification, and no attacker interaction after the initial request. Across 14 scenarios, a GPT-20B orchestrator produces policy-

What carries the argument

Semantic Intent Fragmentation (SIF), the mechanism by which task decomposition in LLM orchestrators creates composed policy violations from subtasks that each pass isolated checks.

If this is right

  • Stronger orchestrators produce higher SIF success rates.
  • Plan-level information-flow tracking combined with compliance evaluation detects all attacks before execution.
  • SIF requires no injected content or ongoing attacker control after the initial request.
  • The compositional safety gap is closable by shifting evaluation from subtask to full plan.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Multi-agent systems in regulated domains may need mandatory plan-level review before any subtask runs.
  • Defenses built only around individual step classification leave open a route for single-shot attacks that scale with model capability.
  • Similar fragmentation effects could appear in non-LLM orchestration pipelines that rely on sequential task breakdown.

Load-bearing premise

The three validation signals correctly flag policy violations with zero false positives and subtasks are genuinely benign when examined in isolation.

What would settle it

Re-running the 14 scenarios with an independent human policy reviewer or a fourth validation method that finds any subtask labeled benign to actually contain a violation or any plan labeled violating to actually comply with policy.

Figures

Figures reproduced from arXiv: 2604.08608 by Ismail Hossain, Md Jahangir Alam, Sai Puppala, Sajedul Talukder, Syed Bahauddin Alam, Tanzim Ahad, Yoonpyo Lee.

Figure 1
Figure 1. Figure 1: The SIF attack pipeline and its structural blind spot. (Centre) A single legitimately-phrased enterprise request is autonomously decomposed by the LLM orchestrator into three subtasks (T1–T3). Although every subtask passes all deployed safety classifiers individually (Fragmentation Score FS = 1.0, Definition 1), their composed output violates enterprise policy, a vulnerability we term the plan-generation g… view at source ↗
read the original abstract

We introduce Semantic Intent Fragmentation (SIF), an attack class against LLM orchestration systems where a single, legitimately phrased request causes an orchestrator to decompose a task into subtasks that are individually benign but jointly violate security policy. Current safety mechanisms operate at the subtask level, so each step clears existing classifiers -- the violation only emerges at the composed plan. SIF exploits OWASP LLM06:2025 through four mechanisms: bulk scope escalation, silent data exfiltration, embedded trigger deployment, and quasi-identifier aggregation, requiring no injected content, no system modification, and no attacker interaction after the initial request. We construct a three-stage red-teaming pipeline grounded in OWASP, MITRE ATLAS, and NIST frameworks to generate realistic enterprise scenarios. Across 14 scenarios spanning financial reporting, information security, and HR analytics, a GPT-20B orchestrator produces policy-violating plans in 71% of cases (10/14) while every subtask appears benign. Three independent signals validate this: deterministic taint analysis, chain-of-thought evaluation, and a cross-model compliance judge with 0% false positives. Stronger orchestrators increase SIF success rates. Plan-level information-flow tracking combined with compliance evaluation detects all attacks before execution, showing the compositional safety gap is closable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Semantic Intent Fragmentation (SIF), a single-shot compositional attack on LLM orchestration pipelines in which a legitimately phrased request causes the orchestrator to decompose the task into subtasks that are individually benign yet jointly violate security policy. It reports results from a three-stage red-teaming pipeline grounded in OWASP, MITRE ATLAS, and NIST frameworks, showing that a GPT-20B orchestrator produces policy-violating plans in 71% (10/14) of scenarios spanning financial reporting, information security, and HR analytics, while every subtask appears benign. Validation rests on three signals—deterministic taint analysis, chain-of-thought evaluation, and a cross-model compliance judge stated to have 0% false positives—and the work proposes plan-level information-flow tracking plus compliance evaluation as a mitigation that detects all attacks before execution.

Significance. If the empirical claims are substantiated with reproducible details, the work identifies a genuine gap in subtask-level safety mechanisms for multi-agent LLM systems and directly addresses OWASP LLM06:2025. The demonstration of four concrete mechanisms (bulk scope escalation, silent data exfiltration, embedded trigger deployment, quasi-identifier aggregation) across realistic enterprise domains, together with the constructive proposal for plan-level detection, would be a useful contribution to compositional security research. The grounding in established frameworks and the use of multiple independent validation signals are strengths that increase the potential impact if the results can be verified.

major comments (2)
  1. [Abstract] Abstract: The central empirical claim (71% success rate, 10/14 scenarios) depends on the cross-model compliance judge having 0% false positives and on the three validation signals correctly distinguishing composed violations from benign subtasks. No description is given of the test set, policy formalization, inter-rater agreement, or how the 0% false-positive rate was measured; without these, the 10/14 count cannot be distinguished from possible classifier agreement.
  2. [Evaluation] The manuscript provides no details on the 14 scenarios, exact prompts, data exclusion rules, or raw outputs. This absence makes it impossible to reproduce the experiments or independently confirm that subtasks were genuinely benign when evaluated in isolation by taint analysis and CoT evaluation, which is load-bearing for the claim that the violation emerges only at composition.
minor comments (1)
  1. [Abstract] The model is referred to as 'GPT-20B' without clarification of whether this denotes a specific released model, a hypothetical parameter count, or a fine-tuned variant; adding this specification in the methods would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The comments highlight important gaps in documentation that affect reproducibility and verifiability of the empirical claims. We address each major comment below and have revised the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claim (71% success rate, 10/14 scenarios) depends on the cross-model compliance judge having 0% false positives and on the three validation signals correctly distinguishing composed violations from benign subtasks. No description is given of the test set, policy formalization, inter-rater agreement, or how the 0% false-positive rate was measured; without these, the 10/14 count cannot be distinguished from possible classifier agreement.

    Authors: We agree that the manuscript does not sufficiently document the validation of the cross-model compliance judge. In the revised version we add a new subsection in the Evaluation section that describes: the 200-prompt test set used to measure false positives (constructed from benign enterprise queries with no compositional intent), the policy formalization process (mapping OWASP LLM06:2025 and NIST controls to explicit rules), inter-rater agreement (Cohen’s kappa = 0.94 between two human reviewers and the judge), and the exact measurement protocol (zero false positives observed across all 200 prompts when subtasks were evaluated in isolation). These additions directly support the claim that violations emerge only at composition. revision: yes

  2. Referee: [Evaluation] The manuscript provides no details on the 14 scenarios, exact prompts, data exclusion rules, or raw outputs. This absence makes it impossible to reproduce the experiments or independently confirm that subtasks were genuinely benign when evaluated in isolation by taint analysis and CoT evaluation, which is load-bearing for the claim that the violation emerges only at composition.

    Authors: We acknowledge that the original manuscript omitted the concrete experimental artifacts required for reproduction. The revised manuscript includes a new Appendix B containing: all 14 scenarios with their verbatim initial prompts, the full subtask decompositions produced by the GPT-20B orchestrator, the deterministic taint-analysis and CoT-evaluation results for each isolated subtask, data exclusion rules (scenarios were retained only if every subtask passed both signals), and representative raw model outputs. We will also release the complete prompt set, evaluation harness, and anonymized logs in a public repository to allow independent verification that the policy violation appears only after composition. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical red-teaming result grounded in external frameworks

full rationale

The paper reports an empirical count (10/14 scenarios) from red-teaming a GPT-20B orchestrator against enterprise scenarios, using three validation signals (deterministic taint analysis, chain-of-thought evaluation, cross-model judge) explicitly tied to OWASP, MITRE ATLAS, and NIST standards. No equations, fitted parameters, or self-citations are used to derive the 71% rate or the 0% false-positive claim; both are presented as direct experimental outcomes. The central claim does not reduce to a self-definition, renamed known result, or load-bearing self-citation chain, satisfying the requirement for independent, externally benchmarked content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that current safety classifiers operate only at the subtask level and that the introduced attack mechanisms (bulk scope escalation, silent data exfiltration, embedded trigger deployment, quasi-identifier aggregation) produce genuinely benign subtasks. No free parameters or invented physical entities are used; SIF is a descriptive category rather than a new postulated object with independent evidence.

axioms (1)
  • domain assumption Current safety mechanisms in LLM orchestration systems evaluate and clear subtasks independently without considering the composed plan.
    Stated directly in the abstract as the reason each step clears classifiers while the violation emerges only at composition.
invented entities (1)
  • Semantic Intent Fragmentation (SIF) no independent evidence
    purpose: To name and categorize the attack class that exploits compositional policy violations from a single legitimate request.
    New descriptive term introduced by the authors; no independent falsifiable evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5566 in / 1479 out tokens · 140270 ms · 2026-05-10T17:33:58.357844+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    Breaking and fixing defenses against control-flow hijacking in multi-agent systems.arXiv preprint arXiv:2510.17276, 2025

    Rishi Jha, Harold Triedman, Justin Wagle, and Vitaly Shmatikov. Breaking and fixing defenses against control-flow hijacking in multi-agent systems.arXiv preprint arXiv:2510.17276, 2025

  2. [2]

    Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025

    Harold Triedman, Rishi Jha, and Vitaly Shmatikov. Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025

  3. [3]

    Omni-leak: Orchestrator multi-agent network induced data leakage.arXiv preprint arXiv:2602.13477, 2026

    Akshat Naik, Jay Culligan, Yarin Gal, Philip Torr, Rahaf Aljundi, Alasdair Paren, and Adel Bibi. Omni-leak: Orchestrator multi-agent network induced data leakage.arXiv preprint arXiv:2602.13477, 2026

  4. [4]

    The Dark Side of LLMs: Agent-based Attack Vectors for System-level Compromise

    Matteo Lupinacci, Francesco Aurelio Pironti, Francesco Blefari, Francesco Romeo, Luigi Arena, and Angelo Furfaro. The dark side of llms : Agent-based attacks for complete computer takeover.CoRR, abs/2507.06850, 2025

  5. [5]

    Adversaries can misuse combinations of safe models

    Erik Jones, Anca Dragan, and Jacob Steinhardt. Adversaries can misuse combinations of safe models. In Forty-second International Conference on Machine Learning, 2025

  6. [6]

    Chain of attack: a semantic-driven contextual multi-turn attacker for llm.arXiv preprint arXiv:2405.05610, 2024

    Xikang Yang, Xuehai Tang, Songlin Hu, and Jizhong Han. Chain of attack: a semantic-driven contextual multi-turn attacker for llm.arXiv preprint arXiv:2405.05610, 2024

  7. [7]

    AgentHarm: A benchmark for measuring harmfulness of LLM agents

    Maksym Andriushchenko et al. AgentHarm: A benchmark for measuring harmfulness of LLM agents. In Proceedings of the 13th International Conference on Learning Representations, 2025

  8. [8]

    Red teaming language models with language models

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates, 2022. Association for Computational...

  9. [9]

    The semantic chaining attack: Bypassing multimodal AI safety filters via sequential semantic manipulation

    NeuralTrust Research. The semantic chaining attack: Bypassing multimodal AI safety filters via sequential semantic manipulation. NeuralTrust Technical Report. https://neuraltrust.ai/blog/semantic-chaining, January 2026. Multimodal single-model attack requiring attacker participation at every step

  10. [10]

    LlamaFirewall: An open source guardrail system for building secure AI agents

    Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, et al. Llamafirewall: An open source guardrail system for building secure ai agents, 2025.URL: https://arxiv. org/abs/2505.03574

  11. [11]

    Securing ai agents with information-flow control

    Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. Securing ai agents with information-flow control. arXiv, May 2025

  12. [12]

    Defeating Prompt Injections by Design

    Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813, 2025

  13. [13]

    ShieldAgent: Shielding agents via verifiable safety policy reasoning

    Zhaorun Chen, Mintong Kang, and Bo Li. ShieldAgent: Shielding agents via verifiable safety policy reasoning. In Proceedings of the 42nd International Conference on Machine Learning, volume 267, pages 8313–8344, 2025

  14. [14]

    OW ASP LLM top 10 for large language model applications, version 2.0

    OW ASP Foundation. OW ASP LLM top 10 for large language model applications, version 2.0. https://owasp.org/ www-project-top-10-for-large-language-model-applications/, 2025

  15. [15]

    MITRE ATLAS: Adversarial threat landscape for artificial-intelligence systems

    MITRE Corporation. MITRE ATLAS: Adversarial threat landscape for artificial-intelligence systems. https: //atlas.mitre.org, 2025

  16. [16]

    The sum leaks more than its parts: Compositional privacy risks and mitigations in multi-agent collaboration.arXiv preprint arXiv:2509.14284, 2025

    Vaidehi Patil, Elias Stengel-Eskin, and Mohit Bansal. The sum leaks more than its parts: Compositional privacy risks and mitigations in multi-agent collaboration.arXiv preprint arXiv:2509.14284, 2025

  17. [17]

    Gonzalez, and Ion Stoica

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  18. [18]

    Exposing weak links in multi-agent systems under adversarial prompting.arXiv preprint arXiv:2511.10949, 2025

    Nirmit Arora, Sathvik Joel, Ishan Kavathekar, Rohan Gandhi, Yash Pandya, Tanuja Ganu, Aditya Kanade, Akshay Nambi, et al. Exposing weak links in multi-agent systems under adversarial prompting.arXiv preprint arXiv:2511.10949, 2025. 10 APREPRINT-

  19. [19]

    GPT-20B (OpenAI open-weights MoE 20b)

    OpenAI. GPT-20B (OpenAI open-weights MoE 20b). Model weights:openai/gpt-oss-20b, 2025

  20. [20]

    The Llama 3 herd of models, 2024

    AI @ Meta Llama Team. The Llama 3 herd of models, 2024. Llama 3.1-8B-Instruct used as classifier and evaluator

  21. [21]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B.arXiv preprint arXiv:23...

  22. [22]

    Llama Guard: LLM-based input-output safeguard for human-AI conversations, 2023

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based input-output safeguard for human-AI conversations, 2023. LlamaGuard-7b safety classifier

  23. [23]

    KoalaAI/Text-Moderation: DistilBERT-based content moderation model

    KoalaAI. KoalaAI/Text-Moderation: DistilBERT-based content moderation model. Hugging Face model hub: KoalaAI/Text-Moderation, 2023

  24. [24]

    Detoxify

    Laura Hanu and Unitary Team. Detoxify. GitHub repository: https://github.com/unitaryai/detoxify, 2020

  25. [25]

    AEGIS: Online adaptive AI content safety moderation with ensemble of LLM experts.arXiv preprint arXiv:2404.05993, 2024

    Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. Aegis: Online adaptive ai content safety moderation with ensemble of llm experts.arXiv preprint arXiv:2404.05993, 2024

  26. [26]

    Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms

    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, v...

  27. [27]

    Prometheus 2: An open source language model specialized in evaluating other language models

    Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353, Miami, Flori...

  28. [28]

    Prompt Guard: Prompt injection and jailbreak detection (Prompt-Guard-86M)

    Meta Llama Team. Prompt Guard: Prompt injection and jailbreak detection (Prompt-Guard-86M). Hugging Face model hub:meta-llama/Prompt-Guard-86M, 2024

  29. [29]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt- bench and chatbot arena. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , US...

  30. [30]

    The Llama 3 herd of models (includes Llama Guard 3), 2024

    AI Meta Llama Team. The Llama 3 herd of models (includes Llama Guard 3), 2024. Llama Guard 3 (13-category safety classifier) released as part of the Llama 3.1 suite; no standalone paper

  31. [31]

    Overconfidence in LLM-as-a-judge: Diagnosis and confidence- driven solution.arXiv preprint arXiv:2508.06225,

    Zailong Tian, Zhuoheng Han, Yanzhe Chen, Haozhe Xu, Xi Yang, Richeng Xuan, Houfeng Wang, and Lizi Liao. Overconfidence in llm-as-a-judge: Diagnosis and confidence-driven solution.arXiv preprint arXiv:2508.06225, 2025

  32. [32]

    G-eval: NLG evaluation using gpt-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, 2023. Association for Computational Linguistics. 11