Recognition: unknown
A Theoretical Game of Attacks via Compositional Skills
Pith reviewed 2026-05-09 19:03 UTC · model grok-4.3
The pith
Modeling attacks on language models as a game between attacker and defender yields a provably optimal defense.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The interaction between an attacker and a defender is formalized as a game whose moves are built from compositional skill decompositions. The attacker’s best-response strategy in this game is shown to be closely related to many published adversarial prompting methods. The resulting game possesses equilibria that favor the attacker. The same analysis produces a defense strategy that is provably optimal against best-response attacks. A practical version of the attack derived from the theory outperforms prior methods across multiple models and benchmarks.
What carries the argument
The attacker-defender game formalized with compositional skill decompositions that enable derivation of best-response attacks and optimal defenses.
If this is right
- Many existing adversarial prompting techniques are approximations to the theoretical best-response attack.
- The game equilibria confer inherent advantages to the attacker regardless of the defender’s choice.
- A provably optimal defense can be obtained directly from the game analysis rather than from trial-and-error tuning.
- Practical instantiations of the optimal attack achieve stronger performance than prior methods on diverse LLMs and benchmarks.
Where Pith is reading between the lines
- Safety work should prioritize disrupting the attacker’s ability to assemble skills compositionally rather than only filtering final outputs.
- The same game structure could be applied to multi-turn conversations or to other generative systems beyond text models.
- Empirical validation would require running the optimal defense on live models against attacks that explicitly follow the compositional best-response construction.
Load-bearing premise
That adversarial prompting interactions can be captured accurately by a well-defined game possessing best-response strategies, equilibria, and decomposable compositional skills.
What would settle it
An empirical test in which the derived optimal defense is applied to a model yet fails to block attacks constructed according to the paper’s best-response rule.
Figures
read the original abstract
As large language models grow increasingly capable, concerns about their safe deployment have intensified. While numerous alignment strategies aim to restrict harmful behavior, these defenses can still be circumvented through carefully designed adversarial prompts. In this work, we introduce a theoretical framework that formalizes a game between an attacker and a defender. Within this framework, we design a theoretical best-response attack strategy and show that it is closely related to many existing adversarial prompting methods. We further analyze the resulting game, characterize its equilibria, and reveal inherent advantages for the attacker. Drawing on our theoretical analysis, we also derive a provably optimal defense strategy. Empirically, we evaluate a practical instantiation of the theoretically optimal attack and observe stronger performance relative to existing adversarial prompting approaches in diverse settings encompassing different LLMs and benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a theoretical game-theoretic framework that models adversarial prompting as a game between an attacker and a defender, where strategies are expressed via compositional skill decompositions. It defines a theoretical best-response attack strategy shown to relate to existing adversarial methods, characterizes the resulting equilibria (highlighting inherent attacker advantages), derives a provably optimal defense strategy from the analysis, and reports empirical results where a practical instantiation of the optimal attack outperforms baselines across multiple LLMs and benchmarks.
Significance. If the central claims hold, the work offers a formal unification of adversarial prompting techniques under a game model with explicit equilibria and optimality results, which could inform more principled defenses in LLM safety. The explicit linkage between the theoretical best-response and practical methods, along with the derivation of a defense strategy, provides a potential foundation for future theoretical work in this area.
major comments (3)
- [§3] §3 (Game Formulation and Skill Decomposition): The derivation of best-response strategies and equilibria assumes that all adversarial prompts admit a complete and unique decomposition into compositional skills. No argument or proof is given that this decomposition covers the full open-ended prompt space or that it is unique; if either fails, the computed best response and subsequent equilibria are only optimal inside the restricted model, not for real LLMs.
- [§4] §4 (Optimal Defense Derivation): The proof that the derived defense is provably optimal rests on the same completeness and uniqueness assumptions for the skill decomposition. Without a demonstration that every possible attack can be expressed (and only in one way) as a skill combination, the optimality claim does not transfer to the actual attack surface.
- [Empirical Evaluation] Empirical Evaluation section: The reported performance gains of the practical instantiation are presented as support for the theoretical framework, yet the experiments do not include controls that isolate whether gains arise from the game-theoretic construction versus other implementation choices (e.g., prompt engineering heuristics). This weakens the link between theory and observed results.
minor comments (2)
- [§3] Notation for skill vectors and payoff functions is introduced without an explicit table or running example, making it difficult to track how concrete prompts map to the abstract game.
- [§3.2] The abstract states that the best-response attack is 'closely related to many existing adversarial prompting methods,' but the manuscript does not include a systematic mapping or citation table showing which methods correspond to which skill combinations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, clarifying the scope of our theoretical model and outlining planned revisions to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Game Formulation and Skill Decomposition): The derivation of best-response strategies and equilibria assumes that all adversarial prompts admit a complete and unique decomposition into compositional skills. No argument or proof is given that this decomposition covers the full open-ended prompt space or that it is unique; if either fails, the computed best response and subsequent equilibria are only optimal inside the restricted model, not for real LLMs.
Authors: We appreciate this observation on the modeling assumptions. Our framework is a theoretical abstraction in which adversarial strategies are represented via compositional skill decompositions. The best-response attack and resulting equilibria are derived strictly within this modeled strategy space, and we show that this representation relates to and unifies many existing adversarial prompting methods. We do not claim that every prompt in the open-ended space admits a complete or unique decomposition; the results characterize the game under the proposed decomposition. We will revise §3 to explicitly articulate these modeling assumptions and their implications for the applicability of the derived strategies. revision: partial
-
Referee: [§4] §4 (Optimal Defense Derivation): The proof that the derived defense is provably optimal rests on the same completeness and uniqueness assumptions for the skill decomposition. Without a demonstration that every possible attack can be expressed (and only in one way) as a skill combination, the optimality claim does not transfer to the actual attack surface.
Authors: We agree that the provable optimality of the defense is established relative to attacks expressible within the skill-decomposition model. The defense is optimal against any strategy in the defined game. We will revise §4 to emphasize that the optimality result holds inside the proposed theoretical framework and to discuss how it can inform practical defenses for attacks that admit such decompositions. revision: partial
-
Referee: [Empirical Evaluation] Empirical Evaluation section: The reported performance gains of the practical instantiation are presented as support for the theoretical framework, yet the experiments do not include controls that isolate whether gains arise from the game-theoretic construction versus other implementation choices (e.g., prompt engineering heuristics). This weakens the link between theory and observed results.
Authors: We acknowledge that additional controls would strengthen the empirical linkage to the theoretical construction. The practical instantiation is directly derived from the theoretical best-response strategy, yet we agree that isolating the contribution from other prompt-engineering factors would be beneficial. We will revise the Empirical Evaluation section to incorporate further ablations or controls that better attribute performance gains to the game-theoretic elements. revision: yes
Circularity Check
No circularity: derivations follow from explicit game-theoretic assumptions and compositional decomposition without reducing to inputs by construction.
full rationale
The abstract and described framework introduce a game between attacker and defender, define best-response attack strategies, characterize equilibria, and derive a provably optimal defense directly from the stated model of compositional skill decompositions. No equations or claims are shown to equate a result to its own fitted parameters, self-referential definitions, or unverified self-citations. The empirical instantiation is presented as a separate practical evaluation, not as the source of the theoretical optimality. The derivation chain remains self-contained against the paper's own premises.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Adversarial prompting can be formalized as a two-player game with well-defined strategies, payoffs, and best responses based on compositional skills.
Reference graph
Works this paper leans on
-
[1]
Jailbreak chat
Albert, A. Jailbreak chat. https://www. jailbreakchat.com,2023.,
2023
-
[2]
Andriushchenko, M., Croce, F., and Flammarion, N
Accessed: 2025-05-14. Andriushchenko, M., Croce, F., and Flammarion, N. Jail- breaking leading safety-aligned llms with simple adaptive attacks. InThe Thirteenth International Conference on Learning Representations,
2025
-
[3]
A theory for emergence of complex skills in language models
Arora, S. and Goyal, A. A theory for emergence of complex skills in language models. arXiv:2307.15936 [cs.LG],
-
[4]
Bianchi, F., Suzgun, M., Attanasio, G., Röttger, P., Juraf- sky, D., Hashimoto, T., and Zou, J. Safety-tuned lla- mas: Lessons from improving the safety of large lan- guage models that follow instructions.arXiv preprint arXiv:2309.07875,
-
[5]
arXiv preprint arXiv:2511.15304 , year=
Bisconti, P., Prandi, M., Pierucci, F., Giarrusso, F., Bracale, M., Galisai, M., Suriani, V ., Sorokoletova, O., Sartore, F., and Nardi, D. Adversarial poetry as a universal single- turn jailbreak mechanism in large language models.arXiv preprint arXiv:2511.15304,
-
[6]
Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues
Chang, Z., Li, M., Liu, Y ., Wang, J., Wang, Q., and Liu, Y . Play guessing game with llm: Indirect jailbreak attack with implicit clues.arXiv preprint arXiv:2402.09091,
-
[7]
Jailbreaking Black Box Large Language Models in Twenty Queries
Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., and Wong, E. Jailbreaking black box large language models in twenty queries. arXiv:2310.08419 [cs.LG],
work page internal anchor Pith review arXiv
-
[8]
V ., and Ozay, M
Elesedy, H., Esperanca, P., Oprea, S. V ., and Ozay, M. Lora- guard: Parameter-efficient guardrail adaptation for con- tent moderation of large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 11746–11765,
2024
-
[9]
Liu, F., Feng, Y ., Xu, Z., Su, L., Ma, X., Yin, D., and Liu, H. JAILJUDGE: A comprehensive jailbreak judge bench- mark with multi-agent enhanced explanation evaluation framework. arXiv:2410.12855 [cs.CL],
-
[10]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451,
work page internal anchor Pith review arXiv
-
[11]
Luo, X., Wang, Y ., He, Z., Tu, G., Li, J., and Xu, R. A simple and efficient jailbreak method exploiting llms’ helpfulness.arXiv preprint arXiv:2509.14297,
-
[12]
Accessed: 2025-05-14. 9 A Theoretical Game of Attacks via Compositional Skills OpenAI et al. GPT-4 technical report. arXiv:2303.08774 [cs.CL],
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
XSTest: A test suite for identifying exaggerated safety behaviours in large language mod- els
Röttger, P., Kirk, H., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. XSTest: A test suite for identifying exaggerated safety behaviours in large language mod- els. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 5377–5400,
2024
-
[14]
Sun, H., Wu, Y ., Cheng, Y ., and Chu, X. Game theory meets large language models: A systematic survey. In Kwok, J. (ed.),Proceedings of the Thirty-Fourth Interna- tional Joint Conference on Artificial Intelligence, IJCAI- 25, pp. 10669–10677. International Joint Conferences on Artificial Intelligence Organization, 8 2025a. doi: 10.24963/ijcai.2025/1184. ...
-
[15]
Wang, Z., Cao, Y ., and Liu, P. Hidden you malicious goal into benign narratives: Jailbreak large language models through logic chain injection.arXiv preprint arXiv:2404.04849,
- [16]
-
[17]
Enhancing jailbreak attacks on llms via persona prompts
Yu, D., Kaur, S., Gupta, A., Brown-Cohen, J., Goyal, A., and Arora, S. Skill-mix: A flexible and expandable family of evaluations for AI models. InInternational Conference on Learning Representations (ICLR), 2024a. Yu, Z., Liu, X., Liang, S., Cameron, Z., Xiao, C., and Zhang, N. Don’t listen to me: Understanding and exploring jailbreak prompts of large la...
-
[18]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043 [cs.CL],
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
highest p(i)
Defender update:perform one projected (sub)gradient ascent step on F(r) , followed by a projection back to the feasible set{r≥0 : P i,s ri,s =c}. Since F(r) contains a min operator, it is non-smooth; we use a valid subgradient that distributes mass uniformly across all tied minimizers for each intent. Concretely, for eachi, letM (t) i := arg mins a(t) i,s...
2000
-
[20]
We found that model capacity plays a crucial role in enabling LLMs to function effectively as raters. Models with insufficient capacity such as LLaMA-3-70B and GPT-3.5—often struggle to identify implicit or indirect connections between the intent and the response, and in some cases (e.g., LLaMA-3-70B), it frequently refuses to generate ratings altogether....
2024
-
[21]
Always Intelligent and Machiavellian
In practice, an attacker could aggregate such information to achieve its malicious objective. 22 A Theoretical Game of Attacks via Compositional Skills Table 5.Percentage drop in attack performance relative to the original performance on various target LLMs defended by our defense method by misleading attacker. Open-Source Closed-Source Attack Metric Llam...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.