Jailbroken: How does llm safety training fail?

· 2023

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

cs.CR · 2026-04-13 · unverdicted · novelty 6.0

Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.

PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification

cs.CR · 2026-04-11 · unverdicted · novelty 6.0

PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.

citing papers explorer

Showing 2 of 2 citing papers.

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems cs.CR · 2026-04-13 · unverdicted · none · ref 3
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification cs.CR · 2026-04-11 · unverdicted · none · ref 18
PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.

Jailbroken: How does llm safety training fail?

fields

years

verdicts

representative citing papers

citing papers explorer