This SoK paper introduces a nine-stage taxonomy for LLM guardrail breaches in phishing, characterizes evasion and manipulation tactics, and identifies a dynamic-offense versus static-defense asymmetry.
hub Canonical reference
Spear phishing with large language models
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 5representative citing papers
Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.
Agentic AI eliminates the fidelity-scale tradeoff in deception, enabling the Infinite Impostor attack that hijacks trusted relationships at mass scale and requiring a shift to suspect-by-default security based on evaluating actions rather than actors.
A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning improving mimicry but limited cross-task transfer.
Adaptive Stealing improves watermark theft efficiency from LLMs via Position-Based Seal Construction and Adaptive Selection modules that dynamically choose optimal attack perspectives.
Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than baselines while bypassing perplexity defenses.
Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.
Multi-turn prompts in Afrikaans, Kiswahili, isiXhosa and isiZulu achieve 52-83% harmful response rates across GPT, Claude, Gemini and others, rising further with native-speaker red-teaming, showing translation quality limits jailbreak success.
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.
citing papers explorer
-
The End of Trust: How Agentic AI Breaks Security Assumptions
Agentic AI eliminates the fidelity-scale tradeoff in deception, enabling the Infinite Impostor attack that hijacks trusted relationships at mass scale and requiring a shift to suspect-by-default security based on evaluating actions rather than actors.
-
Process Matters more than Output for Distinguishing Humans from Machines
A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning improving mimicry but limited cross-task transfer.
-
Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models
Adaptive Stealing improves watermark theft efficiency from LLMs via Position-Based Seal Construction and Adaptive Selection modules that dynamically choose optimal attack perspectives.
-
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
-
An Independent Safety Evaluation of Kimi K2.5
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
-
Multilingual jailbreaking of LLMs using low-resource languages
Multi-turn prompts in Afrikaans, Kiswahili, isiXhosa and isiZulu achieve 52-83% harmful response rates across GPT, Claude, Gemini and others, rising further with native-speaker red-teaming, showing translation quality limits jailbreak success.
-
From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI
The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.
- PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks