PASA is a semantic-level watermarking method for LLM text that uses embedding-space clusters and synchronized randomness to remain detectable after paraphrasing while preserving text quality.
Large language models can be used to effectively scale spear phishing campaigns
7 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning improving mimicry but limited cross-task transfer.
Adaptive Stealing improves watermark theft efficiency from LLMs via Position-Based Seal Construction and Adaptive Selection modules that dynamically choose optimal attack perspectives.
Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than baselines while bypassing perplexity defenses.
LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.
citing papers explorer
-
PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks
PASA is a semantic-level watermarking method for LLM text that uses embedding-space clusters and synchronized randomness to remain detectable after paraphrasing while preserving text quality.
-
Process Matters more than Output for Distinguishing Humans from Machines
A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning improving mimicry but limited cross-task transfer.
-
Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models
Adaptive Stealing improves watermark theft efficiency from LLMs via Position-Based Seal Construction and Adaptive Selection modules that dynamically choose optimal attack perspectives.
-
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
-
An Independent Safety Evaluation of Kimi K2.5
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
-
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than baselines while bypassing perplexity defenses.
-
Jailbroken: How Does LLM Safety Training Fail?
LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.