hub Canonical reference

Spear phishing with large language models

· 2023 · arXiv 2305.06972

Canonical reference. 80% of citing Pith papers cite this work as background.

14 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 4 support 1

representative citing papers

SoK: Exposing the Generation and Detection Gaps in LLM-Generated Phishing

cs.CR · 2025-08-29 · unverdicted · novelty 7.0

This SoK paper introduces a nine-stage taxonomy for LLM guardrail breaches in phishing, characterizes evasion and manipulation tactics, and identifies a dynamic-offense versus static-defense asymmetry.

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

cs.CL · 2023-10-10 · conditional · novelty 7.0

Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.

The End of Trust: How Agentic AI Breaks Security Assumptions

cs.CR · 2026-05-14 · unverdicted · novelty 6.0

Agentic AI eliminates the fidelity-scale tradeoff in deception, enabling the Infinite Impostor attack that hijacks trusted relationships at mass scale and requiring a shift to suspect-by-default security based on evaluating actions rather than actors.

Process Matters more than Output for Distinguishing Humans from Machines

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning improving mimicry but limited cross-task transfer.

Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models

cs.CR · 2026-04-13 · unverdicted · novelty 6.0

Adaptive Stealing improves watermark theft efficiency from LLMs via Position-Based Seal Construction and Adaptive Selection modules that dynamically choose optimal attack perspectives.

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

cs.CL · 2026-04-10 · unverdicted · novelty 6.0

Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.

An Independent Safety Evaluation of Kimi K2.5

cs.CR · 2026-04-03 · conditional · novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

cs.CL · 2023-10-03 · conditional · novelty 6.0

AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than baselines while bypassing perplexity defenses.

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

cs.CR · 2023-08-07 · unverdicted · novelty 6.0

Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.

Jailbroken: How Does LLM Safety Training Fail?

cs.LG · 2023-07-05 · unverdicted · novelty 6.0

LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.

Multilingual jailbreaking of LLMs using low-resource languages

cs.CL · 2026-05-18 · unverdicted · novelty 5.0

Multi-turn prompts in Afrikaans, Kiswahili, isiXhosa and isiZulu achieve 52-83% harmful response rates across GPT, Claude, Gemini and others, rising further with native-speaker red-teaming, showing translation quality limits jailbreak success.

TrustLLM: Trustworthiness in Large Language Models

cs.CL · 2024-01-10 · unverdicted · novelty 5.0

TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.

From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI

cs.CR · 2026-05-15 · unverdicted · novelty 3.0

The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.

PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

cs.CR · 2026-05-09

citing papers explorer

Showing 1 of 1 citing paper after filters.

TrustLLM: Trustworthiness in Large Language Models cs.CL · 2024-01-10 · unverdicted · none · ref 248
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.

Spear phishing with large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer