{"total":14,"items":[{"citing_arxiv_id":"2605.18239","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multilingual jailbreaking of LLMs using low-resource languages","primary_cat":"cs.CL","submitted_at":"2026-05-18T11:33:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Multi-turn prompts in Afrikaans, Kiswahili, isiXhosa and isiZulu achieve 52-83% harmful response rates across GPT, Claude, Gemini and others, rising further with native-speaker red-teaming, showing translation quality limits jailbreak success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16471","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI","primary_cat":"cs.CR","submitted_at":"2026-05-15T13:53:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"These are distinct signal domains that resist theoretical unification. First Author et al.:Preprint submitted to ElsevierPage 12 of 25 Security and Safety Threats in Generative AI Modality Paradigm Methods Mechanism Key Limitation Zero-shot DetectGPT [80], Binoculars [49] Probability curvature Requires generator access; paraphrase fragileText Supervised GPTZero [2], RADAR [53] Encoder fine-tuning Cross-generator drop to<60% [74] Frequency Spectral analysis GAN upsampling artifacts Mitigated by post-processing Representation UnivFD [87], DIRE [132] Feature-space classification GAN→diffusion: 3.05% accuracyImage Reconstruction AEROBLADE [107] Diffusion reconstruction error Computationally expensive Audio Challenge-based ASVspoof [139, 33] Spectral/prosodic features Codec artifacts confound detection"},{"citing_arxiv_id":"2605.16436","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The End of Trust: How Agentic AI Breaks Security Assumptions","primary_cat":"cs.CR","submitted_at":"2026-05-14T21:30:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agentic AI eliminates the fidelity-scale tradeoff in deception, enabling the Infinite Impostor attack that hijacks trusted relationships at mass scale and requiring a shift to suspect-by-default security based on evaluating actions rather than actors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10977","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks","primary_cat":"cs.CR","submitted_at":"2026-05-09T01:09:01+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", 2024; Touvron et al., 2023; Yang et al., 2025a). As LLMs become increasingly power- ful, the distinction between machine-generated and human- authored text has become blurred. This raises significant concerns around misuse, including large-scale disinforma- tion (Vykopal et al., 2024; Zhu et al., 2025b), automated spear phishing and targeted deception (Hazell, 2023), ampli- fied threats to organizational security (Mirsky et al., 2023), and challenges to academic evaluation systems (Balalle & Pannilage, 2025). These concerns motivate the need for verifiable provenance and accountable attribution. Recent work has focused on active provenance via LLM watermarking (Kirchenbauer et al., 2023; Liu et al., 2024c; Yang et al."},{"citing_arxiv_id":"2605.06524","ref_index":63,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Process Matters more than Output for Distinguishing Humans from Machines","primary_cat":"cs.AI","submitted_at":"2026-05-07T16:30:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning improving mimicry but limited cross-task transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10893","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-04-13T01:46:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adaptive Stealing improves watermark theft efficiency from LLMs via Position-Based Seal Construction and Adaptive Selection modules that dynamically choose optimal attack perspectives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09544","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism","primary_cat":"cs.CL","submitted_at":"2026-04-10T17:58:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03121","ref_index":79,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"An Independent Safety Evaluation of Kimi K2.5","primary_cat":"cs.CR","submitted_at":"2026-04-03T15:45:35+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"marking indirect prompt injections in tool-integrated llm agents. arXiv preprint arXiv:2403.02691, 2024. [78] L. Lanya Yu, William Anderson, Andy Wang, and Tao Yang. LLM agent security risks: Demos and best practices. In TREO Talks, 46th International Conference on Information Systems (ICIS 2025) , Nashville, TN, 2025. URL https://aisel.aisnet .org/treos_icis2025/104. [79] Ryan Greenblatt, Buck Shlegeris, Mrinank Sachan, and Fabien Roger. Alignment faking in large language models. arXiv preprint arXiv:2412.14093 , 2024. [80] Alexander Meinke, Mislav Balesni, Rusheb Shah, and Fabien Roger. Frontier models are capable of in-context scheming. arXiv preprint arXiv:2412.04984 , 2024. [81] Joe Benton, Jérémy Scheurer, Ryan Greenblatt, Cem Anil, and Ethan Perez."},{"citing_arxiv_id":"2508.21457","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SoK: Exposing the Generation and Detection Gaps in LLM-Generated Phishing","primary_cat":"cs.CR","submitted_at":"2025-08-29T09:39:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"This SoK paper introduces a nine-stage taxonomy for LLM guardrail breaches in phishing, characterizes evasion and manipulation tactics, and identifies a dynamic-offense versus static-defense asymmetry.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.05561","ref_index":248,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TrustLLM: Trustworthiness in Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-01-10T22:07:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"malicious intent exploit LLMs for nefarious objectives [ 251]. Prior research has shown that LLMs are susceptible to various forms of misuse. Specifically, they have been implicated in the propagation of misinformation [227, 226, 515], the endorse- ment of conspiracy theories [516], the sophisticated cyberattacks [517], the amplification of spear phishing attacks [248], and the facilitation of hate-driven campaigns [518, 519] through LLM's outstanding abilities. 40 TRUST LLM Dataset. There are already many datasets on the misuse of LLMs [73, 193]. In a recent study, a Do-Not-Answer [73] dataset is released, which contains various types of misuse actions. When discussing the misuse of LLMs, we mainly refer to dangerous or inappropriate uses, such as asking how to make a bomb."},{"citing_arxiv_id":"2310.06987","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation","primary_cat":"cs.CL","submitted_at":"2023-10-10T20:15:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.04451","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-10-03T19:44:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than baselines while bypassing perplexity defenses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2308.03825","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"\"Do Anything Now\": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models","primary_cat":"cs.CR","submitted_at":"2023-08-07T16:55:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2307.02483","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Jailbroken: How Does LLM Safety Training Fail?","primary_cat":"cs.LG","submitted_at":"2023-07-05T17:58:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[26] Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. More than you've asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv preprint arXiv:2302.12173, 2023. [27] Alexey Guzey. A two sentence jailbreak for GPT-4 and Claude & why nobody knows how to fix it. https://guzey.com/ai/two-sentence-universal-jailbreak/ , 2023. [28] Julian Hazell. Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972, 2023. [29] Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023. [30] Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto."}],"limit":50,"offset":0}