{"total":12,"items":[{"citing_arxiv_id":"2606.27632","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety","primary_cat":"cs.CL","submitted_at":"2026-06-26T01:12:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22643","ref_index":92,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety","primary_cat":"cs.CL","submitted_at":"2026-05-21T15:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10639","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks","primary_cat":"cs.AI","submitted_at":"2026-05-11T14:27:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Following HumanToxiGen [9] 2022 Implicit Toxicity, Hate Speech Text Generation, Classification HateBERT [1], ToxDectRoBERTa [36]AdvBench [37] 2023 Adversarial Robustness Text Generation, Instr. Following Automated MetricsDoNotAnswer [30] 2023 Harmfulness QA, Instr. Following Human, LLM-as-a-JudgeMaliciousInstruct [11] 2023 Jailbreak Robustness Instruction Following Automated ClassifierSafetyBench [34] 2023 Comprehensive Safety Multiple-Choice QA AccuracySimpleSafetyTests [29] 2023 Harmfulness QA, Instr. Following Human, Automated ClassifiersToxicChat [17] 2023 Toxicity Toxicity Classification Classification MetricsXSTest [24] 2023 Over-refusal QA, Instr. Following Human, Rule-based, LLM-as-a-JudgeBeHonest [3] 2024 Honesty, Misinformation QA Rule-based, LLM-as-a-JudgeHarmBench [20] 2024 Red Teaming, Robust Refusal Instruction Following LLM-as-a-JudgeJailbreakBench [2] 2024 Jailbreak Robustness Instruction Following Rule-based, LLM-as-a-JudgeSALAD-Bench [14] 2024 Comprehensive Safety QA, Instr."},{"citing_arxiv_id":"2605.06652","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:56:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24074","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Sensitive Are Safety Benchmarks to Judge Configuration Choices?","primary_cat":"cs.CL","submitted_at":"2026-04-27T05:59:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM judge prompt variations alone shift HarmBench harmful-response rates by up to 24.2 percentage points and produce moderate instability in model safety rankings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16659","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs","primary_cat":"cs.CR","submitted_at":"2026-04-17T19:28:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14548","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VoxSafeBench: Not Just What Is Said, but Who, How, and Where","primary_cat":"cs.SD","submitted_at":"2026-04-16T02:24:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"chinese large language models.arXiv preprint arXiv:2304.10436, 2023. [12] Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. Figstep: Jailbreaking large vision-language models via typographic visual prompts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23951-23959, 2025. [13] Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, and Minlie Huang. Safetybench: Evaluating the safety of large language models, 2024. URLhttps://arxiv.org/abs/2309.07045. [14] Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen,"},{"citing_arxiv_id":"2604.02713","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts","primary_cat":"cs.CL","submitted_at":"2026-04-03T04:10:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"More interactive systems such as Ψ-Arena [50] incorporate tri- adic feedback loops and psychologically profiled clients to optimize LLM counselors. Other frameworks assess socio-emotional capacity more abstractly. BOLT [7] links behavioral patterns to therapeutic orientations, EQ-Bench [26] evaluates emotional intelligence up- stream of conversation, and SPHERE [46] provides standards for ensuring methodological transparency and evaluation validity. Although these efforts broaden multi-turn evaluation, they share two assumptions that limit their coverage of alignment dynamics. First, they usually evaluate systems interacting with cooperative or support-seeking users, emphasizing capacities such as reflective"},{"citing_arxiv_id":"2512.21110","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Context: Large Language Models' Failure to Grasp Users' Intent","primary_cat":"cs.AI","submitted_at":"2025-12-24T11:15:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"structed contextual scenarios with empirical testing of ex- ploitation techniques across multiple model architectures [2], arXiv:2512.21110v2 [cs.AI] 29 Dec 2025 [3]. Through controlled experiments, we demonstrate how benign-looking prompts can reliably circumvent safety mech- anisms across diverse application domains, from mental health support systems to content moderation platforms [11], [12]. The significance of this study extends beyond academic interest, revealing immediate concerns for AI deployment [13], [14]. As LLMs become increasingly integrated into sensitive applications, understanding and addressing these fundamental limitations becomes essential for ensuring safe and reliable AI systems [15], [16]. Our findings suggest that technical safe-"},{"citing_arxiv_id":"2508.06471","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models","primary_cat":"cs.CL","submitted_at":"2025-08-08T17:21:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"On the IFEval benchmark, GLM-4.5 outperforms DeepSeek R1. In the Sysbench evaluation, GLM-4.5 surpasses GPT-4.1, DeepSeek V3, and Kimi K2. Additionally, on the MultiChallenge benchmark, it demonstrates superior performance compared to both GPT-4.1 and DeepSeek R1. 4.2.5 Evaluation of Safety To systematically assess the safety alignment of our model, we utilized SafetyBench [51], a compre- hensive benchmark designed to evaluate the safety of large language models. SafetyBench consists of 11,435 multiple-choice questions covering seven distinct categories of safety concerns, with data in both English and Chinese. This benchmark enables a standardized and scalable evaluation of a model's ability to handle potentially harmful or sensitive topics."},{"citing_arxiv_id":"2407.04295","ref_index":115,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Jailbreak Attacks and Defenses Against Large Language Models: A Survey","primary_cat":"cs.CR","submitted_at":"2024-07-05T06:57:30+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"AdvBench [125] English 1000 8 Harmful strings and harmful behaviors SafeBench [30] English 500 10 Unsafe questions Do-Not-Answer [96] English 939 5 Harmful instructions TechHazardQA [7] English 1850 7 I nstruction-centric questions SC-Safety [86] Chinese 4912 20+ Multi-round conversations LatentJailbreak [69] Chinese English 416 3 Translation tasks SafetyBench [115] Chinese English 11435 7 Multiple choice questions StrongREJECT [84] English 346 6 Unsafe questions AttackEval [80] English 390 13 Unsafe questions HarmBench [63] English 510 18 Harmful behaviors Safety-Prompts [86] Chinese 100000 14 Harmful behaviors JailbreakBench [14] English 200 10 Harmful behaviors and benign behaviors DoAnythingNow [79] English 107250 13 Forbidden questions"},{"citing_arxiv_id":"2406.12793","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools","primary_cat":"cs.CL","submitted_at":"2024-06-18T16:58:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Browser Information Seeking 78.08 67.12 4 Safety and Risks We are committed to ensuring that GLM-4 operates as a safe, responsible, and unbiased model. In addition to addressing common ethical and fairness concerns, we carefully assess and mitigate potential harms that the model may pose to users in real-world scenarios. Table 10: GLM-4 performance on SafetyBench [56], compared to GPT-4 models and Claude 3 Opus. Ethics & Morality Illegal Activities Mental Health Offens- iveness Physical Health Privacy & Property Unfairness & Bias Overall GPT-4 (0613) 92.7 93.3 93.0 87.7 96.7 91.3 73.3 89.7 GPT-4 Turbo (1106) 91.0 92.0 93.0 86.0 92.0 88.7 74.3 88.1 GPT-4 Turbo (2024-04-09) 90.3 91.3 91.7 85.3 92.0 89.3 75.0 87.9"}],"limit":50,"offset":0}