Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
hub Baseline reference
SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions
Baseline reference. 57% of citing Pith papers use this work as a benchmark or comparison.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
Item Response Theory enables adaptive and fixed-subset item selection that reduces safety benchmark costs by 80-99.9% while preserving high correlation with full rankings.
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
LLM judge prompt variations alone shift HarmBench harmful-response rates by up to 24.2 percentage points and produce moderate instability in model safety rankings.
Bounded performance metrics always favor convergence of AI capabilities to meek models while unbounded metrics allow frontier models to maintain leads indefinitely, with policy implications for capability concentration.
Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.
GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.
GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
citing papers explorer
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts
Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.
-
Beyond Context: Large Language Models' Failure to Grasp Users' Intent
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.