Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
What’s in your" safe" data?: Identifying benign data that breaks safety
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
A latent variable IRT framework decouples four safety-driving factors across 61 model configurations and 10 languages using 1.9 million evaluations, revealing that safety is largely unidimensional and that high cross-lingual gaps cluster in physical harm prompts and lower-resource languages.
Benign fine-tuning drifts LLM parameters toward danger directions; SQSD scores each sample by the projection difference of its induced update onto safety versus danger vectors.
Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.
Benign fine-tuning of foundation models induces large, heterogeneous, and often contradictory changes in safety metrics across general and domain-specific benchmarks.
Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.
citing papers explorer
-
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
-
Why Do Safety Guardrails Degrade Across Languages?
A latent variable IRT framework decouples four safety-driving factors across 61 model configurations and 10 languages using 1.9 million evaluations, revealing that safety is largely unidimensional and that high cross-lingual gaps cluster in physical harm prompts and lower-resource languages.
-
From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning
Benign fine-tuning drifts LLM parameters toward danger directions; SQSD scores each sample by the projection difference of its induced update onto safety versus danger vectors.
-
The Power of Order: Fooling LLMs with Adversarial Table Permutations
Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.
-
Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains
Benign fine-tuning of foundation models induces large, heterogeneous, and often contradictory changes in safety metrics across general and domain-specific benchmarks.
-
Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.
-
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.