VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
A holistic approach to undesired content detection
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.
Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.
ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.
citing papers explorer
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
Test-Time Safety Alignment
Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
-
Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.
-
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
-
Ignore Previous Prompt: Attack Techniques For Language Models
PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.
-
Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety
Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.
-
ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction
ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.