Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
arXiv preprint arXiv:2308.09662 , year=
15 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
CREST-Search is a red-teaming framework that crafts seemingly benign search queries to induce unsafe citations from web-augmented LLMs, backed by a new WebSearch-Harm dataset for fine-tuning a specialized attacker model.
Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.
A participatory red-teaming project in the Global South created the PLACES dataset of 26k T2I failure examples that reveal unique cultural and linguistic harms missed by existing safety frameworks.
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
ToxSearch-S applies unsupervised speciation to evolutionary prompt search, maintaining capacity-limited species with exemplar leaders and species-aware selection to achieve higher peak toxicity and broader semantic coverage than standard methods.
Personalization through long-term memory in LLM agents increases harmful query success rates by 15.8-243.7% via intent legitimation, measured on the new PS-Bench benchmark across frameworks.
CoRT achieves 95% average attack success rate on nine LLMs by using iterative risk-concealing prompts and a controller that scores concealment levels on a new 522-instruction financial risk benchmark.
Phonetic perturbations fragment safety-critical tokens in LLMs, suppressing attribution scores while preserving input understanding and causing safety mechanisms to fail despite good comprehension.
SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
citing papers explorer
-
Evalet: Evaluating Large Language Models through Functional Fragmentation
Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.