Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
arXiv preprint arXiv:2308.09662 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
citing papers explorer
-
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
-
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
-
MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.