Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.CoRR abs/2602.08621
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
citing papers explorer
-
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
-
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.