pith. sign in

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.CoRR abs/2602.08621

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

citation-role summary

background 3

citation-polarity summary

fields

cs.CR 3 cs.LG 2

years

2026 5

roles

background 3

polarities

background 3

representative citing papers

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

MESA: Improving MoE Safety Alignment via Decentralized Expertise

cs.LG · 2026-05-30 · unverdicted · novelty 5.0

MESA decentralizes safety duties in MoE LLMs via expert capacity reallocation and dynamic routing refinement based on optimal transport theory, yielding robust defense on harmful benchmarks while preserving helpfulness.

citing papers explorer

Showing 5 of 5 citing papers.