pith. machine review for the scientific record. sign in

arxiv: 2605.04446 · v1 · submitted 2026-05-06 · 💻 cs.CR

Recognition: 2 theorem links

· Lean Theorem

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

Jianing Geng, Ruiqi He, Weijie Liu, Xiaofeng Wang, Zekun Fei, Zheli Liu, Zihao Wang

Pith reviewed 2026-05-08 18:01 UTC · model grok-4.3

classification 💻 cs.CR
keywords mixture-of-expertsadversarial attacksllm safetyrouting mechanismsinput-only attacksmodel transferexpert alignment
0
0 comments X

The pith

Adversarial inputs optimized on open-source MoE models can be transferred to steer routing in proprietary API services and bypass safety alignments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Mixture-of-Experts LLMs expose a new attack surface through their routing mechanisms, which can be manipulated indirectly via input perturbations alone. By training attacks in a white-box setting on accessible surrogate models and transferring them to closed services in the same family, the work shows how attackers can direct inputs toward weakly aligned experts while favoring capable ones for coherent harmful outputs. This matters because real-world LLM deployments are remote and input-only, so any successful transfer demonstrates that existing safety measures fail against routing-targeted attacks without requiring model weights or local deployment. The approach uses a two-phase process to first stabilize routing toward target experts and then refine outputs while keeping the routing path intact. If the transfer succeeds consistently, it implies that sparse activation patterns create transferable vulnerabilities across model versions.

Core claim

Misrouter jointly optimizes inputs to influence routing decisions and expert selection in MoE architectures. It first identifies weakly aligned experts by observing activations on harmful queries paired with unsafe continuations, then steers routing toward those experts and away from strongly aligned ones while also favoring high-capability general experts. A two-phase optimization first fixes the routing path and then refines the harmful content generation without destabilizing the chosen experts.

What carries the argument

Misrouter, the input-only attack framework that identifies weakly aligned experts from activation patterns and uses two-phase optimization to steer routing toward them while preserving output quality.

If this is right

  • Safety alignments applied during training can be bypassed in deployed MoE services without any direct access to model internals.
  • Routing control and output generation can be decoupled enough to allow both safety bypass and coherent harmful responses through input perturbations alone.
  • Expert-level alignment must be uniform across all experts rather than relying on a small set of strongly aligned ones to handle all inputs.
  • Production MoE systems that share model families between open and closed versions inherit transferable attack surfaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar routing-based attacks may succeed on other sparse or conditional computation architectures beyond current MoE designs.
  • API providers could detect such attacks by monitoring for unusual routing patterns or expert activation distributions on submitted queries.
  • Future safety evaluations for MoE models should include surrogate-to-target transfer tests rather than only white-box or black-box local evaluations.

Load-bearing premise

Routing decisions and expert behaviors observed on open-source surrogate models are similar enough to those in proprietary API versions for the optimized adversarial inputs to transfer effectively.

What would settle it

Apply the transferred adversarial inputs to the target public API service and measure whether the rate of harmful or unsafe outputs rises significantly above the rate produced by random or benign inputs of similar length.

Figures

Figures reproduced from arXiv: 2605.04446 by Jianing Geng, Ruiqi He, Weijie Liu, Xiaofeng Wang, Zekun Fei, Zheli Liu, Zihao Wang.

Figure 1
Figure 1. Figure 1: Overview of Misrouter. Weighted routing contrast. We compute the routing score for each expert 𝑖 as: Δ𝑈𝑖 = 𝜆1𝑈¯ 𝑖(Dharm) − 𝜆2𝑈¯ 𝑖(Dcomp) − 𝜆3𝑈¯ 𝑖(Dbenign), (19) where 𝜆1, 𝜆2, 𝜆3 > 0 control the relative importance of suppressing strongly aligned experts, promoting weakly aligned experts, and preserving generation utility, respectively. This sign convention is important. Experts frequently activated by Dhar… view at source ↗
read the original abstract

Mixture-of-Experts (MoE) architectures have emerged as a leading paradigm for scaling large language models through sparse, routing-based computation. However, this design introduces a new attack surface: the routing mechanism that determines which experts process each input. Prior work shows that manipulating routing can bypass safety alignment, but existing attacks require model modification and thus apply only to locally deployed models. By contrast, real-world LLM services are remotely hosted and accessible only through input queries. This raises a fundamental question: can MoE routing be exploited through input-only attacks to induce stronger unsafe behaviors in real-world services? Our key insight is to optimize attacks in a white-box setting on open-source surrogate MoE models and transfer the resulting adversarial inputs to public API services within the same model family. This setting presents three main challenges: routing can be influenced only indirectly through input perturbations, routing control and output generation are tightly coupled, and even a successful safety bypass may still produce low-quality responses. To address these challenges, we propose Misrouter, an input-only attack framework that jointly targets routing behavior and expert functionality. Misrouter identifies weakly aligned experts that are willing to produce target harmful content by analyzing expert activations under harmful queries paired with unsafe continuations. It then optimizes adversarial inputs to steer routing toward these experts and away from strongly aligned ones. It further biases routing toward highly capable general-purpose experts identified from benign question-answering tasks. Finally, because routing and output objectives can conflict, Misrouter uses a two-phase optimization strategy that first steers routing and then optimizes harmful outputs while preserving routing stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Misrouter, an input-only attack framework targeting the routing mechanisms of Mixture-of-Experts LLMs. It optimizes adversarial inputs in a white-box setting on open-source surrogate models to steer token routing toward weakly-aligned experts (identified via activation analysis on harmful queries) and away from strongly-aligned ones, while also biasing toward capable general-purpose experts. A two-phase optimization strategy first stabilizes routing control and then optimizes for harmful outputs. The resulting prompts are transferred to proprietary API services within the same model family to induce unsafe behaviors without direct model access.

Significance. If the transferability claims hold under empirical validation, the work would be significant for exposing a new attack surface in remotely hosted MoE LLMs, which are increasingly deployed in production services. It systematically addresses three challenges (indirect routing influence, coupled routing/output objectives, and response quality) with a concrete two-phase strategy and expert-identification method. This could inform both attack research and defenses focused on routing robustness. The framework's reliance on surrogate-to-target transfer is a practical contribution if demonstrated.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Method): The central claim that optimized inputs transfer effectively to induce safety bypasses in proprietary API models rests on an unverified similarity assumption between surrogate and target routers/expert behaviors. No analysis, ablation, or router-activation comparison is provided to show that perturbations steering routing on open-source models (e.g., Mixtral) produce equivalent expert selection in API versions, which differ in training data, post-training, and gating networks. This assumption is load-bearing for the transfer-based attack.
  2. [Abstract and §4] Abstract and §4 (Experiments): No quantitative results, attack success rates, transfer success metrics, or baseline comparisons are reported. The abstract and method description outline the framework but provide no validation that the two-phase optimization achieves routing stability or harmful output generation after transfer, undermining assessment of whether the approach supports the stated claims.
minor comments (2)
  1. [§3] Clarify notation for expert activation metrics and the precise optimization objectives (e.g., loss terms for routing steering vs. output harm) to improve reproducibility.
  2. [§5] Add discussion of potential failure modes, such as when routing cannot be sufficiently controlled via input perturbations alone.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive report. The feedback highlights important aspects of our transfer-based attack claims that require clarification and strengthening in the manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that optimized inputs transfer effectively to induce safety bypasses in proprietary API models rests on an unverified similarity assumption between surrogate and target routers/expert behaviors. No analysis, ablation, or router-activation comparison is provided to show that perturbations steering routing on open-source models (e.g., Mixtral) produce equivalent expert selection in API versions, which differ in training data, post-training, and gating networks. This assumption is load-bearing for the transfer-based attack.

    Authors: We agree that direct verification of router similarity is not feasible, as proprietary APIs do not expose internal gating networks or expert activations. Our approach relies on the empirical observation that adversarial inputs optimized on open-source surrogates from the same model family (e.g., Mixtral variants) successfully induce unsafe behaviors when transferred to the corresponding API endpoints. We will revise §3 to explicitly discuss this limitation, add a new subsection on surrogate selection rationale and expected behavioral similarity within model families, and include additional transfer experiments with multiple surrogates to strengthen the evidence. We cannot provide router-level ablations on the target APIs. revision: partial

  2. Referee: [Abstract and §4] Abstract and §4 (Experiments): No quantitative results, attack success rates, transfer success metrics, or baseline comparisons are reported. The abstract and method description outline the framework but provide no validation that the two-phase optimization achieves routing stability or harmful output generation after transfer, undermining assessment of whether the approach supports the stated claims.

    Authors: We apologize for the insufficient visibility of the experimental results in the abstract and early sections. Quantitative attack success rates, transfer metrics (including success on API targets), routing stability measurements, and baseline comparisons are presented in §4. We will revise the abstract to summarize key quantitative findings (e.g., attack success rates and transfer rates) and update §3 to cross-reference the empirical validation of the two-phase strategy and expert identification method. This will make the validation more prominent. revision: yes

standing simulated objections not resolved
  • Direct router-activation comparisons or ablations on proprietary API models, due to lack of access to internal model states.

Circularity Check

0 steps flagged

No circularity: empirical attack framework with independent transfer testing

full rationale

The paper presents an empirical attack method (Misrouter) that optimizes adversarial inputs on open-source MoE surrogates and evaluates transfer to API models. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description. The central claim rests on experimental transfer results rather than any definitional equivalence or reduction of outputs to inputs by construction. The transferability assumption is an empirical hypothesis subject to external falsification, not a circular step. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the unverified assumption that surrogate models share routing and expert alignment properties with target API models; no free parameters, axioms, or invented entities are explicitly defined in the abstract.

invented entities (1)
  • Misrouter framework no independent evidence
    purpose: Jointly targets routing and expert functionality for input-only safety bypass
    Newly proposed attack method described in the abstract.

pith-pipeline@v0.9.0 · 5609 in / 1037 out tokens · 42443 ms · 2026-05-08T18:01:50.842500+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

44 extracted references · 21 canonical work pages · 8 internal anchors

  1. [1]

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauff- mann, et al. 2024. Phi-4 technical report.arXiv preprint arXiv:2412.08905(2024)

  2. [2]

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al . 2025. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925(2025)

  3. [3]

    Rishabh Bhardwaj, Duc Anh Do, and Soujanya Poria. 2024. Language models are homer simpson! safety re-alignment of fine-tuned language models through task arithmetic. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 14138–14149

  4. [4]

    Rishabh Bhardwaj and Soujanya Poria. 2023. Red-teaming large language models using chain of utterances for safety-alignment.arXiv preprint arXiv:2308.09662 (2023)

  5. [5]

    Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang

  6. [6]

    A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering(2025)

  7. [7]

    Yufei Chen, Chao Shen, Cong Wang, and Yang Zhang. 2022. Teacher model fin- gerprinting attacks against transfer learning. In31st USENIX Security Symposium (USENIX Security 22). 3593–3610

  8. [8]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457(2018)

  9. [9]

    Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1280–1297

  10. [10]

    Yi Dong, Ronghui Mu, Gaojie Jin, Yi Qi, Jinwei Hu, Xingyu Zhao, Jie Meng, Wenjie Ruan, and Xiaowei Huang. 2024. Building guardrails for large language models.arXiv preprint arXiv:2402.01822(2024)

  11. [11]

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al

  12. [12]

    In International conference on machine learning

    Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning. PMLR, 5547–5569

  13. [13]

    Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Schütze, and Nanyun Peng. 2025. Steering moe llms via expert (de) activation.arXiv preprint arXiv:2509.09660(2025)

  14. [14]

    Jamie Hayes, Ilia Shumailov, and Itay Yona. 2024. Buffer overflow in mixture of experts.arXiv preprint arXiv:2402.05526(2024)

  15. [15]

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674(2023)

  16. [16]

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.Neural computation3, 1 (1991), 79–87

  17. [17]

    Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, and Yang Zhang. 2026. Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.arXiv preprint arXiv:2602.08621(2026)

  18. [18]

    Torsten Krauß, Hamid Dashtbani, and Alexandra Dmitrienko. 2025. {TwinBreak}: Jailbreaking {LLM} Security Alignments based on Twin Prompts. In34th USENIX Security Symposium (USENIX Security 25). 2343–2362

  19. [19]

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics7 (2019), 453–466

  20. [20]

    Zhenglin Lai, Mengyao Liao, Bingzhe Wu, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, and Jianqiang Li. 2025. Safex: Analyzing vulnerabilities of moe-based llms via stable safety-critical expert identification.arXiv preprint arXiv:2506.17368 (2025)

  21. [21]

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al . 2024. Deepseek- v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434(2024)

  22. [22]

    Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. 2024. Jail- breakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks.arXiv preprint arXiv:2404.03027(2024)

  23. [23]

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing. 2381–2391

  24. [24]

    Mistral AI. 2023. Mixtral of Experts. https://mistral.ai/news/mixtral-of-experts

  25. [25]

    OpenAI. 2025. Introducing gpt-oss. https://openai.com/index/introducing-gpt- oss/

  26. [26]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  27. [27]

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693(2023)

  28. [28]

    Qwen Team. 2024. Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters. https://qwen.ai/blog?id=qwen-moe

  29. [29]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

  30. [30]

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale.Commun. ACM 64, 9 (2021), 99–106

  31. [31]

    Shuo Shao, Yiming Li, Yu He, Hongwei Yao, Wenyuan Yang, Dacheng Tao, and Zhan Qin. 2025. Sok: Large language model copyright auditing via fingerprinting. arXiv preprint arXiv:2508.19843(2025)

  32. [32]

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538(2017)

  33. [33]

    do anything now

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1671–1685

  34. [34]

    Abhay Sheshadri, Aidan Ewart, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield- Menell, et al. 2024. Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.Transactions on Machine Learning Research(2024)

  35. [35]

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al . 2024. A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems37 (2024), 125416–125440

  36. [36]

    Haochun Tang, Yuliang Yan, Jiahua Lu, Huaxiao Liu, and Enyan Dai. 2026. Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization.arXiv preprint arXiv:2604.15022(2026)

  37. [37]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP workshop Black- boxNLP: Analyzing and interpreting neural networks for NLP. 353–355

  38. [38]

    Qingyue Wang, Qi Pang, Xixun Lin, Shuai Wang, and Daoyuan Wu. 2025. Bad- MoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts.arXiv preprint arXiv:2504.18598(2025)

  39. [39]

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail?Advances in neural information processing systems 36 (2023), 80079–80110

  40. [40]

    Lichao Wu, Sasha Behrouzi, Mohamadreza Rostami, Stjepan Picek, and Ahmad- Reza Sadeghi. 2025. GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs.arXiv preprint arXiv:2512.21008(2025)

  41. [41]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  42. [42]

    Itay Yona, Ilia Shumailov, Jamie Hayes, and Nicholas Carlini. 2024. Stealing user prompts from mixture of experts.arXiv preprint arXiv:2410.22884(2024)

  43. [43]

    Yue Zhou, Henry Peng Zou, Barbara Di Eugenio, and Yang Zhang. 2024. Large language models are involuntary truth-tellers: Exploiting fallacy failure for jailbreak attacks. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 13293–13304

  44. [44]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023). Ethical Considerations In this work, we introduceMisrouter, an input-only attack frame- work that exploits the routing mechanism of MoE LLMs to induce...