Recognition: 2 theorem links
· Lean TheoremMisrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
Pith reviewed 2026-05-08 18:01 UTC · model grok-4.3
The pith
Adversarial inputs optimized on open-source MoE models can be transferred to steer routing in proprietary API services and bypass safety alignments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Misrouter jointly optimizes inputs to influence routing decisions and expert selection in MoE architectures. It first identifies weakly aligned experts by observing activations on harmful queries paired with unsafe continuations, then steers routing toward those experts and away from strongly aligned ones while also favoring high-capability general experts. A two-phase optimization first fixes the routing path and then refines the harmful content generation without destabilizing the chosen experts.
What carries the argument
Misrouter, the input-only attack framework that identifies weakly aligned experts from activation patterns and uses two-phase optimization to steer routing toward them while preserving output quality.
If this is right
- Safety alignments applied during training can be bypassed in deployed MoE services without any direct access to model internals.
- Routing control and output generation can be decoupled enough to allow both safety bypass and coherent harmful responses through input perturbations alone.
- Expert-level alignment must be uniform across all experts rather than relying on a small set of strongly aligned ones to handle all inputs.
- Production MoE systems that share model families between open and closed versions inherit transferable attack surfaces.
Where Pith is reading between the lines
- Similar routing-based attacks may succeed on other sparse or conditional computation architectures beyond current MoE designs.
- API providers could detect such attacks by monitoring for unusual routing patterns or expert activation distributions on submitted queries.
- Future safety evaluations for MoE models should include surrogate-to-target transfer tests rather than only white-box or black-box local evaluations.
Load-bearing premise
Routing decisions and expert behaviors observed on open-source surrogate models are similar enough to those in proprietary API versions for the optimized adversarial inputs to transfer effectively.
What would settle it
Apply the transferred adversarial inputs to the target public API service and measure whether the rate of harmful or unsafe outputs rises significantly above the rate produced by random or benign inputs of similar length.
Figures
read the original abstract
Mixture-of-Experts (MoE) architectures have emerged as a leading paradigm for scaling large language models through sparse, routing-based computation. However, this design introduces a new attack surface: the routing mechanism that determines which experts process each input. Prior work shows that manipulating routing can bypass safety alignment, but existing attacks require model modification and thus apply only to locally deployed models. By contrast, real-world LLM services are remotely hosted and accessible only through input queries. This raises a fundamental question: can MoE routing be exploited through input-only attacks to induce stronger unsafe behaviors in real-world services? Our key insight is to optimize attacks in a white-box setting on open-source surrogate MoE models and transfer the resulting adversarial inputs to public API services within the same model family. This setting presents three main challenges: routing can be influenced only indirectly through input perturbations, routing control and output generation are tightly coupled, and even a successful safety bypass may still produce low-quality responses. To address these challenges, we propose Misrouter, an input-only attack framework that jointly targets routing behavior and expert functionality. Misrouter identifies weakly aligned experts that are willing to produce target harmful content by analyzing expert activations under harmful queries paired with unsafe continuations. It then optimizes adversarial inputs to steer routing toward these experts and away from strongly aligned ones. It further biases routing toward highly capable general-purpose experts identified from benign question-answering tasks. Finally, because routing and output objectives can conflict, Misrouter uses a two-phase optimization strategy that first steers routing and then optimizes harmful outputs while preserving routing stability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Misrouter, an input-only attack framework targeting the routing mechanisms of Mixture-of-Experts LLMs. It optimizes adversarial inputs in a white-box setting on open-source surrogate models to steer token routing toward weakly-aligned experts (identified via activation analysis on harmful queries) and away from strongly-aligned ones, while also biasing toward capable general-purpose experts. A two-phase optimization strategy first stabilizes routing control and then optimizes for harmful outputs. The resulting prompts are transferred to proprietary API services within the same model family to induce unsafe behaviors without direct model access.
Significance. If the transferability claims hold under empirical validation, the work would be significant for exposing a new attack surface in remotely hosted MoE LLMs, which are increasingly deployed in production services. It systematically addresses three challenges (indirect routing influence, coupled routing/output objectives, and response quality) with a concrete two-phase strategy and expert-identification method. This could inform both attack research and defenses focused on routing robustness. The framework's reliance on surrogate-to-target transfer is a practical contribution if demonstrated.
major comments (2)
- [Abstract and §3] Abstract and §3 (Method): The central claim that optimized inputs transfer effectively to induce safety bypasses in proprietary API models rests on an unverified similarity assumption between surrogate and target routers/expert behaviors. No analysis, ablation, or router-activation comparison is provided to show that perturbations steering routing on open-source models (e.g., Mixtral) produce equivalent expert selection in API versions, which differ in training data, post-training, and gating networks. This assumption is load-bearing for the transfer-based attack.
- [Abstract and §4] Abstract and §4 (Experiments): No quantitative results, attack success rates, transfer success metrics, or baseline comparisons are reported. The abstract and method description outline the framework but provide no validation that the two-phase optimization achieves routing stability or harmful output generation after transfer, undermining assessment of whether the approach supports the stated claims.
minor comments (2)
- [§3] Clarify notation for expert activation metrics and the precise optimization objectives (e.g., loss terms for routing steering vs. output harm) to improve reproducibility.
- [§5] Add discussion of potential failure modes, such as when routing cannot be sufficiently controlled via input perturbations alone.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The feedback highlights important aspects of our transfer-based attack claims that require clarification and strengthening in the manuscript. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that optimized inputs transfer effectively to induce safety bypasses in proprietary API models rests on an unverified similarity assumption between surrogate and target routers/expert behaviors. No analysis, ablation, or router-activation comparison is provided to show that perturbations steering routing on open-source models (e.g., Mixtral) produce equivalent expert selection in API versions, which differ in training data, post-training, and gating networks. This assumption is load-bearing for the transfer-based attack.
Authors: We agree that direct verification of router similarity is not feasible, as proprietary APIs do not expose internal gating networks or expert activations. Our approach relies on the empirical observation that adversarial inputs optimized on open-source surrogates from the same model family (e.g., Mixtral variants) successfully induce unsafe behaviors when transferred to the corresponding API endpoints. We will revise §3 to explicitly discuss this limitation, add a new subsection on surrogate selection rationale and expected behavioral similarity within model families, and include additional transfer experiments with multiple surrogates to strengthen the evidence. We cannot provide router-level ablations on the target APIs. revision: partial
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): No quantitative results, attack success rates, transfer success metrics, or baseline comparisons are reported. The abstract and method description outline the framework but provide no validation that the two-phase optimization achieves routing stability or harmful output generation after transfer, undermining assessment of whether the approach supports the stated claims.
Authors: We apologize for the insufficient visibility of the experimental results in the abstract and early sections. Quantitative attack success rates, transfer metrics (including success on API targets), routing stability measurements, and baseline comparisons are presented in §4. We will revise the abstract to summarize key quantitative findings (e.g., attack success rates and transfer rates) and update §3 to cross-reference the empirical validation of the two-phase strategy and expert identification method. This will make the validation more prominent. revision: yes
- Direct router-activation comparisons or ablations on proprietary API models, due to lack of access to internal model states.
Circularity Check
No circularity: empirical attack framework with independent transfer testing
full rationale
The paper presents an empirical attack method (Misrouter) that optimizes adversarial inputs on open-source MoE surrogates and evaluates transfer to API models. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description. The central claim rests on experimental transfer results rather than any definitional equivalence or reduction of outputs to inputs by construction. The transferability assumption is an empirical hypothesis subject to external falsification, not a circular step. This matches the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Misrouter framework
no independent evidence
Lean theorems connected to this paper
-
Cost.FunctionalEquation / Foundation.AlphaCoordinateFixation (J(x)=½(x+x⁻¹)−1)washburn_uniqueness_aczel unclearL_route = Σ ΔU_i · p_i(x̃), where ΔU_i = λ1 U^harm_i − λ2 U^comp_i − λ3 U^benign_i
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauff- mann, et al. 2024. Phi-4 technical report.arXiv preprint arXiv:2412.08905(2024)
work page internal anchor Pith review arXiv 2024
-
[2]
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al . 2025. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925(2025)
work page internal anchor Pith review arXiv 2025
-
[3]
Rishabh Bhardwaj, Duc Anh Do, and Soujanya Poria. 2024. Language models are homer simpson! safety re-alignment of fine-tuned language models through task arithmetic. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 14138–14149
2024
- [4]
-
[5]
Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
-
[6]
A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering(2025)
2025
-
[7]
Yufei Chen, Chao Shen, Cong Wang, and Yang Zhang. 2022. Teacher model fin- gerprinting attacks against transfer learning. In31st USENIX Security Symposium (USENIX Security 22). 3593–3610
2022
-
[8]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457(2018)
work page Pith review arXiv 2018
-
[9]
Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1280–1297
2024
- [10]
-
[11]
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al
-
[12]
In International conference on machine learning
Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning. PMLR, 5547–5569
- [13]
- [14]
-
[15]
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674(2023)
work page internal anchor Pith review arXiv 2023
-
[16]
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.Neural computation3, 1 (1991), 79–87
1991
- [17]
-
[18]
Torsten Krauß, Hamid Dashtbani, and Alexandra Dmitrienko. 2025. {TwinBreak}: Jailbreaking {LLM} Security Alignments based on Twin Prompts. In34th USENIX Security Symposium (USENIX Security 25). 2343–2362
2025
-
[19]
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics7 (2019), 453–466
2019
- [20]
-
[21]
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al . 2024. Deepseek- v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434(2024)
work page internal anchor Pith review arXiv 2024
- [22]
-
[23]
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing. 2381–2391
2018
-
[24]
Mistral AI. 2023. Mixtral of Experts. https://mistral.ai/news/mixtral-of-experts
2023
-
[25]
OpenAI. 2025. Introducing gpt-oss. https://openai.com/index/introducing-gpt- oss/
2025
-
[26]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744
2022
-
[27]
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693(2023)
work page internal anchor Pith review arXiv 2023
-
[28]
Qwen Team. 2024. Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters. https://qwen.ai/blog?id=qwen-moe
2024
-
[29]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741
2023
-
[30]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale.Commun. ACM 64, 9 (2021), 99–106
2021
- [31]
-
[32]
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538(2017)
work page internal anchor Pith review arXiv 2017
-
[33]
do anything now
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1671–1685
2024
-
[34]
Abhay Sheshadri, Aidan Ewart, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield- Menell, et al. 2024. Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.Transactions on Machine Learning Research(2024)
2024
-
[35]
Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al . 2024. A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems37 (2024), 125416–125440
2024
-
[36]
Haochun Tang, Yuliang Yan, Jiahua Lu, Huaxiao Liu, and Enyan Dai. 2026. Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization.arXiv preprint arXiv:2604.15022(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP workshop Black- boxNLP: Analyzing and interpreting neural networks for NLP. 353–355
2018
- [38]
-
[39]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail?Advances in neural information processing systems 36 (2023), 80079–80110
2023
- [40]
-
[41]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review arXiv 2025
- [42]
-
[43]
Yue Zhou, Henry Peng Zou, Barbara Di Eugenio, and Yang Zhang. 2024. Large language models are involuntary truth-tellers: Exploiting fallacy failure for jailbreak attacks. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 13293–13304
2024
-
[44]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023). Ethical Considerations In this work, we introduceMisrouter, an input-only attack frame- work that exploits the routing mechanism of MoE LLMs to induce...
work page Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.