pith. sign in

arxiv: 2606.00651 · v1 · pith:XEHWE6FYnew · submitted 2026-05-30 · 💻 cs.LG · cs.AI· cs.CL

MESA: Improving MoE Safety Alignment via Decentralized Expertise

Pith reviewed 2026-06-28 18:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords mixture of expertssafety alignmentoptimal transportlarge language modelsadversarial robustnessdecentralized expertise
0
0 comments X

The pith

MESA uses optimal transport to spread safety duties across multiple experts in MoE LLMs, reducing bypass vulnerability while keeping helpfulness intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts models concentrate safety capabilities in a small number of experts, creating an easy target for adversarial attacks that bypass those experts. Standard alignment techniques update all parameters uniformly and can degrade the model's performance on helpful tasks. MESA counters this by applying optimal transport to reallocate safety responsibilities to a broader set of cost-effective experts and by refining the router to activate those experts on demand. The result is claimed to be stronger defense on harmful benchmarks without the usual loss in utility. Readers would care because it targets the structural weakness of MoE scaling without forcing a broad performance trade-off.

Core claim

MESA is a targeted alignment framework for MoE-based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport theory, it operates through two mechanisms: Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost-effective experts, and Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness.

What carries the argument

Expert Capacity Reallocation via a transport cost matrix combined with Dynamic Routing Refinement, both derived from Optimal Transport theory, to assign and activate safety duties across experts.

If this is right

  • Safety capabilities become distributed rather than sparse, reducing the chance that a single expert attack succeeds.
  • The alignment process avoids uniform parameter changes that would interfere with utility-focused experts.
  • The router learns to activate the reallocated safety modules on relevant inputs.
  • Defensive performance improves across multiple harmful benchmarks while helpfulness metrics remain stable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cost-matrix approach could be tested for decentralizing other specialized behaviors such as factual recall or domain expertise.
  • If the cost matrix is poorly calibrated on a new model, the reallocation step itself might introduce unintended routing patterns.
  • Scaling experiments on larger MoE models would test whether the two mechanisms remain effective when the number of experts grows.

Load-bearing premise

The optimal transport cost matrix can identify experts that will reliably carry decentralized safety duties without degrading utility or creating new bypass routes.

What would settle it

Demonstrating that after MESA, an adversary can still bypass safety by targeting the newly assigned experts or that helpfulness scores on standard benchmarks drop substantially.

Figures

Figures reproduced from arXiv: 2606.00651 by Hui Xue, Ranjie Duan, Teng Li, Xingjun Ma, Xingxing Wei, Yao Huang, Yichi Zhang, Yitong Sun.

Figure 1
Figure 1. Figure 1: Illustration of MESA. By redistributing safety capabili￾ties via Optimal Transport (OT), MESA decentralizes specialized experts to form robust safety distribution across MoE experts, significantly improving jailbreak resistance. Meanwhile, MESA preserves models’ general capabilities. Guo et al., 2025; Yang et al., 2025; Comanici et al., 2025) could achieve a remarkable balance between massive model capacit… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the MESA framework. I. Expert Capacity Reallocation: Leveraging empirical frequencies, we first design a cost function based on two theorems, then we compute an OT plan π ∗ that considers both cost matrix C and initial distribution Pemp to identify the optimal expert subset. II. Dynamic Routing Refinement: An online OT mechanism adjusts routing targets based on input type, ensuring effective ac… view at source ↗
Figure 3
Figure 3. Figure 3: Expert Adaptation Cost and Distributional Rank Shift Analysis. Top: Fine-tuning performance across critical and dormant expert groups categorized by Rsafe and Rgen dimensions. Bottom: Rank distribution in Rmix of critical and dormant experts categorized by Rsafe and Rgen dimensions. disproportionately high degradation in general capabilities, confirming their unsuitability for alignment. Guided by these pr… view at source ↗
Figure 4
Figure 4. Figure 4: Robustness evaluation on Strong-Reject and Strata. We assess structural resilience against two inference-time masking strategies: (1) Random Masking, where a specific number of experts are randomly selected and disabled; (2) Highest-Activation Masking, which identifies the Top-5 and Top-10 experts based on Rsafe, and forces their routing probabilities to zero to prevent activation. the SOTA in content-leve… view at source ↗
Figure 5
Figure 5. Figure 5: Head expert overlap between diverse data sources. After safety fine-tuning, the head expert overlap between the orig￾inal safety experts (Mix) and two harmful datasets (Strata and SR) decreases, suggesting decentralization. Meanwhile, overlap between head experts for coding (HE) and both harmful datasets increases, indicating that general critical experts play a more promi￾nent safety role. Notably, larger… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of expert ratio w on fine-tuning performance [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of expert activation frequencies in Qwen3-30B-A3B. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Detailed activation map of expert overlap, which is the visualized result of Section 5. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
read the original abstract

Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods uniformly adapt all parameters, ignoring their functional differences and inadvertently degrading performances. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE-based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost-effective experts, and (2) Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness. Code is available at https://github.com/lorraine021/MESA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes MESA, a targeted safety alignment framework for Mixture-of-Experts LLMs that uses Optimal Transport to decentralize safety responsibilities across experts via Expert Capacity Reallocation (based on a transport cost matrix) and Dynamic Routing Refinement (to constrain the router). The central claim is that this approach mitigates Safety Sparsity, yielding robust defense on harmful benchmarks while preserving helpfulness, unlike uniform alignment methods.

Significance. If the empirical results hold, the work is significant for addressing safety vulnerabilities specific to MoE scaling, where safety concentrates in few experts. The OT-based decentralization is a principled alternative to full-parameter adaptation and could inform efficient alignment techniques. Reproducibility is strengthened by the public code release.

minor comments (3)
  1. [Abstract] Abstract: the phrase 'varied harmful benchmarks' is vague; the introduction or §4 should explicitly list the benchmarks (e.g., specific datasets or attack types) and report quantitative deltas versus baselines to support the 'robust defensive performance' claim.
  2. [§3] The definition of the transport cost matrix and 'cost-effective experts' should be stated with explicit formulas (ideally in §3) so readers can verify how the OT objective avoids creating new bypass routes.
  3. [Experiments] Figure or table captions in the experiments section should include error bars or statistical significance tests for the helpfulness vs. safety trade-off to strengthen the 'preserving helpfulness' result.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of MESA and for recommending minor revision. The referee's summary correctly captures our central claim regarding safety sparsity in MoE models and the use of optimal transport for decentralized alignment. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The abstract and available description frame MESA as applying external Optimal Transport theory to construct a cost matrix for expert reallocation and router constraints. No equations, self-citations, or fitted parameters are shown that reduce the safety/utility outcomes to quantities defined by the method itself. The derivation chain relies on OT as an independent input and reports external benchmark results, making the central claim self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The transport cost matrix and router constraint are described as mechanisms but their concrete implementation details are absent.

pith-pipeline@v0.9.1-grok · 5734 in / 1073 out tokens · 16971 ms · 2026-06-28T18:49:54.721381+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 30 canonical work pages · 19 internal anchors

  1. [1]

    Journal of Machine Learning Research , volume=

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

  2. [2]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  3. [3]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

  4. [4]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  6. [6]

    arXiv preprint arXiv:2410.10630 , year=

    Thinking llms: General instruction following with thought generation , author=. arXiv preprint arXiv:2410.10630 , year=

  7. [7]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  8. [8]

    Ministral 3

    Ministral 3 , author=. arXiv preprint arXiv:2601.08584 , year=

  9. [9]

    The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author=

  10. [10]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  11. [11]

    Advances in neural information processing systems , volume=

    Sinkhorn distances: Lightspeed computation of optimal transport , author=. Advances in neural information processing systems , volume=

  12. [12]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  13. [13]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  14. [14]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  15. [15]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  16. [16]

    Proceedings of the ACM on Software Engineering , volume=

    S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models , author=. Proceedings of the ACM on Software Engineering , volume=. 2025 , publisher=

  17. [17]

    arXiv preprint arXiv:2406.05946 , year=

    Safety alignment should be made more than just a few tokens deep , author=. arXiv preprint arXiv:2406.05946 , year=

  18. [18]

    arXiv preprint arXiv:2401.06373 , year=

    How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms , author=. arXiv preprint arXiv:2401.06373 , year=

  19. [19]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Breaking the ceiling: Exploring the potential of jailbreak attacks through expanding strategy space , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  20. [20]

    Advances in neural information processing systems , volume=

    Deceptionbench: A comprehensive benchmark for ai deception behaviors in real-world scenarios , author=. Advances in neural information processing systems , volume=

  21. [21]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Jailbreaking black box large language models in twenty queries , author=. arXiv preprint arXiv:2310.08419 , year=

  22. [22]

    2019 , publisher=

    Computational optimal transport: With applications to data science , author=. 2019 , publisher=

  23. [23]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

  24. [24]

    arXiv preprint arXiv:2602.08621 , year=

    Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs , author=. arXiv preprint arXiv:2602.08621 , year=

  25. [25]

    Advances in Neural Information Processing Systems , volume=

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=

  26. [26]

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    Safe rlhf: Safe reinforcement learning from human feedback , author=. arXiv preprint arXiv:2310.12773 , year=

  27. [27]

    UltraFeedback: Boosting Language Models with Scaled AI Feedback

    Ultrafeedback: Boosting language models with scaled ai feedback , author=. arXiv preprint arXiv:2310.01377 , year=

  28. [28]

    von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin , license =

  29. [29]

    arXiv preprint arXiv:2502.02384 , year=

    STAIR: Improving Safety Alignment with Introspective Reasoning , author=. arXiv preprint arXiv:2502.02384 , year=

  30. [30]

    2025 , eprint=

    SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification , author=. 2025 , eprint=

  31. [31]

    A Strong

    Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , booktitle=. A Strong

  32. [32]

    2024 , eprint=

    WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=

  33. [33]

    arXiv preprint arXiv:2509.01444 , year=

    Strata-sword: A hierarchical safety evaluation towards llms based on reasoning complexity of jailbreak instructions , author=. arXiv preprint arXiv:2509.01444 , year=

  34. [34]

    Let's Verify Step by Step

    Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

  35. [35]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

  36. [36]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  37. [37]

    Program Synthesis with Large Language Models

    Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

  38. [38]

    First Conference on Language Modeling , year=

    Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=

  39. [39]

    arXiv preprint arXiv:2504.18598 , year=

    BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts , author=. arXiv preprint arXiv:2504.18598 , year=

  40. [40]

    arXiv preprint arXiv:2509.09660 , year=

    Steering moe llms via expert (de) activation , author=. arXiv preprint arXiv:2509.09660 , year=

  41. [41]

    , author=

    DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. , author=. NeurIPS , year=

  42. [42]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  43. [43]

    TrustLLM: Trustworthiness in Large Language Models

    Trustllm: Trustworthiness in large language models , author=. arXiv preprint arXiv:2401.05561 , year=

  44. [44]

    arXiv preprint arXiv:2509.01909 , year=

    Oyster-I: Beyond Refusal--Constructive Safety Alignment for Responsible Language Models , author=. arXiv preprint arXiv:2509.01909 , year=

  45. [45]

    The Thirteenth International Conference on Learning Representations , year=

    Safety Alignment Should be Made More Than Just a Few Tokens Deep , author=. The Thirteenth International Conference on Learning Representations , year=

  46. [46]

    arXiv preprint arXiv:2503.00555 , year=

    Safety tax: Safety alignment makes your large reasoning models less reasonable , author=. arXiv preprint arXiv:2503.00555 , year=

  47. [47]

    arXiv preprint arXiv:2503.18991 , year=

    Inverse reinforcement learning with dynamic reward scaling for llm alignment , author=. arXiv preprint arXiv:2503.18991 , year=