MESA: Improving MoE Safety Alignment via Decentralized Expertise

Hui Xue; Ranjie Duan; Teng Li; Xingjun Ma; Xingxing Wei; Yao Huang; Yichi Zhang; Yitong Sun

arxiv: 2606.00651 · v1 · pith:XEHWE6FYnew · submitted 2026-05-30 · 💻 cs.LG · cs.AI· cs.CL

MESA: Improving MoE Safety Alignment via Decentralized Expertise

Yitong Sun , Yao Huang , Teng Li , Ranjie Duan , Yichi Zhang , Xingjun Ma , Hui Xue , Xingxing Wei This is my paper

Pith reviewed 2026-06-28 18:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords mixture of expertssafety alignmentoptimal transportlarge language modelsadversarial robustnessdecentralized expertise

0 comments

The pith

MESA uses optimal transport to spread safety duties across multiple experts in MoE LLMs, reducing bypass vulnerability while keeping helpfulness intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts models concentrate safety capabilities in a small number of experts, creating an easy target for adversarial attacks that bypass those experts. Standard alignment techniques update all parameters uniformly and can degrade the model's performance on helpful tasks. MESA counters this by applying optimal transport to reallocate safety responsibilities to a broader set of cost-effective experts and by refining the router to activate those experts on demand. The result is claimed to be stronger defense on harmful benchmarks without the usual loss in utility. Readers would care because it targets the structural weakness of MoE scaling without forcing a broad performance trade-off.

Core claim

MESA is a targeted alignment framework for MoE-based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport theory, it operates through two mechanisms: Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost-effective experts, and Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness.

What carries the argument

Expert Capacity Reallocation via a transport cost matrix combined with Dynamic Routing Refinement, both derived from Optimal Transport theory, to assign and activate safety duties across experts.

If this is right

Safety capabilities become distributed rather than sparse, reducing the chance that a single expert attack succeeds.
The alignment process avoids uniform parameter changes that would interfere with utility-focused experts.
The router learns to activate the reallocated safety modules on relevant inputs.
Defensive performance improves across multiple harmful benchmarks while helpfulness metrics remain stable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cost-matrix approach could be tested for decentralizing other specialized behaviors such as factual recall or domain expertise.
If the cost matrix is poorly calibrated on a new model, the reallocation step itself might introduce unintended routing patterns.
Scaling experiments on larger MoE models would test whether the two mechanisms remain effective when the number of experts grows.

Load-bearing premise

The optimal transport cost matrix can identify experts that will reliably carry decentralized safety duties without degrading utility or creating new bypass routes.

What would settle it

Demonstrating that after MESA, an adversary can still bypass safety by targeting the newly assigned experts or that helpfulness scores on standard benchmarks drop substantially.

Figures

Figures reproduced from arXiv: 2606.00651 by Hui Xue, Ranjie Duan, Teng Li, Xingjun Ma, Xingxing Wei, Yao Huang, Yichi Zhang, Yitong Sun.

**Figure 1.** Figure 1: Illustration of MESA. By redistributing safety capabilities via Optimal Transport (OT), MESA decentralizes specialized experts to form robust safety distribution across MoE experts, significantly improving jailbreak resistance. Meanwhile, MESA preserves models’ general capabilities. Guo et al., 2025; Yang et al., 2025; Comanici et al., 2025) could achieve a remarkable balance between massive model capacit… view at source ↗

**Figure 2.** Figure 2: Overview of the MESA framework. I. Expert Capacity Reallocation: Leveraging empirical frequencies, we first design a cost function based on two theorems, then we compute an OT plan π ∗ that considers both cost matrix C and initial distribution Pemp to identify the optimal expert subset. II. Dynamic Routing Refinement: An online OT mechanism adjusts routing targets based on input type, ensuring effective ac… view at source ↗

**Figure 3.** Figure 3: Expert Adaptation Cost and Distributional Rank Shift Analysis. Top: Fine-tuning performance across critical and dormant expert groups categorized by Rsafe and Rgen dimensions. Bottom: Rank distribution in Rmix of critical and dormant experts categorized by Rsafe and Rgen dimensions. disproportionately high degradation in general capabilities, confirming their unsuitability for alignment. Guided by these pr… view at source ↗

**Figure 4.** Figure 4: Robustness evaluation on Strong-Reject and Strata. We assess structural resilience against two inference-time masking strategies: (1) Random Masking, where a specific number of experts are randomly selected and disabled; (2) Highest-Activation Masking, which identifies the Top-5 and Top-10 experts based on Rsafe, and forces their routing probabilities to zero to prevent activation. the SOTA in content-leve… view at source ↗

**Figure 5.** Figure 5: Head expert overlap between diverse data sources. After safety fine-tuning, the head expert overlap between the original safety experts (Mix) and two harmful datasets (Strata and SR) decreases, suggesting decentralization. Meanwhile, overlap between head experts for coding (HE) and both harmful datasets increases, indicating that general critical experts play a more prominent safety role. Notably, larger… view at source ↗

**Figure 6.** Figure 6: Effect of expert ratio w on fine-tuning performance [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of expert activation frequencies in Qwen3-30B-A3B. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Detailed activation map of expert overlap, which is the visualized result of Section 5. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods uniformly adapt all parameters, ignoring their functional differences and inadvertently degrading performances. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE-based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost-effective experts, and (2) Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness. Code is available at https://github.com/lorraine021/MESA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MESA applies optimal transport to spread safety across MoE experts, delivering better defense on harmful benchmarks without hurting utility.

read the letter

MESA's key idea is to treat safety alignment in Mixture-of-Experts models as a problem of redistributing duties rather than blanket updates. They use optimal transport to move safety capabilities to experts that can handle them efficiently, then adjust the router to use those experts more.

This is new in the way it combines OT with MoE routing constraints for safety. Most prior work either aligns everything the same or focuses on single experts. The paper does well by showing results on multiple harmful benchmarks and keeping helpfulness scores stable. Releasing the code is a plus for checking the details.

The math seems straightforward: the cost matrix guides the transport, and the router refinement adds a constraint. No sign of the results being circular or just fitted to the method. The stress test confirms the construction is consistent.

Soft spots are minor. The definition of cost-effective experts relies on the matrix they build, and if that matrix misses some interactions between experts, the decentralization might not hold as strongly in practice. They probably need more ablations on different model sizes or routing strategies to strengthen that. The benchmarks are varied but real-world attacks could still find gaps.

This paper is for researchers focused on efficient LLM scaling and safety alignment. Readers who work with MoE architectures will get practical value from the method and the reported numbers.

It deserves a serious referee because the central claim is testable and the approach is novel enough to warrant review.

Referee Report

0 major / 3 minor

Summary. The paper proposes MESA, a targeted safety alignment framework for Mixture-of-Experts LLMs that uses Optimal Transport to decentralize safety responsibilities across experts via Expert Capacity Reallocation (based on a transport cost matrix) and Dynamic Routing Refinement (to constrain the router). The central claim is that this approach mitigates Safety Sparsity, yielding robust defense on harmful benchmarks while preserving helpfulness, unlike uniform alignment methods.

Significance. If the empirical results hold, the work is significant for addressing safety vulnerabilities specific to MoE scaling, where safety concentrates in few experts. The OT-based decentralization is a principled alternative to full-parameter adaptation and could inform efficient alignment techniques. Reproducibility is strengthened by the public code release.

minor comments (3)

[Abstract] Abstract: the phrase 'varied harmful benchmarks' is vague; the introduction or §4 should explicitly list the benchmarks (e.g., specific datasets or attack types) and report quantitative deltas versus baselines to support the 'robust defensive performance' claim.
[§3] The definition of the transport cost matrix and 'cost-effective experts' should be stated with explicit formulas (ideally in §3) so readers can verify how the OT objective avoids creating new bypass routes.
[Experiments] Figure or table captions in the experiments section should include error bars or statistical significance tests for the helpfulness vs. safety trade-off to strengthen the 'preserving helpfulness' result.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of MESA and for recommending minor revision. The referee's summary correctly captures our central claim regarding safety sparsity in MoE models and the use of optimal transport for decentralized alignment. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The abstract and available description frame MESA as applying external Optimal Transport theory to construct a cost matrix for expert reallocation and router constraints. No equations, self-citations, or fitted parameters are shown that reduce the safety/utility outcomes to quantities defined by the method itself. The derivation chain relies on OT as an independent input and reports external benchmark results, making the central claim self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The transport cost matrix and router constraint are described as mechanisms but their concrete implementation details are absent.

pith-pipeline@v0.9.1-grok · 5734 in / 1073 out tokens · 16971 ms · 2026-06-28T18:49:54.721381+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 30 canonical work pages · 19 internal anchors

[1]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=
[2]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2410.10630 , year=

Thinking llms: General instruction following with thought generation , author=. arXiv preprint arXiv:2410.10630 , year=

work page arXiv
[7]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Ministral 3

Ministral 3 , author=. arXiv preprint arXiv:2601.08584 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author=
[10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Advances in neural information processing systems , volume=

Sinkhorn distances: Lightspeed computation of optimal transport , author=. Advances in neural information processing systems , volume=
[12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[14]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[16]

Proceedings of the ACM on Software Engineering , volume=

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models , author=. Proceedings of the ACM on Software Engineering , volume=. 2025 , publisher=

2025
[17]

arXiv preprint arXiv:2406.05946 , year=

Safety alignment should be made more than just a few tokens deep , author=. arXiv preprint arXiv:2406.05946 , year=

work page arXiv
[18]

arXiv preprint arXiv:2401.06373 , year=

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms , author=. arXiv preprint arXiv:2401.06373 , year=

work page arXiv
[19]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Breaking the ceiling: Exploring the potential of jailbreak attacks through expanding strategy space , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[20]

Advances in neural information processing systems , volume=

Deceptionbench: A comprehensive benchmark for ai deception behaviors in real-world scenarios , author=. Advances in neural information processing systems , volume=
[21]

Jailbreaking Black Box Large Language Models in Twenty Queries

Jailbreaking black box large language models in twenty queries , author=. arXiv preprint arXiv:2310.08419 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

2019 , publisher=

Computational optimal transport: With applications to data science , author=. 2019 , publisher=

2019
[23]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[24]

arXiv preprint arXiv:2602.08621 , year=

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs , author=. arXiv preprint arXiv:2602.08621 , year=

work page arXiv
[25]

Advances in Neural Information Processing Systems , volume=

Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=
[26]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Safe rlhf: Safe reinforcement learning from human feedback , author=. arXiv preprint arXiv:2310.12773 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ultrafeedback: Boosting language models with scaled ai feedback , author=. arXiv preprint arXiv:2310.01377 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin , license =
[29]

arXiv preprint arXiv:2502.02384 , year=

STAIR: Improving Safety Alignment with Introspective Reasoning , author=. arXiv preprint arXiv:2502.02384 , year=

work page arXiv
[30]

2025 , eprint=

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification , author=. 2025 , eprint=

2025
[31]

A Strong

Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , booktitle=. A Strong
[32]

2024 , eprint=

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=

2024
[33]

arXiv preprint arXiv:2509.01444 , year=

Strata-sword: A hierarchical safety evaluation towards llms based on reasoning complexity of jailbreak instructions , author=. arXiv preprint arXiv:2509.01444 , year=

work page arXiv
[34]

Let's Verify Step by Step

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

First Conference on Language Modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=
[39]

arXiv preprint arXiv:2504.18598 , year=

BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts , author=. arXiv preprint arXiv:2504.18598 , year=

work page arXiv
[40]

arXiv preprint arXiv:2509.09660 , year=

Steering moe llms via expert (de) activation , author=. arXiv preprint arXiv:2509.09660 , year=

work page arXiv
[41]

, author=

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. , author=. NeurIPS , year=
[42]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

TrustLLM: Trustworthiness in Large Language Models

Trustllm: Trustworthiness in large language models , author=. arXiv preprint arXiv:2401.05561 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

arXiv preprint arXiv:2509.01909 , year=

Oyster-I: Beyond Refusal--Constructive Safety Alignment for Responsible Language Models , author=. arXiv preprint arXiv:2509.01909 , year=

work page arXiv
[45]

The Thirteenth International Conference on Learning Representations , year=

Safety Alignment Should be Made More Than Just a Few Tokens Deep , author=. The Thirteenth International Conference on Learning Representations , year=
[46]

arXiv preprint arXiv:2503.00555 , year=

Safety tax: Safety alignment makes your large reasoning models less reasonable , author=. arXiv preprint arXiv:2503.00555 , year=

work page arXiv
[47]

arXiv preprint arXiv:2503.18991 , year=

Inverse reinforcement learning with dynamic reward scaling for llm alignment , author=. arXiv preprint arXiv:2503.18991 , year=

work page arXiv

[1] [1]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

[2] [2]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

arXiv preprint arXiv:2410.10630 , year=

Thinking llms: General instruction following with thought generation , author=. arXiv preprint arXiv:2410.10630 , year=

work page arXiv

[7] [7]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Ministral 3

Ministral 3 , author=. arXiv preprint arXiv:2601.08584 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author=

[10] [10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Advances in neural information processing systems , volume=

Sinkhorn distances: Lightspeed computation of optimal transport , author=. Advances in neural information processing systems , volume=

[12] [12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

[14] [14]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

[16] [16]

Proceedings of the ACM on Software Engineering , volume=

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models , author=. Proceedings of the ACM on Software Engineering , volume=. 2025 , publisher=

2025

[17] [17]

arXiv preprint arXiv:2406.05946 , year=

Safety alignment should be made more than just a few tokens deep , author=. arXiv preprint arXiv:2406.05946 , year=

work page arXiv

[18] [18]

arXiv preprint arXiv:2401.06373 , year=

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms , author=. arXiv preprint arXiv:2401.06373 , year=

work page arXiv

[19] [19]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Breaking the ceiling: Exploring the potential of jailbreak attacks through expanding strategy space , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025

[20] [20]

Advances in neural information processing systems , volume=

Deceptionbench: A comprehensive benchmark for ai deception behaviors in real-world scenarios , author=. Advances in neural information processing systems , volume=

[21] [21]

Jailbreaking Black Box Large Language Models in Twenty Queries

Jailbreaking black box large language models in twenty queries , author=. arXiv preprint arXiv:2310.08419 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

2019 , publisher=

Computational optimal transport: With applications to data science , author=. 2019 , publisher=

2019

[23] [23]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009

[24] [24]

arXiv preprint arXiv:2602.08621 , year=

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs , author=. arXiv preprint arXiv:2602.08621 , year=

work page arXiv

[25] [25]

Advances in Neural Information Processing Systems , volume=

Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=

[26] [26]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Safe rlhf: Safe reinforcement learning from human feedback , author=. arXiv preprint arXiv:2310.12773 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ultrafeedback: Boosting language models with scaled ai feedback , author=. arXiv preprint arXiv:2310.01377 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin , license =

[29] [29]

arXiv preprint arXiv:2502.02384 , year=

STAIR: Improving Safety Alignment with Introspective Reasoning , author=. arXiv preprint arXiv:2502.02384 , year=

work page arXiv

[30] [30]

2025 , eprint=

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification , author=. 2025 , eprint=

2025

[31] [31]

A Strong

Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , booktitle=. A Strong

[32] [32]

2024 , eprint=

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=

2024

[33] [33]

arXiv preprint arXiv:2509.01444 , year=

Strata-sword: A hierarchical safety evaluation towards llms based on reasoning complexity of jailbreak instructions , author=. arXiv preprint arXiv:2509.01444 , year=

work page arXiv

[34] [34]

Let's Verify Step by Step

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

First Conference on Language Modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=

[39] [39]

arXiv preprint arXiv:2504.18598 , year=

BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts , author=. arXiv preprint arXiv:2504.18598 , year=

work page arXiv

[40] [40]

arXiv preprint arXiv:2509.09660 , year=

Steering moe llms via expert (de) activation , author=. arXiv preprint arXiv:2509.09660 , year=

work page arXiv

[41] [41]

, author=

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. , author=. NeurIPS , year=

[42] [42]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

TrustLLM: Trustworthiness in Large Language Models

Trustllm: Trustworthiness in large language models , author=. arXiv preprint arXiv:2401.05561 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

arXiv preprint arXiv:2509.01909 , year=

Oyster-I: Beyond Refusal--Constructive Safety Alignment for Responsible Language Models , author=. arXiv preprint arXiv:2509.01909 , year=

work page arXiv

[45] [45]

The Thirteenth International Conference on Learning Representations , year=

Safety Alignment Should be Made More Than Just a Few Tokens Deep , author=. The Thirteenth International Conference on Learning Representations , year=

[46] [46]

arXiv preprint arXiv:2503.00555 , year=

Safety tax: Safety alignment makes your large reasoning models less reasonable , author=. arXiv preprint arXiv:2503.00555 , year=

work page arXiv

[47] [47]

arXiv preprint arXiv:2503.18991 , year=

Inverse reinforcement learning with dynamic reward scaling for llm alignment , author=. arXiv preprint arXiv:2503.18991 , year=

work page arXiv