Defending against Backdoor Attacks via Module Switching

Ansh Arora; Mark Dras; Qiongkai Xu; Weijun Li; Xuanli He

arxiv: 2504.05902 · v2 · submitted 2025-04-08 · 💻 cs.CR · cs.CL

Defending against Backdoor Attacks via Module Switching

Weijun Li , Ansh Arora , Xuanli He , Mark Dras , Qiongkai Xu This is my paper

Pith reviewed 2026-05-22 20:23 UTC · model grok-4.3

classification 💻 cs.CR cs.CL

keywords backdoor attacksmodel mergingpost-training defensemodule switchingcollusive attackstransformercnn

0 comments

The pith

Module switching defense disrupts backdoor triggers more effectively than weight averaging, especially with fewer models and collusive attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes module-switching defense as a post-training method to protect deep neural networks from backdoor attacks without needing training data or attack details. Backdoor attacks implant triggers that cause hidden malicious outputs at inference, and current merging approaches like weight averaging require many models to work well. MSD instead switches between modules drawn from different available models to break the associations that backdoors rely on. Experiments on simple networks show greater backdoor divergence than averaging while keeping clean performance, and an evolutionary optimizer finds good switching patterns for Transformers and CNNs. The approach also holds up better when some models share the same backdoors.

Core claim

By selectively switching modules across multiple models, MSD increases divergence between backdoor behaviors while preserving utility on clean inputs, delivering stronger defense than weight averaging with fewer models and superior robustness against collusive attacks where models share triggers.

What carries the argument

Module-switching defense (MSD), which selectively activates modules from different models according to optimized fusion strategies to disrupt backdoor shortcuts.

If this is right

MSD produces higher backdoor divergence than weight averaging on two-layer networks while maintaining utility.
An evolutionary algorithm can identify effective selective fusion strategies for deep Transformer and CNN models.
Switching strategies remain robust even when multiple models share the same backdoors in collusive attack settings.
Fewer models suffice for practical defense compared with averaging methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenders with small sets of models could achieve reliable protection by running the evolutionary search once at deployment time.
The same selective-switching idea might target shared malicious features in other poisoning or trojan scenarios beyond backdoors.
Evaluating MSD on additional architectures or attack variants would test whether the divergence effect generalizes.

Load-bearing premise

The evolutionary algorithm will reliably discover module combinations that break backdoors without degrading accuracy on clean data.

What would settle it

A test showing that the best MSD strategy found by the evolutionary search reduces backdoor attack success rate no more than weight averaging while also lowering clean accuracy would falsify the claim.

Figures

Figures reproduced from arXiv: 2504.05902 by Ansh Arora, Mark Dras, Qiongkai Xu, Weijun Li, Xuanli He.

**Figure 2.** Figure 2: Euclidean distances between the normalized output [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: By identifying three types of module adjacency in Transformers, we can formulate the cost and optimize switching [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: The Euclidean distances between the normalized output vectors of simulated [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 7.** Figure 7: Early stopping strategy (fitness score -94.2). [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Strategy of ablating rule 1. Q K V O I P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Layer Q K O P Q O P Q O P K O K I V I K V I K V I Q V I Q V I Q V I Q K I Q V I Q K V I K V I P Q V I P Q V O P V O P K O P K O P K O P K O P O P Q O Model 1 Q K V O I P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 V I K V I K V I Q V I P Q V O P Q K O P Q O P Q O P K O P K O P K O P V … view at source ↗

**Figure 11.** Figure 11: Adopted merging strategy (fitness score -26.2). [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Adopted merging strategy (fitness score -39.5). [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

read the original abstract

Backdoor attacks pose a serious threat to deep neural networks (DNNs), allowing adversaries to implant triggers for hidden behaviors in inference. Defending against such vulnerabilities is especially difficult in the post-training setting, since end-users lack training data or prior knowledge of the attacks. Model merging offers a cost-effective defense; however, latest methods like weight averaging (WAG) provide reasonable protection when multiple homologous models are available, but are less effective with fewer models and place heavy demands on defenders. We propose a module-switching defense (MSD) for disrupting backdoor shortcuts. We first validate its theoretical rationale and empirical effectiveness on two-layer networks, showing its capability of achieving higher backdoor divergence than WAG, and preserving utility. For deep models, we evaluate MSD on Transformer and CNN architectures and design an evolutionary algorithm to optimize fusion strategies with selective mechanisms to identify the most effective combinations. Experiments show that MSD achieves stronger defense with fewer models in practical settings, and even under an underexplored case of collusive attacks among multiple models--where some models share the same backdoors--switching strategies by MSD deliver superior robustness against diverse attacks. Code is available at https://github.com/weijun-l/module-switching-defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Module switching beats averaging with few models and collusive backdoors, but the evolutionary optimizer's fitness function is not explained and may need attack knowledge the threat model forbids.

read the letter

The paper's core claim is that switching modules across a handful of models disrupts backdoors better than weight averaging, especially when models share backdoors. They show this first on two-layer nets with a divergence argument, then scale to Transformers and CNNs via an evolutionary search over fusion choices. The collusive-attack experiments are the clearest addition to prior merging work. Code release is also useful for anyone wanting to test it directly. That is the part worth paying attention to if you work on post-training defenses. The evolutionary algorithm is the soft spot. The abstract gives no description of the fitness function. In the stated setting—no clean data, no attack knowledge—a fitness that only tracks clean accuracy cannot distinguish backdoor-breaking switches from neutral ones. The two-layer math shows higher divergence than averaging but supplies no obvious way to turn that into an objective the optimizer can use on deep models. Without an explicit backdoor-sensitive term or some other proxy that works without attack labels, the reported robustness on deep nets and collusive cases could be an artifact of the particular experimental setup rather than a general property of module switching. If the full paper has a fitness that stays within the threat model, the results would land more solidly. Otherwise the extension from small nets to the EA search remains a gap. This is for readers already following model-merging defenses or practical backdoor mitigation. It is narrow enough that most people outside that niche will not need it, but the practical angle with few models makes it worth a referee to check the optimizer details and the quantitative numbers that are missing from the abstract.

Referee Report

2 major / 2 minor

Summary. The paper proposes Module Switching Defense (MSD) as a post-training defense against backdoor attacks on DNNs. It validates the approach on two-layer networks with a theoretical argument for higher backdoor divergence than weight averaging (WAG) while preserving clean utility, then extends to deep Transformer and CNN models via an evolutionary algorithm that searches over fusion strategies incorporating selective mechanisms. Experiments claim MSD delivers stronger robustness than baselines with fewer models and remains effective even under collusive attacks where multiple models share backdoors.

Significance. If the central claims hold, MSD would be a practical, low-cost defense that reduces the number of models required compared to prior merging methods and handles an underexplored collusive-attack setting. The public code release supports reproducibility.

major comments (2)

[Evolutionary algorithm for deep models] Evolutionary algorithm description (deep-model section): the fitness function used to guide the search over fusion strategies is not specified. Under the stated threat model (no training data, no attack knowledge), a fitness based solely on clean accuracy cannot distinguish backdoor-disrupting switches from neutral ones; without an explicit backdoor-sensitive term or other mechanism, the reported robustness gains on Transformers and CNNs rest on an unverified assumption.
[Theoretical analysis] Theoretical analysis (two-layer networks): the proof of higher backdoor divergence is not connected by any concrete mechanism or reduction to the evolutionary search procedure employed for deep models, so the two-layer result does not directly support the headline claims on complex architectures.

minor comments (2)

Abstract and experimental sections lack quantitative numbers, error bars, or explicit description of how backdoor success rate was measured, hindering assessment of the empirical claims.
Notation for selective mechanisms and fusion strategies should be defined more clearly before the evolutionary-algorithm description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We provide point-by-point responses to the major comments below and will make revisions to address the concerns raised.

read point-by-point responses

Referee: [Evolutionary algorithm for deep models] Evolutionary algorithm description (deep-model section): the fitness function used to guide the search over fusion strategies is not specified. Under the stated threat model (no training data, no attack knowledge), a fitness based solely on clean accuracy cannot distinguish backdoor-disrupting switches from neutral ones; without an explicit backdoor-sensitive term or other mechanism, the reported robustness gains on Transformers and CNNs rest on an unverified assumption.

Authors: We acknowledge that the description of the evolutionary algorithm requires additional detail on the fitness function. In the revised manuscript we will explicitly state that the fitness is clean accuracy on a small held-out validation set (consistent with the post-training threat model). The selective mechanisms incorporated in the fusion strategies guide the search toward module combinations that preserve utility while producing the observed backdoor disruption; we will add pseudocode, a description of the selection operators, and further empirical diagnostics showing that the discovered strategies systematically increase backdoor divergence relative to random or weight-averaged baselines. revision: yes
Referee: [Theoretical analysis] Theoretical analysis (two-layer networks): the proof of higher backdoor divergence is not connected by any concrete mechanism or reduction to the evolutionary search procedure employed for deep models, so the two-layer result does not directly support the headline claims on complex architectures.

Authors: The two-layer analysis establishes that module switching can provably achieve higher backdoor divergence than weight averaging while preserving clean accuracy; this supplies the core intuition motivating the design of selective fusion strategies. We agree there is no formal reduction linking the proof to the evolutionary procedure on deep models. In revision we will clarify this distinction, present the theory as foundational motivation rather than direct support, and emphasize that the headline claims for Transformers and CNNs rest on the empirical results. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on held-out attacks is independent of fitted parameters or self-citations

full rationale

The paper's core claims rest on (1) a two-layer theoretical comparison of backdoor divergence between MSD and WAG, presented as direct validation rather than a fitted result, and (2) an evolutionary search over fusion strategies whose effectiveness is measured empirically on held-out attack scenarios and clean data. No equation reduces a claimed prediction to a parameter fit by construction, no load-bearing premise is justified solely by self-citation, and the EA is described as an optimizer rather than a definitional renaming of its own objective. The derivation chain therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Because only the abstract is available, the ledger is necessarily incomplete. The evolutionary optimizer likely introduces free parameters (population size, mutation rate, fitness function weights) that are fitted to the defense task. No new physical entities are postulated.

free parameters (1)

evolutionary optimizer hyperparameters
Population size, mutation rate, and selection criteria for the fusion strategy search are chosen to maximize backdoor disruption on the evaluated models.

pith-pipeline@v0.9.0 · 5751 in / 1235 out tokens · 31001 ms · 2026-05-22T20:23:16.633228+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel module-switching strategy to break such spurious correlations within the model’s propagation path... evolutionary algorithm to optimize fusion strategies with selective mechanisms
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Intra-layer adjacency penalty... Consecutive-layer adjacency penalty... Residual-path adjacency penalty... Balance penalty... Diversity reward

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 8 internal anchors

[1]

Abien Fred Agarap. 2018. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Ansh Arora, Xuanli He, Maximilian Mozes, Srinibas Swain, Mark Dras, and Qiongkai Xu. 2024. Here’s a Free Lunch: Sanitizing Backdoored Models with Model Merge. In Findings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 15059–1507...

work page doi:10.18653/v1/ 2024
[3]

Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers . Springer, 177–186

work page 2010
[4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020
[5]

Chen Chen, Yuchen Sun, Xueluan Gong, Jiaxin Gao, and Kwok-Yan Lam. 2024. Neutralizing Backdoors through Information Conflicts for Large Language Mod- els. arXiv preprint arXiv:2411.18280 (2024)

work page arXiv 2024
[6]

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. 2017. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. CoRR abs/1712.05526 (2017). arXiv:1712.05526 http://arxiv.org/abs/1712.05526

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Jiazhu Dai, Chuanshuai Chen, and Yufeng Li. 2019. A Backdoor Attack Against LSTM-Based Text Classification Systems. IEEE Access 7 (2019), 138872–138878. https://api.semanticscholar.org/CorpusID:168170110

work page 2019
[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy...

work page 2019
[9]

Jacob Dumford and Walter Scheirer. 2020. Backdooring Convolutional Neural Networks via Targeted Weight Perturbations. In 2020 IEEE International Joint Conference on Biometrics (IJCB) . 1–9. doi:10.1109/IJCB48548.2020.9304875

work page doi:10.1109/ijcb48548.2020.9304875 2020
[10]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Das- Sarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah....

work page 2021
[11]

Chong Fu, Xuhong Zhang, Shouling Ji, Ting Wang, Peng Lin, Yanghe Feng, and Jianwei Yin. 2023. FreeEagle: Detecting Complex Neural Trojans in Data- Free Cases. In 32nd USENIX Security Symposium (USENIX Security 23) . USENIX Association, Anaheim, CA, 6399–6416. https://www.usenix.org/conference/ usenixsecurity23/presentation/fu-chong

work page 2023
[12]

Matt Gardner, William Merrill, Jesse Dodge, Matthew Peters, Alexis Ross, Sameer Singh, and Noah A. Smith. 2021. Competency Problems: On Finding and Re- moving Artifacts in Language Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , Marie-Francine Moens, Xuan- jing Huang, Lucia Specia, and Scott Wen-tau Yih (E...

work page doi:10.18653/v1/2021.emnlp-main.135 2021
[13]

Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. 2017. BadNets: Iden- tifying Vulnerabilities in the Machine Learning Model Supply Chain. CoRR abs/1708.06733 (2017). arXiv:1708.06733 http://arxiv.org/abs/1708.06733

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Resid- ual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2016
[15]

Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Xuanli He, Qiongkai Xu, Jun Wang, Benjamin Rubinstein, and Trevor Cohn. 2023. Mitigating Backdoor Poisoning Attacks through the Lens of Spurious Correlation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 953–...

work page doi:10.18653/v1/2023.emnlp- 2023
[17]

Xuanli He, Qiongkai Xu, Jun Wang, Benjamin IP Rubinstein, and Trevor Cohn

work page
[18]

Transactions of the Association for Computational Linguistics 12 (2024), 996–1010

SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks. Transactions of the Association for Computational Linguistics 12 (2024), 996–1010

work page 2024
[19]

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and An- drew Gordon Wilson. 2018. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti- mization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings , Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015
[21]

Alex Krizhevsky et al. 2009. Learning multiple layers of features from tiny images. (2009)

work page 2009
[22]

Keita Kurita, Paul Michel, and Graham Neubig. 2020. Weight Poisoning Attacks on Pretrained Models. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 2793–2806. doi:10.18653/v1/2020.acl-main.249

work page doi:10.18653/v1/2020.acl-main.249 2020
[23]

Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger,...

work page arXiv 2021
[24]

Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma

work page
[25]

In Advances in Neural Information Processing Systems , M

Anti-Backdoor Learning: Training Clean Models on Poisoned Data. In Advances in Neural Information Processing Systems , M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 14900–14912. https://proceedings.neurips.cc/paper_files/paper/2021/file/ 7d38b1e9bd793d3f45e0e212a729a93c-Paper.pdf

work page 2021
[26]

Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, and Radha Poovendran. 2024. CleanGen: Mitigating Back- door Attacks for Generation Tasks in Large Language Models. arXiv preprint arXiv:2406.12257 (2024)

work page arXiv 2024
[27]

Yiming Li, Mengxi Ya, Yang Bai, Yong Jiang, and Shu-Tao Xia. 2023. BackdoorBox: A Python Toolbox for Backdoor Learning. In ICLR Workshop

work page 2023
[28]

Yiming Li, Tongqing Zhai, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. 2021. Back- door attack in the physical world. arXiv preprint arXiv:2104.02361 (2021)

work page arXiv 2021
[29]

Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2018. Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks. CoRR abs/1805.12185 (2018). arXiv:1805.12185 http://arxiv.org/abs/1805.12185

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing Ma, Yousra Aafer, and Xiangyu Zhang. 2019. Abs: Scanning neural networks for back-doors by artificial brain stimulation. InProceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 1265–1282

work page 2019
[31]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[32]

Michael S Matena and Colin A Raffel. 2022. Merging Models with Fisher-Weighted Averaging. In Advances in Neural Information Pro- cessing Systems , S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 17703– 17716. https://proceedings.neurips.cc/paper_files/paper/2022/file/ 70c26937fbf3d4600b69a129031b...

work page 2022
[33]

Geoffrey F Miller, Peter M Todd, and Shailesh U Hegde. 1989. Designing Neural Networks Using Genetic Algorithms.. In ICGA, Vol. 89. 379–384

work page 1989
[34]

Anh Nguyen and Anh Tran. 2021. Wanet–imperceptible warping-based backdoor attack. arXiv preprint arXiv:2102.10369 (2021)

work page arXiv 2021
[35]

Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun

work page
[36]

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.)

ONION: A Simple and Effective Defense Against Textual Backdoor Attacks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 9558–9566. doi:10.18653/v1...

work page doi:10.18653/v1/2021.emnlp-main.752 2021
[37]

Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. 2021. Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long P...

work page doi:10.18653/v1/2021.acl-long.37 2021
[38]

Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. 2021. Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:235196099

work page 2021
[39]

Fanchao Qi, Yuan Yao, Sophia Xu, Zhiyuan Liu, and Maosong Sun. 2021. Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution. In Annual Meeting of the Association for Computational Linguistics . https://api. semanticscholar.org/CorpusID:235417102

work page 2021
[40]

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2019. Regularized Evolution for Image Classifier Architecture Search. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (Jul. 2019), 4780–4789. doi:10.1609/aaai. v33i01.33014780

work page doi:10.1609/aaai 2019
[41]

David So, Quoc Le, and Chen Liang. 2019. The Evolved Transformer. In Pro- ceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdi- nov (Eds.). PMLR, 5877–5886. https://proceedings.mlr.press/v97/so19a.html

work page 2019
[42]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Con- ference on Empirical Methods in Natural Language Processing , David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, a...

work page 2013
[43]

Yanghao Su, Jie Zhang, Ting Xu, Tianwei Zhang, Weiming Zhang, and Nenghai Yu. 2024. Model X-ray: Detecting Backdoored Models via Decision Boundary. In Proceedings of the 32nd ACM International Conference on Multimedia . 10296– 10305

work page 2024
[44]

Yao Tong, Weijun Li, Xuanli He, Haolan Zhan, and Qiongkai Xu. 2024. Cut the Deadwood Out: Post-Training Model Purification with Selective Module Substitution. arXiv preprint arXiv:2412.20476 (2024)

work page arXiv 2024
[45]

TorchVision maintainers and contributors. 2016. TorchVision: PyTorch’s Com- puter Vision library. https://github.com/pytorch/vision. GitHub repository

work page 2016
[46]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems , I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://pr...

work page 2017
[47]

Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. 2019. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In 2019 IEEE Symposium on Security and Privacy (SP). 707–723. doi:10.1109/SP.2019.00031

work page doi:10.1109/sp.2019.00031 2019
[48]

Hang Wang, Zhen Xiang, David J Miller, and George Kesidis. 2024. Mm-bd: Post-training detection of backdoor attacks with arbitrary backdoor pattern types using a maximum margin statistic. In 2024 IEEE Symposium on Security and Privacy (SP). IEEE, 1994–2012

work page 2024
[49]

Ren Wang, Gaoyuan Zhang, Sijia Liu, Pin-Yu Chen, Jinjun Xiong, and Meng Wang. 2020. Practical detection of trojan neural networks: Data-limited and data- free cases. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16 . Springer, 222–238

work page 2020
[50]

Colin White, Mahmoud Safari, Rhea Sukthanker, Binxin Ru, Thomas Elsken, Arber Zela, Debadeepta Dey, and Frank Hutter. 2023. Neural architecture search: Insights from 1000 papers. arXiv preprint arXiv:2301.08727 (2023)

work page arXiv 2023
[51]

Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (New Orleans, Louisiana). Association for Computa...

work page 2018
[52]

Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. Visual Transformers: Token-based Image Representation and Processing for Computer Vision. arXiv:2006.03677 [cs.CV]

work page arXiv 2020
[53]

Dongxian Wu and Yisen Wang. 2021. Adversarial Neuron Pruning Purifies Backdoored Deep Models. In Advances in Neural Information Processing Systems , M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 16913–16925. https://proceedings.neurips. cc/paper_files/paper/2021/file/8cbe9ce23f42628c98...

work page 2021
[54]

Tong Xu, Yiming Li, Yong Jiang, and Shu-Tao Xia. 2023. Batt: Backdoor at- tack with transformation-based triggers. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 1–5

work page 2023
[55]

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal

work page
[56]

In Ad- vances in Neural Information Processing Systems , A

TIES-Merging: Resolving Interference When Merging Models. In Ad- vances in Neural Information Processing Systems , A. Oh, T. Naumann, A. Glober- son, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 7093–7115. https://proceedings.neurips.cc/paper_files/paper/2023/file/ 1644c9af28ab7916874f6fd6228a9bcf-Paper-Conference.pdf

work page 2023
[57]

Jinluan Yang, Anke Tang, Didi Zhu, Zhengyu Chen, Li Shen, and Fei Wu. 2024. Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace. arXiv preprint arXiv:2410.13910 (2024)

work page arXiv 2024
[58]

Wenqian Ye, Guangtao Zheng, Xu Cao, Yunsheng Ma, and Aidong Zhang

work page
[59]

Will You Find These Shortcuts?

Spurious correlations in machine learning: A survey. arXiv preprint arXiv:2402.12715 (2024)

work page arXiv 2024
[60]

Biao Yi, Sishuo Chen, Yiming Li, Tong Li, Baolei Zhang, and Zheli Liu. 2024. BadActs: A Universal Backdoor Defense in the Activation Space. In Findings of the Association for Computational Linguistics: ACL 2024 , Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 5339–5352. doi:10.18653/v1/2...

work page doi:10.18653/v1/2024.findings-acl.317 2024
[61]

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. 2024. Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch. In Forty-first International Conference on Machine Learning . https: //openreview.net/forum?id=fq0NaiU8Ex

work page 2024
[62]

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision. 6023–6032

work page 2019
[63]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/ 2015/file/250cf8b51c773f3f8dc8b4be867a9a0...

work page 2015
[64]

Zhiyuan Zhang, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, and Xu Sun. 2022. Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2022 , Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguis- tics, Abu Dhabi, United Arab Emirates, 355–372....

work page doi:10.18653/v1/2022.findings- 2022
[65]

Xingyi Zhao, Depeng Xu, and Shuhan Yuan. 2024. Defense against Backdoor Attack on Pre-trained Language Models via Head Pruning and Attention Normal- ization. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235) , Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuri...

work page 2024
[66]

Xun Zhou, A. K. Qin, Maoguo Gong, and Kay Chen Tan. 2021. A Survey on Evolutionary Construction of Deep Neural Networks. IEEE Transactions on Evolutionary Computation 25, 5 (2021), 894–912. doi:10.1109/TEVC.2021.3079985 10 Defending Deep Neural Networks against Backdoor Attacks via Module Switching A Results of a Two-Layer Network with Varying Activations...

work page doi:10.1109/tevc.2021.3079985 2021
[67]

𝐹intra (𝑆) = − |𝑆 |∑︁ 𝑙=1 IntraViolation(𝑆 [𝑙]) (6) Here, IntraViolation quantifies the number of adjacent module pairs from the same source model within layer 𝑆 [𝑙]

Intra-layer Adjacency ( 𝐹intra). 𝐹intra (𝑆) = − |𝑆 |∑︁ 𝑙=1 IntraViolation(𝑆 [𝑙]) (6) Here, IntraViolation quantifies the number of adjacent module pairs from the same source model within layer 𝑆 [𝑙]

work page
[68]

Consecutive-layer Adjacency ( 𝐹consec). 𝐹consec (𝑆) = − |𝑆 | −1∑︁ 𝑙=1 ConsecViolation(𝑆 [𝑙], 𝑆[𝑙 + 1]) (7) The function ConsecViolation counts module pairs from the same source model that are directly connected between consecutive layers

work page
[69]

Residual Connections ( 𝐹residual). 𝐹residual(𝑆) = − |𝑆 |∑︁ 𝑙=1 |𝑆 |∑︁ 𝑘=𝑙+1 ResidualViolation(𝑆 [𝑙], 𝑆[𝑘]) × ( 0.5)𝑘 −𝑙 (8) This term evaluates residual connections between layers 𝑆 [𝑙] and 𝑆 [𝑘], with ResidualViolation weighted by (0.5)𝑘 −𝑙 to reduce the impact of long-range connections

work page
[70]

Module Balance ( 𝐹balance). 𝐹balance (𝑆) = − 𝑛models∑︁ 𝑖=1 ∑︁ 𝑚∈ M |count𝑖,𝑚 − countideal| (9) where count𝑖,𝑚 is the count of module type 𝑚 from model 𝑖, M = {𝑄, 𝐾, 𝑉 , 𝑂, 𝐼, 𝑃} is the set of module types, and countideal = |𝑆 |/𝑛models represents the ideal count per module type per model

work page
[71]

𝐹diversity(𝑆) = |unique(𝑆)| (10) where unique(𝑆) is the set of unique layer compositions in strategy 𝑆

Layer Diversity ( 𝐹diversity). 𝐹diversity(𝑆) = |unique(𝑆)| (10) where unique(𝑆) is the set of unique layer compositions in strategy 𝑆. 12 Defending Deep Neural Networks against Backdoor Attacks via Module Switching C Additional Experiment Setup C.1 Dataset Statistics We evaluate our method on four datasets. The statistical information of each dataset and ...

work page

[1] [1]

Abien Fred Agarap. 2018. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Ansh Arora, Xuanli He, Maximilian Mozes, Srinibas Swain, Mark Dras, and Qiongkai Xu. 2024. Here’s a Free Lunch: Sanitizing Backdoored Models with Model Merge. In Findings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 15059–1507...

work page doi:10.18653/v1/ 2024

[3] [3]

Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers . Springer, 177–186

work page 2010

[4] [4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020

[5] [5]

Chen Chen, Yuchen Sun, Xueluan Gong, Jiaxin Gao, and Kwok-Yan Lam. 2024. Neutralizing Backdoors through Information Conflicts for Large Language Mod- els. arXiv preprint arXiv:2411.18280 (2024)

work page arXiv 2024

[6] [6]

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. 2017. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. CoRR abs/1712.05526 (2017). arXiv:1712.05526 http://arxiv.org/abs/1712.05526

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Jiazhu Dai, Chuanshuai Chen, and Yufeng Li. 2019. A Backdoor Attack Against LSTM-Based Text Classification Systems. IEEE Access 7 (2019), 138872–138878. https://api.semanticscholar.org/CorpusID:168170110

work page 2019

[8] [8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy...

work page 2019

[9] [9]

Jacob Dumford and Walter Scheirer. 2020. Backdooring Convolutional Neural Networks via Targeted Weight Perturbations. In 2020 IEEE International Joint Conference on Biometrics (IJCB) . 1–9. doi:10.1109/IJCB48548.2020.9304875

work page doi:10.1109/ijcb48548.2020.9304875 2020

[10] [10]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Das- Sarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah....

work page 2021

[11] [11]

Chong Fu, Xuhong Zhang, Shouling Ji, Ting Wang, Peng Lin, Yanghe Feng, and Jianwei Yin. 2023. FreeEagle: Detecting Complex Neural Trojans in Data- Free Cases. In 32nd USENIX Security Symposium (USENIX Security 23) . USENIX Association, Anaheim, CA, 6399–6416. https://www.usenix.org/conference/ usenixsecurity23/presentation/fu-chong

work page 2023

[12] [12]

Matt Gardner, William Merrill, Jesse Dodge, Matthew Peters, Alexis Ross, Sameer Singh, and Noah A. Smith. 2021. Competency Problems: On Finding and Re- moving Artifacts in Language Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , Marie-Francine Moens, Xuan- jing Huang, Lucia Specia, and Scott Wen-tau Yih (E...

work page doi:10.18653/v1/2021.emnlp-main.135 2021

[13] [13]

Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. 2017. BadNets: Iden- tifying Vulnerabilities in the Machine Learning Model Supply Chain. CoRR abs/1708.06733 (2017). arXiv:1708.06733 http://arxiv.org/abs/1708.06733

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Resid- ual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2016

[15] [15]

Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Xuanli He, Qiongkai Xu, Jun Wang, Benjamin Rubinstein, and Trevor Cohn. 2023. Mitigating Backdoor Poisoning Attacks through the Lens of Spurious Correlation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 953–...

work page doi:10.18653/v1/2023.emnlp- 2023

[17] [17]

Xuanli He, Qiongkai Xu, Jun Wang, Benjamin IP Rubinstein, and Trevor Cohn

work page

[18] [18]

Transactions of the Association for Computational Linguistics 12 (2024), 996–1010

SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks. Transactions of the Association for Computational Linguistics 12 (2024), 996–1010

work page 2024

[19] [19]

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and An- drew Gordon Wilson. 2018. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti- mization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings , Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015

[21] [21]

Alex Krizhevsky et al. 2009. Learning multiple layers of features from tiny images. (2009)

work page 2009

[22] [22]

Keita Kurita, Paul Michel, and Graham Neubig. 2020. Weight Poisoning Attacks on Pretrained Models. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 2793–2806. doi:10.18653/v1/2020.acl-main.249

work page doi:10.18653/v1/2020.acl-main.249 2020

[23] [23]

Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger,...

work page arXiv 2021

[24] [24]

Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma

work page

[25] [25]

In Advances in Neural Information Processing Systems , M

Anti-Backdoor Learning: Training Clean Models on Poisoned Data. In Advances in Neural Information Processing Systems , M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 14900–14912. https://proceedings.neurips.cc/paper_files/paper/2021/file/ 7d38b1e9bd793d3f45e0e212a729a93c-Paper.pdf

work page 2021

[26] [26]

Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, and Radha Poovendran. 2024. CleanGen: Mitigating Back- door Attacks for Generation Tasks in Large Language Models. arXiv preprint arXiv:2406.12257 (2024)

work page arXiv 2024

[27] [27]

Yiming Li, Mengxi Ya, Yang Bai, Yong Jiang, and Shu-Tao Xia. 2023. BackdoorBox: A Python Toolbox for Backdoor Learning. In ICLR Workshop

work page 2023

[28] [28]

Yiming Li, Tongqing Zhai, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. 2021. Back- door attack in the physical world. arXiv preprint arXiv:2104.02361 (2021)

work page arXiv 2021

[29] [29]

Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2018. Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks. CoRR abs/1805.12185 (2018). arXiv:1805.12185 http://arxiv.org/abs/1805.12185

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing Ma, Yousra Aafer, and Xiangyu Zhang. 2019. Abs: Scanning neural networks for back-doors by artificial brain stimulation. InProceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 1265–1282

work page 2019

[31] [31]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019

[32] [32]

Michael S Matena and Colin A Raffel. 2022. Merging Models with Fisher-Weighted Averaging. In Advances in Neural Information Pro- cessing Systems , S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 17703– 17716. https://proceedings.neurips.cc/paper_files/paper/2022/file/ 70c26937fbf3d4600b69a129031b...

work page 2022

[33] [33]

Geoffrey F Miller, Peter M Todd, and Shailesh U Hegde. 1989. Designing Neural Networks Using Genetic Algorithms.. In ICGA, Vol. 89. 379–384

work page 1989

[34] [34]

Anh Nguyen and Anh Tran. 2021. Wanet–imperceptible warping-based backdoor attack. arXiv preprint arXiv:2102.10369 (2021)

work page arXiv 2021

[35] [35]

Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun

work page

[36] [36]

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.)

ONION: A Simple and Effective Defense Against Textual Backdoor Attacks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 9558–9566. doi:10.18653/v1...

work page doi:10.18653/v1/2021.emnlp-main.752 2021

[37] [37]

Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. 2021. Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long P...

work page doi:10.18653/v1/2021.acl-long.37 2021

[38] [38]

Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. 2021. Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:235196099

work page 2021

[39] [39]

Fanchao Qi, Yuan Yao, Sophia Xu, Zhiyuan Liu, and Maosong Sun. 2021. Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution. In Annual Meeting of the Association for Computational Linguistics . https://api. semanticscholar.org/CorpusID:235417102

work page 2021

[40] [40]

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2019. Regularized Evolution for Image Classifier Architecture Search. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (Jul. 2019), 4780–4789. doi:10.1609/aaai. v33i01.33014780

work page doi:10.1609/aaai 2019

[41] [41]

David So, Quoc Le, and Chen Liang. 2019. The Evolved Transformer. In Pro- ceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdi- nov (Eds.). PMLR, 5877–5886. https://proceedings.mlr.press/v97/so19a.html

work page 2019

[42] [42]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Con- ference on Empirical Methods in Natural Language Processing , David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, a...

work page 2013

[43] [43]

Yanghao Su, Jie Zhang, Ting Xu, Tianwei Zhang, Weiming Zhang, and Nenghai Yu. 2024. Model X-ray: Detecting Backdoored Models via Decision Boundary. In Proceedings of the 32nd ACM International Conference on Multimedia . 10296– 10305

work page 2024

[44] [44]

Yao Tong, Weijun Li, Xuanli He, Haolan Zhan, and Qiongkai Xu. 2024. Cut the Deadwood Out: Post-Training Model Purification with Selective Module Substitution. arXiv preprint arXiv:2412.20476 (2024)

work page arXiv 2024

[45] [45]

TorchVision maintainers and contributors. 2016. TorchVision: PyTorch’s Com- puter Vision library. https://github.com/pytorch/vision. GitHub repository

work page 2016

[46] [46]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems , I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://pr...

work page 2017

[47] [47]

Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. 2019. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In 2019 IEEE Symposium on Security and Privacy (SP). 707–723. doi:10.1109/SP.2019.00031

work page doi:10.1109/sp.2019.00031 2019

[48] [48]

Hang Wang, Zhen Xiang, David J Miller, and George Kesidis. 2024. Mm-bd: Post-training detection of backdoor attacks with arbitrary backdoor pattern types using a maximum margin statistic. In 2024 IEEE Symposium on Security and Privacy (SP). IEEE, 1994–2012

work page 2024

[49] [49]

Ren Wang, Gaoyuan Zhang, Sijia Liu, Pin-Yu Chen, Jinjun Xiong, and Meng Wang. 2020. Practical detection of trojan neural networks: Data-limited and data- free cases. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16 . Springer, 222–238

work page 2020

[50] [50]

Colin White, Mahmoud Safari, Rhea Sukthanker, Binxin Ru, Thomas Elsken, Arber Zela, Debadeepta Dey, and Frank Hutter. 2023. Neural architecture search: Insights from 1000 papers. arXiv preprint arXiv:2301.08727 (2023)

work page arXiv 2023

[51] [51]

Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (New Orleans, Louisiana). Association for Computa...

work page 2018

[52] [52]

Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. Visual Transformers: Token-based Image Representation and Processing for Computer Vision. arXiv:2006.03677 [cs.CV]

work page arXiv 2020

[53] [53]

Dongxian Wu and Yisen Wang. 2021. Adversarial Neuron Pruning Purifies Backdoored Deep Models. In Advances in Neural Information Processing Systems , M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 16913–16925. https://proceedings.neurips. cc/paper_files/paper/2021/file/8cbe9ce23f42628c98...

work page 2021

[54] [54]

Tong Xu, Yiming Li, Yong Jiang, and Shu-Tao Xia. 2023. Batt: Backdoor at- tack with transformation-based triggers. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 1–5

work page 2023

[55] [55]

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal

work page

[56] [56]

In Ad- vances in Neural Information Processing Systems , A

TIES-Merging: Resolving Interference When Merging Models. In Ad- vances in Neural Information Processing Systems , A. Oh, T. Naumann, A. Glober- son, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 7093–7115. https://proceedings.neurips.cc/paper_files/paper/2023/file/ 1644c9af28ab7916874f6fd6228a9bcf-Paper-Conference.pdf

work page 2023

[57] [57]

Jinluan Yang, Anke Tang, Didi Zhu, Zhengyu Chen, Li Shen, and Fei Wu. 2024. Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace. arXiv preprint arXiv:2410.13910 (2024)

work page arXiv 2024

[58] [58]

Wenqian Ye, Guangtao Zheng, Xu Cao, Yunsheng Ma, and Aidong Zhang

work page

[59] [59]

Will You Find These Shortcuts?

Spurious correlations in machine learning: A survey. arXiv preprint arXiv:2402.12715 (2024)

work page arXiv 2024

[60] [60]

Biao Yi, Sishuo Chen, Yiming Li, Tong Li, Baolei Zhang, and Zheli Liu. 2024. BadActs: A Universal Backdoor Defense in the Activation Space. In Findings of the Association for Computational Linguistics: ACL 2024 , Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 5339–5352. doi:10.18653/v1/2...

work page doi:10.18653/v1/2024.findings-acl.317 2024

[61] [61]

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. 2024. Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch. In Forty-first International Conference on Machine Learning . https: //openreview.net/forum?id=fq0NaiU8Ex

work page 2024

[62] [62]

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision. 6023–6032

work page 2019

[63] [63]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/ 2015/file/250cf8b51c773f3f8dc8b4be867a9a0...

work page 2015

[64] [64]

Zhiyuan Zhang, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, and Xu Sun. 2022. Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2022 , Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguis- tics, Abu Dhabi, United Arab Emirates, 355–372....

work page doi:10.18653/v1/2022.findings- 2022

[65] [65]

Xingyi Zhao, Depeng Xu, and Shuhan Yuan. 2024. Defense against Backdoor Attack on Pre-trained Language Models via Head Pruning and Attention Normal- ization. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235) , Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuri...

work page 2024

[66] [66]

Xun Zhou, A. K. Qin, Maoguo Gong, and Kay Chen Tan. 2021. A Survey on Evolutionary Construction of Deep Neural Networks. IEEE Transactions on Evolutionary Computation 25, 5 (2021), 894–912. doi:10.1109/TEVC.2021.3079985 10 Defending Deep Neural Networks against Backdoor Attacks via Module Switching A Results of a Two-Layer Network with Varying Activations...

work page doi:10.1109/tevc.2021.3079985 2021

[67] [67]

𝐹intra (𝑆) = − |𝑆 |∑︁ 𝑙=1 IntraViolation(𝑆 [𝑙]) (6) Here, IntraViolation quantifies the number of adjacent module pairs from the same source model within layer 𝑆 [𝑙]

Intra-layer Adjacency ( 𝐹intra). 𝐹intra (𝑆) = − |𝑆 |∑︁ 𝑙=1 IntraViolation(𝑆 [𝑙]) (6) Here, IntraViolation quantifies the number of adjacent module pairs from the same source model within layer 𝑆 [𝑙]

work page

[68] [68]

Consecutive-layer Adjacency ( 𝐹consec). 𝐹consec (𝑆) = − |𝑆 | −1∑︁ 𝑙=1 ConsecViolation(𝑆 [𝑙], 𝑆[𝑙 + 1]) (7) The function ConsecViolation counts module pairs from the same source model that are directly connected between consecutive layers

work page

[69] [69]

Residual Connections ( 𝐹residual). 𝐹residual(𝑆) = − |𝑆 |∑︁ 𝑙=1 |𝑆 |∑︁ 𝑘=𝑙+1 ResidualViolation(𝑆 [𝑙], 𝑆[𝑘]) × ( 0.5)𝑘 −𝑙 (8) This term evaluates residual connections between layers 𝑆 [𝑙] and 𝑆 [𝑘], with ResidualViolation weighted by (0.5)𝑘 −𝑙 to reduce the impact of long-range connections

work page

[70] [70]

Module Balance ( 𝐹balance). 𝐹balance (𝑆) = − 𝑛models∑︁ 𝑖=1 ∑︁ 𝑚∈ M |count𝑖,𝑚 − countideal| (9) where count𝑖,𝑚 is the count of module type 𝑚 from model 𝑖, M = {𝑄, 𝐾, 𝑉 , 𝑂, 𝐼, 𝑃} is the set of module types, and countideal = |𝑆 |/𝑛models represents the ideal count per module type per model

work page

[71] [71]

𝐹diversity(𝑆) = |unique(𝑆)| (10) where unique(𝑆) is the set of unique layer compositions in strategy 𝑆

Layer Diversity ( 𝐹diversity). 𝐹diversity(𝑆) = |unique(𝑆)| (10) where unique(𝑆) is the set of unique layer compositions in strategy 𝑆. 12 Defending Deep Neural Networks against Backdoor Attacks via Module Switching C Additional Experiment Setup C.1 Dataset Statistics We evaluate our method on four datasets. The statistical information of each dataset and ...

work page