pith. machine review for the scientific record. sign in

arxiv: 2605.04209 · v1 · submitted 2026-05-05 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Undetectable Backdoors in Model Parameters: Hiding Sparse Secrets in High Dimensions

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:46 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords backdoor attackundetectable backdoorsparse perturbationSparse PCAsupply chain attackimage classifiersprovable securitymodel parameters
0
0 comments X

The pith

A sparse perturbation masked by Gaussian dither embeds backdoors whose detection reduces to the hard Sparse PCA problem.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sparse Backdoor, which plants a backdoor by adding a structured sparse perturbation along a random direction to a small subset of columns in each fully connected layer of a pre-trained classifier. The perturbation is masked with independent isotropic Gaussian dither that creates a clean reference distribution. Under a mild margin condition the dithered reference matches the original classifier's behavior on normal inputs. The authors prove that any probabilistic polynomial-time algorithm distinguishing the backdoored weights from this reference would also solve Sparse PCA detection, which is computationally infeasible under standard hardness assumptions. The result applies to convolutional networks and Vision Transformers and holds even when the adversary has white-box access to the published parameters.

Core claim

The paper establishes that a backdoor can be planted by injecting a sparse perturbation along a randomly chosen direction into selected columns at each fully connected layer, then masking the change with isotropic Gaussian dither. This produces a backdoored model that remains functionally equivalent to the original classifier under a mild margin condition while routing a chosen trigger to an adversary-specified target class. Distinguishing the injected model from the dithered reference distribution is at least as hard as Sparse PCA detection for any probabilistic polynomial-time algorithm with white-box parameter access.

What carries the argument

The Sparse Backdoor construction: a random-direction sparse perturbation applied to a subset of columns in fully connected layers, masked by independent isotropic Gaussian dither to anchor a reference distribution for the undetectability reduction.

If this is right

  • The attack applies to both convolutional networks and Vision Transformers.
  • The backdoor activates only on the adversary-chosen trigger while preserving accuracy on clean inputs.
  • Undetectability holds against any white-box probabilistic polynomial-time distinguisher.
  • The construction works by reducing detection directly to the Sparse PCA decision problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model supply chains may need to enforce trusted training pipelines rather than relying on post-publication parameter inspection.
  • The same masking idea could be explored for hiding modifications in other high-dimensional model components such as attention matrices.
  • Practical deployment would still require the attacker to know the exact pre-trained weights to align the perturbation direction.

Load-bearing premise

The dithered reference model remains functionally equivalent to the original pre-trained classifier under a mild margin condition on its decision boundaries.

What would settle it

An efficient probabilistic polynomial-time algorithm that distinguishes backdoor-injected parameters from the corresponding dithered reference with non-negligible advantage would falsify the undetectability claim.

Figures

Figures reproduced from arXiv: 2605.04209 by Ashish Hooda, Atharv Singh Patlan, Kassem Fawaz, Nils Palumbo, Sarthak Choudhary, Somesh Jha.

Figure 1
Figure 1. Figure 1: Security game for backdoor undetectability. The challenger C samples a random bit b and either forwards a freshly drawn clean classifier f ∼ Fclean or runs the attack Atk on f and forwards the result. The distin￾guisher G is given white-box access to f ∗ and attempts to guess b. 2. Challenge. C samples a bit b ∼ {0, 1} uniformly at random and a clean classifier f ← Fclean, and sends to the distinguisher f … view at source ↗
Figure 2
Figure 2. Figure 2: Sparse Backdoor pipeline. The trigger ∆∗ (red dots) blended into input x is passed through a frozen feature encoder fenc, producing an embedding with a high component along a sparse direction s1 (red coor￾dinates). Each perturbed FC layer Wfi combines a Gaussian dither with a structured spike along si , propagating the signal to a sparser direction si+1 as embeddings shrink with depth. The final layer rout… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of fine-tuning on Sparse Backdoor. Top: ASR (%); bottom: BA (%) with dashed lines showing the baseline clean accuracy CA (%). The defender fine-tunes FC layers on 1% clean held-out data; shaded bands are ±1 std over 10 seeds. Clean accuracy recovers quickly while ASR persistence varies, making fine-tuning unreliable as a standalone mitigation. 7.4 Persistence Under Fine-Tuning (RQ3) Beyond detection… view at source ↗
Figure 4
Figure 4. Figure 4: Representative clean images and corresponding trigger-corrupted view at source ↗
read the original abstract

We present Sparse Backdoor, a supply-chain attack that plants a \emph{provably undetectable} backdoor in pre-trained image classifiers, including convolutional networks and Vision Transformers. The attack injects a structured sparse perturbation along a randomly chosen direction into a small subset of columns at each fully connected layer, propagating a trigger signal to an adversary-chosen target class, and masks the perturbation with an independent isotropic Gaussian dither. The dither serves a single technical purpose: it induces a clean reference distribution anchored at the pre-trained weights, against which undetectability can be formalized. Under a mild margin condition on the pre-trained classifier, we show that the dithered reference is functionally equivalent to the original classifier. We prove that distinguishing the backdoor-injected model from this reference is at least as hard as Sparse PCA detection, which is computationally infeasible under standard hardness assumptions. The guarantee holds against any probabilistic polynomial-time distinguisher with white-box access to the parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Sparse Backdoor, a supply-chain attack that injects a structured sparse perturbation along a randomly chosen direction into a small subset of columns at each fully connected layer of pre-trained image classifiers (CNNs and ViTs). This perturbation propagates a trigger to a target class and is masked by an independent isotropic Gaussian dither. Under a mild margin condition on the pre-trained classifier, the dithered reference is claimed to be functionally equivalent to the original model. The central technical claim is a reduction showing that distinguishing the backdoor-injected model from this reference is at least as hard as Sparse PCA detection, which is computationally infeasible for probabilistic polynomial-time distinguishers with white-box parameter access.

Significance. If the reduction and variance selection are rigorously established, the result would be significant: it supplies a formal, assumption-based undetectability guarantee for a backdoor attack in the white-box setting, reducing security to a standard computational hardness problem (Sparse PCA) rather than heuristic arguments. This strengthens the literature on provable security for model poisoning by demonstrating how sparse secrets can be hidden in high-dimensional parameter spaces while preserving functionality.

major comments (2)
  1. [Undetectability proof and margin condition] The undetectability reduction requires the isotropic Gaussian dither variance to be sufficiently large to embed the structured sparse perturbation in the computationally hard regime of Sparse PCA detection. Simultaneously, the mild margin condition (under which the dithered reference remains functionally equivalent) constrains the allowable dither magnitude. The manuscript provides no explicit bounds, existence argument, or parameter regime showing that a single variance satisfies both requirements at once; this is load-bearing for the central claim that the backdoor is provably undetectable.
  2. [Abstract and main theorem] The abstract states a reduction to Sparse PCA hardness and equivalence under a margin condition, but the provided text does not include the full derivation, error analysis, or the precise statement of the margin condition. Without these, the support for the equivalence and hardness claims cannot be verified.
minor comments (1)
  1. [Attack construction] Clarify the exact subset of columns modified per FC layer and how the random direction is chosen; this affects both trigger propagation and the sparsity pattern used in the Sparse PCA reduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We appreciate the recognition that a rigorous reduction to Sparse PCA would be a significant contribution. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Undetectability proof and margin condition] The undetectability reduction requires the isotropic Gaussian dither variance to be sufficiently large to embed the structured sparse perturbation in the computationally hard regime of Sparse PCA detection. Simultaneously, the mild margin condition (under which the dithered reference remains functionally equivalent) constrains the allowable dither magnitude. The manuscript provides no explicit bounds, existence argument, or parameter regime showing that a single variance satisfies both requirements at once; this is load-bearing for the central claim that the backdoor is provably undetectable.

    Authors: We agree that an explicit combined parameter regime is necessary to fully substantiate the central undetectability claim. The manuscript derives the functional-equivalence bound (via the margin condition) and the Sparse-PCA hardness regime separately but does not explicitly exhibit a non-empty interval for the dither variance that satisfies both simultaneously. In the revised version we will add a dedicated subsection deriving explicit bounds: under the mild margin assumption, the allowable perturbation size is O(1/sqrt(d)) with high probability; for layer dimension d sufficiently larger than the sparsity level k, the Sparse-PCA detection threshold permits variances up to poly(log d). We will prove that these two intervals overlap for standard hyper-parameter choices, thereby establishing existence of a suitable variance. revision: yes

  2. Referee: [Abstract and main theorem] The abstract states a reduction to Sparse PCA hardness and equivalence under a margin condition, but the provided text does not include the full derivation, error analysis, or the precise statement of the margin condition. Without these, the support for the equivalence and hardness claims cannot be verified.

    Authors: The full manuscript states the margin condition precisely as Assumption 3.1, presents the main reduction as Theorem 4.2, and supplies the complete proof together with error analysis in Appendix B. The abstract is written as a concise summary and therefore omits these details. To improve readability we will (i) add a parenthetical pointer in the abstract to Assumption 3.1 and Theorem 4.2, (ii) include a one-paragraph high-level statement of the margin condition in the introduction, and (iii) move a condensed error-bound paragraph into the main body. revision: partial

Circularity Check

0 steps flagged

No circularity: undetectability reduced to external Sparse PCA hardness and mild margin condition

full rationale

The paper's central claim reduces distinguishing the backdoored model from the dithered reference to the computational hardness of Sparse PCA detection (an external standard assumption) and shows functional equivalence of the reference under a mild margin condition on the pre-trained classifier. Neither step is defined in terms of the paper's own fitted quantities, self-citations, or internal constructions; the isotropic Gaussian dither is introduced explicitly to create an anchor distribution for the reduction, and the proofs do not collapse by construction to the inputs. This is a standard hardness reduction with no load-bearing self-referential elements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the computational hardness of Sparse PCA (external) and the mild margin condition (domain assumption) for functional equivalence; no free parameters or new invented entities with independent evidence are introduced.

axioms (1)
  • domain assumption Mild margin condition on the pre-trained classifier ensures the dithered reference is functionally equivalent to the original
    Invoked to establish that the backdoored model behaves identically on clean inputs.
invented entities (2)
  • Structured sparse perturbation along a randomly chosen direction no independent evidence
    purpose: To propagate the trigger signal to the target class while remaining sparse
    Introduced as the core mechanism of the backdoor injection.
  • Independent isotropic Gaussian dither no independent evidence
    purpose: To mask the perturbation and anchor a clean reference distribution
    Added specifically to enable the undetectability formalization.

pith-pipeline@v0.9.0 · 5490 in / 1466 out tokens · 44426 ms · 2026-05-08T17:46:52.224990+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    Accessed: 2026

    Hugging Face Model Hub.https://huggingface.co, 2026. Accessed: 2026

  2. [2]

    Accessed: 2026

    ModelZoo.https://modelzoo.co, 2026. Accessed: 2026

  3. [3]

    Accessed: 2026

    TensorFlow Model Garden.https://github.com/tensorflow/ models, 2026. Accessed: 2026

  4. [4]

    Blind backdoors in deep learning models

    Eugene Bagdasaryan and Vitaly Shmatikov. Blind backdoors in deep learning models. In30th USENIX Security Symposium (USENIX Secu- rity 21), pages 1505–1521, 2021

  5. [5]

    A nearly tight sum-of-squares lower bound for the planted clique problem.SIAM Journal on Computing, 48(2):687–735, 2019

    Boaz Barak, Samuel Hopkins, Jonathan Kelner, Pravesh K Kothari, Ankur Moitra, and Aaron Potechin. A nearly tight sum-of-squares lower bound for the planted clique problem.SIAM Journal on Computing, 48(2):687–735, 2019. 34

  6. [6]

    Complexity theoretic lower bounds for sparse principal component detection

    Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. InConference on learning theory, pages 1046–1066. PMLR, 2013

  7. [7]

    Optimal detection of sparse principal components in high dimension

    Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components in high dimension. 2013

  8. [8]

    Optimal average-case reductions to sparse pca: From weak assumptions to strong hardness

    Matthew Brennan and Guy Bresler. Optimal average-case reductions to sparse pca: From weak assumptions to strong hardness. InConference on Learning Theory, pages 469–470. PMLR, 2019

  9. [9]

    Reducibility and computational lower bounds for problems with planted sparse structure

    Matthew Brennan, Guy Bresler, and Wasim Huleihel. Reducibility and computational lower bounds for problems with planted sparse structure. InConference On Learning Theory, pages 48–166. PMLR, 2018

  10. [10]

    Data free backdoor attacks

    Bochuan Cao, Jinyuan Jia, Chuxuan Hu, Wenbo Guo, Zhen Xiang, Jinghui Chen, Bo Li, and Dawn Song. Data free backdoor attacks. Advances in Neural Information Processing Systems, 37:23881–23911, 2024

  11. [11]

    arXiv preprint arXiv:1811.03728 (2018)

    Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Ben- jamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. Detect- ing backdoor attacks on deep neural networks by activation clustering. arXiv preprint arXiv:1811.03728, 2018

  12. [12]

    Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

    Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Tar- geted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017

  13. [13]

    Attacking byzantine robust aggregation in high dimensions

    Sarthak Choudhary, Aashish Kolluri, and Prateek Saxena. Attacking byzantine robust aggregation in high dimensions. In2024 IEEE Sym- posium on Security and Privacy (SP), pages 1325–1344. IEEE, 2024

  14. [14]

    Sever: A robust meta-algorithm for stochas- tic optimization

    Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Stein- hardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochas- tic optimization. InInternational Conference on Machine Learning, pages 1596–1606. PMLR, 2019

  15. [15]

    Subexponential-time algorithms for sparse pca.Foundations of Computational Mathematics, 24(3):865–914, 2024

    Yunzi Ding, Dmitriy Kunisky, Alexander S Wein, and Afonso S Ban- deira. Subexponential-time algorithms for sparse pca.Foundations of Computational Mathematics, 24(3):865–914, 2024. 35

  16. [16]

    Statistical algorithms and a lower bound for detecting planted cliques.Journal of the ACM (JACM), 64(2):1–37, 2017

    Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh S Vempala, and Ying Xiao. Statistical algorithms and a lower bound for detecting planted cliques.Journal of the ACM (JACM), 64(2):1–37, 2017

  17. [17]

    Sparse cca: Adaptive estimation and computational barriers

    Chao Gao, Zongming Ma, and Harrison H Zhou. Sparse cca: Adaptive estimation and computational barriers. 2017

  18. [18]

    Planting undetectable backdoors in machine learning models

    Shafi Goldwasser, Michael P Kim, Vinod Vaikuntanathan, and Or Za- mir. Planting undetectable backdoors in machine learning models. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Sci- ence (FOCS), pages 931–942. IEEE, 2022

  19. [19]

    Oblivious defense in ml models: Backdoor removal without detection

    Shafi Goldwasser, Jonathan Shafer, Neekon Vafa, and Vinod Vaikun- tanathan. Oblivious defense in ml models: Backdoor removal without detection. InProceedings of the 57th Annual ACM Symposium on The- ory of Computing, pages 1785–1794, 2025

  20. [20]

    BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

    Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Iden- tifying vulnerabilities in the machine learning model supply chain.arXiv preprint arXiv:1708.06733, 2017

  21. [21]

    Wenbo Guo, Bolun Wang, Yuanshun Yao, Shawn Shan, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. TABOR: A highly accurate approach to inspecting and restoring trojan backdoors in AI systems. arXiv preprint arXiv:1908.01763, 2019

  22. [22]

    Handcrafted backdoors in deep neural networks.Advances in Neural Information Processing Systems, 35:8068–8080, 2022

    Sanghyun Hong, Nicholas Carlini, and Alexey Kurakin. Handcrafted backdoors in deep neural networks.Advances in Neural Information Processing Systems, 35:8068–8080, 2022

  23. [23]

    The power of sum- of-squares for detecting hidden structures

    Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, Tselil Schramm, and David Steurer. The power of sum- of-squares for detecting hidden structures. In2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 720–

  24. [24]

    Indistinguishability obfusca- tion from well-founded assumptions.Journal of the ACM, 73(1):1–30, 2026

    Aayush Jain, Huijia Lin, and Amit Sahai. Indistinguishability obfusca- tion from well-founded assumptions.Journal of the ACM, 73(1):1–30, 2026. 36

  25. [25]

    Injecting undetectable backdoors in obfuscated neural networks and language models.Advances in Neural Information Processing Systems, 37:21537–21571, 2024

    Alkis Kalavasis, Amin Karbasi, Argyris Oikonomou, Katerina Sotiraki, Grigoris Velegkas, and Manolis Zampetakis. Injecting undetectable backdoors in obfuscated neural networks and language models.Advances in Neural Information Processing Systems, 37:21537–21571, 2024

  26. [26]

    Cs 354: Unfulfilled algorithmic fantasies.https://web

    Yunsung Kim. Cs 354: Unfulfilled algorithmic fantasies.https://web. stanford.edu/class/cs354/scribe/lecture11.pdf, 2019

  27. [27]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  28. [28]

    Succinct garbled circuits with low-depth garbling algorithms.Cryptology ePrint Archive, 2025

    Hanjun Li, Huijia Lin, and George Lu. Succinct garbled circuits with low-depth garbling algorithms.Cryptology ePrint Archive, 2025

  29. [29]

    Fine-pruning: Defending against backdooring attacks on deep neural networks

    Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdooring attacks on deep neural networks. In Research in Attacks, Intrusions, and Defenses (RAID), 2018

  30. [30]

    Trojaning attack on neural networks

    Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Wei- hang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. In25th Annual Network And Distributed System Security Symposium (NDSS 2018). Internet Soc, 2018

  31. [31]

    Reading digits in natural images with un- supervised feature learning

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with un- supervised feature learning. InNIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 7. Granada, 2011

  32. [32]

    Crypto- graphic backdoor for neural networks: Boon and bane.arXiv preprint arXiv:2509.20714, 2025

    Anh Tu Ngo, Anupam Chattopadhyay, and Subhamoy Maitra. Crypto- graphic backdoor for neural networks: Boon and bane.arXiv preprint arXiv:2509.20714, 2025

  33. [33]

    Wanet–imperceptible warping-based back- door attack,

    Anh Nguyen and Anh Tran. Wanet–imperceptible warping-based back- door attack.arXiv preprint arXiv:2102.10369, 2021

  34. [34]

    Input-aware dynamic backdoor at- tack.Advances in Neural Information Processing Systems, 33:3454– 3464, 2020

    Tuan Anh Nguyen and Anh Tran. Input-aware dynamic backdoor at- tack.Advances in Neural Information Processing Systems, 33:3454– 3464, 2020. 37

  35. [35]

    Tbt: Targeted neural network attack with bit trojan

    Adnan Siraj Rakin, Zhezhi He, and Deliang Fan. Tbt: Targeted neural network attack with bit trojan. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13198–13207, 2020

  36. [36]

    Bypassing backdoor detection algorithms in deep learning

    Reza Shokri et al. Bypassing backdoor detection algorithms in deep learning. In2020 IEEE European Symposium on Security and Privacy (EuroS&P), pages 175–183. IEEE, 2020

  37. [37]

    Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Benchmarking machine learning algorithms for traf- fic sign recognition.Neural networks, 32:323–332, 2012

  38. [38]

    Spectral signatures in backdoor attacks.Advances in neural information processing systems, 31, 2018

    Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks.Advances in neural information processing systems, 31, 2018

  39. [39]

    Clean-label backdoor attacks

    Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Clean-label backdoor attacks. 2018

  40. [40]

    Cambridge University Press, 2018

    Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018

  41. [41]

    Neural cleanse: Identify- ing and mitigating backdoor attacks in neural networks

    Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identify- ing and mitigating backdoor attacks in neural networks. In2019 IEEE symposium on security and privacy (SP), pages 707–723. IEEE, 2019

  42. [42]

    Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. Neural cleanse: Identi- fying and mitigating backdoor attacks in neural networks. InIEEE Symposium on Security and Privacy (S&P), 2019

  43. [43]

    Statistical and computational trade-offs in estimation of sparse principal compo- nents

    Tengyao Wang, Quentin Berthet, and Richard J Samworth. Statistical and computational trade-offs in estimation of sparse principal compo- nents. 2016

  44. [44]

    Rethinking the reverse-engineering of trojan triggers.arXiv preprint arXiv:2210.15127, 2022

    Zhenting Wang, Kai Mei, Hailun Ding, Juan Zhai, and Shiqing Ma. Rethinking the reverse-engineering of trojan triggers.arXiv preprint arXiv:2210.15127, 2022. 38

  45. [45]

    Re- thinking the reverse-engineering of trojan triggers.Advances in Neural Information Processing Systems, 35:9738–9753, 2022

    Zhenting Wang, Kai Mei, Hailun Ding, Juan Zhai, and Shiqing Ma. Re- thinking the reverse-engineering of trojan triggers.Advances in Neural Information Processing Systems, 35:9738–9753, 2022

  46. [46]

    UNICORN: A unified backdoor trigger inversion framework

    Zhenting Wang, Kai Mei, Juan Zhai, and Shiqing Ma. UNICORN: A unified backdoor trigger inversion framework. InInternational Confer- ence on Learning Representations (ICLR), 2023

  47. [47]

    Adversarial neuron pruning purifies backdoored deep models

    Dongxian Wu and Yisen Wang. Adversarial neuron pruning purifies backdoored deep models. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  48. [48]

    Detecting ai trojans using meta neural analysis

    Xiaojun Xu, Qi Wang, Huichen Li, Nikita Borisov, Carl A Gunter, and Bo Li. Detecting ai trojans using meta neural analysis. In2021 IEEE Symposium on Security and Privacy (SP), pages 103–120. IEEE, 2021

  49. [49]

    Towards backdoor stealthiness in model parameter space

    Xiaoyun Xu, Zhuoran Liu, Stefanos Koffas, and Stjepan Picek. Towards backdoor stealthiness in model parameter space. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Se- curity, pages 2863–2876, 2025

  50. [50]

    BAN: Detecting backdoors activated by adversarial neuron noise

    Xiaoyun Xu, Zhuoran Liu, Stefanos Koffas, Shujian Yu, and Stjepan Picek. BAN: Detecting backdoors activated by adversarial neuron noise. arXiv preprint arXiv:2405.19928, 2024

  51. [51]

    To- wards reliable and efficient backdoor trigger inversion via decoupling benign features

    Xiong Xu, Kunzhe Huang, Yiming Li, Zhan Qin, and Kui Ren. To- wards reliable and efficient backdoor trigger inversion via decoupling benign features. InInternational Conference on Learning Representa- tions (ICLR), 2024

  52. [52]

    Latent backdoor attacks on deep neural networks

    Yuanshun Yao, Huiying Li, Haitao Zheng, and Ben Y Zhao. Latent backdoor attacks on deep neural networks. InProceedings of the 2019 ACM SIGSAC conference on computer and communications security, pages 2041–2055, 2019

  53. [53]

    Morley Mao, Ming Jin, and Ruoxi Jia

    Yi Zeng, Si Chen, Won Park, Z. Morley Mao, Ming Jin, and Ruoxi Jia. Adversarial unlearning of backdoors via implicit hypergradient. In International Conference on Learning Representations (ICLR), 2022. 39 A Proofs of Main Results This appendix contains the complete proofs of the results presented in Sec- tion 6, including the correctness of signal propaga...

  54. [54]

    the clean trained modelf(with logit mapg), and

  55. [55]

    We sweep the test set of the corresponding dataset and, for every test input x, record the four per-sample quantities defined below

    theclean reference modelf ′ (with logit mapg ′), obtained by perturbing each weight columnj∈ I i of every targeted layeriby an isotropic Gaussian ditherη (j) i ∼ N(0, τ 2 i ·I di), with the same per-layerτ i used by our attack andnobackdoor signal. We sweep the test set of the corresponding dataset and, for every test input x, record the four per-sample q...