arxiv: 2605.04209 · v1 · submitted 2026-05-05 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Undetectable Backdoors in Model Parameters: Hiding Sparse Secrets in High Dimensions

Sarthak Choudhary , Atharv Singh Patlan , Nils Palumbo , Ashish Hooda , Kassem Fawaz , Somesh Jha

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:46 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords backdoor attackundetectable backdoorsparse perturbationSparse PCAsupply chain attackimage classifiersprovable securitymodel parameters

0 comments

The pith

A sparse perturbation masked by Gaussian dither embeds backdoors whose detection reduces to the hard Sparse PCA problem.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sparse Backdoor, which plants a backdoor by adding a structured sparse perturbation along a random direction to a small subset of columns in each fully connected layer of a pre-trained classifier. The perturbation is masked with independent isotropic Gaussian dither that creates a clean reference distribution. Under a mild margin condition the dithered reference matches the original classifier's behavior on normal inputs. The authors prove that any probabilistic polynomial-time algorithm distinguishing the backdoored weights from this reference would also solve Sparse PCA detection, which is computationally infeasible under standard hardness assumptions. The result applies to convolutional networks and Vision Transformers and holds even when the adversary has white-box access to the published parameters.

Core claim

The paper establishes that a backdoor can be planted by injecting a sparse perturbation along a randomly chosen direction into selected columns at each fully connected layer, then masking the change with isotropic Gaussian dither. This produces a backdoored model that remains functionally equivalent to the original classifier under a mild margin condition while routing a chosen trigger to an adversary-specified target class. Distinguishing the injected model from the dithered reference distribution is at least as hard as Sparse PCA detection for any probabilistic polynomial-time algorithm with white-box parameter access.

What carries the argument

The Sparse Backdoor construction: a random-direction sparse perturbation applied to a subset of columns in fully connected layers, masked by independent isotropic Gaussian dither to anchor a reference distribution for the undetectability reduction.

If this is right

The attack applies to both convolutional networks and Vision Transformers.
The backdoor activates only on the adversary-chosen trigger while preserving accuracy on clean inputs.
Undetectability holds against any white-box probabilistic polynomial-time distinguisher.
The construction works by reducing detection directly to the Sparse PCA decision problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model supply chains may need to enforce trusted training pipelines rather than relying on post-publication parameter inspection.
The same masking idea could be explored for hiding modifications in other high-dimensional model components such as attention matrices.
Practical deployment would still require the attacker to know the exact pre-trained weights to align the perturbation direction.

Load-bearing premise

The dithered reference model remains functionally equivalent to the original pre-trained classifier under a mild margin condition on its decision boundaries.

What would settle it

An efficient probabilistic polynomial-time algorithm that distinguishes backdoor-injected parameters from the corresponding dithered reference with non-negligible advantage would falsify the undetectability claim.

Figures

Figures reproduced from arXiv: 2605.04209 by Ashish Hooda, Atharv Singh Patlan, Kassem Fawaz, Nils Palumbo, Sarthak Choudhary, Somesh Jha.

**Figure 1.** Figure 1: Security game for backdoor undetectability. The challenger C samples a random bit b and either forwards a freshly drawn clean classifier f ∼ Fclean or runs the attack Atk on f and forwards the result. The distinguisher G is given white-box access to f ∗ and attempts to guess b. 2. Challenge. C samples a bit b ∼ {0, 1} uniformly at random and a clean classifier f ← Fclean, and sends to the distinguisher f … view at source ↗

**Figure 2.** Figure 2: Sparse Backdoor pipeline. The trigger ∆∗ (red dots) blended into input x is passed through a frozen feature encoder fenc, producing an embedding with a high component along a sparse direction s1 (red coordinates). Each perturbed FC layer Wfi combines a Gaussian dither with a structured spike along si , propagating the signal to a sparser direction si+1 as embeddings shrink with depth. The final layer rout… view at source ↗

**Figure 3.** Figure 3: Effect of fine-tuning on Sparse Backdoor. Top: ASR (%); bottom: BA (%) with dashed lines showing the baseline clean accuracy CA (%). The defender fine-tunes FC layers on 1% clean held-out data; shaded bands are ±1 std over 10 seeds. Clean accuracy recovers quickly while ASR persistence varies, making fine-tuning unreliable as a standalone mitigation. 7.4 Persistence Under Fine-Tuning (RQ3) Beyond detection… view at source ↗

**Figure 4.** Figure 4: Representative clean images and corresponding trigger-corrupted view at source ↗

read the original abstract

We present Sparse Backdoor, a supply-chain attack that plants a \emph{provably undetectable} backdoor in pre-trained image classifiers, including convolutional networks and Vision Transformers. The attack injects a structured sparse perturbation along a randomly chosen direction into a small subset of columns at each fully connected layer, propagating a trigger signal to an adversary-chosen target class, and masks the perturbation with an independent isotropic Gaussian dither. The dither serves a single technical purpose: it induces a clean reference distribution anchored at the pre-trained weights, against which undetectability can be formalized. Under a mild margin condition on the pre-trained classifier, we show that the dithered reference is functionally equivalent to the original classifier. We prove that distinguishing the backdoor-injected model from this reference is at least as hard as Sparse PCA detection, which is computationally infeasible under standard hardness assumptions. The guarantee holds against any probabilistic polynomial-time distinguisher with white-box access to the parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reduces detecting their parameter backdoor to Sparse PCA hardness using sparse column perturbations plus Gaussian dither, but the dither variance has to hit two conflicting targets at once and the abstract gives no derivation showing it does.

read the letter

The core contribution is a concrete way to plant a backdoor in the weights of image classifiers so that any efficient white-box test for its presence is at least as hard as detecting a sparse principal component. They pick a random direction, add a structured sparse perturbation to a few columns per fully connected layer to carry the trigger, then overlay independent isotropic Gaussian noise. The noise creates a reference distribution; under a margin condition the reference model stays functionally identical to the original. Distinguishing the perturbed model from this reference is then reduced to Sparse PCA, which is believed hard in the relevant regime. That reduction is the new piece. Prior backdoor work mostly shows empirical stealth or weaker statistical indistinguishability; tying the claim directly to an existing computational hardness assumption is cleaner and more useful for supply-chain arguments. The construction itself is straightforward once stated, and the choice to work only on fully connected layers keeps the perturbation sparse enough to matter. The obvious soft spot is the variance of the dither. It must be large enough that the added sparse component does not create an easy statistical signal, yet the margin condition must still hold so the reference remains equivalent. The abstract states both requirements but supplies no error bounds, no explicit range for the variance, and no argument that a single value satisfies both simultaneously. If the perturbation strength needed for reliable trigger propagation forces a larger dither, the reduction could fail or the reference could diverge. Without the full proof it is impossible to tell whether this is a minor technical detail or a load-bearing gap. Experiments are also absent from what is visible, so we do not know whether the trigger actually fires after the dither is added. This is worth sending to referees who work on formal ML security. A reader who already knows Sparse PCA reductions and cares about provable undetectability will find the approach worth checking; others can skip it. The central claim is non-routine enough that a serious review is justified even if the variance issue turns out to need extra assumptions.

Referee Report

2 major / 1 minor

Summary. The paper introduces Sparse Backdoor, a supply-chain attack that injects a structured sparse perturbation along a randomly chosen direction into a small subset of columns at each fully connected layer of pre-trained image classifiers (CNNs and ViTs). This perturbation propagates a trigger to a target class and is masked by an independent isotropic Gaussian dither. Under a mild margin condition on the pre-trained classifier, the dithered reference is claimed to be functionally equivalent to the original model. The central technical claim is a reduction showing that distinguishing the backdoor-injected model from this reference is at least as hard as Sparse PCA detection, which is computationally infeasible for probabilistic polynomial-time distinguishers with white-box parameter access.

Significance. If the reduction and variance selection are rigorously established, the result would be significant: it supplies a formal, assumption-based undetectability guarantee for a backdoor attack in the white-box setting, reducing security to a standard computational hardness problem (Sparse PCA) rather than heuristic arguments. This strengthens the literature on provable security for model poisoning by demonstrating how sparse secrets can be hidden in high-dimensional parameter spaces while preserving functionality.

major comments (2)

[Undetectability proof and margin condition] The undetectability reduction requires the isotropic Gaussian dither variance to be sufficiently large to embed the structured sparse perturbation in the computationally hard regime of Sparse PCA detection. Simultaneously, the mild margin condition (under which the dithered reference remains functionally equivalent) constrains the allowable dither magnitude. The manuscript provides no explicit bounds, existence argument, or parameter regime showing that a single variance satisfies both requirements at once; this is load-bearing for the central claim that the backdoor is provably undetectable.
[Abstract and main theorem] The abstract states a reduction to Sparse PCA hardness and equivalence under a margin condition, but the provided text does not include the full derivation, error analysis, or the precise statement of the margin condition. Without these, the support for the equivalence and hardness claims cannot be verified.

minor comments (1)

[Attack construction] Clarify the exact subset of columns modified per FC layer and how the random direction is chosen; this affects both trigger propagation and the sparsity pattern used in the Sparse PCA reduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We appreciate the recognition that a rigorous reduction to Sparse PCA would be a significant contribution. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Undetectability proof and margin condition] The undetectability reduction requires the isotropic Gaussian dither variance to be sufficiently large to embed the structured sparse perturbation in the computationally hard regime of Sparse PCA detection. Simultaneously, the mild margin condition (under which the dithered reference remains functionally equivalent) constrains the allowable dither magnitude. The manuscript provides no explicit bounds, existence argument, or parameter regime showing that a single variance satisfies both requirements at once; this is load-bearing for the central claim that the backdoor is provably undetectable.

Authors: We agree that an explicit combined parameter regime is necessary to fully substantiate the central undetectability claim. The manuscript derives the functional-equivalence bound (via the margin condition) and the Sparse-PCA hardness regime separately but does not explicitly exhibit a non-empty interval for the dither variance that satisfies both simultaneously. In the revised version we will add a dedicated subsection deriving explicit bounds: under the mild margin assumption, the allowable perturbation size is O(1/sqrt(d)) with high probability; for layer dimension d sufficiently larger than the sparsity level k, the Sparse-PCA detection threshold permits variances up to poly(log d). We will prove that these two intervals overlap for standard hyper-parameter choices, thereby establishing existence of a suitable variance. revision: yes
Referee: [Abstract and main theorem] The abstract states a reduction to Sparse PCA hardness and equivalence under a margin condition, but the provided text does not include the full derivation, error analysis, or the precise statement of the margin condition. Without these, the support for the equivalence and hardness claims cannot be verified.

Authors: The full manuscript states the margin condition precisely as Assumption 3.1, presents the main reduction as Theorem 4.2, and supplies the complete proof together with error analysis in Appendix B. The abstract is written as a concise summary and therefore omits these details. To improve readability we will (i) add a parenthetical pointer in the abstract to Assumption 3.1 and Theorem 4.2, (ii) include a one-paragraph high-level statement of the margin condition in the introduction, and (iii) move a condensed error-bound paragraph into the main body. revision: partial

Circularity Check

0 steps flagged

No circularity: undetectability reduced to external Sparse PCA hardness and mild margin condition

full rationale

The paper's central claim reduces distinguishing the backdoored model from the dithered reference to the computational hardness of Sparse PCA detection (an external standard assumption) and shows functional equivalence of the reference under a mild margin condition on the pre-trained classifier. Neither step is defined in terms of the paper's own fitted quantities, self-citations, or internal constructions; the isotropic Gaussian dither is introduced explicitly to create an anchor distribution for the reduction, and the proofs do not collapse by construction to the inputs. This is a standard hardness reduction with no load-bearing self-referential elements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the computational hardness of Sparse PCA (external) and the mild margin condition (domain assumption) for functional equivalence; no free parameters or new invented entities with independent evidence are introduced.

axioms (1)

domain assumption Mild margin condition on the pre-trained classifier ensures the dithered reference is functionally equivalent to the original
Invoked to establish that the backdoored model behaves identically on clean inputs.

invented entities (2)

Structured sparse perturbation along a randomly chosen direction no independent evidence
purpose: To propagate the trigger signal to the target class while remaining sparse
Introduced as the core mechanism of the backdoor injection.
Independent isotropic Gaussian dither no independent evidence
purpose: To mask the perturbation and anchor a clean reference distribution
Added specifically to enable the undetectability formalization.

pith-pipeline@v0.9.0 · 5490 in / 1466 out tokens · 44426 ms · 2026-05-08T17:46:52.224990+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/AlphaCoordinateFixation.lean (RS uses α ≥ 1 in the bilinear cost family); the paper's α is the SPCA sparsity exponent, semantically unrelated. alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

k_i = ⌊d_i^α⌋ for some α ∈ (0, 1/2] ... places the resulting weight distribution in the Sparse PCA detection hardness regime.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Accessed: 2026

Hugging Face Model Hub.https://huggingface.co, 2026. Accessed: 2026

2026
[2]

Accessed: 2026

ModelZoo.https://modelzoo.co, 2026. Accessed: 2026

2026
[3]

Accessed: 2026

TensorFlow Model Garden.https://github.com/tensorflow/ models, 2026. Accessed: 2026

2026
[4]

Blind backdoors in deep learning models

Eugene Bagdasaryan and Vitaly Shmatikov. Blind backdoors in deep learning models. In30th USENIX Security Symposium (USENIX Secu- rity 21), pages 1505–1521, 2021

2021
[5]

A nearly tight sum-of-squares lower bound for the planted clique problem.SIAM Journal on Computing, 48(2):687–735, 2019

Boaz Barak, Samuel Hopkins, Jonathan Kelner, Pravesh K Kothari, Ankur Moitra, and Aaron Potechin. A nearly tight sum-of-squares lower bound for the planted clique problem.SIAM Journal on Computing, 48(2):687–735, 2019. 34

2019
[6]

Complexity theoretic lower bounds for sparse principal component detection

Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. InConference on learning theory, pages 1046–1066. PMLR, 2013

2013
[7]

Optimal detection of sparse principal components in high dimension

Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components in high dimension. 2013

2013
[8]

Optimal average-case reductions to sparse pca: From weak assumptions to strong hardness

Matthew Brennan and Guy Bresler. Optimal average-case reductions to sparse pca: From weak assumptions to strong hardness. InConference on Learning Theory, pages 469–470. PMLR, 2019

2019
[9]

Reducibility and computational lower bounds for problems with planted sparse structure

Matthew Brennan, Guy Bresler, and Wasim Huleihel. Reducibility and computational lower bounds for problems with planted sparse structure. InConference On Learning Theory, pages 48–166. PMLR, 2018

2018
[10]

Data free backdoor attacks

Bochuan Cao, Jinyuan Jia, Chuxuan Hu, Wenbo Guo, Zhen Xiang, Jinghui Chen, Bo Li, and Dawn Song. Data free backdoor attacks. Advances in Neural Information Processing Systems, 37:23881–23911, 2024

2024
[11]

arXiv preprint arXiv:1811.03728 (2018)

Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Ben- jamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. Detect- ing backdoor attacks on deep neural networks by activation clustering. arXiv preprint arXiv:1811.03728, 2018

work page arXiv 2018
[12]

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Tar- geted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017

work page internal anchor Pith review arXiv 2017
[13]

Attacking byzantine robust aggregation in high dimensions

Sarthak Choudhary, Aashish Kolluri, and Prateek Saxena. Attacking byzantine robust aggregation in high dimensions. In2024 IEEE Sym- posium on Security and Privacy (SP), pages 1325–1344. IEEE, 2024

2024
[14]

Sever: A robust meta-algorithm for stochas- tic optimization

Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Stein- hardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochas- tic optimization. InInternational Conference on Machine Learning, pages 1596–1606. PMLR, 2019

2019
[15]

Subexponential-time algorithms for sparse pca.Foundations of Computational Mathematics, 24(3):865–914, 2024

Yunzi Ding, Dmitriy Kunisky, Alexander S Wein, and Afonso S Ban- deira. Subexponential-time algorithms for sparse pca.Foundations of Computational Mathematics, 24(3):865–914, 2024. 35

2024
[16]

Statistical algorithms and a lower bound for detecting planted cliques.Journal of the ACM (JACM), 64(2):1–37, 2017

Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh S Vempala, and Ying Xiao. Statistical algorithms and a lower bound for detecting planted cliques.Journal of the ACM (JACM), 64(2):1–37, 2017

2017
[17]

Sparse cca: Adaptive estimation and computational barriers

Chao Gao, Zongming Ma, and Harrison H Zhou. Sparse cca: Adaptive estimation and computational barriers. 2017

2017
[18]

Planting undetectable backdoors in machine learning models

Shafi Goldwasser, Michael P Kim, Vinod Vaikuntanathan, and Or Za- mir. Planting undetectable backdoors in machine learning models. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Sci- ence (FOCS), pages 931–942. IEEE, 2022

2022
[19]

Oblivious defense in ml models: Backdoor removal without detection

Shafi Goldwasser, Jonathan Shafer, Neekon Vafa, and Vinod Vaikun- tanathan. Oblivious defense in ml models: Backdoor removal without detection. InProceedings of the 57th Annual ACM Symposium on The- ory of Computing, pages 1785–1794, 2025

2025
[20]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Iden- tifying vulnerabilities in the machine learning model supply chain.arXiv preprint arXiv:1708.06733, 2017

work page internal anchor Pith review arXiv 2017
[21]

Wenbo Guo, Bolun Wang, Yuanshun Yao, Shawn Shan, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. TABOR: A highly accurate approach to inspecting and restoring trojan backdoors in AI systems. arXiv preprint arXiv:1908.01763, 2019

work page arXiv 1908
[22]

Handcrafted backdoors in deep neural networks.Advances in Neural Information Processing Systems, 35:8068–8080, 2022

Sanghyun Hong, Nicholas Carlini, and Alexey Kurakin. Handcrafted backdoors in deep neural networks.Advances in Neural Information Processing Systems, 35:8068–8080, 2022

2022
[23]

The power of sum- of-squares for detecting hidden structures

Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, Tselil Schramm, and David Steurer. The power of sum- of-squares for detecting hidden structures. In2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 720–
[24]

Indistinguishability obfusca- tion from well-founded assumptions.Journal of the ACM, 73(1):1–30, 2026

Aayush Jain, Huijia Lin, and Amit Sahai. Indistinguishability obfusca- tion from well-founded assumptions.Journal of the ACM, 73(1):1–30, 2026. 36

2026
[25]

Injecting undetectable backdoors in obfuscated neural networks and language models.Advances in Neural Information Processing Systems, 37:21537–21571, 2024

Alkis Kalavasis, Amin Karbasi, Argyris Oikonomou, Katerina Sotiraki, Grigoris Velegkas, and Manolis Zampetakis. Injecting undetectable backdoors in obfuscated neural networks and language models.Advances in Neural Information Processing Systems, 37:21537–21571, 2024

2024
[26]

Cs 354: Unfulfilled algorithmic fantasies.https://web

Yunsung Kim. Cs 354: Unfulfilled algorithmic fantasies.https://web. stanford.edu/class/cs354/scribe/lecture11.pdf, 2019

2019
[27]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

2009
[28]

Succinct garbled circuits with low-depth garbling algorithms.Cryptology ePrint Archive, 2025

Hanjun Li, Huijia Lin, and George Lu. Succinct garbled circuits with low-depth garbling algorithms.Cryptology ePrint Archive, 2025

2025
[29]

Fine-pruning: Defending against backdooring attacks on deep neural networks

Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdooring attacks on deep neural networks. In Research in Attacks, Intrusions, and Defenses (RAID), 2018

2018
[30]

Trojaning attack on neural networks

Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Wei- hang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. In25th Annual Network And Distributed System Security Symposium (NDSS 2018). Internet Soc, 2018

2018
[31]

Reading digits in natural images with un- supervised feature learning

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with un- supervised feature learning. InNIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 7. Granada, 2011

2011
[32]

Crypto- graphic backdoor for neural networks: Boon and bane.arXiv preprint arXiv:2509.20714, 2025

Anh Tu Ngo, Anupam Chattopadhyay, and Subhamoy Maitra. Crypto- graphic backdoor for neural networks: Boon and bane.arXiv preprint arXiv:2509.20714, 2025

work page arXiv 2025
[33]

Wanet–imperceptible warping-based back- door attack,

Anh Nguyen and Anh Tran. Wanet–imperceptible warping-based back- door attack.arXiv preprint arXiv:2102.10369, 2021

work page arXiv 2021
[34]

Input-aware dynamic backdoor at- tack.Advances in Neural Information Processing Systems, 33:3454– 3464, 2020

Tuan Anh Nguyen and Anh Tran. Input-aware dynamic backdoor at- tack.Advances in Neural Information Processing Systems, 33:3454– 3464, 2020. 37

2020
[35]

Tbt: Targeted neural network attack with bit trojan

Adnan Siraj Rakin, Zhezhi He, and Deliang Fan. Tbt: Targeted neural network attack with bit trojan. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13198–13207, 2020

2020
[36]

Bypassing backdoor detection algorithms in deep learning

Reza Shokri et al. Bypassing backdoor detection algorithms in deep learning. In2020 IEEE European Symposium on Security and Privacy (EuroS&P), pages 175–183. IEEE, 2020

2020
[37]

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Benchmarking machine learning algorithms for traf- fic sign recognition.Neural networks, 32:323–332, 2012

2012
[38]

Spectral signatures in backdoor attacks.Advances in neural information processing systems, 31, 2018

Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks.Advances in neural information processing systems, 31, 2018

2018
[39]

Clean-label backdoor attacks

Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Clean-label backdoor attacks. 2018

2018
[40]

Cambridge University Press, 2018

Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018

2018
[41]

Neural cleanse: Identify- ing and mitigating backdoor attacks in neural networks

Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identify- ing and mitigating backdoor attacks in neural networks. In2019 IEEE symposium on security and privacy (SP), pages 707–723. IEEE, 2019

2019
[42]

Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. Neural cleanse: Identi- fying and mitigating backdoor attacks in neural networks. InIEEE Symposium on Security and Privacy (S&P), 2019

2019
[43]

Statistical and computational trade-offs in estimation of sparse principal compo- nents

Tengyao Wang, Quentin Berthet, and Richard J Samworth. Statistical and computational trade-offs in estimation of sparse principal compo- nents. 2016

2016
[44]

Rethinking the reverse-engineering of trojan triggers.arXiv preprint arXiv:2210.15127, 2022

Zhenting Wang, Kai Mei, Hailun Ding, Juan Zhai, and Shiqing Ma. Rethinking the reverse-engineering of trojan triggers.arXiv preprint arXiv:2210.15127, 2022. 38

work page arXiv 2022
[45]

Re- thinking the reverse-engineering of trojan triggers.Advances in Neural Information Processing Systems, 35:9738–9753, 2022

Zhenting Wang, Kai Mei, Hailun Ding, Juan Zhai, and Shiqing Ma. Re- thinking the reverse-engineering of trojan triggers.Advances in Neural Information Processing Systems, 35:9738–9753, 2022

2022
[46]

UNICORN: A unified backdoor trigger inversion framework

Zhenting Wang, Kai Mei, Juan Zhai, and Shiqing Ma. UNICORN: A unified backdoor trigger inversion framework. InInternational Confer- ence on Learning Representations (ICLR), 2023

2023
[47]

Adversarial neuron pruning purifies backdoored deep models

Dongxian Wu and Yisen Wang. Adversarial neuron pruning purifies backdoored deep models. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

2021
[48]

Detecting ai trojans using meta neural analysis

Xiaojun Xu, Qi Wang, Huichen Li, Nikita Borisov, Carl A Gunter, and Bo Li. Detecting ai trojans using meta neural analysis. In2021 IEEE Symposium on Security and Privacy (SP), pages 103–120. IEEE, 2021

2021
[49]

Towards backdoor stealthiness in model parameter space

Xiaoyun Xu, Zhuoran Liu, Stefanos Koffas, and Stjepan Picek. Towards backdoor stealthiness in model parameter space. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Se- curity, pages 2863–2876, 2025

2025
[50]

BAN: Detecting backdoors activated by adversarial neuron noise

Xiaoyun Xu, Zhuoran Liu, Stefanos Koffas, Shujian Yu, and Stjepan Picek. BAN: Detecting backdoors activated by adversarial neuron noise. arXiv preprint arXiv:2405.19928, 2024

work page arXiv 2024
[51]

To- wards reliable and efficient backdoor trigger inversion via decoupling benign features

Xiong Xu, Kunzhe Huang, Yiming Li, Zhan Qin, and Kui Ren. To- wards reliable and efficient backdoor trigger inversion via decoupling benign features. InInternational Conference on Learning Representa- tions (ICLR), 2024

2024
[52]

Latent backdoor attacks on deep neural networks

Yuanshun Yao, Huiying Li, Haitao Zheng, and Ben Y Zhao. Latent backdoor attacks on deep neural networks. InProceedings of the 2019 ACM SIGSAC conference on computer and communications security, pages 2041–2055, 2019

2019
[53]

Morley Mao, Ming Jin, and Ruoxi Jia

Yi Zeng, Si Chen, Won Park, Z. Morley Mao, Ming Jin, and Ruoxi Jia. Adversarial unlearning of backdoors via implicit hypergradient. In International Conference on Learning Representations (ICLR), 2022. 39 A Proofs of Main Results This appendix contains the complete proofs of the results presented in Sec- tion 6, including the correctness of signal propaga...

2022
[54]

the clean trained modelf(with logit mapg), and
[55]

We sweep the test set of the corresponding dataset and, for every test input x, record the four per-sample quantities defined below

theclean reference modelf ′ (with logit mapg ′), obtained by perturbing each weight columnj∈ I i of every targeted layeriby an isotropic Gaussian ditherη (j) i ∼ N(0, τ 2 i ·I di), with the same per-layerτ i used by our attack andnobackdoor signal. We sweep the test set of the corresponding dataset and, for every test input x, record the four per-sample q...