Out-of-the-box: Black-box Causal Attacks on Object Detectors
Pith reviewed 2026-05-17 02:30 UTC · model grok-4.3
The pith
BlackCAtt identifies minimal causally sufficient pixel sets to generate smaller, explainable attacks on object detectors from black-box outputs alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BlackCAtt identifies minimal causally sufficient pixel sets from black-box detector outputs to construct explainable, imperceptible, reproducible, and architecture-agnostic attacks. With access only to bounding-box position and label, the attacks are comparable or better than those from other black-box methods. With added access to model confidence, it functions as a meta-algorithm that reduces perturbation size, for instance lowering the average L0 norm from 0.987 to 0.072 when paired with SquareAttack while preserving success rate.
What carries the argument
Minimal causally sufficient pixel sets, identified from detector outputs and perturbed to produce targeted failures in detection results.
If this is right
- Attacks become fully explainable because they manipulate only the pixels that cause the detector's output.
- Using only position and label information yields attacks that are comparable or better than those from other black-box techniques.
- When model confidence is available, BlackCAtt reduces the size of perturbations from standard methods such as SquareAttack while keeping similar success rates.
- Ablation studies show that each component of the algorithm contributes measurably to attack quality.
Where Pith is reading between the lines
- The causal-pixel approach could be applied to design defenses that protect the most influential image regions rather than the entire input.
- The same identification process might be tested on other vision tasks such as segmentation or classification to check whether causal sets remain small and useful.
- Developers could use the method to audit deployed detectors for hidden causal weaknesses without needing model weights or gradients.
Load-bearing premise
Minimal causally sufficient pixel sets can be reliably identified and effectively altered using only black-box outputs such as bounding-box position, label, or confidence.
What would settle it
An experiment in which perturbing the identified causal pixel sets produces no change in the detector's outputs or in which the resulting attacks are larger or less successful than those generated by baseline black-box methods.
Figures
read the original abstract
Adversarial perturbations are a useful way to expose vulnerabilities in object detectors. Existing perturbation methods are frequently white-box, architecture specific and use a loss function. More importantly, while they are often successful, it is rarely clear why they work. Insights into the mechanism of this success would allow developers to understand and analyze these attacks, as well as fine-tune the model to prevent them. This paper presents BlackCAtt, a black-box algorithm and tool, which uses minimal, causally sufficient pixel sets to construct explainable, imperceptible, reproducible, architecture-agnostic attacks on object detectors. We evaluate BlackCAtt on standard benchmarks and compare it to other black-box adversarial attacks methods. When BlackCAtt has access only to the position and label of a bounding box, it produces attacks that are comparable or better to those produced by other black-box methods. When BlackCAtt has access to the model confidence as well, it can work as a meta-algorithm, improving the ability of standard black-box techniques to construct smaller, less perceptible attacks. As BlackCAtt attacks manipulate causes only, the attacks become fully explainable. We compare the performance of BlackCAtt with other black-box attack methods and show that targeting causal pixels leads to smaller and less perceptible attacks. For example, when using BlackCAtt with SquareAttack, it reduces the average distance ($L_0$ norm) of the attack from the original input from $0.987$ to $0.072$, while maintaining a similar success rate. We perform ablation studies on the BlackCAtt algorithm and analyze the effect of different components on its performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BlackCAtt, a black-box algorithm for adversarial attacks on object detectors that identifies minimal causally sufficient pixel sets to produce explainable perturbations. It claims that with access only to bounding-box position and label the method matches or exceeds other black-box attacks, and that when model confidence is also available it functions as a meta-algorithm that improves existing techniques (e.g., reducing average L0 norm from 0.987 to 0.072 while preserving success rate when combined with SquareAttack). The work is evaluated on standard benchmarks with ablation studies on algorithmic components.
Significance. If the central claim that minimal causal pixel sets can be recovered and edited from black-box detector outputs holds, the result would be significant for adversarial robustness research in computer vision. It offers an architecture-agnostic, gradient-free route to smaller, more interpretable attacks and could supply concrete diagnostic information for hardening detectors.
major comments (2)
- [§3] §3 (causal identification procedure): the algorithm that recovers minimal causally sufficient pixel sets via black-box interventions (masking/perturbation followed by output comparison) is not shown to guarantee minimality. Because detector outputs are discrete and thresholded, the procedure can return supersets or pixels whose effect is merely correlated; this directly affects the reported L0 reduction and the claim that attacks are 'fully explainable' because they manipulate causes only.
- [Experimental results] Experimental results (L0-norm comparison paragraph and associated table/figure): the reduction from 0.987 to 0.072 when BlackCAtt is used with SquareAttack is presented without the number of images, number of independent runs, variance, or statistical test for the success-rate equivalence. These details are load-bearing for the meta-algorithm claim.
minor comments (2)
- [Abstract] Abstract and §1: 'standard benchmarks' and the exact object detectors used should be named explicitly rather than left generic.
- [Notation] Notation: the formal definition of 'causally sufficient' for a pixel set relative to a detector output (position, label, or score) should be stated once, preferably with a short equation or set notation.
Simulated Author's Rebuttal
We are grateful to the referee for their detailed and insightful comments, which have helped us identify areas for improvement in the manuscript. We address each major comment below, indicating the revisions we plan to make.
read point-by-point responses
-
Referee: [§3] §3 (causal identification procedure): the algorithm that recovers minimal causally sufficient pixel sets via black-box interventions (masking/perturbation followed by output comparison) is not shown to guarantee minimality. Because detector outputs are discrete and thresholded, the procedure can return supersets or pixels whose effect is merely correlated; this directly affects the reported L0 reduction and the claim that attacks are 'fully explainable' because they manipulate causes only.
Authors: We acknowledge that our causal identification procedure, as described in Section 3, does not include a formal proof or guarantee of minimality. The discrete and thresholded nature of object detector outputs means that the identified pixel sets may indeed be supersets or include correlated effects rather than purely causal minimal sets. This is a valid point that impacts the strength of our claims regarding L0 reductions and full explainability. In the revised version, we will update Section 3 to explicitly state that the procedure identifies empirically sufficient sets through interventions but does not guarantee strict minimality. We will also temper the language around 'fully explainable' to reflect that the attacks manipulate pixels that are sufficient to cause changes in the detector output based on our black-box interventions. This will be a partial revision as we clarify rather than fundamentally alter the algorithm. revision: partial
-
Referee: [Experimental results] Experimental results (L0-norm comparison paragraph and associated table/figure): the reduction from 0.987 to 0.072 when BlackCAtt is used with SquareAttack is presented without the number of images, number of independent runs, variance, or statistical test for the success-rate equivalence. These details are load-bearing for the meta-algorithm claim.
Authors: We agree that the experimental details for the L0-norm comparison are insufficient as presented. The results are derived from evaluations on the standard COCO dataset or similar benchmarks used in the paper, but we did not report the exact number of images tested, the number of independent runs, variance measures, or perform statistical tests to support the equivalence in success rates. We will revise the relevant paragraph, table, and figure captions to include: the number of images (e.g., 1000 images from the validation set), number of independent runs (e.g., 3 runs with different random seeds), standard deviation or variance for the L0 norms, and a statistical test (such as Wilcoxon signed-rank test) confirming that the success rate remains statistically equivalent while L0 is significantly reduced. This will strengthen the meta-algorithm claim. revision: yes
Circularity Check
No circularity; algorithmic method evaluated on external benchmarks
full rationale
The paper presents BlackCAtt as a black-box algorithmic procedure for identifying and manipulating minimal causally sufficient pixel sets using only detector outputs such as bounding box position, label, or confidence. It evaluates performance on standard benchmarks, reports empirical comparisons to other black-box methods (e.g., L0-norm reduction when combined with SquareAttack), and performs ablation studies. No equations, derivations, or claims are shown that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The method is self-contained against external benchmarks with no load-bearing steps that loop back to the paper's own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Causal relationships between pixel changes and detector outputs can be identified using only black-box access to bounding box position, label, or confidence.
Forward citations
Cited by 1 Pith paper
-
If It's Good Enough for You, It's Good Enough for Me: Transferability of Audio Sufficiencies across Models
Transferability analysis finds that minimal sufficient signals transfer across audio models at rates varying by task, around 26% for music genre classification, with some deepfake models showing distinct behaviors not...
Reference graph
Works this paper leans on
-
[1]
Square attack: a query-efficient black-box adversarial attack via random search
Maksym Andriushchenko, Francesco Croce, Nicolas Flam- marion, and Matthias Hein. Square attack: a query-efficient black-box adversarial attack via random search. InEuropean conference on computer vision, pages 484–501. Springer,
-
[2]
Sebastian Bach, Alexander Binder, Gr ´egoire Montavon, Frederick Klauschen, Klaus-Robert M ¨uller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLOS One, 10(7), 2015. 7
work page 2015
-
[3]
Face: Faithful automatic concept extraction
Dipkamal Bhusal, Michael Clifford, Sara Rampazzi, and Nidhi Rastogi. Face: Faithful automatic concept extraction. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems. 8
-
[4]
Context-aware transfer attacks for ob- ject detection
Zikui Cai, Xinxin Xie, Shasha Li, Mingjun Yin, Chengyu Song, Srikanth V Krishnamurthy, Amit K Roy-Chowdhury, and M Salman Asif. Context-aware transfer attacks for ob- ject detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 149–157, 2022. 8
work page 2022
-
[5]
Hana Chockler and Joseph Y . Halpern. Responsibility and blame: A structural-model approach.J. Artif. Intell. Res., 22:93–115, 2004. 2
work page 2004
-
[6]
Hana Chockler and Joseph Y . Halpern. Explaining image classifiers, 2024. 1
work page 2024
-
[7]
Causal explanations for image classifiers
Hana Chockler, David A Kelly, Daniel Kroening, and Youcheng Sun. Causal explanations for image classifiers. arXiv preprint arXiv:2411.08875, 2024. 1, 2, 3
-
[8]
Hana Chockler, David A. Kelly, and Daniel Kroening. Mul- tiple different explanations for image classifiers. InECAI European Conference on Artificial Intelligence, 2025. 2, 7
work page 2025
-
[9]
Sparse and imperceiv- able adversarial attacks
Francesco Croce and Matthias Hein. Sparse and imperceiv- able adversarial attacks. InProceedings of the IEEE/CVF international conference on computer vision, pages 4724– 4732, 2019. 8
work page 2019
-
[10]
Saliency attack: Towards imperceptible black-box adversarial attack
Zeyu Dai, Shengcai Liu, Qing Li, and Ke Tang. Saliency attack: Towards imperceptible black-box adversarial attack. ACM Transactions on Intelligent Systems and Technology, 14(3), 2023. 8
work page 2023
-
[11]
Lee R. Dice. Measures of the amount of ecologic association between species.Ecology, 26:297—-302, 1945. 3
work page 1945
-
[12]
On the connection between adversarial ro- bustness and saliency map interpretability
Christian Etmann, Sebastian Lunz, Peter Maass, and Car- ola Schoenlieb. On the connection between adversarial ro- bustness and saliency map interpretability. InInternational Conference on Machine Learning, pages 1823–1832. PMLR,
-
[13]
Fastjsma: Accel- erating jacobian-based saliency map attacks through gradient decoupling
Zhenghao Gao, Shengjie Xu, Zijing Li, Meixi Chen, Chao- jian Yu, Yuanjie Shao, and Changxin Gao. Fastjsma: Accel- erating jacobian-based saliency map attacks through gradient decoupling. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1506–1515,
-
[14]
Saliency methods for explain- ing adversarial attacks, 2019
Jindong Gu and V olker Tresp. Saliency methods for explain- ing adversarial attacks, 2019. 8
work page 2019
-
[15]
Joseph Y . Halpern. A modification of the Halpern–Pearl def- inition of causality. InProceedings of IJCAI, pages 3022–
-
[16]
AAAI Press, 2015. 1 9
work page 2015
-
[17]
Joseph Y . Halpern.Actual Causality. The MIT Press, 2019. 2, 7
work page 2019
-
[18]
Alexey Ignatiev, Nina Narodytska, and Joao Marques-Silva. On relating explanations and adversarial examples.Ad- vances in neural information processing systems, 32, 2019. 8
work page 2019
-
[19]
Black-box adversarial attacks with limited queries and information
Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. InInternational conference on machine learn- ing, pages 2137–2146. PMLR, 2018. 8
work page 2018
-
[20]
FastSHAP: Real-Time Shapley Value Estimation
Neil Jethani, Mukund Sudarshan, Ian Connick Covert, Su-In Lee, and Rajesh Ranganath. FastSHAP: Real-Time Shapley Value Estimation. InInternational Conference on Learning Representations, 2022. 8
work page 2022
-
[21]
FastSHAP: Real-time shapley value estimation
Neil Jethani, Mukund Sudarshan, Ian Connick Covert, Su-In Lee, and Rajesh Ranganath. FastSHAP: Real-time shapley value estimation. InInternational Conference on Learning Representations, 2022. 8
work page 2022
-
[22]
Comparing the decision-making mechanisms by transformers and cnns via explanation methods
Mingqi Jiang, Saeed Khorram, and Li Fuxin. Comparing the decision-making mechanisms by transformers and cnns via explanation methods. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9546–9555, 2024. 6
work page 2024
-
[23]
David A Kelly and Hana Chockler. Causal identification of sufficient, contrastive and complete feature sets in image classification.arXiv preprint arXiv:2507.23497, 2025. 7
-
[24]
Fast Explanation Using Shapley Value for Object Detection.IEEE Access, 12: 31047–31054, 2024
Michihiro Kuroki and Toshihiko Yamasaki. Fast Explanation Using Shapley Value for Object Detection.IEEE Access, 12: 31047–31054, 2024. 8
work page 2024
-
[25]
Ninghao Liu, Mengnan Du, Ruocheng Guo, Huan Liu, and Xia Hu. Adversarial attacks and defenses: An interpretation perspective.ACM SIGKDD Explorations Newsletter, 23(1): 86–99, 2021. 8
work page 2021
-
[26]
Reed, Cheng-Yang Fu, and Alexander C
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. InProceedings of European Conference in Computer Vision ECCV, Part I, pages 21–37. Springer, 2016. 8
work page 2016
-
[27]
DPatch: An Adversarial Patch Attack on Object Detectors
Xin Liu, Huanrui Yang, Ziwei Liu, Linghao Song, Hai Li, and Yiran Chen. Dpatch: An adversarial patch attack on object detectors.arXiv preprint arXiv:1806.02299, 2018. 8
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. InAdvances in Neural In- formation Processing Systems (NeurIPS), pages 4765–4774,
-
[29]
Detrs beat yolos on real-time object detection, 2023
Wenyu Lv, Shangliang Xu, Yian Zhao, Guanzhong Wang, Jinman Wei, Cheng Cui, Yuning Du, Qingqing Dang, and Yi Liu. Detrs beat yolos on real-time object detection, 2023. 8
work page 2023
-
[30]
On saliency maps and adversarial robustness
Puneet Mangla, Vedant Singh, and Vineeth N Balasubrama- nian. On saliency maps and adversarial robustness. Injoint European conference on machine learning and knowledge discovery in databases, pages 272–288. Springer, 2020. 8
work page 2020
-
[31]
Woo-Jeoung Nam, Shir Gur, Jaesik Choi, Lior Wolf, and Seong-Whan Lee. Relative attributing propagation: Inter- preting the comparative contributions of individual units in deep neural networks. InAAAI Conference on Artificial In- telligence, pages 2501–2508, 2020. 7
work page 2020
-
[32]
Simple black- box adversarial attacks on deep neural networks
Nina Narodytska and Shiva Kasiviswanathan. Simple black- box adversarial attacks on deep neural networks. In2017 IEEE Conference on Computer Vision and Pattern Recogni- tion Workshops (CVPRW), pages 1310–1318, 2017. 1
work page 2017
-
[33]
A survey and evaluation of adversarial attacks for object de- tection, 2025
Khoi Nguyen Tiet Nguyen, Wenyu Zhang, Kangkang Lu, Yuhuan Wu, Xingjian Zheng, Hui Li Tan, and Liangli Zhen. A survey and evaluation of adversarial attacks for object de- tection, 2025. 8
work page 2025
-
[34]
Khoi Nguyen Tiet Nguyen, Wenyu Zhang, Kangkang Lu, Yu-Huan Wu, Xingjian Zheng, Hui Li Tan, and Liangli Zhen. A survey and evaluation of adversarial attacks in object de- tection.IEEE Transactions on Neural Networks and Learn- ing Systems, 36(9):15706–15722, 2025. 1
work page 2025
-
[35]
Berkay Celik, and Ananthram Swami
Nicolas Papernot, Patrick Mcdaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings.2016 IEEE European Symposium on Security and Privacy (Eu- roS&P), pages 372–387, 2015. 8
work page 2016
-
[36]
The Limitations of Deep Learning in Adversarial Settings
Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings.CoRR, abs/1511.07528, 2015. 8
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[37]
RISE: random- ized input sampling for explanation of black-box models
Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: random- ized input sampling for explanation of black-box models. In British Machine Vision Conference (BMVC). BMV A Press,
-
[38]
Morariu, Ashutosh Mehra, Vicente Ordonez, and Kate Saenko
Vitali Petsiuk, Rajiv Jain, Varun Manjunatha, Vlad I. Morariu, Ashutosh Mehra, Vicente Ordonez, and Kate Saenko. Black-box explanation of object detectors via saliency maps. In2021 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 11438– 11447, 2021. 2, 6
work page 2021
-
[39]
Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of CVPR, pages 779–788,
-
[40]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should I trust you?” Explaining the predictions of any classifier. InKnowledge Discovery and Data Mining (KDD), pages 1135–1144. ACM, 2016. 8
work page 2016
-
[41]
Do explanations expose bias? how saliency maps af- fect judgements of biased face-recognition models
Justyn Rodrigues, Kris Ehinger, Oliver Obst, and Rosalind Wang. Do explanations expose bias? how saliency maps af- fect judgements of biased face-recognition models. InPro- ceedings of the 25th European Conference on Artificial In- telligence (ECAI 2025), 2025. 1
work page 2025
-
[42]
Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. InInternational Conference on Computer Vision (ICCV), pages 618–626. IEEE, 2017. 8
work page 2017
-
[43]
Learning important features through propagating activation differences
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. InInternational Conference on Machine Learn- ing (ICML), pages 3145–3153. PMLR, 2017. 7
work page 2017
-
[44]
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. Striving for simplicity: The all convolutional net. InICLR (Workshop Track), 2015. 7 10
work page 2015
-
[45]
Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. One pixel attack for fooling deep neural networks.IEEE Transactions on Evolutionary Computation, 23(5):828–841,
-
[46]
Axiomatic attribution for deep networks
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InInternational Conference on Machine Learning, pages 3319–3328. PMLR, 2017. 7
work page 2017
-
[47]
Adversarial sample detection for deep neural network through model mutation testing
Jingyi Wang, Guoliang Dong, Jun Sun, Xinyu Wang, and Peixin Zhang. Adversarial sample detection for deep neural network through model mutation testing. InProceedings of the 41st International Conference on Software Engineering. ACM, 2019. 8
work page 2019
-
[48]
Shen Wang and Yuxin Gong. Adversarial example detection based on saliency map features.Applied Intelligence, 52(6): 6262–6275, 2022. 1, 8
work page 2022
-
[49]
Zachary Yahn, Selim Furkan Tekin, Fatih Ilhan, Sihao Hu, Tiansheng Huang, Yichang Xu, Margaret Loper, and Ling Liu. Adversarial attention perturbations for large object de- tection transformers.ArXiv, abs/2508.02987, 2025. 1
-
[50]
Hanwei Zhang, Felipe Torres Figueroa, and Holger Her- manns. Saliency Maps Give a False Sense of Explanability to Image Classifiers: An empirical evaluation across methods and metrics. InProceedings of the 16th Asian Conference on Machine Learning, pages 479–494. PMLR, 2025. 1
work page 2025
-
[51]
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 1, 2 11 Supplementary Materials ModelsThe different models used were (i) single-stage real-time detectors (YOLOv11 checkpoint: “yolo11n.pt”), (ii) two- stage (Faster-R-CNN using the standard CO...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.