pith. sign in

arxiv: 2408.06509 · v1 · pith:7N6NTEERnew · submitted 2024-08-12 · 💻 cs.LG · cs.AI· cs.CR

Fooling SHAP with Output Shuffling Attacks

classification 💻 cs.LG cs.AIcs.CR
keywords attacksshapmodelshapleyshufflingadversarialattackdetect
0
0 comments X
read the original abstract

Explainable AI~(XAI) methods such as SHAP can help discover feature attributions in black-box models. If the method reveals a significant attribution from a ``protected feature'' (e.g., gender, race) on the model output, the model is considered unfair. However, adversarial attacks can subvert the detection of XAI methods. Previous approaches to constructing such an adversarial model require access to underlying data distribution, which may not be possible in many practical scenarios. We relax this constraint and propose a novel family of attacks, called shuffling attacks, that are data-agnostic. The proposed attack strategies can adapt any trained machine learning model to fool Shapley value-based explanations. We prove that Shapley values cannot detect shuffling attacks. However, algorithms that estimate Shapley values, such as linear SHAP and SHAP, can detect these attacks with varying degrees of effectiveness. We demonstrate the efficacy of the attack strategies by comparing the performance of linear SHAP and SHAP using real-world datasets.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research

    cs.CE 2026-06 unverdicted novelty 7.0

    Introduces the Matter to Mechanism benchmark of 2,645 structured instances and a composite metric suite for evaluating AI co-scientists on problem-to-hypothesis reasoning in battery materials research.

  2. The Unseen Hand: Manipulating Model Fairness and SHAP with Targeted Identity Re-Association Attacks

    cs.LG 2026-06 unverdicted novelty 6.0

    TIRA attacks with PMiS and PRSMP push fairness metrics to ideal values and reduce SHAP attribution for protected features to zero in black-box settings.