The Unseen Hand: Manipulating Model Fairness and SHAP with Targeted Identity Re-Association Attacks

Muhammad U. S. Khan; Sannaan Khan

arxiv: 2606.22858 · v1 · pith:PV3GCTKXnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI

The Unseen Hand: Manipulating Model Fairness and SHAP with Targeted Identity Re-Association Attacks

Sannaan Khan , Muhammad U. S. Khan This is my paper

Pith reviewed 2026-06-26 09:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords adversarial attacksalgorithmic fairnessSHAP explanationsmodel manipulationdata perturbationexplainable AImachine learning security

0 comments

The pith

Targeted identity re-association attacks can drive fairness metrics to ideal values while reducing SHAP attribution for protected features to zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Targeted Identity Re-Association (TIRA) attacks that use two probabilistic algorithms to alter data associations. Probabilistic Micro-Shuffling performs localized adjacent swaps, and Probabilistic Rank-Shift Micro-Perturbation introduces small randomized rank shifts. These methods manipulate model outputs iteratively without access to internals or feature representations. They improve fairness metrics and eliminate residual SHAP attributions for protected features, addressing the detectable artifacts left by earlier attacks. A sympathetic reader would care because current fairness audits and explainability tools could be systematically misled by stealthy input changes.

Core claim

TIRA attacks, through the algorithms Probabilistic Micro-Shuffling (PMiS) and Probabilistic Rank-Shift Micro-Perturbation (PRSMP), iteratively and probabilistically manipulate model outputs to push fairness metrics toward ideal values and confound SHAP-based explanations, leaving effectively zero residual attribution for protected features, without requiring access to the model's internals or feature representations and without leaving detectable artifacts.

What carries the argument

Targeted Identity Re-Association (TIRA) attacks, which apply localized adjacent swaps or small randomized rank shifts to re-associate identities in the input data and thereby alter model outputs and attributions.

Load-bearing premise

The attacks can be performed iteratively and probabilistically to manipulate outputs without requiring access to the model's internals or feature representations while leaving no detectable artifacts.

What would settle it

Apply a TIRA attack to a dataset with known protected features, retrain or query the model, compute SHAP values, and check whether the attribution scores for the protected features remain at or near zero.

Figures

Figures reproduced from arXiv: 2606.22858 by Muhammad U. S. Khan, Sannaan Khan.

**Figure 2.** Figure 2: Tradeoff Curves between Disparate Impact (x-axis) and Demographic Parity (y-axis) across [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: SHAP Values of the Protected Feature post-Attack for Bangladeshi Diabetes Dataset (NN [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

As machine learning models grow more influential and opaque, algorithmic fairness and explainability are critical for ensuring accountability. However, we demonstrate that these auditing mechanisms are themselves vulnerable to subtle manipulation, camouflaging the influence of protected features. While prior work on data-agnostic attacks has exposed this vulnerability, they leave behind detectable artifacts that compromise their stealth. We introduce Targeted Identity Re-Association (TIRA) attacks, a novel family of attacks that iteratively and probabilistically manipulate a model's outputs without requiring access to the model's internals or feature representations. We formalize two algorithms: Probabilistic Micro-Shuffling (PMiS), which applies localized adjacent swaps, and Probabilistic Rank-Shift Micro-Perturbation (PRSMP), which introduces small, randomized rank shifts. We empirically demonstrate that TIRA attacks are highly effective at pushing fairness metrics towards ideal values. Crucially, TIRA attacks successfully confound SHAP-based explanations, leaving effectively zero residual attribution for protected features, a major improvement over prior work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names a new TIRA attack family with two algorithms but the abstract supplies zero datasets, results, or protocol, so the effectiveness claims cannot be checked.

read the letter

The main takeaway is that this paper introduces Targeted Identity Re-Association attacks, with Probabilistic Micro-Shuffling and Probabilistic Rank-Shift Micro-Perturbation as the two concrete methods, positioned as stealthier than earlier data-agnostic attacks because they work iteratively and probabilistically without model internals. That framing is the actual new piece.

The work does a reasonable job laying out why prior attacks are limited: they leave detectable artifacts. The authors correctly note that fairness metrics and SHAP are used for accountability, so showing they can be gamed is worth attention if the demonstration is solid.

The soft spots are straightforward and central. The abstract claims TIRA drives fairness metrics to ideal values and reduces protected-feature SHAP attribution to effectively zero, yet it lists no datasets, no models, no quantitative tables, and no experimental setup. Without those, there is no way to verify whether the attacks succeed, whether they improve on prior work, or whether they truly leave no artifacts. The weakest assumption in the abstract is that these probabilistic manipulations can be applied repeatedly while remaining undetectable; that step is asserted rather than evidenced.

This paper is aimed at people who study attacks on ML auditing tools. A reader might pick up the high-level idea of a new attack surface, but the lack of any empirical backing means it is not ready to cite or build on. I would not bring it to a reading group. It does not yet deserve peer review; the authors need to add the missing experimental section with reproducible results before an editor should send it out.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Targeted Identity Re-Association (TIRA) attacks, formalized as Probabilistic Micro-Shuffling (PMiS) and Probabilistic Rank-Shift Micro-Perturbation (PRSMP). These are claimed to iteratively and probabilistically alter model outputs to drive fairness metrics to ideal values and reduce protected-feature attributions in SHAP explanations to effectively zero, without requiring access to model internals or feature representations, and without the detectable artifacts left by prior data-agnostic attacks.

Significance. If the empirical claims were substantiated, the work would be significant for exposing vulnerabilities in both fairness auditing and post-hoc explainability methods, with potential implications for regulatory and deployment practices. The absence of any supporting experiments, however, prevents any assessment of whether those implications are warranted.

major comments (1)

[Abstract] Abstract: the central claim that TIRA attacks are 'highly effective at pushing fairness metrics towards ideal values' and 'successfully confound SHAP-based explanations, leaving effectively zero residual attribution for protected features' is presented as an empirical result, yet the manuscript supplies no datasets, models, quantitative metrics, tables, figures, or experimental protocol. This absence renders the primary contribution unevaluable and is load-bearing for the paper's thesis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for highlighting this critical issue. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that TIRA attacks are 'highly effective at pushing fairness metrics towards ideal values' and 'successfully confound SHAP-based explanations, leaving effectively zero residual attribution for protected features' is presented as an empirical result, yet the manuscript supplies no datasets, models, quantitative metrics, tables, figures, or experimental protocol. This absence renders the primary contribution unevaluable and is load-bearing for the paper's thesis.

Authors: We agree that the current manuscript does not contain the required experimental section, datasets, models, metrics, tables, figures, or protocol. The abstract's empirical claims are therefore unsupported in the submitted version. This constitutes a substantive omission. In the revised manuscript we will add a complete experimental evaluation section that specifies the datasets, models, fairness and SHAP metrics (with before/after values), quantitative results, tables, figures, and a reproducible experimental protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript introduces TIRA attacks (PMiS and PRSMP) as empirical methods and reports their observed effects on fairness metrics and SHAP attributions. The provided text contains no equations, derivations, fitted parameters, or self-citations that could reduce any claim to its own inputs by construction. The central claims are presented as experimental outcomes rather than mathematical predictions or uniqueness theorems, so no load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the high-level description of the attack procedures.

pith-pipeline@v0.9.1-grok · 5706 in / 968 out tokens · 30723 ms · 2026-06-26T09:05:58.846796+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 2 canonical work pages

[1]

A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil- López, D. Molina, R. Benjamins, R. Chatila, and F. Herrera. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.Information Fusion, 2020. URLhttps://arxiv.org/abs/1910.10045

arXiv 2020
[2]

R. K. E. Bellamy, K. Dey, M. Hind, S. C. Hoffman, S. Houde, K. Kannan, P. Lohia, J. Martino, S. Mehta, A. Mojsilovic, S. Nagar, K. N. Ramamurthy, J. Richards, D. Saha, P. Sattigeri, M. Singh, K. R. Varshney, and Y . Zhang. AI fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias.IBM Journal of Research and Development, 2019. UR...

Pith/arXiv arXiv 2019
[3]

Das and P

A. Das and P. Rad. Opportunities and challenges in explainable artificial intelligence (XAI): A survey. arXiv preprint, 2020. URLhttps://arxiv.org/abs/2006.11371

arXiv 2020
[4]

Dimanov, U

B. Dimanov, U. Bhatt, M. Jamnik, and A. Weller. You shouldn’t trust me: Learning models which conceal unfairness from multiple explanation methods. InECAI, 2020. URL https://ebooks.iospress.nl/ pdf/doi/10.3233/FAIA200380

work page doi:10.3233/faia200380 2020
[5]

S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth. A comparative study of fairness-enhancing interventions in machine learning. InProceedings of the Conference on Fairness, Accountability, and Transparency, 2019. URLhttps://arxiv.org/abs/1802. 04422

2019
[6]

H. Hofmann. Statlog (german credit data). UCI Machine Learning Repository, 1994. DOI: https: //doi.org/10.24432/C5NC77

work page doi:10.24432/c5nc77 1994
[7]

Islam, R

M. Islam, R. Ferdousi, S. Rahman, and H. Y . Bushra. Likelihood prediction of diabetes at early stage using data mining techniques. InComputer Vision and Machine Intelligence in Medical Image Analysis: International Symposium, ISCMM 2019, 2020. URL https://link.springer.com/chapter/10. 1007/978-981-15-2428-2_10

2019
[8]

Laberge, U

G. Laberge, U. Aïvodji, S. Hara, M. Marchand, and F. Khomh. Fool SHAP with stealthily biased sampling. InICLR, 2023. URLhttps://arxiv.org/abs/2205.15419

arXiv 2023
[9]

Lundberg and S

S. Lundberg and S. I. Lee. A unified approach to interpreting model predictions. InNeurIPS, 2017. URL https://arxiv.org/abs/1705.07874

Pith/arXiv arXiv 2017
[10]

why should i trust you?

M. T. Ribeiro, S. Singh, and C. Guestrin. "why should i trust you?": Explaining the predictions of any classifier. InKDD, 2016. URLhttps://arxiv.org/abs/1602.04938

Pith/arXiv arXiv 2016
[11]

C. Rudin. Stop explaining black box machine learning models for high-stakes decisions and use inter- pretable models instead.Nature Machine Intelligence, 2019. URL https://pmc.ncbi.nlm.nih.gov/ articles/PMC9122117/pdf/nihms-1058031.pdf

2019
[12]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization.IJCV, 2019. URL https://arxiv. org/abs/1610.02391

arXiv 2019
[13]

Slack, S

D. Slack, S. Hilgard, E. Jia, S. Singh, and H. Lakkaraju. Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods. InAAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES), 2020. URLhttps://arxiv.org/abs/1911.02508

arXiv 2020
[14]

Speicher, H

T. Speicher, H. Heidari, N. Grgic-Hlaca, K. P. Gummadi, A. Singla, A. Weller, and M. B. Zafar. A unified approach to quantifying algorithmic unfairness: Measuring individual & group unfairness via inequality indices. InACM SIGKDD International Conference on Knowledge Discovery and Data Mining Proceedings, 2018. URLhttps://arxiv.org/abs/1807.00787

Pith/arXiv arXiv 2018
[15]

Sundararajan, A

M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. InICML, 2017. URL https://arxiv.org/abs/1703.01365

Pith/arXiv arXiv 2017
[16]

Yuan and A

J. Yuan and A. Dasgupta. Fooling SHAP with output shuffling attacks. InAAAI, 2024. URL https: //arxiv.org/abs/2408.06509. 5 A Related Work Explainability aims to render opaque deep learning models understandable and transparent [ 11]. Rooted in co-operative game theory, SHAP stands as a cornerstone of post-hoc explainability. SHAP offers a theoretically g...

arXiv 2024

[1] [1]

A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil- López, D. Molina, R. Benjamins, R. Chatila, and F. Herrera. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.Information Fusion, 2020. URLhttps://arxiv.org/abs/1910.10045

arXiv 2020

[2] [2]

R. K. E. Bellamy, K. Dey, M. Hind, S. C. Hoffman, S. Houde, K. Kannan, P. Lohia, J. Martino, S. Mehta, A. Mojsilovic, S. Nagar, K. N. Ramamurthy, J. Richards, D. Saha, P. Sattigeri, M. Singh, K. R. Varshney, and Y . Zhang. AI fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias.IBM Journal of Research and Development, 2019. UR...

Pith/arXiv arXiv 2019

[3] [3]

Das and P

A. Das and P. Rad. Opportunities and challenges in explainable artificial intelligence (XAI): A survey. arXiv preprint, 2020. URLhttps://arxiv.org/abs/2006.11371

arXiv 2020

[4] [4]

Dimanov, U

B. Dimanov, U. Bhatt, M. Jamnik, and A. Weller. You shouldn’t trust me: Learning models which conceal unfairness from multiple explanation methods. InECAI, 2020. URL https://ebooks.iospress.nl/ pdf/doi/10.3233/FAIA200380

work page doi:10.3233/faia200380 2020

[5] [5]

S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth. A comparative study of fairness-enhancing interventions in machine learning. InProceedings of the Conference on Fairness, Accountability, and Transparency, 2019. URLhttps://arxiv.org/abs/1802. 04422

2019

[6] [6]

H. Hofmann. Statlog (german credit data). UCI Machine Learning Repository, 1994. DOI: https: //doi.org/10.24432/C5NC77

work page doi:10.24432/c5nc77 1994

[7] [7]

Islam, R

M. Islam, R. Ferdousi, S. Rahman, and H. Y . Bushra. Likelihood prediction of diabetes at early stage using data mining techniques. InComputer Vision and Machine Intelligence in Medical Image Analysis: International Symposium, ISCMM 2019, 2020. URL https://link.springer.com/chapter/10. 1007/978-981-15-2428-2_10

2019

[8] [8]

Laberge, U

G. Laberge, U. Aïvodji, S. Hara, M. Marchand, and F. Khomh. Fool SHAP with stealthily biased sampling. InICLR, 2023. URLhttps://arxiv.org/abs/2205.15419

arXiv 2023

[9] [9]

Lundberg and S

S. Lundberg and S. I. Lee. A unified approach to interpreting model predictions. InNeurIPS, 2017. URL https://arxiv.org/abs/1705.07874

Pith/arXiv arXiv 2017

[10] [10]

why should i trust you?

M. T. Ribeiro, S. Singh, and C. Guestrin. "why should i trust you?": Explaining the predictions of any classifier. InKDD, 2016. URLhttps://arxiv.org/abs/1602.04938

Pith/arXiv arXiv 2016

[11] [11]

C. Rudin. Stop explaining black box machine learning models for high-stakes decisions and use inter- pretable models instead.Nature Machine Intelligence, 2019. URL https://pmc.ncbi.nlm.nih.gov/ articles/PMC9122117/pdf/nihms-1058031.pdf

2019

[12] [12]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization.IJCV, 2019. URL https://arxiv. org/abs/1610.02391

arXiv 2019

[13] [13]

Slack, S

D. Slack, S. Hilgard, E. Jia, S. Singh, and H. Lakkaraju. Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods. InAAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES), 2020. URLhttps://arxiv.org/abs/1911.02508

arXiv 2020

[14] [14]

Speicher, H

T. Speicher, H. Heidari, N. Grgic-Hlaca, K. P. Gummadi, A. Singla, A. Weller, and M. B. Zafar. A unified approach to quantifying algorithmic unfairness: Measuring individual & group unfairness via inequality indices. InACM SIGKDD International Conference on Knowledge Discovery and Data Mining Proceedings, 2018. URLhttps://arxiv.org/abs/1807.00787

Pith/arXiv arXiv 2018

[15] [15]

Sundararajan, A

M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. InICML, 2017. URL https://arxiv.org/abs/1703.01365

Pith/arXiv arXiv 2017

[16] [16]

Yuan and A

J. Yuan and A. Dasgupta. Fooling SHAP with output shuffling attacks. InAAAI, 2024. URL https: //arxiv.org/abs/2408.06509. 5 A Related Work Explainability aims to render opaque deep learning models understandable and transparent [ 11]. Rooted in co-operative game theory, SHAP stands as a cornerstone of post-hoc explainability. SHAP offers a theoretically g...

arXiv 2024