pith. machine review for the scientific record. sign in

arxiv: 2604.08004 · v1 · submitted 2026-04-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Evaluating Counterfactual Explanation Methods on Incomplete Inputs

Daniel Neider, Francesco Leofante, Mustafa Yal\c{c}{\i}ner

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:58 UTC · model grok-4.3

classification 💻 cs.AI
keywords counterfactual explanationsincomplete datamachine learning interpretabilityrobustnessevaluation studymissing values
0
0 comments X

The pith

Existing counterfactual explanation methods struggle to generate valid counterfactuals on incomplete inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates how well current methods for generating counterfactual explanations work when the input data has missing values. It tests the idea that robust methods, which are designed to handle variations, would perform better than others in this setting. The results show that robust methods do achieve higher rates of valid explanations, but every method tested finds it difficult to produce valid counterfactuals. This highlights a gap in existing techniques and the need for new approaches that can deal with incomplete data.

Core claim

While robust CX methods achieve higher validity than non-robust ones, all methods struggle to find valid counterfactuals. These results motivate the need for new CX methods capable of handling incomplete inputs.

What carries the argument

The systematic evaluation of recent CX generation methods applied to benchmark datasets with artificially introduced missing values, measured by validity and plausibility of the generated counterfactuals.

Load-bearing premise

That artificially introducing missing values into complete benchmark datasets accurately models real-world missingness patterns and that the chosen validity and plausibility metrics sufficiently capture explanation quality.

What would settle it

Testing the methods on real-world datasets that naturally contain missing values to see if the performance patterns hold.

Figures

Figures reproduced from arXiv: 2604.08004 by Daniel Neider, Francesco Leofante, Mustafa Yal\c{c}{\i}ner.

Figure 1
Figure 1. Figure 1: Imputation inaccuracy impedes counterfactual validity [3]. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CX Comparison m = 1 MCER PROPLACE STCE RNCE APAS BLS KDTreeNNCEMCE Wachter ARMIN 0.0 0.2 0.4 0.6 0.8 1.0 → Validity → (a) Recourse Validity Robust Non-Robust MCER PROPLACE STCE RNCE APAS BLS KDTreeNNCEMCE Wachter ARMIN 0.0 0.2 0.4 0.6 0.8 1.0 1.2 ← Cost ← (b) Cost Robust Non-Robust MCER PROPLACE STCE RNCE APAS BLS KDTreeNNCEMCE Wachter ARMIN −2.0 −1.8 −1.6 −1.4 −1.2 −1.0 → Plausibility (lof) → (c) Plausibi… view at source ↗
Figure 4
Figure 4. Figure 4: Wachter: Hyperparameter Impact on Recourse Validity [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

Existing algorithms for generating Counterfactual Explanations (CXs) for Machine Learning (ML) typically assume fully specified inputs. However, real-world data often contains missing values, and the impact of these incomplete inputs on the performance of existing CX methods remains unexplored. To address this gap, we systematically evaluate recent CX generation methods on their ability to provide valid and plausible counterfactuals when inputs are incomplete. As part of this investigation, we hypothesize that robust CX generation methods will be better suited to address the challenge of providing valid and plausible counterfactuals when inputs are incomplete. Our findings reveal that while robust CX methods achieve higher validity than non-robust ones, all methods struggle to find valid counterfactuals. These results motivate the need for new CX methods capable of handling incomplete inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates recent counterfactual explanation (CX) generation methods on their ability to produce valid and plausible counterfactuals when input instances contain missing values. It hypothesizes that robust CX methods will outperform non-robust ones under incomplete inputs, reports experimental results on benchmark datasets (e.g., Adult, German Credit) with artificially introduced missingness, and concludes that robust methods achieve higher validity but all methods struggle, thereby motivating the development of new CX techniques designed for incomplete data.

Significance. If the empirical results hold under realistic missingness patterns, the work identifies a practically relevant gap in the CX literature, since real-world tabular data frequently contains missing entries. The systematic benchmarking provides a useful baseline for future method development. The paper does not include machine-checked proofs or parameter-free derivations, but its empirical focus on validity and plausibility metrics is a concrete contribution if the evaluation protocol is shown to be representative.

major comments (3)
  1. [§4.1] §4.1 (Missing-Value Simulation Procedure): The paper generates incomplete inputs by randomly or systematically masking values in otherwise complete benchmark datasets under an MCAR assumption. This does not replicate common real-world mechanisms such as MNAR (where missingness depends on the unobserved value itself) or feature-target correlations. Because the central claim that 'all methods struggle' and the motivation for entirely new CX methods rest on this simulation, a sensitivity analysis across MCAR/MAR/MNAR regimes is required to establish that the low validity rates generalize beyond the artificial setting.
  2. [§4.3] §4.3 (Quantitative Results and Controls): The abstract and results section report directional findings (robust methods higher validity than non-robust, yet all struggle) without specifying the exact datasets, missingness rates, validity/plausibility metrics, statistical tests, or baseline controls used. This prevents verification that the evidence supports the headline claims; explicit reporting of effect sizes, confidence intervals, and ablation on missingness fraction is needed.
  3. [§5] §5 (Discussion and Motivation): The conclusion that new CX methods are required follows directly from the observed low validity rates. However, the paper does not discuss whether targeted adaptations of existing robust methods (e.g., explicit imputation within the CX optimization) could suffice, versus the stronger claim that entirely new methods are necessary. This distinction is load-bearing for the stated motivation.
minor comments (2)
  1. [Table 2] Table 2 caption and column headers use inconsistent terminology for 'validity' versus 'plausibility'; clarify the exact definitions and whether they are computed before or after any imputation step.
  2. [§2] The related-work section omits several recent robust CX papers that explicitly handle uncertainty; adding these citations would strengthen the positioning of the hypothesis.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation of counterfactual explanation methods for incomplete inputs. We address each major comment below and clarify our experimental design and claims.

read point-by-point responses
  1. Referee: [§4.1] §4.1 (Missing-Value Simulation Procedure): The paper generates incomplete inputs by randomly or systematically masking values in otherwise complete benchmark datasets under an MCAR assumption. This does not replicate common real-world mechanisms such as MNAR (where missingness depends on the unobserved value itself) or feature-target correlations. Because the central claim that 'all methods struggle' and the motivation for entirely new CX methods rest on this simulation, a sensitivity analysis across MCAR/MAR/MNAR regimes is required to establish that the low validity rates generalize beyond the artificial setting.

    Authors: We agree that MCAR is a controlled starting point and does not capture MNAR or feature-target dependencies that occur in practice. Our intent was to isolate the effect of missingness under a standard assumption used in prior robustness literature. Even under this optimistic MCAR regime, validity remains low across methods, which we interpret as evidence that the problem is not merely an artifact of missingness mechanism. We will add a sensitivity analysis for MAR (by making missingness depend on observed features) in the revised manuscript, including results on the same datasets. Full MNAR simulation would require domain-specific assumptions about the missingness process that are not available for the benchmark datasets; we will note this limitation explicitly. revision: partial

  2. Referee: [§4.3] §4.3 (Quantitative Results and Controls): The abstract and results section report directional findings (robust methods higher validity than non-robust, yet all struggle) without specifying the exact datasets, missingness rates, validity/plausibility metrics, statistical tests, or baseline controls used. This prevents verification that the evidence supports the headline claims; explicit reporting of effect sizes, confidence intervals, and ablation on missingness fraction is needed.

    Authors: We apologize for the insufficient detail in the abstract and main results narrative. The full manuscript evaluates on Adult and German Credit datasets with artificial missingness at rates 10%, 20%, and 30% under MCAR. Validity is defined as the fraction of generated counterfactuals that change the model prediction; plausibility uses the distance to the nearest training instance of the target class. We performed paired t-tests (p < 0.05) and report mean validity with standard deviations. We will insert a new table in §4.3 with exact per-dataset, per-rate numbers, effect sizes (Cohen's d), 95% confidence intervals, and an ablation plot varying missingness fraction from 5% to 40%. Baseline controls include the original non-robust CX methods and a simple imputation + CX pipeline. revision: yes

  3. Referee: [§5] §5 (Discussion and Motivation): The conclusion that new CX methods are required follows directly from the observed low validity rates. However, the paper does not discuss whether targeted adaptations of existing robust methods (e.g., explicit imputation within the CX optimization) could suffice, versus the stronger claim that entirely new methods are necessary. This distinction is load-bearing for the stated motivation.

    Authors: We will expand §5 to address this distinction directly. Our experiments already include a comparison against robust methods that implicitly tolerate uncertainty; even these achieve only modest validity gains. We will add a short discussion of why simple adaptations such as pre-imputation or joint optimization with an imputation model are unlikely to fully resolve the issue (e.g., imputation introduces its own errors that propagate into the counterfactual search). At the same time, we will soften the claim from 'entirely new methods are necessary' to 'new or substantially adapted methods are needed' and outline concrete directions such as missingness-aware distance metrics and optimization under partial observability. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmarking study

full rationale

This paper performs a systematic experimental evaluation of existing counterfactual explanation methods on artificially incomplete versions of standard benchmark datasets (Adult, German Credit, etc.). It states a hypothesis, runs the methods, reports validity/plausibility metrics, and concludes that new methods are needed. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the reported work. All claims rest on direct experimental outcomes against external benchmarks rather than reducing to the paper's own inputs by construction. The artificial-missingness assumption is a methodological limitation but does not create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation paper that applies existing CX methods to a new scenario; no free parameters, mathematical axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5425 in / 969 out tokens · 32115 ms · 2026-05-10T17:58:08.634613+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

28 extracted references · 3 canonical work pages

  1. [1]

    Explanation in artificial intelligence: Insights from the social sciences.Artif

    Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.Artif. Intell., 267:1–38, 2019

  2. [2]

    Pinar Tuefekci. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods.International Journal of Electrical Power & Energy Systems, 60:126–140, 2014

  3. [3]

    Counterfactual explanation withmissing values.CoRR, abs/2304.14606, 2023

    KentaroKanamori, TakuyaTakagi, Ken Kobayashi, and YuichiIke. Counterfactual explanation withmissing values.CoRR, abs/2304.14606, 2023

  4. [4]

    Robust counterfactual explanations in machine learning: A survey

    Junqi Jiang, Francesco Leofante, Antonio Rago, and Francesca Toni. Robust counterfactual explanations in machine learning: A survey. InIJCAI, pages 8086–8094, 2024

  5. [5]

    Mittelstadt, and Chris Russell

    Sandra Wachter, Brent D. Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR.Harv. JL & Tech., 31:841, 2017

  6. [6]

    Vo, Thu Nguyen, Hugo Lewi Hammer, Michael A

    Tuan L. Vo, Thu Nguyen, Hugo Lewi Hammer, Michael A. Riegler, and Pål Halvorsen. Explainability of machine learning models under missing data.CoRR, abs/2407.00411, 2024

  7. [7]

    Which imputation fits which feature selection method? a survey-based simulation study.Data Science in Science, 4(1):2562209, 2025

    Andrés Romero, Jakob Schwerter, Florian Dumpert, and Markus Pauly. Which imputation fits which feature selection method? a survey-based simulation study.Data Science in Science, 4(1):2562209, 2025

  8. [8]

    Assessing the multivariate distribu- tional accuracy of common imputation methods.Statistical Journal of the IAOS, 40(1):99–108, 2024

    Maria Thurow, Florian Dumpert, Burim Ramosaj, and Markus Pauly. Assessing the multivariate distribu- tional accuracy of common imputation methods.Statistical Journal of the IAOS, 40(1):99–108, 2024

  9. [9]

    Scaling guarantees for nearest counterfactual explanations

    Kiarash Mohammadi, Amir-Hossein Karimi, Gilles Barthe, and Isabel Valera. Scaling guarantees for nearest counterfactual explanations. InAIES ’21: AAAI/ACM Conference on AI, Ethics, and Society, USA, 2021, pages 177–187. ACM, 2021

  10. [10]

    Robustx: Robust counterfactual explanations made easy

    Junqi Jiang, Luca Marzari, Aaryan Purohit, and Francesco Leofante. Robustx: Robust counterfactual explanations made easy. InProc. of IJCAI, pages 11067–11071, 2025

  11. [11]

    Promoting counterfactual robustness through diversity

    Francesco Leofante and Nico Potyka. Promoting counterfactual robustness through diversity. InThirty- Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Vancouver, Canada, pages 21322–21330. AAAI Press, 2024

  12. [12]

    Scaling guarantees for nearest counterfactual explanations

    Kiarash Mohammadi, Amir-Hossein Karimi, Gilles Barthe, and Isabel Valera. Scaling guarantees for nearest counterfactual explanations. InAIES ’21: Conference on AI, Ethics, and Society, Virtual Event, USA, May 19-21, 2021, pages 177–187. ACM, 2021

  13. [13]

    NICE: an algorithm for nearest instance counter- factual explanations.Data Min

    Dieter Brughmans, Pieter Leyman, and David Martens. NICE: an algorithm for nearest instance counter- factual explanations.Data Min. Knowl. Discov., 38(5):2665–2703, 2024

  14. [14]

    Formalising the robustness of coun- terfactual explanations for neural networks

    Junqi Jiang, Francesco Leofante, Antonio Rago, and Francesca Toni. Formalising the robustness of coun- terfactual explanations for neural networks. InThirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, pages 14901–14909. AAAI Press, 2023

  15. [15]

    Interval abstractions for robust coun- terfactual explanations.Artif

    Junqi Jiang, Francesco Leofante, Antonio Rago, and Francesca Toni. Interval abstractions for robust coun- terfactual explanations.Artif. Intell., 336:104218, 2024. 10

  16. [16]

    Provably robust and plausible counterfactual explanations for neural networks via robust optimisation

    Junqi Jiang, Jianglin Lan, Francesco Leofante, Antonio Rago, and Francesca Toni. Provably robust and plausible counterfactual explanations for neural networks via robust optimisation. InACML 2023, 11-14 November 2023, Istanbul, Turkey, volume 222 ofPMLR, pages 582–597, 2023

  17. [17]

    Robust counterfac- tual explanations for tree-based ensembles

    Sanghamitra Dutta, Jason Long, Saumitra Mishra, Cecilia Tilli, and Daniele Magazzeni. Robust counterfac- tual explanations for tree-based ensembles. InInternational Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 ofProceedings of Machine Learning Research, pages 5742–5756. PMLR, 2022

  18. [18]

    Robust counterfactual explanations for neural networks with probabilistic guarantees

    Faisal Hamman, Erfaun Noorani, Saumitra Mishra, Daniele Magazzeni, and Sanghamitra Dutta. Robust counterfactual explanations for neural networks with probabilistic guarantees. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofPr...

  19. [19]

    Rigorous probabilistic guarantees for robust counterfactual explanations

    Luca Marzari, Francesco Leofante, Ferdinando Cicalese, and Alessandro Farinelli. Rigorous probabilistic guarantees for robust counterfactual explanations. InECAI 2024 - 27th European Conference on Artificial Intelligence, 19-24 October 2024, volume 392 ofFrontiers in Artificial Intelligence and Applications, pages 1059–1066. IOS Press, 2024

  20. [20]

    mice: Multivariate imputation by chained equations in r.Journal of Statistical Software, 45(3):1–67, 2011

    Stef van Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate imputation by chained equations in r.Journal of Statistical Software, 45(3):1–67, 2011

  21. [21]

    Troyanskaya, Michael N

    Olga G. Troyanskaya, Michael N. Cantor, Gavin Sherlock, Patrick O. Brown, Trevor Hastie, Robert Tib- shirani, David Botstein, and Russ B. Altman. Missing value estimation methods for DNA microarrays. Bioinform., 17(6):520–525, 2001

  22. [22]

    Cortez, A

    Cortez Paulo, Cerdeira A., F. Almeida, Matos T., , and Reis J. Wine Quality. UCI Machine Learning Repository, 2009. DOI: https://doi.org/10.24432/C56S3T

  23. [23]

    of Diabetes Digestive and Kidney Diseases

    National Inst. of Diabetes Digestive and Kidney Diseases. Diabetes dataset, Nov 1988

  24. [24]

    I.-C. Yeh. Modeling of strength of high-performance concrete using artificial neural networks.Cement and Concrete Research, 28(12):1797–1808, 1998

  25. [25]

    Breunig, Hans-Peter Kriegel, Raymond T

    Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. LOF: identifying density-based local outliers. InProceedings of the 2000 ACM SIGMOD, May 16-18, 2000, Dallas, Texas, USA, pages 93–104. ACM, 2000

  26. [26]

    Provably robust and plausible counterfactual explanations for neural networks via robust optimisation

    Junqi Jiang, Jianglin Lan, Francesco Leofante, Antonio Rago, and Francesca Toni. Provably robust and plausible counterfactual explanations for neural networks via robust optimisation. In Berrin Yanikoglu and Wray L. Buntine, editors,Asian Conference on Machine Learning, ACML 2023, 11-14 November 2023, Istanbul, Turkey, volume 222 ofProceedings of Machine ...

  27. [27]

    On counterfactual explanations under predictive multiplicity

    Martin Pawelczyk, Klaus Broelemann, and Gjergji Kasneci. On counterfactual explanations under predictive multiplicity. InProceedings of the Thirty-Sixth Conference on Uncertainty in Artificial Intelligence, UAI 2020, virtual online, August 3-6, 2020, volume 124 ofProceedings of Machine Learning Research, pages 809–818. AUAI Press, 2020

  28. [28]

    H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other.The Annals of Mathematical Statistics, (1):50–60, 1947. 11