Recognition: 2 theorem links
· Lean TheoremEvaluating Counterfactual Explanation Methods on Incomplete Inputs
Pith reviewed 2026-05-10 17:58 UTC · model grok-4.3
The pith
Existing counterfactual explanation methods struggle to generate valid counterfactuals on incomplete inputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
While robust CX methods achieve higher validity than non-robust ones, all methods struggle to find valid counterfactuals. These results motivate the need for new CX methods capable of handling incomplete inputs.
What carries the argument
The systematic evaluation of recent CX generation methods applied to benchmark datasets with artificially introduced missing values, measured by validity and plausibility of the generated counterfactuals.
Load-bearing premise
That artificially introducing missing values into complete benchmark datasets accurately models real-world missingness patterns and that the chosen validity and plausibility metrics sufficiently capture explanation quality.
What would settle it
Testing the methods on real-world datasets that naturally contain missing values to see if the performance patterns hold.
Figures
read the original abstract
Existing algorithms for generating Counterfactual Explanations (CXs) for Machine Learning (ML) typically assume fully specified inputs. However, real-world data often contains missing values, and the impact of these incomplete inputs on the performance of existing CX methods remains unexplored. To address this gap, we systematically evaluate recent CX generation methods on their ability to provide valid and plausible counterfactuals when inputs are incomplete. As part of this investigation, we hypothesize that robust CX generation methods will be better suited to address the challenge of providing valid and plausible counterfactuals when inputs are incomplete. Our findings reveal that while robust CX methods achieve higher validity than non-robust ones, all methods struggle to find valid counterfactuals. These results motivate the need for new CX methods capable of handling incomplete inputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates recent counterfactual explanation (CX) generation methods on their ability to produce valid and plausible counterfactuals when input instances contain missing values. It hypothesizes that robust CX methods will outperform non-robust ones under incomplete inputs, reports experimental results on benchmark datasets (e.g., Adult, German Credit) with artificially introduced missingness, and concludes that robust methods achieve higher validity but all methods struggle, thereby motivating the development of new CX techniques designed for incomplete data.
Significance. If the empirical results hold under realistic missingness patterns, the work identifies a practically relevant gap in the CX literature, since real-world tabular data frequently contains missing entries. The systematic benchmarking provides a useful baseline for future method development. The paper does not include machine-checked proofs or parameter-free derivations, but its empirical focus on validity and plausibility metrics is a concrete contribution if the evaluation protocol is shown to be representative.
major comments (3)
- [§4.1] §4.1 (Missing-Value Simulation Procedure): The paper generates incomplete inputs by randomly or systematically masking values in otherwise complete benchmark datasets under an MCAR assumption. This does not replicate common real-world mechanisms such as MNAR (where missingness depends on the unobserved value itself) or feature-target correlations. Because the central claim that 'all methods struggle' and the motivation for entirely new CX methods rest on this simulation, a sensitivity analysis across MCAR/MAR/MNAR regimes is required to establish that the low validity rates generalize beyond the artificial setting.
- [§4.3] §4.3 (Quantitative Results and Controls): The abstract and results section report directional findings (robust methods higher validity than non-robust, yet all struggle) without specifying the exact datasets, missingness rates, validity/plausibility metrics, statistical tests, or baseline controls used. This prevents verification that the evidence supports the headline claims; explicit reporting of effect sizes, confidence intervals, and ablation on missingness fraction is needed.
- [§5] §5 (Discussion and Motivation): The conclusion that new CX methods are required follows directly from the observed low validity rates. However, the paper does not discuss whether targeted adaptations of existing robust methods (e.g., explicit imputation within the CX optimization) could suffice, versus the stronger claim that entirely new methods are necessary. This distinction is load-bearing for the stated motivation.
minor comments (2)
- [Table 2] Table 2 caption and column headers use inconsistent terminology for 'validity' versus 'plausibility'; clarify the exact definitions and whether they are computed before or after any imputation step.
- [§2] The related-work section omits several recent robust CX papers that explicitly handle uncertainty; adding these citations would strengthen the positioning of the hypothesis.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our evaluation of counterfactual explanation methods for incomplete inputs. We address each major comment below and clarify our experimental design and claims.
read point-by-point responses
-
Referee: [§4.1] §4.1 (Missing-Value Simulation Procedure): The paper generates incomplete inputs by randomly or systematically masking values in otherwise complete benchmark datasets under an MCAR assumption. This does not replicate common real-world mechanisms such as MNAR (where missingness depends on the unobserved value itself) or feature-target correlations. Because the central claim that 'all methods struggle' and the motivation for entirely new CX methods rest on this simulation, a sensitivity analysis across MCAR/MAR/MNAR regimes is required to establish that the low validity rates generalize beyond the artificial setting.
Authors: We agree that MCAR is a controlled starting point and does not capture MNAR or feature-target dependencies that occur in practice. Our intent was to isolate the effect of missingness under a standard assumption used in prior robustness literature. Even under this optimistic MCAR regime, validity remains low across methods, which we interpret as evidence that the problem is not merely an artifact of missingness mechanism. We will add a sensitivity analysis for MAR (by making missingness depend on observed features) in the revised manuscript, including results on the same datasets. Full MNAR simulation would require domain-specific assumptions about the missingness process that are not available for the benchmark datasets; we will note this limitation explicitly. revision: partial
-
Referee: [§4.3] §4.3 (Quantitative Results and Controls): The abstract and results section report directional findings (robust methods higher validity than non-robust, yet all struggle) without specifying the exact datasets, missingness rates, validity/plausibility metrics, statistical tests, or baseline controls used. This prevents verification that the evidence supports the headline claims; explicit reporting of effect sizes, confidence intervals, and ablation on missingness fraction is needed.
Authors: We apologize for the insufficient detail in the abstract and main results narrative. The full manuscript evaluates on Adult and German Credit datasets with artificial missingness at rates 10%, 20%, and 30% under MCAR. Validity is defined as the fraction of generated counterfactuals that change the model prediction; plausibility uses the distance to the nearest training instance of the target class. We performed paired t-tests (p < 0.05) and report mean validity with standard deviations. We will insert a new table in §4.3 with exact per-dataset, per-rate numbers, effect sizes (Cohen's d), 95% confidence intervals, and an ablation plot varying missingness fraction from 5% to 40%. Baseline controls include the original non-robust CX methods and a simple imputation + CX pipeline. revision: yes
-
Referee: [§5] §5 (Discussion and Motivation): The conclusion that new CX methods are required follows directly from the observed low validity rates. However, the paper does not discuss whether targeted adaptations of existing robust methods (e.g., explicit imputation within the CX optimization) could suffice, versus the stronger claim that entirely new methods are necessary. This distinction is load-bearing for the stated motivation.
Authors: We will expand §5 to address this distinction directly. Our experiments already include a comparison against robust methods that implicitly tolerate uncertainty; even these achieve only modest validity gains. We will add a short discussion of why simple adaptations such as pre-imputation or joint optimization with an imputation model are unlikely to fully resolve the issue (e.g., imputation introduces its own errors that propagate into the counterfactual search). At the same time, we will soften the claim from 'entirely new methods are necessary' to 'new or substantially adapted methods are needed' and outline concrete directions such as missingness-aware distance metrics and optimization under partial observability. revision: yes
Circularity Check
No circularity: pure empirical benchmarking study
full rationale
This paper performs a systematic experimental evaluation of existing counterfactual explanation methods on artificially incomplete versions of standard benchmark datasets (Adult, German Credit, etc.). It states a hypothesis, runs the methods, reports validity/plausibility metrics, and concludes that new methods are needed. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the reported work. All claims rest on direct experimental outcomes against external benchmarks rather than reducing to the paper's own inputs by construction. The artificial-missingness assumption is a methodological limitation but does not create circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearWe systematically evaluate recent CX generation methods on their ability to provide valid and plausible counterfactuals when inputs are incomplete... robust CX methods achieve higher validity than non-robust ones
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearall methods struggle to find valid counterfactuals. These results motivate the need for new CX methods capable of handling incomplete inputs
Reference graph
Works this paper leans on
-
[1]
Explanation in artificial intelligence: Insights from the social sciences.Artif
Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.Artif. Intell., 267:1–38, 2019
2019
-
[2]
Pinar Tuefekci. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods.International Journal of Electrical Power & Energy Systems, 60:126–140, 2014
2014
-
[3]
Counterfactual explanation withmissing values.CoRR, abs/2304.14606, 2023
KentaroKanamori, TakuyaTakagi, Ken Kobayashi, and YuichiIke. Counterfactual explanation withmissing values.CoRR, abs/2304.14606, 2023
-
[4]
Robust counterfactual explanations in machine learning: A survey
Junqi Jiang, Francesco Leofante, Antonio Rago, and Francesca Toni. Robust counterfactual explanations in machine learning: A survey. InIJCAI, pages 8086–8094, 2024
2024
-
[5]
Mittelstadt, and Chris Russell
Sandra Wachter, Brent D. Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR.Harv. JL & Tech., 31:841, 2017
2017
-
[6]
Vo, Thu Nguyen, Hugo Lewi Hammer, Michael A
Tuan L. Vo, Thu Nguyen, Hugo Lewi Hammer, Michael A. Riegler, and Pål Halvorsen. Explainability of machine learning models under missing data.CoRR, abs/2407.00411, 2024
-
[7]
Which imputation fits which feature selection method? a survey-based simulation study.Data Science in Science, 4(1):2562209, 2025
Andrés Romero, Jakob Schwerter, Florian Dumpert, and Markus Pauly. Which imputation fits which feature selection method? a survey-based simulation study.Data Science in Science, 4(1):2562209, 2025
2025
-
[8]
Assessing the multivariate distribu- tional accuracy of common imputation methods.Statistical Journal of the IAOS, 40(1):99–108, 2024
Maria Thurow, Florian Dumpert, Burim Ramosaj, and Markus Pauly. Assessing the multivariate distribu- tional accuracy of common imputation methods.Statistical Journal of the IAOS, 40(1):99–108, 2024
2024
-
[9]
Scaling guarantees for nearest counterfactual explanations
Kiarash Mohammadi, Amir-Hossein Karimi, Gilles Barthe, and Isabel Valera. Scaling guarantees for nearest counterfactual explanations. InAIES ’21: AAAI/ACM Conference on AI, Ethics, and Society, USA, 2021, pages 177–187. ACM, 2021
2021
-
[10]
Robustx: Robust counterfactual explanations made easy
Junqi Jiang, Luca Marzari, Aaryan Purohit, and Francesco Leofante. Robustx: Robust counterfactual explanations made easy. InProc. of IJCAI, pages 11067–11071, 2025
2025
-
[11]
Promoting counterfactual robustness through diversity
Francesco Leofante and Nico Potyka. Promoting counterfactual robustness through diversity. InThirty- Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Vancouver, Canada, pages 21322–21330. AAAI Press, 2024
2024
-
[12]
Scaling guarantees for nearest counterfactual explanations
Kiarash Mohammadi, Amir-Hossein Karimi, Gilles Barthe, and Isabel Valera. Scaling guarantees for nearest counterfactual explanations. InAIES ’21: Conference on AI, Ethics, and Society, Virtual Event, USA, May 19-21, 2021, pages 177–187. ACM, 2021
2021
-
[13]
NICE: an algorithm for nearest instance counter- factual explanations.Data Min
Dieter Brughmans, Pieter Leyman, and David Martens. NICE: an algorithm for nearest instance counter- factual explanations.Data Min. Knowl. Discov., 38(5):2665–2703, 2024
2024
-
[14]
Formalising the robustness of coun- terfactual explanations for neural networks
Junqi Jiang, Francesco Leofante, Antonio Rago, and Francesca Toni. Formalising the robustness of coun- terfactual explanations for neural networks. InThirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, pages 14901–14909. AAAI Press, 2023
2023
-
[15]
Interval abstractions for robust coun- terfactual explanations.Artif
Junqi Jiang, Francesco Leofante, Antonio Rago, and Francesca Toni. Interval abstractions for robust coun- terfactual explanations.Artif. Intell., 336:104218, 2024. 10
2024
-
[16]
Provably robust and plausible counterfactual explanations for neural networks via robust optimisation
Junqi Jiang, Jianglin Lan, Francesco Leofante, Antonio Rago, and Francesca Toni. Provably robust and plausible counterfactual explanations for neural networks via robust optimisation. InACML 2023, 11-14 November 2023, Istanbul, Turkey, volume 222 ofPMLR, pages 582–597, 2023
2023
-
[17]
Robust counterfac- tual explanations for tree-based ensembles
Sanghamitra Dutta, Jason Long, Saumitra Mishra, Cecilia Tilli, and Daniele Magazzeni. Robust counterfac- tual explanations for tree-based ensembles. InInternational Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 ofProceedings of Machine Learning Research, pages 5742–5756. PMLR, 2022
2022
-
[18]
Robust counterfactual explanations for neural networks with probabilistic guarantees
Faisal Hamman, Erfaun Noorani, Saumitra Mishra, Daniele Magazzeni, and Sanghamitra Dutta. Robust counterfactual explanations for neural networks with probabilistic guarantees. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofPr...
2023
-
[19]
Rigorous probabilistic guarantees for robust counterfactual explanations
Luca Marzari, Francesco Leofante, Ferdinando Cicalese, and Alessandro Farinelli. Rigorous probabilistic guarantees for robust counterfactual explanations. InECAI 2024 - 27th European Conference on Artificial Intelligence, 19-24 October 2024, volume 392 ofFrontiers in Artificial Intelligence and Applications, pages 1059–1066. IOS Press, 2024
2024
-
[20]
mice: Multivariate imputation by chained equations in r.Journal of Statistical Software, 45(3):1–67, 2011
Stef van Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate imputation by chained equations in r.Journal of Statistical Software, 45(3):1–67, 2011
2011
-
[21]
Troyanskaya, Michael N
Olga G. Troyanskaya, Michael N. Cantor, Gavin Sherlock, Patrick O. Brown, Trevor Hastie, Robert Tib- shirani, David Botstein, and Russ B. Altman. Missing value estimation methods for DNA microarrays. Bioinform., 17(6):520–525, 2001
2001
-
[22]
Cortez Paulo, Cerdeira A., F. Almeida, Matos T., , and Reis J. Wine Quality. UCI Machine Learning Repository, 2009. DOI: https://doi.org/10.24432/C56S3T
-
[23]
of Diabetes Digestive and Kidney Diseases
National Inst. of Diabetes Digestive and Kidney Diseases. Diabetes dataset, Nov 1988
1988
-
[24]
I.-C. Yeh. Modeling of strength of high-performance concrete using artificial neural networks.Cement and Concrete Research, 28(12):1797–1808, 1998
1998
-
[25]
Breunig, Hans-Peter Kriegel, Raymond T
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. LOF: identifying density-based local outliers. InProceedings of the 2000 ACM SIGMOD, May 16-18, 2000, Dallas, Texas, USA, pages 93–104. ACM, 2000
2000
-
[26]
Provably robust and plausible counterfactual explanations for neural networks via robust optimisation
Junqi Jiang, Jianglin Lan, Francesco Leofante, Antonio Rago, and Francesca Toni. Provably robust and plausible counterfactual explanations for neural networks via robust optimisation. In Berrin Yanikoglu and Wray L. Buntine, editors,Asian Conference on Machine Learning, ACML 2023, 11-14 November 2023, Istanbul, Turkey, volume 222 ofProceedings of Machine ...
2023
-
[27]
On counterfactual explanations under predictive multiplicity
Martin Pawelczyk, Klaus Broelemann, and Gjergji Kasneci. On counterfactual explanations under predictive multiplicity. InProceedings of the Thirty-Sixth Conference on Uncertainty in Artificial Intelligence, UAI 2020, virtual online, August 3-6, 2020, volume 124 ofProceedings of Machine Learning Research, pages 809–818. AUAI Press, 2020
2020
-
[28]
H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other.The Annals of Mathematical Statistics, (1):50–60, 1947. 11
1947
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.