pith. sign in

arxiv: 2605.29411 · v1 · pith:E55OYR2Knew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· stat.ME· stat.ML

The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

Pith reviewed 2026-06-29 08:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.MEstat.ML
keywords Markov boundarytabular predictioncausal discoveryfeature selectionsynthetic benchmarksstructural causal modelsconditional independenceregressors
0
0 comments X

The pith

Restricting tabular regressors to the true Markov boundary improves accuracy, especially in large sparse tables, but causal discovery rarely recovers a useful boundary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the Markov boundary—the smallest feature set that makes a target variable independent of all others—can serve as an effective feature mask for tabular prediction. On a benchmark of 3450 synthetic tasks spanning six structural causal model families and feature counts up to 1000, models trained only on the oracle boundary outperform those using the full feature set, with the gap widening as dimensionality and sparsity increase. Pipelines that first apply causal discovery to recover a boundary and then train on the resulting mask fail to beat full-feature baselines. The shortfall traces to three mismatches: discovery algorithms optimize for structural correctness rather than predictive utility, false negatives and false positives carry sharply different costs for downstream error, and many feature sets besides the exact boundary already outperform the complete table.

Core claim

Under standard graphical assumptions the Markov boundary is the minimal feature set rendering the target conditionally independent of all remaining variables. On the SCM3K benchmark of 3450 tasks, training six regressors on the oracle boundary substantially improves prediction over the full feature set, and the improvement grows with feature count. Boundaries recovered by existing causal discovery methods produce masks that rarely improve upon the full feature set because discovery optimizes structural recovery rather than prediction, false negatives and positives incur asymmetric predictive costs, and the exact boundary is only one of many feature sets that beat all features.

What carries the argument

The Markov boundary: the smallest set of features that renders the target conditionally independent of all other features under standard graphical assumptions.

If this is right

  • Restricting regressors to the oracle Markov boundary improves prediction accuracy over full-feature training on high-dimensional tabular data.
  • The accuracy gain from the oracle boundary grows as the number of features increases and the table becomes sparser.
  • Existing causal discovery estimators exhaust compute before reaching the high-dimensional regime where boundary use would help most.
  • False negatives in boundary recovery carry higher predictive cost than false positives.
  • Many feature sets besides the exact Markov boundary already outperform training on all columns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Feature selection for prediction may need algorithms that optimize directly for downstream model error rather than graph recovery metrics.
  • Tabular models could incorporate internal mechanisms to detect and ignore redundant features without recovering an explicit causal boundary.
  • The existence of many sufficient feature sets suggests room for approximate or heuristic selection methods tuned to a specific regressor.
  • Hybrid causal-predictive pipelines might close the gap between oracle and recovered boundaries on real tabular tasks.

Load-bearing premise

Standard graphical assumptions hold such that the Markov boundary is the minimal set of features rendering the target conditionally independent of all others.

What would settle it

On SCM3K tasks with 500 or more features, compare test error of regressors trained on boundaries recovered by standard causal discovery algorithms against the same regressors trained on the full feature set; if the recovered-mask models do not underperform the full set, the claim that discovery pipelines fail is falsified.

Figures

Figures reproduced from arXiv: 2605.29411 by Abhinav Gorantla, Huan Liu, K. Sel\c{c}uk Candan, Shu Wan.

Figure 1
Figure 1. Figure 1: The scalability–minimality–sufficiency triangle [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Oracle MB gap by downstream regressor. The boundary helps most for regressors that pay a high finite-sample cost [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Regressor-conditioned attribution of the MB gap. The gap is explained primarily by redundancy, feature dimension, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Causal boundary estimators are least available [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: False negatives and false positives have different downstream costs. The TabPFN panel shown here follows the same [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Layered blankets versus target proximity as mask families. The top row expands by accumulated Markov-boundary [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Test RMSE of two mask-expansion strategies for [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prediction gain maps for Ridge and TabPFN. The axes are mask precision and recall against the oracle boundary; the [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Why the exact boundary can be too exclusive. The exact boundary is the minimal sufficient set, but layer-2, mixed [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other feature redundant. Once the boundary is observed, the target is conditionally independent of the rest of the table. This is a tempting object for tabular prediction, since it names exactly the columns a model should need. Yet modern regressors are still trained on the full feature set. We ask whether the Markov boundary is genuinely useful for prediction on SCM3K, a 3,450-task synthetic SCM benchmark with feature counts from 40 to 1000 and six SCM families, evaluated with six regressors. The answer is more nuanced than the theory suggests. Restricting a regressor to the oracle boundary often improves prediction substantially, and the improvement grows as the feature space becomes larger and sparser. But the natural pipeline of recovering the boundary with causal discovery and training on the recovered mask does not deliver. Existing estimators exhaust the compute budget before reaching the regime where the boundary helps most, and even where they run they rarely beat the full feature set. We trace this to three causes. Discovery optimizes structural recovery rather than prediction. False negatives and false positives carry sharply asymmetric predictive cost. The exact boundary is only one of many feature sets that beat all features. We then develop what these facts imply for prediction-aligned feature selection and for tabular models that learn to use causal structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that on the SCM3K synthetic benchmark (3,450 tasks from six SCM families, feature counts 40-1000), restricting regressors to the oracle Markov boundary often improves prediction substantially, with gains increasing in larger and sparser regimes. However, the pipeline of recovering the boundary via causal discovery and training on the recovered mask fails to deliver comparable gains. The authors trace the failure to three factors: discovery optimizing structural recovery rather than prediction, asymmetric predictive costs of false negatives versus positives, and the existence of multiple feature sets that outperform the full set.

Significance. If the results hold, this provides a controlled large-scale empirical demonstration of the gap between the theoretical utility of Markov boundaries for conditional independence and their practical value when recovered by existing causal discovery methods. The synthetic benchmark with known ground-truth SCMs is a clear strength, enabling direct oracle-vs-recovered comparisons and falsifiable observations about when boundaries help prediction. This can inform development of prediction-aligned rather than structure-aligned feature selection.

major comments (1)
  1. [Abstract] Abstract: the central claim that 'the improvement grows as the feature space becomes larger and sparser' is load-bearing for the headline result, yet the abstract provides no reference to a specific table, figure, or statistical test (e.g., trend across dimension bins or p-values) that would allow verification of the scaling behavior.
minor comments (2)
  1. [Abstract] Abstract: the six SCM families and six regressors are referenced but not named; naming them would improve immediate readability without requiring the reader to reach the methods section.
  2. [Abstract] Abstract: the sentence beginning 'Existing estimators exhaust the compute budget...' would be clearer if it quantified the budget (e.g., wall-clock or memory limits) or the dimension threshold at which the boundary becomes beneficial.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation, the recognition of the benchmark's controlled nature, and the recommendation of minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'the improvement grows as the feature space becomes larger and sparser' is load-bearing for the headline result, yet the abstract provides no reference to a specific table, figure, or statistical test (e.g., trend across dimension bins or p-values) that would allow verification of the scaling behavior.

    Authors: We agree that the abstract claim would benefit from an explicit pointer to the supporting evidence for verifiability. The scaling behavior (larger gains in higher-dimensional and sparser regimes) is shown via binned results in Figure 5 (mean R2 improvement vs. feature count bins) and Table 4 (stratified by sparsity), with standard errors across the six SCM families. In the revised manuscript we will append a brief parenthetical '(see Figure 5)' to the relevant abstract sentence. This is a minor textual change that preserves abstract length while addressing the concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This paper presents an empirical evaluation on the external SCM3K synthetic benchmark rather than any derivation chain. The headline result (oracle Markov boundaries improve regressors while recovered ones do not) is measured directly on data generated from known SCMs, where the invoked graphical assumptions hold by construction. No equations, fitted parameters, or self-citations are used to derive the central claims; the three explanatory factors are presented as post-hoc observations. The study is therefore self-contained against external benchmarks with no load-bearing reductions to inputs or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that standard graphical models apply to the generated SCM families and that the benchmark captures relevant properties of tabular prediction tasks.

axioms (1)
  • domain assumption Standard graphical assumptions hold such that the Markov boundary is the minimal set rendering the target conditionally independent of all others
    Invoked in the first sentence of the abstract to define the object of study.

pith-pipeline@v0.9.1-grok · 5800 in / 1238 out tokens · 36486 ms · 2026-06-29T08:34:30.709262+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 10 canonical work pages

  1. [1]

    Constantin F Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, and Xenofon D Koutsoukos. 2010. Local causal and Markov blanket in- duction for causal discovery and feature selection for classification part I: Algo- rithms and empirical evaluation.Journal of Machine Learning Research11 (2010), 171–234

  2. [2]

    Constantin F Aliferis, Ioannis Tsamardinos, and Alexander Statnikov. 2003. HI- TON: a novel Markov Blanket algorithm for optimal variable selection. InAMIA annual symposium proceedings, Vol. 2003. 21

  3. [3]

    Cresswell, and Rahul Krishnan

    Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas, Junwei Ma, Bingru Li, Jesse C. Cresswell, and Rahul Krishnan. 2026. CausalPFN: Amortized Causal Effect Estimation via In-Context Learning. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems. https://openreview.net/forum? id=RblaNJGx8C

  4. [4]

    Philippe Brouillard, Chandler Squires, Jonas Wahl, Konrad K"ording, Karen Sachs, Alexandre Drouin, and Dhanya Sridhar. 2025. The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications. In Proceedings of the Fourth Conference on Causal Learning and Reasoning (Proceed- ings of Machine Learning Research, Vol. 275), Biwei Hu...

  5. [5]

    Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting Sys- tem. InProceedings of the 22nd ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining(San Francisco, California, USA)(KDD ’16). As- sociation for Computing Machinery, New York, NY, USA, 785–794. doi:10.1145/ 2939672.2939785

  6. [6]

    David Maxwell Chickering. 2003. Optimal structure identification with greedy search.J. Mach. Learn. Res.3, null (March 2003), 507–554. doi:10.1162/ 153244303321897717

  7. [7]

    Paul Erdős and Alfréd Rényi. 1960. On the evolution of random graphs.Publica- tions of the Mathematical Institute of the Hungarian Academy of Sciences5 (1960), 17–61

  8. [8]

    2024.causalAssembly: Generating Realistic Pro- duction Data for Benchmarking Causal Discovery

    Konstantin Göbler, Tobias Windisch, Mathias Drton, Tim Pychynski, Martin Roth, and Steffen Sonntag. 2024.causalAssembly: Generating Realistic Pro- duction Data for Benchmarking Causal Discovery. InProceedings of the Third Conference on Causal Learning and Reasoning (Proceedings of Machine Learning Research, Vol. 236), Francesco Locatello and Vanessa Didel...

  9. [9]

    https://proceedings.mlr.press/v236/gobler24a.html

  10. [10]

    The Elements of Statistical Learning: Data Mining, Inference, and Prediction

    Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009.The Elements of Statistical Learning: Data Mining, Inference, and Prediction(2 ed.). Springer, New York, NY. doi:10.1007/978-0-387-84858-7

  11. [11]

    Hoerl and Robert W

    Arthur E. Hoerl and Robert W. Kennard. 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems.Technometrics12, 1 (1970), 55–

  12. [12]

    Ridge Regression: Appli- cations to Nonorthogonal Problems

    arXiv:https://doi.org/10.1080/00401706.1970.10488634 doi:10.1080/00401706. 1970.10488634

  13. [13]

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter

  14. [14]

    InThe Eleventh International Conference on Learning Representations

    TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=cp5PvcI6w8_

  15. [15]

    Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feed- forward networks are universal approximators.Neural Networks2, 5 (1989), 359–366. doi:10.1016/0893-6080(89)90020-8

  16. [16]

    2009.Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning

    Daphne Koller and Nir Friedman. 2009.Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press

  17. [17]

    Laird and James H

    Nan M. Laird and James H. Ware. 1982. Random-Effects Models for Longitudinal Data.Biometrics38, 4 (1982), 963–974. http://www.jstor.org/stable/2529876

  18. [18]

    Yuchen Ma, Dennis Frauen, Emil Javurek, and Stefan Feuerriegel. 2026. Foun- dation Models for Causal Inference via Prior-Data Fitted Networks. InThe Four- teenth International Conference on Learning Representations. https://openreview. net/forum?id=d2L1ndOKjq

  19. [19]

    Daniel Marbach, James C Costello, Robert Küffner, Nicole M Vega, Robert J Prill, Diogo M Camacho, Kyle R Allison, Manolis Kellis, James J Collins, et al. 2012. Wisdom of crowds for robust gene network inference.Nature methods9, 8 (2012), 796–804

  20. [20]

    Prill, Thomas Schaffter, Claudio Mat- tiussi, Dario Floreano, and Gustavo Stolovitzky

    Daniel Marbach, Robert J. Prill, Thomas Schaffter, Claudio Mat- tiussi, Dario Floreano, and Gustavo Stolovitzky. 2010. Revealing strengths and weaknesses of methods for gene network inference.Pro- ceedings of the National Academy of Sciences107, 14 (2010), 6286–

  21. [21]

    arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.0913357107 doi:10.1073/pnas.0913357107

  22. [22]

    Dimitris Margaritis and Sebastian Thrun. 1999. Bayesian Network Induction via Local Neighborhoods. InAdvances in Neural Informa- tion Processing Systems, S. Solla, T. Leen, and K. Müller (Eds.), Vol. 12. MIT Press. https://proceedings.neurips.cc/paper_files/paper/1999/file/ 5d79099fcdf499f12b79770834c0164a-Paper.pdf

  23. [23]

    Christopher Meek. 1995. Causal inference and causal explanation with back- ground knowledge.Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence(1995), 403–410

  24. [24]

    Wang Miao, Zhi Geng, and Eric J Tchetgen Tchetgen. 2018. Identifying causal effects with proxy variables of an unmeasured confounder.Biometrika 105, 4 (12 2018), 987–993. arXiv:https://academic.oup.com/biomet/article- pdf/105/4/987/27121264/asy038.pdf doi:10.1093/biomet/asy038

  25. [25]

    Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. 2022. Transformers Can Do Bayesian Inference. InInterna- tional Conference on Learning Representations. https://openreview.net/forum? id=KSugKcbNf9

  26. [26]

    Samuel Müller, Arik Reuter, Noah Hollmann, David Rügamer, and Frank Hutter

  27. [27]

    InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

    Position: The Future of Bayesian Prediction Is Prior-Fitted. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267). PMLR, Vancouver, Canada. https://proceedings.mlr. press/v267/muller25d.html

  28. [28]

    Rainer Opgen-Rhein and Korbinian Strimmer. 2007. From correlation to causa- tion networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data.BMC systems biology1, 1 (2007), 37

  29. [29]

    1988.Probabilistic Reasoning in Intelligent Systems: Networks of Plau- sible Inference

    Judea Pearl. 1988.Probabilistic Reasoning in Intelligent Systems: Networks of Plau- sible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA

  30. [30]

    2009.Causality: Models, Reasoning and Inference(2nd ed.)

    Judea Pearl. 2009.Causality: Models, Reasoning and Inference(2nd ed.). Cam- bridge University Press, USA

  31. [31]

    2017.Elements of Causal Inference: Foundations and Learning Algorithms

    Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2017.Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, Cambridge, MA

  32. [32]

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. 2025. TabICL: A Tabular Foundation Model for In-Context Learning on Large Data. In Proceedings of the 42nd International Conference on Machine Learning (Proceed- ings of Machine Learning Research, Vol. 267). PMLR, Vancouver, Canada, 50817– 50847. https://proceedings.mlr.press/v267/qu25d.html

  33. [33]

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan

  34. [34]

    TabICLv2: A better, faster, scalable, and open tabular foundation model.arXiv:2602.11139, 2026

    TabICLv2: A better, faster, scalable, and open tabular foundation model. arXiv:2602.11139 [cs.LG] https://arxiv.org/abs/2602.11139

  35. [35]

    Alexander Reisach, Christof Seiler, and Sebastian Weichwald. 2021. Beware of the simulated dag! causal discovery benchmarks may be easy to game.Advances in Neural Information Processing Systems34 (2021), 27772–27784

  36. [36]

    Jake Robertson, Arik Reuter, Siyuan Guo, Noah Hollmann, Frank Hutter, and Bernhard Schölkopf. 2025. Do-PFN: In-Context Learning for Causal Effect Esti- mation. arXiv:2506.06039 [cs.LG] https://arxiv.org/abs/2506.06039

  37. [37]

    Marco Scutari. 2010. Learning Bayesian networks with the bnlearn R package. Journal of statistical software35 (2010), 1–22

  38. [38]

    2000.Causation, Prediction, and Search(2 ed.)

    Peter Spirtes, Clark Glymour, and Richard Scheines. 2000.Causation, Prediction, and Search(2 ed.). MIT Press, Cambridge, MA

  39. [39]

    Robert Tibshirani. 1996. Regression Shrinkage and Selection Via the Lasso.Jour- nal of the Royal Statistical Society: Series B (Methodological)58, 1 (01 1996), 267–

  40. [40]

    doi:10.1111/j.2517-6161.1996.tb02080.x

  41. [41]

    Aliferis, and Alexander Statnikov

    Ioannis Tsamardinos, Constantin F. Aliferis, and Alexander Statnikov. 2003. Al- gorithms for Large Scale Markov Blanket Discovery. InProceedings of the Six- teenth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2003). AAAI Press, Menlo Park, CA, 376–381. https://aaai.org/papers/ flairs-2003-073/

  42. [42]

    Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. 2006. The max-min hill-climbing Bayesian network structure learning algorithm.Machine Learning65, 1 (2006), 31–78

  43. [43]

    Tim Van den Bulcke, Koenraad Van Leemput, Bart Naudts, Piet van Remortel, Hongwu Ma, Alain Verschoren, Bart De Moor, and Kathleen Marchal. 2006. Syn- TReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms.BMC bioinformatics7, 1 (2006), 43

  44. [44]

    Thomas Verma and Judea Pearl. 1990. Equivalence and synthesis of causal mod- els.Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence (1990), 220–227

  45. [45]

    Wooldridge

    Jefrey M. Wooldridge. 2010.Econometric Analysis of Cross Section and Panel Data. The MIT Press. http://www.jstor.org/stable/j.ctt5hhcfr

  46. [46]

    Liyuan Xu, Heishiro Kanagawa, and Arthur Gretton. 2021. Deep Proxy Causal Learning and its Application to Confounded Bandit Policy Evaluation. InAd- vances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran As- sociates, Inc., 26264–26275. https://proceedings.neurips...

  47. [47]

    Naiyu Yin, Hanjing Wang, Yue Yu, Tian Gao, Amit Dhurandhar, and Qiang Ji

  48. [48]

    InEuropean Conference on Computer Vision

    Integrating Markov blanket discovery into causal representation learning for domain generalization. InEuropean Conference on Computer Vision. Springer, 271–288

  49. [49]

    Kui Yu, Lin Liu, Jiuyong Li, Weiping Ding, and Thuc Duy Le. 2020. Causality- based feature selection: Methods and evaluations.Comput. Surveys53, 5 (2020), 1–36