pith. sign in

arxiv: 2607.01417 · v1 · pith:DUOPV6NTnew · submitted 2026-07-01 · 💻 cs.LG · stat.ML

Conditional Inference Trees and Forests for Feature Selection

Pith reviewed 2026-07-03 21:14 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords conditional inference forestsfeature rankingfeature selectionpermutation testsclassificationregressiontree ensembles
0
0 comments X

The pith

Conditional inference forests rank 4th and 3rd among feature selectors in classification and regression benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests conditional inference trees and forests as tools for ranking the most useful input features before training a final predictor. It runs real-data experiments on dozens of datasets, measures how runtime settings affect speed and accuracy, and checks recovery of known important features in synthetic cases. The central finding is that these forests deliver competitive top-k rankings while using permutation tests to avoid favoring features with more possible split points. A reader would care because biased feature selection can waste computation and degrade model performance on new data.

Core claim

Conditional inference forests, which test each feature with a permutation p-value before choosing any split threshold, produce feature rankings that place fourth among seventeen classification methods on twenty-two datasets and third among eighteen regression methods on eight datasets, supporting their use for top-k feature selection.

What carries the argument

Bonferroni-corrected Monte Carlo permutation p-values computed at each node to decide which feature receives a split, independent of the number of thresholds tested.

If this is right

  • CIF can be applied directly as a feature-ranking step before training any downstream classifier or regressor.
  • Turning off adaptive stopping increases fitting time by a factor of four to eight with negligible change in final scores.
  • Exact threshold search multiplies runtime by up to ten with little accuracy gain.
  • In sparse high-dimensional settings, the random feature sampling inside the forest can exclude informative variables from many candidate splits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same permutation-test logic could be grafted onto other tree algorithms to reduce split bias without switching to full conditional inference.
  • Runtime gains from adaptive stopping suggest that similar early-stopping rules might help permutation-based methods in other domains such as causal discovery.
  • The observed risk that informative features are sampled out of many trees points to a need for stratified or importance-weighted sampling inside the forest.

Load-bearing premise

The collection of real and synthetic datasets used is representative of the problems where top-k feature ranking matters.

What would settle it

A new collection of thirty datasets in which CIF rankings produce downstream prediction scores that fall below the median of the other sixteen or seventeen methods.

Figures

Figures reproduced from arXiv: 2607.01417 by Justin Downes, Newel Hirst, Robert Milletich, Steve Goley.

Figure 1
Figure 1. Figure 1: Classification mean rank by number of selected features. Rows show all 17 feature selection methods in the classification benchmark, averaged across downstream models at each value of k. Lower ranks are better. The columns for k = 5, 10, 25, 50, 100 use 22, 21, 15, 15, and 14 datasets, respectively. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Regression mean rank by number of selected features. Rows show all 18 feature selection methods in the regression benchmark, averaged across downstream models at each value of k. Lower ranks are better. The columns for k = 5, 10, 25, 50, 100 use 8, 8, 7, 7, and 6 datasets, respectively. 5.2 Runtime ablations The runtime ablations identify which benchmark settings have the largest effect on fitting time and… view at source ↗
Figure 3
Figure 3. Figure 3: Where CIF’s classification scores peak as k increases, by downstream learner. (A) shows, for each learner, the percentage of 15 datasets where the first evaluated k attaining the best observed CIF score falls below 100, at 100, between 100 and p, or at k=p. (B) shows the mean balanced accuracy change from k=100 to k=p. Synthetic feature recovery. Across synthetic classification datasets, 34% of the feature… view at source ↗
Figure 4
Figure 4. Figure 4: Synthetic top-k recovery curves. Columns compare single trees, forests, and boosted trees; rows show classification and regression. Each value is the percentage of selected features that are truly informative. Feature sampling in forests. A sparse simulation asks whether forest feature sampling makes splits less likely to use informative features in classification and regression settings [PITH_FULL_IMAGE:… view at source ↗
Figure 5
Figure 5. Figure 5: Tree-level feature use in one sparse classification forest design (n=250, p=1,000, two informative features, 1,000 trees). Counts are the number of trees using each feature in at least one split. CIF samples a subset of features; CIF-all evaluates all nonconstant features. Each subplot has its own y-axis range because counts differ across methods. 7 Discussion In these experiments, CIF is best supported as… view at source ↗
Figure 6
Figure 6. Figure 6: Classification CIF margins against ctree, cforest, and CIT by downstream learner and value of k. Each plotted value is the mean balanced-accuracy difference, CIF minus comparator, for one downstream learner and one value of k; positive values favor CIF. Dataset counts by k are 22, 21, 15, 15, and 14 for ctree and cforest, and 23, 22, 16, 16, and 15 for CIT. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Regression CIF margins against ctree, cforest, and CIT by downstream learner and value of k. Each plotted value is the mean R2 difference, CIF minus comparator, for one downstream learner and one value of k; positive values favor CIF. Dataset counts by k are 8, 8, 7, 7, and 6 for each comparison. Large positive plotted values can reflect negative mean R2 for the comparator. Numeric labels show the unclippe… view at source ↗
Figure 8
Figure 8. Figure 8: Classification forest feature sampling by feature dimension p with 1,000 trees per fit. Points are means over ninformative ∈ {1, 2, 5, 10}; capped bars span the minimum-to-maximum values across those four sparse designs. (A) shows % of splits using informative features; (B) shows the number of distinct uninformative features used [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Regression forest feature sampling by feature dimension p with 1,000 trees per fit. Points are means over ninformative ∈ {1, 2, 5, 10}; capped bars span the minimum-to-maximum values across those four sparse designs. (A) shows % of splits using informative features; (B) shows the number of distinct uninformative features used. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_9.png] view at source ↗
read the original abstract

Conditional inference trees (CIT) and conditional inference forests (CIF) reduce split-selection bias by testing features before choosing split thresholds, but repeated permutation tests and threshold searches can make these methods computationally expensive. We study CIT and CIF as top-$k$ feature-ranking methods for downstream prediction using real-data benchmarks, runtime ablations, and synthetic feature-recovery experiments. At a fixed node, if the features and permutation budget do not depend on the node responses, Bonferroni-corrected $+1$ Monte Carlo permutation $p$-values control nodewise rejection under the complete permutation null. CIF ranks 4th among 17 classification methods on 22 datasets and 3rd among 18 regression methods on 8 datasets. With Bonferroni correction held fixed, the CIF runtime ablations indicate that adaptive stopping and the number of thresholds searched have the largest measured effect on runtime: turning off adaptive stopping and using exact threshold search increase fitting time by 4.0--8.4$\times$ and 1.9--10.8$\times$, respectively, while downstream score changes are at most 0.011. Sparse high-$p$ simulations indicate that forest feature sampling can leave informative features out of many split decisions. Overall, the results support CIF as a top-$k$ feature-ranking method in the evaluated downstream prediction benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper studies conditional inference trees (CIT) and forests (CIF) as top-k feature-ranking methods for downstream prediction tasks. It provides a condition under which Bonferroni-corrected +1 Monte Carlo permutation p-values control nodewise rejection under the complete permutation null, reports that CIF ranks 4th among 17 classification methods on 22 datasets and 3rd among 18 regression methods on 8 datasets, presents runtime ablations showing that adaptive stopping and threshold search have the largest effects on runtime (with downstream score changes ≤0.011), and includes synthetic feature-recovery experiments indicating that forest feature sampling can omit informative features.

Significance. If the p-value control holds and the empirical rankings are robust, the work would provide a theoretically grounded, bias-reduced alternative to standard tree-based feature selection with competitive downstream performance and quantified runtime trade-offs. The ablations and synthetic results add practical insight into implementation choices.

major comments (1)
  1. [Abstract and runtime ablations] Abstract (p-value control statement) and runtime ablations: The paper states that at a fixed node, Bonferroni-corrected +1 Monte Carlo permutation p-values control nodewise rejection under the complete permutation null only "if the features and permutation budget do not depend on the node responses." The runtime ablations report that turning off adaptive stopping increases fitting time by 4.0–8.4× and identify it as having the largest measured effect, implying the main reported CIF results use adaptive stopping. This makes the effective permutation budget depend on observed responses, directly violating the stated precondition and removing the claimed frequentist guarantee for the nodewise p-values that justify reduced split-selection bias and reliable top-k ranking.
minor comments (2)
  1. [Abstract] The abstract reports CIF ranking 4th/17 on classification and 3rd/18 on regression but does not specify whether these are mean ranks, median ranks, or win counts, nor the exact metric (e.g., AUC, accuracy) used for downstream evaluation.
  2. [Abstract] The synthetic experiments mention "sparse high-p simulations" but the abstract does not report the exact dimensions (p, n, sparsity level) or recovery metric, making it hard to assess how representative they are of the real-data regime.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for pointing out the potential inconsistency between the stated condition for p-value control and the use of adaptive stopping in the reported experiments. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Abstract and runtime ablations] Abstract (p-value control statement) and runtime ablations: The paper states that at a fixed node, Bonferroni-corrected +1 Monte Carlo permutation p-values control nodewise rejection under the complete permutation null only "if the features and permutation budget do not depend on the node responses." The runtime ablations report that turning off adaptive stopping increases fitting time by 4.0–8.4× and identify it as having the largest measured effect, implying the main reported CIF results use adaptive stopping. This makes the effective permutation budget depend on observed responses, directly violating the stated precondition and removing the claimed frequentist guarantee for the nodewise p-values that justify reduced split-selection bias and reliable top-k ranking.

    Authors: We agree with the referee that the main CIF results utilize adaptive stopping, which causes the permutation budget to depend on the observed responses at each node, thereby violating the precondition for the nodewise p-value control result. The theoretical guarantee applies specifically to the case of fixed permutation budget independent of responses. We will revise the abstract to clarify the scope of the p-value control statement and add a note in the methods or discussion section explaining that adaptive stopping is employed for computational efficiency in the empirical evaluations, while the frequentist control is guaranteed only without it. The benchmark rankings and runtime ablation results are empirical observations and remain unchanged. This revision addresses the concern without altering the core findings. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external benchmarks

full rationale

The paper presents no derivation chain that reduces to its inputs by construction. Its strongest claims are direct empirical rankings of CIF against 17/18 other methods on 22/8 datasets plus runtime ablations and synthetic recovery experiments. The single conditional statement on p-value control is an if-then claim about a fixed-node permutation test and is not invoked to derive the reported rankings or to rename any fitted quantity as a prediction. No self-citations, ansatzes, or uniqueness theorems appear as load-bearing steps. The work is therefore self-contained against the external benchmarks it reports.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; full text would be required to audit these elements.

pith-pipeline@v0.9.1-grok · 5767 in / 1014 out tokens · 35757 ms · 2026-07-03T21:14:37.854933+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 24 canonical work pages · 1 internal anchor

  1. [1]

    Generalized random forests.The Annals of Statistics, 47(2):1148–1178, 2019

    Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests.The Annals of Statistics, 47(2):1148–1178, 2019. doi: 10.1214/18-AOS1709

  2. [2]

    Sequential Monte Carlo p-values.Biometrika, 78(2):301–304,

    Julian Besag and Peter Clifford. Sequential Monte Carlo p-values.Biometrika, 78(2):301–304,

  3. [3]

    doi: 10.1093/biomet/78.2.301

  4. [4]

    Random forests.Machine Learning, 45(1):5–32, 2001

    Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001. doi: 10.1023/A: 1010933404324. 17

  5. [5]

    Friedman, Charles J

    Leo Breiman, Jerome H. Friedman, Charles J. Stone, and Richard A. Olshen.Classification and Regression Trees. Wadsworth International Group, Belmont, CA, 1984. ISBN 9780412048418

  6. [6]

    A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016. doi: 10.1145/2939672.2939785

  7. [7]

    Conditional permutation importance revisited.BMC Bioinformatics, 21(1):307, 2020

    Dries Debeer and Carolin Strobl. Conditional permutation importance revisited.BMC Bioinformatics, 21(1):307, 2020. doi: 10.1186/s12859-020-03622-2

  8. [8]

    Exploring classification strategies with the CoEPrA 2006 contest.Bioinformatics, 26(5):603–609, 2010

    Ozgur Demir-Kavuk, Henning Riedesel, and Ernst-Walter Knapp. Exploring classification strategies with the CoEPrA 2006 contest.Bioinformatics, 26(5):603–609, 2010. doi: 10.1093/ bioinformatics/btq021

  9. [9]

    Michael D. Ernst. Permutation methods: A basis for exact inference.Statistical Science, 19(4): 676–685, 2004. doi: 10.1214/088342304000000396

  10. [10]

    Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk.Journal of the American Statistical Association, 104(488):1504–1511, 2009

    Axel Gandy. Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk.Journal of the American Statistical Association, 104(488):1504–1511, 2009. doi: 10.1198/ jasa.2009.tm08368

  11. [11]

    Variable selection using random forests.Pattern Recognition Letters, 31(14):2225–2236, 2010

    Robin Genuer, Jean-Michel Poggi, and Christine Tuleau-Malot. Variable selection using random forests.Pattern Recognition Letters, 31(14):2225–2236, 2010. doi: 10.1016/j.patrec.2010.03.014

  12. [12]

    Extremely randomized trees.Machine Learning, 63(1):3–42, 2006

    Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees.Machine Learning, 63(1):3–42, 2006. doi: 10.1007/s10994-006-6226-1

  13. [13]

    Gene selection for cancer classification using support vector machines.Machine Learning, 46(1-3):389–422, 2002

    Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification using support vector machines.Machine Learning, 46(1-3):389–422, 2002. doi: 10.1023/A:1012487302797

  14. [14]

    Result analysis of the NIPS 2003 feature selection challenge

    Isabelle Guyon, Steve Gunn, Asa Ben-Hur, and Gideon Dror. Result analysis of the NIPS 2003 feature selection challenge. InAdvances in Neural Information Processing Systems, volume 17, 2004. URL https://proceedings.neurips.cc/paper_files/paper/2004/hash/ 5e751896e527c862bf67251a474b3819-Abstract.html

  15. [15]

    Exact testing with random permutations.TEST, 27(4): 811–825, 2018

    Jesse Hemerik and Jelle Goeman. Exact testing with random permutations.TEST, 27(4): 811–825, 2018. doi: 10.1007/s11749-017-0571-1

  16. [16]

    partykit: A modular toolkit for recursive partytioning in R.Journal of Machine Learning Research, 16(118):3905–3909, 2015

    Torsten Hothorn and Achim Zeileis. partykit: A modular toolkit for recursive partytioning in R.Journal of Machine Learning Research, 16(118):3905–3909, 2015. URL https://jmlr. org/papers/v16/hothorn15a.html

  17. [17]

    van der Laan

    Torsten Hothorn, Peter B¨ uhlmann, Sandrine Dudoit, Annette Molinaro, and Mark J. van der Laan. Survival ensembles.Biostatistics, 7(3):355–373, 2006. doi: 10.1093/biostatistics/kxj011

  18. [18]

    van de Wiel, and Achim Zeileis

    Torsten Hothorn, Kurt Hornik, Mark A. van de Wiel, and Achim Zeileis. A lego system for conditional inference.The American Statistician, 60(3):257–263, 2006. doi: 10.1198/ 000313006X118430

  19. [19]

    Unbiased recursive partitioning: A condi- tional inference framework.Journal of Computational and Graphical Statistics, 15(3):651–674,

    Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: A condi- tional inference framework.Journal of Computational and Graphical Statistics, 15(3):651–674,

  20. [20]

    doi: 10.1198/106186006X133933. 18

  21. [21]

    URL https://CRAN.R-project.org/package=partykit

    Torsten Hothorn, Heidi Seibold, and Achim Zeileis.partykit: A Toolkit for Recursive Par- tytioning, 2026. URL https://CRAN.R-project.org/package=partykit. R package version 1.2-27

  22. [22]

    A highly efficient gradient boosting decision tree

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, pages 3146–3154, 2017. URL https://papers. nips.cc/paper/6907-a-highly-efficient-gradient-boosting-decision-tree

  23. [23]

    UCI machine learning repository

    Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. UCI machine learning repository. https://archive.ics.uci.edu, n.d. URL https://archive.ics.uci.edu/. Accessed 2026- 05-04

  24. [24]

    Classification trees with unbiased multiway splits.Journal of the American Statistical Association, 96(454):589–604, 2001

    Hyunjoong Kim and Wei-Yin Loh. Classification trees with unbiased multiway splits.Journal of the American Statistical Association, 96(454):589–604, 2001. doi: 10.1198/016214501753168271

  25. [25]

    Kursa and Witold R

    Miron B. Kursa and Witold R. Rudnicki. Feature selection with the Boruta package.Journal of Statistical Software, 36(11):1–13, 2010. doi: 10.18637/jss.v036.i11

  26. [26]

    Trevino, Jiliang Tang, and Huan Liu

    Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. Feature selection: A data perspective.ACM Computing Surveys, 50(6):94:1–94:45,

  27. [27]

    doi: 10.1145/3136625

  28. [28]

    Regression trees with unbiased variable selection and interaction detection.Sta- tistica Sinica, 12(2):361–386, 2002

    Wei-Yin Loh. Regression trees with unbiased variable selection and interaction detection.Sta- tistica Sinica, 12(2):361–386, 2002. URL https://www3.stat.sinica.edu.tw/statistica/ j12n2/j12n21/j12n21.htm

  29. [29]

    Split selection methods for classification trees.Statistica Sinica, 7(4):815–840, 1997

    Wei-Yin Loh and Yu-Shan Shih. Split selection methods for classification trees.Statistica Sinica, 7(4):815–840, 1997. URL https://www3.stat.sinica.edu.tw/statistica/j7n4/ j7n41/j7n41.htm

  30. [30]

    The Randomized Dependence Coefficient

    David Lopez-Paz, Philipp Hennig, and Bernhard Sch¨ olkopf. The randomized dependence coefficient, 2013. arXiv:1304.7717

  31. [31]

    B. V. North, David Curtis, and Pak C. Sham. A note on the calculation of empirical P values from Monte Carlo procedures.American Journal of Human Genetics, 71(2):439–441, 2002. doi: 10.1086/341527

  32. [32]

    Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12(85):2825–2830, 2011

    Fabian Pedregosa, Ga¨ el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake VanderPlas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and ´Edouard Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Resea...

  33. [33]

    Belinda Phipson and Gordon K. Smyth. Permutation P-values should never be zero: Calculating exact P-values when permutations are randomly drawn.Statistical Applications in Genetics and Molecular Biology, 9(1):1–12, 2010. doi: 10.2202/1544-6115.1585

  34. [34]

    Unbiased boosting with categorical features

    Liudmila Ostroumova Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Doro- gush, and Andrey Gulin. Unbiased boosting with categorical features. InAdvances in Neural Information Processing Systems, volume 31, pages 6639–6649, 2018. URL https://papers. nips.cc/paper/7898-catboost-unbiased-boosting-with-categorical-features. 19

  35. [35]

    A note on split selection bias in classification trees.Computational Statistics & Data Analysis, 45(3):457–466, 2004

    Yu-San Shih. A note on split selection bias in classification trees.Computational Statistics & Data Analysis, 45(3):457–466, 2004. doi: 10.1016/S0167-9473(03)00064-1

  36. [36]

    The asymptotic theory of permutation statistics

    Helmut Strasser and Christian Weber. The asymptotic theory of permutation statistics. Mathematical Methods of Statistics, 8(2):220–250, 1999

  37. [37]

    Unbiased split selection for classification trees based on the Gini index.Computational Statistics & Data Analysis, 52(1): 483–501, 2007

    Carolin Strobl, Anne-Laure Boulesteix, and Thomas Augustin. Unbiased split selection for classification trees based on the Gini index.Computational Statistics & Data Analysis, 52(1): 483–501, 2007. doi: 10.1016/j.csda.2006.12.030

  38. [38]

    Bias in random forest variable importance measures: Illustrations, sources and a solution.BMC Bioinformatics, 8(1):25, 2007

    Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution.BMC Bioinformatics, 8(1):25, 2007. doi: 10.1186/1471-2105-8-25

  39. [39]

    Conditional variable importance for random forests.BMC Bioinformatics, 9(1):307, 2008

    Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis. Conditional variable importance for random forests.BMC Bioinformatics, 9(1):307, 2008. doi: 10.1186/1471-2105-9-307

  40. [40]

    Sz´ ekely, Maria L

    G´ abor J. Sz´ ekely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing dependence by correlation of distances.The Annals of Statistics, 35(6):2769–2794, 2007. doi: 10.1214/ 009053607000000505

  41. [41]

    van Rijn, Bernd Bischl, and Luis Torgo

    Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2013. doi: 10.1145/2641190.2641198. URLhttps://doi.org/10.1145/2641190.2641198

  42. [42]

    Estimation and inference of heterogeneous treatment effects using random forests.Journal of the American Statistical Association, 113(523):1228–1242,

    Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests.Journal of the American Statistical Association, 113(523):1228–1242,

  43. [43]

    Variational inference: A review for statisticians,

    doi: 10.1080/01621459.2017.1319839. 20 Appendix A Setup and assumptions A.1 Fixed-node setup Let (Xi, Yi)n i=1 be the training data. Fix a node t with sample index set It and size nt := |It|. Write Xt := (Xi)i∈It and Yt := (Yi)i∈It for the data restricted to that node. Stage A tests features Ft with mt :=|F t|. The fixed-node theorem analyzes the exhausti...