Conditional Inference Trees and Forests for Feature Selection

Justin Downes; Newel Hirst; Robert Milletich; Steve Goley

arxiv: 2607.01417 · v1 · pith:DUOPV6NTnew · submitted 2026-07-01 · 💻 cs.LG · stat.ML

Conditional Inference Trees and Forests for Feature Selection

Robert Milletich , Justin Downes , Steve Goley , Newel Hirst This is my paper

Pith reviewed 2026-07-03 21:14 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords conditional inference forestsfeature rankingfeature selectionpermutation testsclassificationregressiontree ensembles

0 comments

The pith

Conditional inference forests rank 4th and 3rd among feature selectors in classification and regression benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests conditional inference trees and forests as tools for ranking the most useful input features before training a final predictor. It runs real-data experiments on dozens of datasets, measures how runtime settings affect speed and accuracy, and checks recovery of known important features in synthetic cases. The central finding is that these forests deliver competitive top-k rankings while using permutation tests to avoid favoring features with more possible split points. A reader would care because biased feature selection can waste computation and degrade model performance on new data.

Core claim

Conditional inference forests, which test each feature with a permutation p-value before choosing any split threshold, produce feature rankings that place fourth among seventeen classification methods on twenty-two datasets and third among eighteen regression methods on eight datasets, supporting their use for top-k feature selection.

What carries the argument

Bonferroni-corrected Monte Carlo permutation p-values computed at each node to decide which feature receives a split, independent of the number of thresholds tested.

If this is right

CIF can be applied directly as a feature-ranking step before training any downstream classifier or regressor.
Turning off adaptive stopping increases fitting time by a factor of four to eight with negligible change in final scores.
Exact threshold search multiplies runtime by up to ten with little accuracy gain.
In sparse high-dimensional settings, the random feature sampling inside the forest can exclude informative variables from many candidate splits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same permutation-test logic could be grafted onto other tree algorithms to reduce split bias without switching to full conditional inference.
Runtime gains from adaptive stopping suggest that similar early-stopping rules might help permutation-based methods in other domains such as causal discovery.
The observed risk that informative features are sampled out of many trees points to a need for stratified or importance-weighted sampling inside the forest.

Load-bearing premise

The collection of real and synthetic datasets used is representative of the problems where top-k feature ranking matters.

What would settle it

A new collection of thirty datasets in which CIF rankings produce downstream prediction scores that fall below the median of the other sixteen or seventeen methods.

Figures

Figures reproduced from arXiv: 2607.01417 by Justin Downes, Newel Hirst, Robert Milletich, Steve Goley.

**Figure 1.** Figure 1: Classification mean rank by number of selected features. Rows show all 17 feature selection methods in the classification benchmark, averaged across downstream models at each value of k. Lower ranks are better. The columns for k = 5, 10, 25, 50, 100 use 22, 21, 15, 15, and 14 datasets, respectively. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗

**Figure 2.** Figure 2: Regression mean rank by number of selected features. Rows show all 18 feature selection methods in the regression benchmark, averaged across downstream models at each value of k. Lower ranks are better. The columns for k = 5, 10, 25, 50, 100 use 8, 8, 7, 7, and 6 datasets, respectively. 5.2 Runtime ablations The runtime ablations identify which benchmark settings have the largest effect on fitting time and… view at source ↗

**Figure 3.** Figure 3: Where CIF’s classification scores peak as k increases, by downstream learner. (A) shows, for each learner, the percentage of 15 datasets where the first evaluated k attaining the best observed CIF score falls below 100, at 100, between 100 and p, or at k=p. (B) shows the mean balanced accuracy change from k=100 to k=p. Synthetic feature recovery. Across synthetic classification datasets, 34% of the feature… view at source ↗

**Figure 4.** Figure 4: Synthetic top-k recovery curves. Columns compare single trees, forests, and boosted trees; rows show classification and regression. Each value is the percentage of selected features that are truly informative. Feature sampling in forests. A sparse simulation asks whether forest feature sampling makes splits less likely to use informative features in classification and regression settings [PITH_FULL_IMAGE:… view at source ↗

**Figure 5.** Figure 5: Tree-level feature use in one sparse classification forest design (n=250, p=1,000, two informative features, 1,000 trees). Counts are the number of trees using each feature in at least one split. CIF samples a subset of features; CIF-all evaluates all nonconstant features. Each subplot has its own y-axis range because counts differ across methods. 7 Discussion In these experiments, CIF is best supported as… view at source ↗

**Figure 6.** Figure 6: Classification CIF margins against ctree, cforest, and CIT by downstream learner and value of k. Each plotted value is the mean balanced-accuracy difference, CIF minus comparator, for one downstream learner and one value of k; positive values favor CIF. Dataset counts by k are 22, 21, 15, 15, and 14 for ctree and cforest, and 23, 22, 16, 16, and 15 for CIT. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_6.png] view at source ↗

**Figure 7.** Figure 7: Regression CIF margins against ctree, cforest, and CIT by downstream learner and value of k. Each plotted value is the mean R2 difference, CIF minus comparator, for one downstream learner and one value of k; positive values favor CIF. Dataset counts by k are 8, 8, 7, 7, and 6 for each comparison. Large positive plotted values can reflect negative mean R2 for the comparator. Numeric labels show the unclippe… view at source ↗

**Figure 8.** Figure 8: Classification forest feature sampling by feature dimension p with 1,000 trees per fit. Points are means over ninformative ∈ {1, 2, 5, 10}; capped bars span the minimum-to-maximum values across those four sparse designs. (A) shows % of splits using informative features; (B) shows the number of distinct uninformative features used [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗

**Figure 9.** Figure 9: Regression forest feature sampling by feature dimension p with 1,000 trees per fit. Points are means over ninformative ∈ {1, 2, 5, 10}; capped bars span the minimum-to-maximum values across those four sparse designs. (A) shows % of splits using informative features; (B) shows the number of distinct uninformative features used. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_9.png] view at source ↗

read the original abstract

Conditional inference trees (CIT) and conditional inference forests (CIF) reduce split-selection bias by testing features before choosing split thresholds, but repeated permutation tests and threshold searches can make these methods computationally expensive. We study CIT and CIF as top-$k$ feature-ranking methods for downstream prediction using real-data benchmarks, runtime ablations, and synthetic feature-recovery experiments. At a fixed node, if the features and permutation budget do not depend on the node responses, Bonferroni-corrected $+1$ Monte Carlo permutation $p$-values control nodewise rejection under the complete permutation null. CIF ranks 4th among 17 classification methods on 22 datasets and 3rd among 18 regression methods on 8 datasets. With Bonferroni correction held fixed, the CIF runtime ablations indicate that adaptive stopping and the number of thresholds searched have the largest measured effect on runtime: turning off adaptive stopping and using exact threshold search increase fitting time by 4.0--8.4$\times$ and 1.9--10.8$\times$, respectively, while downstream score changes are at most 0.011. Sparse high-$p$ simulations indicate that forest feature sampling can leave informative features out of many split decisions. Overall, the results support CIF as a top-$k$ feature-ranking method in the evaluated downstream prediction benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CIF ranks well in the reported benchmarks but the adaptive stopping used in the main runs likely breaks the p-value control the paper relies on to justify reduced split bias.

read the letter

The paper takes established conditional inference trees and forests and tests them specifically as top-k feature rankers on real data, with added runtime ablations and synthetic feature-recovery checks. The main empirical result is that CIF places 4th out of 17 methods on 22 classification datasets and 3rd out of 18 on 8 regression ones, with downstream prediction scores that are competitive.

The runtime section is the clearest addition: turning off adaptive stopping increases fitting time by 4–8.4× while changing downstream scores by at most 0.011, and exact threshold search adds another 1.9–10.8×. The synthetic experiments also flag that forest feature sampling can miss informative variables across many splits. These are straightforward, useful measurements.

The soft spot is the p-value claim. The paper states that Bonferroni-corrected +1 Monte Carlo p-values control nodewise rejection only when features and permutation budget do not depend on node responses. The ablations identify adaptive stopping as the dominant runtime saver, which means the main reported CIF results use a procedure that makes the budget depend on the observed responses. That directly violates the stated precondition, so the frequentist guarantee the authors invoke for reduced split-selection bias does not apply to the experiments they actually ran.

The raw benchmark rankings still stand as direct comparisons, but the mechanistic story tying those rankings to the method’s bias-reduction property is undercut. The paper is aimed at practitioners who already use tree-based feature ranking and want runtime numbers plus a check on whether the permutation approach holds up in practice. It is coherent enough on its own terms to go to referees, though any review would need to focus on whether the p-value issue affects the ranking claims or just the interpretation.

Referee Report

1 major / 2 minor

Summary. The paper studies conditional inference trees (CIT) and forests (CIF) as top-k feature-ranking methods for downstream prediction tasks. It provides a condition under which Bonferroni-corrected +1 Monte Carlo permutation p-values control nodewise rejection under the complete permutation null, reports that CIF ranks 4th among 17 classification methods on 22 datasets and 3rd among 18 regression methods on 8 datasets, presents runtime ablations showing that adaptive stopping and threshold search have the largest effects on runtime (with downstream score changes ≤0.011), and includes synthetic feature-recovery experiments indicating that forest feature sampling can omit informative features.

Significance. If the p-value control holds and the empirical rankings are robust, the work would provide a theoretically grounded, bias-reduced alternative to standard tree-based feature selection with competitive downstream performance and quantified runtime trade-offs. The ablations and synthetic results add practical insight into implementation choices.

major comments (1)

[Abstract and runtime ablations] Abstract (p-value control statement) and runtime ablations: The paper states that at a fixed node, Bonferroni-corrected +1 Monte Carlo permutation p-values control nodewise rejection under the complete permutation null only "if the features and permutation budget do not depend on the node responses." The runtime ablations report that turning off adaptive stopping increases fitting time by 4.0–8.4× and identify it as having the largest measured effect, implying the main reported CIF results use adaptive stopping. This makes the effective permutation budget depend on observed responses, directly violating the stated precondition and removing the claimed frequentist guarantee for the nodewise p-values that justify reduced split-selection bias and reliable top-k ranking.

minor comments (2)

[Abstract] The abstract reports CIF ranking 4th/17 on classification and 3rd/18 on regression but does not specify whether these are mean ranks, median ranks, or win counts, nor the exact metric (e.g., AUC, accuracy) used for downstream evaluation.
[Abstract] The synthetic experiments mention "sparse high-p simulations" but the abstract does not report the exact dimensions (p, n, sparsity level) or recovery metric, making it hard to assess how representative they are of the real-data regime.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for pointing out the potential inconsistency between the stated condition for p-value control and the use of adaptive stopping in the reported experiments. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract and runtime ablations] Abstract (p-value control statement) and runtime ablations: The paper states that at a fixed node, Bonferroni-corrected +1 Monte Carlo permutation p-values control nodewise rejection under the complete permutation null only "if the features and permutation budget do not depend on the node responses." The runtime ablations report that turning off adaptive stopping increases fitting time by 4.0–8.4× and identify it as having the largest measured effect, implying the main reported CIF results use adaptive stopping. This makes the effective permutation budget depend on observed responses, directly violating the stated precondition and removing the claimed frequentist guarantee for the nodewise p-values that justify reduced split-selection bias and reliable top-k ranking.

Authors: We agree with the referee that the main CIF results utilize adaptive stopping, which causes the permutation budget to depend on the observed responses at each node, thereby violating the precondition for the nodewise p-value control result. The theoretical guarantee applies specifically to the case of fixed permutation budget independent of responses. We will revise the abstract to clarify the scope of the p-value control statement and add a note in the methods or discussion section explaining that adaptive stopping is employed for computational efficiency in the empirical evaluations, while the frequentist control is guaranteed only without it. The benchmark rankings and runtime ablation results are empirical observations and remain unchanged. This revision addresses the concern without altering the core findings. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external benchmarks

full rationale

The paper presents no derivation chain that reduces to its inputs by construction. Its strongest claims are direct empirical rankings of CIF against 17/18 other methods on 22/8 datasets plus runtime ablations and synthetic recovery experiments. The single conditional statement on p-value control is an if-then claim about a fixed-node permutation test and is not invoked to derive the reported rankings or to rename any fitted quantity as a prediction. No self-citations, ansatzes, or uniqueness theorems appear as load-bearing steps. The work is therefore self-contained against the external benchmarks it reports.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; full text would be required to audit these elements.

pith-pipeline@v0.9.1-grok · 5767 in / 1014 out tokens · 35757 ms · 2026-07-03T21:14:37.854933+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 24 canonical work pages · 1 internal anchor

[1]

Generalized random forests.The Annals of Statistics, 47(2):1148–1178, 2019

Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests.The Annals of Statistics, 47(2):1148–1178, 2019. doi: 10.1214/18-AOS1709

work page doi:10.1214/18-aos1709 2019
[2]

Sequential Monte Carlo p-values.Biometrika, 78(2):301–304,

Julian Besag and Peter Clifford. Sequential Monte Carlo p-values.Biometrika, 78(2):301–304,
[3]

doi: 10.1093/biomet/78.2.301

work page doi:10.1093/biomet/78.2.301
[4]

Random forests.Machine Learning, 45(1):5–32, 2001

Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001. doi: 10.1023/A: 1010933404324. 17

work page doi:10.1023/a: 2001
[5]

Friedman, Charles J

Leo Breiman, Jerome H. Friedman, Charles J. Stone, and Richard A. Olshen.Classification and Regression Trees. Wadsworth International Group, Belmont, CA, 1984. ISBN 9780412048418

1984
[6]

A scalable tree boosting system

Tianqi Chen and Carlos Guestrin. A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016. doi: 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016
[7]

Conditional permutation importance revisited.BMC Bioinformatics, 21(1):307, 2020

Dries Debeer and Carolin Strobl. Conditional permutation importance revisited.BMC Bioinformatics, 21(1):307, 2020. doi: 10.1186/s12859-020-03622-2

work page doi:10.1186/s12859-020-03622-2 2020
[8]

Exploring classification strategies with the CoEPrA 2006 contest.Bioinformatics, 26(5):603–609, 2010

Ozgur Demir-Kavuk, Henning Riedesel, and Ernst-Walter Knapp. Exploring classification strategies with the CoEPrA 2006 contest.Bioinformatics, 26(5):603–609, 2010. doi: 10.1093/ bioinformatics/btq021

2006
[9]

Michael D. Ernst. Permutation methods: A basis for exact inference.Statistical Science, 19(4): 676–685, 2004. doi: 10.1214/088342304000000396

work page doi:10.1214/088342304000000396 2004
[10]

Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk.Journal of the American Statistical Association, 104(488):1504–1511, 2009

Axel Gandy. Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk.Journal of the American Statistical Association, 104(488):1504–1511, 2009. doi: 10.1198/ jasa.2009.tm08368

2009
[11]

Variable selection using random forests.Pattern Recognition Letters, 31(14):2225–2236, 2010

Robin Genuer, Jean-Michel Poggi, and Christine Tuleau-Malot. Variable selection using random forests.Pattern Recognition Letters, 31(14):2225–2236, 2010. doi: 10.1016/j.patrec.2010.03.014

work page doi:10.1016/j.patrec.2010.03.014 2010
[12]

Extremely randomized trees.Machine Learning, 63(1):3–42, 2006

Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees.Machine Learning, 63(1):3–42, 2006. doi: 10.1007/s10994-006-6226-1

work page doi:10.1007/s10994-006-6226-1 2006
[13]

Gene selection for cancer classification using support vector machines.Machine Learning, 46(1-3):389–422, 2002

Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification using support vector machines.Machine Learning, 46(1-3):389–422, 2002. doi: 10.1023/A:1012487302797

work page doi:10.1023/a:1012487302797 2002
[14]

Result analysis of the NIPS 2003 feature selection challenge

Isabelle Guyon, Steve Gunn, Asa Ben-Hur, and Gideon Dror. Result analysis of the NIPS 2003 feature selection challenge. InAdvances in Neural Information Processing Systems, volume 17, 2004. URL https://proceedings.neurips.cc/paper_files/paper/2004/hash/ 5e751896e527c862bf67251a474b3819-Abstract.html

2003
[15]

Exact testing with random permutations.TEST, 27(4): 811–825, 2018

Jesse Hemerik and Jelle Goeman. Exact testing with random permutations.TEST, 27(4): 811–825, 2018. doi: 10.1007/s11749-017-0571-1

work page doi:10.1007/s11749-017-0571-1 2018
[16]

partykit: A modular toolkit for recursive partytioning in R.Journal of Machine Learning Research, 16(118):3905–3909, 2015

Torsten Hothorn and Achim Zeileis. partykit: A modular toolkit for recursive partytioning in R.Journal of Machine Learning Research, 16(118):3905–3909, 2015. URL https://jmlr. org/papers/v16/hothorn15a.html

2015
[17]

van der Laan

Torsten Hothorn, Peter B¨ uhlmann, Sandrine Dudoit, Annette Molinaro, and Mark J. van der Laan. Survival ensembles.Biostatistics, 7(3):355–373, 2006. doi: 10.1093/biostatistics/kxj011

work page doi:10.1093/biostatistics/kxj011 2006
[18]

van de Wiel, and Achim Zeileis

Torsten Hothorn, Kurt Hornik, Mark A. van de Wiel, and Achim Zeileis. A lego system for conditional inference.The American Statistician, 60(3):257–263, 2006. doi: 10.1198/ 000313006X118430

2006
[19]

Unbiased recursive partitioning: A condi- tional inference framework.Journal of Computational and Graphical Statistics, 15(3):651–674,

Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: A condi- tional inference framework.Journal of Computational and Graphical Statistics, 15(3):651–674,
[20]

doi: 10.1198/106186006X133933. 18

work page doi:10.1198/106186006x133933
[21]

URL https://CRAN.R-project.org/package=partykit

Torsten Hothorn, Heidi Seibold, and Achim Zeileis.partykit: A Toolkit for Recursive Par- tytioning, 2026. URL https://CRAN.R-project.org/package=partykit. R package version 1.2-27

2026
[22]

A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, pages 3146–3154, 2017. URL https://papers. nips.cc/paper/6907-a-highly-efficient-gradient-boosting-decision-tree

2017
[23]

UCI machine learning repository

Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. UCI machine learning repository. https://archive.ics.uci.edu, n.d. URL https://archive.ics.uci.edu/. Accessed 2026- 05-04

2026
[24]

Classification trees with unbiased multiway splits.Journal of the American Statistical Association, 96(454):589–604, 2001

Hyunjoong Kim and Wei-Yin Loh. Classification trees with unbiased multiway splits.Journal of the American Statistical Association, 96(454):589–604, 2001. doi: 10.1198/016214501753168271

work page doi:10.1198/016214501753168271 2001
[25]

Kursa and Witold R

Miron B. Kursa and Witold R. Rudnicki. Feature selection with the Boruta package.Journal of Statistical Software, 36(11):1–13, 2010. doi: 10.18637/jss.v036.i11

work page doi:10.18637/jss.v036.i11 2010
[26]

Trevino, Jiliang Tang, and Huan Liu

Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. Feature selection: A data perspective.ACM Computing Surveys, 50(6):94:1–94:45,
[27]

doi: 10.1145/3136625

work page doi:10.1145/3136625
[28]

Regression trees with unbiased variable selection and interaction detection.Sta- tistica Sinica, 12(2):361–386, 2002

Wei-Yin Loh. Regression trees with unbiased variable selection and interaction detection.Sta- tistica Sinica, 12(2):361–386, 2002. URL https://www3.stat.sinica.edu.tw/statistica/ j12n2/j12n21/j12n21.htm

2002
[29]

Split selection methods for classification trees.Statistica Sinica, 7(4):815–840, 1997

Wei-Yin Loh and Yu-Shan Shih. Split selection methods for classification trees.Statistica Sinica, 7(4):815–840, 1997. URL https://www3.stat.sinica.edu.tw/statistica/j7n4/ j7n41/j7n41.htm

1997
[30]

The Randomized Dependence Coefficient

David Lopez-Paz, Philipp Hennig, and Bernhard Sch¨ olkopf. The randomized dependence coefficient, 2013. arXiv:1304.7717

work page internal anchor Pith review Pith/arXiv arXiv 2013
[31]

B. V. North, David Curtis, and Pak C. Sham. A note on the calculation of empirical P values from Monte Carlo procedures.American Journal of Human Genetics, 71(2):439–441, 2002. doi: 10.1086/341527

work page doi:10.1086/341527 2002
[32]

Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12(85):2825–2830, 2011

Fabian Pedregosa, Ga¨ el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake VanderPlas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and ´Edouard Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Resea...

2011
[33]

Belinda Phipson and Gordon K. Smyth. Permutation P-values should never be zero: Calculating exact P-values when permutations are randomly drawn.Statistical Applications in Genetics and Molecular Biology, 9(1):1–12, 2010. doi: 10.2202/1544-6115.1585

work page doi:10.2202/1544-6115.1585 2010
[34]

Unbiased boosting with categorical features

Liudmila Ostroumova Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Doro- gush, and Andrey Gulin. Unbiased boosting with categorical features. InAdvances in Neural Information Processing Systems, volume 31, pages 6639–6649, 2018. URL https://papers. nips.cc/paper/7898-catboost-unbiased-boosting-with-categorical-features. 19

2018
[35]

A note on split selection bias in classification trees.Computational Statistics & Data Analysis, 45(3):457–466, 2004

Yu-San Shih. A note on split selection bias in classification trees.Computational Statistics & Data Analysis, 45(3):457–466, 2004. doi: 10.1016/S0167-9473(03)00064-1

work page doi:10.1016/s0167-9473(03)00064-1 2004
[36]

The asymptotic theory of permutation statistics

Helmut Strasser and Christian Weber. The asymptotic theory of permutation statistics. Mathematical Methods of Statistics, 8(2):220–250, 1999

1999
[37]

Unbiased split selection for classification trees based on the Gini index.Computational Statistics & Data Analysis, 52(1): 483–501, 2007

Carolin Strobl, Anne-Laure Boulesteix, and Thomas Augustin. Unbiased split selection for classification trees based on the Gini index.Computational Statistics & Data Analysis, 52(1): 483–501, 2007. doi: 10.1016/j.csda.2006.12.030

work page doi:10.1016/j.csda.2006.12.030 2007
[38]

Bias in random forest variable importance measures: Illustrations, sources and a solution.BMC Bioinformatics, 8(1):25, 2007

Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution.BMC Bioinformatics, 8(1):25, 2007. doi: 10.1186/1471-2105-8-25

work page doi:10.1186/1471-2105-8-25 2007
[39]

Conditional variable importance for random forests.BMC Bioinformatics, 9(1):307, 2008

Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis. Conditional variable importance for random forests.BMC Bioinformatics, 9(1):307, 2008. doi: 10.1186/1471-2105-9-307

work page doi:10.1186/1471-2105-9-307 2008
[40]

Sz´ ekely, Maria L

G´ abor J. Sz´ ekely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing dependence by correlation of distances.The Annals of Statistics, 35(6):2769–2794, 2007. doi: 10.1214/ 009053607000000505

2007
[41]

van Rijn, Bernd Bischl, and Luis Torgo

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2013. doi: 10.1145/2641190.2641198. URLhttps://doi.org/10.1145/2641190.2641198

work page doi:10.1145/2641190.2641198 2013
[42]

Estimation and inference of heterogeneous treatment effects using random forests.Journal of the American Statistical Association, 113(523):1228–1242,

Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests.Journal of the American Statistical Association, 113(523):1228–1242,
[43]

Variational inference: A review for statisticians,

doi: 10.1080/01621459.2017.1319839. 20 Appendix A Setup and assumptions A.1 Fixed-node setup Let (Xi, Yi)n i=1 be the training data. Fix a node t with sample index set It and size nt := |It|. Write Xt := (Xi)i∈It and Yt := (Yi)i∈It for the data restricted to that node. Stage A tests features Ft with mt :=|F t|. The fixed-node theorem analyzes the exhausti...

work page doi:10.1080/01621459.2017.1319839 2017

[1] [1]

Generalized random forests.The Annals of Statistics, 47(2):1148–1178, 2019

Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests.The Annals of Statistics, 47(2):1148–1178, 2019. doi: 10.1214/18-AOS1709

work page doi:10.1214/18-aos1709 2019

[2] [2]

Sequential Monte Carlo p-values.Biometrika, 78(2):301–304,

Julian Besag and Peter Clifford. Sequential Monte Carlo p-values.Biometrika, 78(2):301–304,

[3] [3]

doi: 10.1093/biomet/78.2.301

work page doi:10.1093/biomet/78.2.301

[4] [4]

Random forests.Machine Learning, 45(1):5–32, 2001

Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001. doi: 10.1023/A: 1010933404324. 17

work page doi:10.1023/a: 2001

[5] [5]

Friedman, Charles J

Leo Breiman, Jerome H. Friedman, Charles J. Stone, and Richard A. Olshen.Classification and Regression Trees. Wadsworth International Group, Belmont, CA, 1984. ISBN 9780412048418

1984

[6] [6]

A scalable tree boosting system

Tianqi Chen and Carlos Guestrin. A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016. doi: 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016

[7] [7]

Conditional permutation importance revisited.BMC Bioinformatics, 21(1):307, 2020

Dries Debeer and Carolin Strobl. Conditional permutation importance revisited.BMC Bioinformatics, 21(1):307, 2020. doi: 10.1186/s12859-020-03622-2

work page doi:10.1186/s12859-020-03622-2 2020

[8] [8]

Exploring classification strategies with the CoEPrA 2006 contest.Bioinformatics, 26(5):603–609, 2010

Ozgur Demir-Kavuk, Henning Riedesel, and Ernst-Walter Knapp. Exploring classification strategies with the CoEPrA 2006 contest.Bioinformatics, 26(5):603–609, 2010. doi: 10.1093/ bioinformatics/btq021

2006

[9] [9]

Michael D. Ernst. Permutation methods: A basis for exact inference.Statistical Science, 19(4): 676–685, 2004. doi: 10.1214/088342304000000396

work page doi:10.1214/088342304000000396 2004

[10] [10]

Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk.Journal of the American Statistical Association, 104(488):1504–1511, 2009

Axel Gandy. Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk.Journal of the American Statistical Association, 104(488):1504–1511, 2009. doi: 10.1198/ jasa.2009.tm08368

2009

[11] [11]

Variable selection using random forests.Pattern Recognition Letters, 31(14):2225–2236, 2010

Robin Genuer, Jean-Michel Poggi, and Christine Tuleau-Malot. Variable selection using random forests.Pattern Recognition Letters, 31(14):2225–2236, 2010. doi: 10.1016/j.patrec.2010.03.014

work page doi:10.1016/j.patrec.2010.03.014 2010

[12] [12]

Extremely randomized trees.Machine Learning, 63(1):3–42, 2006

Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees.Machine Learning, 63(1):3–42, 2006. doi: 10.1007/s10994-006-6226-1

work page doi:10.1007/s10994-006-6226-1 2006

[13] [13]

Gene selection for cancer classification using support vector machines.Machine Learning, 46(1-3):389–422, 2002

Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification using support vector machines.Machine Learning, 46(1-3):389–422, 2002. doi: 10.1023/A:1012487302797

work page doi:10.1023/a:1012487302797 2002

[14] [14]

Result analysis of the NIPS 2003 feature selection challenge

Isabelle Guyon, Steve Gunn, Asa Ben-Hur, and Gideon Dror. Result analysis of the NIPS 2003 feature selection challenge. InAdvances in Neural Information Processing Systems, volume 17, 2004. URL https://proceedings.neurips.cc/paper_files/paper/2004/hash/ 5e751896e527c862bf67251a474b3819-Abstract.html

2003

[15] [15]

Exact testing with random permutations.TEST, 27(4): 811–825, 2018

Jesse Hemerik and Jelle Goeman. Exact testing with random permutations.TEST, 27(4): 811–825, 2018. doi: 10.1007/s11749-017-0571-1

work page doi:10.1007/s11749-017-0571-1 2018

[16] [16]

partykit: A modular toolkit for recursive partytioning in R.Journal of Machine Learning Research, 16(118):3905–3909, 2015

Torsten Hothorn and Achim Zeileis. partykit: A modular toolkit for recursive partytioning in R.Journal of Machine Learning Research, 16(118):3905–3909, 2015. URL https://jmlr. org/papers/v16/hothorn15a.html

2015

[17] [17]

van der Laan

Torsten Hothorn, Peter B¨ uhlmann, Sandrine Dudoit, Annette Molinaro, and Mark J. van der Laan. Survival ensembles.Biostatistics, 7(3):355–373, 2006. doi: 10.1093/biostatistics/kxj011

work page doi:10.1093/biostatistics/kxj011 2006

[18] [18]

van de Wiel, and Achim Zeileis

Torsten Hothorn, Kurt Hornik, Mark A. van de Wiel, and Achim Zeileis. A lego system for conditional inference.The American Statistician, 60(3):257–263, 2006. doi: 10.1198/ 000313006X118430

2006

[19] [19]

Unbiased recursive partitioning: A condi- tional inference framework.Journal of Computational and Graphical Statistics, 15(3):651–674,

Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: A condi- tional inference framework.Journal of Computational and Graphical Statistics, 15(3):651–674,

[20] [20]

doi: 10.1198/106186006X133933. 18

work page doi:10.1198/106186006x133933

[21] [21]

URL https://CRAN.R-project.org/package=partykit

Torsten Hothorn, Heidi Seibold, and Achim Zeileis.partykit: A Toolkit for Recursive Par- tytioning, 2026. URL https://CRAN.R-project.org/package=partykit. R package version 1.2-27

2026

[22] [22]

A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, pages 3146–3154, 2017. URL https://papers. nips.cc/paper/6907-a-highly-efficient-gradient-boosting-decision-tree

2017

[23] [23]

UCI machine learning repository

Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. UCI machine learning repository. https://archive.ics.uci.edu, n.d. URL https://archive.ics.uci.edu/. Accessed 2026- 05-04

2026

[24] [24]

Classification trees with unbiased multiway splits.Journal of the American Statistical Association, 96(454):589–604, 2001

Hyunjoong Kim and Wei-Yin Loh. Classification trees with unbiased multiway splits.Journal of the American Statistical Association, 96(454):589–604, 2001. doi: 10.1198/016214501753168271

work page doi:10.1198/016214501753168271 2001

[25] [25]

Kursa and Witold R

Miron B. Kursa and Witold R. Rudnicki. Feature selection with the Boruta package.Journal of Statistical Software, 36(11):1–13, 2010. doi: 10.18637/jss.v036.i11

work page doi:10.18637/jss.v036.i11 2010

[26] [26]

Trevino, Jiliang Tang, and Huan Liu

Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. Feature selection: A data perspective.ACM Computing Surveys, 50(6):94:1–94:45,

[27] [27]

doi: 10.1145/3136625

work page doi:10.1145/3136625

[28] [28]

Regression trees with unbiased variable selection and interaction detection.Sta- tistica Sinica, 12(2):361–386, 2002

Wei-Yin Loh. Regression trees with unbiased variable selection and interaction detection.Sta- tistica Sinica, 12(2):361–386, 2002. URL https://www3.stat.sinica.edu.tw/statistica/ j12n2/j12n21/j12n21.htm

2002

[29] [29]

Split selection methods for classification trees.Statistica Sinica, 7(4):815–840, 1997

Wei-Yin Loh and Yu-Shan Shih. Split selection methods for classification trees.Statistica Sinica, 7(4):815–840, 1997. URL https://www3.stat.sinica.edu.tw/statistica/j7n4/ j7n41/j7n41.htm

1997

[30] [30]

The Randomized Dependence Coefficient

David Lopez-Paz, Philipp Hennig, and Bernhard Sch¨ olkopf. The randomized dependence coefficient, 2013. arXiv:1304.7717

work page internal anchor Pith review Pith/arXiv arXiv 2013

[31] [31]

B. V. North, David Curtis, and Pak C. Sham. A note on the calculation of empirical P values from Monte Carlo procedures.American Journal of Human Genetics, 71(2):439–441, 2002. doi: 10.1086/341527

work page doi:10.1086/341527 2002

[32] [32]

Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12(85):2825–2830, 2011

Fabian Pedregosa, Ga¨ el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake VanderPlas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and ´Edouard Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Resea...

2011

[33] [33]

Belinda Phipson and Gordon K. Smyth. Permutation P-values should never be zero: Calculating exact P-values when permutations are randomly drawn.Statistical Applications in Genetics and Molecular Biology, 9(1):1–12, 2010. doi: 10.2202/1544-6115.1585

work page doi:10.2202/1544-6115.1585 2010

[34] [34]

Unbiased boosting with categorical features

Liudmila Ostroumova Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Doro- gush, and Andrey Gulin. Unbiased boosting with categorical features. InAdvances in Neural Information Processing Systems, volume 31, pages 6639–6649, 2018. URL https://papers. nips.cc/paper/7898-catboost-unbiased-boosting-with-categorical-features. 19

2018

[35] [35]

A note on split selection bias in classification trees.Computational Statistics & Data Analysis, 45(3):457–466, 2004

Yu-San Shih. A note on split selection bias in classification trees.Computational Statistics & Data Analysis, 45(3):457–466, 2004. doi: 10.1016/S0167-9473(03)00064-1

work page doi:10.1016/s0167-9473(03)00064-1 2004

[36] [36]

The asymptotic theory of permutation statistics

Helmut Strasser and Christian Weber. The asymptotic theory of permutation statistics. Mathematical Methods of Statistics, 8(2):220–250, 1999

1999

[37] [37]

Unbiased split selection for classification trees based on the Gini index.Computational Statistics & Data Analysis, 52(1): 483–501, 2007

Carolin Strobl, Anne-Laure Boulesteix, and Thomas Augustin. Unbiased split selection for classification trees based on the Gini index.Computational Statistics & Data Analysis, 52(1): 483–501, 2007. doi: 10.1016/j.csda.2006.12.030

work page doi:10.1016/j.csda.2006.12.030 2007

[38] [38]

Bias in random forest variable importance measures: Illustrations, sources and a solution.BMC Bioinformatics, 8(1):25, 2007

Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution.BMC Bioinformatics, 8(1):25, 2007. doi: 10.1186/1471-2105-8-25

work page doi:10.1186/1471-2105-8-25 2007

[39] [39]

Conditional variable importance for random forests.BMC Bioinformatics, 9(1):307, 2008

Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis. Conditional variable importance for random forests.BMC Bioinformatics, 9(1):307, 2008. doi: 10.1186/1471-2105-9-307

work page doi:10.1186/1471-2105-9-307 2008

[40] [40]

Sz´ ekely, Maria L

G´ abor J. Sz´ ekely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing dependence by correlation of distances.The Annals of Statistics, 35(6):2769–2794, 2007. doi: 10.1214/ 009053607000000505

2007

[41] [41]

van Rijn, Bernd Bischl, and Luis Torgo

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2013. doi: 10.1145/2641190.2641198. URLhttps://doi.org/10.1145/2641190.2641198

work page doi:10.1145/2641190.2641198 2013

[42] [42]

Estimation and inference of heterogeneous treatment effects using random forests.Journal of the American Statistical Association, 113(523):1228–1242,

Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests.Journal of the American Statistical Association, 113(523):1228–1242,

[43] [43]

Variational inference: A review for statisticians,

doi: 10.1080/01621459.2017.1319839. 20 Appendix A Setup and assumptions A.1 Fixed-node setup Let (Xi, Yi)n i=1 be the training data. Fix a node t with sample index set It and size nt := |It|. Write Xt := (Xi)i∈It and Yt := (Yi)i∈It for the data restricted to that node. Stage A tests features Ft with mt :=|F t|. The fixed-node theorem analyzes the exhausti...

work page doi:10.1080/01621459.2017.1319839 2017