pith. sign in

arxiv: 2502.16758 · v3 · submitted 2025-02-24 · 🧮 math.ST · math.PR· stat.TH

Stabilizing the Splits through Minimax Decision Trees

Pith reviewed 2026-05-23 02:57 UTC · model grok-4.3

classification 🧮 math.ST math.PRstat.TH
keywords MinimaxSplitdecision treesoracle inequalitiesCARTregressionclassificationend-cut preferencenon-atomicity
0
0 comments X

The pith

MinimaxSplit decision trees select splits by minimizing worst-case child risk rather than average risk.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MinimaxSplit decision trees as a robust alternative to CART that addresses the end-cut preference phenomenon by choosing splits to minimize the maximum child impurity instead of the average. For regression the criterion minimizes the maximum within-child squared error; for classification it minimizes the maximum child entropy. Oracle inequalities are proved for both tasks under mild marginal non-atomicity conditions on the covariates, bounding the tree's global excess risk directly by local worst-case impurities. These bounds deliver faster convergence rates than standard CART. The analysis covers a cyclic coordinate-cycling variant and a random-dimension forest extension, with reported empirical gains on heterogeneous data such as EEG amplitude regression and spatial image denoising.

Core claim

MinimaxSplit decision trees select splits by minimizing the maximum within-child risk (squared error for regression, entropy for classification) rather than the average risk. Oracle inequalities are established that cover both regression and classification under mild marginal non-atomicity conditions on the covariate distributions. The bounds control the tree's global excess risk by local worst-case impurities and yield fast convergence rates compared to CART. The main method studied is the cyclic variant that deterministically cycles through coordinates, and the analysis extends to random-dimension forests that subsample coordinates per node.

What carries the argument

The minimax split criterion that selects the split minimizing the maximum child risk (squared error or entropy) instead of the average risk.

If this is right

  • Oracle inequalities hold simultaneously for squared-error regression and entropy-based classification.
  • Global excess risk is bounded above by local worst-case child impurities.
  • Convergence rates are faster than the corresponding rates for CART trees.
  • The cyclic variant cycles deterministically through coordinates at each node.
  • Random-dimension forest variants subsample coordinates at each node while retaining the minimax split rule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same worst-case control might extend to other splitting criteria or to boosted tree ensembles.
  • Structured spatial or temporal data may benefit from the deterministic cycling in the cyclic variant.
  • High-dimensional settings with many irrelevant coordinates could see further gains from the random-dimension extension.

Load-bearing premise

The mild marginal non-atomicity conditions on the covariate distributions are required to establish the oracle inequalities.

What would settle it

A concrete counterexample showing that the oracle inequality fails on a covariate distribution containing atoms would falsify the central claim.

Figures

Figures reproduced from arXiv: 2502.16758 by Hengrui Luo, Zhenyuan Zhang.

Figure 1
Figure 1. Figure 1: The comparison of the scikit-learn DecisionTreeRegressor, our MinimaxSplit tree, and Vari [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The comparison of the VarianceSplit, MinimaxSplit and cyclic MinimaxSplit trees. Here we use [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualizing the nested sequence of partitions [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of the mean squared errors for the MinimaxSplit and VarianceSplit regimes, in the [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Empirical distributions of sample size n = 500 over 1000 replicates of the smaller-child proportion min(nL, nR)/n when splitting pure-noise data generated by standard normal (left), t(3) (middle), and t(1) (right) distributions. We also use dashed lines to denote the threshold 0.05 [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average leaf size (mean number of samples per leaf) versus maximum tree depth for four splitting [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Fixed-horizon EEG amplitude regression with trees. The top row summarizes performance across [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparative analysis of various random forest (depth [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of decision tree regression methods (depth [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Plots of the log-MSE, log2 (E[(U − Mk) 2 ]) versus the approximation depth k for four different methods, where the density of the law U is given: (a) by f(u) ∝ 1[0,0.9](u) + (1 + 104 (u − 0.9))1[0.9,1](u) and (b) by f(u) ∝ u 101[0,1](u). The right panel of (b) shows the ratios E[(U − Mk+1) 2 ]/E[(U − Mk) 2 ] for the four methods as a function of k. (i) Define uvar = arg min u∈I [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 11
Figure 11. Figure 11: Showcasing the possibilities of the splits for levels [PITH_FULL_IMAGE:figures/full_fig_p046_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of node-level split selection in two random forest variants. Left: a random-dimension [PITH_FULL_IMAGE:figures/full_fig_p056_12.png] view at source ↗
read the original abstract

By revisiting the end-cut preference (ECP) phenomenon associated with a single CART (Breiman et al. (1984)), we introduce MinimaxSplit decision trees, a robust alternative to CART that selects splits by minimizing the worst-case child risk rather than the average risk. For regression, we minimize the maximum within-child squared error; for classification, we minimize the maximum child entropy, yielding a C4.5-compatible criterion. We also study a cyclic variant that deterministically cycles coordinates, leading to our main method of cyclic MinimaxSplit decision trees. We prove oracle inequalities that cover both regression and classification, under mild marginal non-atomicity conditions. The bounds control the tree's global excess risk by local worst-case impurities and yield fast convergence rates compared to CART. We extend the analysis to a random-dimension forest variant that subsamples coordinates per node. Empirically, (cyclic) MinimaxSplit trees and their forests improve over baselines on structured heterogeneous data such as EEG amplitude regression over fixed time horizons and image denoising, framed as non-parametric regression on spatial coordinates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces MinimaxSplit decision trees that select splits by minimizing the worst-case child impurity (maximum within-child squared error for regression; maximum child entropy for classification) rather than the average impurity used in CART, to address the end-cut preference phenomenon. A cyclic variant that deterministically cycles through coordinates is presented as the main method, along with a random-dimension forest extension. Oracle inequalities are claimed for both regression and classification tasks under mild marginal non-atomicity conditions on the covariates; these bounds relate the tree's global excess risk to local worst-case impurities and are asserted to yield faster convergence rates than CART. Empirical results on EEG amplitude regression and image denoising (framed as spatial regression) are reported to show improvements over baselines.

Significance. If the oracle inequalities hold with the stated rates under the invoked conditions, the work would supply a theoretically grounded splitting criterion that stabilizes tree construction while retaining interpretability, with potential applicability to heterogeneous structured data. The combination of minimax criterion, cyclic coordinate handling, and forest extension, together with the claimed oracle bounds, would distinguish it from standard CART analyses.

major comments (2)
  1. [Abstract] Abstract (oracle inequalities paragraph): the claimed oracle inequalities for regression and classification rest on 'mild marginal non-atomicity conditions' whose precise statement (e.g., absolute continuity of marginals, absence of atoms of positive probability, or uniform density lower bounds) is not supplied. Without this, it is impossible to verify necessity, check satisfaction on the EEG/image examples, or confirm that the proof technique does not collapse when the conditions fail; this is load-bearing for the central theoretical claim.
  2. [Abstract] Abstract (bounds paragraph): the statement that the bounds 'control the tree's global excess risk by local worst-case impurities' requires an explicit inequality (with constants and remainder terms) to assess whether the control is tight enough to deliver the asserted faster rates relative to CART; the current phrasing leaves the quantitative relationship between local minimax impurity and global excess risk unstated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. The points raised concern the level of detail in the abstract regarding the non-atomicity conditions and the quantitative form of the oracle inequalities. We address each comment below and will revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract (oracle inequalities paragraph): the claimed oracle inequalities for regression and classification rest on 'mild marginal non-atomicity conditions' whose precise statement (e.g., absolute continuity of marginals, absence of atoms of positive probability, or uniform density lower bounds) is not supplied. Without this, it is impossible to verify necessity, check satisfaction on the EEG/image examples, or confirm that the proof technique does not collapse when the conditions fail; this is load-bearing for the central theoretical claim.

    Authors: We agree that the abstract would benefit from a concise statement of the conditions. Section 2.2 of the manuscript defines them explicitly as: the marginal distribution of each covariate is non-atomic, i.e., P(X_j = x) = 0 for every coordinate j and every x. This is weaker than absolute continuity or density lower bounds and is the minimal assumption used in the proofs to guarantee that splits exist with positive probability mass on both children. The EEG amplitude and image coordinate data consist of continuous measurements and therefore satisfy the condition; we will add a sentence in the empirical section confirming this. We will revise the abstract to read: 'under the mild assumption that the marginal distributions of the covariates are non-atomic.' revision: yes

  2. Referee: [Abstract] Abstract (bounds paragraph): the statement that the bounds 'control the tree's global excess risk by local worst-case impurities' requires an explicit inequality (with constants and remainder terms) to assess whether the control is tight enough to deliver the asserted faster rates relative to CART; the current phrasing leaves the quantitative relationship between local minimax impurity and global excess risk unstated.

    Authors: The explicit inequalities appear in Theorems 3.1 (regression) and 4.2 (classification). For regression they take the form R(hat f) - R(f*) <= C * (sum over leaves of the minimax child impurity) + o(1) as n -> infinity, where C depends only on tree depth; an analogous bound holds for classification with entropy. The non-atomicity ensures the o(1) term vanishes at the stated rate, which is faster than the corresponding CART bounds because the minimax criterion directly controls the worst-case child impurity rather than an average. We will revise the abstract to include a high-level version of this relation, e.g., 'with global excess risk bounded by a multiple of the sum of local minimax impurities plus a vanishing remainder.' revision: yes

Circularity Check

0 steps flagged

No significant circularity; oracle inequalities derived independently from new minimax criterion

full rationale

The paper defines the MinimaxSplit criterion (max within-child squared error for regression; max child entropy for classification) independently of the target excess-risk quantity. Oracle inequalities are proved from this definition under external marginal non-atomicity assumptions, without any reduction of the bounds to fitted parameters, self-citations, or renaming of known results. The derivation chain is self-contained against external benchmarks and does not invoke load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The theoretical results rest on the non-atomicity assumption for the marginal distributions; no free parameters or new postulated entities are described in the abstract.

axioms (1)
  • domain assumption mild marginal non-atomicity conditions
    Invoked to ensure the oracle inequalities control global excess risk via local worst-case impurities.

pith-pipeline@v0.9.0 · 5717 in / 1253 out tokens · 40678 ms · 2026-05-23T02:57:49.074461+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The maximum-entropy median-martingale

    math.PR 2026-05 unverdicted novelty 6.0

    The maximum-entropy median-martingale walk on [0,1] has the arcsine distribution as its stationary distribution, with a proof connecting it to classical arcsine laws for Brownian motion.

  2. The maximum-entropy median-martingale

    math.PR 2026-05 unverdicted novelty 5.0

    The maximum-entropy median-martingale on [0,1] has the arcsine distribution as its stationary distribution, with a proof linking it to two classical arcsine laws for Brownian motion, plus a generalization of martingal...

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper

  1. [1]

    On the pointwise behavior of recursive parti- tioning and its implications for heterogeneous causal effect estimation.arXiv preprint arXiv:2211.10805,

    Matias D Cattaneo, Jason M Klusowski, and Peter M Tian. On the pointwise behavior of recursive parti- tioning and its implications for heterogeneous causal effect estimation.arXiv preprint arXiv:2211.10805,

  2. [2]

    arXiv:2306.00361, 1–46 (2022)

    PhD thesis. Hengrui Luo and Matthew T Pratola. Sharded Bayesian additive regression trees.arXiv preprint arXiv:2306.00361,

  3. [3]

    Efficient decision trees for tensor regressions.arXiv preprint arXiv:2408.01926,

    Hengrui Luo, Akira Horiguchi, and Li Ma. Efficient decision trees for tensor regressions.arXiv preprint arXiv:2408.01926,

  4. [4]

    These problems are of their own interest (see the end of Appendix A.1) and have implications in the regression tree setting

    A On partition-based martingale approximations In this section, we develop the framework of partition-based martingale approximations and show that a number of constructions feature exponential convergence rates. These problems are of their own interest (see the end of Appendix A.1) and have implications in the regression tree setting. The main takeaway i...

  5. [5]

    sizes” ofU1 {U∈I} on the two sets. The “size

    UnlessUis a constant, there are different partition-based martingale approximations depending on the distribution ofU. Our goal in this section is to identify a few partition-based martingale approximations that efficiently approximateU, where the construction algorithm is universal. The efficiency criterion is given by the decay of the MSEE[(U−M k)2]. Th...

  6. [6]

    Note that in our setting,Yis a sum of martingale differences, but whose variance is bounded, and hence one cannot apply the martingale CLT or its convergence rates

    and Markovian walks (Grama et al., 2018). Note that in our setting,Yis a sum of martingale differences, but whose variance is bounded, and hence one cannot apply the martingale CLT or its convergence rates. In the next two sections, we discuss two types of results:uniformrates for boundedU(Appendix A.2) and non-uniformrates (Appendix A.3) where the asympt...

  7. [7]

    TakeM >sup{x, y}

    Proof.Let suppM 1 ={x, y}wherex < y. TakeM >sup{x, y}. Letτ + M be the first hitting time to {x:x⩾M},τ − M be the first hitting time to{x:x⩽−M}, andτ M = min{τ + M , τ − M }. It is straightforward to prove (see the proof of Lemma 5.6 of Zhang et al. (2024)) that for somer∈(0,1) andC >0, E[(U−M k)2]⩽E[1 {τM <∞}U2] +CM 2rk. Since the supports of{M k}k⩾0 cor...

  8. [8]

    (2024), and our proof of Theorem 19 follows a similar route

    The case of the Simons martingale has been established by Zhang et al. (2024), and our proof of Theorem 19 follows a similar route. Since the 31 asymptotic constant is allowed to depend on the law ofU, the optimal non-uniform ratermight be smaller than the uniform ones (in Theorem 16). Recall from Appendix A.2 that the rates obtained in Theorem 16 are asy...

  9. [9]

    It follows from (36) that (35) holds for someu ∗ ∈[u 1, u2]

    Therefore, the mapu7→P(U⩽u) is constant on [u 1, u2] and for eachu∈[u 1, u2], (E[U|U⩽u],E[U| U > u]) = (m a(pvar), mb(pvar)). It follows from (36) that (35) holds for someu ∗ ∈[u 1, u2]. Using the definition (34) andp var =P(U⩽u var), we haveu var = max u∈[u1,u2] u=u 2, which implies that P(U⩽u ∗) =P(U⩽u var). Therefore, in both cases, the desired claim i...

  10. [10]

    This shows (44) and thus proves (43)

    It follows that E h ( √ δ Y− 1 +δ√ δ X)2 i ⩾E h√ δ Y− 1 +δ√ δ X i2 = (1 +δ) 2 δ E[X]2 ⩾ (1 +δ) 2 δ (inf suppX) 2 ⩾ (1 +δ) δ (sup suppX) 2 ⩾ 1 +δ δ E[X2]. This shows (44) and thus proves (43). Proof of Theorem 2.We divide the proof into three steps. Step I: reducing to an inequality involvingVar(Y).We claim that it suffices to prove E[(Y−M k)2]⩽inf g∈G (1 ...

  11. [11]

    By settingM= 2UwhereU= max|Y i|, the extra term of 8 3 2kM2 N e− k 6d + d n1/3 ⩽C 2kU2 N + 2kU2d N n1/3 is carried until (B.34) of Klusowski and Tian (2024)

    Again we follow the arguments in the proof of Theorem 4.3 for CART of Klusowski and Tian (2024). By settingM= 2UwhereU= max|Y i|, the extra term of 8 3 2kM2 N e− k 6d + d n1/3 ⩽C 2kU2 N + 2kU2d N n1/3 is carried until (B.34) of Klusowski and Tian (2024). The term 2 kU2/Nis absorbed by the term U42k log(N d)/Ntherein. The other term 2 kU2d/(N n1/3) is carr...

  12. [12]

    Here, we use the notationξ n =O P(αn) if the sequence{ξ n/αn}is tight

    It follows from standard arguments in van der Vaart and Wellner (2013) that the class{f1 R}R∈R is a Donsker class, so withm R =E P∗[f(X)|X∈R], we have sup R∈R |EPn[f(X)|X∈R]−m R|=O P(1/√n). Here, we use the notationξ n =O P(αn) if the sequence{ξ n/αn}is tight. Letg R(x) = (f(x)−m R)2 forR∈ R. The class {gR1 R}R∈R is also Donsker. ForR∈ R, letV R denote th...

  13. [13]

    Define R↑ a,b;u = [a, b]×[u,1], R ↓ a,b;u = [a, b]×[−1, u], u∈[−1,1]; R← a,b;u = [a, u]×[−1,1], R → a,b;u = [u, b]×[−1,1], u∈[a, b]

    Consider a rectangleR a,b = [a, b]×[−1,1]∈ Rwhere−1⩽a < b⩽1. Define R↑ a,b;u = [a, b]×[u,1], R ↓ a,b;u = [a, b]×[−1, u], u∈[−1,1]; R← a,b;u = [a, u]×[−1,1], R → a,b;u = [u, b]×[−1,1], u∈[a, b]. Also denote bym(p) =E[(M−1)1 {M⩾1} /M] whereM law ∼Bin(n, p). A standard computation (by first conditioning on the number of samples in the rectangle) yields that ...

  14. [14]

    It follows thata 0 = 2 and for eachk,a k+1 ⩾a k/2−C/ √n

    of the rectangles at levelk. It follows thata 0 = 2 and for eachk,a k+1 ⩾a k/2−C/ √n. Solving this recursion yieldsa k ⩾2 1−k −2C/ √n for allk. In particular,a K ⩾2 1−K −2C/ √n⩾C/n 1/6. In summary, we have shown that ifn⩾C 064K, with high probability, the MinimaxSplit dimension is always along the first dimension for the firstKlevels of the tree. As a con...