pith. machine review for the scientific record. sign in

arxiv: 2604.20394 · v1 · submitted 2026-04-22 · 💻 cs.DS

Recognition: unknown

Nearly Optimal Bounds for Computing Decision Tree Splits in Data Streams

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:37 UTC · model grok-4.3

classification 💻 cs.DS
keywords data streamsdecision treesregressionclassificationGini impurityspace lower boundsreservoir samplingCount-Min sketches
0
0 comments X

The pith

Streaming algorithms approximate optimal decision tree splits in one pass with space matching the lower bounds for both regression and classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how to approximate the best split for a decision tree node when data arrives as a stream that can only be read once. For numeric regression labels up to M, an algorithm uses space roughly M squared over epsilon to guarantee an additive error of epsilon in split quality. For categorical classification with the Gini measure, the space drops to roughly one over epsilon, again with a matching lower bound proving necessity. These nearly tight results improve earlier algorithms that needed two passes or more space. The work matters for building accurate tree models on very large or continuously arriving datasets where memory is limited.

Core claim

For regression with labels in the range {0,1,…,M}, we give a one-pass algorithm using Õ(M²/ε) space that outputs a split within additive ε error of the optimal split, improving upon the two-pass algorithm of Pham et al. We also provide a matching Ω(M²/ε) lower bound showing that this space is necessary. For classification, we obtain a one-pass algorithm using Õ(1/ε) space for approximating the optimal Gini split, complementing this with matching Ω(1/ε) lower bounds for both Gini impurity and misclassification error. The algorithms exploit the Lipschitz property of the loss functions and use reservoir sampling along with Count-Min sketches with range queries; the lower bounds follow from care

What carries the argument

Reservoir sampling combined with Count-Min sketches supporting range queries, which together exploit the Lipschitz property of the split quality loss functions to bound approximation error from sketch inaccuracies.

Load-bearing premise

The split quality measures such as squared error or Gini impurity satisfy a Lipschitz condition so that accurate estimates of counts and sums yield accurate estimates of split quality.

What would settle it

An adversarial stream derived from an INDEX instance of size proportional to M²/ε would force any ε-approximate split finder to use at least that much space, as failing to recover the hidden index would produce a split worse than ε-optimal.

read the original abstract

We establish nearly optimal upper and lower bounds for approximating decision tree splits in data streams. For regression with labels in the range $\{0,1,\ldots,M\}$, we give a one-pass algorithm using $\tilde{O}(M^2/\epsilon)$ space that outputs a split within additive $\epsilon$ error of the optimal split, improving upon the two-pass algorithm of Pham et al. (ISIT 2025). Furthermore, we provide a matching one-pass lower bound showing that $\Omega(M^2/\epsilon)$ space is indeed necessary. For classification, we also obtain a one-pass algorithm using $\tilde{O}(1/\epsilon)$ space for approximating the optimal Gini split, improving upon the previous $\tilde{O}(1/\epsilon^2)$-space algorithm. We complement these results with matching space lower bounds: $\Omega(1/\epsilon)$ for Gini impurity and $\Omega(1/\epsilon)$ for misclassification (which matches the upper bound obtained by sampling). Our algorithms exploit the Lipschitz property of the loss functions and use reservoir sampling along with Count--Min sketches with range queries. Our lower bounds follow from careful reductions from the INDEX problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims nearly optimal one-pass streaming algorithms and matching lower bounds for approximating optimal decision tree splits. For regression with labels in {0, ..., M}, it gives an Õ(M²/ε)-space algorithm achieving additive ε error to the optimal split (improving on the prior two-pass result), with a matching Ω(M²/ε) lower bound. For classification it gives an Õ(1/ε)-space algorithm for Gini impurity approximation with matching Ω(1/ε) lower bounds for both Gini and misclassification error. The algorithms rely on reservoir sampling combined with Count-Min sketches (with range queries) and exploit the Lipschitz property of the loss functions; lower bounds are obtained via reductions from the INDEX problem.

Significance. If the stated bounds and analyses hold, the results are significant: they close the gap to nearly tight space complexity for core streaming primitives used in decision-tree construction, using only standard sketching primitives and a direct INDEX reduction. The matching upper/lower bounds give a clean characterization of the space needed once the Lipschitz constant is fixed, which is a useful contribution to the data-streams literature.

minor comments (3)
  1. The introduction should explicitly define the Õ notation and state the precise dependence on the Lipschitz constant of the loss function, as this constant appears in the error-propagation argument that justifies the claimed space bounds.
  2. A short comparison table (or paragraph) contrasting the new one-pass bounds against the two-pass algorithm of Pham et al. and any other prior streaming results would improve readability.
  3. In the lower-bound sections, the concrete parameter settings used in the INDEX reduction (how M and ε map to the communication complexity) should be written out explicitly rather than left implicit.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading of the manuscript and the positive recommendation to accept. The referee's summary accurately reflects the main results on one-pass algorithms and matching lower bounds for approximating decision tree splits.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's upper bounds are constructed from standard reservoir sampling plus Count-Min sketches with range queries, using the Lipschitz property only to propagate estimation errors into split-quality error; this is a parameter-free analysis once the Lipschitz constant is fixed. Lower bounds are obtained by explicit reductions from the external INDEX problem in communication complexity. No equation or claim reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation; the central results rest on independent primitives whose correctness is external to the target bounds.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the standard single-pass streaming model, the Lipschitz continuity of the split-quality loss functions, and the correctness of reservoir sampling plus Count-Min sketches with range queries. No free parameters are fitted; the space bounds are expressed directly in terms of M and ε.

axioms (2)
  • domain assumption The split-quality loss functions are Lipschitz continuous.
    Invoked to bound the error introduced by approximate counts from sketches and sampling.
  • standard math The data arrives in a single pass in the standard streaming model.
    Standard assumption for all streaming algorithms in the paper.

pith-pipeline@v0.9.0 · 5500 in / 1319 out tokens · 35518 ms · 2026-05-09T23:37:28.070543+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 4 canonical work pages

  1. [1]

    Information processing letters , volume=

    Constructing optimal binary decision trees is NP-complete , author=. Information processing letters , volume=

  2. [2]

    Schapire

    Yoav Freund and Robert E. Schapire , title =. J. Comput. Syst. Sci. , volume =. 1997 , url =. doi:10.1006/jcss.1997.1504 , timestamp =

  3. [3]

    Muthukrishnan , title =

    Graham Cormode and S. Muthukrishnan , title =. J. Algorithms , volume =. 2005 , url =. doi:10.1016/j.jalgor.2003.12.001 , timestamp =

  4. [4]

    On Learning with Integral Operators

    Yael Ben. A Streaming Parallel Decision Tree Algorithm , journal =. 2010 , url =. doi:10.5555/1756006.1756034 , timestamp =

  5. [5]

    Domingos and Geoff Hulten , title =

    Pedro M. Domingos and Geoff Hulten , title =

  6. [6]

    Ross Quinlan , title =

    J. Ross Quinlan , title =

  7. [7]

    Leo Breiman and J. H. Friedman and Richard A. Olshen and C. J. Stone , title =

  8. [8]

    Tianqi Chen and Carlos Guestrin , title =

  9. [9]

    Bartlett and Marcus R

    Llew Mason and Jonathan Baxter and Peter L. Bartlett and Marcus R. Frean , title =

  10. [10]

    Computational statistics & data analysis , volume=

    Stochastic gradient boosting , author=. Computational statistics & data analysis , volume=. 2002 , publisher=

  11. [11]

    Tin Kam Ho , title =

  12. [12]

    Sepehr Assadi and Nikolai Karpov and Qin Zhang , title =

  13. [13]

    Computational Complexity , volume=

    On randomized one-round communication complexity , author=. Computational Complexity , volume=. 1999 , publisher=

  14. [14]

    LightGBM:

    Guolin Ke and Qi Meng and Thomas Finley and Taifeng Wang and Wei Chen and Weidong Ma and Qiwei Ye and Tie. LightGBM:

  15. [15]

    NeurIPS , pages =

    Liudmila Ostroumova Prokhorenkova and Gleb Gusev and Aleksandr Vorobev and Anna Veronika Dorogush and Andrey Gulin , title =. NeurIPS , pages =

  16. [16]

    2009 , author=

    The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition , isbn=. 2009 , author=

  17. [17]

    Foundations and Trends in Theoretical Computer Science , volume=

    Communication complexity (for algorithm designers) , author=. Foundations and Trends in Theoretical Computer Science , volume=. 2016 , publisher=

  18. [18]

    Rivest , title =

    Laurent Hyafil and Ronald L. Rivest , title =. Inf. Process. Lett. , volume =

  19. [19]

    Murthy , title =

    Sreerama K. Murthy , title =. Data Min. Knowl. Discov. , volume =

  20. [20]

    Why do tree-based models still outperform deep learning on typical tabular data? , booktitle =

    L. Why do tree-based models still outperform deep learning on typical tabular data? , booktitle =

  21. [21]

    Tabular data: Deep learning is not all you need , journal =

    Ravid Shwartz. Tabular data: Deep learning is not all you need , journal =

  22. [22]

    Daniel Hillis and Guy L

    W. Daniel Hillis and Guy L. Steele Jr. , title =. Commun

  23. [23]

    Richard Cole , title =

  24. [24]

    Joseph F. J. An Introduction to Parallel Algorithms , publisher =

  25. [25]

    Blelloch , title =

    Guy E. Blelloch , title =

  26. [26]

    Segmented scan --- Wikipedia , The Free Encyclopedia

    Wikipedia contributors. Segmented scan --- Wikipedia , The Free Encyclopedia. 2024

  27. [27]

    Prefix sum --- Wikipedia , The Free Encyclopedia

    Wikipedia contributors. Prefix sum --- Wikipedia , The Free Encyclopedia. 2024

  28. [28]

    Goodrich and Nodari Sitchinava and Qin Zhang , title =

    Michael T. Goodrich and Nodari Sitchinava and Qin Zhang , title =

  29. [29]

    URL: http://people

    Massively parallel algorithms , author=. URL: http://people. csail. mit. edu/ghaffari/MPA19/Notes/MPA. pdf , year=

  30. [30]

    Karloff and Siddharth Suri and Sergei Vassilvitskii , title =

    Howard J. Karloff and Siddharth Suri and Sergei Vassilvitskii , title =

  31. [31]

    Dug Hun Hong and Sungho Lee and Kyung Tae Kim , title =

  32. [32]

    Fayyad and Keki B

    Jie Cheng and Usama M. Fayyad and Keki B. Irani and Zhaogang Qian , title =

  33. [33]

    Amit Chakrabarti and Subhash Khot and Xiaodong Sun , title =

  34. [34]

    An information statistics approach to data stream and communication complexity , journal =

    Ziv Bar. An information statistics approach to data stream and communication complexity , journal =

  35. [35]

    Fayyad and Keki B

    Usama M. Fayyad and Keki B. Irani , title =. Mach. Learn. , volume =

  36. [36]

    James Dougherty and Ron Kohavi and Mehran Sahami , title =

  37. [37]

    Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

    Efficient decision tree construction on streaming data , author=. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

  38. [38]

    Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

    Mining time-changing data streams , author=. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

  39. [39]

    Information Sciences , volume=

    The CART decision tree for mining data streams , author=. Information Sciences , volume=. 2014 , publisher=

  40. [40]

    arXiv preprint arXiv:2403.19867 , year=

    Finding Decision Tree Splits in Streaming and Massively Parallel Models , author=. arXiv preprint arXiv:2403.19867 , year=

  41. [41]

    Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

    New ensemble methods for evolving data streams , author=. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

  42. [42]

    Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

    Extremely fast decision tree , author=. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

  43. [43]

    Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages=

    Extremely fast decision tree mining for evolving data streams , author=. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages=

  44. [44]

    2022 IEEE International Conference on Data Mining (ICDM) , pages=

    Extremely fast hoeffding adaptive tree , author=. 2022 IEEE International Conference on Data Mining (ICDM) , pages=. 2022 , organization=

  45. [45]

    IEEE transactions on information theory , volume=

    Minimax-optimal classification with dyadic decision trees , author=. IEEE transactions on information theory , volume=. 2006 , publisher=

  46. [46]

    IEEE Transactions on Information Theory , volume=

    Decision tree design from a communication theory standpoint , author=. IEEE Transactions on Information Theory , volume=. 1988 , publisher=

  47. [47]

    IEEE Transactions on information theory , volume=

    Application of information theory to the construction of efficient decision trees , author=. IEEE Transactions on information theory , volume=. 1982 , publisher=

  48. [48]

    IEEE transactions on knowledge and data engineering , volume=

    K-Means+ ID3: A novel method for supervised anomaly detection by cascading K-Means clustering and ID3 decision tree learning methods , author=. IEEE transactions on knowledge and data engineering , volume=. 2007 , publisher=

  49. [49]

    Proceedings of the 1999 ACM SIGMOD international conference on Management of Data , pages=

    BOAT—optimistic decision tree construction , author=. Proceedings of the 1999 ACM SIGMOD international conference on Management of Data , pages=

  50. [50]

    Data mining and knowledge discovery handbook , pages=

    Decision trees , author=. Data mining and knowledge discovery handbook , pages=. 2005 , publisher=

  51. [51]

    Albert Bifet and Jiajin Zhang and Wei Fan and Cheng He and Jianfeng Zhang and Jianfeng Qian and Geoff Holmes and Bernhard Pfahringer , title =

  52. [52]

    Adaptive random forests for evolving data stream classification , journal =

    Heitor Murilo Gomes and Albert Bifet and Jesse Read and Jean Paul Barddal and Fabr. Adaptive random forests for evolving data stream classification , journal =

  53. [53]

    Journal of Machine Learning Research , year =

    Jacob Montiel and Max Halford and Saulo Martiello Mastelini and Geoffrey Bolmier and Raphael Sourty and Robin Vaysse and Adil Zouitine and Heitor Murilo Gomes and Jesse Read and Talel Abdessalem and Albert Bifet , title =. Journal of Machine Learning Research , year =

  54. [54]

    Yu and Jiawei Han , title =

    Haixun Wang and Wei Fan and Philip S. Yu and Jiawei Han , title =

  55. [55]

    Alberto Cano and Bartosz Krawczyk , title =. Mach. Learn. , volume =

  56. [56]

    Fed-VFDT: Federated Very Fast Decision Trees with Coordinated Splitting Over Data Streams , booktitle =

    Paula Raissa Silva and Jo. Fed-VFDT: Federated Very Fast Decision Trees with Coordinated Splitting Over Data Streams , booktitle =

  57. [57]

    Vu , title =

    Huy Pham and Hoang Ta and Hoa T. Vu , title =

  58. [58]

    Ablayev , title =

    Farid M. Ablayev , title =. Theor. Comput. Sci. , volume =

  59. [59]

    Nikolaj Tatti , title =

  60. [60]

    Adaptive Options for Decision Trees in Evolving Data Stream Classification , booktitle =

    Daniel Nowak Assis and Jean Paul Barddal and Fabr. Adaptive Options for Decision Trees in Evolving Data Stream Classification , booktitle =

  61. [61]

    Automatica , volume=

    A threshold selection method from gray-level histograms , author=. Automatica , volume=

  62. [62]

    Knowledge and information systems , volume=

    A survey of methods for time series change point detection , author=. Knowledge and information systems , volume=. 2017 , publisher=