arxiv: 2604.20394 · v1 · submitted 2026-04-22 · 💻 cs.DS

Recognition: unknown

Nearly Optimal Bounds for Computing Decision Tree Splits in Data Streams

Hoang Ta , Hoa T. Vu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:37 UTC · model grok-4.3

classification 💻 cs.DS

keywords data streamsdecision treesregressionclassificationGini impurityspace lower boundsreservoir samplingCount-Min sketches

0 comments

The pith

Streaming algorithms approximate optimal decision tree splits in one pass with space matching the lower bounds for both regression and classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how to approximate the best split for a decision tree node when data arrives as a stream that can only be read once. For numeric regression labels up to M, an algorithm uses space roughly M squared over epsilon to guarantee an additive error of epsilon in split quality. For categorical classification with the Gini measure, the space drops to roughly one over epsilon, again with a matching lower bound proving necessity. These nearly tight results improve earlier algorithms that needed two passes or more space. The work matters for building accurate tree models on very large or continuously arriving datasets where memory is limited.

Core claim

For regression with labels in the range {0,1,…,M}, we give a one-pass algorithm using Õ(M²/ε) space that outputs a split within additive ε error of the optimal split, improving upon the two-pass algorithm of Pham et al. We also provide a matching Ω(M²/ε) lower bound showing that this space is necessary. For classification, we obtain a one-pass algorithm using Õ(1/ε) space for approximating the optimal Gini split, complementing this with matching Ω(1/ε) lower bounds for both Gini impurity and misclassification error. The algorithms exploit the Lipschitz property of the loss functions and use reservoir sampling along with Count-Min sketches with range queries; the lower bounds follow from care

What carries the argument

Reservoir sampling combined with Count-Min sketches supporting range queries, which together exploit the Lipschitz property of the split quality loss functions to bound approximation error from sketch inaccuracies.

Load-bearing premise

The split quality measures such as squared error or Gini impurity satisfy a Lipschitz condition so that accurate estimates of counts and sums yield accurate estimates of split quality.

What would settle it

An adversarial stream derived from an INDEX instance of size proportional to M²/ε would force any ε-approximate split finder to use at least that much space, as failing to recover the hidden index would produce a split worse than ε-optimal.

read the original abstract

We establish nearly optimal upper and lower bounds for approximating decision tree splits in data streams. For regression with labels in the range $\{0,1,\ldots,M\}$, we give a one-pass algorithm using $\tilde{O}(M^2/\epsilon)$ space that outputs a split within additive $\epsilon$ error of the optimal split, improving upon the two-pass algorithm of Pham et al. (ISIT 2025). Furthermore, we provide a matching one-pass lower bound showing that $\Omega(M^2/\epsilon)$ space is indeed necessary. For classification, we also obtain a one-pass algorithm using $\tilde{O}(1/\epsilon)$ space for approximating the optimal Gini split, improving upon the previous $\tilde{O}(1/\epsilon^2)$-space algorithm. We complement these results with matching space lower bounds: $\Omega(1/\epsilon)$ for Gini impurity and $\Omega(1/\epsilon)$ for misclassification (which matches the upper bound obtained by sampling). Our algorithms exploit the Lipschitz property of the loss functions and use reservoir sampling along with Count--Min sketches with range queries. Our lower bounds follow from careful reductions from the INDEX problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives matching one-pass upper and lower bounds for approximating optimal decision tree splits in streams, improving regression to one pass and tightening Gini classification space by a 1/ε factor.

read the letter

The main point is that the authors close the gap to one-pass algorithms with matching lower bounds for approximating the best split under additive ε error. For regression with labels in {0, ..., M} they give a Õ(M²/ε)-space one-pass algorithm and prove Ω(M²/ε) is necessary, beating the prior two-pass result. For Gini impurity they reduce the space to Õ(1/ε) with a matching lower bound, and they do the same for misclassification error where sampling already achieves the bound. Both improvements are new relative to the cited prior work. The algorithms rely on reservoir sampling plus Count-Min sketches with range queries to maintain the needed sums or counts, then use the Lipschitz property of the loss to turn small estimation errors into small errors in split quality. The lower bounds come from direct reductions from INDEX, which keeps the argument non-circular. The techniques are standard in streaming but applied cleanly here, and the bounds match up to log factors. One minor limitation is the quadratic dependence on M in the regression case; the paper treats M as a fixed parameter and the matching lower bound shows the dependence is necessary, so it is not a flaw but it does restrict applicability when M is large. The setting is explicit throughout. This is solid, focused work for people who care about streaming primitives for machine learning. The analysis is honest, the claims are tight, and the methods are appropriate, so it deserves a serious referee rather than a desk reject. I would bring it to a reading group to look at the lower-bound reduction and the sketch analysis.

Referee Report

0 major / 3 minor

Summary. The paper claims nearly optimal one-pass streaming algorithms and matching lower bounds for approximating optimal decision tree splits. For regression with labels in {0, ..., M}, it gives an Õ(M²/ε)-space algorithm achieving additive ε error to the optimal split (improving on the prior two-pass result), with a matching Ω(M²/ε) lower bound. For classification it gives an Õ(1/ε)-space algorithm for Gini impurity approximation with matching Ω(1/ε) lower bounds for both Gini and misclassification error. The algorithms rely on reservoir sampling combined with Count-Min sketches (with range queries) and exploit the Lipschitz property of the loss functions; lower bounds are obtained via reductions from the INDEX problem.

Significance. If the stated bounds and analyses hold, the results are significant: they close the gap to nearly tight space complexity for core streaming primitives used in decision-tree construction, using only standard sketching primitives and a direct INDEX reduction. The matching upper/lower bounds give a clean characterization of the space needed once the Lipschitz constant is fixed, which is a useful contribution to the data-streams literature.

minor comments (3)

The introduction should explicitly define the Õ notation and state the precise dependence on the Lipschitz constant of the loss function, as this constant appears in the error-propagation argument that justifies the claimed space bounds.
A short comparison table (or paragraph) contrasting the new one-pass bounds against the two-pass algorithm of Pham et al. and any other prior streaming results would improve readability.
In the lower-bound sections, the concrete parameter settings used in the INDEX reduction (how M and ε map to the communication complexity) should be written out explicitly rather than left implicit.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading of the manuscript and the positive recommendation to accept. The referee's summary accurately reflects the main results on one-pass algorithms and matching lower bounds for approximating decision tree splits.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's upper bounds are constructed from standard reservoir sampling plus Count-Min sketches with range queries, using the Lipschitz property only to propagate estimation errors into split-quality error; this is a parameter-free analysis once the Lipschitz constant is fixed. Lower bounds are obtained by explicit reductions from the external INDEX problem in communication complexity. No equation or claim reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation; the central results rest on independent primitives whose correctness is external to the target bounds.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the standard single-pass streaming model, the Lipschitz continuity of the split-quality loss functions, and the correctness of reservoir sampling plus Count-Min sketches with range queries. No free parameters are fitted; the space bounds are expressed directly in terms of M and ε.

axioms (2)

domain assumption The split-quality loss functions are Lipschitz continuous.
Invoked to bound the error introduced by approximate counts from sketches and sampling.
standard math The data arrives in a single pass in the standard streaming model.
Standard assumption for all streaming algorithms in the paper.

pith-pipeline@v0.9.0 · 5500 in / 1319 out tokens · 35518 ms · 2026-05-09T23:37:28.070543+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 4 canonical work pages

[1]

Information processing letters , volume=

Constructing optimal binary decision trees is NP-complete , author=. Information processing letters , volume=
[2]

Schapire

Yoav Freund and Robert E. Schapire , title =. J. Comput. Syst. Sci. , volume =. 1997 , url =. doi:10.1006/jcss.1997.1504 , timestamp =

work page doi:10.1006/jcss.1997.1504 1997
[3]

Muthukrishnan , title =

Graham Cormode and S. Muthukrishnan , title =. J. Algorithms , volume =. 2005 , url =. doi:10.1016/j.jalgor.2003.12.001 , timestamp =

work page doi:10.1016/j.jalgor.2003.12.001 2005
[4]

On Learning with Integral Operators

Yael Ben. A Streaming Parallel Decision Tree Algorithm , journal =. 2010 , url =. doi:10.5555/1756006.1756034 , timestamp =

work page doi:10.5555/1756006.1756034 2010
[5]

Domingos and Geoff Hulten , title =

Pedro M. Domingos and Geoff Hulten , title =
[6]

Ross Quinlan , title =

J. Ross Quinlan , title =
[7]

Leo Breiman and J. H. Friedman and Richard A. Olshen and C. J. Stone , title =
[8]

Tianqi Chen and Carlos Guestrin , title =
[9]

Bartlett and Marcus R

Llew Mason and Jonathan Baxter and Peter L. Bartlett and Marcus R. Frean , title =
[10]

Computational statistics & data analysis , volume=

Stochastic gradient boosting , author=. Computational statistics & data analysis , volume=. 2002 , publisher=

2002
[11]

Tin Kam Ho , title =
[12]

Sepehr Assadi and Nikolai Karpov and Qin Zhang , title =
[13]

Computational Complexity , volume=

On randomized one-round communication complexity , author=. Computational Complexity , volume=. 1999 , publisher=

1999
[14]

LightGBM:

Guolin Ke and Qi Meng and Thomas Finley and Taifeng Wang and Wei Chen and Weidong Ma and Qiwei Ye and Tie. LightGBM:
[15]

NeurIPS , pages =

Liudmila Ostroumova Prokhorenkova and Gleb Gusev and Aleksandr Vorobev and Anna Veronika Dorogush and Andrey Gulin , title =. NeurIPS , pages =
[16]

2009 , author=

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition , isbn=. 2009 , author=

2009
[17]

Foundations and Trends in Theoretical Computer Science , volume=

Communication complexity (for algorithm designers) , author=. Foundations and Trends in Theoretical Computer Science , volume=. 2016 , publisher=

2016
[18]

Rivest , title =

Laurent Hyafil and Ronald L. Rivest , title =. Inf. Process. Lett. , volume =
[19]

Murthy , title =

Sreerama K. Murthy , title =. Data Min. Knowl. Discov. , volume =
[20]

Why do tree-based models still outperform deep learning on typical tabular data? , booktitle =

L. Why do tree-based models still outperform deep learning on typical tabular data? , booktitle =
[21]

Tabular data: Deep learning is not all you need , journal =

Ravid Shwartz. Tabular data: Deep learning is not all you need , journal =
[22]

Daniel Hillis and Guy L

W. Daniel Hillis and Guy L. Steele Jr. , title =. Commun
[23]

Richard Cole , title =
[24]

Joseph F. J. An Introduction to Parallel Algorithms , publisher =
[25]

Blelloch , title =

Guy E. Blelloch , title =
[26]

Segmented scan --- Wikipedia , The Free Encyclopedia

Wikipedia contributors. Segmented scan --- Wikipedia , The Free Encyclopedia. 2024

2024
[27]

Prefix sum --- Wikipedia , The Free Encyclopedia

Wikipedia contributors. Prefix sum --- Wikipedia , The Free Encyclopedia. 2024

2024
[28]

Goodrich and Nodari Sitchinava and Qin Zhang , title =

Michael T. Goodrich and Nodari Sitchinava and Qin Zhang , title =
[29]

URL: http://people

Massively parallel algorithms , author=. URL: http://people. csail. mit. edu/ghaffari/MPA19/Notes/MPA. pdf , year=
[30]

Karloff and Siddharth Suri and Sergei Vassilvitskii , title =

Howard J. Karloff and Siddharth Suri and Sergei Vassilvitskii , title =
[31]

Dug Hun Hong and Sungho Lee and Kyung Tae Kim , title =
[32]

Fayyad and Keki B

Jie Cheng and Usama M. Fayyad and Keki B. Irani and Zhaogang Qian , title =
[33]

Amit Chakrabarti and Subhash Khot and Xiaodong Sun , title =
[34]

An information statistics approach to data stream and communication complexity , journal =

Ziv Bar. An information statistics approach to data stream and communication complexity , journal =
[35]

Fayyad and Keki B

Usama M. Fayyad and Keki B. Irani , title =. Mach. Learn. , volume =
[36]

James Dougherty and Ron Kohavi and Mehran Sahami , title =
[37]

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

Efficient decision tree construction on streaming data , author=. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=
[38]

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

Mining time-changing data streams , author=. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining , pages=
[39]

Information Sciences , volume=

The CART decision tree for mining data streams , author=. Information Sciences , volume=. 2014 , publisher=

2014
[40]

arXiv preprint arXiv:2403.19867 , year=

Finding Decision Tree Splits in Streaming and Massively Parallel Models , author=. arXiv preprint arXiv:2403.19867 , year=

work page arXiv
[41]

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

New ensemble methods for evolving data streams , author=. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=
[42]

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

Extremely fast decision tree , author=. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=
[43]

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages=

Extremely fast decision tree mining for evolving data streams , author=. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages=
[44]

2022 IEEE International Conference on Data Mining (ICDM) , pages=

Extremely fast hoeffding adaptive tree , author=. 2022 IEEE International Conference on Data Mining (ICDM) , pages=. 2022 , organization=

2022
[45]

IEEE transactions on information theory , volume=

Minimax-optimal classification with dyadic decision trees , author=. IEEE transactions on information theory , volume=. 2006 , publisher=

2006
[46]

IEEE Transactions on Information Theory , volume=

Decision tree design from a communication theory standpoint , author=. IEEE Transactions on Information Theory , volume=. 1988 , publisher=

1988
[47]

IEEE Transactions on information theory , volume=

Application of information theory to the construction of efficient decision trees , author=. IEEE Transactions on information theory , volume=. 1982 , publisher=

1982
[48]

IEEE transactions on knowledge and data engineering , volume=

K-Means+ ID3: A novel method for supervised anomaly detection by cascading K-Means clustering and ID3 decision tree learning methods , author=. IEEE transactions on knowledge and data engineering , volume=. 2007 , publisher=

2007
[49]

Proceedings of the 1999 ACM SIGMOD international conference on Management of Data , pages=

BOAT—optimistic decision tree construction , author=. Proceedings of the 1999 ACM SIGMOD international conference on Management of Data , pages=

1999
[50]

Data mining and knowledge discovery handbook , pages=

Decision trees , author=. Data mining and knowledge discovery handbook , pages=. 2005 , publisher=

2005
[51]

Albert Bifet and Jiajin Zhang and Wei Fan and Cheng He and Jianfeng Zhang and Jianfeng Qian and Geoff Holmes and Bernhard Pfahringer , title =
[52]

Adaptive random forests for evolving data stream classification , journal =

Heitor Murilo Gomes and Albert Bifet and Jesse Read and Jean Paul Barddal and Fabr. Adaptive random forests for evolving data stream classification , journal =
[53]

Journal of Machine Learning Research , year =

Jacob Montiel and Max Halford and Saulo Martiello Mastelini and Geoffrey Bolmier and Raphael Sourty and Robin Vaysse and Adil Zouitine and Heitor Murilo Gomes and Jesse Read and Talel Abdessalem and Albert Bifet , title =. Journal of Machine Learning Research , year =
[54]

Yu and Jiawei Han , title =

Haixun Wang and Wei Fan and Philip S. Yu and Jiawei Han , title =
[55]

Alberto Cano and Bartosz Krawczyk , title =. Mach. Learn. , volume =
[56]

Fed-VFDT: Federated Very Fast Decision Trees with Coordinated Splitting Over Data Streams , booktitle =

Paula Raissa Silva and Jo. Fed-VFDT: Federated Very Fast Decision Trees with Coordinated Splitting Over Data Streams , booktitle =
[57]

Vu , title =

Huy Pham and Hoang Ta and Hoa T. Vu , title =
[58]

Ablayev , title =

Farid M. Ablayev , title =. Theor. Comput. Sci. , volume =
[59]

Nikolaj Tatti , title =
[60]

Adaptive Options for Decision Trees in Evolving Data Stream Classification , booktitle =

Daniel Nowak Assis and Jean Paul Barddal and Fabr. Adaptive Options for Decision Trees in Evolving Data Stream Classification , booktitle =
[61]

Automatica , volume=

A threshold selection method from gray-level histograms , author=. Automatica , volume=
[62]

Knowledge and information systems , volume=

A survey of methods for time series change point detection , author=. Knowledge and information systems , volume=. 2017 , publisher=

2017