arxiv: 2604.21042 · v1 · submitted 2026-04-22 · 💻 cs.LG

Recognition: unknown

Interpretable Quantile Regression by Optimal Decision Trees

Ga\"el Aglin, Siegfried Nijssen, Valentin Lemaire

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords quantile regressiondecision treesinterpretable machine learningconditional distributionoptimal treesdistribution estimation

0 comments

The pith

A method learns sets of optimal quantile regression trees to predict complete conditional distributions interpretably and efficiently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes a novel approach for learning multiple optimal quantile regression trees at once. It enables the estimation of the full conditional distribution of a target variable without making any assumptions about its form. The resulting models are decision trees, making them interpretable for users who need to understand the predictions. Importantly, this is achieved while preserving the computational efficiency of learning just one tree.

Core claim

The paper claims that by learning a set of optimal quantile regression trees, one can obtain predictions for the entire conditional distribution of the target variable. These predictions require no prior assumptions on the distribution, are provided through interpretable tree structures, and the learning process does not sacrifice efficiency relative to training a single tree.

What carries the argument

The set of optimal quantile regression trees, learned jointly to predict multiple quantiles while preserving single-tree efficiency.

If this is right

Predictions cover any desired quantile level of the target variable.
The full distribution can be modeled without parametric assumptions on its shape.
Interpretability is maintained through the decision tree format for each quantile.
Efficiency remains comparable to single tree learning with no added overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This joint optimization approach could support more reliable uncertainty estimates in applications where the full response distribution matters.
Similar techniques might extend to other interpretable models that require distribution-level outputs beyond point estimates.

Load-bearing premise

The trees learned by the method are truly optimal and the learning process incurs no additional computational cost compared to single tree training.

What would settle it

A demonstration on benchmark datasets that the learned trees are suboptimal for quantile prediction or that training time exceeds that of a single tree would disprove the efficiency and optimality claims.

Figures

Figures reproduced from arXiv: 2604.21042 by Ga\"el Aglin, Siegfried Nijssen, Valentin Lemaire.

**Figure 1.** Figure 1: Running times for naive and efficient versions of QDL8.5. Both axes are in log scale. Efficiency This work showed an algorithm modification that enables the DL8.5 algorithm to learn many trees at once, each optimizing a loss function with a different quantile parameter, while only exploring the tree space once. In this experiment, we measured, for different numbers of trees to learn, the execution time o… view at source ↗

**Figure 2.** Figure 2: Similarity matrix of trees learned by QDL8.5 for the Air Quality dataset [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

The field of machine learning is subject to an increasing interest in models that are not only accurate but also interpretable and robust, thus allowing their end users to understand and trust AI systems. This paper presents a novel method for learning a set of optimal quantile regression trees. The advantages of this method are that (1) it provides predictions about the complete conditional distribution of a target variable without prior assumptions on this distribution; (2) it provides predictions that are interpretable; (3) it learns a set of optimal quantile regression trees without compromising algorithmic efficiency compared to learning a single tree.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims to learn sets of optimal quantile regression trees for full conditional distribution modeling without efficiency loss, but the abstract gives no algorithm or evidence for how that works.

read the letter

The paper introduces a method for learning sets of optimal quantile regression trees to model the full conditional distribution in an interpretable way. The main point to know is that they claim this can be done without losing efficiency compared to a single tree. What stands out as new is the focus on optimality for the quantile trees as a set, rather than approximate or heuristic approaches common in tree-based quantile regression. They do well in framing the advantages clearly: no distributional assumptions, full distribution predictions, and interpretability from the trees. This addresses a practical gap in applied settings where understanding uncertainty matters. The soft spots are around the central claims of optimality and unchanged efficiency. As the stress-test notes, standard optimal tree learning is computationally demanding, and scaling it to multiple quantiles without extra cost is not straightforward. The abstract states the benefits but does not outline the algorithm, any complexity bounds, or how they avoid the expected cost increase. If the full paper includes a detailed optimization procedure, such as a specialized dynamic program or solver that exploits structure across quantiles, along with runtime experiments, that would address this. Otherwise, the efficiency claim remains unconvincing based on what's presented. The work is aimed at researchers in interpretable machine learning and those needing robust regression models. A reader interested in tree methods for uncertainty quantification could find value here, particularly if experiments validate the approach on real data. I recommend sending it for peer review. The idea has merit and the goals are well-motivated, so referees can check the technical details and empirical support.

Referee Report

2 major / 1 minor

Summary. The paper presents a method for learning a set of optimal quantile regression trees. It claims that this approach (1) predicts the full conditional distribution of a target variable without prior distributional assumptions, (2) yields interpretable predictions, and (3) achieves optimality for the set of trees at the same algorithmic efficiency as training a single tree.

Significance. If the optimality and efficiency claims hold with rigorous justification, the work would contribute a practical, assumption-light tool for interpretable distributional regression. This could be valuable in domains needing full quantile coverage (e.g., risk modeling) while preserving decision-tree transparency, provided the method scales and avoids hidden parameter tuning.

major comments (2)

[Abstract] Abstract: the central efficiency claim—that a set of optimal quantile trees can be learned without compromising complexity relative to a single tree—is stated without any algorithm description, complexity bound, or reference to the underlying optimization procedure (e.g., dynamic programming, MIP, or branch-and-bound). Standard optimal decision-tree search is NP-hard; extending it to multiple quantiles while preserving asymptotic cost requires an explicit argument that is absent here.
[Abstract] Abstract: the assertion of 'optimal' trees for the full conditional distribution lacks any statement of the loss function, optimality criterion (global vs. local), or proof sketch showing that joint optimization over quantiles does not introduce additional parameters or relaxations that undermine the optimality guarantee.

minor comments (1)

[Abstract] The abstract would benefit from a brief parenthetical clarifying whether 'optimal' refers to exact global optimality or a provably bounded approximation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on the points raised.

read point-by-point responses

Referee: [Abstract] Abstract: the central efficiency claim—that a set of optimal quantile trees can be learned without compromising complexity relative to a single tree—is stated without any algorithm description, complexity bound, or reference to the underlying optimization procedure (e.g., dynamic programming, MIP, or branch-and-bound). Standard optimal decision-tree search is NP-hard; extending it to multiple quantiles while preserving asymptotic cost requires an explicit argument that is absent here.

Authors: The abstract is deliberately concise. The full manuscript (Section 3) presents the dynamic programming recursion that jointly optimizes the set of quantile trees by sharing subproblem solutions across quantiles, preserving the same asymptotic complexity as a single optimal tree. A formal complexity bound and reference to the optimization procedure appear in the methods and appendix. We will revise the abstract to include a brief reference to this procedure and the efficiency result. revision: yes
Referee: [Abstract] Abstract: the assertion of 'optimal' trees for the full conditional distribution lacks any statement of the loss function, optimality criterion (global vs. local), or proof sketch showing that joint optimization over quantiles does not introduce additional parameters or relaxations that undermine the optimality guarantee.

Authors: Optimality is defined globally with respect to the pinball loss at each quantile level. The manuscript defines the loss, states the global optimality criterion for the joint set of trees, and provides a proof (in the appendix) that the dynamic program yields exact optimality without relaxations or extra parameters. We will add a concise statement to the abstract clarifying the loss function and global optimality criterion. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on unspecified method with no equations or derivation chain shown

full rationale

The abstract states three advantages including learning 'a set of optimal quantile regression trees without compromising algorithmic efficiency compared to learning a single tree' but supplies neither equations, an algorithm, complexity analysis, nor any derivation steps. No self-definitional relations, fitted inputs renamed as predictions, or self-citation chains appear in the provided text. Without load-bearing mathematical steps that reduce to inputs by construction, the circularity score is 0. The central claim is unsupported rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No details available from the abstract to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5389 in / 899 out tokens · 22856 ms · 2026-05-10T00:38:35.689601+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

In: Proceedings of the AAAI Conference on Artificial In- telligence

Aglin, G., Nijssen, S., Schaus, P.: Learning optimal decision trees using caching branch-and-bound search. In: Proceedings of the AAAI Conference on Artificial In- telligence. vol. 34, pp. 3146–3153 (2020). https://doi.org/10.1609/aaai.v34i04.5711

work page doi:10.1609/aaai.v34i04.5711 2020
[2]

Random forests,

Breiman, L.: Random forests. Machine learning45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324 12 V. Lemaire et al

work page doi:10.1023/a:1010933404324 2001
[3]

Machine Learning108(8), 1613–1634 (2019)

Cousins, C., Riondato, M.: Cadet: interpretable parametric conditional density estimation with decision trees and forests. Machine Learning108(8), 1613–1634 (2019). https://doi.org/10.1007/s10994-019-05820-3

work page doi:10.1007/s10994-019-05820-3 2019
[4]

Demirović, E., Lukina, A., Hebrard, E., Chan, J., Bailey, J., Leckie, C., Ramamohanarao, K., Stuckey, P.J.: Murtree: Optimal decision trees via dy- namic programming and search. J. Mach. Learn. Res.23(1) (jan 2022). https://doi.org/10.5555/3586589.3586615

work page doi:10.5555/3586589.3586615 2022
[5]

Commu- nications of the ACM63(1), 68–77 (2019)

Du, M., Liu, N., Hu, X.: Techniques for interpretable machine learning. Commu- nications of the ACM63(1), 68–77 (2019). https://doi.org/10.1145/3359786

work page doi:10.1145/3359786 2019
[6]

In: Proceedings of the IEEE international conference on computer vision

Fong, R.C., Vedaldi, A.: Interpretable explanations of black boxes by meaningful perturbation. In: Proceedings of the IEEE international conference on computer vision. pp. 3429–3437 (2017). https://doi.org/10.1109/ICCV.2017.371

work page doi:10.1109/iccv.2017.371 2017
[7]

In: 2018 IEEE Inter- national Conference on Software Maintenance and Evolution (ICSME)

Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L.: Explaining explanations: An overview of interpretability of machine learning. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). pp. 80–89 (2018). https://doi.org/10.1109/DSAA.2018.00018

work page doi:10.1109/dsaa.2018.00018 2018
[8]

Koenker and K

Koenker, R., Hallock, K.F.: Quantile regression. Journal of Economic Perspectives 15(4), 143–156 (December 2001). https://doi.org/10.1257/jep.15.4.143

work page doi:10.1257/jep.15.4.143 2001
[9]

Letham, C

Letham, B., Rudin, C., McCormick, T.H., Madigan, D.: Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics9(3) (sep 2015). https://doi.org/10.1214/15-aoas848

work page doi:10.1214/15-aoas848 2015
[10]

Meinshausen, N.: Quantile regression forests. J. Mach. Learn. Res.7, 983–999 (dec 2006). https://doi.org/10.5555/1248547.1248582

work page doi:10.5555/1248547.1248582 2006
[11]

In: Knowledge Discovery and Data Mining (2007)

Nijssen, S., Fromont, É.: Mining optimal decision trees from item- set lattices. In: Knowledge Discovery and Data Mining (2007). https://doi.org/10.1145/1281192.1281250

work page doi:10.1145/1281192.1281250 2007
[12]

John, O.: Robustness of quantile regression to outliers

O. John, O.: Robustness of quantile regression to outliers. Ameri- can Journal of Applied Mathematics and Statistics3(2), 86–88 (2015). https://doi.org/10.12691/ajams-3-2-8

work page doi:10.12691/ajams-3-2-8 2015
[13]

In: 5th Australian joint conference on artificial intelligence

Quinlan, J.R., et al.: Learning with continuous classes. In: 5th Australian joint conference on artificial intelligence. vol. 92, pp. 343–348. World Scientific (1992). https://doi.org/10.1142/9789814536271

work page doi:10.1142/9789814536271 1992
[14]

2019 , month = may, journal =

Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence1(5), 206–215 (2019). https://doi.org/10.1038/s42256-019-0048-x

work page doi:10.1038/s42256-019-0048-x 2019
[15]

R.et al.Grad-cam: Visual explanations from deep networks via gradient-based localization.International Journal of Computer Vision128, 336–359 (2019)

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: Visual explanations from deep networks via gradient-based local- ization. International Journal of Computer Vision128(2), 336–359 (oct 2019). https://doi.org/10.1007/s11263-019-01228-7

work page doi:10.1007/s11263-019-01228-7 2019
[16]

and Sheu, Chyong-Hwa , year = 1991, month = sep, journal =

Terrell, G.R., Scott, D.W.: Variable Kernel Density Estimation. The Annals of Statistics20(3), 1236 – 1265 (1992). https://doi.org/10.1214/aos/1176348768

work page doi:10.1214/aos/1176348768 1992
[17]

Energy Procedia158, 6446–6451 (2019)

Wang, S., Wang, S., Wang, D.: Combined probability density model for medium term load forecasting based on quantile regression and kernel density estimation. Energy Procedia158, 6446–6451 (2019). https://doi.org/10.1016/j.egypro.2019.01.169, innovative Solutions for Energy Transitions

work page doi:10.1016/j.egypro.2019.01.169 2019
[18]

Applied Energy277, 115600 (2020)

Zhang, S., Wang, Y., Zhang, Y., Wang, D., Zhang, N.: Load probability density forecasting by transforming and combining quantile forecasts. Applied Energy277, 115600 (2020). https://doi.org/10.1016/j.apenergy.2020.115600

work page doi:10.1016/j.apenergy.2020.115600 2020