pith. machine review for the scientific record. sign in

arxiv: 2605.02593 · v1 · submitted 2026-05-04 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

Gradient Boosted Risk Scores

Costa Georgantas, Jonas Richiardi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:36 UTC · model grok-4.3

classification 💻 cs.LG
keywords risk scoresgradient boostinginterpretable machine learningtabular dataclassificationsurvival analysismedical decision support
0
0 comments X

The pith

Gradient boosting can be adapted to build compact point-based risk scores that match the accuracy of linear regression while using far fewer rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a gradient boosting algorithm for creating risk scores that humans can compute by hand from a small number of rules. Traditional approaches rely on linear regression to assign points to variables, but this can miss nonlinear patterns in the data. The new method uses boosting to capture those patterns while keeping the output format as simple addition of points. Tests on twelve tabular datasets for regression, classification, and time-to-event prediction show competitive accuracy alongside substantially more compact scores. In domains such as medicine and insurance, fewer rules reduce the chance of calculation errors and increase the chance that people will actually use the model.

Core claim

We provide an algorithm based on gradient boosting that is capable of modeling nonlinear effects for building compact and predictive risk scores, along with a C++ implementation with Python and R bindings. Through extensive empirical evaluation on twelve tabular datasets spanning regression, classification, and time-to-event tasks, we show that our method achieves competitive predictive performance while producing substantially more compact scores than regression-based alternatives, with 60% fewer rules for classification tasks and 16% fewer rules for time-to-event tasks on average, compared to AutoScore.

What carries the argument

Gradient boosting algorithm adapted to output a limited set of human-computable point-based rules

If this is right

  • Risk scores can incorporate nonlinear relationships without increasing the number of rules a human must evaluate.
  • The approach extends to regression and survival tasks while preserving the compactness advantage.
  • An open C++ implementation with language bindings makes the method immediately usable in production pipelines.
  • Fewer rules directly lowers the cognitive load when clinicians or analysts compute scores by hand.
  • Competitive performance on diverse tabular data suggests the method can replace regression-based risk scores in many existing workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same boosting adaptation might be applied to other constrained model formats that require human-readable outputs.
  • The observed compactness gains could compound when the method is paired with automated feature selection.
  • Deployment in regulated settings may benefit from the reduced rule count because it simplifies audits and user training.
  • Similar empirical comparisons on larger or streaming data could test whether the compactness benefit scales.

Load-bearing premise

Gradient boosting can be constrained to produce a small number of point rules while still modeling the nonlinear effects present in the data.

What would settle it

A collection of new tabular datasets on which the gradient boosted risk scores require at least as many rules as AutoScore to reach the same level of predictive performance.

Figures

Figures reproduced from arXiv: 2605.02593 by Costa Georgantas, Jonas Richiardi.

Figure 1
Figure 1. Figure 1: Size by number of rules (or stumps) of scores generated by Autoscore and GBRS on benchmarked datasets, view at source ↗
read the original abstract

Risk scores are an interpretable and actionable class of machine learning models with applications in medicine, insurance, and risk management. Unlike most computational methods, risk scores are designed to be computed by a human by attributing points to a data sample based on a limited set of criteria. The most common approaches for generating risk scores use linear regressions to estimate the effect of selected variables. We propose a simple and effective approach towards building compact and predictive risk scores. We provide an algorithm based on gradient boosting that is capable of modeling nonlinear effects, along with a C++ implementation with Python and R bindings. Through extensive empirical evaluation on twelve tabular datasets spanning regression, classification, and time-to-event tasks, we show that our method achieves competitive predictive performance while producing substantially more compact scores than regression-based alternatives, with 60% fewer rules for classification tasks and 16% fewer rules for time-to-event tasks on average, compared to AutoScore.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes an adaptation of gradient boosting to generate compact, human-computable risk scores consisting of a limited set of point-based rules. It claims this approach can model nonlinear effects while achieving competitive predictive performance on regression, classification, and time-to-event tasks across 12 tabular datasets, with substantially greater compactness (60% fewer rules on classification tasks and 16% fewer on time-to-event tasks) than regression-based methods such as AutoScore. A C++ implementation with Python and R bindings is provided.

Significance. If the adaptation successfully preserves nonlinear modeling capacity without post-hoc simplification that degrades performance, the work could meaningfully advance interpretable risk modeling in medicine and insurance by offering a direct boosting-based alternative to linear regression for point scores. The multi-task empirical evaluation and open implementation with language bindings strengthen the contribution for reproducibility and practical adoption.

major comments (2)
  1. [Method description (following abstract)] The manuscript provides no equation, pseudocode, or explicit description of how the gradient boosting update is modified to enforce a bounded point-based risk-score format (e.g., additive point assignment per feature with a hard rule limit). This is load-bearing for the central claim because the headline compactness gains and the assertion that nonlinear effects are retained both depend on the precise structure of this adaptation; without it, the comparison to AutoScore cannot be assessed for fairness.
  2. [Experimental evaluation section] The empirical protocol for enforcing the rule limit during boosting and for the AutoScore baseline is unspecified (e.g., how the number of boosting rounds interacts with the rule limit hyperparameter, or whether post-training pruning is applied). This undermines the reported 60%/16% reductions because the compactness metric may not be computed under identical constraints.
minor comments (2)
  1. [Abstract] The abstract states results on 'twelve tabular datasets' but does not list the datasets or their characteristics; adding a table or reference to the data section would improve clarity.
  2. [Introduction/Method] Notation for the risk-score output (points per rule) is introduced without a formal definition or example computation; a small worked example would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's significance and for the constructive major comments. We address each point below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses
  1. Referee: [Method description (following abstract)] The manuscript provides no equation, pseudocode, or explicit description of how the gradient boosting update is modified to enforce a bounded point-based risk-score format (e.g., additive point assignment per feature with a hard rule limit). This is load-bearing for the central claim because the headline compactness gains and the assertion that nonlinear effects are retained both depend on the precise structure of this adaptation; without it, the comparison to AutoScore cannot be assessed for fairness.

    Authors: We agree that the adaptation requires a more explicit mathematical and algorithmic description to support the central claims. The full manuscript outlines the approach in Section 3 by adapting standard gradient boosting to use piecewise-constant weak learners that assign additive integer points per feature bin, with the total number of rules controlled by the number of boosting iterations. However, we acknowledge the current presentation lacks the requested equations and pseudocode. In the revision we will add the modified boosting update rule (including the constrained loss and stopping criterion), the explicit form of the risk score as a sum of point contributions, and a new Algorithm 1 that shows the full procedure. This will make clear how nonlinear effects are captured while enforcing the bounded point-based format. revision: yes

  2. Referee: [Experimental evaluation section] The empirical protocol for enforcing the rule limit during boosting and for the AutoScore baseline is unspecified (e.g., how the number of boosting rounds interacts with the rule limit hyperparameter, or whether post-training pruning is applied). This undermines the reported 60%/16% reductions because the compactness metric may not be computed under identical constraints.

    Authors: We agree that the experimental protocol must be stated unambiguously to allow assessment of the compactness results. In our implementation the rule limit is enforced directly by setting the number of boosting rounds equal to the target number of rules (one rule per iteration, with no post-training pruning). For the AutoScore baseline we used the authors' recommended procedure and selected the number of rules to match the same target compactness level used for our method. We will expand the experimental section (and the supplementary material) with an explicit description of these choices, the hyperparameter settings, and the exact procedure for computing the number of rules under identical constraints for both methods. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical algorithm validated externally

full rationale

The manuscript proposes a gradient-boosting-based algorithm for constructing compact risk scores and supports its claims solely through empirical performance comparisons against AutoScore and other baselines across twelve independent tabular datasets. No derivation chain, uniqueness theorem, or fitted-parameter prediction is presented that reduces by construction to quantities defined inside the method itself. The central results (competitive accuracy plus compactness gains) are external measurements, not tautological restatements of the algorithm's inputs or self-citations. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach assumes risk scores must remain simple enough for manual calculation and that boosting iterations can be constrained to produce such scores; no new physical entities are introduced.

free parameters (1)
  • number of boosting rounds and rule limit
    Hyperparameters that control compactness and are likely tuned on validation data to achieve the reported rule reductions.
axioms (1)
  • domain assumption Risk scores are required to be computed by humans using a limited set of criteria
    Stated in the opening of the abstract as the defining property of risk scores.

pith-pipeline@v0.9.0 · 5445 in / 1342 out tokens · 52175 ms · 2026-05-08T18:36:29.107430+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

31 extracted references · 10 canonical work pages

  1. [1]

    Thomas, L. C. A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers. International Journal of Forecasting16,149–172.ISSN: 0169-2070. https://www.sciencedirect.com/ science/article/pii/S0169207000000340(2025) (Apr. 2000)

  2. [2]

    Pehlivanlı, D., Alp, E. A. & Katanalp, B. Introducing the overall risk scoring as an early warning system. Expert Systems with Applications246,123232.ISSN: 0957-4174. https://www.sciencedirect.com/ science/article/pii/S0957417424000976(2025) (July 2024)

  3. [3]

    ojp.gov/ncjrs/virtual-library/abstracts/kentucky-pretrial-risk-assessment- instrument-validation(2025)

    Kentucky Pretrial Risk Assessment Instrument Validation — Office of Justice Programs https : / / www . ojp.gov/ncjrs/virtual-library/abstracts/kentucky-pretrial-risk-assessment- instrument-validation(2025)

  4. [4]

    A Scoring Model for Support Decision Making in Criminal Justicein2022 12th International Conference on Advanced Computer Information Technologies (ACIT)ISSN: 2770-5226 (Sept

    Kovalchuk, O.et al. A Scoring Model for Support Decision Making in Criminal Justicein2022 12th International Conference on Advanced Computer Information Technologies (ACIT)ISSN: 2770-5226 (Sept. 2022), 116–120. https://ieeexplore.ieee.org/document/9913182(2025)

  5. [5]

    M.et al.The TIMI risk score for unstable angina/non-ST elevation MI: A method for prognostication and therapeutic decision making

    Antman, E. M.et al.The TIMI risk score for unstable angina/non-ST elevation MI: A method for prognostication and therapeutic decision making. eng.JAMA284,835–842.ISSN: 0098-7484 (Aug. 2000)

  6. [6]

    London, A. J. Artificial Intelligence and Black-Box Medical Decisions:Accuracy versus Explainability. en. Hastings Center Report49,15–21.ISSN: 0093-0334, 1552-146X. https://onlinelibrary.wiley. com/doi/10.1002/hast.973(2025) (Jan. 2019)

  7. [7]

    B.et al.General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study

    D’Agostino, R. B.et al.General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study. en.Circulation117,743–753.ISSN: 0009-7322, 1524-4539. https://www.ahajournals.org/ doi/10.1161/CIRCULATIONAHA.107.699579(2024) (Feb. 2008)

  8. [8]

    Smith, M. E. B.et al.Early warning system scores for clinical deterioration in hospitalized patients: a systematic review. eng.Annals of the American Thoracic Society11,1454–1465.ISSN: 2325-6621 (Nov. 2014)

  9. [9]

    Visseren, F. L. J.et al.2021 ESC Guidelines on cardiovascular disease prevention in clinical practice. eng. European Heart Journal42,3227–3337.ISSN: 1522-9645 (Sept. 2021)

  10. [10]

    & Rudin, C

    Ustun, B. & Rudin, C. Learning Optimized Risk Scores.Journal of Machine Learning Research20,1–75.ISSN: 1533-7928.http://jmlr.org/papers/v20/18-615.html(2025) (2019)

  11. [11]

    Xie, F., Chakraborty, B., Ong, M. E. H., Goldstein, B. A. & Liu, N. AutoScore: A Machine Learning–Based Automatic Clinical Score Generator and Its Application to Mortality Prediction Using Electronic Health Records. EN.JMIR Medical Informatics8,e21798. https://medinform.jmir.org/2020/10/e21798 (2025) (Oct. 2020)

  12. [12]

    Friedman, J. H. & Popescu, B. E.Predictive learning via rule ensemblesarXiv:0811.1679. Nov. 2008. http: //arxiv.org/abs/0811.1679(2025)

  13. [13]

    arXiv preprint arXiv:1909.09223 , year=

    Nori, H., Jenkins, S., Koch, P. & Caruana, R.InterpretML: A Unified Framework for Machine Learning Inter- pretabilityarXiv:1909.09223. Sept. 2019.http://arxiv.org/abs/1909.09223(2025)

  14. [14]

    & Hothorn, T

    B¨uhlmann, P. & Hothorn, T. Boosting Algorithms: Regularization, Prediction and Model Fitting.Statistical Science22,477–505.ISSN: 0883-4237, 2168-8745. https : / / projecteuclid . org / journals / statistical- science/volume- 22/issue- 4/Boosting- Algorithms- Regularization- Prediction-and-Model-Fitting/10.1214/07-STS242.full(2024) (Nov. 2007)

  15. [15]

    Gael Guennebaud, B. J. a. o.Eigen v32010.https://eigen.tuxfamily.org

  16. [16]

    & Menon, R

    Dagum, L. & Menon, R. OpenMP: an industry standard API for shared-memory programming.IEEE Com- putational Science and Engineering5,46–55.ISSN: 1558-190X. https://ieeexplore.ieee.org/ document/660313(2025) (Jan. 1998)

  17. [17]

    Chen and C

    Chen, T. & Guestrin, C.XGBoost: A Scalable Tree Boosting Systemen. inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(ACM, San Francisco California USA, Aug. 2016), 785–794.ISBN: 9781450342322. https://dl.acm.org/doi/10.1145/2939672.2939785 (2024). 9

  18. [18]

    Why do tree-based models still outperform deep learning on tabular data?arXiv preprint arXiv:2207.08815, 2022

    Grinsztajn, L., Oyallon, E. & Varoquaux, G.Why do tree-based models still outperform deep learning on tabular data?arXiv:2207.08815. July 2022.http://arxiv.org/abs/2207.08815(2025)

  19. [19]

    https://www.kaggle.com/datasets/yasserh/housing-prices- dataset

    Housing Prices Dataseten. https://www.kaggle.com/datasets/yasserh/housing-prices- dataset

  20. [20]

    S.Abalone1994.https://archive.ics.uci.edu/dataset/1

    Warwick Nash, T. S.Abalone1994.https://archive.ics.uci.edu/dataset/1. 21.Diabetes Dataseten.https://www.kaggle.com/datasets/mathchi/diabetes-data-set

  21. [21]

    https://www.kaggle.com/datasets/sulianova/cardiovascular- disease-dataset

    Cardiovascular Disease dataseten. https://www.kaggle.com/datasets/sulianova/cardiovascular- disease-dataset

  22. [22]

    F.Wine1992.https://archive.ics.uci.edu/dataset/109

    Stefan Aeberhard, M. F.Wine1992.https://archive.ics.uci.edu/dataset/109. 24.Insurance Dataen.https://www.kaggle.com/datasets/moneystore/agencyperformance

  23. [23]

    https://www.kaggle.com/datasets/averkiyoliabev/ home-equity-line-of-creditheloc(2025)

    Home Equity Line of Credit (HELOC)en. https://www.kaggle.com/datasets/averkiyoliabev/ home-equity-line-of-creditheloc(2025)

  24. [24]

    Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age.PLOS Medicine, 12(3):e1001779, 2015

    Sudlow, C.et al.UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. en.PLOS Medicine12,e1001779.ISSN: 1549-1676. https : //journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001779 (2023) (Mar. 2015)

  25. [25]

    eng.European Heart Journal

    Dolezalova, N.et al.Development of an accessible 10-year Digital CArdioV Ascular (DiCA V A) risk assessment: a UK Biobank study. eng.European Heart Journal. Digital Health2,528–538.ISSN: 2634-3916 (Sept. 2021)

  26. [26]

    A Unified Approach to Interpreting Model Predictions

    Lundberg, S. & Lee, S.-I.A Unified Approach to Interpreting Model PredictionsarXiv:1705.07874. Nov. 2017. http://arxiv.org/abs/1705.07874(2025)

  27. [27]

    Circulation137,2572–2582.ISSN: 0009-7322

    Kamimura, D.et al.Cigarette smoking and incident heart failure: Insights from the Jackson Heart Study. Circulation137,2572–2582.ISSN: 0009-7322. https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC6085757/(2025) (June 2018)

  28. [28]

    & Riboli, E

    Aune, D., Schlesinger, S., Norat, T. & Riboli, E. Tobacco smoking and the risk of heart failure: A systematic review and meta-analysis of prospective studies. eng.European Journal of Preventive Cardiology26,279–288. ISSN: 2047-4881 (Feb. 2019)

  29. [29]

    & Makaryus, A

    Shams, P., Goyal, A. & Makaryus, A. N. eng. inStatPearls(StatPearls Publishing, Treasure Island (FL), 2025). http://www.ncbi.nlm.nih.gov/books/NBK459131/(2025)

  30. [30]

    gov/toolssoftware/ccsr/dxccsr.jsp(2023)

    Clinical Classifications Software Refined (CCSR) for ICD-10-CM Diagnoses https://hcup- us.ahrq. gov/toolssoftware/ccsr/dxccsr.jsp(2023). 10 7 Supplementary 7.1 UK Biobank Preprocessing The UK Biobank (UKB) is a large-scale and comprehensive observational study. It contains in-depth health and genetic information for 500’000 volunteer participants. Many mo...

  31. [31]

    v2023.1 categories. CCSR is a classification system developed by the US Agency for Healthcare Research and Quality’s Healthcare Cost and Utilization Project, which aggregates ICD-10 codes into clinically meaningful categories. We selected heart failure (HF, CIR019), diabetes (END002), chronic kidney disease (CKD, GEN003), and chronic obstructive pulmonary...