arxiv: 2605.02593 · v1 · submitted 2026-05-04 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

Gradient Boosted Risk Scores

Costa Georgantas, Jonas Richiardi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:36 UTC · model grok-4.3

classification 💻 cs.LG

keywords risk scoresgradient boostinginterpretable machine learningtabular dataclassificationsurvival analysismedical decision support

0 comments

The pith

Gradient boosting can be adapted to build compact point-based risk scores that match the accuracy of linear regression while using far fewer rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a gradient boosting algorithm for creating risk scores that humans can compute by hand from a small number of rules. Traditional approaches rely on linear regression to assign points to variables, but this can miss nonlinear patterns in the data. The new method uses boosting to capture those patterns while keeping the output format as simple addition of points. Tests on twelve tabular datasets for regression, classification, and time-to-event prediction show competitive accuracy alongside substantially more compact scores. In domains such as medicine and insurance, fewer rules reduce the chance of calculation errors and increase the chance that people will actually use the model.

Core claim

We provide an algorithm based on gradient boosting that is capable of modeling nonlinear effects for building compact and predictive risk scores, along with a C++ implementation with Python and R bindings. Through extensive empirical evaluation on twelve tabular datasets spanning regression, classification, and time-to-event tasks, we show that our method achieves competitive predictive performance while producing substantially more compact scores than regression-based alternatives, with 60% fewer rules for classification tasks and 16% fewer rules for time-to-event tasks on average, compared to AutoScore.

What carries the argument

Gradient boosting algorithm adapted to output a limited set of human-computable point-based rules

If this is right

Risk scores can incorporate nonlinear relationships without increasing the number of rules a human must evaluate.
The approach extends to regression and survival tasks while preserving the compactness advantage.
An open C++ implementation with language bindings makes the method immediately usable in production pipelines.
Fewer rules directly lowers the cognitive load when clinicians or analysts compute scores by hand.
Competitive performance on diverse tabular data suggests the method can replace regression-based risk scores in many existing workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same boosting adaptation might be applied to other constrained model formats that require human-readable outputs.
The observed compactness gains could compound when the method is paired with automated feature selection.
Deployment in regulated settings may benefit from the reduced rule count because it simplifies audits and user training.
Similar empirical comparisons on larger or streaming data could test whether the compactness benefit scales.

Load-bearing premise

Gradient boosting can be constrained to produce a small number of point rules while still modeling the nonlinear effects present in the data.

What would settle it

A collection of new tabular datasets on which the gradient boosted risk scores require at least as many rules as AutoScore to reach the same level of predictive performance.

Figures

Figures reproduced from arXiv: 2605.02593 by Costa Georgantas, Jonas Richiardi.

**Figure 1.** Figure 1: Size by number of rules (or stumps) of scores generated by Autoscore and GBRS on benchmarked datasets, view at source ↗

read the original abstract

Risk scores are an interpretable and actionable class of machine learning models with applications in medicine, insurance, and risk management. Unlike most computational methods, risk scores are designed to be computed by a human by attributing points to a data sample based on a limited set of criteria. The most common approaches for generating risk scores use linear regressions to estimate the effect of selected variables. We propose a simple and effective approach towards building compact and predictive risk scores. We provide an algorithm based on gradient boosting that is capable of modeling nonlinear effects, along with a C++ implementation with Python and R bindings. Through extensive empirical evaluation on twelve tabular datasets spanning regression, classification, and time-to-event tasks, we show that our method achieves competitive predictive performance while producing substantially more compact scores than regression-based alternatives, with 60% fewer rules for classification tasks and 16% fewer rules for time-to-event tasks on average, compared to AutoScore.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts gradient boosting to output compact point-based risk scores and reports better compactness than AutoScore on 12 datasets while keeping competitive accuracy.

read the letter

The main takeaway is that they modify gradient boosting to produce limited human-computable risk scores instead of full tree ensembles or linear fits. On twelve tabular datasets spanning regression, classification, and time-to-event, the method matches baseline performance but uses 60% fewer rules on classification tasks and 16% fewer on survival tasks compared to AutoScore. They also ship a C++ implementation with Python and R bindings, which is practical for people who actually need to deploy these scores in medicine or insurance. That combination of empirical breadth and usable code is the part that stands out. The adaptation itself is the new piece: standard boosting does not naturally output a bounded set of point assignments, so the paper must impose structure on the updates or the output format to stay within the risk-score constraints. The abstract claims this preserves nonlinear modeling, but without the exact update rule or pseudocode it is difficult to judge whether the compactness comes from genuine boosting dynamics or from heavy regularization that effectively flattens interactions. The free parameters (number of rounds and rule limit) are also left for the user to tune, which is reasonable but means the reported gains depend on careful selection. This is useful for applied researchers who already work with tabular data and want something more flexible than plain logistic regression yet still interpretable by hand. It deserves peer review because the claims are testable, the evaluation covers multiple task types, and the code lowers the barrier to checking the results. A referee could usefully press on the exact boosting modification and whether the nonlinear advantage survives the compactness constraint.

Referee Report

2 major / 2 minor

Summary. The paper proposes an adaptation of gradient boosting to generate compact, human-computable risk scores consisting of a limited set of point-based rules. It claims this approach can model nonlinear effects while achieving competitive predictive performance on regression, classification, and time-to-event tasks across 12 tabular datasets, with substantially greater compactness (60% fewer rules on classification tasks and 16% fewer on time-to-event tasks) than regression-based methods such as AutoScore. A C++ implementation with Python and R bindings is provided.

Significance. If the adaptation successfully preserves nonlinear modeling capacity without post-hoc simplification that degrades performance, the work could meaningfully advance interpretable risk modeling in medicine and insurance by offering a direct boosting-based alternative to linear regression for point scores. The multi-task empirical evaluation and open implementation with language bindings strengthen the contribution for reproducibility and practical adoption.

major comments (2)

[Method description (following abstract)] The manuscript provides no equation, pseudocode, or explicit description of how the gradient boosting update is modified to enforce a bounded point-based risk-score format (e.g., additive point assignment per feature with a hard rule limit). This is load-bearing for the central claim because the headline compactness gains and the assertion that nonlinear effects are retained both depend on the precise structure of this adaptation; without it, the comparison to AutoScore cannot be assessed for fairness.
[Experimental evaluation section] The empirical protocol for enforcing the rule limit during boosting and for the AutoScore baseline is unspecified (e.g., how the number of boosting rounds interacts with the rule limit hyperparameter, or whether post-training pruning is applied). This undermines the reported 60%/16% reductions because the compactness metric may not be computed under identical constraints.

minor comments (2)

[Abstract] The abstract states results on 'twelve tabular datasets' but does not list the datasets or their characteristics; adding a table or reference to the data section would improve clarity.
[Introduction/Method] Notation for the risk-score output (points per rule) is introduced without a formal definition or example computation; a small worked example would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's significance and for the constructive major comments. We address each point below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses

Referee: [Method description (following abstract)] The manuscript provides no equation, pseudocode, or explicit description of how the gradient boosting update is modified to enforce a bounded point-based risk-score format (e.g., additive point assignment per feature with a hard rule limit). This is load-bearing for the central claim because the headline compactness gains and the assertion that nonlinear effects are retained both depend on the precise structure of this adaptation; without it, the comparison to AutoScore cannot be assessed for fairness.

Authors: We agree that the adaptation requires a more explicit mathematical and algorithmic description to support the central claims. The full manuscript outlines the approach in Section 3 by adapting standard gradient boosting to use piecewise-constant weak learners that assign additive integer points per feature bin, with the total number of rules controlled by the number of boosting iterations. However, we acknowledge the current presentation lacks the requested equations and pseudocode. In the revision we will add the modified boosting update rule (including the constrained loss and stopping criterion), the explicit form of the risk score as a sum of point contributions, and a new Algorithm 1 that shows the full procedure. This will make clear how nonlinear effects are captured while enforcing the bounded point-based format. revision: yes
Referee: [Experimental evaluation section] The empirical protocol for enforcing the rule limit during boosting and for the AutoScore baseline is unspecified (e.g., how the number of boosting rounds interacts with the rule limit hyperparameter, or whether post-training pruning is applied). This undermines the reported 60%/16% reductions because the compactness metric may not be computed under identical constraints.

Authors: We agree that the experimental protocol must be stated unambiguously to allow assessment of the compactness results. In our implementation the rule limit is enforced directly by setting the number of boosting rounds equal to the target number of rules (one rule per iteration, with no post-training pruning). For the AutoScore baseline we used the authors' recommended procedure and selected the number of rules to match the same target compactness level used for our method. We will expand the experimental section (and the supplementary material) with an explicit description of these choices, the hyperparameter settings, and the exact procedure for computing the number of rules under identical constraints for both methods. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical algorithm validated externally

full rationale

The manuscript proposes a gradient-boosting-based algorithm for constructing compact risk scores and supports its claims solely through empirical performance comparisons against AutoScore and other baselines across twelve independent tabular datasets. No derivation chain, uniqueness theorem, or fitted-parameter prediction is presented that reduces by construction to quantities defined inside the method itself. The central results (competitive accuracy plus compactness gains) are external measurements, not tautological restatements of the algorithm's inputs or self-citations. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach assumes risk scores must remain simple enough for manual calculation and that boosting iterations can be constrained to produce such scores; no new physical entities are introduced.

free parameters (1)

number of boosting rounds and rule limit
Hyperparameters that control compactness and are likely tuned on validation data to achieve the reported rule reductions.

axioms (1)

domain assumption Risk scores are required to be computed by humans using a limited set of criteria
Stated in the opening of the abstract as the defining property of risk scores.

pith-pipeline@v0.9.0 · 5445 in / 1342 out tokens · 52175 ms · 2026-05-08T18:36:29.107430+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost (J = ½(x+x⁻¹)−1) washburn_uniqueness_aczel unclear
GBRS models predictions as an additive score F(x_i) = β_0 + Σ s_k(x_{ij_k}), where each s_k(·) is a univariate, piecewise-constant scoring function applied to a single covariate.
Foundation.AlphaCoordinateFixation (parameter-free calibration) alpha_pin_under_high_calibration unclear
GBRS uses the number of boosting iterations and the learning rate as its primary hyperparameters, with optional subsampling and user-defined thresholds.

Reference graph

Works this paper leans on

31 extracted references · 10 canonical work pages

[1]

Thomas, L. C. A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers. International Journal of Forecasting16,149–172.ISSN: 0169-2070. https://www.sciencedirect.com/ science/article/pii/S0169207000000340(2025) (Apr. 2000)

2070
[2]

Pehlivanlı, D., Alp, E. A. & Katanalp, B. Introducing the overall risk scoring as an early warning system. Expert Systems with Applications246,123232.ISSN: 0957-4174. https://www.sciencedirect.com/ science/article/pii/S0957417424000976(2025) (July 2024)

2025
[3]

ojp.gov/ncjrs/virtual-library/abstracts/kentucky-pretrial-risk-assessment- instrument-validation(2025)

Kentucky Pretrial Risk Assessment Instrument Validation — Office of Justice Programs https : / / www . ojp.gov/ncjrs/virtual-library/abstracts/kentucky-pretrial-risk-assessment- instrument-validation(2025)

2025
[4]

A Scoring Model for Support Decision Making in Criminal Justicein2022 12th International Conference on Advanced Computer Information Technologies (ACIT)ISSN: 2770-5226 (Sept

Kovalchuk, O.et al. A Scoring Model for Support Decision Making in Criminal Justicein2022 12th International Conference on Advanced Computer Information Technologies (ACIT)ISSN: 2770-5226 (Sept. 2022), 116–120. https://ieeexplore.ieee.org/document/9913182(2025)

work page arXiv 2022
[5]

M.et al.The TIMI risk score for unstable angina/non-ST elevation MI: A method for prognostication and therapeutic decision making

Antman, E. M.et al.The TIMI risk score for unstable angina/non-ST elevation MI: A method for prognostication and therapeutic decision making. eng.JAMA284,835–842.ISSN: 0098-7484 (Aug. 2000)

2000
[6]

London, A. J. Artificial Intelligence and Black-Box Medical Decisions:Accuracy versus Explainability. en. Hastings Center Report49,15–21.ISSN: 0093-0334, 1552-146X. https://onlinelibrary.wiley. com/doi/10.1002/hast.973(2025) (Jan. 2019)

work page doi:10.1002/hast.973(2025 2025
[7]

B.et al.General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study

D’Agostino, R. B.et al.General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study. en.Circulation117,743–753.ISSN: 0009-7322, 1524-4539. https://www.ahajournals.org/ doi/10.1161/CIRCULATIONAHA.107.699579(2024) (Feb. 2008)

work page doi:10.1161/circulationaha.107.699579(2024 2024
[8]

Smith, M. E. B.et al.Early warning system scores for clinical deterioration in hospitalized patients: a systematic review. eng.Annals of the American Thoracic Society11,1454–1465.ISSN: 2325-6621 (Nov. 2014)

2014
[9]

Visseren, F. L. J.et al.2021 ESC Guidelines on cardiovascular disease prevention in clinical practice. eng. European Heart Journal42,3227–3337.ISSN: 1522-9645 (Sept. 2021)

2021
[10]

& Rudin, C

Ustun, B. & Rudin, C. Learning Optimized Risk Scores.Journal of Machine Learning Research20,1–75.ISSN: 1533-7928.http://jmlr.org/papers/v20/18-615.html(2025) (2019)

2025
[11]

Xie, F., Chakraborty, B., Ong, M. E. H., Goldstein, B. A. & Liu, N. AutoScore: A Machine Learning–Based Automatic Clinical Score Generator and Its Application to Mortality Prediction Using Electronic Health Records. EN.JMIR Medical Informatics8,e21798. https://medinform.jmir.org/2020/10/e21798 (2025) (Oct. 2020)

2020
[12]

Friedman, J. H. & Popescu, B. E.Predictive learning via rule ensemblesarXiv:0811.1679. Nov. 2008. http: //arxiv.org/abs/0811.1679(2025)

work page arXiv 2008
[13]

arXiv preprint arXiv:1909.09223 , year=

Nori, H., Jenkins, S., Koch, P. & Caruana, R.InterpretML: A Unified Framework for Machine Learning Inter- pretabilityarXiv:1909.09223. Sept. 2019.http://arxiv.org/abs/1909.09223(2025)

work page arXiv 1909
[14]

& Hothorn, T

B¨uhlmann, P. & Hothorn, T. Boosting Algorithms: Regularization, Prediction and Model Fitting.Statistical Science22,477–505.ISSN: 0883-4237, 2168-8745. https : / / projecteuclid . org / journals / statistical- science/volume- 22/issue- 4/Boosting- Algorithms- Regularization- Prediction-and-Model-Fitting/10.1214/07-STS242.full(2024) (Nov. 2007)

work page doi:10.1214/07-sts242.full(2024 2024
[15]

Gael Guennebaud, B. J. a. o.Eigen v32010.https://eigen.tuxfamily.org
[16]

& Menon, R

Dagum, L. & Menon, R. OpenMP: an industry standard API for shared-memory programming.IEEE Com- putational Science and Engineering5,46–55.ISSN: 1558-190X. https://ieeexplore.ieee.org/ document/660313(2025) (Jan. 1998)

2025
[17]

Chen and C

Chen, T. & Guestrin, C.XGBoost: A Scalable Tree Boosting Systemen. inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(ACM, San Francisco California USA, Aug. 2016), 785–794.ISBN: 9781450342322. https://dl.acm.org/doi/10.1145/2939672.2939785 (2024). 9

work page doi:10.1145/2939672.2939785 2016
[18]

Why do tree-based models still outperform deep learning on tabular data?arXiv preprint arXiv:2207.08815, 2022

Grinsztajn, L., Oyallon, E. & Varoquaux, G.Why do tree-based models still outperform deep learning on tabular data?arXiv:2207.08815. July 2022.http://arxiv.org/abs/2207.08815(2025)

work page arXiv 2022
[19]

https://www.kaggle.com/datasets/yasserh/housing-prices- dataset

Housing Prices Dataseten. https://www.kaggle.com/datasets/yasserh/housing-prices- dataset
[20]

S.Abalone1994.https://archive.ics.uci.edu/dataset/1

Warwick Nash, T. S.Abalone1994.https://archive.ics.uci.edu/dataset/1. 21.Diabetes Dataseten.https://www.kaggle.com/datasets/mathchi/diabetes-data-set
[21]

https://www.kaggle.com/datasets/sulianova/cardiovascular- disease-dataset

Cardiovascular Disease dataseten. https://www.kaggle.com/datasets/sulianova/cardiovascular- disease-dataset
[22]

F.Wine1992.https://archive.ics.uci.edu/dataset/109

Stefan Aeberhard, M. F.Wine1992.https://archive.ics.uci.edu/dataset/109. 24.Insurance Dataen.https://www.kaggle.com/datasets/moneystore/agencyperformance
[23]

https://www.kaggle.com/datasets/averkiyoliabev/ home-equity-line-of-creditheloc(2025)

Home Equity Line of Credit (HELOC)en. https://www.kaggle.com/datasets/averkiyoliabev/ home-equity-line-of-creditheloc(2025)

2025
[24]

Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age.PLOS Medicine, 12(3):e1001779, 2015

Sudlow, C.et al.UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. en.PLOS Medicine12,e1001779.ISSN: 1549-1676. https : //journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001779 (2023) (Mar. 2015)

work page doi:10.1371/journal.pmed.1001779 2023
[25]

eng.European Heart Journal

Dolezalova, N.et al.Development of an accessible 10-year Digital CArdioV Ascular (DiCA V A) risk assessment: a UK Biobank study. eng.European Heart Journal. Digital Health2,528–538.ISSN: 2634-3916 (Sept. 2021)

2021
[26]

A Unified Approach to Interpreting Model Predictions

Lundberg, S. & Lee, S.-I.A Unified Approach to Interpreting Model PredictionsarXiv:1705.07874. Nov. 2017. http://arxiv.org/abs/1705.07874(2025)

work page Pith review arXiv 2017
[27]

Circulation137,2572–2582.ISSN: 0009-7322

Kamimura, D.et al.Cigarette smoking and incident heart failure: Insights from the Jackson Heart Study. Circulation137,2572–2582.ISSN: 0009-7322. https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC6085757/(2025) (June 2018)

2025
[28]

& Riboli, E

Aune, D., Schlesinger, S., Norat, T. & Riboli, E. Tobacco smoking and the risk of heart failure: A systematic review and meta-analysis of prospective studies. eng.European Journal of Preventive Cardiology26,279–288. ISSN: 2047-4881 (Feb. 2019)

2047
[29]

& Makaryus, A

Shams, P., Goyal, A. & Makaryus, A. N. eng. inStatPearls(StatPearls Publishing, Treasure Island (FL), 2025). http://www.ncbi.nlm.nih.gov/books/NBK459131/(2025)

2025
[30]

gov/toolssoftware/ccsr/dxccsr.jsp(2023)

Clinical Classifications Software Refined (CCSR) for ICD-10-CM Diagnoses https://hcup- us.ahrq. gov/toolssoftware/ccsr/dxccsr.jsp(2023). 10 7 Supplementary 7.1 UK Biobank Preprocessing The UK Biobank (UKB) is a large-scale and comprehensive observational study. It contains in-depth health and genetic information for 500’000 volunteer participants. Many mo...

2023
[31]

v2023.1 categories. CCSR is a classification system developed by the US Agency for Healthcare Research and Quality’s Healthcare Cost and Utilization Project, which aggregates ICD-10 codes into clinically meaningful categories. We selected heart failure (HF, CIR019), diabetes (END002), chronic kidney disease (CKD, GEN003), and chronic obstructive pulmonary...