CatBoost: unbiased boosting with categorical features

Liudmila Prokhorenkova , Gleb Gusev , Aleksandr Vorobev , Anna Veronika Dorogush , Andrey Gulin

Authors on Pith no claims yet

classification 💻 cs.LG

keywords boostingcatboostalgorithmalgorithmicalgorithmscategoricalfeaturesgradient

read the original abstract

This paper presents the key algorithmic techniques behind CatBoost, a new gradient boosting toolkit. Their combination leads to CatBoost outperforming other publicly available boosting implementations in terms of quality on a variety of datasets. Two critical algorithmic advances introduced in CatBoost are the implementation of ordered boosting, a permutation-driven alternative to the classic algorithm, and an innovative algorithm for processing categorical features. Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms. In this paper, we provide a detailed analysis of this problem and demonstrate that proposed algorithms solve it effectively, leading to excellent empirical results.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models
cs.LG 2026-05 unverdicted novelty 7.0

TFM-Retouche is an architecture-agnostic input-space residual adapter that improves tabular foundation model accuracy on 51 datasets by learning input corrections through the frozen backbone, with an identity guard to...
WOODELF-HD: Efficient Background SHAP for High-Depth Decision Trees
cs.LG 2026-04 conditional novelty 7.0

WoodelfHD reduces Background SHAP preprocessing for decision trees from 3^D to 2^D complexity, enabling exact computation on depths up to 21 with reported speedups of 33x to 162x.
TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models
cs.LG 2026-05 unverdicted novelty 6.0

TFM-Retouche is an input-space residual adapter that lifts TabICLv2 performance by 56 Elo points on 51 tabular datasets while remaining architecture-agnostic and computationally light.
MuViS: Multimodal Virtual Sensing Benchmark
eess.SP 2026-03 unverdicted novelty 6.0

MuViS is a new unified benchmark showing that neither gradient-boosted trees nor deep neural networks hold a universal advantage in multimodal virtual sensing.
RelAgent: LLM Agents as Data Scientists for Relational Learning
cs.LG 2026-05 unverdicted novelty 5.0

RelAgent uses an LLM agent to autonomously generate SQL feature programs paired with classical models for interpretable relational learning predictions that execute efficiently on standard databases.
Accelerating the Design of Resorbable Magnesium Alloys: A Machine Learning Approach to Property Prediction
cond-mat.mtrl-sci 2026-04 conditional novelty 4.0

CatBoost and other ensemble ML models achieve R² scores of 0.95, 0.916, and 0.903 on yield strength, ultimate tensile strength, and elongation for resorbable Mg alloys, with SHAP analysis highlighting processing condi...