pith. sign in

arxiv: 2604.07143 · v1 · submitted 2026-04-08 · 💻 cs.LG · stat.AP· stat.ML

Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees

Pith reviewed 2026-05-10 18:08 UTC · model grok-4.3

classification 💻 cs.LG stat.APstat.ML
keywords clusteringdivisive clusteringminimum spanning treemutual reachabilityrobust clusteringdensity-based clusteringHDBSCAN
0
0 comments X

The pith

Lumbermark detects clusters of varying sizes, densities, and shapes by iteratively chopping protruding segments from a mutual reachability minimum spanning tree.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Lumbermark as a divisive clustering algorithm that begins with the mutual reachability minimum spanning tree of a dataset and repeatedly severs large limbs at their narrow connecting segments. This process is designed to recover groups that differ in size, density, and geometry while limiting the distorting effect of scattered low-density points. Mutual reachability distances are used to smooth local variations, so that noise between clusters or outliers at edges exert less pull on the cuts. The resulting partitions allow direct user control over cluster sizes, offering a practical alternative when standard density-based methods leave the number or scale of groups unspecified. Practitioners would value such a tool for obtaining reliable groupings from irregular real data without extensive retuning.

Core claim

Lumbermark is a robust divisive clustering algorithm capable of detecting clusters of varying sizes, densities, and shapes. It works by iteratively chopping off large limbs connected by protruding segments of a dataset's mutual reachability minimum spanning tree. The use of mutual reachability distances smoothens the data distribution and decreases the influence of low-density objects, such as noise points between clusters or outliers at their peripheries. The algorithm produces partitions with user-specified sizes and serves as an alternative to HDBSCAN.

What carries the argument

The mutual reachability minimum spanning tree, which the algorithm repeatedly severs at protruding segments to isolate clusters while reducing the effect of density variations and noise.

If this is right

  • The method separates clusters even when they differ substantially in size, density, and shape.
  • Low-density noise and peripheral outliers exert reduced influence on the final partition.
  • Users obtain direct control over the sizes of the clusters in the output.
  • A fast implementation exists that can serve as a drop-in alternative when HDBSCAN output sizes are not controllable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The tree-chopping view may connect to other graph-based clustering techniques that operate on minimum spanning trees.
  • The user-specified size control suggests utility in applications where balanced or constrained group sizes are required.
  • Benchmark performance indicates the approach could extend to domains with high-dimensional or irregularly shaped data groups.

Load-bearing premise

Mutual reachability distances smooth the data distribution enough for the chopping process to separate clusters accurately despite the presence of noise and outliers.

What would settle it

Running the algorithm on benchmark datasets with known ground-truth partitions and user-specified sizes, then checking whether the recovered clusters match the true structure in cases with added noise bridges or density gradients.

Figures

Figures reproduced from arXiv: 2604.07143 by Marek Gagolewski.

Figure 1
Figure 1. Figure 1: Three copies of an example dataset with four clusters: two horizontal segments and two Gaussian blobs connected by a point chain. Additionally, its Euclidean minimum spanning tree (left) as well as two 25-mutual reachability minimum spanning trees (centre, right) are depicted. Edge colour intensities correspond to the underlying distances with the darkest edge being the longest in each case. By increasing … view at source ↗
Figure 2
Figure 2. Figure 2: Proportion of datasets with AR ≥ 0.95 and < 0.8 and the average AR as a function of min_cluster_factor 𝑓 and smoothing parameter 𝑀 in the case of the Lumbermark algorithm on 47 datasets. Datasets with imbalanced and balanced reference cluster counts are studied separately. M. Gagolewski: Preprint; Last updated on 9 April 2026 Page 11 of 13 [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Proportion of datasets with AR ≥ 0.95 and < 0.8 and the average AR as a function of the Gini index threshold 𝐺 and smoothing parameter 𝑀 in the case of the Genie algorithm on 47 datasets. Datasets with imbalanced and balanced reference cluster counts are studied separately. M. Gagolewski: Preprint; Last updated on 9 April 2026 Page 12 of 13 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
read the original abstract

We introduce Lumbermark, a robust divisive clustering algorithm capable of detecting clusters of varying sizes, densities, and shapes. Lumbermark iteratively chops off large limbs connected by protruding segments of a dataset's mutual reachability minimum spanning tree. The use of mutual reachability distances smoothens the data distribution and decreases the influence of low-density objects, such as noise points between clusters or outliers at their peripheries. The algorithm can be viewed as an alternative to HDBSCAN that produces partitions with user-specified sizes. A fast, easy-to-use implementation of the new method is available in the open-source 'lumbermark' package for Python and R. We show that Lumbermark performs well on benchmark data and hope it will prove useful to data scientists and practitioners across different fields.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Lumbermark, a divisive clustering algorithm that iteratively chops large limbs from the mutual reachability minimum spanning tree (MR-MST) of a dataset to identify clusters of varying sizes, densities, and shapes. It claims that mutual reachability distances smooth the distribution and reduce the effect of noise/outliers, positioning the method as a user-specified-size alternative to HDBSCAN. An open-source Python/R implementation is provided, with claims of good performance on benchmarks.

Significance. If the robustness claims hold, Lumbermark would offer a practical, interpretable alternative for noisy clustering tasks where cluster-size control is desired, extending MST-based ideas from HDBSCAN with a chopping heuristic. The open-source package supports reproducibility and adoption by practitioners.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Method description): The central robustness claim—that mutual reachability distances 'smoothens the data distribution and decreases the influence of low-density objects' enabling accurate limb separation—lacks any theoretical analysis, bounds, or controlled failure-mode tests (e.g., for bridge noise, density gradients, or high dimensions), leaving the key assumption unverified despite being load-bearing for the 'resistant' guarantee.
  2. [§4] §4 (Experiments): Benchmark results are asserted without reported details on datasets, baselines (including HDBSCAN variants), metrics, or ablation tests isolating the mutual-reachability smoothing effect, so the performance claims cannot be assessed for support.
minor comments (2)
  1. [§3] The manuscript would benefit from pseudocode or a clear algorithmic listing of the chopping procedure (limb identification and cut criteria) to aid implementation and understanding.
  2. [§4] Figure captions and axis labels in the experimental plots should explicitly state the evaluation metrics and number of runs for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline revisions to improve clarity and support for the claims.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method description): The central robustness claim—that mutual reachability distances 'smoothens the data distribution and decreases the influence of low-density objects' enabling accurate limb separation—lacks any theoretical analysis, bounds, or controlled failure-mode tests (e.g., for bridge noise, density gradients, or high dimensions), leaving the key assumption unverified despite being load-bearing for the 'resistant' guarantee.

    Authors: We agree that the manuscript provides no formal theoretical analysis or bounds on the smoothing effect of mutual reachability distances. This property is drawn from the established use in HDBSCAN, where it reduces outlier influence via core-distance augmentation. In revision, we will expand §3 with additional intuition and controlled synthetic experiments (e.g., datasets with bridge noise, varying density gradients, and moderate dimensions) to demonstrate limb separation behavior empirically. Deriving general theoretical guarantees for the full heuristic procedure is beyond the current scope and noted as future work. revision: partial

  2. Referee: [§4] §4 (Experiments): Benchmark results are asserted without reported details on datasets, baselines (including HDBSCAN variants), metrics, or ablation tests isolating the mutual-reachability smoothing effect, so the performance claims cannot be assessed for support.

    Authors: The experimental section uses standard clustering benchmarks and compares against HDBSCAN, but we acknowledge the details may not have been presented with sufficient explicitness. In the revised manuscript, we will expand §4 to include: (i) a table of all datasets with sizes and characteristics, (ii) precise baseline configurations (HDBSCAN with multiple min_cluster_size settings), (iii) the full set of metrics (adjusted Rand index, normalized mutual information, etc.), and (iv) an ablation comparing results with and without mutual reachability distances. The open-source package already enables full reproduction of the reported runs. revision: yes

Circularity Check

0 steps flagged

No circularity; algorithm is a direct procedural description built on external MST concepts

full rationale

The paper presents Lumbermark as an iterative chopping procedure on a mutual reachability MST, stating its properties (smoothing low-density points) as a direct consequence of the established mutual reachability distance definition from prior literature. No equations, parameters, or steps are defined in terms of the target outputs; no predictions are fitted then renamed; no uniqueness theorems or ansatzes are smuggled via self-citation; the central claim does not reduce to its inputs by construction. The derivation chain is self-contained as a new algorithmic recipe rather than a tautological re-expression of fitted data or prior self-references.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim relies on the effectiveness of the chopping procedure on the mutual reachability MST, which is a new application of these concepts. No new entities are invented. Assessment is limited to abstract description.

free parameters (1)
  • user-specified cluster sizes
    The algorithm produces partitions with user-specified sizes, making the target cluster size a key input parameter.
axioms (1)
  • domain assumption Mutual reachability distances smoothen the data distribution and decrease the influence of low-density objects
    Directly stated in the abstract as a property of the approach.

pith-pipeline@v0.9.0 · 5425 in / 1381 out tokens · 63881 ms · 2026-05-10T18:08:25.938019+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

  1. [1]

    Asystematicreviewofmachinelearningmethodsinsoftware testing

    Ajorloo,S.,Jamarani,A.,Kashfi,M.,HaghiKashani,M.,Najafizadeh,A.,2024. Asystematicreviewofmachinelearningmethodsinsoftware testing. Applied Soft Computing 162, 111805. DOI:10.1016/j.asoc.2024.111805

  2. [2]

    Breunig, Hans-Peter Kriegel, and Joerg Sander

    Ankerst, M., Breunig, M., Kriegel, H.P., Sander, J., 1999. OPTICS: Ordering points to identify the clustering structure, in: Proc. ACM Intl. Conf. Management of Data (SIGMOD), pp. 49–60. DOI:10.1145/304182.304187

  3. [3]

    Communications of the ACM18(9), 509–517 (1975).https://doi.org/10.1145/ 361002.361007

    Bentley, J., 1975. Multidimensional binary search trees used for associative searching. Communications of the ACM 18, 509–517. DOI:10. 1145/361002.361007

  4. [4]

    Foundations of Data Science

    Blum, A., Hopcroft, J., Kannan, R., 2020. Foundations of Data Science. Cambridge University Press. URL:https://www.cs.cornell. edu/jeh/book.pdf

  5. [5]

    O jistém problému minimálním

    Borůvka, O., 1926. O jistém problému minimálním. Práce Moravské Přírodovědecké Společnosti v Brně 3, 37–58

  6. [6]

    In: Advances in Knowledge Discovery and Data Mining

    Campello, R.J., Moulavi, D., Sander, J., 2013. Density-based clustering based on hierarchical density estimates. LNAI 7819, 160–172. DOI:10.1007/978-3-642-37456-2_14

  7. [7]

    ACM Trans- actions on Knowledge Discovery from Data10(1), 1–51 (2015) https://doi.org/ 10.1145/2733381

    Campello,R.J.,Moulavi,D.,Zimek,A.,Sander,J.,2015.Hierarchicaldensityestimatesfordataclustering,visualization,andoutlierdetection. ACM Transactions on Knowledge Discovery from Data 10, 5:1–5:51. DOI:10.1145/2733381

  8. [8]

    Rates of convergence for the cluster tree, in: Advances in Neural Information Processing Systems, pp

    Chaudhuri, K., Dasgupta, S., 2010. Rates of convergence for the cluster tree, in: Advances in Neural Information Processing Systems, pp. 343–351

  9. [9]

    Cluster expansions:T-walks, labeled pose=ts and matrix calculations

    Dogan, A., Birant, D., 2021. Machinelearninganddatamininginmanufacturing. ExpertSystemswithApplications166. DOI:10.1016/j. eswa.2020.114060

  10. [10]

    A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proc

    Ester, M., Kriegel, H.P., Sander, J., Xu, X., 1996. A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proc. KDD’96, pp. 226–231

  11. [11]

    Florek,K.,Łukaszewicz,J.,Perkal,J.,Steinhaus,H.,Zubrzycki,S.,1951.Surlaliaisonetladivisiondespointsd’unensemblefini.Colloquium Mathematicum 2, 282–285

  12. [12]

    Iterative shrinking method for clustering problems

    Fränti, P., Virmajoki, O., 2006. Iterative shrinking method for clustering problems. Pattern Recognition 39, 761–765

  13. [13]

    Data-driven estimation of tbm performance in soft soils using density-based spatial clustering and random forest

    Fu, X., Feng, L., Zhang, L., 2022. Data-driven estimation of tbm performance in soft soils using density-based spatial clustering and random forest. Applied Soft Computing 120, 108686. DOI:10.1016/j.asoc.2022.108686

  14. [14]

    A framework for benchmarking clustering algorithms

    Gagolewski, M., 2022. A framework for benchmarking clustering algorithms. SoftwareX 20, 101270. URL:https:// clustering-benchmarks.gagolewski.com/, DOI:10.1016/j.softx.2022.101270. M. Gagolewski:Preprint; Last updated on 9 April 2026Page 9 of 13 Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees

  15. [15]

    Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm

    Gagolewski, M., Bartoszuk, M., Cena, A., 2016. Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm. Information Sciences 363, 8–23

  16. [16]

    Clustering with minimum spanning trees: How good can it be? Journal of Classification 42, 90–112

    Gagolewski, M., Cena, A., Bartoszuk, M., Brzozowski, L., 2025. Clustering with minimum spanning trees: How good can it be? Journal of Classification 42, 90–112. DOI:10.1007/s00357-024-09483-1

  17. [17]

    Minimum spanning trees and single linkage cluster analysis

    Gower, J., Ross, G., 1969. Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics) 18, 54–64

  18. [18]

    Kernel-based fuzzy clustering: A comparative experimental study

    Graves, D., Pedrycz, W., 2010. Kernel-based fuzzy clustering: A comparative experimental study. Fuzzy Sets and Systems 161, 522–543

  19. [19]

    What are the true clusters? Pattern Recognition Letters 64, 53–62

    Hennig, C., 2015. What are the true clusters? Pattern Recognition Letters 64, 53–62. DOI:10.1016/j.patrec.2015.04.009

  20. [20]

    Comparing partitions

    Hubert, L., Arabie, P., 1985. Comparing partitions. Journal of Classification 2, 193–218

  21. [21]

    Cluster analysis: A modern statistical review

    Jaeger, A., Banks, D., 2023. Cluster analysis: A modern statistical review. Wiley Interdisciplinary Reviews: Computational Statistics 15, e1597. DOI:10.1002/wics.1597

  22. [22]

    Data clustering: A user’s dilemma

    Jain, A., Law, M., 2005. Data clustering: A user’s dilemma. Lecture Notes in Computer Science 3776, 1–10

  23. [23]

    CHAMELEON: Hierarchical clustering using dynamic modeling

    Karypis, G., Han, E., Kumar, V., 1999. CHAMELEON: Hierarchical clustering using dynamic modeling. Computer 32, 68–75. DOI:10. 1109/2.781637

  24. [24]

    Ontheshortestspanningsubtreeofagraphandthetravelingsalesmanproblem

    Kruskal,J.,1956. Ontheshortestspanningsubtreeofagraphandthetravelingsalesmanproblem. ProceedingsoftheAmericanMathematical Society 7, 48–50

  25. [25]

    A probability theory of cluster analysis

    Ling, R.F., 1973. A probability theory of cluster analysis. Journal of the American Statistical Association 68, 159–164

  26. [26]

    A multi-stage hierarchical clustering algorithm based on centroid of tree and cut edge constraint

    Ma, Y., Lin, H., Wang, Y., Huang, H., He, X., 2021. A multi-stage hierarchical clustering algorithm based on centroid of tree and cut edge constraint. Information Sciences 557, 194–219. DOI:10.1016/j.ins.2020.12.016

  27. [27]

    It’sokaytobeskinny,ifyourfriendsarefat,in: The4thCGCWorkshoponComputationalGeometry

    Maneewongvatana,S.,Mount,D.,1999. It’sokaytobeskinny,ifyourfriendsarefat,in: The4thCGCWorkshoponComputationalGeometry

  28. [28]

    Accelerated hierarchical density clustering, in: IEEE International Conference on Data Mining Workshops (ICDMW), pp

    McInnes, L., Healy, J., 2017. Accelerated hierarchical density clustering, in: IEEE International Conference on Data Mining Workshops (ICDMW), pp. 33–42. DOI:10.1109/ICDMW.2017.12. see the extended version athttps://arxiv.org/pdf/1705.07321

  29. [29]

    URL https://doi.org/10.21105/joss.00205

    McInnes, L., Healy, J., Astels, S., 2017. hdbscan: Hierarchical density based clustering. The Journal of Open Source Software 2, 205. DOI:10.21105/joss.00205

  30. [30]

    Information theoretic clustering using minimum spanning trees, in: Proc

    Müller, A., Nowozin, S., Lampert, C., 2012. Information theoretic clustering using minimum spanning trees, in: Proc. German Conference on Pattern Recognition.https://github.com/amueller/information-theoretic-mst

  31. [31]

    Parallel algorithms for hierarchical clustering

    Olson, C.F., 1995. Parallel algorithms for hierarchical clustering. Parallel Computing 21, 1313–1325

  32. [32]

    Scikit-learn: Machine learning in Python

    Pedregosa, F., et al., 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830

  33. [33]

    Chothia and A.M

    Prim,R.C.,1957. Shortestconnectionnetworksandsomegeneralizations. BellSystemTechnicalJournal36,1389–1401. DOI:10.1002/j. 1538-7305.1957.tb01515.x

  34. [34]

    The ArborX library: Version 2.0

    Prokopenko, A., Arndt, D., Lebrun-Grandié, D., Turcksin, B., 2025. The ArborX library: Version 2.0. ACM Transactions on Mathematical Software 51, 30. DOI:10.1145/3772288

  35. [35]

    Asingle-treealgorithmtocomputetheEuclideanminimumspanningtreeonGPUs

    Prokopenko,A.,Sao,P.,Lebrun-Grandié,D.,2023. Asingle-treealgorithmtocomputetheEuclideanminimumspanningtreeonGPUs. Proc. ICPP’22 , 14DOI:10.1145/3545008.3546185

  36. [36]

    Optimizing search strategies in K-d trees, in: 5th WSES/IEEE Conf

    Sample, N., Haines, M., Arnold, M., Purcell, T., 2001. Optimizing search strategies in K-d trees, in: 5th WSES/IEEE Conf. on Circuits, Systems, Communications & Computers (CSCC’01)

  37. [37]

    Clustering benchmark datasets exploiting the fundamental clustering problems

    Thrun, M., Ultsch, A., 2020. Clustering benchmark datasets exploiting the fundamental clustering problems. Data in Brief 30, 105501. DOI:10.1016/j.dib.2020.105501

  38. [38]

    A white paper on good research practices in benchmarking: The case of cluster analysis

    van Mechelen, I., Boulesteix, A.L., Dangl, R., et al., 2023. A white paper on good research practices in benchmarking: The case of cluster analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery , e1511DOI:10.1002/widm.1511

  39. [39]

    ResTune: Resource oriented tuning boosted by meta-learning for cloud databases

    Wang,Y.,Yu,S.,Gu,Y.,Shun,J.,2021. FastparallelalgorithmsforEuclideanminimumspanningtreeandhierarchicalspatialclustering,in: Proc. SIGMOD’21, pp. 1982–1995. DOI:10.1145/3448016.3457296

  40. [40]

    Expert Systems with Applications238, 122229 (2024) https://doi

    Wei,Z.,Gao,Y.,Zhang,X.,Li,X.,Han,Z.,2024. Adaptivemarinetrafficbehaviourpatternrecognitionbasedonmultidimensionaldynamic time warping and dbscan algorithm. Expert Systems with Applications 238. URL:https://www.scopus.com/inward/record.uri? eid=2-s2.0-85174700827&doi=10.1016%2fj.eswa.2023.122229&partnerID=40&md5=5cdf395548b9461bc74e462f9df4332d, DOI:10.101...

  41. [41]

    Modern algorithms of cluster analysis

    Wierzchoń, S., Kłopotek, M., 2018. Modern algorithms of cluster analysis. Springer

  42. [42]

    Integrating semi-supervised learning with density tree structures for scalable clustering

    Yang, H., Feng, Y., Cai, J., Wang, J., Li, Y., Zhao, X., Xun, Y., 2026. Integrating semi-supervised learning with density tree structures for scalable clustering. Applied Soft Computing 186, 114147. DOI:10.1016/j.asoc.2025.114147

  43. [43]

    Densitypeakclusteringalgorithmbasedonboundaryeliminationandbackboneconstruction

    Zhao,Z., Chen,S.,Hu, C.,2025. Densitypeakclusteringalgorithmbasedonboundaryeliminationandbackboneconstruction. AppliedSoft Computing 185, 113880. DOI:10.1016/j.asoc.2025.113880. M. Gagolewski:Preprint; Last updated on 9 April 2026Page 10 of 13 Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees 0.05 0.1 0.15 0.2 0.25...