pith. machine review for the scientific record. sign in

arxiv: 2604.15889 · v1 · submitted 2026-04-17 · 📊 stat.CO

Recognition: unknown

Markov embedding of ranked unlabelled evolutionary trees and its applications

Asger Hobolth, Elisabeth Sommer James, Lars N{\o}rvang Andersen, Lasse Thorup Fallesen, Simon Pauli

Pith reviewed 2026-05-10 07:30 UTC · model grok-4.3

classification 📊 stat.CO
keywords Markov chain embeddingranked unlabelled treesFréchet meansphase-type distributionstree balance indicesF-matricescoalescent modelsneutrality tests
0
0 comments X

The pith

A Markov chain embedding of ranked unlabelled trees reduces the state space enough to compute all Fréchet means exactly and to obtain arbitrary-order moments of F-matrices under any bifurcating coalescent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that ranked unlabelled bifurcating trees can be represented by a Markov chain whose states are far fewer than the trees themselves. This representation is faithful for the quantities that matter in phylogenetics: the embedding supports an algorithm that returns every Fréchet mean of a sample and turns the problem of finding joint distributions of tree balance indices into a standard phase-type calculation. The same construction extends earlier results on the first two moments of F-matrices to moments of any order, and it supplies three new tests for whether a sample of trees is consistent with neutral coalescence. A reader cares because the number of ranked unlabelled trees grows faster than exponentially, so any method that avoids full enumeration opens the door to exact rather than simulated summaries on datasets with more than a handful of leaves.

Core claim

The central claim is that a Markov chain embedding of ranked unlabelled trees preserves topology, ranking and the probability measure of any time-homogeneous bifurcating coalescent, thereby allowing Fréchet means to be recovered from the reduced chain, joint laws of balance indices to be read off from absorption probabilities of a discrete phase-type process, and all moments of the F-matrix to be obtained in closed form for arbitrary order.

What carries the argument

The Markov chain embedding, which compresses each ranked unlabelled tree into a sequence of states that record only the necessary coalescence information while keeping the original topology and ranking recoverable.

If this is right

  • All Fréchet means of any sample become computable by standard algorithms on the reduced chain rather than by exhaustive search or mixed-integer programming.
  • The joint distribution of any finite collection of tree balance indices is given explicitly by the absorption probabilities of the associated phase-type process.
  • Moments of every order of every entry of the F-matrix follow in closed form for every time-homogeneous bifurcating coalescent model.
  • Three new tests for neutrality can be evaluated exactly and exhibit higher power than earlier moment-based tests on simulated data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Because the embedding is defined for any time-homogeneous bifurcating coalescent, the same machinery could be applied to compare observed trees against any fixed neutral model without resimulating the full tree space.
  • The reduced state space may make it feasible to compute other summary statistics that previously required Monte Carlo, provided those statistics can be expressed as functions of the embedded process.
  • The phase-type representation separates the combinatorial structure from the waiting-time distribution, so extensions to models with time-inhomogeneous or selection-driven rates would require only a change in the intensity matrix while keeping the same embedding.

Load-bearing premise

The embedding must map trees to states in such a way that every topological, ranking and distributional property needed for Fréchet means and phase-type results is exactly preserved when the results are mapped back to the original trees.

What would settle it

For any small number of leaves, enumerate all ranked unlabelled trees, compute the Fréchet mean of a sample directly, and compare it with the mean obtained after running the embedded Markov chain algorithm; a mismatch on even one sample falsifies exact transfer of the means.

Figures

Figures reproduced from arXiv: 2604.15889 by Asger Hobolth, Elisabeth Sommer James, Lars N{\o}rvang Andersen, Lasse Thorup Fallesen, Simon Pauli.

Figure 1
Figure 1. Figure 1: A: All ranked tree shapes with n = 5 leaves. B: The corresponding F-matrix representations. C: The transition diagram of the Markov embedding, where arrows from column 2 to column 1 are colored according to the corresponding tree in A. D: Interpretation of F3 and its corresponding tree in the embedding. coalescing two lineages present in the current state. We do so by considering the 11-leaved tree along w… view at source ↗
Figure 2
Figure 2. Figure 2: Coalescent interpretation of F-matrix columns and transitions between them. A: A ranked and unlabelled tree with 11 leaves. B: The corresponding F-matrix, where the third and fourth column are colored to match the interpretation below. C: Interpretation of x6 = (0, 0, 0, 5, 4, 4, 3, 2, 1, 0)T. D: Interpretation of x7 = (0, 0, 4, 3, 3, 3, 2, 1, 1, 0)T. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The size of the state space for an increasing number of leaves. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A: Table containing all non-absorbing states for n = 6, along the quantities used in the proof of Theorem 3.1 B: The cost matrix of the ViTreebi algorithm with n = 6 lineages. We find that the cheapest cost is 874/900 obtained by the two Fréchet paths p¯1 = (1, 2, 4, 6, 10) (red), p¯2 = (1, 2, 4, 7, 11) (blue). C: The corresponding Fréchet mean trees drawn with colors matching the arrows. 7 [PITH_FULL_IMA… view at source ↗
Figure 5
Figure 5. Figure 5: Results from the ViTreebi algorithm. A: average running times based on 1000 runs for increasing number of leaves n. Computations were performed on a laptop with an Intel® i5 processor. B: we demon￾strate ViTreebi’s ability to find all Fréchet means. C: the two Fréchet means with n = 25. D: the total number of Kingman Fréchet means for each n between 4 and 25. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Demonstrating that E and S capture tree balance. A: An imbalanced tree and its corresponding F-matrix. B: A balanced tree along with its corresponding F-matrix. For the left tree we find E(F1) = 42 and S(F1) = 69. For the right tree E(F2) = 32 and S(F2) = 43. Entries contributing to E are highlighted in red, entries contributing to S are highlighted in blue, and entries contributing to both are highlighted… view at source ↗
Figure 7
Figure 7. Figure 7: Sackin A and Colless B index against S for all 7936 ranked unlabelled trees with n = 10 leaves. Each dot is colored according to the total probability of trees corresponding to the given (S, index) pair under the Kingman model. Compared to Sackin and Colless, S spans a larger range of values, making it more sensitive to differences in ranked tree shapes. While all indices manage to capture the caterpillar … view at source ↗
Figure 8
Figure 8. Figure 8: External branch length (left) and the sum of non-fixed entries (right) for [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: First and second order moments of E and S in the ranked Kingman coalescent for an increasing number of lineages. 5.3 The joint distribution of all non-fixed F-matrix entries Consider the ranked coalescent {Xt}t=0,··· ,n−1 for a general number of lineages n. The non-fixed entries Fij are those below the second diagonal, {Fij : 1 ≤ j ≤ n − 3, j + 2 ≤ i ≤ n − 1}. For each non-fixed position (i, j), we define … view at source ↗
Figure 10
Figure 10. Figure 10: Estimated power of all tests including Hotelling’s T-squared from [19]. The power is estimated [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: State space and sub-transition matrix of the ranked BCP in the case [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: A: The two ranked unlabelled trees with n = 4 leaves.B: The transition diagram in the ranked coalescent. C: The transition diagram in the ranked BCP. We note that the two transition diagrams coincide. The paths in the ranked BCP provide a more coarse description of tree topology compared to the complete description by the paths in the ranked coalescent. As a consequence, some measures of tree balance, suc… view at source ↗
Figure 13
Figure 13. Figure 13: Comparing the sizes of the two state spaces. The state space of the ranked coalescent grows at a [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Distribution of external branch length with [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Histogram and quantile plots based on 1000 test statistics simulated from the ranked Kingman coalescent. The null distributions are superimposed. References [1] Bladt, M. and Nielsen, B.F. Matrix-exponential distibutions in applied probability. Springer, 2017. https://doi.org/10.1007/978-1-4939-7049-0. [2] Blum, M.G.B and François, O. Which Random Processes Describe the Tree of Life? A Large-Scale Study o… view at source ↗
read the original abstract

Rooted bifurcating trees are mathematical objects used to model evolutionary relationships and arise naturally in both coalescent theory and phylogenetics. Recent numerical representations of tree topologies, known as F-matrices, allow for summarizing a sample of trees via Fr\'echet means and provide new measures of tree balance. However, the number of ranked unlabelled trees grows super-exponentially with the number of leaves. This makes computation intensive and current methods rely on mixed integer programming and simulation-based methods. Moreover, F-matrices are difficult to interpret, and their distribution is only described in terms of first- and second-order moments under neutral branching. In this paper, we introduce a Markov chain embedding of ranked and unlabelled trees that drastically decreases the size of the state space. Leveraging this embedding, we develop an algorithm that efficiently computes all Fr\'echet means and use discrete phase-type theory to obtain the joint distribution of tree balance indices. We also use discrete phase-type theory to generalize previous results regarding moments of F-matrices to arbitrary order for any time homogeneous and bifurcating coalescent model. Using this framework, we construct three tests for neutrality and demonstrate their improved power compared to previous methods on simulated data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a Markov chain embedding of ranked unlabelled bifurcating trees that reduces the state space size. It uses this embedding to develop an algorithm for computing all Fréchet means of F-matrices, applies discrete phase-type theory to obtain the joint distribution of tree balance indices, generalizes moments of F-matrices to arbitrary order under any time-homogeneous bifurcating coalescent, and constructs three neutrality tests that show improved power on simulated data.

Significance. If the embedding is shown to be faithful (i.e., a structure-preserving lumpable aggregation that exactly reproduces the original distributions and statistics), the work would enable exact, scalable computations of Fréchet means and higher-order moments in a space whose size grows super-exponentially, replacing simulation or MIP approaches. The phase-type framework for balance indices and arbitrary-order F-matrix moments, together with the new neutrality tests, would constitute a substantive methodological advance for coalescent-based phylogenetics.

major comments (3)
  1. [Section describing the Markov embedding (likely §3)] The central claim that the embedding permits exact transfer of Fréchet means, joint balance-index distributions, and arbitrary-order F-matrix moments back to the original ranked-unlabelled tree space requires an explicit proof that the reduced chain is Markovian under any time-homogeneous bifurcating coalescent and that the map preserves the relevant topological, ranking, and distributional properties. This verification is load-bearing for all three main applications and is not supplied by the abstract alone.
  2. [Section on Fréchet means algorithm] The algorithm for computing all Fréchet means via the embedding must be accompanied by a demonstration that the reduced-state Fréchet mean coincides with the original-space mean; if distinct trees with different F-matrices map to the same embedded state, the computed means will be incorrect.
  3. [Section on phase-type moments of F-matrices] The generalization of F-matrix moments to arbitrary order via phase-type theory assumes the embedding induces the correct absorption-time distributions; a concrete check (e.g., recovery of known first- and second-order moments as special cases) is needed to confirm the arbitrary-order formulae are valid.
minor comments (2)
  1. Notation for the embedded states and the transition matrix should be introduced with a small illustrative example (n=4 or n=5) to make the reduction concrete.
  2. The three neutrality tests are described only at a high level; explicit definitions of the test statistics and their null distributions under the embedded chain would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the requirements for rigorous validation of the embedding. We address each major comment below and will revise the manuscript to incorporate the requested proofs and verifications.

read point-by-point responses
  1. Referee: [Section describing the Markov embedding (likely §3)] The central claim that the embedding permits exact transfer of Fréchet means, joint balance-index distributions, and arbitrary-order F-matrix moments back to the original ranked-unlabelled tree space requires an explicit proof that the reduced chain is Markovian under any time-homogeneous bifurcating coalescent and that the map preserves the relevant topological, ranking, and distributional properties. This verification is load-bearing for all three main applications and is not supplied by the abstract alone.

    Authors: We agree that an explicit, self-contained proof is necessary. Section 3 defines the embedding via equivalence classes on ranked unlabelled trees and states that the resulting chain is Markovian under time-homogeneous bifurcating coalescents, but the lumpability argument is only sketched. In the revision we will add a dedicated subsection containing a formal proof of lumpability (using the standard criterion on transition probabilities) together with verification that the embedding map preserves the relevant topological rankings, F-matrix values, and absorption-time distributions. revision: yes

  2. Referee: [Section on Fréchet means algorithm] The algorithm for computing all Fréchet means via the embedding must be accompanied by a demonstration that the reduced-state Fréchet mean coincides with the original-space mean; if distinct trees with different F-matrices map to the same embedded state, the computed means will be incorrect.

    Authors: The embedding is constructed so that states aggregate trees sharing identical F-matrices (or identical contributions to the Fréchet objective under the chosen metric). We will augment the algorithm section with a short lemma proving that the Fréchet mean computed on the embedded chain maps back to a valid Fréchet mean in the original space, by showing that the embedding is isometric with respect to the distance used in the Fréchet definition and that no two trees with distinct F-matrices are collapsed into the same state. revision: yes

  3. Referee: [Section on phase-type moments of F-matrices] The generalization of F-matrix moments to arbitrary order via phase-type theory assumes the embedding induces the correct absorption-time distributions; a concrete check (e.g., recovery of known first- and second-order moments as special cases) is needed to confirm the arbitrary-order formulae are valid.

    Authors: We will insert a verification subsection that recovers the known first- and second-order moments of the F-matrix entries as special cases of the general phase-type moment formula and compares them numerically and algebraically with the expressions previously derived in the literature. This check will be performed both for the Kingman coalescent and for a non-Kingman time-homogeneous bifurcating model to confirm correctness of the arbitrary-order extension. revision: yes

Circularity Check

0 steps flagged

No circularity: novel embedding construction applies established phase-type and Fréchet tools without reduction to inputs

full rationale

The paper defines a new Markov chain embedding of ranked unlabelled trees as an original combinatorial construction that reduces state space while preserving topology, ranking, and distributional properties. It then applies standard discrete phase-type theory to derive joint distributions of balance indices and arbitrary-order moments of F-matrices, plus an algorithm for Fréchet means. No derivation step equates a claimed result to a fitted parameter, self-citation, or ansatz by construction; the embedding is introduced independently and the transfer of results is asserted via structure preservation rather than tautological redefinition. The central claims therefore remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The core addition is the new embedding construction. It rests on standard stochastic process tools without introducing free parameters or extra entities beyond the embedding itself. Generalization assumes time-homogeneous bifurcating coalescent models.

axioms (2)
  • domain assumption Markov chains can be embedded to represent ranked unlabelled tree topologies while preserving necessary properties
    Central premise enabling state-space reduction and subsequent computations.
  • standard math Discrete phase-type distributions apply to the embedded Markov process for deriving moments and joint distributions
    Used to generalize F-matrix moments and obtain tree balance distributions.
invented entities (1)
  • Markov chain embedding of ranked unlabelled trees no independent evidence
    purpose: Drastically reduce state space for efficient computation of means and distributions
    Key innovation introduced to overcome super-exponential growth in tree count.

pith-pipeline@v0.9.0 · 5527 in / 1398 out tokens · 53048 ms · 2026-05-10T07:30:59.157481+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 19 canonical work pages

  1. [1]

    and Nielsen, B.F

    Bladt, M. and Nielsen, B.F. Matrix-exponential distibutions in applied probability. Springer, 2017. https://doi.org/10.1007/978-1-4939-7049-0

  2. [2]

    Blum, M.G.B and François, O. Which Random Processes Describe the Tree of Life? A Large-Scale Study of Phylogenetic Tree Imbalance.Systematic Biology55.4 (2006), 685–691.https://doi.org/1 0.1080/10635150600889625

  3. [3]

    and Plazzotta, G

    Coljin, C. and Plazzotta, G. A Metric on Phylogentic Tree Shapes.Systematic Biology67.1 (2018), 113–126.https://doi.org/10.1093/sysbio/syx046

  4. [4]

    and Wiehe, T

    Disanto, F. and Wiehe, T. Exact enumeration of cherries and pitchforks in ranked trees under the coalescent model.Mathematical Biosciences242.2 (2013), 195–200.https://doi.org/10.1016/j.mb s.2013.01.010

  5. [5]

    and Suchard, M.A

    Drummond, A.J. and Suchard, M.A. Fully Bayesian tests of neutrality using genealogical summary statistics.BMC Genet9.68 (2008).https://doi.org/10.1186/1471-2156-9-68. 25

  6. [6]

    Fishcer, M. et al. Tree Balance Indices - A Comprehensive Survey. Springer International Publishing, 2023.https://doi.org/10.1007/978-3-031-39800-1

  7. [7]

    The Viterbi Algorithm.Proceedings of IEEE61.3 (1973), 268–278.https://doi.org/1 0.1109/PROC.1973.9030

    Forney, G.D. The Viterbi Algorithm.Proceedings of IEEE61.3 (1973), 268–278.https://doi.org/1 0.1109/PROC.1973.9030

  8. [8]

    Les éléments aléatoires de nature quelconque dans un espace distancié

    Fréchet, Maurice. Les éléments aléatoires de nature quelconque dans un espace distancié. fr.Annales de l’institut Henri Poincaré10.4(1948),215–310.https://www.numdam.org/item/AIHP_1948__10_4_215_0/

  9. [9]

    Gurobi Optimization, LLC.Gurobi Optimizer Reference Manual. 2024. https://www.gurobi.com

  10. [10]

    and Ramanujan, S

    Hardy, G.H. and Ramanujan, S. Asymptotic formulae in combinatory analysis.Proc Lond Math Soc Second Seriess2-17 (1918), 75–115.https://doi.org/10.1112/plms/s2-17.1.75

  11. [11]

    Multivariate phase-type theory for the site frequency spectrum.Journal of Mathematical Biology83.63 (2021).https://doi.org/10.1007/s00285-021-0 1689-w

    Hobolth, A., Bladt, M., and Andersen, L.N. Multivariate phase-type theory for the site frequency spectrum.Journal of Mathematical Biology83.63 (2021).https://doi.org/10.1007/s00285-021-0 1689-w

  12. [12]

    Phase-type distributions in population genetics.Theo- retical Population Biology127 (2019), 16–32.https://doi.org/10.1016/j.tpb.2019.02.001

    Hobolth, A., Siri-Jégousse, A., and Bladt, M. Phase-type distributions in population genetics.Theo- retical Population Biology127 (2019), 16–32.https://doi.org/10.1016/j.tpb.2019.02.001

  13. [13]

    Employing phylogenetic tree shape statistics to resolve the underlying host population structure.BMC Bioinformatics22 (2021).https://doi.org /10.1186/s12859-021-04465-1

    Kayondo, H.W., Ssekagiri, A., Nabakooza, G., et al. Employing phylogenetic tree shape statistics to resolve the underlying host population structure.BMC Bioinformatics22 (2021).https://doi.org /10.1186/s12859-021-04465-1

  14. [14]

    Distance metrics for ranked evolutionary trees.Proc

    Kim, J., Rosenberg, N.A., and Palacios, J.A. Distance metrics for ranked evolutionary trees.Proc. Nat. Acad. Sci.117.46 (2020), 28876–86.https://doi.org/10.1073/pnas.1922851117

  15. [15]

    The Coalescent.Stochastic Processes and Their Applications13 (1982), 235–248

    Kingman, J.F.C. The Coalescent.Stochastic Processes and Their Applications13 (1982), 235–248. https://doi.org/10.1016/0304-4149(82)90011-4

  16. [16]

    Ranked Tree Shapes, Nonrandom Extinction, and the Loss of Phylogenetic Diversity.Systematic Biology67.6 (2018), 1025–1040.https://doi.org/10.1093/sy sbio/syy030

    Maliet, O., Gascuel, M., and Lambert, A. Ranked Tree Shapes, Nonrandom Extinction, and the Loss of Phylogenetic Diversity.Systematic Biology67.6 (2018), 1025–1040.https://doi.org/10.1093/sy sbio/syy030

  17. [17]

    Order statistics and multivariate discrete phase-type distributions

    Navarro, A.C. “Order statistics and multivariate discrete phase-type distributions”. PhD thesis. Tech- nical University of Denmark, 2019

  18. [18]

    PhaseTypeR: an R package for phase-type distri- butions in population genetics.Journal of Open Source Software8.82 (2023), 5054.https://doi.org /10.21105/joss.05054

    Rivas-González, I., Andersen, L.N., and Hobolth, A. PhaseTypeR: an R package for phase-type distri- butions in population genetics.Journal of Open Source Software8.82 (2023), 5054.https://doi.org /10.21105/joss.05054

  19. [19]

    Statistical summaries of unlabelled evolutionary trees.Biometrika111.1 (2024), 171–193.https://doi.org/10.1093/biomet/asad025

    Samyak, R and Palacios, J.A. Statistical summaries of unlabelled evolutionary trees.Biometrika111.1 (2024), 171–193.https://doi.org/10.1093/biomet/asad025

  20. [20]

    The on-line encyclopedia of integer sequences.Notices American Mathematical Society (2003).https://doi.org/10.48550/arXiv.1805.10343

    Sloane, N.J.A. The on-line encyclopedia of integer sequences.Notices American Mathematical Society (2003).https://doi.org/10.48550/arXiv.1805.10343

  21. [21]

    Soewongsono, A.C., Diao, J., Stark, T., et al. Matrix-analytic Methods for the Evolution of Species Trees, Gene Trees, and Their Reconciliation.Methodology and Computing in Applied Probability27.9 (2025).https://doi.org/10.1007/s11009-025-10135-z

  22. [22]

    Statistical method for testing the neutral mutation hypothesis by DNA polymorphism

    Tajima, F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics123.3 (1989), 585–95.https://doi.org/10.1093/genetics/123.3.585

  23. [23]

    R Foundation for Statistical Computing

    Team, R Core.R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2024. https://www.R-project.org/

  24. [24]

    Wakeley,J.CoalescentTheory:AnIntroduction.GreenwoodVillage(Colorado):RobertsandCompany Publishers, 2008

  25. [25]

    Hidden Markov Models for Time Series - An Intro- duction Using R

    Zucchini, W., MacDonald, I.L., and Langrock, R. Hidden Markov Models for Time Series - An Intro- duction Using R. Taylor and Francis, 2016.https://doi.org/10.1201/9781420010893. 26