arxiv: 2604.23984 · v1 · submitted 2026-04-27 · 🧬 q-bio.PE · cs.CC

Recognition: unknown

Polynomial-time completion of phylogenetic tree sets

Aleksandr Koshkarov, Nadia Tahiri

Pith reviewed 2026-05-07 17:24 UTC · model grok-4.3

classification 🧬 q-bio.PE cs.CC

keywords phylogenetic tree completionpolynomial-time algorithmtaxon overlapmajority-rule consensusquadratic distancetree setsphylogeneticsbranch lengths

0 comments

The pith

A polynomial-time algorithm completes phylogenetic tree sets with partial taxon overlap while preserving distances and producing order-independent results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Phylogenetic trees from separate studies commonly cover overlapping but non-identical sets of species, so direct comparison requires either discarding taxa or completing the trees. The paper supplies an algorithm that extracts subtrees appearing frequently across the source trees, forms a weighted majority-rule consensus of those subtrees, scales their branch lengths from shared leaves, and inserts each consensus subtree into a target tree at the spot that minimizes quadratic distance error, with candidate spots limited to the target tree's existing branches. The procedure is shown to run in polynomial time, to leave distances among originally shared taxa unchanged, and to give the same final completion no matter the order in which input trees are handled. Experiments on real data sets of amphibians, mammals, sharks, and squamates indicate that the completed trees lie closer to reference trees than those produced by other completion methods, measured both by topology and by branch lengths.

Core claim

The paper demonstrates a polynomial-time algorithm for completing sets of phylogenetic trees that have partial taxon overlap. It identifies maximal completion subtrees that recur across the source trees, assembles a weighted majority-rule consensus of those subtrees, derives branch-length scaling rates from the common leaves, and inserts each consensus subtree into the target tree at the unique position that minimizes the quadratic distance error measured against source-tree information, restricting candidate positions to the original branches of the target tree. The resulting completion preserves distances among the taxa that were already present and is independent of the order in which the

What carries the argument

Extraction of maximal completion subtrees, construction of a weighted majority-rule consensus, and insertion of each consensus subtree at the position that minimizes quadratic distance error while restricting candidates to the original branches of the target tree.

If this is right

The algorithm completes any input set of trees in polynomial time.
Distances among taxa that appear in the original trees remain exactly the same after completion.
The final completed tree set is unique and does not change if the order of the input trees is altered.
On the amphibian, mammal, shark, and squamate data sets the completed trees are closer to the reference subset trees than those produced by competing methods, both in topology and in branch lengths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Order independence removes the need to canonicalize input order before running the procedure, which simplifies its use inside larger automated pipelines.
If the quadratic-error placements avoid artifacts on the tested clades, the same insertion rule may transfer to other clades or to trees with greater overlap variation.
Because distances are preserved, downstream methods that treat branch lengths as evolutionary rates can be applied directly to the completed trees without additional correction steps.

Load-bearing premise

That restricting subtree insertions to the original branches of each target tree and choosing the position that minimizes quadratic distance error against the source trees will produce biologically accurate completions without creating topological artifacts or distorting distances.

What would settle it

A concrete data set of source trees for which the completed output exhibits a larger topological or branch-length distance to the reference subset trees than at least one alternative completion method, or for which distances among the originally shared taxa have changed.

read the original abstract

Comparative analyses of phylogenetic trees typically require identical taxon sets, however, in practice, trees often include distinct but overlapping taxa. Pruning non-shared leaves discards phylogenetic signal, whereas tree completion can preserve both taxa and branch-length information. This work introduces a polynomial-time algorithm for set-wide completion of phylogenetic trees with partial taxon overlap. The proposed method identifies and extracts maximal completion subtrees that frequently appear across the source trees and constructs a weighted majority-rule consensus. Branch lengths are scaled using rates derived from common leaves. Each consensus subtree is inserted at the position that minimizes the quadratic distance error measured against information from the source trees, with candidate positions restricted to the original branches of the target tree. We demonstrate that the algorithm runs in polynomial time and preserves distances among the original taxa, yielding a unique completion that is order-independent with respect to the processing order of target trees. An experimental evaluation on amphibians, mammals, sharks, and squamates shows that the proposed method consistently achieves the lowest distance to the subset reference trees across subsets among all methods, in both topology and branch lengths. An open-source Python implementation of the proposed algorithm and the biological datasets utilized in this study are publicly available at: https://github.com/tahiri-lab/overlap-treeset-completion/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a polynomial-time way to complete phylogenetic tree sets with partial taxon overlap using subtree extraction, consensus, and restricted quadratic insertion, with claims of uniqueness and better performance on real data.

read the letter

This paper gives a workable polynomial-time way to complete phylogenetic tree sets that have only partial taxon overlap, and the experiments suggest it keeps distances better than the alternatives they tested. The new part is pulling maximal subtrees that show up often, making a weighted majority-rule consensus from them, scaling the lengths from shared leaves, and then placing each subtree on the target tree by minimizing quadratic distance error but only on the branches already there. They prove it runs in poly time and that the output is unique no matter the order of the input trees. It does well by releasing the code on GitHub and running it on real data from amphibians, mammals, sharks, and squamates, where it came out ahead on distance to the reference trees. The method does well in the reported experiments by achieving the lowest distance to the subset reference trees compared to other methods, both in topology and branch lengths. Having the code and data public makes it easy to check and use. The potential weak points are in the guarantees. The uniqueness and order-independence depend on the minimization step always selecting a single position without ties, and on insertions not interfering with each other in terms of the distance calculations. If there are cases where multiple positions give the same quadratic error, or if sequential insertions change the effective distances, different processing orders could lead to different results. The restriction to original branches of the target tree might also cause issues if the overlap requires attachment points not already present, possibly introducing distortions despite local distance preservation. This paper is for people working in phylogenetics who deal with trees from different sources that don't share all taxa. Readers who need to assemble or compare such trees without losing information would get practical value from the algorithm and the released code. It deserves a serious referee because it tackles a common practical issue with a claimed efficient solution and includes reproducible materials. I recommend sending it for peer review, focusing on verifying the proof of uniqueness and the experimental setup.

Referee Report

2 major / 2 minor

Summary. The paper introduces a polynomial-time algorithm for completing phylogenetic tree sets with partial taxon overlap. It extracts maximal completion subtrees appearing across source trees, constructs a weighted majority-rule consensus, scales branch lengths from common leaves, and inserts each consensus subtree into a target tree at the position on existing branches that minimizes quadratic distance error to source-tree information. The method is claimed to preserve distances among original taxa, produce a unique order-independent completion, run in polynomial time, and outperform alternatives on four biological datasets (amphibians, mammals, sharks, squamates) in both topology and branch lengths. An open-source Python implementation is provided.

Significance. If the polynomial-time guarantee, distance preservation, and uniqueness/order-independence hold, the work would enable more complete use of overlapping but incomplete phylogenetic trees without discarding signal via pruning. The public code and datasets strengthen reproducibility. The empirical superiority on real datasets suggests practical value for comparative analyses, though the theoretical claims are the primary contribution.

major comments (2)

[Method description (insertion step)] The central claim of uniqueness and order-independence (abstract) rests on the insertion procedure. Sequential insertion of multiple subtrees can cause one insertion to alter effective distances for the next; if the quadratic error surface admits ties or if the order of processing target trees affects the final topology or lengths, the uniqueness guarantee fails. The restriction of candidate positions to original branches of the target tree further risks forcing distortions when overlap patterns require new attachment points.
[Complexity analysis] The polynomial-time claim (abstract) requires explicit complexity analysis. The time for extracting maximal subtrees, building the weighted majority-rule consensus, and performing quadratic minimization over all branches for each subtree must be shown to be polynomial in the total number of taxa and trees; without this breakdown, the runtime guarantee cannot be verified.

minor comments (2)

[Experimental evaluation] The experimental section should specify how the subset reference trees were constructed and whether the same subsetting protocol was applied uniformly to all compared methods.
[Notation and definitions] Define the quadratic distance error measure precisely, including how source-tree information is aggregated when multiple source trees contribute to a given subtree.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the opportunity to address the concerns regarding the insertion procedure and the complexity analysis. We respond to each major comment below.

read point-by-point responses

Referee: [Method description (insertion step)] The central claim of uniqueness and order-independence (abstract) rests on the insertion procedure. Sequential insertion of multiple subtrees can cause one insertion to alter effective distances for the next; if the quadratic error surface admits ties or if the order of processing target trees affects the final topology or lengths, the uniqueness guarantee fails. The restriction of candidate positions to original branches of the target tree further risks forcing distortions when overlap patterns require new attachment points.

Authors: The quadratic distance error for each consensus subtree is computed directly against the fixed source-tree information rather than against an intermediate completed tree. Consequently, the minimization for each subtree is independent of prior insertions, and the final positions are determined on the original target tree. This ensures that the result is independent of the order in which target trees are processed. When ties occur in the quadratic error, the algorithm applies a deterministic tie-breaker (lowest branch index in a depth-first traversal of the target tree) to guarantee uniqueness. The restriction of candidate positions to the original branches of each target tree is deliberate: it is required to preserve exact distances among the taxa already present in that target tree, which is a core property stated in the manuscript. We will add a short subsection clarifying these independence and tie-breaking mechanisms. revision: partial
Referee: [Complexity analysis] The polynomial-time claim (abstract) requires explicit complexity analysis. The time for extracting maximal subtrees, building the weighted majority-rule consensus, and performing quadratic minimization over all branches for each subtree must be shown to be polynomial in the total number of taxa and trees; without this breakdown, the runtime guarantee cannot be verified.

Authors: We agree that an explicit breakdown is needed. In the revised manuscript we will insert a dedicated complexity paragraph showing that (i) extraction of maximal completion subtrees is O(T N^2), (ii) construction of the weighted majority-rule consensus is O(T N^2) using standard algorithms, and (iii) for each of the O(N) subtrees the quadratic minimization examines O(N) candidate branches with O(1) distance evaluations per branch, yielding an overall polynomial bound in the combined size of the input (number of taxa and number of trees). revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic assembly with external validation

full rationale

The paper describes a constructive polynomial-time algorithm that extracts maximal subtrees from source trees, forms a weighted majority-rule consensus, scales branch lengths from common leaves, and inserts subtrees by quadratic-error minimization restricted to existing branches. Claims of distance preservation, uniqueness, and order-independence are presented as direct consequences of this explicit procedure rather than derived predictions; the experimental section evaluates the output against independent reference trees on real biological datasets (amphibians, mammals, etc.), providing external benchmarks. No load-bearing step reduces a claimed result to a tautological restatement of the inputs by construction, self-citation, or fitted-parameter renaming.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method rests on standard phylogenetic tree assumptions and algorithmic techniques without introducing new postulated entities or heavily fitted parameters beyond the choice of quadratic error and majority-rule weighting.

free parameters (1)

quadratic distance error measure
Choice of quadratic form for scoring insertion positions; not fitted to data but selected as the objective.

axioms (2)

domain assumption Phylogenetic trees admit completion by subtree insertion that preserves distances among originally shared taxa.
Invoked when claiming that the output preserves distances among original taxa.
domain assumption Maximal completion subtrees that appear across source trees capture the relevant phylogenetic signal for consensus construction.
Basis for extracting and using those subtrees in the majority-rule step.

pith-pipeline@v0.9.0 · 5519 in / 1402 out tokens · 62852 ms · 2026-05-07T17:24:25.585943+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references

[1]

Systematic Biology 21(4):390–397 Akanni W A, Wilkinson M, Creevey CJ, et al (2015) Implementing and testing bayesian and maximum-likelihood supertree methods in phylogenetics

Adams III EN (1972) Consensus techniques and the comparison of taxonomic trees. Systematic Biology 21(4):390–397 Akanni W A, Wilkinson M, Creevey CJ, et al (2015) Implementing and testing bayesian and maximum-likelihood supertree methods in phylogenetics. Royal S ociety open science 2(8):140436 Bansal MS (2020) Linear-time algorithms for phylogenetic tree...

1972
[2]

Systematic Biology 58(1) :35–54 Dong J, Fern´ andez-Baca D, McMorris F, et al (2010) Majority-r ule (+) consensus trees

Springer Degnan JH, DeGiorgio M, Bryant D, et al (2009) Properties of cons ensus methods for inferring species trees from gene trees. Systematic Biology 58(1) :35–54 Dong J, Fern´ andez-Baca D, McMorris F, et al (2010) Majority-r ule (+) consensus trees. Mathematical Biosciences 228(1):10–15 Garey MR, Johnson DS (2002) Computers and intractability, vol 29...

2009