arxiv: 2605.04119 · v1 · submitted 2026-05-05 · 🧬 q-bio.QM · cs.LG· q-bio.PE

Recognition: 3 theorem links

· Lean Theorem

Tree-Conditioned Edit Flows for Ancestral Sequence Reconstruction

Emil Sharafutdinov, Ingemar Andr\'e

Pith reviewed 2026-05-08 18:01 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.LGq-bio.PE

keywords ancestral sequence reconstructionedit flowsphylogenetic treesinsertions and deletionsprotein evolutionbidirectional trajectoriesvariable-length sequences

0 comments

The pith

A tree-conditioned edit-flow model reconstructs ancestral sequences from descendants via paired bidirectional edit trajectories constrained to a common state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Ancestral sequence reconstruction infers past protein sequences at internal nodes of a phylogenetic tree, but standard methods treat sites independently and handle insertions or deletions poorly. This work introduces a model that takes two descendant sequences plus their distances to a shared ancestor and generates paired edit trajectories that must meet at the same ancestral sequence. The approach is trained on natural sequences containing substitutions, insertions, and deletions. On a substitution-only benchmark the model is competitive but not superior to classical methods, yet on natural homologous sequences rich in indels it localizes inferred evolutionary changes more accurately than prior approaches.

Core claim

The paper introduces a tree-conditioned edit-flow model for variable-length ASR. Given two descendant sequences and their branch distances to a shared ancestor, the model reconstructs the ancestor through paired bidirectional edit trajectories constrained to agree on a common ancestral state. On a benchmark of experimentally evolved sequences with only context-independent substitutions, the model does not match the accuracy of the best classical method, yet still achieves reasonable performance despite being trained on natural sequences that include insertions, deletions, and substitutions. On a benchmark of natural homologous sequences with abundant insertions and deletions, the model most

What carries the argument

Tree-conditioned edit-flow model that reconstructs an ancestor by generating paired bidirectional edit trajectories from two descendants and enforcing agreement on a single ancestral state.

Load-bearing premise

Paired bidirectional edit trajectories trained on natural sequences will generalize to produce accurate ancestral reconstructions on both substitution-only and indel-rich cases when forced to agree on a common state.

What would settle it

An experimental evolution dataset where true ancestral sequences and all insertion, deletion, and substitution events are known in advance; the model would be falsified if its localized changes deviate from the recorded events more than classical methods do.

Figures

Figures reproduced from arXiv: 2605.04119 by Emil Sharafutdinov, Ingemar Andr\'e.

**Figure 1.** Figure 1: A rooted phylogenetic tree with leaves A, B, C and internal nodes D, E (root). Node D = LCA(A, B) is the deepest shared ancestor of the sibling pair (A, B); E = LCA(A, B, C) is the root. The length of branches indicate the evolutionary distance between endpoints. Phylogenetic trees. Hypothesized evolutionary relationships among a set of biological sequences are represented on a phylogenetic tree. A roote… view at source ↗

**Figure 2.** Figure 2: Paired Edit-Flow Architecture of Lærad. Aligned descendants (xa, xb) are gap-stripped before tokenization, while gap positions are retained for bridge supervision and projection back to alignment coordinates. Each residue receives token, trajectory-time, and ordered branch-budget embeddings. The two budget slots correspond to the active source-to-LCA distance, dist(own), and the paired source-to-LCA distan… view at source ↗

**Figure 3.** Figure 3: (a) Pair training. Two descendants (A, B) define a stochastic bridge on t ∈ [0, 1]. Each child induces a reverse trajectory toward the other and is supervised by a bidirectional Bregman loss along the full path. The branch-distance ratio τ = da/(da + db) marks the expected LCA location, where the LCA loss becomes critical: it explicitly penalizes disagreement between the two child-conditioned hidden-state … view at source ↗

read the original abstract

Ancestral sequence reconstruction (ASR) aims to infer extinct protein sequences at internal nodes of a phylogenetic tree. Classical ASR methods are typically based on continuous-time Markov substitution models, but they treat sites largely independently and handle insertions and deletions only weakly or not at all. We introduce a tree-conditioned edit-flow model for variable-length ASR. Given two descendant sequences and their branch distances to a shared ancestor, the model reconstructs the ancestor through paired bidirectional edit trajectories constrained to agree on a common ancestral state. On a benchmark of experimentally evolved sequences with only context-independent substitutions, the model does not match the accuracy of the best classical method, yet still achieves reasonable performance despite being trained on natural sequences that include insertions, deletions, and substitutions. On a benchmark of natural homologous sequences with abundant insertions and deletions, the model most accurately localizes inferred evolutionary change.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The edit-flow model gives a new way to do variable-length ASR with indels via bidirectional trajectories, but it underperforms classical methods on substitutions and the main claim on natural sequences rests on weak validation.

read the letter

The paper's core idea is a tree-conditioned edit-flow model that takes two descendant sequences plus their distances to a common ancestor and reconstructs the ancestor by generating paired bidirectional edit trajectories that are forced to land on the same ancestral state. This setup lets the model handle insertions and deletions without the site-independence assumptions of standard continuous-time Markov models. That is genuinely new within the ASR literature and addresses a real practical gap when sequences vary in length.

Referee Report

2 major / 1 minor

Summary. The paper introduces a tree-conditioned edit-flow model for ancestral sequence reconstruction (ASR). Given two descendant sequences and their branch distances to a shared ancestor, the model reconstructs the ancestor via paired bidirectional edit trajectories constrained to agree on a common ancestral state. This is positioned as an advance over classical continuous-time Markov substitution models, which treat sites independently and handle indels weakly. The model is trained on natural sequences (including indels and substitutions) and evaluated on two benchmarks: experimentally evolved sequences with only context-independent substitutions (where it achieves reasonable but sub-optimal accuracy relative to the best classical method) and natural homologous sequences with abundant indels (where it most accurately localizes inferred evolutionary change).

Significance. If the central claims hold, the work offers a novel data-driven framework for variable-length ASR that directly incorporates indels via edit flows, addressing a clear limitation of traditional substitution models. Training on natural data and the bidirectional agreement constraint are strengths that could enable better handling of complex evolutionary histories. However, the underperformance on substitution-only data with available ground truth and the reliance on relative localization against inferred changes (without independent ground truth) temper the significance; the approach may still prove useful if its generalization properties can be more rigorously established.

major comments (2)

[Abstract] Abstract: the claim that the model 'most accurately localizes inferred evolutionary change' on natural homologous sequences with abundant indels rests on accuracy measured against other inference procedures; without independent ground truth for the true ancestral states, it is unclear whether the bidirectional edit trajectories recover the actual history or merely a consistent but incorrect one that agrees with the baselines.
[Abstract] Abstract: on the benchmark of experimentally evolved sequences with only substitutions, the model underperforms the best classical method despite the availability of ground truth; this indicates that the edit distribution learned from natural data (which includes indels) does not automatically align with the true substitution process, raising concerns about the generalization assumption for both benchmarks.

minor comments (1)

[Abstract] Abstract: comparative performance numbers are reported without accompanying details on model architecture, training procedure, statistical tests, error bars, or precise definitions of the localization metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below in a point-by-point manner. We agree with the concerns about qualifying claims in the absence of ground truth and have revised the abstract and added discussion text to more precisely describe the evaluation and its limitations.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the model 'most accurately localizes inferred evolutionary change' on natural homologous sequences with abundant indels rests on accuracy measured against other inference procedures; without independent ground truth for the true ancestral states, it is unclear whether the bidirectional edit trajectories recover the actual history or merely a consistent but incorrect one that agrees with the baselines.

Authors: We agree that the lack of independent ground truth for natural sequences means we cannot definitively claim recovery of the true history. The original abstract already qualifies the result as localizing 'inferred evolutionary change' to reflect comparison against other methods. To address the possibility of consistent but incorrect agreement, we have revised the abstract to explicitly note that results are relative to baseline inferences and added a new paragraph in the Discussion section acknowledging this limitation while noting that agreement across independent methods serves as a standard proxy in indel-rich ASR settings. revision: yes
Referee: [Abstract] Abstract: on the benchmark of experimentally evolved sequences with only substitutions, the model underperforms the best classical method despite the availability of ground truth; this indicates that the edit distribution learned from natural data (which includes indels) does not automatically align with the true substitution process, raising concerns about the generalization assumption for both benchmarks.

Authors: The referee correctly identifies that our model underperforms the strongest classical substitution model on the experimental benchmark with ground truth. We report this result transparently and do not claim superiority for pure substitution cases. The manuscript instead highlights reasonable accuracy despite training on natural data containing indels. We have revised the abstract to state this limitation more explicitly and expanded the Results and Discussion sections to discuss the implications for generalization, noting that the model's primary intended use is for variable-length sequences with mixed indels and substitutions where classical models are weaker. revision: yes

Circularity Check

0 steps flagged

No significant circularity; model trained on external data

full rationale

The paper introduces a tree-conditioned edit-flow model trained on natural homologous sequences to perform ancestral sequence reconstruction. No equations, derivations, or self-referential definitions appear in the provided abstract or description. The central claims rely on empirical training from external data and comparisons to classical methods on benchmarks, without any load-bearing steps that reduce predictions to fitted inputs by construction or self-citation chains. This is a standard non-circular empirical modeling approach.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review prevents identification of specific free parameters or background axioms; the central claim rests on the unstated assumption that the learned edit trajectories generalize across sequence types and that the bidirectional agreement constraint recovers true ancestral states.

invented entities (1)

tree-conditioned edit-flow model no independent evidence
purpose: To reconstruct ancestral sequences via constrained bidirectional edit trajectories
Introduced in the abstract as the core new modeling approach

pith-pipeline@v0.9.0 · 5448 in / 1226 out tokens · 48560 ms · 2026-05-08T18:01:30.351715+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost (Jcost, FunctionalEquation) washburn_uniqueness_aczel unclear
Following Edit Flows [24], the base term trains a continuous-time edit-rate field by penalizing total predicted edit mass while rewarding rates assigned to edits that move a sampled bridge state toward the target endpoint.

Reference graph

Works this paper leans on

41 extracted references · 37 canonical work pages · 1 internal anchor

[1]

restoration studies

Linus Pauling, Emile Zuckerkandl, Thormod Henriksen, and Rolf Lövstad. Chemical paleogenetics. molecular "restoration studies" of extinct forms of life.Acta Chemica Scandinavica, 17 supl.:9–16, 1963. ISSN 0904-213X. doi: 10.3891/acta.chem.scand.17s-0009

work page doi:10.3891/acta.chem.scand.17s-0009 1963
[2]

A. G. A. Selberg, E. A. Gaucher, and D. A. Liberles. Ancestral sequence reconstruction: From chemical paleogenetics to maximum likelihood algorithms and beyond.J Mol Evol, 89(3):157–164, 2021. ISSN 1432-1432 (Electronic) 0022-2844 (Print) 0022-2844 (Linking). doi: 10.1007/s00239-021-09993-1. URL https://www.ncbi.nlm.nih.gov/pubmed/33486547

work page doi:10.1007/s00239-021-09993-1 2021
[3]

Prakinee, S

K. Prakinee, S. Phaisan, S. Kongjaroon, and P. Chaiyen. Ancestral sequence reconstruction for designing biocatalysts and investigating their functional mechanisms.JACS Au, 4(12):4571–4591, 2024. ISSN 2691-3704 (Electronic) 2691-3704 (Linking). doi: 10.1021/jacsau.4c00653. URL https://www.ncbi. nlm.nih.gov/pubmed/39735918

work page doi:10.1021/jacsau.4c00653 2024
[4]

M. A. Spence, J. A. Kaczmarski, J. W. Saunders, and C. J. Jackson. Ancestral sequence reconstruction for protein engineers.Curr Opin Struct Biol, 69:131–141, 2021. ISSN 1879-033X (Electronic) 0959- 440X (Linking). doi: 10.1016/j.sbi.2021.04.001. URL https://www.ncbi.nlm.nih.gov/pubmed/ 34023793

work page doi:10.1016/j.sbi.2021.04.001 2021
[5]

J. M. Koshi and R. A. Goldstein. Probabilistic reconstruction of ancestral protein sequences.J Mol Evol, 42(2):313–20, 1996. ISSN 0022-2844 (Print) 0022-2844 (Linking). doi: 10.1007/BF02198858. URL https://www.ncbi.nlm.nih.gov/pubmed/8919883

work page doi:10.1007/bf02198858 1996
[6]

Z. Yang, S. Kumar, and M. Nei. A new method of inference of ancestral nucleotide and amino acid sequences.Genetics, 141(4):1641–50, 1995. ISSN 0016-6731 (Print) 0016-6731 (Linking). doi: 10.1093/ genetics/141.4.1641. URLhttps://www.ncbi.nlm.nih.gov/pubmed/8601501

work page arXiv 1995
[7]

Huelsenbeck and Jonathan P

John P. Huelsenbeck and Jonathan P. Bollback. Empirical and hierarchical bayesian estimation of ancestral states.Systematic Biology, 50(3):351–366, 2001. doi: 10.1080/10635150119871

work page doi:10.1080/10635150119871 2001
[8]

Norn and I

C. Norn and I. Andre. Atomistic simulation of protein evolution reveals sequence covariation and time- dependent fluctuations of site-specific substitution rates.PLoS Comput Biol, 19(3):e1010262, 2023. ISSN 1553-7358 (Electronic) 1553-734X (Print) 1553-734X (Linking). doi: 10.1371/journal.pcbi.1010262. URL https://www.ncbi.nlm.nih.gov/pubmed/36961827

work page doi:10.1371/journal.pcbi.1010262 2023
[9]

D. D. Pollock, G. Thiltgen, and R. A. Goldstein. Amino acid coevolution induces an evolutionary Stokes shift.Proc Natl Acad Sci U S A, 109(21):E1352–9, 2012. ISSN 1091-6490 (Electronic) 0027-8424 (Print) 0027-8424 (Linking). doi: 10.1073/pnas.1120084109. URL https://www.ncbi.nlm.nih.gov/ pubmed/22547823

work page doi:10.1073/pnas.1120084109 2012
[10]

Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters.Systematic Biology, 22(3):240–249, 09 1973

Joseph Felsenstein. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters.Systematic Biology, 22(3):240–249, 09 1973. ISSN 1063-5157. doi: 10.1093/sysbio/22.3.240. URLhttps://doi.org/10.1093/sysbio/22.3.240

work page doi:10.1093/sysbio/22.3.240 1973
[11]

Pattern Recognition 127 (2022), 108611

S. Savino, T. Desmet, and J. Franceus. Insertions and deletions in protein evolution and engineering. Biotechnol Adv, 60:108010, 2022. ISSN 1873-1899 (Electronic) 0734-9750 (Linking). doi: 10.1016/j. biotechadv.2022.108010. URLhttps://www.ncbi.nlm.nih.gov/pubmed/35738511

work page doi:10.1016/j 2022
[12]

Toth-Petroczy and D

A. Toth-Petroczy and D. S. Tawfik. Protein insertions and deletions enabled by neutral roaming in sequence space.Mol Biol Evol, 30(4):761–71, 2013. ISSN 1537-1719 (Electronic) 0737-4038 (Linking). doi: 10.1093/molbev/mst003. URLhttps://www.ncbi.nlm.nih.gov/pubmed/23315956

work page doi:10.1093/molbev/mst003 2013
[13]

Hormozdiari, R

F. Hormozdiari, R. Salari, M. Hsing, A. Schonhuth, S. K. Chan, S. C. Sahinalp, and A. Cherkasov. The effect of insertions and deletions on wirings in protein-protein interaction networks: a large-scale study.J Comput Biol, 16(2):159–67, 2009. ISSN 1557-8666 (Electronic) 1066-5277 (Linking). doi: 10.1089/cmb.2008.03TT. URLhttps://www.ncbi.nlm.nih.gov/pubme...

work page doi:10.1089/cmb.2008.03tt 2009
[14]

ARPIP: Ancestral sequence reconstruction with insertions and deletions under the Poisson indel process.Systematic Biology, 72(2):307–318, 07 2022

Gholamhossein Jowkar, J¯ulija Peˇcerska, Massimo Maiolo, Manuel Gil, and Maria Anisimova. ARPIP: Ancestral sequence reconstruction with insertions and deletions under the Poisson indel process.Systematic Biology, 72(2):307–318, 07 2022. ISSN 1063-5157. doi: 10.1093/sysbio/syac050. URL https://doi. org/10.1093/sysbio/syac050

work page doi:10.1093/sysbio/syac050 2022
[15]

Reconstruction of ancestral protein sequences using autoregressive generative models.Molecular Biology and Evolution, 42(4):msaf070, 04

Matteo De Leonardis, Andrea Pagnani, and Pierre Barrat-Charlaix. Reconstruction of ancestral protein sequences using autoregressive generative models.Molecular Biology and Evolution, 42(4):msaf070, 04
[16]

doi: 10.1093/molbev/msaf070

ISSN 1537-1719. doi: 10.1093/molbev/msaf070. URL https://doi.org/10.1093/molbev/ msaf070. 10

work page doi:10.1093/molbev/msaf070
[17]

Ancestral sequence reconstruction using generative models.bioRxiv, 2026

Edo Dotan, Elya Wygoda, Asaf Schers, Iris Lyubman, Yonatan Belinkov, and Tal Pupko. Ancestral sequence reconstruction using generative models.bioRxiv, 2026. doi: 10.64898/2026.01.18.700141. URL https://www.biorxiv.org/content/early/2026/01/21/2026.01.18.700141

work page doi:10.64898/2026.01.18.700141 2026
[18]

Walter M. Fitch. Toward defining the course of evolution: Minimum change for a specific tree topology. Systematic Zoology, 20(4):406–416, 1971. ISSN 00397989. URL http://www.jstor.org/stable/ 2412116

1971
[19]

Jones, William R

David T. Jones, William R. Taylor, and Janet M. Thornton. The rapid generation of mutation data matrices from protein sequences.Bioinformatics, 8(3):275–282, 06 1992. ISSN 1367-4803. doi: 10.1093/ bioinformatics/8.3.275. URLhttps://doi.org/10.1093/bioinformatics/8.3.275

work page doi:10.1093/bioinformatics/8.3.275 1992
[20]

A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach.Molecular Biology and Evolution, 18(5):691–699, 05 2001

Simon Whelan and Nick Goldman. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach.Molecular Biology and Evolution, 18(5):691–699, 05 2001. ISSN 0737-4038. doi: 10.1093/oxfordjournals.molbev.a003851. URL https://doi.org/10. 1093/oxfordjournals.molbev.a003851

work page doi:10.1093/oxfordjournals.molbev.a003851 2001
[21]

An improved general amino acid replacement matrix.Molecular Biology and Evolution, 25(7):1307–1320, 07 2008

Si Quang Le and Olivier Gascuel. An improved general amino acid replacement matrix.Molecular Biology and Evolution, 25(7):1307–1320, 07 2008. ISSN 0737-4038. doi: 10.1093/molbev/msn067. URL https://doi.org/10.1093/molbev/msn067

work page doi:10.1093/molbev/msn067 2008
[22]

Maximum-likelihood estimation of phylogeny from dna sequences when substitution rates differ over sites.Molecular Biology and Evolution, 10(6):1396–1401, 11 1993

Z Yang. Maximum-likelihood estimation of phylogeny from dna sequences when substitution rates differ over sites.Molecular Biology and Evolution, 10(6):1396–1401, 11 1993. ISSN 0737-4038. doi: 10.1093/oxfordjournals.molbev.a040082. URL https://doi.org/10.1093/oxfordjournals. molbev.a040082

work page doi:10.1093/oxfordjournals.molbev.a040082 1993
[23]

A fast algorithm for joint reconstruction of ancestral amino acid sequences.Molecular Biology and Evolution, 17(6):890–896, 06 2000

Tal Pupko, Itsik Pe’er, Ron Shamir, and Dan Graur. A fast algorithm for joint reconstruction of ancestral amino acid sequences.Molecular Biology and Evolution, 17(6):890–896, 06 2000. ISSN 0737-4038. doi: 10.1093/oxfordjournals.molbev.a026369. URL https://doi.org/10.1093/oxfordjournals. molbev.a026369

work page doi:10.1093/oxfordjournals.molbev.a026369 2000
[24]

Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design.arXiv preprint arXiv:2402.04997, 2024

Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design, 2024. URL https://arxiv.org/abs/2402.04997

work page arXiv 2024
[25]

Marton Havasi, Brian Karrer, Itai Gat, and Ricky T. Q. Chen. Edit flows: Flow matching with edit operations, 2025. URLhttps://arxiv.org/abs/2506.09018

work page arXiv 2025
[26]

Language models of protein sequences at the scale of evolution enable accurate structure prediction.Science, 379(6637):1123–1130,

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction.Science, 379(6637):1123–1130,
[27]

doi: 10.1126/science.ade2574

work page doi:10.1126/science.ade2574
[28]

EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering

Nicolas Deutschmann, Constance Ferragu, Jonathan D. Ziegler, Shayan Aziznejad, and Eli Bixby. EvoFlows: Evolutionary edit-based flow-matching for protein engineering, 2026. URL https://arxiv. org/abs/2603.11703

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

T.J. Lambert. FPbase: a community-editable fluorescent protein database.Nature Methods, 16,
[30]

URL https://www.nature.com/articles/ s41592-019-0352-8

doi: https://doi.org/10.1038/s41592-019-0352-8. URL https://www.nature.com/articles/ s41592-019-0352-8

work page doi:10.1038/s41592-019-0352-8
[31]

Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, and Peer Bork. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses.Nucleic Acids Research...

work page doi:10.1093/nar/gky1085 2019
[32]

Johnson, Stephanie J

Rohan Maddamsetti, Daniel T. Johnson, Stephanie J. Spielman, Katherine L. Petrie, Debora S. Marks, and Justin R. Meyer. Gain-of-function experiments with bacteriophage lambda uncover residues under diversifying selection in nature.Evolution, 72(10):2234–2243, 10 2018. ISSN 0014-3820. doi: 10.1111/ evo.13586. URLhttps://doi.org/10.1111/evo.13586

work page doi:10.1111/evo.13586 2018
[33]

Randall, Caelan E

Ryan N. Randall, Caelan E. Radford, Kelsey A. Roof, Divya K. Natarajan, and Eric A. Gaucher. An exper- imental phylogeny to benchmark ancestral sequence reconstruction.Nature Communications, 7, 2016. doi: https://doi.org/10.1038/ncomms12847. URLhttps://www.nature.com/articles/ncomms12847

work page doi:10.1038/ncomms12847 2016
[34]

New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0.Systematic Biology, 59(3):307–321, 2010

Stéphane Guindon, Jean-François Dufayard, Vincent Lefort, Maria Anisimova, Wim Hordijk, and Olivier Gascuel. New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0.Systematic Biology, 59(3):307–321, 2010. doi: 10.1093/sysbio/syq010. 11

work page doi:10.1093/sysbio/syq010 2010
[35]

BioBERT: a pre-trained biomedical language representation model for biomedical text mining,

A. Oliva, S. Pulicani, V . Lefort, L. Brehelin, O. Gascuel, and S. Guindon. Accounting for ambiguity in ancestral sequence reconstruction.Bioinformatics, 35(21):4290–4297, 2019. doi: 10.1093/bioinformatics/ btz249

work page doi:10.1093/bioinformatics/ 2019
[36]

Schmidt, Arndt von Haeseler, and Bui Quang Minh

Lam-Tung Nguyen, Heiko A. Schmidt, Arndt von Haeseler, and Bui Quang Minh. IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies.Molecular Biology and Evolution, 32(1):268–274, 2015. doi: 10.1093/molbev/msu300

work page doi:10.1093/molbev/msu300 2015
[37]

MMseqs 2 enables sensitive protein sequence searching for the analysis of massive data sets

Martin Steinegger and Johannes Söding. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.Nature Biotechnology, 2017. doi: 10.1038/nbt.3988

work page doi:10.1038/nbt.3988 2017
[38]

Price, Paramvir S

Morgan N. Price, Paramvir S. Dehal, and Adam P. Arkin. Fasttree 2: Approximately maximum-likelihood trees for large alignments.PLOS ONE, 5(3):e9490, 2010. doi: 10.1371/journal.pone.0009490

work page doi:10.1371/journal.pone.0009490 2010
[39]

Standley

Kazutaka Katoh and Daron M. Standley. Mafft multiple sequence alignment software version 7: Im- provements in performance and usability.Molecular Biology and Evolution, 30(4):772–780, 2013. doi: 10.1093/molbev/mst010

work page doi:10.1093/molbev/mst010 2013
[40]

Raxml-ng: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference.Bioinformatics, 35(21):4453–4455, 11 2019

Alexey M Kozlov, Diego Darriba, Tomáš Flouri, Benoit Morel, and Alexandros Stamatakis. Raxml-ng: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference.Bioinformatics, 35(21):4453–4455, 11 2019. ISSN 1367-4803. doi: 10.1093/bioinformatics/btz305. URL https: //doi.org/10.1093/bioinformatics/btz305

work page doi:10.1093/bioinformatics/btz305 2019
[41]

PAML 4: Phylogenetic analysis by maximum likelihood.Molecular Biology and Evolution, 24(8):1586–1591, 2007

Ziheng Yang. PAML 4: Phylogenetic analysis by maximum likelihood.Molecular Biology and Evolution, 24(8):1586–1591, 2007. doi: 10.1093/molbev/msm088. 12 A Technical appendices and supplementary material A.1 ASR Classical ASR likelihood with among-site rate variation.Classical protein ASR models evolution at each aligned site as a continuous-time Markov cha...

work page doi:10.1093/molbev/msm088 2007