pith. sign in

arxiv: 2606.22077 · v1 · pith:6JIW63BInew · submitted 2026-06-20 · 💻 cs.CV

Morphology-Aware Multimodal Representation Learning for Insect Phylogenetic Reconstruction

Pith reviewed 2026-06-26 12:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal learningphylogenetic reconstructioninsect morphologyvision transformercontrastive learningBayesian inferenceimage-text alignmentcontinuous traits
0
0 comments X

The pith

Multimodal alignment of insect images with morphological descriptions yields embeddings that improve agreement with reference phylogenies in Bayesian reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that aligning specimen images with curated morphological descriptions in a shared latent space produces visual embeddings that function as continuous traits for Bayesian phylogenetic reconstruction. It adapts a vision transformer via parameter-efficient fine-tuning and supervised contrastive learning to achieve the alignment, then feeds the resulting image embeddings into standard tree-building pipelines. Experiments on the Rove-Tree-11 dataset show higher topological agreement with the reference phylogeny than single-modality visual baselines. A sympathetic reader would care because the method offers a route to automate incorporation of semantic morphological knowledge without exhaustive manual trait coding. This could scale phylogenetic work to more taxa by bridging existing image collections and text descriptions.

Core claim

The central claim is that the morphology-aware multimodal alignment framework, which combines specimen images with morphological descriptions through vision transformer adaptation and image-text alignment, derives image embeddings that, when used as continuous traits, improve topological agreement with the reference phylogeny in Bayesian reconstruction compared to single-modality approaches.

What carries the argument

The morphology-aware multimodal alignment framework that performs supervised contrastive learning for image-text alignment in a shared latent space after parameter-efficient fine-tuning of a vision transformer.

If this is right

  • Image embeddings from the aligned model serve as continuous traits that capture more phylogenetic signal than visual-only features.
  • Multimodal alignment produces higher topological agreement metrics than single-modality baselines across tested visual backbones.
  • The framework enables direct use of existing morphological text alongside images without manual discretization of traits.
  • Ablation results attribute performance gains specifically to the image-text alignment step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could reduce dependence on fully expert-coded discrete characters if morphological descriptions already exist for imaged specimens.
  • It may generalize to other organism groups provided paired image-description datasets are available.
  • Future work could test whether the continuous traits combine productively with molecular sequence data in joint reconstructions.

Load-bearing premise

Curated morphological descriptions accurately encode phylogenetic signal that image-text alignment can capture and transfer into continuous visual traits for Bayesian reconstruction.

What would settle it

Finding equal or lower topological agreement with the reference phylogeny when the multimodal-aligned embeddings replace single-modality image embeddings in Bayesian reconstruction on the Rove-Tree-11 dataset.

read the original abstract

Morphological traits provide important evidence for phylogenetic reconstruction and evolutionary relationship analysis. Recent image-based approaches have introduced deep learning, particularly convolutional models, to derive morphological features from specimen images, but these methods generally rely on single-modality visual representations and do not explicitly incorporate morphological semantics. This study proposes a morphology-aware multimodal alignment framework for insect phylogenetic reconstruction. The framework combines specimen images with curated morphological descriptions by adapting a vision transformer through parameter-efficient fine-tuning and supervised contrastive learning, followed by image-text alignment in a shared latent space. The learned image embeddings are then used as continuous traits for Bayesian phylogenetic reconstruction. On the public Rove-Tree-11 dataset, comparative and ablation experiments across multiple visual backbones and feature adaptation strategies demonstrate that multimodal alignment improves topological agreement with the reference phylogeny. The results indicate that the proposed framework can derive morphology-aware visual traits for computational phylogenetic reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a morphology-aware multimodal framework that fine-tunes a vision transformer with supervised contrastive learning to align insect specimen images and curated morphological text descriptions in a shared latent space; the resulting image embeddings are supplied as continuous traits to Bayesian phylogenetic reconstruction. On the public Rove-Tree-11 dataset, comparative and ablation experiments across visual backbones are reported to show improved topological agreement with a reference phylogeny relative to single-modality baselines.

Significance. If the central claim is substantiated, the work would demonstrate a practical route for injecting semantic morphological information into large-scale phylogenetic pipelines without manual character coding. The reliance on a public dataset and systematic ablations across backbones strengthens the potential for follow-up studies, though the absence of phylogenetic-signal diagnostics leaves open whether the reported gains reflect heritable morphology or dataset-specific correlations.

major comments (3)
  1. [Results section] Results section (and abstract): the reported improvements in topological agreement are presented without statistical significance tests, error bars, or repeated runs with different random seeds or data splits; this directly undermines the claim that multimodal alignment reliably outperforms single-modality baselines.
  2. [Methods (§3) and Results] Methods (§3) and Results: no phylogenetic signal diagnostics (Pagel’s λ, Blomberg’s K, or Mantel tests against independent character matrices) are applied to the final embeddings before they are used as continuous traits; without these, it is impossible to confirm that the contrastive alignment extracts heritable morphological variation rather than imaging or text artifacts.
  3. [Experimental setup] Experimental setup: the manuscript does not specify the exact train/validation/test splits of Rove-Tree-11, the embedding dimensionality reduction (if any) prior to the multivariate Brownian/OU model, or whether the reference phylogeny was held completely out of the contrastive training; each of these choices is load-bearing for the central claim of improved topological agreement.
minor comments (2)
  1. [Methods] Notation for the contrastive loss and the parameter-efficient fine-tuning adapters is introduced without an explicit equation or diagram, making the precise alignment objective difficult to reproduce from the text alone.
  2. [Figures] Figure captions for the ablation plots do not state the number of independent runs or the metric aggregation method (mean, median, etc.).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas to strengthen the statistical rigor, validation of phylogenetic signal, and experimental transparency in our work. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Results section] Results section (and abstract): the reported improvements in topological agreement are presented without statistical significance tests, error bars, or repeated runs with different random seeds or data splits; this directly undermines the claim that multimodal alignment reliably outperforms single-modality baselines.

    Authors: We agree that the lack of statistical tests and variability reporting limits the strength of the reliability claims. The original experiments used a single fixed seed for reproducibility. In the revision we will rerun all comparative and ablation experiments across five different random seeds, report mean topological agreement metrics with standard deviations, and include statistical significance tests (e.g., paired Wilcoxon signed-rank tests) between the multimodal approach and single-modality baselines. Updated results and p-values will appear in the Results section and be reflected in the abstract. revision: yes

  2. Referee: [Methods (§3) and Results] Methods (§3) and Results: no phylogenetic signal diagnostics (Pagel’s λ, Blomberg’s K, or Mantel tests against independent character matrices) are applied to the final embeddings before they are used as continuous traits; without these, it is impossible to confirm that the contrastive alignment extracts heritable morphological variation rather than imaging or text artifacts.

    Authors: We acknowledge the importance of these diagnostics. We will add Pagel’s λ and Blomberg’s K computed on the learned embeddings using the reference phylogeny to quantify phylogenetic signal. However, Rove-Tree-11 does not supply independent character matrices, precluding Mantel tests; we will therefore rely on the signal diagnostics together with the observed gains in topological agreement as supporting evidence. A new subsection will be added to Methods and the resulting values reported in Results. revision: partial

  3. Referee: [Experimental setup] Experimental setup: the manuscript does not specify the exact train/validation/test splits of Rove-Tree-11, the embedding dimensionality reduction (if any) prior to the multivariate Brownian/OU model, or whether the reference phylogeny was held completely out of the contrastive training; each of these choices is load-bearing for the central claim of improved topological agreement.

    Authors: We apologize for these omissions. The Rove-Tree-11 dataset was partitioned 70/15/15 for contrastive training/validation/test with no specimen overlap between splits. The reference phylogeny was held completely out of contrastive training. No dimensionality reduction was performed; the full 768-dimensional ViT embeddings were supplied directly to the multivariate Brownian motion model. These specifications will be added to the Experimental setup section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard methods and public data

full rationale

The paper's chain consists of applying off-the-shelf supervised contrastive learning (parameter-efficient fine-tuning of ViT) to align images with curated text descriptions on the public Rove-Tree-11 dataset, then supplying the resulting embeddings as continuous traits to a standard Bayesian phylogenetic pipeline. No equations, fitted parameters, or predictions are shown to reduce to the inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The central empirical claim (improved topological agreement) is evaluated against an external reference phylogeny and ablations, making the derivation self-contained against public benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters or invented entities; the framework implicitly rests on domain assumptions about morphological data.

axioms (1)
  • domain assumption Morphological descriptions contain phylogenetic signal that can be aligned with visual features
    Central to the proposed image-text alignment and trait extraction

pith-pipeline@v0.9.1-grok · 5699 in / 1036 out tokens · 25241 ms · 2026-06-26T12:33:43.697360+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 16 canonical work pages

  1. [1]

    The role of morphological data in phylogeny reconstruction

    J. J. Wiens, “The role of morphological data in phylogeny reconstruction.” Syst. Biol., vol. 53, no. 4, pp. 653–661, Aug. 2004, doi: 10.1080/10635150490472959

  2. [2]

    Morphological phylogenetics in the genomic age

    M. S. Y. Lee and A. Palci, “Morphological phylogenetics in the genomic age.” Curr. Biol., vol. 25, no. 19, pp. R922–R929, Oct. 2015, doi: 10.1016/j.cub.2015.07.009

  3. [3]

    Morphology should not be forgotten in the era of genomics- a phylogenetic perspective

    G. Giribet, “Morphology should not be forgotten in the era of genomics- a phylogenetic perspective.” Zool. Anz., vol. 256, pp. 96–103, May 2015, doi: 10.1016/j.jcz.2015.01.003

  4. [4]

    Felsenstein, Inferring Phylogenies

    J. Felsenstein, Inferring Phylogenies. Sunderland, MA, USA: Sinauer Associates, 2004

  5. [5]

    Learning transferable visual models from natural language supervision

    A. Radford et al., “ Learning transferable visual models from natural language supervision.” in Proc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 8748–8763

  6. [6]

    Maximum-likelihood estimation of evolutionary trees from continuous characters

    J. Felsenstein, “ Maximum-likelihood estimation of evolutionary trees from continuous characters.” Amer. J. Hum. Genet., vol. 25, no. 5, pp. 471, Sep. 1973

  7. [7]

    Phylogenies and quantitative characters

    J. Felsenstein, “ Phylogenies and quantitative characters. ” Annu. Rev. Ecol. Syst., pp. 445–471, Jan. 1988

  8. [8]

    Rove-Tree-11: The not-so-wild rover, a hierarchically structured image dataset for deep metric learning research

    R. Hunt and K. S. Pedersen, “ Rove-Tree-11: The not-so-wild rover, a hierarchically structured image dataset for deep metric learning research.” in Proc. Asian Conf. Comput. Vis. (ACCV), pp. 2967–2983, 2022

  9. [9]

    Integrating deep learning-derived morphological traits and molecular data for total-evidence phylogenetics: Lessons from digitized collections

    R. Hunt et al., “ Integrating deep learning-derived morphological traits and molecular data for total-evidence phylogenetics: Lessons from digitized collections.” Syst. Biol., vol. 74, no. 3, pp. 453–468, 2025

  10. [10]

    Deep learning on butterfly phenotypes tests evolution’s oldest mathematical model

    J. F. Hoyal Cuthill et al., “Deep learning on butterfly phenotypes tests evolution’s oldest mathematical model. ” Sci. Adv., vol. 5, no. 8, Aug. 2019, Art. no. eaaw4967, doi: 10.1126/sciadv.aaw4967

  11. [11]

    An image is worth 16 × 16 words: Transformers for image recognition at scale

    A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers for image recognition at scale. ” in Proc. Int. Conf. Learn. Represent. (ICLR), 2021

  12. [12]

    Emerging properties in self-supervised vision transformers

    M. Caron et al., “ Emerging properties in self-supervised vision transformers.” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 9650–9660

  13. [13]

    DINOv2: Learning robust visual features without supervision,

    M. Oquab et al., “ DINOv2: Learning robust visual features without supervision,” Trans. Mach. Learn. Res., pp. 1–32, Jan. 2024

  14. [14]

    EV A-02: A visual representation for neon genesis.Image and Vision Computing, 149: 105171, 2024

    Y. Fang et al., “ EVA-02: A visual representation for neon genesis. ” Image Vis. Comput., vol. 149, Sep. 2024, doi: 10.1016/j.imavis.2024.105171

  15. [15]

    BEiT v2: Masked image modeling with vector-quantized visual tokenizers

    Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “BEiT v2: Masked image modeling with vector-quantized visual tokenizers. ” 2022, arXiv:2208.06366

  16. [16]

    ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders

    S. Woo et al., “ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders.” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 16133–16142

  17. [17]

    BioCLIP: A vision foundation model for the tree of life

    S. Stevens et al., “BioCLIP: A vision foundation model for the tree of life.” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 19412–19424

  18. [18]

    Multi-modal molecule structure–text model for text-based retrieval and editing,

    S. Liu et al., “Multi-modal molecule structure–text model for text-based retrieval and editing,” Nat. Mach. Intell., vol. 5, no. 12, pp. 1447–1457, Dec. 2023

  19. [19]

    A molecular multimodal foundation model associating molecule graphs with natural language

    B. Su et al., “ A molecular multimodal foundation model associating molecule graphs with natural language.” 2022, arXiv:2209.05481

  20. [20]

    Extracting molecular properties from natural language with multimodal contrastive learning

    R. Lacombe et al., “ Extracting molecular properties from natural language with multimodal contrastive learning. ” 2023, arXiv:2307.12996

  21. [21]

    MMCL: A multi-modal contrastive learning framework for molecular property prediction

    M. Gao and F. Zhu, “ MMCL: A multi-modal contrastive learning framework for molecular property prediction. ” IEEE/ACM Trans. Comput. Biol. Bioinf., doi: 10.1109/TCBBIO.2026.3663206

  22. [22]

    PTPPI: A study on protein inhibitor prediction methods using multimodal feature fusion and attention mechanism

    Z. Dong et al., “PTPPI: A study on protein inhibitor prediction methods using multimodal feature fusion and attention mechanism.” IEEE/ACM Trans. Comput. Biol. Bioinf., doi: 10.1109/TCBBIO.2026.3657905

  23. [23]

    LoRA: Low-rank adaptation of large language models

    E. J. Hu et al., “LoRA: Low-rank adaptation of large language models.” in Proc. Int. Conf. Learn. Represent. (ICLR), 2022

  24. [24]

    Supervised contrastive learning

    P. Khosla et al., “Supervised contrastive learning.” in Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 18661–18673

  25. [25]

    BioClinical ModernBERT: A state-of-the-art long- context encoder for biomedical and clinical NLP

    T. Sounack et al., “ BioClinical ModernBERT: A state-of-the-art long- context encoder for biomedical and clinical NLP. ” 2025, arXiv:2506.10896

  26. [26]

    RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language

    S. Höhna et al., “ RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language.” Syst. Biol., vol. 65, no. 4, pp. 726–736, Jul. 2016, doi: 10.1093/sysbio/syw021

  27. [27]

    Comparison of phylogenetic trees

    D. F. Robinson and L. R. Foulds, “Comparison of phylogenetic trees.” Math. Biosci., vol. 53, no. 1-2, pp. 131 – 147, Feb. 1981, doi: 10.1016/0025-5564(81)90043-2

  28. [28]

    Introducing GPT-5.2

    OpenAI, “ Introducing GPT-5.2 ” 2025. [Online]. Available: https://openai.com/index/introducing-gpt-5-2/. Accessed: Jan. 15, 2026

  29. [29]

    Siméoni et al., “DINOv3” 2025, arXiv:2508.10104

    O. Siméoni et al., “DINOv3” 2025, arXiv:2508.10104

  30. [30]

    Evidence for an ancient adaptive episode of convergent molecular evolution,

    T. A. Castoe et al., “ Evidence for an ancient adaptive episode of convergent molecular evolution,” Proc. Natl. Acad. Sci. USA, vol. 106, no. 22, pp. 8986–8991, 2009

  31. [31]

    Can phylogenetics identify C4 origins and reversals?

    P.-A. Christin, R. P. Freckleton, and C. P. Osborne, “Can phylogenetics identify C4 origins and reversals?” Trends Ecol. Evol., vol. 25, no. 7, pp. 403–409, Jul. 2010

  32. [32]

    Convergent adaptation in mitochondria of phylogenetically distant birds: Does it exist?

    V. Burskaia et al., “ Convergent adaptation in mitochondria of phylogenetically distant birds: Does it exist?” Genome Biol. Evol., vol. 13, no. 7, Jul 2021, evab113

  33. [33]

    Gene tree discordance, phylogenetic inference and the multispecies coalescent

    J. H. Degnan and N. A. Rosenberg, “ Gene tree discordance, phylogenetic inference and the multispecies coalescent. ” Trends Ecol. Evol., vol. 24, no. 6, pp. 332 – 340, Jun. 2009, doi: 10.1016/j.tree.2009.01.009

  34. [34]

    Resolving difficult phylogenetic questions: Why more sequences are not enough

    H. Philippe et al., “ Resolving difficult phylogenetic questions: Why more sequences are not enough.” PLoS Biol., vol. 9, no. 3, Mar. 2011, doi: 10.1371/journal.pbio.1000602

  35. [35]

    OrthoFinder: Phylogenetic orthology inference for comparative genomics

    D. M. Emms and S. Kelly, “ OrthoFinder: Phylogenetic orthology inference for comparative genomics.” Genome Biol., vol. 20, no. 1, pp. 238, Nov. 2019, doi: 10.1186/s13059-019-1832-y

  36. [36]

    UFBoot2: Improving the ultrafast bootstrap approximation

    D. T. Hoang et al., “ UFBoot2: Improving the ultrafast bootstrap approximation.” Mol. Biol. Evol., vol. 35, no. 2, pp. 518–522, Feb. 2018, doi: 10.1093/molbev/msx281

  37. [37]

    ModelFinder: Fast model selection for accurate phylogenetic estimates

    S. Kalyaanamoorthy et al., “ ModelFinder: Fast model selection for accurate phylogenetic estimates.” Nat. Methods, vol. 14, no. 6, pp. 587– 589, Jun. 2017, doi: 10.1038/nmeth.4285

  38. [38]

    doi: 10.1093/molbev/mst010

    K. Katoh and D. M. Standley, “ MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. ” Mol. Biol. Evol., vol. 30, no. 4, pp. 772 – 780, Jan. 2013, doi: 10.1093/molbev/mst010

  39. [39]

    IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era

    B. Q. Minh et al., “IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era.” Mol. Biol. Evol., vol. 37, no. 5, pp. 1530–1534, May 2020, doi: 10.1093/molbev/msaa015

  40. [40]

    Yang, Computational Molecular Evolution

    Z. Yang, Computational Molecular Evolution. Oxford, U.K.: Oxford Univ. Press, 2006