Structure-guided taxonomic placement of divergent RNA viruses with ViraClass

Bo Zhang; Edward C. Holmes; Qiantai Feng; Sheng Xu; Shiyang Feng; Shutong Yue; Siqi Sun; Weifeng Shi; Weiqiang Bai; Wenxuan Huang

arxiv: 2606.07301 · v1 · pith:3DIFFXJFnew · submitted 2026-06-05 · 🧬 q-bio.QM

Structure-guided taxonomic placement of divergent RNA viruses with ViraClass

Sheng Xu , Wenxuan Huang , Shutong Yue , Weiqiang Bai , Shiyang Feng , Xiaohan He , Bo Zhang , Qiantai Feng

show 3 more authors

Edward C. Holmes Weifeng Shi Siqi Sun

This is my paper

Pith reviewed 2026-06-27 20:09 UTC · model grok-4.3

classification 🧬 q-bio.QM

keywords RNA virus taxonomyRdRp structureViraClassmetatranscriptomicsprotein structure similarityICTV hierarchydivergent viruseshierarchical classification

0 comments

The pith

RdRp protein structure retains taxonomic signal for RNA viruses at depths where sequence similarity collapses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the three-dimensional structure of the RNA-dependent RNA polymerase preserves evolutionary relationships among RNA viruses even after their amino acid sequences have diverged beyond usable similarity. This structural signal aligns with the ICTV taxonomic hierarchy, which enables a new method called ViraClass to assign viruses rank by rank from phylum to genus by comparing structures. The approach outperforms sequence-based and genome-content methods on benchmarks that withhold entire families, orders or classes from the reference set. It also assigns some unclassified RdRp sequences to existing groups and clusters the rest into compact structural sets. The work targets the gap between rapid metatranscriptomic discovery of divergent viruses and the limits of traditional sequence-driven classification.

Core claim

RdRp protein structure retains taxonomic signal at evolutionary depths where RdRp primary sequence similarity has largely collapsed, and the organization of this signal is consistent with the current ICTV hierarchy. ViraClass uses this signal for rank-by-rank assignment from phylum to genus, stopping at the deepest rank supported by confidence thresholds, and calibrates structural clustering for viruses outside existing reference space.

What carries the argument

ViraClass hierarchical framework, which performs rank-by-rank taxonomic assignment using structural similarity metrics on RdRp proteins with confidence thresholds to select the deepest reliable rank.

Load-bearing premise

Structural similarity metrics and clustering thresholds produce groupings that align with the ICTV taxonomic hierarchy at deep ranks rather than reflecting convergent structural features or alignment artifacts.

What would settle it

A phylogenetic study of newly discovered divergent RNA viruses that places them into different phyla or classes than the structural clusters assigned by ViraClass would falsify the claim of retained taxonomic signal.

read the original abstract

Metatranscriptomic sequencing has expanded our knowledge of the RNA virosphere far more rapidly than novel viruses can be taxonomically classified. Taxonomic assignment above the family level is particularly difficult because the RNA-dependent RNA polymerase (RdRp) is often the only gene retained across RNA viruses yet exhibits little sequence similarity among highly divergent viruses. Here we show that RdRp protein structure retains taxonomic signal at evolutionary depths where RdRp primary sequence similarity has largely collapsed, and that the organization of this signal is consistent with the current ICTV hierarchy. Based on this, we developed ViraClass, a hierarchical framework for RNA virus taxonomic placement that uses RdRp structure for rank-by-rank assignment from phylum to genus, stopping at the deepest rank supported by confidence thresholds, and calibrated structural clustering for viruses that remain outside existing reference space. Across random-split, prospective and taxonomic hold-out benchmarks, ViraClass outperforms sequence-based and genome-content baselines. The largest gains emerge at deep evolutionary distances, in benchmarks that withhold entire families, orders or classes from the reference, where sequence-based methods lose most of their signal. In challenging boundary cases such as the Flaviviridae, ViraClass's structure-based placements capture the taxonomic boundary tensions highlighted by recent phylogenetic studies. When applied to a large collection of previously unclassified RdRp sequences, ViraClass places high-confidence queries into existing phyla and organizes the remainder into compact structural groups. ViraClass therefore provides a scalable approach from large-scale virus discovery to hierarchical taxonomic interpretation, particularly at the deep evolutionary ranges that current sequence-based pipelines cannot reach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViraClass shows RdRp structure recovers ICTV labels at depths where sequence signal collapses, with benchmark gains on hold-outs, but the shared palm fold remains a plausible alternative driver that the tests do not fully isolate.

read the letter

The core takeaway is that this paper gives a practical hierarchical classifier for placing highly divergent RNA viruses using predicted RdRp structures, and the reported benchmarks show clear outperformance over sequence baselines especially when entire deep taxa are withheld.

What stands out as new is the combination of rank-by-rank assignment with explicit stopping thresholds plus a separate clustering step for sequences outside the reference set. The abstract indicates they calibrated this on reference data and tested it on random splits, prospective sets, and taxonomic hold-outs, with the biggest improvements at phylum-to-class levels. That matches a real need in metatranscriptomics where many new RdRps have no usable sequence similarity.

The work does well on the boundary cases it highlights, such as capturing tensions inside Flaviviridae that recent phylogenies have noted. If the full methods show that the structural metrics track ICTV hierarchy more closely than a simple conserved-fold baseline would predict, that would be useful evidence.

The stress-test concern lands in part. All RdRps share the palm domain, so TM-score or equivalent measures could group sequences by that universal constraint or by systematic errors in structure prediction on divergent inputs rather than by deeper evolutionary signal. The benchmarks recover held-out ICTV labels, but without explicit controls that subtract the shared-fold contribution or compare against phylogeny-independent structural clusters, it is not yet clear how much extra taxonomic information is present. The abstract does not give numbers on alignment accuracy or threshold calibration either, which leaves the strength of the central claim provisional.

This is a methods paper aimed at virologists who need to annotate large unclassified RdRp sets from sequencing projects. Readers working on virus discovery pipelines or database curation would get direct value from the tool and the benchmark results. The underlying idea is coherent and engages the literature on ICTV limits, so it clears the bar for serious refereeing even if revisions are needed on the validation side.

I would send it to review with requests for the fold-isolation checks and more detail on how the structural clusters were validated against independent data.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ViraClass, a hierarchical structure-based framework for taxonomic placement of RNA viruses using predicted RdRp protein structures. It claims that RdRp structure retains taxonomic signal at evolutionary depths where primary sequence similarity has collapsed, that this signal is organized consistently with the ICTV hierarchy, and that ViraClass outperforms sequence-based and genome-content baselines on random-split, prospective, and taxonomic hold-out benchmarks (with largest gains at deep ranks such as withheld families/orders/classes). The method performs rank-by-rank assignment stopping at confidence thresholds and uses structural clustering for queries outside reference space; application to unclassified sequences is also reported.

Significance. If the central claim holds, the work would be significant for enabling taxonomic interpretation of divergent RNA viruses from metatranscriptomic data at ranges inaccessible to sequence methods. The multi-benchmark design (including taxonomic hold-outs) and explicit comparison to baselines provide a concrete advance. The paper does not ship machine-checked proofs or parameter-free derivations but does report reproducible benchmark protocols and falsifiable predictions via ICTV consistency.

major comments (2)

[Results (taxonomic hold-out benchmarks)] Results section on taxonomic hold-out benchmarks: the claim that structure recovers ICTV groupings at phylum/class levels requires evidence that performance exceeds what would be obtained from the conserved RdRp palm fold alone. All RdRps share this core fold, so TM-score or equivalent metrics could produce groupings driven by this universal structural constraint (or by systematic prediction/alignment biases) rather than by evolutionary signal tracking the ICTV hierarchy. No control experiment (e.g., palm-domain-only alignment baseline, structure permutation preserving the fold, or comparison to random structures with identical fold statistics) is described to isolate the additional taxonomic signal.
[Methods (confidence thresholds and alignment)] Methods on threshold calibration and structural alignment: the abstract and methods state outperformance but supply no quantitative details on structural alignment accuracy (e.g., TM-score distributions per rank), confidence threshold calibration procedure (including any cross-validation or independent test set to avoid overfitting to reference taxonomy), or validation of post-hoc structural clustering. These elements are load-bearing for interpreting the reported gains at deep ranks and the rank-stopping behavior.

minor comments (2)

[Figures] Figure legends (e.g., structural clustering diagrams): color coding and rank-specific annotations could be clarified to avoid ambiguity when readers compare placements across phylum-to-genus levels.
[Methods] Notation: the definition of the structural similarity metric used for clustering should be stated explicitly in the main text (rather than solely in supplementary methods) to facilitate direct comparison with sequence baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript accordingly to strengthen the presentation of controls and methodological details.

read point-by-point responses

Referee: Results section on taxonomic hold-out benchmarks: the claim that structure recovers ICTV groupings at phylum/class levels requires evidence that performance exceeds what would be obtained from the conserved RdRp palm fold alone. All RdRps share this core fold, so TM-score or equivalent metrics could produce groupings driven by this universal structural constraint (or by systematic prediction/alignment biases) rather than by evolutionary signal tracking the ICTV hierarchy. No control experiment (e.g., palm-domain-only alignment baseline, structure permutation preserving the fold, or comparison to random structures with identical fold statistics) is described to isolate the additional taxonomic signal.

Authors: We agree that isolating the contribution of the conserved palm fold versus additional structural features is important for interpreting the taxonomic signal. Our current benchmarks demonstrate that full RdRp structure-based placement outperforms sequence methods at deep ranks, but we acknowledge the absence of an explicit palm-only control. In revision we will add a palm-domain-only structural alignment baseline (using the same TM-score framework) and report its performance on the taxonomic hold-out sets to quantify any incremental signal from variable regions. revision: yes
Referee: Methods on threshold calibration and structural alignment: the abstract and methods state outperformance but supply no quantitative details on structural alignment accuracy (e.g., TM-score distributions per rank), confidence threshold calibration procedure (including any cross-validation or independent test set to avoid overfitting to reference taxonomy), or validation of post-hoc structural clustering. These elements are load-bearing for interpreting the reported gains at deep ranks and the rank-stopping behavior.

Authors: We will expand the Methods section with the requested quantitative details: (i) TM-score distributions stratified by ICTV rank for the reference alignments, (ii) the cross-validation procedure (including the independent hold-out sets) used to select confidence thresholds, and (iii) validation metrics (e.g., silhouette scores and stability under perturbation) for the post-hoc structural clustering step. These additions will be accompanied by supplementary figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained against external benchmarks

full rationale

The paper calibrates structural similarity thresholds and clustering on reference RdRp structures with known ICTV labels, then evaluates generalization on random-split, prospective, and taxonomic hold-out sets that withhold entire families/orders/classes. Performance gains at deep distances are measured against sequence baselines on the same withheld data. No equation, ansatz, or self-citation reduces the rank-by-rank assignments or new-group organization to a fitted parameter defined from the target outputs. The central claim is tested via external validation rather than being true by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that protein structure similarity tracks the ICTV hierarchy at depths where sequence has saturated; this is treated as a domain assumption rather than derived from first principles within the work.

free parameters (1)

confidence thresholds for rank stopping
Thresholds that decide when to stop hierarchical assignment are calibrated on reference data and directly affect placement depth.

axioms (1)

domain assumption RdRp structural similarity is consistent with ICTV taxonomic hierarchy at deep evolutionary distances
Invoked when claiming that structure retains taxonomic signal and when performing rank-by-rank assignment.

pith-pipeline@v0.9.1-grok · 5853 in / 1302 out tokens · 25074 ms · 2026-06-27T20:09:41.300174+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references

[1]

Jiang, J.-Z. et al. Virus classification for viral genomic fragments using PhaGCN2. Briefings in bioinformatics 24, bbac505 (2023). 19. Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nature biotechnology 42, 1303–1312 (2024). 20. Pons, J. C. et al. VPF-class: Taxonomic assignment and host prediction of uncultivated viruses b...

2023
[2]

J., Van Dongen, S

Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for large-scale detection of protein families. Nucleic acids research 30, 1575–1584 (2002). 36. Team, I. et al. InternAgent: When agent becomes the scientist–building closed-loop system from hypothesis to verification. arXiv preprint arXiv:2505.16938 (2025). 37. Feng, S. et al. Intern...

arXiv 2002

[1] [1]

Jiang, J.-Z. et al. Virus classification for viral genomic fragments using PhaGCN2. Briefings in bioinformatics 24, bbac505 (2023). 19. Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nature biotechnology 42, 1303–1312 (2024). 20. Pons, J. C. et al. VPF-class: Taxonomic assignment and host prediction of uncultivated viruses b...

2023

[2] [2]

J., Van Dongen, S

Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for large-scale detection of protein families. Nucleic acids research 30, 1575–1584 (2002). 36. Team, I. et al. InternAgent: When agent becomes the scientist–building closed-loop system from hypothesis to verification. arXiv preprint arXiv:2505.16938 (2025). 37. Feng, S. et al. Intern...

arXiv 2002