Multi-Scale Reversible Chaos Game Representation: A Unified Framework for Sequence Classification

Sarwan Ali; Taslim Murad

arxiv: 2604.18477 · v1 · submitted 2026-04-20 · 💻 cs.LG

Multi-Scale Reversible Chaos Game Representation: A Unified Framework for Sequence Classification

Sarwan Ali , Taslim Murad This is my paper

Pith reviewed 2026-05-10 05:43 UTC · model grok-4.3

classification 💻 cs.LG

keywords biological sequence classificationchaos game representationmulti-scale encodingreversible representationprotein language modelshybrid modelsgeometric sequence featuresDNA and protein analysis

0 comments

The pith

A reversible multi-scale geometric encoding improves biological sequence classification when combined with language model features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MS-RCGR as a way to convert DNA and protein sequences into geometric forms that retain every detail of the original sequence and reveal patterns at multiple lengths. This encoding supports three analysis routes: direct use as features in standard machine learning, conversion into images for vision models, and addition to embeddings from pre-trained protein language models. Tests on synthetic datasets covering seven sequence classes show gains in each route, with the strongest results coming from the combined language-model-plus-encoding approach. The method matters for tasks that require both accurate classification and the ability to trace back which sequence parts drove a decision, since the transformation can be undone without any loss.

Core claim

MS-RCGR transforms biological sequences into multi-resolution geometric representations using rational arithmetic and hierarchical k-mer decomposition. The resulting features are scale-invariant and fully reversible, so the original sequence can be recovered exactly. The framework unifies three analysis styles: traditional machine learning on the extracted geometric features, computer vision models applied to the generated images, and hybrid models that merge the geometric features with embeddings from pre-trained language models such as ESM2 and ProtT5. Experiments on synthetic DNA and protein datasets with seven distinct classes demonstrate that MS-RCGR features raise classification scores

What carries the argument

Multi-Scale Reversible Chaos Game Representation (MS-RCGR), which applies hierarchical k-mer decomposition and rational arithmetic to produce scale-invariant geometric features from sequences while preserving full reversibility.

If this is right

MS-RCGR features raise classification accuracy when used alone in traditional machine learning pipelines.
Images generated from MS-RCGR representations can be processed directly by computer vision models.
Merging MS-RCGR features with protein language model embeddings produces higher accuracy than either source used separately.
Because the encoding is reversible, every original sequence element remains recoverable and can be examined for interpretability.
Multi-scale analysis captures both short nucleotide patterns and longer motif structures within the same representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reversible multi-scale construction could be tested on RNA sequences or full genomes to check whether the accuracy gains persist.
Because the output is both numeric features and images, the method might be combined with other model families such as graph neural networks.
The explicit reversibility property opens the possibility of using MS-RCGR inside generative models that must both classify and reconstruct sequences.
If the complementary information claim holds, similar hybrid constructions could be explored for non-biological sequences such as text or time series.

Load-bearing premise

That the geometric features produced by MS-RCGR contain information that is genuinely complementary to the patterns already captured inside pre-trained language model embeddings.

What would settle it

An experiment on real biological sequences in which the hybrid model using both MS-RCGR features and language model embeddings fails to outperform the language model embeddings alone.

Figures

Figures reproduced from arXiv: 2604.18477 by Sarwan Ali, Taslim Murad.

**Figure 3.1.** Figure 3.1: illustrates the scale k = 1 CGR trajectory for the sequence ATCGATCGTAGC, demonstrating the geometric encoding principle underlying MS-RCGR. Each nucleotide is assigned a corner point on the unit circle via rational-arithmetic projection (Eq. 3.1), and the iterative midpoint process (Eq. 3.2) produces a trajectory whose spatial distribution reflects both the compositional content and the ordering of t… view at source ↗

**Figure 3.** Figure 3: visualises the complete 24-dimensional [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 3.3.** Figure 3.3: Normalised MS-RCGR feature vector ΦCGR(s) ∈ R 24 extracted from the example sequence ATCGATCGTAGC across scales k ∈ {1, 2, 3, 4}. Each group of six bars corresponds to the per-scale descriptor φ (k) (s) = ⟨p (k) nk , nk, Var(k) x , Var(k) y , ¯d (k) ⟩ (Eq. 3.4), where p (k) nk denotes the two-dimensional final trajectory position. transformer with attention truncated to length n. Proof. For each scale k… view at source ↗

**Figure 3.2.** Figure 3.2: MS-RCGR trajectories for the example sequence ATCGATCGTAGC across scales k ∈ {1, 2, 3, 4}. Each trajectory is initiated at the origin × and progresses through nk = n−k + 1 points (indicated in each panel), with edge opacity encoding temporal order. The star marker (⋆) denotes the final position p (k) nk , which forms the first component of the per-scale descriptor φ (k) (s) (Eq. 3.4). As k increases, k-… view at source ↗

read the original abstract

Biological classification with interpretability remains a challenging task. For this, we introduce a novel encoding framework, Multi-Scale Reversible Chaos Game Representation (MS-RCGR), that transforms biological sequences into multi-resolution geometric representations with guaranteed reversibility. Unlike traditional sequence encoding methods, MS-RCGR employs rational arithmetic and hierarchical k-mer decomposition to generate scale-invariant features that preserve complete sequence information while enabling diverse analytical approaches. Our framework bridges three distinct paradigms for sequence analysis: (1) traditional machine learning using extracted geometric features, (2) computer vision models operating on CGR-generated images, and (3) hybrid approaches combining protein language model embeddings with CGR features. Through comprehensive experiments on synthetic DNA and protein datasets encompassing seven distinct sequence classes, we demonstrate that MS-RCGR features consistently enhance classification performance across all paradigms. Notably, our hybrid approach combining pre-trained language model embeddings (ESM2, ProtT5) with MS-RCGR features achieves superior performance compared to either method alone. The reversibility property of our encoding ensures no information loss during transformation, while multi-scale analysis captures patterns ranging from individual nucleotides to complex motif structures. Our results indicate that MS-RCGR provides a flexible, interpretable, and high-performing foundation for biological sequence analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MS-RCGR adds a reversible multi-scale twist to chaos game representations for sequences, but the gains are shown only on synthetic data with artificial classes.

read the letter

The paper introduces Multi-Scale Reversible Chaos Game Representation as a way to encode sequences into geometric forms that stay fully reversible through rational arithmetic and hierarchical k-mer breakdown. This lets the same representation feed traditional feature-based ML, image models on the CGR plots, and hybrids with embeddings from ESM2 or ProtT5. On their synthetic DNA and protein sets with seven contrived classes, the hybrid version beats using either the language model or the geometric features alone. Reversibility is a concrete plus because it guarantees no information drop during the transform, and the multi-scale part aims to catch both short and longer patterns in one go. That unification across three analysis styles is the clearest new piece here. The experiments appear consistent within the synthetic regime they chose, and the framework itself is cleanly motivated without obvious circularity. The main limitation is the exclusive use of synthetic sequences. Real biological data carries evolutionary noise, sequencing errors, and structural constraints that the artificial classes do not replicate, so it is unclear whether the reported complementarity between MS-RCGR and the language-model embeddings would survive outside the clean setup. No ablation on actual sequences is described, and the abstract gives no quantitative numbers or statistical details to judge effect size. Readers working on new sequence encodings or hybrid bioinformatics pipelines could pick this up and test the reversibility property themselves. It is not ready for immediate citation in applied work until real-data results appear. The idea is coherent enough and the method is reproducible in principle, so it deserves a serious referee who can push for expanded benchmarks on genuine biological sequences and clearer comparisons to existing CGR variants.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Multi-Scale Reversible Chaos Game Representation (MS-RCGR), a framework that encodes biological sequences into multi-resolution geometric representations using rational arithmetic and hierarchical k-mer decomposition. It claims guaranteed reversibility with no information loss, scale-invariance, and the ability to bridge traditional ML on geometric features, computer vision on CGR images, and hybrid models combining pre-trained protein language model embeddings (ESM2, ProtT5) with MS-RCGR features. Experiments on synthetic DNA and protein datasets with seven artificial classes are said to show consistent performance gains across paradigms, with the hybrid approach outperforming either component alone.

Significance. If validated, the reversibility property and multi-scale decomposition could provide an interpretable, information-preserving alternative or complement to existing sequence encodings. The hybrid paradigm is a potentially useful direction for combining geometric and embedding-based features. However, the current evidence base is narrow, so the practical significance for biological sequence classification remains provisional pending broader validation.

major comments (2)

[Experimental evaluation / Results] The central claim that the hybrid ESM2/ProtT5 + MS-RCGR approach achieves superior performance rests entirely on synthetic DNA/protein datasets with seven artificially constructed classes. No experiments on real biological sequences (e.g., from UniProt, NCBI, or standard benchmarks with evolutionary/structural noise) are reported, so it is unclear whether the observed complementarity is genuine or an artifact of the synthetic construction. This directly affects the applicability asserted in the abstract and conclusion.
[Abstract] Abstract and experimental description: claims of 'consistent performance improvements' and 'superior performance' are made without any quantitative metrics, error bars, dataset sizes, cross-validation details, or statistical tests. This prevents assessment of effect size or reliability and is load-bearing for the hybrid-superiority assertion.

minor comments (1)

Ensure all tables and figures in the full manuscript are explicitly referenced in the text and include clear captions describing the synthetic class definitions and evaluation protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and presentation of our work. We address each major comment below and describe the revisions we will make.

read point-by-point responses

Referee: [Experimental evaluation / Results] The central claim that the hybrid ESM2/ProtT5 + MS-RCGR approach achieves superior performance rests entirely on synthetic DNA/protein datasets with seven artificially constructed classes. No experiments on real biological sequences (e.g., from UniProt, NCBI, or standard benchmarks with evolutionary/structural noise) are reported, so it is unclear whether the observed complementarity is genuine or an artifact of the synthetic construction. This directly affects the applicability asserted in the abstract and conclusion.

Authors: We thank the referee for this observation. The synthetic datasets with seven artificially constructed classes were deliberately chosen to enable rigorous, controlled evaluation of MS-RCGR's reversibility, scale-invariance, and complementarity with language model embeddings, free from the label noise and unknown evolutionary relationships present in real data. This design directly supports the paper's focus on demonstrating the framework's theoretical and methodological properties. We agree, however, that this limits immediate claims of broad applicability. In the revised manuscript we will qualify the abstract and conclusion to state that results are shown on synthetic data, add a limitations paragraph discussing generalization to real sequences, and include a brief outline of how MS-RCGR could be applied to benchmarks such as UniProt-derived tasks. revision: partial
Referee: [Abstract] Abstract and experimental description: claims of 'consistent performance improvements' and 'superior performance' are made without any quantitative metrics, error bars, dataset sizes, cross-validation details, or statistical tests. This prevents assessment of effect size or reliability and is load-bearing for the hybrid-superiority assertion.

Authors: We agree that the abstract would be strengthened by including quantitative details. In the revision we will update the abstract to report key metrics (e.g., accuracy or F1-score gains of the hybrid model relative to baselines), dataset sizes, and a concise reference to the cross-validation protocol and statistical tests used. The experimental section will be expanded to present error bars from repeated runs and any significance testing performed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; MS-RCGR defined as independent transformation with empirical validation on synthetic data

full rationale

The paper introduces MS-RCGR as a novel encoding framework using rational arithmetic and hierarchical k-mer decomposition to produce reversible, multi-scale geometric representations of sequences. This definition stands on its own without reducing to fitted parameters or self-referential loops. Performance claims for the hybrid ESM2/ProtT5 + MS-RCGR approach are presented as outcomes of experiments on synthetic datasets with seven classes, not as predictions derived tautologically from the inputs. No load-bearing self-citations, uniqueness theorems from prior author work, or ansatzes smuggled via citation are evident in the provided text. The derivation chain (framework definition to experimental results) remains self-contained and externally falsifiable via the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, which does not detail any free parameters, axioms, or invented entities beyond the high-level description of the proposed MS-RCGR framework itself.

pith-pipeline@v0.9.0 · 5520 in / 1072 out tokens · 32401 ms · 2026-05-10T05:43:41.257529+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Genome biology , volume=

Kraken: ultrafast metagenomic sequence classification using exact alignments , author=. Genome biology , volume=. 2014 , publisher=

work page 2014
[2]

Genome biology , volume=

Alignment-free sequence comparison: benefits, applications, and tools , author=. Genome biology , volume=. 2017 , publisher=

work page 2017
[3]

Biology , volume=

Biological sequence representation methods and recent advances: A review , author=. Biology , volume=. 2025 , publisher=

work page 2025
[4]

Nucleic acids research , volume=

Chaos game representation of gene structure , author=. Nucleic acids research , volume=. 1990 , publisher=

work page 1990
[5]

Bioinformatics , volume=

Analysis of genomic sequences by chaos game representation , author=. Bioinformatics , volume=. 2001 , publisher=

work page 2001
[6]

, author=

Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. , author=. Molecular biology and evolution , volume=. 1999 , publisher=

work page 1999
[7]

Journal of theoretical biology , volume=

Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses , author=. Journal of theoretical biology , volume=. 2004 , publisher=

work page 2004
[8]

Biocomputing 2002 , pages=

The spectrum kernel: A string kernel for SVM protein classification , author=. Biocomputing 2002 , pages=. 2001 , publisher=

work page 2002
[9]

Bioinformatics , volume=

Profile-based direct kernels for remote homology detection and fold recognition , author=. Bioinformatics , volume=. 2005 , publisher=

work page 2005
[10]

Nucleic acids research , volume=

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs , author=. Nucleic acids research , volume=. 1997 , publisher=

work page 1997
[11]

Bioinformatics , volume=

Deep learning on chaos game representation for proteins , author=. Bioinformatics , volume=. 2020 , publisher=

work page 2020
[12]

Science , volume=

Evolutionary-scale prediction of atomic-level protein structure with a language model , author=. Science , volume=. 2023 , publisher=

work page 2023
[13]

IEEE transactions on pattern analysis and machine intelligence , volume=

ProtTrans: toward understanding the language of life through self-supervised learning , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2021 , publisher=

work page 2021
[14]

Bioinformatics , volume=

ProteinBERT: a universal deep-learning model of protein sequence and function , author=. Bioinformatics , volume=. 2022 , publisher=

work page 2022
[15]

Bioinformatics , volume=

iFeature: a python package and web server for features extraction and selection from protein and peptide sequences , author=. Bioinformatics , volume=. 2018 , publisher=

work page 2018
[16]

Briefings in bioinformatics , volume=

iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data , author=. Briefings in bioinformatics , volume=. 2020 , publisher=

work page 2020
[17]

0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches , author=

BioSeq-Analysis2. 0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches , author=. Nucleic acids research , volume=. 2019 , publisher=

work page 2019
[18]

BMC Bioinformatics , year =

Cui, Lulu and Zheng, Dengning and Tao, Liwei and Zhang, Ying , title =. BMC Bioinformatics , year =

work page
[19]

nature , volume=

Deep learning , author=. nature , volume=. 2015 , publisher=

work page 2015
[20]

nature , volume=

Highly accurate protein structure prediction with AlphaFold , author=. nature , volume=. 2021 , publisher=

work page 2021
[21]

bioRxiv , year=

Language models of protein sequences at the scale of evolution enable accurate structure prediction , author=. bioRxiv , year=

work page
[22]

IEEE/ACM Transactions on Computational Biology and Bioinformatics , year=

Efficient approximate kernel based spike sequence classification , author=. IEEE/ACM Transactions on Computational Biology and Bioinformatics , year=

work page
[23]

AAAI conference on artificial intelligence , year=

Wasserstein distance guided representation learning for domain adaptation , author=. AAAI conference on artificial intelligence , year=

work page
[24]

International conference on machine learning , pages=

Unsupervised deep embedding for clustering analysis , author=. International conference on machine learning , pages=

work page
[25]

Advances in neural information processing systems , volume=

Evaluating protein transfer learning with TAPE , author=. Advances in neural information processing systems , volume=

work page

[1] [1]

Genome biology , volume=

Kraken: ultrafast metagenomic sequence classification using exact alignments , author=. Genome biology , volume=. 2014 , publisher=

work page 2014

[2] [2]

Genome biology , volume=

Alignment-free sequence comparison: benefits, applications, and tools , author=. Genome biology , volume=. 2017 , publisher=

work page 2017

[3] [3]

Biology , volume=

Biological sequence representation methods and recent advances: A review , author=. Biology , volume=. 2025 , publisher=

work page 2025

[4] [4]

Nucleic acids research , volume=

Chaos game representation of gene structure , author=. Nucleic acids research , volume=. 1990 , publisher=

work page 1990

[5] [5]

Bioinformatics , volume=

Analysis of genomic sequences by chaos game representation , author=. Bioinformatics , volume=. 2001 , publisher=

work page 2001

[6] [6]

, author=

Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. , author=. Molecular biology and evolution , volume=. 1999 , publisher=

work page 1999

[7] [7]

Journal of theoretical biology , volume=

Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses , author=. Journal of theoretical biology , volume=. 2004 , publisher=

work page 2004

[8] [8]

Biocomputing 2002 , pages=

The spectrum kernel: A string kernel for SVM protein classification , author=. Biocomputing 2002 , pages=. 2001 , publisher=

work page 2002

[9] [9]

Bioinformatics , volume=

Profile-based direct kernels for remote homology detection and fold recognition , author=. Bioinformatics , volume=. 2005 , publisher=

work page 2005

[10] [10]

Nucleic acids research , volume=

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs , author=. Nucleic acids research , volume=. 1997 , publisher=

work page 1997

[11] [11]

Bioinformatics , volume=

Deep learning on chaos game representation for proteins , author=. Bioinformatics , volume=. 2020 , publisher=

work page 2020

[12] [12]

Science , volume=

Evolutionary-scale prediction of atomic-level protein structure with a language model , author=. Science , volume=. 2023 , publisher=

work page 2023

[13] [13]

IEEE transactions on pattern analysis and machine intelligence , volume=

ProtTrans: toward understanding the language of life through self-supervised learning , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2021 , publisher=

work page 2021

[14] [14]

Bioinformatics , volume=

ProteinBERT: a universal deep-learning model of protein sequence and function , author=. Bioinformatics , volume=. 2022 , publisher=

work page 2022

[15] [15]

Bioinformatics , volume=

iFeature: a python package and web server for features extraction and selection from protein and peptide sequences , author=. Bioinformatics , volume=. 2018 , publisher=

work page 2018

[16] [16]

Briefings in bioinformatics , volume=

iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data , author=. Briefings in bioinformatics , volume=. 2020 , publisher=

work page 2020

[17] [17]

0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches , author=

BioSeq-Analysis2. 0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches , author=. Nucleic acids research , volume=. 2019 , publisher=

work page 2019

[18] [18]

BMC Bioinformatics , year =

Cui, Lulu and Zheng, Dengning and Tao, Liwei and Zhang, Ying , title =. BMC Bioinformatics , year =

work page

[19] [19]

nature , volume=

Deep learning , author=. nature , volume=. 2015 , publisher=

work page 2015

[20] [20]

nature , volume=

Highly accurate protein structure prediction with AlphaFold , author=. nature , volume=. 2021 , publisher=

work page 2021

[21] [21]

bioRxiv , year=

Language models of protein sequences at the scale of evolution enable accurate structure prediction , author=. bioRxiv , year=

work page

[22] [22]

IEEE/ACM Transactions on Computational Biology and Bioinformatics , year=

Efficient approximate kernel based spike sequence classification , author=. IEEE/ACM Transactions on Computational Biology and Bioinformatics , year=

work page

[23] [23]

AAAI conference on artificial intelligence , year=

Wasserstein distance guided representation learning for domain adaptation , author=. AAAI conference on artificial intelligence , year=

work page

[24] [24]

International conference on machine learning , pages=

Unsupervised deep embedding for clustering analysis , author=. International conference on machine learning , pages=

work page

[25] [25]

Advances in neural information processing systems , volume=

Evaluating protein transfer learning with TAPE , author=. Advances in neural information processing systems , volume=

work page