m6A-FORM: A Foundation Model for Decoding N6-methyladenosine Biology
Pith reviewed 2026-06-27 07:33 UTC · model grok-4.3
The pith
m6A-FORM predicts m6A sites with PR-AUC of 0.635 after pretraining on MeRIP-seq peaks, improving over existing methods by at least 0.14.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
m6A-FORM is a transformer-based foundation model that uses MeRIP-seq peaks as methylation-enriched priors and is pretrained on approximately 22 million peak-derived sequences from 143 human MeRIP-seq studies. After fine-tuning with high-confidence single-nucleotide m6A annotations from m6A-Atlas v2.0 and GLORI, it achieves a PR-AUC of 0.635 and ROC-AUC of 0.988 for m6A site prediction, improving PR-AUC by at least 0.14 over existing methods while enabling substantially faster inference. Task-specific adaptation supports prediction of binding sites for 19 m6A-associated regulators and identification of YTHDF2-bound m6A sites associated with mRNA degradation. Applying the model across 67 datas
What carries the argument
transformer-based foundation model that uses MeRIP-seq peaks as methylation-enriched priors for pretraining on peak-derived sequences
If this is right
- The model achieves substantially faster inference for m6A site prediction compared to prior methods.
- Task-specific adaptation enables prediction of binding sites for 19 m6A-associated regulators.
- The model identifies YTHDF2-bound m6A sites associated with mRNA degradation.
- Application across 67 datasets from 24 human tissues yields 19,631 tissue-conserved m6A sites with distinct signatures.
Where Pith is reading between the lines
- The peak-based pretraining approach might extend to prediction tasks for other RNA modifications where peak data exists.
- The tissue-conserved sites could provide candidates for experiments testing effects on mRNA decay rates in specific cell types.
- Faster inference could allow scanning of larger transcriptomes or integration with other sequencing datasets for combined analyses.
Load-bearing premise
MeRIP-seq peaks serve as reliable methylation-enriched priors for pretraining and the single-nucleotide annotations from m6A-Atlas v2.0 and GLORI constitute accurate ground truth without substantial false positives or selection biases.
What would settle it
An independent validation experiment using an orthogonal technique such as mass spectrometry on held-out tissue samples to check whether the predicted m6A sites match at the reported accuracy levels.
read the original abstract
N6-methyladenosine (m6A) is the most abundant internal modification in eukaryotic mRNA. However, most existing predictors use adenosine-centered formulations that are computationally inefficient and prone to false positives. Here we present m6A-FORM, a transformer-based foundation model for RNA methylation that uses MeRIP-seq peaks as methylation-enriched priors and is pretrained on approximately 22 million peak-derived sequences from 143 human MeRIP-seq studies. After fine-tuning with high-confidence single-nucleotide m6A annotations from m6A-Atlas v2.0 and GLORI, m6A-FORM-sites achieves state-of-the-art m6A site prediction performance, with a PR-AUC of 0.635 and ROC-AUC of 0.988, improving PR-AUC by at least 0.14 over existing methods while enabling substantially faster inference. Task-specific adaptation further supports prediction of binding sites for 19 m6A-associated regulators and identification of YTHDF2-bound m6A sites associated with mRNA degradation. Applying m6A-FORM across 67 datasets from 24 human tissues identifies 19,631 tissue-conserved sites with distinct localization, clustering, methylation, expression, RBP-interaction, and decay-associated signatures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents m6A-FORM, a transformer-based foundation model pretrained on ~22 million sequences derived from MeRIP-seq peaks across 143 human studies. After fine-tuning on single-nucleotide m6A annotations from m6A-Atlas v2.0 and GLORI, the m6A-FORM-sites variant reports state-of-the-art performance (PR-AUC 0.635, ROC-AUC 0.988) for m6A site prediction, with claimed improvement of at least 0.14 in PR-AUC over prior methods, faster inference, and downstream applications to 19 regulator binding sites and 19,631 tissue-conserved sites identified across 67 datasets from 24 tissues.
Significance. If the performance metrics are shown to be robust, the work offers a potentially useful large-scale pretrained model for m6A biology that could improve site prediction efficiency and enable tissue-level analyses. The scale of the pretraining corpus (~22M sequences) is a clear strength relative to prior adenosine-centered predictors.
major comments (3)
- [Results] Results (m6A site prediction experiments): The headline PR-AUC of 0.635 and 0.14 improvement over baselines are reported without any description of train-test split methodology, baseline re-implementations, statistical error bars, or explicit controls for data leakage between the MeRIP-seq peak pretraining corpus and the fine-tuning labels from m6A-Atlas v2.0/GLORI; this directly undermines verification of the central SOTA claim.
- [Methods] Methods (fine-tuning data curation): The model treats single-nucleotide annotations from m6A-Atlas v2.0 and GLORI as high-confidence ground truth, yet no analysis or external validation is provided for potential false-positive rates, tissue-selection biases, or sequence-context artifacts common in aggregated MeRIP/GLORI compilations; if present, these would systematically inflate both absolute metrics and the reported improvement.
- [Results] Results (tissue-conserved sites analysis): The identification of 19,631 tissue-conserved sites and their downstream signatures (localization, RBP interaction, decay) inherits the same label-quality dependency as the site-prediction task; without independent orthogonal validation (e.g., mass-spec or orthogonal sequencing), the biological conclusions rest on the same unverified ground-truth assumption.
minor comments (2)
- The abstract states performance numbers but the main text should include a dedicated table comparing all baselines with exact PR-AUC/ROC-AUC values, inference times, and parameter counts for reproducibility.
- Notation for the foundation model variants (m6A-FORM vs. m6A-FORM-sites) is introduced without an explicit definition table or diagram showing the pretraining vs. fine-tuning stages.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments. We address each major comment point-by-point below, committing to revisions that add missing methodological details and explicit discussions of limitations.
read point-by-point responses
-
Referee: [Results] Results (m6A site prediction experiments): The headline PR-AUC of 0.635 and 0.14 improvement over baselines are reported without any description of train-test split methodology, baseline re-implementations, statistical error bars, or explicit controls for data leakage between the MeRIP-seq peak pretraining corpus and the fine-tuning labels from m6A-Atlas v2.0/GLORI; this directly undermines verification of the central SOTA claim.
Authors: We agree that these details are essential for verifying the central claims. In the revised manuscript we will add a dedicated subsection describing the train-test split protocol (including sequence-identity filtering to prevent leakage between the ~22M pretraining sequences and the m6A-Atlas/GLORI fine-tuning labels), the exact re-implementation steps for each baseline, and statistical error bars obtained from multiple random seeds or cross-validation folds. revision: yes
-
Referee: [Methods] Methods (fine-tuning data curation): The model treats single-nucleotide annotations from m6A-Atlas v2.0 and GLORI as high-confidence ground truth, yet no analysis or external validation is provided for potential false-positive rates, tissue-selection biases, or sequence-context artifacts common in aggregated MeRIP/GLORI compilations; if present, these would systematically inflate both absolute metrics and the reported improvement.
Authors: We acknowledge that the original submission did not include an explicit analysis of label quality. In revision we will insert a new paragraph in Methods that discusses known limitations of aggregated MeRIP-seq and GLORI compilations, cites supporting literature on their false-positive characteristics, and notes potential tissue biases. A full orthogonal experimental validation lies outside the scope of this computational study. revision: partial
-
Referee: [Results] Results (tissue-conserved sites analysis): The identification of 19,631 tissue-conserved sites and their downstream signatures (localization, RBP interaction, decay) inherits the same label-quality dependency as the site-prediction task; without independent orthogonal validation (e.g., mass-spec or orthogonal sequencing), the biological conclusions rest on the same unverified ground-truth assumption.
Authors: We agree that the conserved-site conclusions rest on the same label assumptions. The revised manuscript will explicitly state this dependency, add a limitations paragraph, and frame the reported signatures as computational observations that motivate future orthogonal experiments. The internal consistency of the signatures (e.g., expected RBP and decay associations) provides supporting context but does not replace independent validation. revision: partial
Circularity Check
No circularity: empirical ML pipeline with external labels and no derivation steps
full rationale
The paper presents a transformer foundation model pretrained on MeRIP-seq peak sequences and fine-tuned on single-nucleotide annotations from the external m6A-Atlas v2.0 and GLORI resources. Reported metrics (PR-AUC 0.635, ROC-AUC 0.988) are standard supervised evaluation outcomes on held-out data rather than any claimed first-principles derivation. No equations, self-definitional loops, fitted-input-as-prediction steps, or load-bearing self-citations appear in the described pipeline. The central claims rest on empirical performance against independent annotations and do not reduce to the model's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption MeRIP-seq peaks provide reliable methylation-enriched priors for pretraining
- domain assumption m6A-Atlas v2.0 and GLORI supply accurate single-nucleotide ground truth
Reference graph
Works this paper leans on
-
[1]
Clustered
rely on highly similar experimental principles, we treated them as a single technology when counting supporting evi dence. Using these criteria, we constructed a high -confidence dataset containing 131,320 base-resolution m6A sites. Dataset preparation for m6A sites identification We collected 528,452 MeRIP -seq peaks from five human cell lines with the l...
-
[2]
Nature, 2014
Wang, X., et al., N6-methyladenosine-dependent regulation of messenger RNA stability. Nature, 2014. 505(7481): p. 117-120
2014
-
[3]
Nature, 2012
Dominissini, D., et al., Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq. Nature, 2012. 485(7397): p. 201-206
2012
-
[4]
Cell Genom, 2024
Fan, R., et al., A combined deep learning framework for mammalian m6A site prediction. Cell Genom, 2024. 4(12): p. 100697
2024
-
[5]
Briefings in Functional Genomics, 2025
Huang, X., et al., m6A RNA modification pathway: orchestrating fibrotic mechanisms across multiple organs. Briefings in Functional Genomics, 2025. 24
2025
-
[6]
Nature, 2017
Barbieri, I., et al., Promoter-bound METTL3 maintains myeloid leukaemia by m6A- dependent translation control. Nature, 2017. 552(7683): p. 126-131
2017
-
[7]
Trends in Molecular Medicine, 2023
Liu, Y ., et al., N6-methyladenosine-mediated gene regulation and therapeutic implications. Trends in Molecular Medicine, 2023. 29(6): p. 454-467
2023
-
[8]
Bioinformatics, 2023
Zhang, Y ., et al., Interpretable prediction models for widespread m6A RNA modification across cell lines and tissues. Bioinformatics, 2023. 39(12)
2023
-
[9]
Bioinformatics, 2024
Ni, P ., et al., RNA m6A detection using raw current sig nals and basecalling errors from Nanopore direct RNA sequencing reads. Bioinformatics, 2024. 40(6)
2024
-
[10]
Nature Methods, 2015
Linder, B., et al., Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome. Nature Methods, 2015. 12(8): p. 767-772
2015
-
[11]
Nature Biotechnology, 2023
Liu, C., et al., Absolute quantification of single -base m6A methylation in the mammalian transcriptome using GLORI. Nature Biotechnology, 2023. 41(3): p. 355-366
2023
-
[12]
Nucleic Acids Research, 2016
Zhou, Y ., et al., SRAMP: prediction of mammalian N6 -methyladenosine (m6A) sites based on sequence-derived features. Nucleic Acids Research, 2016. 44(10): p. e91-e91
2016
-
[13]
Nucleic Acids Research,
Chen, K., et al., WHISTLE: a high-accuracy map of the human N6 -methyladenosine (m6A) epitranscriptome predicted using a machine learning approach. Nucleic Acids Research,
-
[14]
RNA Biol, 2021
Li, J., et al., HSm6AP: a high-precision predictor for the Homo sapiens N6-methyladenosine (m;6 A) based on multiple weights and feature stitching. RNA Biol, 2021. 18(11): p. 1882- 1892
2021
-
[15]
Cell Genomics, 2024
Fan, R., et al., A combined deep learning framework for mammalian m6A site prediction. Cell Genomics, 2024. 4(12)
2024
-
[16]
Nature Communications,
Song, Z., et al., Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications. Nature Communications,
-
[17]
BMC Bioinformatics, 2024
Tu, G., et al., m6A-TCPred: a web server to predict tissue-conserved human m6A sites using machine learning approach. BMC Bioinformatics, 2024. 25(1): p. 127
2024
-
[18]
Nucleic Acids Research, 2021
Xiong, Y ., et al., Modeling multi-species RNA modification through multi -task curriculum learning. Nucleic Acids Research, 2021. 49(7): p. 3719-3734
2021
-
[19]
Signal Transduct Target Ther, 2021
Jiang, X., et al., The role of m6A modification in the biological functions and diseases. Signal Transduct Target Ther, 2021. 6(1): p. 74
2021
-
[20]
BioRxiv, 2021: p
Chen, Y ., et al., A systematic benchmark of Nanopore long read RNA sequencing for transcript level analysis in human cell lines. BioRxiv, 2021: p. 2021.04. 21.440736
2021
-
[21]
Bioinformatics, 2021
Ji, Y ., et al., DNABERT: pre -trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 2021. 37(15): p. 2112- 2120
2021
-
[22]
bioRxiv, 2026
Jo, S., et al., Systematic identification of tissue -conserved m(6)A sites reveals a stable epitranscriptomic regulatory layer controlling essential genes. bioRxiv, 2026
2026
-
[23]
Nucleic Acids Research, 2021
Körtel, N., et al., Deep and accurate detection of m6A RNA modifications using miCLIP2 and m6Aboost machine learning. Nucleic Acids Research, 2021. 49(16): p. e92-e92
2021
-
[24]
Nucleic Acids Research, 2023
Liang, Z., et al., m6A-Atlas v2.0: updated resources for unraveling the N6-methyladenosine (m6A) epitranscriptome among multiple species. Nucleic Acids Research, 2023. 52(D1): p. D194-D202
2023
-
[25]
Tegowski, and K.D
Flamand, M.N., M. Tegowski, and K.D. Meyer, The Proteins of mRNA Modification: Writers, Readers, and Erasers. Annu Rev Biochem, 2023. 92: p. 145-173
2023
-
[26]
Nucleic Acids Research, 2021
Zhao, W., et al., POSTAR3: an updated platform for exploring post -transcriptional regulation coordinated by RNA-binding proteins. Nucleic Acids Research, 2021. 50(D1): p. D287-D294
2021
-
[27]
Wang, X. and Y . Wang. Sentence-level resampling for named entity recognition . in Proceedings of the 2022 Conference of the North American Chapter of the Association for computational linguistics: human language technologies. 2022
2022
-
[28]
BMC Genomics, 2018
Pan, X., et al., Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC Genomics, 2018. 19(1): p. 511
2018
-
[29]
GigaScience, 2021
Uhl, M., et al., RNAProt: an efficient and feature -rich RNA binding protein binding site predictor. GigaScience, 2021. 10(8)
2021
-
[30]
Int J Gen Med, 2025
Long, X., et al., RNA Binding Motif Protein 15 (RBM15): Structure, Function and Its Research Progress in Tumors. Int J Gen Med, 2025. 18: p. 3635-3649
2025
-
[31]
Molecular Cell, 2016
Xiao, W., et al., Nuclear m<sup>6</sup>A Reader YTHDC1 Regulates mRNA Splicing. Molecular Cell, 2016. 61(4): p. 507-519
2016
-
[32]
Zaccara, S. and S.R. Jaffrey, A Unified Model for the Function of YTHDF Proteins in Regulating m6A-Modified mRNA. Cell, 2020. 181(7): p. 1582-1595.e18
2020
-
[33]
Cell Reports,
Boo, S.H., et al., UPF1 promotes rapid degradation of m6A-containing RNAs. Cell Reports,
-
[34]
Nature Cell Biology, 2018
Huang, H., et al., Recognition of RNA N6 -methyladenosine by IGF2BP proteins enhances mRNA stability and translation. Nature Cell Biology, 2018. 20(3): p. 285-295
2018
-
[35]
Molecular Cancer, 2024
Ying, Y ., et al., Co-transcriptional R-loops-mediated epigenetic regulation drives growth retardation and docetaxel chemosensitivity enhancement in advanced prostate cancer. Molecular Cancer, 2024. 23(1): p. 79
2024
-
[36]
Nat Cell Biol, 2018
Huang, H., et al., Recognition of RNA N(6)-methyladenosine by IGF2BP proteins enhances mRNA stability and translation. Nat Cell Biol, 2018. 20(3): p. 285-295
2018
-
[37]
Cell Death Discovery, 2022
Yan, H., et al., Roles and mechanisms of the m6A reader YTHDC1 in biological processes and diseases. Cell Death Discovery, 2022. 8(1): p. 237
2022
-
[38]
Journal of Translational Medicine, 2022
Wang, X., et al., SRSF9 promotes colorectal cancer progression via stabilizing DSN1 mRNA in an m6A-related manner. Journal of Translational Medicine, 2022. 20(1): p. 198
2022
-
[39]
Cancer Biology & Therapy, 2024
Wang, J., et al., A positive feedback loop of SRSF9/USP22/ZEB1 promotes the progression of ovarian cancer. Cancer Biology & Therapy, 2024. 25(1): p. 2427415
2024
-
[40]
eLife, 2016
Ge, Z., et al., Polypyrimidine tract binding protein 1 protects mRNAs from recognition by the nonsense-mediated mRNA decay pathway. eLife, 2016. 5: p. e11155
2016
-
[41]
Mol Cancer Res, 2020
Zhang, K., et al., AGO2 Mediates MYC mRNA Sta bility in Hepatocellular Carcinoma. Mol Cancer Res, 2020. 18(4): p. 612-622
2020
-
[42]
Nucleic Acids Research, 2020
Zhang, H., et al., Dynamic landscape and evolution of m6A methylation in human. Nucleic Acids Research, 2020. 48(11): p. 6251-6264
2020
-
[43]
Molecular Cell, 2020
Liu, J.e., et al., Landscape and Regulation of m6A and m6Am Methylome across Human and Mouse Tissues. Molecular Cell, 2020. 77(2): p. 426-440.e6
2020
-
[44]
Human Molecular Genetics, 2018
Zhang, F., et al., Fragile X mental retardation protein modulates the stability of its m6A- marked messenger RNA targets. Human Molecular Genetics, 2018. 27(22): p. 3936-3950
2018
-
[45]
Bioinformatics, 2018
Chen, S., et al., fastp: an ultra -fast all-in-one FASTQ preprocessor. Bioinformatics, 2018. 34(17): p. i884-i890
2018
-
[46]
Martin, M., Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal, 2011. 17(1): p. 10-12
2011
-
[47]
Nature Biotechnology, 2019
Kim, D., et al., Graph-based genome alignment and genotyping with HISAT2 and HISAT - genotype. Nature Biotechnology, 2019. 37(8): p. 907-915
2019
-
[48]
Genomics, Proteomics & Bioinformatics, 2026
Zhou, J., et al., Comprehensive Epitranscriptome Analysis from MeR IP-seq Data with exomePeak2. Genomics, Proteomics & Bioinformatics, 2026
2026
-
[49]
Briefings in Bioinformatics, 2024
Zhang, T.-H., et al., Understanding YTHDF2-mediated mRNA degradation by m6A-BERT- Deg. Briefings in Bioinformatics, 2024. 25(3): p. bbae170
2024
-
[50]
Cell Genomics, 2024
Fan, R., et al., A combined deep learning framework for mammalian m6A site prediction. Cell Genomics, 2024. 4(12): p. 100697
2024
-
[51]
Bioinformatics, 2024
Genovese, G., et al., BCFtools/liftover: an accurate and comprehensive tool to convert genetic variants across genome assemblies. Bioinformatics, 2024. 40(2)
2024
-
[52]
Nature, 2015
Zhou, J., et al., Dynamic m6A mRNA methylation directs translational control of heat shock response. Nature, 2015. 526(7574): p. 591-594
2015
-
[53]
Better Modeling of Incomplete Annotations for Named Entity Recognition
Jie, Z., et al. Better Modeling of Incomplete Annotations for Named Entity Recognition
-
[54]
Minneapolis, Minnesota: Association for Computational Linguistics
-
[55]
Did the Model Understand the Question? 2018
Mudrakarta, P .K., et al. Did the Model Understand the Question? 2018. Melbourne, Australia: Association for Computational Linguistics
2018
-
[56]
International Journal of Cancer, 2023
Nakken, S., et al., Comprehensive interrogation of gene lists f rom genome-scale cancer screens with oncoEnrichR. International Journal of Cancer, 2023. 153(10): p. 1819-1828
2023
-
[57]
PLOS Computational Biology, 2013
Lawrence, M., et al., Software for Computing and Annotating Genomic Ranges. PLOS Computational Biology, 2013. 9(8): p. e1003118. Fig. 1 | Overview of the m6A-FORM framework. a, Pipeline for constructing the high -confidence single-base m6A dataset. A total of 224 human MeRIP-seq datasets were processed through data preparation and peak calling, yielding 2...
2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.