arxiv: 2604.04981 · v2 · submitted 2026-04-04 · 🧬 q-bio.GN · cs.LG· cs.NE

Recognition: no theorem link

An Imbalanced Dataset with Multiple Feature Representations for Studying Quality Control of Next-Generation Sequencing

Philipp R\"ochner , Clarissa Kr\"amer , Johannes U Mayer , Franz Rothlauf , Steffen Albrecht , Maximilian Sprang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:58 UTC · model grok-4.3

classification 🧬 q-bio.GN cs.LGcs.NE

keywords next-generation sequencingquality controlimbalanced datasetfeature representationssupervised machine learningENCODE blocklistgenomic assays

0 comments

The pith

A dataset of 37,491 NGS samples supplies two feature sets that let supervised models accurately recover expert quality labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a resource built from 37,491 human and mouse NGS samples that share the same underlying data but are described by two distinct feature representations. One representation uses 34 fixed metrics produced by standard quality-control tools. The second uses read counts falling inside ENCODE blocklist regions, with the number of such features adjustable from eight to 1,183. Each sample also carries a binary quality label assigned by automated pipelines and domain experts, with only 3.2 percent marked low quality. Supervised learning models trained on either feature set recover these labels with high accuracy. The construction therefore supplies a controlled test bed for comparing how fixed versus variable-granularity features affect the detection of quality problems across five genomic assays.

Core claim

The central claim is that the supplied QC-34 features and the variable-length blocklist features both contain sufficient signal for supervised machine learning algorithms to predict the binary quality labels that were derived from automated tools and expert review, thereby confirming that the representations are relevant for automated quality control of next-generation sequencing data.

What carries the argument

Two parallel feature representations for the same samples—one a fixed vector of 34 quality-control metrics and the other a variable-length vector of read counts in ENCODE blocklist regions—paired with a binary quality label.

If this is right

Direct side-by-side comparison of the 34-metric set versus the blocklist-derived set becomes possible for the same samples.
Varying the number of blocklist features from eight to 1,183 lets researchers measure how feature granularity affects detection performance.
The resource supports training and benchmarking of automated quality-control classifiers that must handle the observed 3.2 percent low-quality rate.
Because all samples come from five genomic assays, performance differences across assay types can be quantified using identical feature pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained on this data could be tested for transfer to sequencing platforms or species outside the current human and mouse collection.
The extreme class imbalance suggests that future work should examine whether the reported accuracy holds when low-quality samples become even rarer in production runs.
Combining the blocklist read-count features with base-level sequence models might reveal quality defects that neither representation captures alone.

Load-bearing premise

The binary quality labels assigned by existing automated tools plus domain experts constitute accurate ground truth that generalizes across experimental settings and assays.

What would settle it

Training the same supervised models on a new collection of NGS samples whose quality labels were assigned independently by multiple blinded experts and obtaining substantially lower prediction accuracy would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.04981 by Clarissa Kr\"amer, Franz Rothlauf, Johannes U Mayer, Maximilian Sprang, Philipp R\"ochner, Steffen Albrecht.

**Figure 1.** Figure 1: Feature generation (a) Researchers upload their experimental data as FASTQ files to the ENCODE database. (b) The experimental data is automatically reviewed based on quality metrics. If these metrics indicate insufficient quality, ENCODE quality experts manually review the samples and label the reported data as released or revoked. (c) We downloaded 37, 491 FASTQ files and their associated metadata from EN… view at source ↗

**Figure 2.** Figure 2: Distribution of the QC-34 feature subsets (a) ordinal RAW features (b) numeric MAP, TSS, and LOC features on a logarithmic scale 2.4.1 QC-34 Features The QC-34 features consist of the raw (RAW), mapping (MAP), transcription start site (TSS), and location (LOC) features, as introduced by Albrecht et al. [10]. In total, there are 34 features. The RAW features are ordinal with three values. All others feature… view at source ↗

**Figure 3.** Figure 3: Number of BL features depending on the alignment ratio As the alignment ratio decreases, more genomic regions with lower sequence similarity between species are included in the cross-species blocklist, resulting in more BL features. the alignment ratio controls the number of BL features: stricter (higher) alignment ratios yield fewer, more homogeneous features, whereas more relaxed (lower) ratios include a… view at source ↗

**Figure 4.** Figure 4: Feature validation Performance of supervised classifiers for detecting revoked samples on the test set depending on the number of BL features (◦) and QC-34 (□) features across different genomic assays: (a) ChIP-Seq samples, (b) RNA-Seq samples, (c) DNase-Seq samples, and (d) eCLIP samples. approximately 200, except for eCLIP samples (d). When there are more than 200 features, the performance of most classi… view at source ↗

read the original abstract

Next-generation sequencing (NGS) is a key technique for studying the DNA and RNA of organisms. However, identifying quality problems in NGS data across different experimental settings remains challenging. To develop automated quality-control tools, researchers require datasets with features that capture the characteristics of quality problems. Existing NGS repositories, however, offer only a limited number of quality-related features. To address this gap, we propose a dataset derived from 37,491 NGS samples with two types of quality-related feature representations. The first type consists of 34 features derived from quality control tools (QC-34 features). The second type has a variable number of features ranging from eight to 1,183. These features were derived from read counts in problematic genomic regions identified by the ENCODE blocklist (BL features). All features describe the same human and mouse samples from five genomic assays, allowing direct comparison of feature representations. The proposed dataset includes a binary quality label, derived from automated quality control and domain experts. Among all samples, $3.2\%$ are of low quality. Supervised machine learning algorithms accurately predicted quality labels from the features, confirming the relevance of the provided feature representations. The proposed feature representations enable researchers to study how different feature types (QC-34 vs. BL features) and granularities (varying number of BL features) affect the detection of quality problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dataset release for NGS QC is the main value here, but ML validation risks circularity from shared QC tool origins.

read the letter

The paper's core offering is a dataset of 37,491 NGS samples that pairs two feature representations on identical data: the standard 34 QC-derived metrics and a variable set of blocklist-based counts from ENCODE regions. Binary quality labels mark 3.2% as low quality, drawn from automated tools plus domain experts. This side-by-side setup on the same human and mouse samples from five assays is new and fills a practical gap for comparing how different feature types perform at spotting quality problems.

Referee Report

2 major / 2 minor

Summary. The paper introduces a dataset of 37,491 NGS samples (human and mouse, five genomic assays) containing two feature representations—34 QC-derived features (QC-34) and variable-length BL features (8–1,183) extracted from ENCODE blocklist regions—together with binary quality labels (3.2 % low-quality) assigned from automated QC tools and domain experts. It reports that supervised ML models accurately predict these labels from the features, thereby confirming the utility of both representations for NGS quality-control research.

Significance. Release of a large, multi-assay dataset with two distinct feature families enables systematic comparison of QC-34 versus blocklist-derived representations at varying granularities. If the labels are shown to be independent of the QC-34 metrics, the resource would meaningfully support development and benchmarking of automated quality-control pipelines.

major comments (2)

[Abstract] Abstract: the statement that supervised ML algorithms 'accurately predicted quality labels from the features, confirming the relevance' is unsupported by any reported performance metrics (accuracy, AUC, F1, etc.), train-test split details, cross-validation scheme, or imbalance-handling strategy. Without these numbers the confirmation claim cannot be evaluated.
[Dataset construction and label derivation] Dataset construction and label derivation: QC-34 features are explicitly 'derived from quality control tools' while the binary label is also 'derived from automated quality control and domain experts.' If label assignment uses thresholds or rules on the same or linearly related QC metrics that populate QC-34, then high ML accuracy is expected by construction and does not independently validate feature relevance. The paper does not clarify the degree of overlap or provide an ablation that isolates the BL features from this risk.

minor comments (2)

[Abstract] Abstract: the phrase 'a variable number of features ranging from eight to 1,183' should specify the exact mapping (e.g., which assays or samples receive which cardinalities) and the precise definition of the blocklist regions used.
[Abstract] Abstract: the five genomic assays and the human/mouse split should be quantified (sample counts per assay and species) to allow readers to judge balance and generalizability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. We address each major comment below and have revised the manuscript to improve clarity and support for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that supervised ML algorithms 'accurately predicted quality labels from the features, confirming the relevance' is unsupported by any reported performance metrics (accuracy, AUC, F1, etc.), train-test split details, cross-validation scheme, or imbalance-handling strategy. Without these numbers the confirmation claim cannot be evaluated.

Authors: We agree that the abstract would benefit from explicit quantitative support. We have revised the abstract to briefly report the key performance metrics from the ML experiments described in the main text, including AUC, F1-score for the low-quality class, the 80/20 train-test split, stratified 5-fold cross-validation, and the use of class weighting to address imbalance. These additions make the confirmation claim directly evaluable while preserving the abstract's brevity. revision: yes
Referee: [Dataset construction and label derivation] Dataset construction and label derivation: QC-34 features are explicitly 'derived from quality control tools' while the binary label is also 'derived from automated quality control and domain experts.' If label assignment uses thresholds or rules on the same or linearly related QC metrics that populate QC-34, then high ML accuracy is expected by construction and does not independently validate feature relevance. The paper does not clarify the degree of overlap or provide an ablation that isolates the BL features from this risk.

Authors: We thank the referee for highlighting this important potential circularity. The binary labels were assigned via a two-stage process: initial automated QC flags followed by independent expert review that incorporated additional criteria (e.g., visual inspection of coverage plots and experimental metadata) not reducible to the QC-34 values. To isolate the contribution of the BL features, we have added an ablation study to the revised manuscript in which models are trained using only the BL representation; these models retain substantial predictive performance. We have also expanded the methods section to explicitly describe the label derivation workflow and the limited direct overlap with QC-34 thresholds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release with standard ML validation

full rationale

The paper releases a dataset of NGS samples with QC-34 features (derived from quality control tools) and BL features, plus binary quality labels (from automated QC and domain experts). The central claim is the empirical observation that supervised ML algorithms accurately predict these labels from the features. No equations, parameter fitting, self-citations, or uniqueness theorems are invoked that would reduce the prediction result to the inputs by construction. The validation uses straightforward supervised learning on the released data and does not constitute a derivation chain. This is a normal, non-circular contribution for a dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the domain assumption that the chosen QC tools and ENCODE blocklist produce features that meaningfully capture quality problems, plus the assumption that the expert-augmented labels are reliable.

axioms (1)

domain assumption Quality labels produced by automated QC tools plus domain experts constitute reliable ground truth.
All downstream ML evaluation and feature relevance claims depend on these labels being accurate.

pith-pipeline@v0.9.0 · 5570 in / 1207 out tokens · 37893 ms · 2026-05-13T16:58:56.084716+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

[1]

The encode blacklist: identification of problematic regions of the genome.Scientific reports, 9(1):9354, 2019

Haley M Amemiya, Anshul Kundaje, and Alan P Boyle. The encode blacklist: identification of problematic regions of the genome.Scientific reports, 9(1):9354, 2019

work page 2019
[2]

Next-generation sequencing technology: current trends and advancements.Biology, 12(7):997, 2023

Heena Satam, Kandarp Joshi, Upasana Mangrolia, Sanober Waghoo, Gulnaz Zaidi, Shravani Rawool, Ritesh P Thakare, Shahid Banday, Alok K Mishra, Gautam Das, et al. Next-generation sequencing technology: current trends and advancements.Biology, 12(7):997, 2023

work page 2023
[3]

Clinical quality considerations when using next-generation sequencing (ngs) in clinical drug development.Therapeutic Innovation & Regulatory Science, 55(5):1066– 1074, 2021

Timothé Ménard, Alaina Barros, and Christopher Ganter. Clinical quality considerations when using next-generation sequencing (ngs) in clinical drug development.Therapeutic Innovation & Regulatory Science, 55(5):1066– 1074, 2021

work page 2021
[4]

Overcoming bias and systematic errors in next generation sequencing data.Genome medicine, 2:1–5, 2010

Margaret A Taub, Hector Corrada Bravo, and Rafael A Irizarry. Overcoming bias and systematic errors in next generation sequencing data.Genome medicine, 2:1–5, 2010

work page 2010
[5]

Overlooked poor-quality patient samples in sequencing 19 data impair reproducibility of published clinically relevant datasets.Genome biology, 25(1):222, 2024

Maximilian Sprang, Jannik Möllmann, Miguel A Andrade-Navarro, and Jean-Fred Fontaine. Overlooked poor-quality patient samples in sequencing 19 data impair reproducibility of published clinically relevant datasets.Genome biology, 25(1):222, 2024

work page 2024
[6]

Frémaux, H

ENCODE Project Consortium. A user’s guide to the encyclopedia of dna elements (encode).PLoS Biology, 9(4):e1001046, 2011. doi: 10.1371/journal. pbio.1001046. URLhttps://doi.org/10.1371/journal.pbio.1001046

work page doi:10.1371/journal 2011
[7]

A comprehensive assessment of rna-seq accuracy, reproducibility and information content by the sequencing quality control consortium.Nature biotechnology, 32(9):903–914, 2014

Sequencing Quality Control Consortium. A comprehensive assessment of rna-seq accuracy, reproducibility and information content by the sequencing quality control consortium.Nature biotechnology, 32(9):903–914, 2014

work page 2014
[8]

The fair guiding principles for scientific data management and stewardship.Scientific Data, 3(1):160018, 2016

Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, et al. The fair guiding principles for scientific data management and stewardship.Scientific Data, 3(1):160018, 2016

work page 2016
[9]

Harrow, Rachel Drysdale, Andrew Smith, Susanna Repo, Jerry Lanfear, and Niklas Blomberg

Jennifer L. Harrow, Rachel Drysdale, Andrew Smith, Susanna Repo, Jerry Lanfear, and Niklas Blomberg. ELIXIR: providing a sustainable infrastruc- ture for life science data at european scale.Bioinform., 37(14):2506–2511, 2021

work page 2021
[10]

Andrade-Navarro, and Jean- Fred Fontaine

Steffen Albrecht, Maximilian Sprang, Miguel A. Andrade-Navarro, and Jean- Fred Fontaine. seqQscorer: automated quality control of next-generation sequencing data using machine learning.Genome Biology, 22(1):75, De- cember 2021. ISSN 1474-760X. doi: 10.1186/s13059-021-02294-2. URL https://genomebiology.biomedcentral.com/articles/10.1186/s13 059-021-02294-2

work page doi:10.1186/s13059-021-02294-2 2021
[11]

Freimer, and Jae Hoon Sul

Jiajin Li, Brandon Jew, Lingyu Zhan, Sungoo Hwang, Giovanni Coppola, Nelson B. Freimer, and Jae Hoon Sul. Forestqc: Quality control on genetic variants from next-generation sequencing data using random forest.PLoS Comput. Biol., 15(12), 2019

work page 2019
[12]

Hitz, IdanGabdank, Jason A

Yunhai Luo, Benjamin C. Hitz, IdanGabdank, Jason A. Hilton, Meenakshi S. Kagda, Bonita Lam, Zachary A. Myers, Paul Sud, Jennifer Jou, Khine Lin, Ulugbek K. Baymuradov, Keenan Graham, Casey Litton, Stuart R. Miyasato, J. Seth Strattan, Otto Jolanki, Jin-Wook Lee, Forrest Tanaka, Philip Adenekan, Emma O’Neill, and J. Michael Cherry. New developments on the ...

work page 2020
[13]

An integrated encyclopedia of dna elements in the human genome.Nature, 489(7414):57, 2012

a) ENCODE Project Consortium et al. An integrated encyclopedia of dna elements in the human genome.Nature, 489(7414):57, 2012

work page 2012
[14]

Meyer, and Xiaole Shirley Liu

Rongbin Zheng, Changxin Wan, Shenglin Mei, Qian Qin, Qiu Wu, Hanfei Sun, Chen-Hao Chen, Myles Brown, Xiaoyan Zhang, Clifford A. Meyer, and Xiaole Shirley Liu. Cistrome data browser: expanded datasets and new tools for gene regulatory analysis.Nucleic Acids Res., 47(Database-Issue): D729–D735, 2019. 20

work page 2019
[15]

How to overcome curse-of-dimensionality for out-of-distribution detection? In Michael J

Soumya Suvra Ghosal, Yiyou Sun, and Yixuan Li. How to overcome curse-of-dimensionality for out-of-distribution detection? In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors,Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Con- ference on Innovative Applications of Artificial Intelligence, IAAI 2024, F...

work page doi:10.1609/aaai.v38i18.29960 2024
[16]

Rna-seq: a revolutionary tool for transcriptomics.Nature reviews genetics, 10(1):57–63, 2009

Zhong Wang, Mark Gerstein, and Michael Snyder. Rna-seq: a revolutionary tool for transcriptomics.Nature reviews genetics, 10(1):57–63, 2009

work page 2009
[17]

Chip–seq: advantages and challenges of a maturing technology

Peter J Park. Chip–seq: advantages and challenges of a maturing technology. Nature reviews genetics, 10(10):669–680, 2009

work page 2009
[18]

Dnase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells.Cold Spring Harbor Protocols, 2010(2):pdb–prot5384, 2010

Lingyun Song and Gregory E Crawford. Dnase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells.Cold Spring Harbor Protocols, 2010(2):pdb–prot5384, 2010

work page 2010
[19]

Robust transcriptome- wide discovery of rna-binding protein binding sites with enhanced clip (eclip)

Eric L Van Nostrand, Gabriel A Pratt, Alexander A Shishkin, Chelsea Gelboin-Burkhart, Mark Y Fang, Balaji Sundararaman, Steven M Blue, Thai B Nguyen, Christine Surka, Keri Elkins, et al. Robust transcriptome- wide discovery of rna-binding protein binding sites with enhanced clip (eclip). Nature methods, 13(6):508–514, 2016

work page 2016
[20]

The sanger fastq file format for sequences with quality scores, and the solexa/illumina fastq variants.Nucleic acids research, 38(6): 1767–1771, 2010

Peter JA Cock, Christopher J Fields, Naohisa Goto, Michael L Heuer, and Peter M Rice. The sanger fastq file format for sequences with quality scores, and the solexa/illumina fastq variants.Nucleic acids research, 38(6): 1767–1771, 2010

work page 2010
[21]

Base- calling of automated sequencer traces using phred

Brent Ewing, LaDeana Hillier, Michael C Wendl, and Phil Green. Base- calling of automated sequencer traces using phred. i. accuracy assessment. Genome research, 8(3):175–185, 1998

work page 1998
[22]

Marth, Gonçalo R

Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor T. Marth, Gonçalo R. Abecasis, and Richard Durbin. The sequence alignment/map format and samtools.Bioinform., 25(16):2078–2079, 2009

work page 2078
[23]

Fastqc: A quality control tool for high throughput sequence data

Simon Andrews. Fastqc: A quality control tool for high throughput sequence data. Available online athttp://www.bioinformatics.babraham.ac.uk/ projects/fastqc, 2010. Accessed: 2025-04-14

work page 2010
[24]

Fast gapped-read alignment with bowtie 2.Nature methods, 9(4):357–359, 2012

Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2.Nature methods, 9(4):357–359, 2012

work page 2012
[25]

Statistical guidelines for quality control of next-generation 21 sequencing techniques.Life Science Alliance, 4(11):e202101113, November

Maximilian Sprang, Matteo Krüger, Miguel A Andrade-Navarro, and Jean- Fred Fontaine. Statistical guidelines for quality control of next-generation 21 sequencing techniques.Life Science Alliance, 4(11):e202101113, November

work page
[26]

doi: 10.26508/lsa.202101113

ISSN 2575-1077. doi: 10.26508/lsa.202101113. URL https://www. life-science-alliance.org/lookup/doi/10.26508/lsa.202101113

work page doi:10.26508/lsa.202101113
[27]

Chipseeker: an r/bioconductor package for chip peak annotation, comparison and visu- alization.Bioinform., 31(14):2382–2383, 2015

Guangchuang Yu, Li-Gen Wang, and Qing-Yu He. Chipseeker: an r/bioconductor package for chip peak annotation, comparison and visu- alization.Bioinform., 31(14):2382–2383, 2015

work page 2015
[28]

Lawson, Hervé Pagès, Simon M

Lihua Julie Zhu, Claude Gazin, Nathan D. Lawson, Hervé Pagès, Simon M. Lin, David S. Lapointe, and Michael R. Green. Chippeakanno: a biocon- ductor package to annotate chip-seq and chip-chip data.BMC Bioinform., 11:237, 2010

work page 2010
[29]

Diffbind: differential binding analysis of chip-seq peak data.R package version, 100(4.3):2–21, 2011

Rory Stark, Gordon Brown, et al. Diffbind: differential binding analysis of chip-seq peak data.R package version, 100(4.3):2–21, 2011

work page 2011
[30]

Model-based analysis of chip-seq (macs).Genome biology, 9:1–9, 2008

Yong Zhang, Tao Liu, Clifford A Meyer, Jérôme Eeckhoute, David S Johnson, Bradley E Bernstein, Chad Nusbaum, Richard M Myers, Myles Brown, Wei Li, et al. Model-based analysis of chip-seq (macs).Genome biology, 9:1–9, 2008

work page 2008
[31]

Inte- grating the encode blocklist for machine learning quality control of chip-seq with seqqscorer.bioRxiv, 2025

Steffen Albrecht, Clarissa Krämer, Philipp Röchner, Johannes U Mayer, Franz Rothlauf, Miguel A Andrade-Navarro, and Maximilian Sprang. Inte- grating the encode blocklist for machine learning quality control of chip-seq with seqqscorer.bioRxiv, 2025. doi: 10.1101/2025.05.12.653555. URLhttps: //www.biorxiv.org/content/early/2025/05/15/2025.05.12.653555

work page doi:10.1101/2025.05.12.653555 2025
[32]

The encode (encyclopedia of dna elements) project.Science (New York, N.Y.), 306(5696):636–640, 2004

ENCODE Project Consortium. The encode (encyclopedia of dna elements) project.Science (New York, N.Y.), 306(5696):636–640, 2004. doi: 10.1126/ science.1105136. URLhttps://doi.org/10.1126/science.1105136

work page doi:10.1126/science.1105136 2004
[33]

Kagda, Bonita Lam, Casey Litton, Corinn Small, Cricket A

Meenakshi S. Kagda, Bonita Lam, Casey Litton, Corinn Small, Cricket A. Sloan, Emma Spragins, Forrest Tanaka, Ian Whaling, Idan Gabdank, Ingrid Youngworth, J. Seth Strattan, Jason Hilton, Jennifer Jou, Jessica Au, Jin- Wook Lee, Kalina Andreeva, Keenan Graham, Khine Lin, Matt Simison, Otto Jolanki, Paul Sud, Pedro Assis, Philip Adenekan, Eric Douglas, Ming...

work page 2023
[34]

Hilton, Meenakshi S

Jennifer Jou, Idan Gabdank, Yunhai Luo, Khine Lin, Paul Sud, Zachary Myers, Jason A. Hilton, Meenakshi S. Kagda, Bonita Lam, Emma O’Neill, Philip Adenekan, Keenan Graham, Ulugbek K. Baymuradov, Stuart R. Miyasato, J. Seth Strattan, Otto Jolanki, Jin-Wook Lee, Casey Lit- ton, Forrest Y. Tanaka, Benjamin C. Hitz, and J. Michael Cherry. The encode portal as ...

work page doi:10.1002/cpbi.89 2019
[35]

Sloan, Esther T

Cricket A. Sloan, Esther T. Chan, Jean M. Davidson, Venkat S. Malladi, J. Seth Strattan, Benjamin C. Hitz, Idan Gabdank, Aditi K. Narayanan, Marcus Ho, Brian T. Lee, Laurence D. Rowe, Timothy R. Dreszer, Greg Roe, Nikhil R. Podduturi, Forrest Tanaka, Eurie L. Hong, and J. Michael Cherry. ENCODE data at the ENCODE portal.Nucleic Acids Res., 44 (Database-Is...

work page 2016
[36]

Perspectives on ENCODE.Nature, 583(7818):693–698, July 2020

The ENCODE Project Consortium, Michael P Snyder, Thomas R Gingeras, Jill E Moore, Zhiping Weng, Mark B Gerstein, Bing Ren, Ross C Hardison, John A Stamatoyannopoulos, Brenton R Graveley, et al. Perspectives on ENCODE.Nature, 583(7818):693–698, July 2020. ISSN 0028-0836, 1476-

work page 2020
[37]

URL https://www.nature.com/a rticles/s41586-020-2449-8

doi: 10.1038/s41586-020-2449-8. URL https://www.nature.com/a rticles/s41586-020-2449-8

work page doi:10.1038/s41586-020-2449-8
[38]

Quinlan and Ira M

Aaron R. Quinlan and Ira M. Hall. Bedtools: a flexible suite of utilities for comparing genomic features.Bioinform., 26(6):841–842, 2010

work page 2010
[39]

The ucsc genome browser database: 2025 update.Nucleic Acids Research, 53(D1):D1243–D1249, 2025

Gerardo Perez, Galt P Barber, Anna Benet-Pages, Jonathan Casper, Hiram Clawson, Mark Diekhans, Clay Fischer, Jairo Navarro Gonzalez, Angie S Hinrichs, Christopher M Lee, et al. The ucsc genome browser database: 2025 update.Nucleic Acids Research, 53(D1):D1243–D1249, 2025

work page 2025
[40]

Ortiz, Len Taing, Clifford A

Tao Liu, Jorge A. Ortiz, Len Taing, Clifford A. Meyer, Bernett Lee, Yong Zhang, Hyunjin Shin, Swee S. Wong, Jian Ma, Ying Lei, Utz J. Pape, Michael Poidinger, Yiwen Chen, Kevin Yeung, Myles Brown, Yaron Turpaz, and X. Shirley Liu. Cistrome: an integrative platform for transcriptional regulation studies.Genome Biology, 12(8):R83, August 2011. ISSN 1474- 76...

work page doi:10.1186/gb-2011-12-8-r83 2011
[41]

McKnight and Julius Najab.Mann-Whitney U Test

Patrick E. McKnight and Julius Najab.Mann-Whitney U Test. John Wiley & Sons, Ltd, 2010. ISBN 9780470479216. doi: https://doi.org/10.1002/97 80470479216.corpsy0524. URL https://onlinelibrary.wiley.com/doi/ abs/10.1002/9780470479216.corpsy0524

work page doi:10.1002/97 2010
[42]

A simple sequentially rejective multiple test procedure.Scan- dinavian Journal of Statistics, 6(2):65–70, 1979

Sture Holm. A simple sequentially rejective multiple test procedure.Scan- dinavian Journal of Statistics, 6(2):65–70, 1979. ISSN 03036898, 14679469. URLhttp://www.jstor.org/stable/4615733

work page arXiv 1979
[43]

Andrew P. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms.Pattern Recognit., 30(7):1145– 1159, 1997

work page 1997
[44]

Friedman.The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition

Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman.The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition. Springer Series in Statistics. Springer, 2009. 23

work page 2009
[45]

Random forests.Mach

Leo Breiman. Random forests.Mach. Learn., 45(1):5–32, 2001

work page 2001
[46]

Greedy function approximation: a gradient boosting machine.Annals of statistics, pages 1189–1232, 2001

Jerome H Friedman. Greedy function approximation: a gradient boosting machine.Annals of statistics, pages 1189–1232, 2001

work page 2001
[47]

MIT Press, 2016.http://www.deeplearningbook.org

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press, 2016.http://www.deeplearningbook.org

work page 2016
[48]

Scikit-learn: Machine learning in python.J

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake VanderPlas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. Scikit-learn: Machine learning in python.J. Mach. Learn. Res., 12:2825– 2830, 2011

work page 2011
[49]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR (Poster), 2015. 24

work page 2015