PyPeakRankR: Reproducible Peak-Level Feature Extraction for Regulatory Element Ranking

Jeremy A. Miller; Nelson J. Johansen; Saroja Somasundaram; Trygve E. Bakken

arxiv: 2606.18179 · v1 · pith:DPDX2S5Qnew · submitted 2026-06-16 · 🧬 q-bio.GN

PyPeakRankR: Reproducible Peak-Level Feature Extraction for Regulatory Element Ranking

Saroja Somasundaram , Nelson J. Johansen , Trygve E. Bakken , Jeremy A. Miller This is my paper

Pith reviewed 2026-06-26 21:29 UTC · model grok-4.3

classification 🧬 q-bio.GN

keywords peak feature extractionregulatory elementschromatin accessibilityATAC-seqreproducibilityPython packageenhancer rankingBigWig signal

0 comments

The pith

PyPeakRankR extracts peak features into a reproducible TSV matrix to separate extraction from ranking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PyPeakRankR extracts a collection of peak-level features from chromatin accessibility data into one standardized matrix. The features cover signal strength, sequence composition, evolutionary conservation, distributional shape, and cell-type specificity. Output as a TSV file, this matrix makes the extraction step deterministic and separate from any ranking method. This structure lets different prioritization strategies be tested fairly on the same extracted data.

Core claim

PyPeakRankR assembles BigWig signal summaries, GC content, PhyloP conservation scores, distribution moments (kurtosis, skewness, bimodality), and cell-type specificity rankings into a single reproducible peak-by-feature matrix stored as a TSV file, separating deterministic feature extraction from downstream ranking to enable transparent benchmarking of prioritization strategies on the same upstream data.

What carries the argument

The peak-by-feature TSV matrix that aggregates BigWig summaries, GC content, PhyloP scores, distribution moments, and cell-type specificity rankings for each peak.

If this is right

Allows transparent benchmarking of prioritization strategies on the same upstream data.
Supports cross-assembly scoring via liftOver.
Processes thousands of peaks in minutes through command-line interface or Python API.
Validated in the BICCN challenge where its predecessor ranked in the top 3 of 16 methods.
Used in the Cross-species Enhancer Ranking Pipeline to identify enhancers with greater than 70 percent on-target specificity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of the fixed TSV format could reduce variability when independent groups analyze the same peak sets.
The matrix could serve as direct input to machine learning models for testing new ranking algorithms.
Similar extraction logic might apply to data from other assays such as ChIP-seq without major changes.
Future work could compare this feature set against alternative combinations to test sufficiency.

Load-bearing premise

That the chosen set of features is the right one to support transparent benchmarking and that separating extraction from ranking will improve reproducibility across studies.

What would settle it

Running PyPeakRankR on the same input peaks and files twice and obtaining TSV matrices with any differing feature values.

Figures

Figures reproduced from arXiv: 2606.18179 by Jeremy A. Miller, Nelson J. Johansen, Saroja Somasundaram, Trygve E. Bakken.

read the original abstract

High-throughput chromatin accessibility assays such as ATAC-seq generate thousands of candidate regulatory elements (peaks), yet no standardized tool exists for assembling the diverse quantitative features needed to prioritize peaks for functional validation. Here we present PyPeakRankR, an open-source Python package that extracts peak-level features, namely BigWig signal summaries, GC content, PhyloP conservation scores, distribution moments (kurtosis, skewness, bimodality), and cell-type specificity rankings, into a single reproducible peak by feature matrix stored as a tab-separated values (TSV) file. PyPeakRankR separates deterministic feature extraction from downstream ranking, enabling transparent benchmarking of prioritization strategies on the same upstream data. The package provides both a command-line interface and a matching Python API, supports cross-assembly scoring via liftOver, and runs in minutes on thousands of peaks. PyPeakRankR was validated in the Brain Initiative Cell Census Network (BICCN) community challenge, where its predecessor PeakRankR ranked among the top 3 of 16 methods for cell-type specific enhancer prediction. In a recent basal ganglia study, PyPeakRankR was used within the Cross-species Enhancer Ranking Pipeline (CERP) to identify enhancer-AAV tools achieving greater than 70% on-target specificity across cell types. PyPeakRankR is freely available under the MIT license at https://github.com/AllenInstitute/PeakRankR/tree/python-package.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PyPeakRankR is a clean Python packaging of standard peak features with CLI and liftOver support, but adds no new biology or methods.

read the letter

PyPeakRankR takes the usual peak features—BigWig signal stats, GC content, PhyloP scores, kurtosis/skewness/bimodality, and cell-type specificity—and writes them out as a deterministic TSV matrix. It also supplies a CLI, Python API, and cross-assembly liftOver so the extraction step can be run the same way across studies.

The useful part is the separation of feature extraction from whatever ranking comes next. That design choice makes it easier to benchmark different prioritization approaches on identical upstream data, and the package is fast enough for thousands of peaks. The predecessor PeakRankR placed in the top three of the BICCN challenge and was used in the CERP pipeline for AAV enhancer selection, which gives some external indication that the feature set is practical.

The paper itself does not introduce new features, new derivations, or fresh benchmarks. It is a software implementation announcement that relies on prior validation of the R version. There is no error analysis or sensitivity testing reported here, so users will still need to inspect the GitHub code for edge cases in the moment calculations or specificity scoring.

This is aimed at labs that already do peak calling on ATAC-seq or similar data and want a reproducible first step before applying their own ranking logic. It is the kind of methods contribution that can reduce workflow variation, so it is worth sending to peer review even though the scientific novelty is low.

Referee Report

0 major / 3 minor

Summary. The manuscript presents PyPeakRankR, an open-source Python package that extracts a set of peak-level features—including BigWig signal summaries, GC content, PhyloP conservation scores, distribution moments (kurtosis, skewness, bimodality), and cell-type specificity rankings—from chromatin accessibility peaks into a single reproducible peak-by-feature matrix stored as a TSV file. The package provides both a command-line interface and Python API, supports liftOver for cross-assembly scoring, and is designed to separate deterministic feature extraction from downstream ranking to facilitate benchmarking. It reports that the predecessor PeakRankR ranked in the top 3 of 16 methods in the BICCN community challenge and was used in the CERP pipeline for a basal ganglia study achieving >70% on-target specificity.

Significance. If the implementation faithfully reproduces the listed feature calculations and runs deterministically as described, the work supplies a practical, standardized tool that addresses the absence of a common pipeline for assembling quantitative peak annotations in regulatory element prioritization. The explicit separation of feature extraction from ranking, combined with open-source release under the MIT license and dual CLI/API access, is a clear strength that directly supports reproducible and comparable downstream analyses across studies. Credit is due for the public GitHub availability and the reported external validations, even though they pertain to the predecessor package.

minor comments (3)

[Abstract] Abstract: the list of features is presented clearly, but the manuscript would benefit from an explicit table or enumerated list (perhaps in a dedicated Methods or Features section) that states the precise computation for each feature, including any window sizes, normalization steps, or external data sources required (e.g., which PhyloP bigWig files).
[Abstract] The validation statements refer exclusively to the predecessor PeakRankR; the manuscript should clarify whether the Python re-implementation was subjected to any unit tests or regression checks against the original R version to confirm numerical equivalence of the extracted features.
The claim that the package 'runs in minutes on thousands of peaks' is useful but would be strengthened by a brief timing table or paragraph reporting wall-clock times, peak counts, and hardware used, even if only as a supplementary note.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report accurately captures the purpose and strengths of PyPeakRankR. No major comments were provided in the report, so we have no specific points requiring point-by-point rebuttal or revision at this stage.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a software package for deterministic extraction of existing peak-level features (BigWig summaries, GC content, PhyloP, distribution moments, cell-type specificity) into a TSV matrix, with separation from downstream ranking. No equations, derivations, fitted parameters, or predictions appear in the text. The central claim is purely implementational and descriptive; validation references to prior external challenges and a predecessor package are not load-bearing for any derivation. The work is self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or invented entities are involved; this is a description of a data-processing software tool.

pith-pipeline@v0.9.1-grok · 5803 in / 1084 out tokens · 27852 ms · 2026-06-26T21:29:34.721877+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 9 canonical work pages

[1]

D., Giresi, P

Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y., & Greenleaf, W. J. (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA- binding proteins and nucleosome position.Nature Methods, 10(12), 1213–1218. https: //doi.org/10.1038/nmeth.2688

work page doi:10.1038/nmeth.2688 2013
[2]

M., Corces, M

Granja, J. M., Corces, M. R., Pierce, S. E., et al. (2021). ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis.Nature Genetics, 53(3), 403–411. https://doi.org/10.1038/s41588-021-00790-6

work page doi:10.1038/s41588-021-00790-6 2021
[3]

R., Millman, K

Harris, C. R., Millman, K. J., van der Walt, S. J., et al. (2020). Array programming with NumPy. Nature, 585(7825), 357–367.https://doi.org/10.1038/s41586-020-2649-2

work page doi:10.1038/s41586-020-2649-2 2020
[4]

J., Kempynck, N., Zemke, N

Johansen, N. J., Kempynck, N., Zemke, N. R., Somasundaram, S., De Winter, S., et al. (2025). Evaluating methods for the prediction of cell-type-specific enhancers in the mammalian cortex. Cell Genomics, 5(6), 100879.https://doi.org/10.1016/j.xgen.2025.100879

work page doi:10.1016/j.xgen.2025.100879 2025
[5]

Lu, Y., Qu, W., Shan, G., & Zhang, C. (2015). DELTA: A distal enhancer locating tool based on AdaBoost algorithm and shape features of chromatin modifications.PLoS ONE, 10(6), e0130622.https://doi.org/10.1371/journal.pone.0130622

work page doi:10.1371/journal.pone.0130622 2015
[6]

Y., Bristor, D., Hiller, M., et al

McLean, C. Y., Bristor, D., Hiller, M., et al. (2010). GREAT improves functional interpretation of cis-regulatory regions.Nature Biotechnology, 28(5), 495–501.https://doi.org/10.1038/ nbt.1630 5

2010
[7]

S., Hubisz, M

Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R., & Siepel, A. (2010). Detection of nonneutral substitution rates on mammalian phylogenies.Genome Research, 20(1), 110–121.https: //doi.org/10.1101/gr.097857.109 Ramírez, F., & Diehl, S. (2020).pyBigWig: A Python extension for reading BigWig files.https: //github.com/deeptools/pyBigWig Ramírez, F., Ryan, D. ...

work page doi:10.1101/gr.097857.109 2010
[8]

Pythonic

Shirley, M. D., Ma, Z., Pedersen, B. S., & Wheelan, S. J. (2015).Efficient “Pythonic” access to FASTA files using pyfaidx.https://doi.org/10.7287/peerj.preprints.970v1 pandas development team. (2020). pandas-dev/pandas: Pandas. https://doi.org/10.5281/ zenodo.3509134

work page doi:10.7287/peerj.preprints.970v1 2015
[9]

E., et al

Virtanen, P., Gommers, R., Oliphant, T. E., et al. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python.Nature Methods, 17(3), 261–272.https://doi.org/10. 1038/s41592-019-0686-2

2020
[10]

E., Hunker, A

Wirthlin, M. E., Hunker, A. C., Somasundaram, S., et al. (2026). A cross-species enhancer- AAV toolkit for cell type-specific targeting across the basal ganglia.bioRxiv, ahead of print. https://doi.org/10.64898/2026.02.23.706695

work page doi:10.64898/2026.02.23.706695 2026
[11]

A., et al

Zhang, Y., Liu, T., Meyer, C. A., et al. (2008). Model-based analysis of ChIP-seq (MACS).Genome Biology, 9(9), R137.https://doi.org/10.1186/gb-2008-9-9-r137 6

work page doi:10.1186/gb-2008-9-9-r137 2008

[1] [1]

D., Giresi, P

Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y., & Greenleaf, W. J. (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA- binding proteins and nucleosome position.Nature Methods, 10(12), 1213–1218. https: //doi.org/10.1038/nmeth.2688

work page doi:10.1038/nmeth.2688 2013

[2] [2]

M., Corces, M

Granja, J. M., Corces, M. R., Pierce, S. E., et al. (2021). ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis.Nature Genetics, 53(3), 403–411. https://doi.org/10.1038/s41588-021-00790-6

work page doi:10.1038/s41588-021-00790-6 2021

[3] [3]

R., Millman, K

Harris, C. R., Millman, K. J., van der Walt, S. J., et al. (2020). Array programming with NumPy. Nature, 585(7825), 357–367.https://doi.org/10.1038/s41586-020-2649-2

work page doi:10.1038/s41586-020-2649-2 2020

[4] [4]

J., Kempynck, N., Zemke, N

Johansen, N. J., Kempynck, N., Zemke, N. R., Somasundaram, S., De Winter, S., et al. (2025). Evaluating methods for the prediction of cell-type-specific enhancers in the mammalian cortex. Cell Genomics, 5(6), 100879.https://doi.org/10.1016/j.xgen.2025.100879

work page doi:10.1016/j.xgen.2025.100879 2025

[5] [5]

Lu, Y., Qu, W., Shan, G., & Zhang, C. (2015). DELTA: A distal enhancer locating tool based on AdaBoost algorithm and shape features of chromatin modifications.PLoS ONE, 10(6), e0130622.https://doi.org/10.1371/journal.pone.0130622

work page doi:10.1371/journal.pone.0130622 2015

[6] [6]

Y., Bristor, D., Hiller, M., et al

McLean, C. Y., Bristor, D., Hiller, M., et al. (2010). GREAT improves functional interpretation of cis-regulatory regions.Nature Biotechnology, 28(5), 495–501.https://doi.org/10.1038/ nbt.1630 5

2010

[7] [7]

S., Hubisz, M

Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R., & Siepel, A. (2010). Detection of nonneutral substitution rates on mammalian phylogenies.Genome Research, 20(1), 110–121.https: //doi.org/10.1101/gr.097857.109 Ramírez, F., & Diehl, S. (2020).pyBigWig: A Python extension for reading BigWig files.https: //github.com/deeptools/pyBigWig Ramírez, F., Ryan, D. ...

work page doi:10.1101/gr.097857.109 2010

[8] [8]

Pythonic

Shirley, M. D., Ma, Z., Pedersen, B. S., & Wheelan, S. J. (2015).Efficient “Pythonic” access to FASTA files using pyfaidx.https://doi.org/10.7287/peerj.preprints.970v1 pandas development team. (2020). pandas-dev/pandas: Pandas. https://doi.org/10.5281/ zenodo.3509134

work page doi:10.7287/peerj.preprints.970v1 2015

[9] [9]

E., et al

Virtanen, P., Gommers, R., Oliphant, T. E., et al. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python.Nature Methods, 17(3), 261–272.https://doi.org/10. 1038/s41592-019-0686-2

2020

[10] [10]

E., Hunker, A

Wirthlin, M. E., Hunker, A. C., Somasundaram, S., et al. (2026). A cross-species enhancer- AAV toolkit for cell type-specific targeting across the basal ganglia.bioRxiv, ahead of print. https://doi.org/10.64898/2026.02.23.706695

work page doi:10.64898/2026.02.23.706695 2026

[11] [11]

A., et al

Zhang, Y., Liu, T., Meyer, C. A., et al. (2008). Model-based analysis of ChIP-seq (MACS).Genome Biology, 9(9), R137.https://doi.org/10.1186/gb-2008-9-9-r137 6

work page doi:10.1186/gb-2008-9-9-r137 2008