PyPeakRankR: Reproducible Peak-Level Feature Extraction for Regulatory Element Ranking
Pith reviewed 2026-06-26 21:29 UTC · model grok-4.3
The pith
PyPeakRankR extracts peak features into a reproducible TSV matrix to separate extraction from ranking.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PyPeakRankR assembles BigWig signal summaries, GC content, PhyloP conservation scores, distribution moments (kurtosis, skewness, bimodality), and cell-type specificity rankings into a single reproducible peak-by-feature matrix stored as a TSV file, separating deterministic feature extraction from downstream ranking to enable transparent benchmarking of prioritization strategies on the same upstream data.
What carries the argument
The peak-by-feature TSV matrix that aggregates BigWig summaries, GC content, PhyloP scores, distribution moments, and cell-type specificity rankings for each peak.
If this is right
- Allows transparent benchmarking of prioritization strategies on the same upstream data.
- Supports cross-assembly scoring via liftOver.
- Processes thousands of peaks in minutes through command-line interface or Python API.
- Validated in the BICCN challenge where its predecessor ranked in the top 3 of 16 methods.
- Used in the Cross-species Enhancer Ranking Pipeline to identify enhancers with greater than 70 percent on-target specificity.
Where Pith is reading between the lines
- Widespread use of the fixed TSV format could reduce variability when independent groups analyze the same peak sets.
- The matrix could serve as direct input to machine learning models for testing new ranking algorithms.
- Similar extraction logic might apply to data from other assays such as ChIP-seq without major changes.
- Future work could compare this feature set against alternative combinations to test sufficiency.
Load-bearing premise
That the chosen set of features is the right one to support transparent benchmarking and that separating extraction from ranking will improve reproducibility across studies.
What would settle it
Running PyPeakRankR on the same input peaks and files twice and obtaining TSV matrices with any differing feature values.
Figures
read the original abstract
High-throughput chromatin accessibility assays such as ATAC-seq generate thousands of candidate regulatory elements (peaks), yet no standardized tool exists for assembling the diverse quantitative features needed to prioritize peaks for functional validation. Here we present PyPeakRankR, an open-source Python package that extracts peak-level features, namely BigWig signal summaries, GC content, PhyloP conservation scores, distribution moments (kurtosis, skewness, bimodality), and cell-type specificity rankings, into a single reproducible peak by feature matrix stored as a tab-separated values (TSV) file. PyPeakRankR separates deterministic feature extraction from downstream ranking, enabling transparent benchmarking of prioritization strategies on the same upstream data. The package provides both a command-line interface and a matching Python API, supports cross-assembly scoring via liftOver, and runs in minutes on thousands of peaks. PyPeakRankR was validated in the Brain Initiative Cell Census Network (BICCN) community challenge, where its predecessor PeakRankR ranked among the top 3 of 16 methods for cell-type specific enhancer prediction. In a recent basal ganglia study, PyPeakRankR was used within the Cross-species Enhancer Ranking Pipeline (CERP) to identify enhancer-AAV tools achieving greater than 70% on-target specificity across cell types. PyPeakRankR is freely available under the MIT license at https://github.com/AllenInstitute/PeakRankR/tree/python-package.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents PyPeakRankR, an open-source Python package that extracts a set of peak-level features—including BigWig signal summaries, GC content, PhyloP conservation scores, distribution moments (kurtosis, skewness, bimodality), and cell-type specificity rankings—from chromatin accessibility peaks into a single reproducible peak-by-feature matrix stored as a TSV file. The package provides both a command-line interface and Python API, supports liftOver for cross-assembly scoring, and is designed to separate deterministic feature extraction from downstream ranking to facilitate benchmarking. It reports that the predecessor PeakRankR ranked in the top 3 of 16 methods in the BICCN community challenge and was used in the CERP pipeline for a basal ganglia study achieving >70% on-target specificity.
Significance. If the implementation faithfully reproduces the listed feature calculations and runs deterministically as described, the work supplies a practical, standardized tool that addresses the absence of a common pipeline for assembling quantitative peak annotations in regulatory element prioritization. The explicit separation of feature extraction from ranking, combined with open-source release under the MIT license and dual CLI/API access, is a clear strength that directly supports reproducible and comparable downstream analyses across studies. Credit is due for the public GitHub availability and the reported external validations, even though they pertain to the predecessor package.
minor comments (3)
- [Abstract] Abstract: the list of features is presented clearly, but the manuscript would benefit from an explicit table or enumerated list (perhaps in a dedicated Methods or Features section) that states the precise computation for each feature, including any window sizes, normalization steps, or external data sources required (e.g., which PhyloP bigWig files).
- [Abstract] The validation statements refer exclusively to the predecessor PeakRankR; the manuscript should clarify whether the Python re-implementation was subjected to any unit tests or regression checks against the original R version to confirm numerical equivalence of the extracted features.
- The claim that the package 'runs in minutes on thousands of peaks' is useful but would be strengthened by a brief timing table or paragraph reporting wall-clock times, peak counts, and hardware used, even if only as a supplementary note.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report accurately captures the purpose and strengths of PyPeakRankR. No major comments were provided in the report, so we have no specific points requiring point-by-point rebuttal or revision at this stage.
Circularity Check
No significant circularity
full rationale
The paper describes a software package for deterministic extraction of existing peak-level features (BigWig summaries, GC content, PhyloP, distribution moments, cell-type specificity) into a TSV matrix, with separation from downstream ranking. No equations, derivations, fitted parameters, or predictions appear in the text. The central claim is purely implementational and descriptive; validation references to prior external challenges and a predecessor package are not load-bearing for any derivation. The work is self-contained against external benchmarks with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y., & Greenleaf, W. J. (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA- binding proteins and nucleosome position.Nature Methods, 10(12), 1213–1218. https: //doi.org/10.1038/nmeth.2688
-
[2]
Granja, J. M., Corces, M. R., Pierce, S. E., et al. (2021). ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis.Nature Genetics, 53(3), 403–411. https://doi.org/10.1038/s41588-021-00790-6
-
[3]
Harris, C. R., Millman, K. J., van der Walt, S. J., et al. (2020). Array programming with NumPy. Nature, 585(7825), 357–367.https://doi.org/10.1038/s41586-020-2649-2
-
[4]
Johansen, N. J., Kempynck, N., Zemke, N. R., Somasundaram, S., De Winter, S., et al. (2025). Evaluating methods for the prediction of cell-type-specific enhancers in the mammalian cortex. Cell Genomics, 5(6), 100879.https://doi.org/10.1016/j.xgen.2025.100879
-
[5]
Lu, Y., Qu, W., Shan, G., & Zhang, C. (2015). DELTA: A distal enhancer locating tool based on AdaBoost algorithm and shape features of chromatin modifications.PLoS ONE, 10(6), e0130622.https://doi.org/10.1371/journal.pone.0130622
-
[6]
Y., Bristor, D., Hiller, M., et al
McLean, C. Y., Bristor, D., Hiller, M., et al. (2010). GREAT improves functional interpretation of cis-regulatory regions.Nature Biotechnology, 28(5), 495–501.https://doi.org/10.1038/ nbt.1630 5
2010
-
[7]
Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R., & Siepel, A. (2010). Detection of nonneutral substitution rates on mammalian phylogenies.Genome Research, 20(1), 110–121.https: //doi.org/10.1101/gr.097857.109 Ramírez, F., & Diehl, S. (2020).pyBigWig: A Python extension for reading BigWig files.https: //github.com/deeptools/pyBigWig Ramírez, F., Ryan, D. ...
-
[8]
Shirley, M. D., Ma, Z., Pedersen, B. S., & Wheelan, S. J. (2015).Efficient “Pythonic” access to FASTA files using pyfaidx.https://doi.org/10.7287/peerj.preprints.970v1 pandas development team. (2020). pandas-dev/pandas: Pandas. https://doi.org/10.5281/ zenodo.3509134
-
[9]
E., et al
Virtanen, P., Gommers, R., Oliphant, T. E., et al. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python.Nature Methods, 17(3), 261–272.https://doi.org/10. 1038/s41592-019-0686-2
2020
-
[10]
Wirthlin, M. E., Hunker, A. C., Somasundaram, S., et al. (2026). A cross-species enhancer- AAV toolkit for cell type-specific targeting across the basal ganglia.bioRxiv, ahead of print. https://doi.org/10.64898/2026.02.23.706695
-
[11]
Zhang, Y., Liu, T., Meyer, C. A., et al. (2008). Model-based analysis of ChIP-seq (MACS).Genome Biology, 9(9), R137.https://doi.org/10.1186/gb-2008-9-9-r137 6
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.