Endeavor: Efficient PairHMM for Detection of DNA Variants in Genome-Scale Datasets
Pith reviewed 2026-06-25 19:52 UTC · model grok-4.3
The pith
Endeavor redefines PairHMM to unlock row-level parallelism for accurate variant calling on sequences up to 100k basepairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Endeavor redefines the traditional PairHMM formulation to explore row-level fine-grained parallelism without loss in solution accuracy. Based on this, a novel and portable SIMD-based approach is derived for efficient and high-performance processing of short and long sequences in CPUs and GPUs, leveraging novel levels of parallelism and synchronization to achieve high throughput in sequences up to 100k basepairs for the first time.
What carries the argument
The redefinition of the PairHMM recurrence relations that exposes independent row computations instead of the conventional anti-diagonal wavefront.
If this is right
- CPUs achieve up to 2.14 times higher peak throughput than GKL.
- Real-world GATK HaplotypeCaller runs become at least twice as fast.
- GPUs deliver up to 2.05 times speedup over existing GPU PairHMM implementations.
- Sequences of 100k basepairs become practical on commodity hardware.
Where Pith is reading between the lines
- The same row-wise reformulation could be applied to other dynamic-programming bioinformatics kernels that currently rely on anti-diagonal parallelism.
- Portable SIMD code generated from the new formulation might reduce the need for separate CPU and GPU code paths in production pipelines.
- If the numerical invariance holds under reduced precision, further speedups on low-precision accelerators become possible.
Load-bearing premise
Changing the order of PairHMM operations preserves exact numerical accuracy while exposing new parallelism.
What would settle it
Running the original and redefined formulations on identical input sequences of length 50k basepairs and checking whether the computed variant probabilities differ by more than floating-point roundoff.
Figures
read the original abstract
DNA variant calling represents a key operation in bioinformatics pipelines that aims at identifying genetic variants. Given an evidenced explosion in genomic data availability, there is an urgent need for a high-performant, portable and efficient solution for variant calling, which can further improve our understanding of genomic structure and genetic basis for complex diseases. In its most common formulation, the Pair Hidden Markov Model (PairHMM) algorithm for variant calling stands as the main bottleneck in the pipeline, accounting for up to 70% of the execution time in large-scale genomic datasets. The state-of-the-art approaches for accelerating PairHMM in CPUs and GPUs do not scale to long DNA sequences and only explore very limited anti-diagonal data parallelism, which yields poor performance. In this work, Endeavor is proposed as a new parallelization strategy for PairHMM that redefines its traditional formulation to explore row-level fine-grained parallelism without loss in solution accuracy. Based on this, a novel and portable SIMD-based approach is derived for efficient and high-performance processing of short and long sequences in CPUs and GPUs, leveraging novel levels of parallelism and synchronization to achieve high throughput in sequences up to 100k basepairs for the first time. Evaluation on Intel and AMD CPUs shows that Endeavor outperforms GKL up to 2.14x in peak throughput and GATK HaplotypeCaller by at least 2x in real-world datasets, while NVIDIA and AMD GPUs achieve up to 2.05x speedups in genome-scale datasets when compared to state-of-the-art GPU-based methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Endeavor, a new parallelization strategy for the Pair Hidden Markov Model (PairHMM) algorithm used in DNA variant calling. It redefines the traditional formulation to enable row-level fine-grained parallelism without loss in solution accuracy, deriving a portable SIMD-based approach for CPUs and GPUs that achieves high throughput on sequences up to 100k basepairs. Evaluations report speedups of up to 2.14x over GKL on Intel/AMD CPUs and 2.05x over prior GPU methods on NVIDIA/AMD GPUs in genome-scale datasets.
Significance. If the row-level reformulation preserves numerical accuracy while unlocking the claimed parallelism and scalability, the work could meaningfully accelerate a key bottleneck (up to 70% of runtime) in bioinformatics pipelines for large genomic datasets. The emphasis on portability across CPU and GPU architectures and handling of long sequences represents a practical advance over prior anti-diagonal limited approaches.
minor comments (2)
- [Abstract] Abstract: performance numbers and accuracy-preservation claims are asserted without any reference to the specific reformulation equations, error bounds, or benchmark methodology; adding a one-sentence pointer to the relevant section would improve readability.
- The manuscript would benefit from an explicit statement (perhaps in the evaluation section) of the sequence-length distribution in the real-world datasets used for the GATK HaplotypeCaller comparison.
Simulated Author's Rebuttal
We thank the referee for the positive summary and recommendation of minor revision. The report highlights the potential significance of the row-level reformulation for PairHMM and its portability across architectures, which aligns with our goals. No major comments were provided in the report, so we have no specific points to address point-by-point. We will incorporate any minor suggestions in the revised version.
Circularity Check
No significant circularity in derivation chain
full rationale
The paper claims a row-level reformulation of PairHMM that enables fine-grained SIMD parallelism on long sequences while preserving exact numerical accuracy. No equations, fitted parameters, self-citations, or ansatzes appear in the supplied abstract or skeptic analysis that reduce any prediction or uniqueness claim to the inputs by construction. The central premise is granted as a novel reformulation, after which standard SIMD/GPU techniques are applied; the argument structure is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Andrew Adinetz. 2014. Adaptive parallel computation with CUDA dynamic parallelism.NVIDIA Corporation) Retrieved January4 (2014), 2016
2014
-
[2]
Srinivas Aluru, Natsuhiko Futamura, and Kishan Mehrotra. 2003. Parallel bio- logical sequence comparison using prefix computations.J. Parallel and Distrib. Comput.63, 3 (2003), 264–272
2003
-
[3]
Euan A Ashley. 2016. Towards precision medicine.Nature Reviews Genetics17, 9 (2016), 507–522
2016
-
[4]
Subho S Banerjee, Mohamed El-Hadedy, Ching Y Tan, Zbigniew T Kalbarczyk, Steve Lumetta, and Ravishankar K Iyer. 2017. On accelerating pair-HMM compu- tations in programmable hardware. In2017 27th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 1–8
2017
-
[5]
Ravi Bhargava and Kai Troester. 2024. AMD next-generation “Zen 4” core and 4th gen AMD EPYC server CPUs.IEEE Micro44, 3 (2024), 8–17
2024
-
[6]
Beatrice Branchini, Alberto Zeni, and Marco D Santambrogio. 2021. A Methodol- ogy for Accelerating Variant Calling on GPU. (2021)
2021
-
[7]
Benjamin Buchfink, Klaus Reuter, and Hajk-Georg Drost. 2021. Sensitive protein alignments at tree-of-life scale using DIAMOND.Nature methods18, 4 (2021), 366–368
2021
-
[8]
Christiam Camacho, Grzegorz M Boratyn, Victor Joukov, Roberto Vera Alvarez, and Thomas L Madden. 2023. ElasticBLAST: accelerating sequence search via cloud computing.BMC bioinformatics24, 1 (2023), 117
2023
-
[9]
M Carneiro. 2013. Optimization of a Haplotype Pair-HMM class for GPU/FPGA and AVX processing. https://github.com/MauricioCarneiro/PairHMM
2013
-
[10]
Tiago Carneiro Pessoa, Jan Gmys, Francisco Heron de Carvalho Júnior, Nouredine Melab, and Daniel Tuyttens. 2018. GPU-accelerated backtracking using CUDA Dynamic Parallelism.Concurrency and Computation: Practice and Experience30, 9 (2018), e4374
2018
-
[11]
Ming-Hung Chen, Mao-Jan Lin, Yu-Cheng Li, and Yi-Chang Lu. 2019. Banded Pair- HMM Algorithm for DNA Variant Calling and Its Hardware Accelerator Design. In2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE). IEEE, 563–566
2019
-
[12]
1000 Genomes Project Consortium et al . 2015. A global reference for human genetic variation.Nature526, 7571 (2015), 68
2015
-
[13]
Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl, Anthony A Philippakis, Guillermo Del Angel, Manuel A Rivas, Matt Hanna, et al. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data.Nature genetics43, 5 (2011)
2011
-
[14]
1998.Biolog- ical sequence analysis: probabilistic models of proteins and nucleic acids
Richard Durbin, Sean R Eddy, Anders Krogh, and Graeme Mitchison. 1998.Biolog- ical sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press
1998
-
[15]
Sean R Eddy. 2004. What is dynamic programming?Nature biotechnology22, 7 (2004), 909–910
2004
-
[16]
Patrick Foley, Abirami Prabhakaran, Karthik Gururaj, Mishali Naik, Shiva Gopalan, Aleksandr Shargorodskiy, and Ernesto Brau. 2017. Accelerate Genomics Research with the Broad-Intel Genomics Stack
2017
-
[17]
Efstathia Giannopoulou, Theodora Katsila, Christina Mitropoulou, Evangelia- Eirini Tsermpini, and George P Patrinos. 2019. Integrating next-generation sequencing in the clinical pharmacogenomics workflow.Frontiers in pharmacol- ogy10 (2019), 384
2019
-
[18]
Richard A Gibbs. 2020. The human genome project changed everything.Nature Reviews Genetics21, 10 (2020), 575–576
2020
-
[19]
Mark Harris, Shubhabrata Sengupta, and John D Owens. 2007. Parallel prefix sum (scan) with CUDA.GPU gems3, 39 (2007), 851–876
2007
-
[20]
Taishan Hu, Nilesh Chitnis, Dimitri Monos, and Anh Dinh. 2021. Next-generation sequencing technologies: An overview.Human Immunology82, 11 (2021)
2021
-
[21]
Sitao Huang, Gowthami Jayashri Manikandan, Anand Ramachandran, Kyle Rup- now, Wen-mei W Hwu, and Deming Chen. 2017. Hardware acceleration of the pair-HMM algorithm for DNA variant calling. InProceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
2017
-
[22]
Miten Jain, Sergey Koren, Karen H Miga, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, Andrew D Beggs, Alexander T Dilthey, Ian T Fiddes, et al
-
[23]
Nanopore sequencing and assembly of a human genome with ultra-long reads.Nature biotechnology36, 4 (2018), 338–345
2018
-
[24]
Hákon Jónsson, Patrick Sulem, Birte Kehr, Snaedis Kristmundsdottir, Florian Zink, Eirikur Hjartarson, Marteinn T Hardarson, Kristjan E Hjorleifsson, Hannes P Eg- gertsson, Sigurjon Axel Gudjonsson, et al. 2017. Whole genome characterization of sequence diversity of 15,220 Icelanders.Scientific data4, 1 (2017), 1–9
2017
-
[25]
Ali Khajeh-Saeed, Stephen Poole, and J Blair Perot. 2010. Acceleration of the Smith–Waterman algorithm using single and multiple graphics processors.J. Comput. Phys.229, 11 (2010), 4247–4258
2010
-
[26]
Daniel C Koboldt. 2020. Best practices for variant calling in clinical sequencing. Genome Medicine12, 1 (2020), 91
2020
-
[27]
Enliang Li, Subho S Banerjee, Sitao Huang, Ravishankar K Iyer, and Deming Chen
-
[28]
In2021 IEEE 39th International Conference on Computer Design (ICCD)
Improved gpu implementations of the pair-hmm forward algorithm for dna sequence alignment. In2021 IEEE 39th International Conference on Computer Design (ICCD). IEEE, 299–306
-
[29]
Zhuren Liu, Shouzhe Zhang, Justin Garrigus, and Hui Zhao. 2023. Genomics- GPU: A Benchmark Suite for GPU-accelerated Genome Analysis. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 178–188
2023
-
[30]
Chengwei Luo, Despina Tsementzi, Nikos Kyrpides, Timothy Read, and Kon- stantinos T Konstantinidis. 2012. Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample.PloS one7, 2 (2012), e30087
2012
-
[31]
Bui Quang Minh, Heiko A Schmidt, Olga Chernomor, Dominik Schrempf, Michael D Woodhams, Arndt Von Haeseler, and Robert Lanfear. 2020. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era.Molecular biology and evolution37, 5 (2020), 1530–1534
2020
-
[32]
José Morgado, Leonel Sousa, and Aleksandar Ilic. 2024. CARM Tool: Cache-Aware Roofline Model Automatic Benchmarking and Application Analysis. In2024 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 68–81
2024
-
[33]
Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V Bzikadze, Alla Mikheenko, Mitchell R Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gersh- man, et al. 2022. The complete sequence of a human genome.Science376, 6588 (2022), 44–53
2022
-
[34]
National Institute of Health. 2024. NA12878 Pacific Biosciences BAM Dataset. Available at https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/PacBio_ SequelII_CCS_11kb/HG001.SequelII.pbmm2.hs37d5.whatshap.haplotag.RTG. trio.bam
2024
-
[35]
National Institute of Health. 2024. NA24149 Chromium Long Ranger BAM Dataset. Available at https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/ AshkenazimTrio/analysis/10XGenomics_ChromiumGenome_LongRanger2.0_ 06202016/HG003_NA24149_father/NA24149_GRCh37.bam
2024
-
[36]
National Institute of Health. 2024. NA24695 Oxford Nanopore Tech- nologies BAM Dataset. Available at https://ftp-trace.ncbi.nlm. nih.gov/giab/ftp/data/ChineseTrio/HG007_NA24695-hu38168_mother/ UCSC_Ultralong_OxfordNanopore_Promethion/HG007_GRCh37_ONT- UL_UCSC_20200109.phased.bam
2024
-
[37]
National Institute of Health. 2024. NIH NA12878 Illumina BAM Dataset. Available at https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_ NA12878_HG001_HiSeq_300x/RMNISTHS_30xdownsample.bam. Endeavor: Efficient PairHMM for Detection of DNA Variants in Genome-Scale Datasets HPDC ’26, July 13–16, 2026, Cleveland, OH, USA
2024
-
[38]
National Institute of Health. 2024. NIH NA12878 Ion Torrent BAM Dataset. Avail- able at https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/ion_exome/ IonXpress_020_rawlib.b37.bam
2024
-
[39]
National Institute of Health. 2024. NIH NA12878 SoLiD BAM Dataset. Available at https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/technical/NA12878_data_other_ projects/alignment/NA12878.SOLID.SRP012400.Xprize_SRR643700.bam
2024
-
[40]
National Institute of Health. 2024. NIH NA24631 BGISEQ500 BAM Dataset. https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/ChineseTrio/HG005_NA24631_ son/NIST_BGIseq_2x150bp_100x/GRCh38/HG005_GRCh38_BGIseq-2x150- 100x_NIST_20211126.bam
2024
-
[41]
Nathan D Olson, Justin Wagner, Nathan Dwarshuis, Karen H Miga, Fritz J Sed- lazeck, Marc Salit, and Justin M Zook. 2023. Variant calling and benchmarking in an era of complete human genome sequences.Nature Reviews Genetics24, 7 (2023), 464–483
2023
-
[42]
Johan Peltenburg, Shanshan Ren, and Zaid Al-Ars. 2016. Maximizing systolic array efficiency to accelerate the PairHMM forward algorithm. In2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE
2016
-
[43]
Shanshan Ren, Koen Bertels, and Zaid Al-Ars. 2017. GPU-accelerated GATK haplotypecaller with load-balanced multi-process optimization. In2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE). IEEE, 497–502
2017
-
[44]
Shanshan Ren, Koen Bertels, and Zaid Al-Ars. 2018. Efficient acceleration of the pair-hmms forward algorithm for gatk haplotypecaller on graphics processing units.Evolutionary Bioinformatics14 (2018), 1176934318760543
2018
-
[45]
Tony Robinson, Jim Harkin, and Priyank Shukla. 2021. Hardware acceleration of genomics data analysis: challenges and opportunities.Bioinformatics37, 13 (2021), 1785–1795
2021
-
[46]
Davide Sampietro, Chiara Crippa, Lorenzo Di Tucci, Emanuele Del Sozzo, and Marco D Santambrogio. 2018. Fpga-based pairhmm forward algorithm for dna variant calling. In2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 1–8
2018
-
[47]
Bertil Schmidt, Felix Kallenborn, Alexander Wichmann, Alejandro Chacon, and Christian Hundt. 2026. gpuPairHMM: High-Speed Pair-HMM Forward Algorithm for DNA Variant Calling on GPUs.IEEE Transactions on Computational Biology and Bioinformatics(2026), 1–8. doi:10.1109/TCBBIO.2026.3657252
-
[48]
Roman Snytsar. 2023. PairHMM Improvements for Modern Instruction Set Archi- tectures. In2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 3328–3331
2023
-
[49]
TOP500.org. [n. d.]. TOP500 June 2025. https://www.top500.org/lists/top500/ 2025/06/. [Online; Jun-2025]
2025
-
[50]
Jin Wang and Sudhakar Yalamanchili. 2014. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 51–60
2014
-
[51]
Rick Wertenbroek and Yann Thoma. 2019. Acceleration of the Pair-HMM forward algorithm on FPGA with cloud integration for GATK. In2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 534–541
2019
-
[52]
Chunlin Xiao, Justin Zook, Shane Trask, Stephen Sherry, and Genome in-a Bot- tle Consortium. 2014. GIAB: Genome reference material development resources for clinical sequencing.Cancer Research74, 19_Supplement (2014), 5328–5328
2014
-
[53]
Byung-Jun Yoon. 2009. Hidden Markov models and their applications in biological sequence analysis.Current genomics10, 6 (2009), 402–415
2009
-
[54]
Zhonghai Zhang, Yewen Li, Ke Meng, Chunming Zhang, and Guangming Tan
-
[55]
InProceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
Faster and Cheaper: Pushing the Sequence Alignment Throughput with Commercial CPUs. InProceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 466–479
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.