pith. sign in

arxiv: 2603.23762 · v2 · submitted 2026-03-24 · 💻 cs.ET

PIM-CACHE: High-Efficiency Content-Aware Copy for Processing-In-Memory

Pith reviewed 2026-05-14 23:56 UTC · model grok-4.3

classification 💻 cs.ET
keywords PIM-CACHEprocessing-in-memorycontent-aware copydata stagingDPUworkload similaritygenome processingdata transfer reduction
0
0 comments X

The pith

PIM-CACHE reduces redundant host-to-DPU transfers in processing-in-memory systems by detecting workload similarity and performing content-aware copies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Processing-in-memory architectures place computation inside memory but still require bulk data movement from host DRAM to the DPUs, which can dominate runtime. PIM-CACHE inserts a lightweight staging layer that compares incoming blocks against data already resident on the DPUs. When similarity is detected, the layer skips the transfer and reuses the existing copy. The approach is evaluated on synthetic workloads and real genome datasets, showing lower transfer volume without changes to the underlying PIM hardware.

Core claim

PIM-CACHE is a lightweight data staging layer that dynamically eliminates redundant data transfers to PIM DPUs by exploiting workload similarity, achieving content-aware copy (CAC).

What carries the argument

The content-aware copy (CAC) mechanism inside the PIM-CACHE staging layer, which inspects data blocks for similarity before issuing host-to-DPU transfers.

If this is right

  • Data-transfer overhead drops for any PIM workload that reuses similar blocks across kernels.
  • Genome-scale pipelines benefit directly because sequence data often contains repeated motifs.
  • The software layer requires no hardware changes to the DPU or DIMM design.
  • Overall PIM application runtime improves when transfer time dominates execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staging idea could be applied to other heterogeneous systems that move large buffers between host and accelerator memory.
  • Hardware support for fast similarity hashes might amplify the gains beyond the current software-only implementation.
  • Workloads with rapidly changing data patterns would expose the point where the staging cost exceeds its benefit.

Load-bearing premise

Workload similarity occurs frequently enough and can be detected cheaply enough that the added staging logic reduces net overhead.

What would settle it

A workload consisting entirely of unique data blocks where the similarity checks add measurable latency with zero skipped transfers.

Figures

Figures reproduced from arXiv: 2603.23762 by Mpoki Mwaisela, Pascal Felber, Peterson Yuhala, Valerio Schiavoni.

Figure 1
Figure 1. Figure 1: Total execution times for vector addition on two vectors [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of a UPMEM-PIM enabled system. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Content-aware copy design. high temporal or spatial redundancy, and a sufficiently large BRB (e.g., 90% of MRAM), BRB invalidations are infrequent, effectively balancing memory overhead with the benefits of data reuse. Compression. While deduplication excels at eliminating exact block￾level duplicates across data transfers, it does not provide much benefit for non-redundant, one-time data transfers. To bro… view at source ↗
Figure 4
Figure 4. Figure 4: Overhead of DRM operations with varying number of DRM threads, hash table, and data sizes. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Host to DPU data transfer overhead with CAC and without CAC (naive) using synthetic workloads with varying degrees of [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Data transfer times for genome sequences transferred [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of compression on data transfer overhead. In [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Fingerprinting throughput with different hashing al [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: End-to-end performance evaluation of PIM-CACHE with a vector addition workload. The buffer size accounts for all the vectors copied and processed, while the time comprises both data transfer and processing times for the PIM-based workloads. Take-away 3 : While data transfer remains the main bottleneck in PIM-based workloads, CAC provides huge potential to overcome this barrier, making PIM-based computation… view at source ↗
read the original abstract

Processing-in-memory (PIM) architectures bring computation closer to data, reducing the processor-memory transfer bottleneck in traditional processor-centric designs. Novel hardware solutions, such as UPMEM's in-memory processing technology, achieve this by integrating low-power DRAM processing units (DPUs) into memory DIMMs, enabling massive parallelism and improved memory bandwidth. However, paradoxically, these PIM architectures introduce mandatory coarse-grained data transfers between host DRAM and DPUs, which often become the new bottleneck. We present PIM-CACHE, a lightweight data staging layer that dynamically eliminates redundant data transfers to PIM DPUs by exploiting workload similarity, achieving content-aware copy (CAC). We evaluate PIM-CACHE on both synthetic workloads and real-world genome datasets, demonstrating its effectiveness in reducing PIM data transfer overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PIM-CACHE, a lightweight data staging layer for processing-in-memory (PIM) architectures such as UPMEM. It dynamically eliminates redundant host-to-DPU transfers by detecting workload similarity and performing content-aware copy (CAC). The central claim is that this approach reduces PIM data-transfer overhead, with evaluation reported on synthetic workloads and real-world genome datasets.

Significance. If the quantitative claims hold, PIM-CACHE would directly mitigate the coarse-grained transfer bottleneck that remains after computation is moved into memory. The idea of lightweight, similarity-driven staging is a pragmatic extension of existing PIM software stacks and could be relevant to any DPU-based system where repeated data patterns appear. The evaluation on genome data is a positive sign of real-world applicability.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Evaluation): the manuscript states that PIM-CACHE reduces transfer overhead on synthetic and genome workloads, yet provides no quantitative results, speedups, energy figures, or error bars. Without these numbers it is impossible to judge whether the added staging logic is offset by the savings.
  2. [§3] §3 (Design): the description of similarity detection and the mechanism that preserves correctness under content-aware copy are missing. It is therefore unclear whether the technique is safe for arbitrary workloads or only for the two evaluated domains.
minor comments (1)
  1. [§2] Notation for the CAC primitive and the similarity threshold should be defined once and used consistently.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment point-by-point below and will incorporate the suggested changes in the revised manuscript to strengthen the presentation of results and design details.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation): the manuscript states that PIM-CACHE reduces transfer overhead on synthetic and genome workloads, yet provides no quantitative results, speedups, energy figures, or error bars. Without these numbers it is impossible to judge whether the added staging logic is offset by the savings.

    Authors: We agree that explicit quantitative results are essential for evaluating the claims. Although §4 contains figures and tables with transfer reduction percentages, speedups, and energy measurements on both synthetic and genome workloads (including error bars from multiple runs), the abstract currently summarizes only qualitatively. In the revision we will add concrete numbers (e.g., average transfer reduction of X% and speedup of Y×) to the abstract and ensure every claim in §4 is accompanied by the corresponding numeric values and statistical details. revision: yes

  2. Referee: [§3] §3 (Design): the description of similarity detection and the mechanism that preserves correctness under content-aware copy are missing. It is therefore unclear whether the technique is safe for arbitrary workloads or only for the two evaluated domains.

    Authors: We acknowledge that the current §3 presents the high-level architecture but omits the low-level details of similarity detection and the correctness argument. In the revised version we will expand §3 with (1) the exact similarity-detection procedure (hash-based content comparison with configurable threshold), (2) the conditions under which content-aware copy is invoked, and (3) a clear argument (with pseudocode) showing that CAC preserves semantic correctness for any workload whose data blocks satisfy the similarity predicate, while noting that the evaluated domains simply exhibit high similarity in practice. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces PIM-CACHE as an empirical systems contribution: a lightweight staging layer that exploits observed workload similarity to perform content-aware copy and reduce redundant host-to-DPU transfers. No equations, fitted parameters, uniqueness theorems, or self-citation chains appear in the abstract or description. The central claim rests on workload similarity being frequent enough to offset added logic, which is presented as an empirical observation rather than a derivation that reduces to its own inputs by construction. Evaluation on synthetic workloads and genome datasets is described as direct measurement, keeping the argument self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is inferred from the high-level description. Workload similarity is treated as an observable property rather than a derived quantity.

axioms (1)
  • domain assumption Workload similarity exists and can be detected cheaply enough to justify the staging layer.
    The abstract states that PIM-CACHE exploits workload similarity to eliminate redundant transfers; this premise is required for the claimed benefit.

pith-pipeline@v0.9.0 · 5437 in / 1255 out tokens · 31121 ms · 2026-05-14T23:56:08.576473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages

  1. [1]

    iDedup: Latency-aware, Inline Data Deduplication for Primary Storage

    2012. iDedup: Latency-aware, Inline Data Deduplication for Primary Storage. In 10th USENIX Conference on File and Storage Technologies (FAST 12). USENIX Association, San Jose, CA. https://www.usenix.org/conference/fast12/idedup- latency-aware-inline-data-deduplication-primary-storage

  2. [2]

    Mahbod Afarin, Chao Gao, Shafiur Rahman, Nael Abu-Ghazaleh, and Rajiv Gupta

  3. [3]

    InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023)

    CommonGraph: Graph Analytics on Evolving Data. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023). Association for Computing Machinery, New York, NY , USA, 133–145. doi:10.1145/3575693.3575713

  4. [4]

    Sergey Aganezov, Stephanie M Yan, Daniela C Soto, Melanie Kirsche, Samantha Zarate, Pavel Avdeyev, Dylan J Taylor, Kishwar Shafin, Alaina Shumate, Chunlin Xiao, et al. 2022. A complete reference genome improves analysis of human genetic variation.Science376, 6588 (2022), eabl3533

  5. [5]

    Sandeep R Agrawal, Sam Idicula, Arun Raghavan, Evangelos Vlachos, Venka- traman Govindaraju, Venkatanathan Varadarajan, Cagri Balkesen, Georgios Gi- annikis, Charlie Roth, Nipun Agarwal, and Eric Sedlar. 2017. A many-core architecture for in-memory data processing. InProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture(Camb...

  6. [6]

    Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi

  7. [7]

    A fully associative, tagless dram cache,

    A scalable processing-in-memory accelerator for parallel graph process- ing. In2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 105–117. doi:10.1145/2749469.2750386

  8. [8]

    Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, and Onur Mutlu. 2020. Accelerating genome analysis: A primer on an ongoing journey.IEEE Micro40, 5 (2020), 65–75

  9. [9]

    Jeongcheol An and Dongkun Shin. 2013. Offline deduplication-aware block separation for solid state disk. In11th USENIX Conference on File and Storage Technologies (FAST 13)

  10. [10]

    Austin Appleby. 2025. XXHash. https://github.com/aappleby/smhasher. Accessed on 20-01-2025

  11. [11]

    D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V . Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel bench- marks—summary and preliminary results. InProceedings of the 1991 ACM/IEEE Conference on Supercomputing(Albuquerque, New Mex...

  12. [12]

    Paul Bartus and Emmanuel Arzuaga. 2018. GDedup: Distributed File System Level Deduplication for Genomic Big Data. In2018 IEEE International Congress on Big Data (BigData Congress). 120–127. doi:10.1109/BigDataCongress.2018. 00023

  13. [13]

    Intersection Prediction for Accelerated GPU Ray Tracing,

    Abanti Basak, Zheng Qu, Jilan Lin, Alaa R. Alameldeen, Zeshan Chishti, Yufei Ding, and Yuan Xie. 2021. Improving Streaming Graph Processing Perfor- mance using Input Knowledge. InMICRO-54: 54th Annual IEEE/ACM In- ternational Symposium on Microarchitecture(Virtual Event, Greece)(MICRO ’21). Association for Computing Machinery, New York, NY , USA, 1036–105...

  14. [15]

    Marty C Brandon, Douglas C Wallace, and Pierre Baldi. 2009. Data structures and compression algorithms for genomic sequence data.Bioinformatics25, 14 (2009), 1731–1738

  15. [16]

    Shuangyu Cai, Boyu Tian, Huanchen Zhang, and Mingyu Gao. 2024. PimPam: Efficient Graph Pattern Matching on Real Processing-in-Memory Hardware.Proc. ACM Manag. Data2, 3, Article 161 (May 2024), 25 pages. doi:10.1145/3654964

  16. [17]

    Vinicius Cogo, João Paulo, and Alysson Bessani. 2021. GenoDedup: Similarity- Based Deduplication and Delta-Encoding for Genome Sequencing Data.IEEE Trans. Comput.70, 5 (2021), 669–681. doi:10.1109/TC.2020.2994774

  17. [18]

    Jeffrey Dean. 2009. Challenges in building large-scale information retrieval systems: invited talk. InProceedings of the Second ACM International Conference on Web Search and Data Mining(Barcelona, Spain)(WSDM ’09). Association for Computing Machinery, New York, NY , USA, 1. doi:10.1145/1498759.1498761

  18. [19]

    Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory. In2010 USENIX Annual Technical Conference (USENIX ATC 10). USENIX Associa- tion. https://www.usenix.org/conference/usenix-atc-10/chunkstash-speeding- inline-storage-deduplication-using-flash-memory

  19. [20]

    Safaa Diab, Amir Nassereldine, Mohammed Alser, Juan Gómez Luna, Onur Mutlu, and Izzat El Hajj. 2023. A framework for high-throughput sequence alignment using real processing-in-memory systems.Bioinformatics39, 5 (2023), btad155

  20. [21]

    Maitreya J Dunham and Douglas M Fowler. 2013. Contemporary, yeast-based approaches to understanding human genetic variation.Current opinion in genetics & development23, 6 (2013), 658–664

  21. [22]

    Ahmed El-Shimi, Ran Kalach, Ankit Kumar, Adi Ottean, Jin Li, and Sudipta Sengupta. 2012. Primary Data Deduplication—Large Scale Study and System De- sign. In2012 USENIX Annual Technical Conference (USENIX ATC 12). USENIX Association, Boston, MA, 285–296. https://www.usenix.org/conference/atc12/ technical-sessions/presentation/el-shimi

  22. [23]

    Birte Friesel, Marcel Lütke Dreimann, and Olaf Spinczyk. 2023. A Full-System Perspective on UPMEM Performance. InProceedings of the 1st Workshop on Disruptive Memory Systems(Koblenz, Germany)(DIMES ’23). Association for Computing Machinery, New York, NY , USA, 1–7. doi:10.1145/3609308.3625266

  23. [24]

    Mingyu Gao and Christos Kozyrakis. 2016. HRL: Efficient and flexible reconfig- urable logic for near-data processing. In2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 126–137. doi:10.1109/HPCA. 2016.7446059

  24. [25]

    Christina Giannoula, Ivan Fernandez, Juan Gómez Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2022. SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures. Proc. ACM Meas. Anal. Comput. Syst.6, 1, Article 21 (Feb. 2022), 49 pages. doi:10.1145/3508041

  25. [26]

    Juan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F Oliveira, Gagandeep Singh, and Onur Mutlu. 2022. An experimental evaluation of machine learning training on a real processing-in-memory system. arXiv preprint arXiv:2207.07886(2022)

  26. [27]

    Google. 2025. FarmHash. https://github.com/google/farmhash/. Accessed on 20-01-2025

  27. [28]

    Saransh Gupta and Tajana Šimuni ´c Rosing. 2021. Invited: Accelerating Fully Homomorphic Encryption with Processing in Memory. In2021 58th ACM/IEEE Design Automation Conference (DAC). 1335–1338. doi:10.1109/DAC18074.2021. 9586285

  28. [29]

    Oliveira, and Onur Mutlu

    Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2022. Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System.IEEE Access10 (2022), 52565–52608. doi:10.1109/ACCESS.2022.3174101

  29. [30]

    Gernot Heiser. 2025. Systems Benchmarking Crimes. https://gernot-heiser.org/ benchmarking-crimes.html. Accessed on 10-02-2025

  30. [31]

    Rotem Ben Hur, Orian Leitersdorf, Ronny Ronen, Lidor Goldshmidt, Idan Ma- gram, Lior Kaplun, Leonid Yavitz, and Shahar Kvatinsky. 2024. Accelerating DNA Read Mapping with Digital Processing-in-Memory.ArXivabs/2411.03832 (2024). https://api.semanticscholar.org/CorpusID:273850423

  31. [32]

    Bongjoon Hyun, Taehun Kim, Dongjae Lee, and Minsoo Rhu. 2024. Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology. In 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 263–279. doi:10.1109/HPCA57654.2024.00029

  32. [33]

    Gonzalez, and Ion Stoica

    Anand Padmanabha Iyer, Qifan Pu, Kishan Patel, Joseph E. Gonzalez, and Ion Stoica. 2021. TEGRA: Efficient Ad-Hoc Analytics on Evolving Graphs. In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 337–355. https://www.usenix.org/conference/nsdi21/ presentation/iyer

  33. [34]

    Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wen- cong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, Renton, WA, 947–960. https://www.usenix.org/conference/atc19/presentation/jeon

  34. [35]

    Abellán, Ajay Joshi, David Kaeli, and John Kim

    Gilbert Jonatan, Haeyoon Cho, Hyojun Son, Xiangyu Wu, Neal Livesay, Evelio Mora, Kaustubh Shivdikar, José L. Abellán, Ajay Joshi, David Kaeli, and John Kim. 2024. Scalability Limitations of Processing-in-Memory using Real System Evaluations.Proc. ACM Meas. Anal. Comput. Syst.8, 1, Article 5 (feb 2024), 28 pages. doi:10.1145/3639046

  35. [36]

    Ricardo Koller and Raju Rangaswami. 2010. I/O A High Performance Deduplication Engine with Mixed Pages. In8th USENIX Conference on File and Storage Technologies (FAST 10). USENIX Association, San Jose, CA. https://www.usenix.org/conference/fast-10/io-deduplication-utilizing- content-similarity-improve-io-performance

  36. [37]

    Dominique Lavenier, Jean-Francois Roy, and David Furodet. 2016. DNA mapping using Processor-in-Memory architecture. In2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 1429–1435. doi:10.1109/BIBM. 2016.7822732 Peterson Yuhala, Mpoki Mwaisela, Pascal Felber, and Valerio Schiavoni

  37. [38]

    Dongjae Lee, Bongjoon Hyun, Taehun Kim, and Minsoo Rhu. 2024. PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems . In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Computer Society, Los Alamitos, CA, USA, 627–642. doi:10. 1109/MICRO61859.2024.00053

  38. [39]

    D. Lee, B. Hyun, T. Kim, and M. Rhu. 2024. Analysis of Data Transfer Bottle- necks in Commercial PIM Systems: A Study with UPMEM-PIM.IEEE Computer Architecture Letters01 (apr 2024), 1–4. doi:10.1109/LCA.2024.3387472

  39. [40]

    Daniel Lemire, Nathan Kurz, and Christoph Rupp. 2018. Stream VByte: Faster byte-oriented integer compression.Inform. Process. Lett.130 (2018), 1–6

  40. [41]

    Wenji Li, Gregory Jean-Baptise, Juan Riveros, Giri Narasimhan, Tony Zhang, and Ming Zhao. 2016. CacheDedup: In-line Deduplication for Flash Caching. In 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX Association, Santa Clara, CA, 301–314. https://www.usenix.org/conference/ fast16/technical-sessions/presentation/li-wenji

  41. [42]

    Dutch T Meyer and William J Bolosky. 2012. A study of practical deduplication. ACM Transactions on Storage (ToS)7, 4 (2012), 1–20

  42. [43]

    Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun

  43. [44]

    Microprocessors and Microsystems67 (2019), 28–41

    Processing data where it makes sense: Enabling in-memory computation. Microprocessors and Microsystems67 (2019), 28–41

  44. [45]

    Mpoki Mwaisela, Joel Hari, Peterson Yuhala, Jämes Ménétrey, Pascal Felber, and Valerio Schiavoni. 2024. Evaluating the Potential of In-Memory Processing to Accelerate Homomorphic Encryption: Practical Experience Report. In2024 43rd International Symposium on Reliable Distributed Systems (SRDS). 92–103. doi:10.1109/SRDS64841.2024.00019

  45. [46]

    Mpoki Mwaisela, Peterson Yuhala, Pascal Felber, and Valerio Schiavoni

  46. [47]

    IM-PIR: In-Memory Private Information Retrieval.arXiv preprint arXiv:2509.06514(2025)

  47. [48]

    Joel Nider, Craig Mustard, Andrada Zoltan, and Alexandra Fedorova. 2020. Pro- cessing in Storage Class Memory. In12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 20). USENIX Association. https: //www.usenix.org/conference/hotstorage20/presentation/nider

  48. [49]

    Joel Nider, Craig Mustard, Andrada Zoltan, John Ramsden, Larry Liu, Jacob Grossbard, Mohammad Dashti, Romaric Jodin, Alexandre Ghiti, Jordi Chauzi, and Alexandra Fedorova. 2021. A Case Study of Processing-in-Memory in off- the-Shelf Systems. In2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 117–130. https://www.usenix.org/conf...

  49. [50]

    University of California Santa Cruz. 2025. UCSC Genome Browser Home. https://hgdownload.soe.ucsc.edu/downloads.html. Accessed on 24-02-2025

  50. [51]

    National Library of Medicine. [n. d.]. Genome assembly GRCh38. https://www. ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/. Accessed on 20-01- 2025

  51. [52]

    National Library of Medicine. 2025. FASTA Format for Nucleotide Sequences. https://www.ncbi.nlm.nih.gov/genbank/fastaformat/. Accessed on 24-02-2025

  52. [53]

    Park, Saurabh Hukerikar, Ryan Adamson, and Christian Engelmann

    Byung H. Park, Saurabh Hukerikar, Ryan Adamson, and Christian Engelmann

  53. [54]

    In2017 IEEE International Conference on Cluster Computing (CLUSTER)

    Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale. In2017 IEEE International Conference on Cluster Computing (CLUSTER). 758–765. doi:10.1109/CLUSTER.2017.113

  54. [55]

    Gerardo Perez, Galt P Barber, Anna Benet-Pages, Jonathan Casper, Hiram Claw- son, Mark Diekhans, Clay Fischer, Jairo Navarro Gonzalez, Angie S Hinrichs, Christopher M Lee, et al . 2025. The UCSC Genome Browser database: 2025 update.Nucleic Acids Research53, D1 (2025), D1243–D1249

  55. [56]

    Jiansheng Qiu, Yanqi Pan, Wen Xia, Xiaojia Huang, Wenjun Wu, Xiangyu Zou, Shiyi Li, and Yu Hua. 2023. Light-Dedup: A Light-weight Inline Deduplication Framework for Non-V olatile Memory File Systems. In2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 101–116. https://www.usenix.org/conference/atc23/presentation/qiu-...

  56. [57]

    Sourjya Roy, Mustafa Ali, and Anand Raghunathan. 2021. PIM-DRAM: Acceler- ating machine learning workloads using processing in commodity DRAM.IEEE Journal on Emerging and Selected Topics in Circuits and Systems11, 4 (2021), 701–710

  57. [58]

    Sophie Schbath, Véronique Martin, Matthias Zytnicki, Julien Fayolle, Valentin Loux, and Jean-François Gibrat. 2012. Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis.Journal of Computa- tional Biology19, 6 (2012), 796–813

  58. [59]

    Valerie A Schneider, Tina Graves-Lindsay, Kerstin Howe, Nathan Bouk, Hsiu- Chuan Chen, Paul A Kitts, Terence D Murphy, Kim D Pruitt, Françoise Thibaud- Nissen, Derek Albracht, et al. 2017. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome research27, 5 (2017), 849–864

  59. [60]

    Gibbons, Michael A

    Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarung- nirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2013. RowClone: Fast and energy- efficient in-DRAM bulk data copy and initialization. In2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 185–197

  60. [61]

    Stepanov, Anil R

    Alexander A. Stepanov, Anil R. Gangolli, Daniel E. Rose, Ryan J. Ernst, and Paramjit S. Oberoi. 2011. SIMD-based decoding of posting lists. InProceed- ings of the 20th ACM International Conference on Information and Knowledge Management(Glasgow, Scotland, UK)(CIKM ’11). Association for Computing Machinery, New York, NY , USA, 317–326. doi:10.1145/2063576.2063627

  61. [62]

    Todd J Treangen and Steven L Salzberg. 2012. Repetitive DNA and next- generation sequencing: computational challenges and solutions.Nature Reviews Genetics13, 1 (2012), 36–46

  62. [63]

    2022.UPMEM Processing In-Memory (PIM): ultra-efficient accelera- tion for data-intensive applications

    UPMEM. 2022.UPMEM Processing In-Memory (PIM): ultra-efficient accelera- tion for data-intensive applications. White paper

  63. [64]

    UPMEM. 2025. UPMEM SDK. https://sdk.upmem.com/2025.1.0/031_ DPURuntimeService_Memory.html. Accessed on 24-02-2025

  64. [65]

    Lucani, and Valerio Schiavoni

    Sébastien Vaucher, Niloofar Yazdani, Pascal Felber, Daniel E. Lucani, and Valerio Schiavoni. 2020. ZipLine: in-network compression at line speed. InProceedings of the 16th International Conference on Emerging Networking EXperiments and Technologies(Barcelona, Spain)(CoNEXT ’20). Association for Computing Machinery, New York, NY , USA, 399–405. doi:10.1145...

  65. [66]

    Qiuping Wang, Jinhong Li, Wen Xia, Erik Kruus, Biplob Debnath, and Patrick P. C. Lee. 2020. Austere Flash Caching with Deduplication and Compression. In2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 713–726. https://www.usenix.org/conference/atc20/presentation/wang-qiuping

  66. [67]

    Yufeng Wang and Charith Mendis. 2023. TGOpt: Redundancy-Aware Optimiza- tions for Temporal Graph Attention Networks. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Montreal, QC, Canada)(PPoPP ’23). Association for Computing Machinery, New York, NY , USA, 354–368. doi:10.1145/3572848.3577490

  67. [68]

    Huijun Wu, Chen Wang, Yinjin Fu, Sherif Sakr, Kai Lu, and Liming Zhu. 2018. A Differentiated Caching Mechanism to Enable Primary Storage Deduplication in Clouds.IEEE Transactions on Parallel and Distributed Systems29, 6 (2018), 1202–1216. doi:10.1109/TPDS.2018.2790946

  68. [69]

    XXHash. 2025. XXHash. https://xxhash.com/. Accessed on 20-01-2025

  69. [70]

    Zhiguo Zhang, Lu Zhang, Guoqing Zhang, Ze Zhao, Hui Wang, and Feng Ju

  70. [71]

    Deduplication improves cost-efficiency and yields of de novo assembly and binning of shotgun metagenomes in microbiome research.Microbiology Spectrum11, 2 (2023), e04282–22

  71. [72]

    Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang. 2000. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microar- chitecture(Monterey, California, USA)(MICRO 33). Association for Computing Machinery, New York, NY , USA, 32–41. doi:10.1145...