PIM-CACHE: High-Efficiency Content-Aware Copy for Processing-In-Memory

Mpoki Mwaisela; Pascal Felber; Peterson Yuhala; Valerio Schiavoni

arxiv: 2603.23762 · v2 · submitted 2026-03-24 · 💻 cs.ET

PIM-CACHE: High-Efficiency Content-Aware Copy for Processing-In-Memory

Peterson Yuhala , Mpoki Mwaisela , Pascal Felber , Valerio Schiavoni This is my paper

Pith reviewed 2026-05-14 23:56 UTC · model grok-4.3

classification 💻 cs.ET

keywords PIM-CACHEprocessing-in-memorycontent-aware copydata stagingDPUworkload similaritygenome processingdata transfer reduction

0 comments

The pith

PIM-CACHE reduces redundant host-to-DPU transfers in processing-in-memory systems by detecting workload similarity and performing content-aware copies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Processing-in-memory architectures place computation inside memory but still require bulk data movement from host DRAM to the DPUs, which can dominate runtime. PIM-CACHE inserts a lightweight staging layer that compares incoming blocks against data already resident on the DPUs. When similarity is detected, the layer skips the transfer and reuses the existing copy. The approach is evaluated on synthetic workloads and real genome datasets, showing lower transfer volume without changes to the underlying PIM hardware.

Core claim

PIM-CACHE is a lightweight data staging layer that dynamically eliminates redundant data transfers to PIM DPUs by exploiting workload similarity, achieving content-aware copy (CAC).

What carries the argument

The content-aware copy (CAC) mechanism inside the PIM-CACHE staging layer, which inspects data blocks for similarity before issuing host-to-DPU transfers.

If this is right

Data-transfer overhead drops for any PIM workload that reuses similar blocks across kernels.
Genome-scale pipelines benefit directly because sequence data often contains repeated motifs.
The software layer requires no hardware changes to the DPU or DIMM design.
Overall PIM application runtime improves when transfer time dominates execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staging idea could be applied to other heterogeneous systems that move large buffers between host and accelerator memory.
Hardware support for fast similarity hashes might amplify the gains beyond the current software-only implementation.
Workloads with rapidly changing data patterns would expose the point where the staging cost exceeds its benefit.

Load-bearing premise

Workload similarity occurs frequently enough and can be detected cheaply enough that the added staging logic reduces net overhead.

What would settle it

A workload consisting entirely of unique data blocks where the similarity checks add measurable latency with zero skipped transfers.

Figures

Figures reproduced from arXiv: 2603.23762 by Mpoki Mwaisela, Pascal Felber, Peterson Yuhala, Valerio Schiavoni.

**Figure 2.** Figure 2: Architecture of a UPMEM-PIM enabled system. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Content-aware copy design. high temporal or spatial redundancy, and a sufficiently large BRB (e.g., 90% of MRAM), BRB invalidations are infrequent, effectively balancing memory overhead with the benefits of data reuse. Compression. While deduplication excels at eliminating exact blocklevel duplicates across data transfers, it does not provide much benefit for non-redundant, one-time data transfers. To bro… view at source ↗

**Figure 4.** Figure 4: Overhead of DRM operations with varying number of DRM threads, hash table, and data sizes. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Host to DPU data transfer overhead with CAC and without CAC (naive) using synthetic workloads with varying degrees of [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Data transfer times for genome sequences transferred [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of compression on data transfer overhead. In [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Fingerprinting throughput with different hashing al [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: End-to-end performance evaluation of PIM-CACHE with a vector addition workload. The buffer size accounts for all the vectors copied and processed, while the time comprises both data transfer and processing times for the PIM-based workloads. Take-away 3 : While data transfer remains the main bottleneck in PIM-based workloads, CAC provides huge potential to overcome this barrier, making PIM-based computation… view at source ↗

read the original abstract

Processing-in-memory (PIM) architectures bring computation closer to data, reducing the processor-memory transfer bottleneck in traditional processor-centric designs. Novel hardware solutions, such as UPMEM's in-memory processing technology, achieve this by integrating low-power DRAM processing units (DPUs) into memory DIMMs, enabling massive parallelism and improved memory bandwidth. However, paradoxically, these PIM architectures introduce mandatory coarse-grained data transfers between host DRAM and DPUs, which often become the new bottleneck. We present PIM-CACHE, a lightweight data staging layer that dynamically eliminates redundant data transfers to PIM DPUs by exploiting workload similarity, achieving content-aware copy (CAC). We evaluate PIM-CACHE on both synthetic workloads and real-world genome datasets, demonstrating its effectiveness in reducing PIM data transfer overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PIM-CACHE adds a practical content-aware staging layer to reduce redundant transfers on UPMEM DPUs by exploiting workload similarity, but the abstract leaves the actual gains and detection method unquantified.

read the letter

The key takeaway is that PIM-CACHE introduces a content-aware copy technique for UPMEM DPUs that avoids redundant data transfers by detecting similar content in workloads. This could help with the persistent transfer bottleneck in PIM setups, where even with computation near memory, moving data from host DRAM remains costly. The paper frames it as a lightweight staging layer rather than a full redesign, which keeps the scope manageable. What stands out as new is the specific staging layer that dynamically eliminates copies based on similarity, applied to both synthetic and genome data. The paper does well in identifying the practical problem and proposing a lightweight solution that builds on existing caching concepts without overcomplicating things. Focusing on genome datasets makes sense given the repetitive nature of biological sequences, and the approach stays tied to real hardware constraints. The soft spots are in the evaluation details. The abstract claims effectiveness in reducing overhead but doesn't include any quantitative results, descriptions of the similarity detection method, or how correctness is maintained when skipping transfers. This leaves the actual performance gains unverified at this level. Since it's focused on one hardware platform, broader claims would need more support, and without error bars or baseline comparisons it's hard to gauge the improvement magnitude. The central claim holds together internally, with no obvious contradictions. This work is for people in the PIM and in-memory computing community, especially those optimizing data movement for real applications like genome processing. A practitioner or researcher looking for incremental improvements on UPMEM would find it relevant. It deserves peer review because the central idea is grounded in a real hardware limitation and the approach seems internally consistent, even if more data is needed to strengthen the case.

Referee Report

2 major / 1 minor

Summary. The paper introduces PIM-CACHE, a lightweight data staging layer for processing-in-memory (PIM) architectures such as UPMEM. It dynamically eliminates redundant host-to-DPU transfers by detecting workload similarity and performing content-aware copy (CAC). The central claim is that this approach reduces PIM data-transfer overhead, with evaluation reported on synthetic workloads and real-world genome datasets.

Significance. If the quantitative claims hold, PIM-CACHE would directly mitigate the coarse-grained transfer bottleneck that remains after computation is moved into memory. The idea of lightweight, similarity-driven staging is a pragmatic extension of existing PIM software stacks and could be relevant to any DPU-based system where repeated data patterns appear. The evaluation on genome data is a positive sign of real-world applicability.

major comments (2)

[Abstract and §4] Abstract and §4 (Evaluation): the manuscript states that PIM-CACHE reduces transfer overhead on synthetic and genome workloads, yet provides no quantitative results, speedups, energy figures, or error bars. Without these numbers it is impossible to judge whether the added staging logic is offset by the savings.
[§3] §3 (Design): the description of similarity detection and the mechanism that preserves correctness under content-aware copy are missing. It is therefore unclear whether the technique is safe for arbitrary workloads or only for the two evaluated domains.

minor comments (1)

[§2] Notation for the CAC primitive and the similarity threshold should be defined once and used consistently.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment point-by-point below and will incorporate the suggested changes in the revised manuscript to strengthen the presentation of results and design details.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): the manuscript states that PIM-CACHE reduces transfer overhead on synthetic and genome workloads, yet provides no quantitative results, speedups, energy figures, or error bars. Without these numbers it is impossible to judge whether the added staging logic is offset by the savings.

Authors: We agree that explicit quantitative results are essential for evaluating the claims. Although §4 contains figures and tables with transfer reduction percentages, speedups, and energy measurements on both synthetic and genome workloads (including error bars from multiple runs), the abstract currently summarizes only qualitatively. In the revision we will add concrete numbers (e.g., average transfer reduction of X% and speedup of Y×) to the abstract and ensure every claim in §4 is accompanied by the corresponding numeric values and statistical details. revision: yes
Referee: [§3] §3 (Design): the description of similarity detection and the mechanism that preserves correctness under content-aware copy are missing. It is therefore unclear whether the technique is safe for arbitrary workloads or only for the two evaluated domains.

Authors: We acknowledge that the current §3 presents the high-level architecture but omits the low-level details of similarity detection and the correctness argument. In the revised version we will expand §3 with (1) the exact similarity-detection procedure (hash-based content comparison with configurable threshold), (2) the conditions under which content-aware copy is invoked, and (3) a clear argument (with pseudocode) showing that CAC preserves semantic correctness for any workload whose data blocks satisfy the similarity predicate, while noting that the evaluated domains simply exhibit high similarity in practice. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces PIM-CACHE as an empirical systems contribution: a lightweight staging layer that exploits observed workload similarity to perform content-aware copy and reduce redundant host-to-DPU transfers. No equations, fitted parameters, uniqueness theorems, or self-citation chains appear in the abstract or description. The central claim rests on workload similarity being frequent enough to offset added logic, which is presented as an empirical observation rather than a derivation that reduces to its own inputs by construction. Evaluation on synthetic workloads and genome datasets is described as direct measurement, keeping the argument self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is inferred from the high-level description. Workload similarity is treated as an observable property rather than a derived quantity.

axioms (1)

domain assumption Workload similarity exists and can be detected cheaply enough to justify the staging layer.
The abstract states that PIM-CACHE exploits workload similarity to eliminate redundant transfers; this premise is required for the claimed benefit.

pith-pipeline@v0.9.0 · 5437 in / 1255 out tokens · 31121 ms · 2026-05-14T23:56:08.576473+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present PIM-CACHE, a lightweight data staging layer that dynamically eliminates redundant data transfers to PIM DPUs by exploiting workload similarity, achieving content-aware copy (CAC).
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our CAC approach identifies repeating data patterns using well-known fingerprinting techniques, deduplicating the data before transferring it to DPU memory.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages

[1]

iDedup: Latency-aware, Inline Data Deduplication for Primary Storage

2012. iDedup: Latency-aware, Inline Data Deduplication for Primary Storage. In 10th USENIX Conference on File and Storage Technologies (FAST 12). USENIX Association, San Jose, CA. https://www.usenix.org/conference/fast12/idedup- latency-aware-inline-data-deduplication-primary-storage

work page 2012
[2]

Mahbod Afarin, Chao Gao, Shafiur Rahman, Nael Abu-Ghazaleh, and Rajiv Gupta

work page
[3]

InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023)

CommonGraph: Graph Analytics on Evolving Data. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023). Association for Computing Machinery, New York, NY , USA, 133–145. doi:10.1145/3575693.3575713

work page doi:10.1145/3575693.3575713 2023
[4]

Sergey Aganezov, Stephanie M Yan, Daniela C Soto, Melanie Kirsche, Samantha Zarate, Pavel Avdeyev, Dylan J Taylor, Kishwar Shafin, Alaina Shumate, Chunlin Xiao, et al. 2022. A complete reference genome improves analysis of human genetic variation.Science376, 6588 (2022), eabl3533

work page 2022
[5]

Sandeep R Agrawal, Sam Idicula, Arun Raghavan, Evangelos Vlachos, Venka- traman Govindaraju, Venkatanathan Varadarajan, Cagri Balkesen, Georgios Gi- annikis, Charlie Roth, Nipun Agarwal, and Eric Sedlar. 2017. A many-core architecture for in-memory data processing. InProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture(Camb...

work page doi:10.1145/3123939.3123985 2017
[6]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi

work page
[7]

A fully associative, tagless dram cache,

A scalable processing-in-memory accelerator for parallel graph process- ing. In2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 105–117. doi:10.1145/2749469.2750386

work page doi:10.1145/2749469.2750386
[8]

Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, and Onur Mutlu. 2020. Accelerating genome analysis: A primer on an ongoing journey.IEEE Micro40, 5 (2020), 65–75

work page 2020
[9]

Jeongcheol An and Dongkun Shin. 2013. Offline deduplication-aware block separation for solid state disk. In11th USENIX Conference on File and Storage Technologies (FAST 13)

work page 2013
[10]

Austin Appleby. 2025. XXHash. https://github.com/aappleby/smhasher. Accessed on 20-01-2025

work page 2025
[11]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V . Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel bench- marks—summary and preliminary results. InProceedings of the 1991 ACM/IEEE Conference on Supercomputing(Albuquerque, New Mex...

work page doi:10.1145/125826.125925 1991
[12]

Paul Bartus and Emmanuel Arzuaga. 2018. GDedup: Distributed File System Level Deduplication for Genomic Big Data. In2018 IEEE International Congress on Big Data (BigData Congress). 120–127. doi:10.1109/BigDataCongress.2018. 00023

work page doi:10.1109/bigdatacongress.2018 2018
[13]

Intersection Prediction for Accelerated GPU Ray Tracing,

Abanti Basak, Zheng Qu, Jilan Lin, Alaa R. Alameldeen, Zeshan Chishti, Yufei Ding, and Yuan Xie. 2021. Improving Streaming Graph Processing Perfor- mance using Input Knowledge. InMICRO-54: 54th Annual IEEE/ACM In- ternational Symposium on Microarchitecture(Virtual Event, Greece)(MICRO ’21). Association for Computing Machinery, New York, NY , USA, 1036–105...

work page doi:10.1145/3466752.3480096 2021
[15]

Marty C Brandon, Douglas C Wallace, and Pierre Baldi. 2009. Data structures and compression algorithms for genomic sequence data.Bioinformatics25, 14 (2009), 1731–1738

work page 2009
[16]

Shuangyu Cai, Boyu Tian, Huanchen Zhang, and Mingyu Gao. 2024. PimPam: Efficient Graph Pattern Matching on Real Processing-in-Memory Hardware.Proc. ACM Manag. Data2, 3, Article 161 (May 2024), 25 pages. doi:10.1145/3654964

work page doi:10.1145/3654964 2024
[17]

Vinicius Cogo, João Paulo, and Alysson Bessani. 2021. GenoDedup: Similarity- Based Deduplication and Delta-Encoding for Genome Sequencing Data.IEEE Trans. Comput.70, 5 (2021), 669–681. doi:10.1109/TC.2020.2994774

work page doi:10.1109/tc.2020.2994774 2021
[18]

Jeffrey Dean. 2009. Challenges in building large-scale information retrieval systems: invited talk. InProceedings of the Second ACM International Conference on Web Search and Data Mining(Barcelona, Spain)(WSDM ’09). Association for Computing Machinery, New York, NY , USA, 1. doi:10.1145/1498759.1498761

work page doi:10.1145/1498759.1498761 2009
[19]

Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory. In2010 USENIX Annual Technical Conference (USENIX ATC 10). USENIX Associa- tion. https://www.usenix.org/conference/usenix-atc-10/chunkstash-speeding- inline-storage-deduplication-using-flash-memory

work page 2010
[20]

Safaa Diab, Amir Nassereldine, Mohammed Alser, Juan Gómez Luna, Onur Mutlu, and Izzat El Hajj. 2023. A framework for high-throughput sequence alignment using real processing-in-memory systems.Bioinformatics39, 5 (2023), btad155

work page 2023
[21]

Maitreya J Dunham and Douglas M Fowler. 2013. Contemporary, yeast-based approaches to understanding human genetic variation.Current opinion in genetics & development23, 6 (2013), 658–664

work page 2013
[22]

Ahmed El-Shimi, Ran Kalach, Ankit Kumar, Adi Ottean, Jin Li, and Sudipta Sengupta. 2012. Primary Data Deduplication—Large Scale Study and System De- sign. In2012 USENIX Annual Technical Conference (USENIX ATC 12). USENIX Association, Boston, MA, 285–296. https://www.usenix.org/conference/atc12/ technical-sessions/presentation/el-shimi

work page 2012
[23]

Birte Friesel, Marcel Lütke Dreimann, and Olaf Spinczyk. 2023. A Full-System Perspective on UPMEM Performance. InProceedings of the 1st Workshop on Disruptive Memory Systems(Koblenz, Germany)(DIMES ’23). Association for Computing Machinery, New York, NY , USA, 1–7. doi:10.1145/3609308.3625266

work page doi:10.1145/3609308.3625266 2023
[24]

Mingyu Gao and Christos Kozyrakis. 2016. HRL: Efficient and flexible reconfig- urable logic for near-data processing. In2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 126–137. doi:10.1109/HPCA. 2016.7446059

work page doi:10.1109/hpca 2016
[25]

Christina Giannoula, Ivan Fernandez, Juan Gómez Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2022. SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures. Proc. ACM Meas. Anal. Comput. Syst.6, 1, Article 21 (Feb. 2022), 49 pages. doi:10.1145/3508041

work page doi:10.1145/3508041 2022
[26]

Juan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F Oliveira, Gagandeep Singh, and Onur Mutlu. 2022. An experimental evaluation of machine learning training on a real processing-in-memory system. arXiv preprint arXiv:2207.07886(2022)

work page arXiv 2022
[27]

Google. 2025. FarmHash. https://github.com/google/farmhash/. Accessed on 20-01-2025

work page 2025
[28]

Saransh Gupta and Tajana Šimuni ´c Rosing. 2021. Invited: Accelerating Fully Homomorphic Encryption with Processing in Memory. In2021 58th ACM/IEEE Design Automation Conference (DAC). 1335–1338. doi:10.1109/DAC18074.2021. 9586285

work page doi:10.1109/dac18074.2021 2021
[29]

Oliveira, and Onur Mutlu

Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2022. Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System.IEEE Access10 (2022), 52565–52608. doi:10.1109/ACCESS.2022.3174101

work page doi:10.1109/access.2022.3174101 2022
[30]

Gernot Heiser. 2025. Systems Benchmarking Crimes. https://gernot-heiser.org/ benchmarking-crimes.html. Accessed on 10-02-2025

work page 2025
[31]

Rotem Ben Hur, Orian Leitersdorf, Ronny Ronen, Lidor Goldshmidt, Idan Ma- gram, Lior Kaplun, Leonid Yavitz, and Shahar Kvatinsky. 2024. Accelerating DNA Read Mapping with Digital Processing-in-Memory.ArXivabs/2411.03832 (2024). https://api.semanticscholar.org/CorpusID:273850423

work page arXiv 2024
[32]

Bongjoon Hyun, Taehun Kim, Dongjae Lee, and Minsoo Rhu. 2024. Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology. In 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 263–279. doi:10.1109/HPCA57654.2024.00029

work page doi:10.1109/hpca57654.2024.00029 2024
[33]

Gonzalez, and Ion Stoica

Anand Padmanabha Iyer, Qifan Pu, Kishan Patel, Joseph E. Gonzalez, and Ion Stoica. 2021. TEGRA: Efficient Ad-Hoc Analytics on Evolving Graphs. In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 337–355. https://www.usenix.org/conference/nsdi21/ presentation/iyer

work page 2021
[34]

Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wen- cong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, Renton, WA, 947–960. https://www.usenix.org/conference/atc19/presentation/jeon

work page 2019
[35]

Abellán, Ajay Joshi, David Kaeli, and John Kim

Gilbert Jonatan, Haeyoon Cho, Hyojun Son, Xiangyu Wu, Neal Livesay, Evelio Mora, Kaustubh Shivdikar, José L. Abellán, Ajay Joshi, David Kaeli, and John Kim. 2024. Scalability Limitations of Processing-in-Memory using Real System Evaluations.Proc. ACM Meas. Anal. Comput. Syst.8, 1, Article 5 (feb 2024), 28 pages. doi:10.1145/3639046

work page doi:10.1145/3639046 2024
[36]

Ricardo Koller and Raju Rangaswami. 2010. I/O A High Performance Deduplication Engine with Mixed Pages. In8th USENIX Conference on File and Storage Technologies (FAST 10). USENIX Association, San Jose, CA. https://www.usenix.org/conference/fast-10/io-deduplication-utilizing- content-similarity-improve-io-performance

work page 2010
[37]

Dominique Lavenier, Jean-Francois Roy, and David Furodet. 2016. DNA mapping using Processor-in-Memory architecture. In2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 1429–1435. doi:10.1109/BIBM. 2016.7822732 Peterson Yuhala, Mpoki Mwaisela, Pascal Felber, and Valerio Schiavoni

work page doi:10.1109/bibm 2016
[38]

Dongjae Lee, Bongjoon Hyun, Taehun Kim, and Minsoo Rhu. 2024. PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems . In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Computer Society, Los Alamitos, CA, USA, 627–642. doi:10. 1109/MICRO61859.2024.00053

work page arXiv 2024
[39]

D. Lee, B. Hyun, T. Kim, and M. Rhu. 2024. Analysis of Data Transfer Bottle- necks in Commercial PIM Systems: A Study with UPMEM-PIM.IEEE Computer Architecture Letters01 (apr 2024), 1–4. doi:10.1109/LCA.2024.3387472

work page doi:10.1109/lca.2024.3387472 2024
[40]

Daniel Lemire, Nathan Kurz, and Christoph Rupp. 2018. Stream VByte: Faster byte-oriented integer compression.Inform. Process. Lett.130 (2018), 1–6

work page 2018
[41]

Wenji Li, Gregory Jean-Baptise, Juan Riveros, Giri Narasimhan, Tony Zhang, and Ming Zhao. 2016. CacheDedup: In-line Deduplication for Flash Caching. In 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX Association, Santa Clara, CA, 301–314. https://www.usenix.org/conference/ fast16/technical-sessions/presentation/li-wenji

work page 2016
[42]

Dutch T Meyer and William J Bolosky. 2012. A study of practical deduplication. ACM Transactions on Storage (ToS)7, 4 (2012), 1–20

work page 2012
[43]

Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun

work page
[44]

Microprocessors and Microsystems67 (2019), 28–41

Processing data where it makes sense: Enabling in-memory computation. Microprocessors and Microsystems67 (2019), 28–41

work page 2019
[45]

Mpoki Mwaisela, Joel Hari, Peterson Yuhala, Jämes Ménétrey, Pascal Felber, and Valerio Schiavoni. 2024. Evaluating the Potential of In-Memory Processing to Accelerate Homomorphic Encryption: Practical Experience Report. In2024 43rd International Symposium on Reliable Distributed Systems (SRDS). 92–103. doi:10.1109/SRDS64841.2024.00019

work page doi:10.1109/srds64841.2024.00019 2024
[46]

Mpoki Mwaisela, Peterson Yuhala, Pascal Felber, and Valerio Schiavoni

work page
[47]

IM-PIR: In-Memory Private Information Retrieval.arXiv preprint arXiv:2509.06514(2025)

work page arXiv 2025
[48]

Joel Nider, Craig Mustard, Andrada Zoltan, and Alexandra Fedorova. 2020. Pro- cessing in Storage Class Memory. In12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 20). USENIX Association. https: //www.usenix.org/conference/hotstorage20/presentation/nider

work page 2020
[49]

Joel Nider, Craig Mustard, Andrada Zoltan, John Ramsden, Larry Liu, Jacob Grossbard, Mohammad Dashti, Romaric Jodin, Alexandre Ghiti, Jordi Chauzi, and Alexandra Fedorova. 2021. A Case Study of Processing-in-Memory in off- the-Shelf Systems. In2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 117–130. https://www.usenix.org/conf...

work page 2021
[50]

University of California Santa Cruz. 2025. UCSC Genome Browser Home. https://hgdownload.soe.ucsc.edu/downloads.html. Accessed on 24-02-2025

work page 2025
[51]

National Library of Medicine. [n. d.]. Genome assembly GRCh38. https://www. ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/. Accessed on 20-01- 2025

work page 2025
[52]

National Library of Medicine. 2025. FASTA Format for Nucleotide Sequences. https://www.ncbi.nlm.nih.gov/genbank/fastaformat/. Accessed on 24-02-2025

work page 2025
[53]

Park, Saurabh Hukerikar, Ryan Adamson, and Christian Engelmann

Byung H. Park, Saurabh Hukerikar, Ryan Adamson, and Christian Engelmann

work page
[54]

In2017 IEEE International Conference on Cluster Computing (CLUSTER)

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale. In2017 IEEE International Conference on Cluster Computing (CLUSTER). 758–765. doi:10.1109/CLUSTER.2017.113

work page doi:10.1109/cluster.2017.113 2017
[55]

Gerardo Perez, Galt P Barber, Anna Benet-Pages, Jonathan Casper, Hiram Claw- son, Mark Diekhans, Clay Fischer, Jairo Navarro Gonzalez, Angie S Hinrichs, Christopher M Lee, et al . 2025. The UCSC Genome Browser database: 2025 update.Nucleic Acids Research53, D1 (2025), D1243–D1249

work page 2025
[56]

Jiansheng Qiu, Yanqi Pan, Wen Xia, Xiaojia Huang, Wenjun Wu, Xiangyu Zou, Shiyi Li, and Yu Hua. 2023. Light-Dedup: A Light-weight Inline Deduplication Framework for Non-V olatile Memory File Systems. In2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 101–116. https://www.usenix.org/conference/atc23/presentation/qiu-...

work page 2023
[57]

Sourjya Roy, Mustafa Ali, and Anand Raghunathan. 2021. PIM-DRAM: Acceler- ating machine learning workloads using processing in commodity DRAM.IEEE Journal on Emerging and Selected Topics in Circuits and Systems11, 4 (2021), 701–710

work page 2021
[58]

Sophie Schbath, Véronique Martin, Matthias Zytnicki, Julien Fayolle, Valentin Loux, and Jean-François Gibrat. 2012. Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis.Journal of Computa- tional Biology19, 6 (2012), 796–813

work page 2012
[59]

Valerie A Schneider, Tina Graves-Lindsay, Kerstin Howe, Nathan Bouk, Hsiu- Chuan Chen, Paul A Kitts, Terence D Murphy, Kim D Pruitt, Françoise Thibaud- Nissen, Derek Albracht, et al. 2017. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome research27, 5 (2017), 849–864

work page 2017
[60]

Gibbons, Michael A

Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarung- nirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2013. RowClone: Fast and energy- efficient in-DRAM bulk data copy and initialization. In2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 185–197

work page 2013
[61]

Stepanov, Anil R

Alexander A. Stepanov, Anil R. Gangolli, Daniel E. Rose, Ryan J. Ernst, and Paramjit S. Oberoi. 2011. SIMD-based decoding of posting lists. InProceed- ings of the 20th ACM International Conference on Information and Knowledge Management(Glasgow, Scotland, UK)(CIKM ’11). Association for Computing Machinery, New York, NY , USA, 317–326. doi:10.1145/2063576.2063627

work page doi:10.1145/2063576.2063627 2011
[62]

Todd J Treangen and Steven L Salzberg. 2012. Repetitive DNA and next- generation sequencing: computational challenges and solutions.Nature Reviews Genetics13, 1 (2012), 36–46

work page 2012
[63]

2022.UPMEM Processing In-Memory (PIM): ultra-efficient accelera- tion for data-intensive applications

UPMEM. 2022.UPMEM Processing In-Memory (PIM): ultra-efficient accelera- tion for data-intensive applications. White paper

work page 2022
[64]

UPMEM. 2025. UPMEM SDK. https://sdk.upmem.com/2025.1.0/031_ DPURuntimeService_Memory.html. Accessed on 24-02-2025

work page 2025
[65]

Lucani, and Valerio Schiavoni

Sébastien Vaucher, Niloofar Yazdani, Pascal Felber, Daniel E. Lucani, and Valerio Schiavoni. 2020. ZipLine: in-network compression at line speed. InProceedings of the 16th International Conference on Emerging Networking EXperiments and Technologies(Barcelona, Spain)(CoNEXT ’20). Association for Computing Machinery, New York, NY , USA, 399–405. doi:10.1145...

work page doi:10.1145/3386367.3431302 2020
[66]

Qiuping Wang, Jinhong Li, Wen Xia, Erik Kruus, Biplob Debnath, and Patrick P. C. Lee. 2020. Austere Flash Caching with Deduplication and Compression. In2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 713–726. https://www.usenix.org/conference/atc20/presentation/wang-qiuping

work page 2020
[67]

Yufeng Wang and Charith Mendis. 2023. TGOpt: Redundancy-Aware Optimiza- tions for Temporal Graph Attention Networks. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Montreal, QC, Canada)(PPoPP ’23). Association for Computing Machinery, New York, NY , USA, 354–368. doi:10.1145/3572848.3577490

work page doi:10.1145/3572848.3577490 2023
[68]

Huijun Wu, Chen Wang, Yinjin Fu, Sherif Sakr, Kai Lu, and Liming Zhu. 2018. A Differentiated Caching Mechanism to Enable Primary Storage Deduplication in Clouds.IEEE Transactions on Parallel and Distributed Systems29, 6 (2018), 1202–1216. doi:10.1109/TPDS.2018.2790946

work page doi:10.1109/tpds.2018.2790946 2018
[69]

XXHash. 2025. XXHash. https://xxhash.com/. Accessed on 20-01-2025

work page 2025
[70]

Zhiguo Zhang, Lu Zhang, Guoqing Zhang, Ze Zhao, Hui Wang, and Feng Ju

work page
[71]

Deduplication improves cost-efficiency and yields of de novo assembly and binning of shotgun metagenomes in microbiome research.Microbiology Spectrum11, 2 (2023), e04282–22

work page 2023
[72]

Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang. 2000. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microar- chitecture(Monterey, California, USA)(MICRO 33). Association for Computing Machinery, New York, NY , USA, 32–41. doi:10.1145...

work page doi:10.1145/360128.360134 2000

[1] [1]

iDedup: Latency-aware, Inline Data Deduplication for Primary Storage

2012. iDedup: Latency-aware, Inline Data Deduplication for Primary Storage. In 10th USENIX Conference on File and Storage Technologies (FAST 12). USENIX Association, San Jose, CA. https://www.usenix.org/conference/fast12/idedup- latency-aware-inline-data-deduplication-primary-storage

work page 2012

[2] [2]

Mahbod Afarin, Chao Gao, Shafiur Rahman, Nael Abu-Ghazaleh, and Rajiv Gupta

work page

[3] [3]

InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023)

CommonGraph: Graph Analytics on Evolving Data. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023). Association for Computing Machinery, New York, NY , USA, 133–145. doi:10.1145/3575693.3575713

work page doi:10.1145/3575693.3575713 2023

[4] [4]

Sergey Aganezov, Stephanie M Yan, Daniela C Soto, Melanie Kirsche, Samantha Zarate, Pavel Avdeyev, Dylan J Taylor, Kishwar Shafin, Alaina Shumate, Chunlin Xiao, et al. 2022. A complete reference genome improves analysis of human genetic variation.Science376, 6588 (2022), eabl3533

work page 2022

[5] [5]

Sandeep R Agrawal, Sam Idicula, Arun Raghavan, Evangelos Vlachos, Venka- traman Govindaraju, Venkatanathan Varadarajan, Cagri Balkesen, Georgios Gi- annikis, Charlie Roth, Nipun Agarwal, and Eric Sedlar. 2017. A many-core architecture for in-memory data processing. InProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture(Camb...

work page doi:10.1145/3123939.3123985 2017

[6] [6]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi

work page

[7] [7]

A fully associative, tagless dram cache,

A scalable processing-in-memory accelerator for parallel graph process- ing. In2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 105–117. doi:10.1145/2749469.2750386

work page doi:10.1145/2749469.2750386

[8] [8]

Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, and Onur Mutlu. 2020. Accelerating genome analysis: A primer on an ongoing journey.IEEE Micro40, 5 (2020), 65–75

work page 2020

[9] [9]

Jeongcheol An and Dongkun Shin. 2013. Offline deduplication-aware block separation for solid state disk. In11th USENIX Conference on File and Storage Technologies (FAST 13)

work page 2013

[10] [10]

Austin Appleby. 2025. XXHash. https://github.com/aappleby/smhasher. Accessed on 20-01-2025

work page 2025

[11] [11]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V . Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel bench- marks—summary and preliminary results. InProceedings of the 1991 ACM/IEEE Conference on Supercomputing(Albuquerque, New Mex...

work page doi:10.1145/125826.125925 1991

[12] [12]

Paul Bartus and Emmanuel Arzuaga. 2018. GDedup: Distributed File System Level Deduplication for Genomic Big Data. In2018 IEEE International Congress on Big Data (BigData Congress). 120–127. doi:10.1109/BigDataCongress.2018. 00023

work page doi:10.1109/bigdatacongress.2018 2018

[13] [13]

Intersection Prediction for Accelerated GPU Ray Tracing,

Abanti Basak, Zheng Qu, Jilan Lin, Alaa R. Alameldeen, Zeshan Chishti, Yufei Ding, and Yuan Xie. 2021. Improving Streaming Graph Processing Perfor- mance using Input Knowledge. InMICRO-54: 54th Annual IEEE/ACM In- ternational Symposium on Microarchitecture(Virtual Event, Greece)(MICRO ’21). Association for Computing Machinery, New York, NY , USA, 1036–105...

work page doi:10.1145/3466752.3480096 2021

[14] [15]

Marty C Brandon, Douglas C Wallace, and Pierre Baldi. 2009. Data structures and compression algorithms for genomic sequence data.Bioinformatics25, 14 (2009), 1731–1738

work page 2009

[15] [16]

Shuangyu Cai, Boyu Tian, Huanchen Zhang, and Mingyu Gao. 2024. PimPam: Efficient Graph Pattern Matching on Real Processing-in-Memory Hardware.Proc. ACM Manag. Data2, 3, Article 161 (May 2024), 25 pages. doi:10.1145/3654964

work page doi:10.1145/3654964 2024

[16] [17]

Vinicius Cogo, João Paulo, and Alysson Bessani. 2021. GenoDedup: Similarity- Based Deduplication and Delta-Encoding for Genome Sequencing Data.IEEE Trans. Comput.70, 5 (2021), 669–681. doi:10.1109/TC.2020.2994774

work page doi:10.1109/tc.2020.2994774 2021

[17] [18]

Jeffrey Dean. 2009. Challenges in building large-scale information retrieval systems: invited talk. InProceedings of the Second ACM International Conference on Web Search and Data Mining(Barcelona, Spain)(WSDM ’09). Association for Computing Machinery, New York, NY , USA, 1. doi:10.1145/1498759.1498761

work page doi:10.1145/1498759.1498761 2009

[18] [19]

Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory. In2010 USENIX Annual Technical Conference (USENIX ATC 10). USENIX Associa- tion. https://www.usenix.org/conference/usenix-atc-10/chunkstash-speeding- inline-storage-deduplication-using-flash-memory

work page 2010

[19] [20]

Safaa Diab, Amir Nassereldine, Mohammed Alser, Juan Gómez Luna, Onur Mutlu, and Izzat El Hajj. 2023. A framework for high-throughput sequence alignment using real processing-in-memory systems.Bioinformatics39, 5 (2023), btad155

work page 2023

[20] [21]

Maitreya J Dunham and Douglas M Fowler. 2013. Contemporary, yeast-based approaches to understanding human genetic variation.Current opinion in genetics & development23, 6 (2013), 658–664

work page 2013

[21] [22]

Ahmed El-Shimi, Ran Kalach, Ankit Kumar, Adi Ottean, Jin Li, and Sudipta Sengupta. 2012. Primary Data Deduplication—Large Scale Study and System De- sign. In2012 USENIX Annual Technical Conference (USENIX ATC 12). USENIX Association, Boston, MA, 285–296. https://www.usenix.org/conference/atc12/ technical-sessions/presentation/el-shimi

work page 2012

[22] [23]

Birte Friesel, Marcel Lütke Dreimann, and Olaf Spinczyk. 2023. A Full-System Perspective on UPMEM Performance. InProceedings of the 1st Workshop on Disruptive Memory Systems(Koblenz, Germany)(DIMES ’23). Association for Computing Machinery, New York, NY , USA, 1–7. doi:10.1145/3609308.3625266

work page doi:10.1145/3609308.3625266 2023

[23] [24]

Mingyu Gao and Christos Kozyrakis. 2016. HRL: Efficient and flexible reconfig- urable logic for near-data processing. In2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 126–137. doi:10.1109/HPCA. 2016.7446059

work page doi:10.1109/hpca 2016

[24] [25]

Christina Giannoula, Ivan Fernandez, Juan Gómez Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2022. SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures. Proc. ACM Meas. Anal. Comput. Syst.6, 1, Article 21 (Feb. 2022), 49 pages. doi:10.1145/3508041

work page doi:10.1145/3508041 2022

[25] [26]

Juan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F Oliveira, Gagandeep Singh, and Onur Mutlu. 2022. An experimental evaluation of machine learning training on a real processing-in-memory system. arXiv preprint arXiv:2207.07886(2022)

work page arXiv 2022

[26] [27]

Google. 2025. FarmHash. https://github.com/google/farmhash/. Accessed on 20-01-2025

work page 2025

[27] [28]

Saransh Gupta and Tajana Šimuni ´c Rosing. 2021. Invited: Accelerating Fully Homomorphic Encryption with Processing in Memory. In2021 58th ACM/IEEE Design Automation Conference (DAC). 1335–1338. doi:10.1109/DAC18074.2021. 9586285

work page doi:10.1109/dac18074.2021 2021

[28] [29]

Oliveira, and Onur Mutlu

Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2022. Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System.IEEE Access10 (2022), 52565–52608. doi:10.1109/ACCESS.2022.3174101

work page doi:10.1109/access.2022.3174101 2022

[29] [30]

Gernot Heiser. 2025. Systems Benchmarking Crimes. https://gernot-heiser.org/ benchmarking-crimes.html. Accessed on 10-02-2025

work page 2025

[30] [31]

Rotem Ben Hur, Orian Leitersdorf, Ronny Ronen, Lidor Goldshmidt, Idan Ma- gram, Lior Kaplun, Leonid Yavitz, and Shahar Kvatinsky. 2024. Accelerating DNA Read Mapping with Digital Processing-in-Memory.ArXivabs/2411.03832 (2024). https://api.semanticscholar.org/CorpusID:273850423

work page arXiv 2024

[31] [32]

Bongjoon Hyun, Taehun Kim, Dongjae Lee, and Minsoo Rhu. 2024. Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology. In 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 263–279. doi:10.1109/HPCA57654.2024.00029

work page doi:10.1109/hpca57654.2024.00029 2024

[32] [33]

Gonzalez, and Ion Stoica

Anand Padmanabha Iyer, Qifan Pu, Kishan Patel, Joseph E. Gonzalez, and Ion Stoica. 2021. TEGRA: Efficient Ad-Hoc Analytics on Evolving Graphs. In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 337–355. https://www.usenix.org/conference/nsdi21/ presentation/iyer

work page 2021

[33] [34]

Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wen- cong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, Renton, WA, 947–960. https://www.usenix.org/conference/atc19/presentation/jeon

work page 2019

[34] [35]

Abellán, Ajay Joshi, David Kaeli, and John Kim

Gilbert Jonatan, Haeyoon Cho, Hyojun Son, Xiangyu Wu, Neal Livesay, Evelio Mora, Kaustubh Shivdikar, José L. Abellán, Ajay Joshi, David Kaeli, and John Kim. 2024. Scalability Limitations of Processing-in-Memory using Real System Evaluations.Proc. ACM Meas. Anal. Comput. Syst.8, 1, Article 5 (feb 2024), 28 pages. doi:10.1145/3639046

work page doi:10.1145/3639046 2024

[35] [36]

Ricardo Koller and Raju Rangaswami. 2010. I/O A High Performance Deduplication Engine with Mixed Pages. In8th USENIX Conference on File and Storage Technologies (FAST 10). USENIX Association, San Jose, CA. https://www.usenix.org/conference/fast-10/io-deduplication-utilizing- content-similarity-improve-io-performance

work page 2010

[36] [37]

Dominique Lavenier, Jean-Francois Roy, and David Furodet. 2016. DNA mapping using Processor-in-Memory architecture. In2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 1429–1435. doi:10.1109/BIBM. 2016.7822732 Peterson Yuhala, Mpoki Mwaisela, Pascal Felber, and Valerio Schiavoni

work page doi:10.1109/bibm 2016

[37] [38]

Dongjae Lee, Bongjoon Hyun, Taehun Kim, and Minsoo Rhu. 2024. PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems . In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Computer Society, Los Alamitos, CA, USA, 627–642. doi:10. 1109/MICRO61859.2024.00053

work page arXiv 2024

[38] [39]

D. Lee, B. Hyun, T. Kim, and M. Rhu. 2024. Analysis of Data Transfer Bottle- necks in Commercial PIM Systems: A Study with UPMEM-PIM.IEEE Computer Architecture Letters01 (apr 2024), 1–4. doi:10.1109/LCA.2024.3387472

work page doi:10.1109/lca.2024.3387472 2024

[39] [40]

Daniel Lemire, Nathan Kurz, and Christoph Rupp. 2018. Stream VByte: Faster byte-oriented integer compression.Inform. Process. Lett.130 (2018), 1–6

work page 2018

[40] [41]

Wenji Li, Gregory Jean-Baptise, Juan Riveros, Giri Narasimhan, Tony Zhang, and Ming Zhao. 2016. CacheDedup: In-line Deduplication for Flash Caching. In 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX Association, Santa Clara, CA, 301–314. https://www.usenix.org/conference/ fast16/technical-sessions/presentation/li-wenji

work page 2016

[41] [42]

Dutch T Meyer and William J Bolosky. 2012. A study of practical deduplication. ACM Transactions on Storage (ToS)7, 4 (2012), 1–20

work page 2012

[42] [43]

Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun

work page

[43] [44]

Microprocessors and Microsystems67 (2019), 28–41

Processing data where it makes sense: Enabling in-memory computation. Microprocessors and Microsystems67 (2019), 28–41

work page 2019

[44] [45]

Mpoki Mwaisela, Joel Hari, Peterson Yuhala, Jämes Ménétrey, Pascal Felber, and Valerio Schiavoni. 2024. Evaluating the Potential of In-Memory Processing to Accelerate Homomorphic Encryption: Practical Experience Report. In2024 43rd International Symposium on Reliable Distributed Systems (SRDS). 92–103. doi:10.1109/SRDS64841.2024.00019

work page doi:10.1109/srds64841.2024.00019 2024

[45] [46]

Mpoki Mwaisela, Peterson Yuhala, Pascal Felber, and Valerio Schiavoni

work page

[46] [47]

IM-PIR: In-Memory Private Information Retrieval.arXiv preprint arXiv:2509.06514(2025)

work page arXiv 2025

[47] [48]

Joel Nider, Craig Mustard, Andrada Zoltan, and Alexandra Fedorova. 2020. Pro- cessing in Storage Class Memory. In12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 20). USENIX Association. https: //www.usenix.org/conference/hotstorage20/presentation/nider

work page 2020

[48] [49]

Joel Nider, Craig Mustard, Andrada Zoltan, John Ramsden, Larry Liu, Jacob Grossbard, Mohammad Dashti, Romaric Jodin, Alexandre Ghiti, Jordi Chauzi, and Alexandra Fedorova. 2021. A Case Study of Processing-in-Memory in off- the-Shelf Systems. In2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 117–130. https://www.usenix.org/conf...

work page 2021

[49] [50]

University of California Santa Cruz. 2025. UCSC Genome Browser Home. https://hgdownload.soe.ucsc.edu/downloads.html. Accessed on 24-02-2025

work page 2025

[50] [51]

National Library of Medicine. [n. d.]. Genome assembly GRCh38. https://www. ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/. Accessed on 20-01- 2025

work page 2025

[51] [52]

National Library of Medicine. 2025. FASTA Format for Nucleotide Sequences. https://www.ncbi.nlm.nih.gov/genbank/fastaformat/. Accessed on 24-02-2025

work page 2025

[52] [53]

Park, Saurabh Hukerikar, Ryan Adamson, and Christian Engelmann

Byung H. Park, Saurabh Hukerikar, Ryan Adamson, and Christian Engelmann

work page

[53] [54]

In2017 IEEE International Conference on Cluster Computing (CLUSTER)

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale. In2017 IEEE International Conference on Cluster Computing (CLUSTER). 758–765. doi:10.1109/CLUSTER.2017.113

work page doi:10.1109/cluster.2017.113 2017

[54] [55]

Gerardo Perez, Galt P Barber, Anna Benet-Pages, Jonathan Casper, Hiram Claw- son, Mark Diekhans, Clay Fischer, Jairo Navarro Gonzalez, Angie S Hinrichs, Christopher M Lee, et al . 2025. The UCSC Genome Browser database: 2025 update.Nucleic Acids Research53, D1 (2025), D1243–D1249

work page 2025

[55] [56]

Jiansheng Qiu, Yanqi Pan, Wen Xia, Xiaojia Huang, Wenjun Wu, Xiangyu Zou, Shiyi Li, and Yu Hua. 2023. Light-Dedup: A Light-weight Inline Deduplication Framework for Non-V olatile Memory File Systems. In2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 101–116. https://www.usenix.org/conference/atc23/presentation/qiu-...

work page 2023

[56] [57]

Sourjya Roy, Mustafa Ali, and Anand Raghunathan. 2021. PIM-DRAM: Acceler- ating machine learning workloads using processing in commodity DRAM.IEEE Journal on Emerging and Selected Topics in Circuits and Systems11, 4 (2021), 701–710

work page 2021

[57] [58]

Sophie Schbath, Véronique Martin, Matthias Zytnicki, Julien Fayolle, Valentin Loux, and Jean-François Gibrat. 2012. Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis.Journal of Computa- tional Biology19, 6 (2012), 796–813

work page 2012

[58] [59]

Valerie A Schneider, Tina Graves-Lindsay, Kerstin Howe, Nathan Bouk, Hsiu- Chuan Chen, Paul A Kitts, Terence D Murphy, Kim D Pruitt, Françoise Thibaud- Nissen, Derek Albracht, et al. 2017. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome research27, 5 (2017), 849–864

work page 2017

[59] [60]

Gibbons, Michael A

Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarung- nirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2013. RowClone: Fast and energy- efficient in-DRAM bulk data copy and initialization. In2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 185–197

work page 2013

[60] [61]

Stepanov, Anil R

Alexander A. Stepanov, Anil R. Gangolli, Daniel E. Rose, Ryan J. Ernst, and Paramjit S. Oberoi. 2011. SIMD-based decoding of posting lists. InProceed- ings of the 20th ACM International Conference on Information and Knowledge Management(Glasgow, Scotland, UK)(CIKM ’11). Association for Computing Machinery, New York, NY , USA, 317–326. doi:10.1145/2063576.2063627

work page doi:10.1145/2063576.2063627 2011

[61] [62]

Todd J Treangen and Steven L Salzberg. 2012. Repetitive DNA and next- generation sequencing: computational challenges and solutions.Nature Reviews Genetics13, 1 (2012), 36–46

work page 2012

[62] [63]

2022.UPMEM Processing In-Memory (PIM): ultra-efficient accelera- tion for data-intensive applications

UPMEM. 2022.UPMEM Processing In-Memory (PIM): ultra-efficient accelera- tion for data-intensive applications. White paper

work page 2022

[63] [64]

UPMEM. 2025. UPMEM SDK. https://sdk.upmem.com/2025.1.0/031_ DPURuntimeService_Memory.html. Accessed on 24-02-2025

work page 2025

[64] [65]

Lucani, and Valerio Schiavoni

Sébastien Vaucher, Niloofar Yazdani, Pascal Felber, Daniel E. Lucani, and Valerio Schiavoni. 2020. ZipLine: in-network compression at line speed. InProceedings of the 16th International Conference on Emerging Networking EXperiments and Technologies(Barcelona, Spain)(CoNEXT ’20). Association for Computing Machinery, New York, NY , USA, 399–405. doi:10.1145...

work page doi:10.1145/3386367.3431302 2020

[65] [66]

Qiuping Wang, Jinhong Li, Wen Xia, Erik Kruus, Biplob Debnath, and Patrick P. C. Lee. 2020. Austere Flash Caching with Deduplication and Compression. In2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 713–726. https://www.usenix.org/conference/atc20/presentation/wang-qiuping

work page 2020

[66] [67]

Yufeng Wang and Charith Mendis. 2023. TGOpt: Redundancy-Aware Optimiza- tions for Temporal Graph Attention Networks. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Montreal, QC, Canada)(PPoPP ’23). Association for Computing Machinery, New York, NY , USA, 354–368. doi:10.1145/3572848.3577490

work page doi:10.1145/3572848.3577490 2023

[67] [68]

Huijun Wu, Chen Wang, Yinjin Fu, Sherif Sakr, Kai Lu, and Liming Zhu. 2018. A Differentiated Caching Mechanism to Enable Primary Storage Deduplication in Clouds.IEEE Transactions on Parallel and Distributed Systems29, 6 (2018), 1202–1216. doi:10.1109/TPDS.2018.2790946

work page doi:10.1109/tpds.2018.2790946 2018

[68] [69]

XXHash. 2025. XXHash. https://xxhash.com/. Accessed on 20-01-2025

work page 2025

[69] [70]

Zhiguo Zhang, Lu Zhang, Guoqing Zhang, Ze Zhao, Hui Wang, and Feng Ju

work page

[70] [71]

Deduplication improves cost-efficiency and yields of de novo assembly and binning of shotgun metagenomes in microbiome research.Microbiology Spectrum11, 2 (2023), e04282–22

work page 2023

[71] [72]

Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang. 2000. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microar- chitecture(Monterey, California, USA)(MICRO 33). Association for Computing Machinery, New York, NY , USA, 32–41. doi:10.1145...

work page doi:10.1145/360128.360134 2000