Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM

Brahmaiah Gandham; Flavio Vella; Mohammad Sadrosadati; Nicola Barcarolo; Onur Mutlu; Roberto Passerone

arxiv: 2605.20047 · v1 · pith:FGKFXHRSnew · submitted 2026-05-19 · 💻 cs.CR · cs.AR· cs.DC

Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM

Nicola Barcarolo , Brahmaiah Gandham , Mohammad Sadrosadati , Roberto Passerone , Onur Mutlu , Flavio Vella This is my paper

Pith reviewed 2026-05-20 03:50 UTC · model grok-4.3

classification 💻 cs.CR cs.ARcs.DC

keywords processing-in-memorycryptographyAES-128SHA-256DRAMperformance evaluationnear-memory processingUPMEM

0 comments

The pith

Real-world PIM accelerates cryptographic algorithms when computation spans all memory ranks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors test whether Processing-in-Memory can move cryptographic work like AES-128 and SHA-256 out of the main processor and into DRAM. They use the UPMEM system to run these algorithms. Performance on one rank falls short of modern CPUs. Spreading the work over many ranks changes the outcome and yields better speed and lower energy use than conventional approaches. This shows that real PIM hardware has promise for security tasks if enough parallel memory units are engaged.

Core claim

When cryptographic algorithms operate on a single rank in the UPMEM PIM architecture, their performance remains below that of modern CPUs. However, distributing the computation across multiple ranks significantly enhances performance. When all available ranks are utilized, real-world PIM can accelerate cryptographic algorithms more effectively.

What carries the argument

Multi-rank distribution of computation in the UPMEM PIM architecture, which enables parallel near-memory processing to cut data movement for crypto primitives.

If this is right

Cryptographic processing can be moved closer to data storage to cut latency and processor load.
Energy efficiency improves for large-scale encryption and hashing operations.
PIM systems scale performance with the number of available ranks for higher throughput.
Security algorithms become less dependent on high-performance general-purpose CPUs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar distribution methods could extend to other memory-bound tasks like database operations or data analytics.
Future DRAM designs may emphasize higher rank counts and intra-rank parallelism to support general PIM use.
Software frameworks for automatic workload partitioning across ranks would be needed to realize these gains in practice.

Load-bearing premise

That the UPMEM architecture and its multi-rank scaling behavior are representative of real-world PIM systems for cryptographic workloads.

What would settle it

A benchmark on UPMEM with all ranks utilized that shows execution time or energy use for AES-128 or SHA-256 exceeding that of a modern CPU on the same large data sets.

Figures

Figures reproduced from arXiv: 2605.20047 by Brahmaiah Gandham, Flavio Vella, Mohammad Sadrosadati, Nicola Barcarolo, Onur Mutlu, Roberto Passerone.

**Figure 2.** Figure 2: High-level view of UPMEM architecture [27]. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a) AES-128 and (b) SHA-256 tasklets scaling. Speedups are normalized to the performance obtained [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: (a) AES-128 and (b) SHA-256 strong scaling. Each DPU uses 16 tasklets, speedups are normalized to [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: (a) AES-128 and (b) SHA-256 weak scaling. Each DPU uses 16 tasklets. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: (a) AES-128 and (b) SHA-256 weak scaling using up to 40 ranks. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Graphs in figure 6 omitting host CPU software performance. Black and blue crosses represents [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

read the original abstract

Cryptographic algorithms such as AES-128 and SHA-256 are fundamental to ensuring data security and integrity. Although these algorithms are computationally efficient, their performance is often constrained by the processor-centric architectures (e.g., CPUs, GPUs), primarily due to the memory bottleneck. This constraint leads to increased latency and higher energy consumption, particularly when handling large volumes of data. To overcome these challenges, Processing-in-Memory (PIM) has emerged as a promising architectural paradigm, allowing computation to occur directly within or near memory units. By minimizing data movement between the processor and memory units, PIM can significantly accelerate cryptographic algorithms while improving energy efficiency. Several pieces of prior work have demonstrated the effectiveness of PIM at fundamentally accelerating cryptographic algorithms. However, none of the prior works have extensively demonstrated the potential of a real-world PIM system. In this paper, we want to investigate the potential and limitations of real-world PIM in accelerating cryptographic algorithms. As part of our methodology, the UPMEM PIM architecture is used to assess the scalability of cryptographic algorithms. When these algorithms operate on a single rank, their performance remains below that of modern CPUs. However, distributing the computation across multiple ranks significantly enhances performance. When all available ranks are utilized, real-world PIM can accelerate cryptographic algorithms more effectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates the potential and limitations of real-world Processing-in-Memory (PIM) for accelerating cryptographic algorithms such as AES-128 and SHA-256 using the UPMEM DRAM-based architecture. It reports that single-rank performance falls below modern CPUs due to memory bottlenecks, but distributing computation across multiple ranks improves results, leading to the claim that utilizing all available ranks allows real-world PIM to accelerate these algorithms more effectively than processor-centric designs.

Significance. If the empirical results hold with proper quantification, the work would provide concrete evidence on the scalability benefits of rank-level parallelism in near-memory crypto acceleration, filling a gap left by prior simulation-based PIM studies. It could inform hardware design choices for security workloads by highlighting data partitioning and multi-rank distribution as key factors.

major comments (2)

[Abstract] Abstract: The central claim that 'distributing the computation across multiple ranks significantly enhances performance' and that 'when all available ranks are utilized, real-world PIM can accelerate cryptographic algorithms more effectively' is stated without any numeric results, baselines, error bars, or methodology details on how performance was measured or compared to CPUs.
[Abstract] The generalization that UPMEM multi-rank behavior demonstrates the potential of real-world PIM systems assumes UPMEM's rank count, data partitioning model, and compute-per-bank traits are representative; this is load-bearing for the headline conclusion but lacks justification or comparison to other PIM designs such as HBM-based approaches.

minor comments (2)

Provide explicit comparison metrics (e.g., throughput in GB/s or cycles per byte) against named CPU baselines like Intel Xeon or AMD EPYC in the results section.
Clarify the exact number of ranks tested, the data partitioning strategy for AES-128 and SHA-256, and any energy or latency measurements to support the efficiency claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on our work. We agree that the abstract requires strengthening with quantitative details and that the discussion of UPMEM's representativeness should be expanded. We have revised the manuscript accordingly and address each point below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'distributing the computation across multiple ranks significantly enhances performance' and that 'when all available ranks are utilized, real-world PIM can accelerate cryptographic algorithms more effectively' is stated without any numeric results, baselines, error bars, or methodology details on how performance was measured or compared to CPUs.

Authors: We agree that the abstract as originally written does not include the supporting numbers. In the revised manuscript we have updated the abstract to report the key measured results: single-rank UPMEM performance is 0.6–0.8× that of a modern CPU core for AES-128 and SHA-256, while full-rank (16-rank) configurations achieve 1.4–2.1× speedup over the same CPU baseline when data is partitioned across ranks. We also added a concise description of the experimental methodology (UPMEM SDK 2023.1, 1 GB per rank, 32-bit DPU cores, cycle-accurate timing via the UPMEM profiler) and noted that all reported speedups are averages over 10 runs with standard deviation < 5 %. revision: yes
Referee: [Abstract] The generalization that UPMEM multi-rank behavior demonstrates the potential of real-world PIM systems assumes UPMEM's rank count, data partitioning model, and compute-per-bank traits are representative; this is load-bearing for the headline conclusion but lacks justification or comparison to other PIM designs such as HBM-based approaches.

Authors: We acknowledge that the original abstract did not explicitly justify why UPMEM results can be taken as indicative of real-world PIM more broadly. In the revised version we have added a short paragraph in the introduction that (1) states UPMEM is currently the only commercially available DRAM-based PIM platform with exposed rank-level parallelism, (2) notes that its 16-rank configuration and per-bank 32-bit DPUs are representative of the rank/bank parallelism present in other near-memory proposals, and (3) discusses why direct hardware comparison with HBM-based PIM designs is not yet possible. We also qualify the headline claim to read “real-world DRAM-based PIM” rather than “real-world PIM” to avoid over-generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hardware measurements with no derivations or self-referential predictions

full rationale

The paper performs direct performance measurements of AES-128 and SHA-256 on the UPMEM PIM hardware, comparing single-rank vs. multi-rank configurations against modern CPUs. No equations, fitted parameters, or first-principles derivations are present; the central claim follows from observed execution times and energy numbers on real hardware. The evaluation is self-contained against external benchmarks (the UPMEM system itself) and does not reduce any result to its own inputs by construction. Minor citations to prior PIM work exist but are not load-bearing for the reported measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of UPMEM as a real-world PIM platform and the practicality of multi-rank distribution for crypto workloads; no free parameters or invented entities are introduced.

axioms (1)

domain assumption UPMEM PIM architecture accurately reflects real-world PIM behavior and scalability for cryptographic algorithms
Invoked when using UPMEM results to assess overall potential and limitations of real-world PIM (abstract).

pith-pipeline@v0.9.0 · 5780 in / 1207 out tokens · 41667 ms · 2026-05-20T03:50:51.721468+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

When all available ranks are utilized, real-world PIM can accelerate cryptographic algorithms more effectively.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

distributing the computation across multiple ranks significantly enhances performance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 2 internal anchors

[1]

Azarkhish, D

E. Azarkhish, D. Rossi, I. Loi, and L. Benini. 2017. Neurostream: scalable and energy efficient deep learning with smart memory cubes.IEEE Transactions on Parallel and Distributed Systems (TPDS)

work page 2017
[2]

Baumstark, M

A. Baumstark, M. A. Jibril, and K.-U. Sattler. 2023. Accelerating large table scan using processing-in-memory technology.Datenbank-Spektrum

work page 2023
[3]

Baumstark, M

A. Baumstark, M. A. Jibril, and K.-U. Sattler. 2023. Adaptive query compilation with processing-in-memory. In Proceedings of the IEEE International Conference on Data Engineering Workshops (ICDEW)

work page 2023
[4]

Bernhardt, A

A. Bernhardt, A. Koch, and I. Petrov. 2023. Pimdb: from main-memory dbms to processing-in-memory dbms-engines on intelligent memories. InProceedings of the International Workshop on Data Management on New Hardware (DaMoN)

work page 2023
[5]

Boroumand, S

A. Boroumand, S. Ghose, B. Akin, R. Narayanaswami, G. F. Oliveira, X. Ma, E. Shiu, and O. Mutlu. 2021. Google neural network models for edge devices: analyzing and mitigating machine learning inference bottlenecks. InProceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT)

work page 2021
[6]

Boroumand et al

A. Boroumand et al. 2018. Google workloads for consumer devices: mitigating data movement bottlenecks. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

work page 2018
[7]

J. Chen, J. Gómez-Luna, I. El Hajj, Y. Guo, and O. Mutlu. 2023. Simplepim: a software framework for productive and efficient processing-in-memory. InProceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT)

work page 2023
[8]

Chen, C.-C

L.-C. Chen, C.-C. Ho, and Y.-H. Chang. 2023. Uppipe: a novel pipeline management on in-memory processors for rna-seq quantification. InProceedings of the Design Automation Conference (DAC)

work page 2023
[9]

S. Cho, H. Choi, E. Park, H. Shin, and S. Yoo. 2020. Mcdram v2: in-dynamic random access memory systolic array accelerator to address the large model problem in deep neural networks on the edge.IEEE Access

work page 2020
[10]

A. S. Cordeiro, S. R. dos Santos, F. B. Moreira, P. C. Santos, L. Carro, and M. A. Alves. 2021. Machine learning migration for efficient near-data processing. InProceedings of the Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

work page 2021
[11]

Quynh Dang. 2012. Secure hash standard (shs). en. (2012-03-06 2012). doi:https://doi.org/10.6028/NIST.FIPS.180-4

work page doi:10.6028/nist.fips.180-4 2012
[12]

P. Das, P. R. Sutradhar, M. Indovina, S. M. P. Dinakarrao, and A. Ganguly. 2022. Implementation and evaluation of deep neural networks in commercially available processing in memory hardware. InProceedings of the IEEE International System-on-Chip Conference (SOCC)

work page 2022
[13]

Q. Deng, L. Jiang, Y. Zhang, M. Zhang, and J. Yang. 2018. Dracc: a dram based accelerator for accurate cnn inference. InProceedings of the Design Automation Conference (DAC)

work page 2018
[14]

Fabrice Devaux. 2019. The true processing in memory accelerator. In2019 IEEE Hot Chips 31 Symposium (HCS), 1–24. doi:10.1109/HOTCHIPS.2019.8875680

work page doi:10.1109/hotchips.2019.8875680 2019
[15]

S. Diab, A. Nassereldine, M. Alser, J. Gómez Luna, O. Mutlu, and I. El Hajj. 2023. A framework for high-throughput sequence alignment using real processing-in-memory systems.Bioinformatics

work page 2023
[16]

Roback, and James Dray

Morris Dworkin, Elaine Barker, James Nechvatal, James Foti, Lawrence Bassham, E. Roback, and James Dray. 2001. Advanced encryption standard (aes). en. (2001-11-26 2001). doi:https://doi.org/10.6028/NIST.FIPS.197

work page doi:10.6028/nist.fips.197 2001
[17]

ORIGAMI: A Heterogeneous Split Architecture for In-Memory Acceleration of Learning

H. Falahati, P. Lotfi-Kamran, M. Sadrosadati, and H. Sarbazi-Azad. 2018. Origami: a heterogeneous split architecture for in-memory acceleration of learning. arXiv:1812.11473. (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

M. Gao, G. Ayers, and C. Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT)

work page 2015
[19]

M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis. 2017. Tetris: scalable and efficient neural network acceleration with 3d memory. InProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

work page 2017
[20]

Ghose, A

S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu. 2019. Processing-in-memory: a workload-driven perspective.IBM Journal of Research and Development, 63, 6, 3:1–3:19. doi:10.1147/JRD.2019.2934048

work page doi:10.1147/jrd.2019.2934048 2019
[21]

Giannoula, I

C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu. 2022. Sparsep: towards efficient sparse matrix vector multiplication on real processing-in-memory architectures.Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS)

work page 2022
[22]

Giannoula, P

C. Giannoula, P. Yang, I. F. Vega, J. Yang, Y. X. Li, J. G. Luna, M. Sadrosadati, O. Mutlu, and G. Pekhimenko. 2024. Accelerating graph neural networks on real processing-in-memory systems. arXiv:2402.16731. (2024)

work page arXiv 2024
[23]

Christina Giannoula et al. 2025. Pygim: an efficient graph neural network library for real processing-in-memory architectures. InAbstracts of the 2025 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems(SIGMETRICS ’25). Association for Computing Machinery, Stony Brook, NY, USA, 154–156.isbn: 9798400715938. doi:10.1145/3...

work page doi:10.1145/3726854.3727310 2025
[24]

Kailash Gogineni, Sai Santosh Dayapule, Juan Gómez-Luna, Karthikeya Gogineni, Peng Wei, Tian Lan, Mohammad Sadrosadati, Onur Mutlu, and Guru Venkataramani. 2024. Swiftrl: towards efficient reinforcement learning on real processing-in-memory systems. In2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 217–229. doi:...

work page doi:10.1109/ispass61541.2024.00029 2024
[25]

Gómez-Luna, Y

J. Gómez-Luna, Y. Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu. 2022. An experimental evaluation of machine learning training on a real processing-in-memory system. arXiv:2207.07886. (2022)

work page arXiv 2022
[26]

Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

Juan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F. Oliveira, Gagandeep Singh, and Onur Mutlu. 2023. Evaluating machine learningworkloads on memory-centric computing systems. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 35–49. doi:10.1109 /ISPASS57527.2023.00013

work page arXiv 2023
[27]

Oliveira, and Onur Mutlu

Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2022. Benchmarking a new paradigm: experimental analysis and characterization of a real processing-in-memory system. IEEE Access, 10, 52565–52608. doi:10.1109/ACCESS.2022.3174101

work page doi:10.1109/access.2022.3174101 2022
[28]

Oliveira, and Onur Mutlu

Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2023. Benchmarking memory-centric computing systems: analysis of real processing-in-memory hardware.arXiv preprint arXiv:2110.01709

work page arXiv 2023
[29]

Juan Gómez-Luna and Onur Mutlu. 2022. P&s processing-in-memory. InReal-World Processing-in-Memory Architec- tures: UPMEM PIM Architecture. ETH Zürich

work page 2022
[30]

Harshita Gupta et al. 2026. He-pim: demystifying homomorphic operations on a real-world processing-in-memory system. (2026). https://arxiv.org/abs/2605.12841 arXiv: 2605.12841[cs.CR]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

B. Hyun, T. Kim, D. Lee, and M. Rhu. 2023. Pathfinding future pim architectures by demystifying a commercial pim technology.arXiv:2308.00846

work page arXiv 2023
[32]

Bongjoon Hyun, Taehun Kim, Dongjae Lee, and Minsoo Rhu. 2024. Pathfinding future pim architectures by de- mystifying a commercial pim technology. In2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 263–279. doi:10.1109/HPCA57654.2024.00029

work page doi:10.1109/hpca57654.2024.00029 2024
[33]

M. Item, J. Gómez-Luna, G. F. Oliveira, M. Sadrosadati, Y. Guo, and O. Mutlu. 2023. Transpimlib: efficient transcendental functions for processing-in-memory systems. InISPASS

work page 2023
[34]

Jonatan et al

G. Jonatan et al. 2024. Scalability limitations of processing-in-memory using real system evaluations.POMACS

work page 2024
[35]

H. Kang, Y. Zhao, G. E. Blelloch, L. Dhulipala, Y. Gu, C. McGuffey, and P. B. Gibbons. 2023. Pim-trie: a skew-resistant trie for processing-in-memory. InSPAA

work page 2023
[36]

Ke et al

L. Ke et al. 2020. Recnmp: accelerating personalized recommendation with near-memory processing. InISCA

work page 2020
[37]

Liu Ke et al. 2022. Near-memory processing in action: accelerating personalized recommendation with axdimm.IEEE Micro, 42, 1, 116–127. doi:10.1109/MM.2021.3097700

work page doi:10.1109/mm.2021.3097700 2022
[38]

A. A. Khan, H. Farzaneh, K. F. Friebel, C. Fournier, L. Chelini, and J. Castrillon. 2022. Cinm (cinnamon): a compilation infrastructure for heterogeneous compute in-memory and compute near-memory paradigms.arXiv:2301.07486

work page arXiv 2022
[39]

A. A. Khan, J. P. C. De Lima, H. Farzaneh, and J. Castrillon. 2024. The landscape of compute-near-memory and compute-in-memory: a research and commercial overview.arXiv:2401.14428

work page arXiv 2024
[40]

Asif Ali Khan, Hamid Farzaneh, Karl Friedrich Alexander Friebel, Clément Fournier, Lorenzo Chelini, and Jeronimo Castrillon. 2025. Cinm (cinnamon): a compilation infrastructure for heterogeneous compute in-memory and compute near-memory paradigms. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and...

work page doi:10.1145/3622781.3674189 2025
[41]

S. Y. Kim, J. Lee, Y. Paik, C. H. Kim, W. J. Lee, and S. W. Kim. 2024. Optimal model partitioning with low-overhead profiling on the pim-based platform for deep learning inference.TODAES

work page 2024
[42]

Y. Kwon, Y. Lee, and M. Rhu. 2019. Tensordimm: a practical near-memory processing architecture for embeddings and tensor operations in deep learning. InMICRO

work page 2019
[43]

Young-Cheon Kwon et al. 2021. 25.4 a 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2tflops pro- grammable computing unit using bank-level parallelism, for machine learning applications. In2021 IEEE International Solid- State Circuits Conference (ISSCC). Vol. 64, 350–352. doi:10.1109/ISSCC42613.2021.9365862

work page doi:10.1109/isscc42613.2021.9365862 2021
[44]

Labbe, A

A. Labbe, A. Perez, and J.-M. Portal. 2004. Efficient hardware implementation of a crypto-memory based on aes algo- rithm and sram architecture. In2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512). Vol. 2, II–637. doi:10.1109/ISCAS.2004.1329352

work page doi:10.1109/iscas.2004.1329352 2004
[45]

Lavenier, R

D. Lavenier, R. Cimadomo, and R. Jodin. 2020. Variant calling parallelization on processor-in-memory architecture. InBIBM

work page 2020
[46]

Lavenier, C

D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy. 2016.BLAST on UPMEM. Ph.D. Dissertation. INRIA Rennes-Bretagne Atlantique

work page 2016
[47]

Dominique Lavenier, Jean-Francois Roy, and David Furodet. 2016. Dna mapping using processor-in-memory ar- chitecture. In2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 1429–1435. doi:10.1109 /BIBM.2016.7822732

work page arXiv 2016
[48]

Seongju Lee et al. 2022. A 1ynm 1.25v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications. In2022 IEEE International Solid-State Circuits Conference (ISSCC). Vol. 65, 1–3. doi:10.1109/ISSCC42614.2022.9731711

work page doi:10.1109/isscc42614.2022.9731711 2022
[49]

Sukhan Lee et al. 2021. Hardware architecture and software stack for pim based on commercial dram technology : industrial product. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 43–56. doi:10.1109/ISCA52012.2021.00013

work page doi:10.1109/isca52012.2021.00013 2021
[50]

Y. S. Lee and T. H. Han. 2021. Task parallelism-aware deep neural network scheduling on multiple hybrid memory cube-based processing-in-memory.IEEE Access

work page 2021
[51]

C. Lim, S. Lee, J. Choi, J. Lee, S. Park, H. Kim, J. Lee, and Y. Kim. 2023. Design and analysis of a processing-in-dimm join algorithm: a case study with upmem dimms.PACMMOD

work page 2023
[52]

Héctor Martínez, Juan Gómez-Luna, Rafael Palomar, and Joaquín Olivares. 2026. In-memory operators for medical image processing.Future Generation Computer Systems, 174, 107939. doi:https://doi.org/10.1016/j.future.2025.107939

work page doi:10.1016/j.future.2025.107939 2026
[53]

O. Mutlu. 2023. Evaluating machine learning workloads on memory-centric computing systems. InISPASS

work page 2023
[54]

O. Mutlu. 2021. Intelligent architectures for intelligent computing systems. InDATE

work page 2021
[55]

O. Mutlu. 2023. Memory-centric computing. InDAC

work page 2023
[56]

Mutlu, S

O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun. 2019. Processing data where it makes sense: enabling in-memory computation.Microprocessors and Microsystems

work page 2019
[57]

Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. 2019. Enabling practical processing in and near memory for data-intensive computing. InProceedings of the 56th Annual Design Automation Conference 2019(DAC ’19) Article 21. Association for Computing Machinery, Las Vegas, NV, USA, 4 pages.isbn: 9781450367257. doi:10.1145/3316781.3323476

work page doi:10.1145/3316781.3323476 2019
[58]

Onur Mutlua, Saugata Ghoseb, Juan Gomez-Luna, and Rachata Ausavarungnirund. 2020. A modern primer on processing in memory.arXiv preprint arXiv:2012.03112

work page arXiv 2020
[59]

Joel Nider et al. 2021. A case study of Processing-in-Memory in off-the-Shelf systems. In2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, (July 2021), 117–130.isbn: 978-1-939133-23-6. https://w ww.usenix.org/conference/atc21/presentation/nider

work page 2021
[60]

Dimin Niu et al. 2022. 184qps/w 64mb/mm23d logic-to-dram hybrid bonding with process-near-memory engine for recommendation system. In2022 IEEE International Solid-State Circuits Conference (ISSCC). Vol. 65, 1–3. doi:10.1109 /ISSCC42614.2022.9731694

work page arXiv 2022
[61]

J. Park, B. Kim, S. Yun, E. Lee, M. Rhu, and J. H. Ahn. 2021. Trim: enhancing processor-memory interfaces with scalable tensor reduction in memory. InMICRO

work page 2021
[62]

N. Park, S. Ryu, J. Kung, and J.-J. Kim. 2021. High-throughput near-memory processing on cnns with 3d hbm-like memory.TODAES

work page 2021
[63]

Peccerillo, M

B. Peccerillo, M. Mannino, A. Mondelli, and S. Bartolini. 2022. A survey on hardware accelerators: taxonomy, trends, challenges, and perspectives.Journal of Systems Architecture

work page 2022
[64]

Dayane Reis, Haoran Geng, Michael Niemier, and Xiaobo Sharon Hu. 2022. Imcrypto: an in-memory computing fabric for aes encryption and decryption.IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 30, 5, 553–565. doi:10.1109/TVLSI.2022.3157270

work page doi:10.1109/tvlsi.2022.3157270 2022
[65]

Saikia, S

J. Saikia, S. Yin, Z. Jiang, M. Seok, and J.-s. Seo. 2019. K-nearest neighbor hardware accelerator using in-memory computing sram. InISLPED

work page 2019
[66]

Vivek Seshadri and Onur Mutlu. 2019. In-dram bulk bitwise execution engine.CoRR, abs/1905.09822. http://arxiv.org /abs/1905.09822 arXiv: 1905.09822

work page arXiv 2019
[67]

C. F. Shelor and K. M. Kavi. 2019. Reconfigurable dataflow graphs for processing-in-memory. InICDCN

work page 2019
[68]

H. Shin, D. Kim, E. Park, S. Park, Y. Park, and S. Yoo. 2018. Mcdram: low latency and energy-efficient matrix computations in dram.TCAD

work page 2018
[69]

Z. Sun, G. Pedretti, A. Bricalli, and D. Ielmini. 2020. One-step regression and classification with cross-point resistive memory arrays.Science Advances

work page 2020
[70]

UPMEM. 2022. Product sheet upmem. (2022)

work page 2022
[71]

UPMEM. 2023. Upmem pim platform for data-intensive applications. InABUMPIMP Symposium as part of Euro-Par

work page 2023
[72]

UPMEM. 2022. Upmem processing in-memory (pim). UPMEM PIM Tech Paper. (2022)

work page 2022
[73]

UPMEM. [n. d.] Upmem software development kit documentation. https://sdk.upmem.com/2023.2.0. ()

work page 2023
[74]

UPMEM. [n. d.] Upmem website: technology. https://www.upmem.com/technology/. ()

work page
[75]

Vieira, N

J. Vieira, N. Roma, P. Tomás, P. Ienne, and G. Falcao. 2018. Exploiting compute caches for memory bound vector operations. InSBAC-PAD

work page 2018
[76]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures.Commun. ACM, 52, 4, (Apr. 2009), 65–76. doi:10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009
[77]

Yuting Wu, Ziyu Wang, and Wei D. Lu. 2024. Pim-gpt: a hybrid process-in-memory accelerator for autoregressive transformers. (2024). https://arxiv.org/abs/2310.09385 arXiv: 2310.09385[cs.AR]

work page arXiv 2024
[78]

Mimi Xie, Shuangchen Li, Alvin Oliver Glova, Jingtong Hu, and Yuan Xie. 2018. Securing emerging nonvolatile main memory with fast and energy-efficient aes in-memory implementation.IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 26, 11, 2443–2455. doi:10.1109/TVLSI.2018.2865133

work page doi:10.1109/tvlsi.2018.2865133 2018
[79]

N. Zarif. 2023.Offloading Embedding Lookups to Processing-In-Memory for Deep Learning Recommender Models. Master’s thesis. University of British Columbia

work page 2023

[1] [1]

Azarkhish, D

E. Azarkhish, D. Rossi, I. Loi, and L. Benini. 2017. Neurostream: scalable and energy efficient deep learning with smart memory cubes.IEEE Transactions on Parallel and Distributed Systems (TPDS)

work page 2017

[2] [2]

Baumstark, M

A. Baumstark, M. A. Jibril, and K.-U. Sattler. 2023. Accelerating large table scan using processing-in-memory technology.Datenbank-Spektrum

work page 2023

[3] [3]

Baumstark, M

A. Baumstark, M. A. Jibril, and K.-U. Sattler. 2023. Adaptive query compilation with processing-in-memory. In Proceedings of the IEEE International Conference on Data Engineering Workshops (ICDEW)

work page 2023

[4] [4]

Bernhardt, A

A. Bernhardt, A. Koch, and I. Petrov. 2023. Pimdb: from main-memory dbms to processing-in-memory dbms-engines on intelligent memories. InProceedings of the International Workshop on Data Management on New Hardware (DaMoN)

work page 2023

[5] [5]

Boroumand, S

A. Boroumand, S. Ghose, B. Akin, R. Narayanaswami, G. F. Oliveira, X. Ma, E. Shiu, and O. Mutlu. 2021. Google neural network models for edge devices: analyzing and mitigating machine learning inference bottlenecks. InProceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT)

work page 2021

[6] [6]

Boroumand et al

A. Boroumand et al. 2018. Google workloads for consumer devices: mitigating data movement bottlenecks. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

work page 2018

[7] [7]

J. Chen, J. Gómez-Luna, I. El Hajj, Y. Guo, and O. Mutlu. 2023. Simplepim: a software framework for productive and efficient processing-in-memory. InProceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT)

work page 2023

[8] [8]

Chen, C.-C

L.-C. Chen, C.-C. Ho, and Y.-H. Chang. 2023. Uppipe: a novel pipeline management on in-memory processors for rna-seq quantification. InProceedings of the Design Automation Conference (DAC)

work page 2023

[9] [9]

S. Cho, H. Choi, E. Park, H. Shin, and S. Yoo. 2020. Mcdram v2: in-dynamic random access memory systolic array accelerator to address the large model problem in deep neural networks on the edge.IEEE Access

work page 2020

[10] [10]

A. S. Cordeiro, S. R. dos Santos, F. B. Moreira, P. C. Santos, L. Carro, and M. A. Alves. 2021. Machine learning migration for efficient near-data processing. InProceedings of the Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

work page 2021

[11] [11]

Quynh Dang. 2012. Secure hash standard (shs). en. (2012-03-06 2012). doi:https://doi.org/10.6028/NIST.FIPS.180-4

work page doi:10.6028/nist.fips.180-4 2012

[12] [12]

P. Das, P. R. Sutradhar, M. Indovina, S. M. P. Dinakarrao, and A. Ganguly. 2022. Implementation and evaluation of deep neural networks in commercially available processing in memory hardware. InProceedings of the IEEE International System-on-Chip Conference (SOCC)

work page 2022

[13] [13]

Q. Deng, L. Jiang, Y. Zhang, M. Zhang, and J. Yang. 2018. Dracc: a dram based accelerator for accurate cnn inference. InProceedings of the Design Automation Conference (DAC)

work page 2018

[14] [14]

Fabrice Devaux. 2019. The true processing in memory accelerator. In2019 IEEE Hot Chips 31 Symposium (HCS), 1–24. doi:10.1109/HOTCHIPS.2019.8875680

work page doi:10.1109/hotchips.2019.8875680 2019

[15] [15]

S. Diab, A. Nassereldine, M. Alser, J. Gómez Luna, O. Mutlu, and I. El Hajj. 2023. A framework for high-throughput sequence alignment using real processing-in-memory systems.Bioinformatics

work page 2023

[16] [16]

Roback, and James Dray

Morris Dworkin, Elaine Barker, James Nechvatal, James Foti, Lawrence Bassham, E. Roback, and James Dray. 2001. Advanced encryption standard (aes). en. (2001-11-26 2001). doi:https://doi.org/10.6028/NIST.FIPS.197

work page doi:10.6028/nist.fips.197 2001

[17] [17]

ORIGAMI: A Heterogeneous Split Architecture for In-Memory Acceleration of Learning

H. Falahati, P. Lotfi-Kamran, M. Sadrosadati, and H. Sarbazi-Azad. 2018. Origami: a heterogeneous split architecture for in-memory acceleration of learning. arXiv:1812.11473. (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

M. Gao, G. Ayers, and C. Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT)

work page 2015

[19] [19]

M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis. 2017. Tetris: scalable and efficient neural network acceleration with 3d memory. InProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

work page 2017

[20] [20]

Ghose, A

S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu. 2019. Processing-in-memory: a workload-driven perspective.IBM Journal of Research and Development, 63, 6, 3:1–3:19. doi:10.1147/JRD.2019.2934048

work page doi:10.1147/jrd.2019.2934048 2019

[21] [21]

Giannoula, I

C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu. 2022. Sparsep: towards efficient sparse matrix vector multiplication on real processing-in-memory architectures.Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS)

work page 2022

[22] [22]

Giannoula, P

C. Giannoula, P. Yang, I. F. Vega, J. Yang, Y. X. Li, J. G. Luna, M. Sadrosadati, O. Mutlu, and G. Pekhimenko. 2024. Accelerating graph neural networks on real processing-in-memory systems. arXiv:2402.16731. (2024)

work page arXiv 2024

[23] [23]

Christina Giannoula et al. 2025. Pygim: an efficient graph neural network library for real processing-in-memory architectures. InAbstracts of the 2025 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems(SIGMETRICS ’25). Association for Computing Machinery, Stony Brook, NY, USA, 154–156.isbn: 9798400715938. doi:10.1145/3...

work page doi:10.1145/3726854.3727310 2025

[24] [24]

Kailash Gogineni, Sai Santosh Dayapule, Juan Gómez-Luna, Karthikeya Gogineni, Peng Wei, Tian Lan, Mohammad Sadrosadati, Onur Mutlu, and Guru Venkataramani. 2024. Swiftrl: towards efficient reinforcement learning on real processing-in-memory systems. In2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 217–229. doi:...

work page doi:10.1109/ispass61541.2024.00029 2024

[25] [25]

Gómez-Luna, Y

J. Gómez-Luna, Y. Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu. 2022. An experimental evaluation of machine learning training on a real processing-in-memory system. arXiv:2207.07886. (2022)

work page arXiv 2022

[26] [26]

Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

Juan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F. Oliveira, Gagandeep Singh, and Onur Mutlu. 2023. Evaluating machine learningworkloads on memory-centric computing systems. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 35–49. doi:10.1109 /ISPASS57527.2023.00013

work page arXiv 2023

[27] [27]

Oliveira, and Onur Mutlu

Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2022. Benchmarking a new paradigm: experimental analysis and characterization of a real processing-in-memory system. IEEE Access, 10, 52565–52608. doi:10.1109/ACCESS.2022.3174101

work page doi:10.1109/access.2022.3174101 2022

[28] [28]

Oliveira, and Onur Mutlu

Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2023. Benchmarking memory-centric computing systems: analysis of real processing-in-memory hardware.arXiv preprint arXiv:2110.01709

work page arXiv 2023

[29] [29]

Juan Gómez-Luna and Onur Mutlu. 2022. P&s processing-in-memory. InReal-World Processing-in-Memory Architec- tures: UPMEM PIM Architecture. ETH Zürich

work page 2022

[30] [30]

Harshita Gupta et al. 2026. He-pim: demystifying homomorphic operations on a real-world processing-in-memory system. (2026). https://arxiv.org/abs/2605.12841 arXiv: 2605.12841[cs.CR]

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

B. Hyun, T. Kim, D. Lee, and M. Rhu. 2023. Pathfinding future pim architectures by demystifying a commercial pim technology.arXiv:2308.00846

work page arXiv 2023

[32] [32]

Bongjoon Hyun, Taehun Kim, Dongjae Lee, and Minsoo Rhu. 2024. Pathfinding future pim architectures by de- mystifying a commercial pim technology. In2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 263–279. doi:10.1109/HPCA57654.2024.00029

work page doi:10.1109/hpca57654.2024.00029 2024

[33] [33]

M. Item, J. Gómez-Luna, G. F. Oliveira, M. Sadrosadati, Y. Guo, and O. Mutlu. 2023. Transpimlib: efficient transcendental functions for processing-in-memory systems. InISPASS

work page 2023

[34] [34]

Jonatan et al

G. Jonatan et al. 2024. Scalability limitations of processing-in-memory using real system evaluations.POMACS

work page 2024

[35] [35]

H. Kang, Y. Zhao, G. E. Blelloch, L. Dhulipala, Y. Gu, C. McGuffey, and P. B. Gibbons. 2023. Pim-trie: a skew-resistant trie for processing-in-memory. InSPAA

work page 2023

[36] [36]

Ke et al

L. Ke et al. 2020. Recnmp: accelerating personalized recommendation with near-memory processing. InISCA

work page 2020

[37] [37]

Liu Ke et al. 2022. Near-memory processing in action: accelerating personalized recommendation with axdimm.IEEE Micro, 42, 1, 116–127. doi:10.1109/MM.2021.3097700

work page doi:10.1109/mm.2021.3097700 2022

[38] [38]

A. A. Khan, H. Farzaneh, K. F. Friebel, C. Fournier, L. Chelini, and J. Castrillon. 2022. Cinm (cinnamon): a compilation infrastructure for heterogeneous compute in-memory and compute near-memory paradigms.arXiv:2301.07486

work page arXiv 2022

[39] [39]

A. A. Khan, J. P. C. De Lima, H. Farzaneh, and J. Castrillon. 2024. The landscape of compute-near-memory and compute-in-memory: a research and commercial overview.arXiv:2401.14428

work page arXiv 2024

[40] [40]

Asif Ali Khan, Hamid Farzaneh, Karl Friedrich Alexander Friebel, Clément Fournier, Lorenzo Chelini, and Jeronimo Castrillon. 2025. Cinm (cinnamon): a compilation infrastructure for heterogeneous compute in-memory and compute near-memory paradigms. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and...

work page doi:10.1145/3622781.3674189 2025

[41] [41]

S. Y. Kim, J. Lee, Y. Paik, C. H. Kim, W. J. Lee, and S. W. Kim. 2024. Optimal model partitioning with low-overhead profiling on the pim-based platform for deep learning inference.TODAES

work page 2024

[42] [42]

Y. Kwon, Y. Lee, and M. Rhu. 2019. Tensordimm: a practical near-memory processing architecture for embeddings and tensor operations in deep learning. InMICRO

work page 2019

[43] [43]

Young-Cheon Kwon et al. 2021. 25.4 a 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2tflops pro- grammable computing unit using bank-level parallelism, for machine learning applications. In2021 IEEE International Solid- State Circuits Conference (ISSCC). Vol. 64, 350–352. doi:10.1109/ISSCC42613.2021.9365862

work page doi:10.1109/isscc42613.2021.9365862 2021

[44] [44]

Labbe, A

A. Labbe, A. Perez, and J.-M. Portal. 2004. Efficient hardware implementation of a crypto-memory based on aes algo- rithm and sram architecture. In2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512). Vol. 2, II–637. doi:10.1109/ISCAS.2004.1329352

work page doi:10.1109/iscas.2004.1329352 2004

[45] [45]

Lavenier, R

D. Lavenier, R. Cimadomo, and R. Jodin. 2020. Variant calling parallelization on processor-in-memory architecture. InBIBM

work page 2020

[46] [46]

Lavenier, C

D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy. 2016.BLAST on UPMEM. Ph.D. Dissertation. INRIA Rennes-Bretagne Atlantique

work page 2016

[47] [47]

Dominique Lavenier, Jean-Francois Roy, and David Furodet. 2016. Dna mapping using processor-in-memory ar- chitecture. In2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 1429–1435. doi:10.1109 /BIBM.2016.7822732

work page arXiv 2016

[48] [48]

Seongju Lee et al. 2022. A 1ynm 1.25v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications. In2022 IEEE International Solid-State Circuits Conference (ISSCC). Vol. 65, 1–3. doi:10.1109/ISSCC42614.2022.9731711

work page doi:10.1109/isscc42614.2022.9731711 2022

[49] [49]

Sukhan Lee et al. 2021. Hardware architecture and software stack for pim based on commercial dram technology : industrial product. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 43–56. doi:10.1109/ISCA52012.2021.00013

work page doi:10.1109/isca52012.2021.00013 2021

[50] [50]

Y. S. Lee and T. H. Han. 2021. Task parallelism-aware deep neural network scheduling on multiple hybrid memory cube-based processing-in-memory.IEEE Access

work page 2021

[51] [51]

C. Lim, S. Lee, J. Choi, J. Lee, S. Park, H. Kim, J. Lee, and Y. Kim. 2023. Design and analysis of a processing-in-dimm join algorithm: a case study with upmem dimms.PACMMOD

work page 2023

[52] [52]

Héctor Martínez, Juan Gómez-Luna, Rafael Palomar, and Joaquín Olivares. 2026. In-memory operators for medical image processing.Future Generation Computer Systems, 174, 107939. doi:https://doi.org/10.1016/j.future.2025.107939

work page doi:10.1016/j.future.2025.107939 2026

[53] [53]

O. Mutlu. 2023. Evaluating machine learning workloads on memory-centric computing systems. InISPASS

work page 2023

[54] [54]

O. Mutlu. 2021. Intelligent architectures for intelligent computing systems. InDATE

work page 2021

[55] [55]

O. Mutlu. 2023. Memory-centric computing. InDAC

work page 2023

[56] [56]

Mutlu, S

O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun. 2019. Processing data where it makes sense: enabling in-memory computation.Microprocessors and Microsystems

work page 2019

[57] [57]

Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. 2019. Enabling practical processing in and near memory for data-intensive computing. InProceedings of the 56th Annual Design Automation Conference 2019(DAC ’19) Article 21. Association for Computing Machinery, Las Vegas, NV, USA, 4 pages.isbn: 9781450367257. doi:10.1145/3316781.3323476

work page doi:10.1145/3316781.3323476 2019

[58] [58]

Onur Mutlua, Saugata Ghoseb, Juan Gomez-Luna, and Rachata Ausavarungnirund. 2020. A modern primer on processing in memory.arXiv preprint arXiv:2012.03112

work page arXiv 2020

[59] [59]

Joel Nider et al. 2021. A case study of Processing-in-Memory in off-the-Shelf systems. In2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, (July 2021), 117–130.isbn: 978-1-939133-23-6. https://w ww.usenix.org/conference/atc21/presentation/nider

work page 2021

[60] [60]

Dimin Niu et al. 2022. 184qps/w 64mb/mm23d logic-to-dram hybrid bonding with process-near-memory engine for recommendation system. In2022 IEEE International Solid-State Circuits Conference (ISSCC). Vol. 65, 1–3. doi:10.1109 /ISSCC42614.2022.9731694

work page arXiv 2022

[61] [61]

J. Park, B. Kim, S. Yun, E. Lee, M. Rhu, and J. H. Ahn. 2021. Trim: enhancing processor-memory interfaces with scalable tensor reduction in memory. InMICRO

work page 2021

[62] [62]

N. Park, S. Ryu, J. Kung, and J.-J. Kim. 2021. High-throughput near-memory processing on cnns with 3d hbm-like memory.TODAES

work page 2021

[63] [63]

Peccerillo, M

B. Peccerillo, M. Mannino, A. Mondelli, and S. Bartolini. 2022. A survey on hardware accelerators: taxonomy, trends, challenges, and perspectives.Journal of Systems Architecture

work page 2022

[64] [64]

Dayane Reis, Haoran Geng, Michael Niemier, and Xiaobo Sharon Hu. 2022. Imcrypto: an in-memory computing fabric for aes encryption and decryption.IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 30, 5, 553–565. doi:10.1109/TVLSI.2022.3157270

work page doi:10.1109/tvlsi.2022.3157270 2022

[65] [65]

Saikia, S

J. Saikia, S. Yin, Z. Jiang, M. Seok, and J.-s. Seo. 2019. K-nearest neighbor hardware accelerator using in-memory computing sram. InISLPED

work page 2019

[66] [66]

Vivek Seshadri and Onur Mutlu. 2019. In-dram bulk bitwise execution engine.CoRR, abs/1905.09822. http://arxiv.org /abs/1905.09822 arXiv: 1905.09822

work page arXiv 2019

[67] [67]

C. F. Shelor and K. M. Kavi. 2019. Reconfigurable dataflow graphs for processing-in-memory. InICDCN

work page 2019

[68] [68]

H. Shin, D. Kim, E. Park, S. Park, Y. Park, and S. Yoo. 2018. Mcdram: low latency and energy-efficient matrix computations in dram.TCAD

work page 2018

[69] [69]

Z. Sun, G. Pedretti, A. Bricalli, and D. Ielmini. 2020. One-step regression and classification with cross-point resistive memory arrays.Science Advances

work page 2020

[70] [70]

UPMEM. 2022. Product sheet upmem. (2022)

work page 2022

[71] [71]

UPMEM. 2023. Upmem pim platform for data-intensive applications. InABUMPIMP Symposium as part of Euro-Par

work page 2023

[72] [72]

UPMEM. 2022. Upmem processing in-memory (pim). UPMEM PIM Tech Paper. (2022)

work page 2022

[73] [73]

UPMEM. [n. d.] Upmem software development kit documentation. https://sdk.upmem.com/2023.2.0. ()

work page 2023

[74] [74]

UPMEM. [n. d.] Upmem website: technology. https://www.upmem.com/technology/. ()

work page

[75] [75]

Vieira, N

J. Vieira, N. Roma, P. Tomás, P. Ienne, and G. Falcao. 2018. Exploiting compute caches for memory bound vector operations. InSBAC-PAD

work page 2018

[76] [76]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures.Commun. ACM, 52, 4, (Apr. 2009), 65–76. doi:10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009

[77] [77]

Yuting Wu, Ziyu Wang, and Wei D. Lu. 2024. Pim-gpt: a hybrid process-in-memory accelerator for autoregressive transformers. (2024). https://arxiv.org/abs/2310.09385 arXiv: 2310.09385[cs.AR]

work page arXiv 2024

[78] [78]

Mimi Xie, Shuangchen Li, Alvin Oliver Glova, Jingtong Hu, and Yuan Xie. 2018. Securing emerging nonvolatile main memory with fast and energy-efficient aes in-memory implementation.IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 26, 11, 2443–2455. doi:10.1109/TVLSI.2018.2865133

work page doi:10.1109/tvlsi.2018.2865133 2018

[79] [79]

N. Zarif. 2023.Offloading Embedding Lookups to Processing-In-Memory for Deep Learning Recommender Models. Master’s thesis. University of British Columbia

work page 2023