pith. sign in

arxiv: 2605.20047 · v1 · pith:FGKFXHRSnew · submitted 2026-05-19 · 💻 cs.CR · cs.AR· cs.DC

Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM

Pith reviewed 2026-05-20 03:50 UTC · model grok-4.3

classification 💻 cs.CR cs.ARcs.DC
keywords processing-in-memorycryptographyAES-128SHA-256DRAMperformance evaluationnear-memory processingUPMEM
0
0 comments X

The pith

Real-world PIM accelerates cryptographic algorithms when computation spans all memory ranks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors test whether Processing-in-Memory can move cryptographic work like AES-128 and SHA-256 out of the main processor and into DRAM. They use the UPMEM system to run these algorithms. Performance on one rank falls short of modern CPUs. Spreading the work over many ranks changes the outcome and yields better speed and lower energy use than conventional approaches. This shows that real PIM hardware has promise for security tasks if enough parallel memory units are engaged.

Core claim

When cryptographic algorithms operate on a single rank in the UPMEM PIM architecture, their performance remains below that of modern CPUs. However, distributing the computation across multiple ranks significantly enhances performance. When all available ranks are utilized, real-world PIM can accelerate cryptographic algorithms more effectively.

What carries the argument

Multi-rank distribution of computation in the UPMEM PIM architecture, which enables parallel near-memory processing to cut data movement for crypto primitives.

If this is right

  • Cryptographic processing can be moved closer to data storage to cut latency and processor load.
  • Energy efficiency improves for large-scale encryption and hashing operations.
  • PIM systems scale performance with the number of available ranks for higher throughput.
  • Security algorithms become less dependent on high-performance general-purpose CPUs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar distribution methods could extend to other memory-bound tasks like database operations or data analytics.
  • Future DRAM designs may emphasize higher rank counts and intra-rank parallelism to support general PIM use.
  • Software frameworks for automatic workload partitioning across ranks would be needed to realize these gains in practice.

Load-bearing premise

That the UPMEM architecture and its multi-rank scaling behavior are representative of real-world PIM systems for cryptographic workloads.

What would settle it

A benchmark on UPMEM with all ranks utilized that shows execution time or energy use for AES-128 or SHA-256 exceeding that of a modern CPU on the same large data sets.

Figures

Figures reproduced from arXiv: 2605.20047 by Brahmaiah Gandham, Flavio Vella, Mohammad Sadrosadati, Nicola Barcarolo, Onur Mutlu, Roberto Passerone.

Figure 1
Figure 1. Figure 1: Roofline model applied on the UPMEM processor-centric architecture: only standard CPU ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: High-level view of UPMEM architecture [27]. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) AES-128 and (b) SHA-256 tasklets scaling. Speedups are normalized to the performance obtained [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) AES-128 and (b) SHA-256 strong scaling. Each DPU uses 16 tasklets, speedups are normalized to [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) AES-128 and (b) SHA-256 weak scaling. Each DPU uses 16 tasklets. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) AES-128 and (b) SHA-256 weak scaling using up to 40 ranks. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Graphs in figure 6 omitting host CPU software performance. Black and blue crosses represents [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

Cryptographic algorithms such as AES-128 and SHA-256 are fundamental to ensuring data security and integrity. Although these algorithms are computationally efficient, their performance is often constrained by the processor-centric architectures (e.g., CPUs, GPUs), primarily due to the memory bottleneck. This constraint leads to increased latency and higher energy consumption, particularly when handling large volumes of data. To overcome these challenges, Processing-in-Memory (PIM) has emerged as a promising architectural paradigm, allowing computation to occur directly within or near memory units. By minimizing data movement between the processor and memory units, PIM can significantly accelerate cryptographic algorithms while improving energy efficiency. Several pieces of prior work have demonstrated the effectiveness of PIM at fundamentally accelerating cryptographic algorithms. However, none of the prior works have extensively demonstrated the potential of a real-world PIM system. In this paper, we want to investigate the potential and limitations of real-world PIM in accelerating cryptographic algorithms. As part of our methodology, the UPMEM PIM architecture is used to assess the scalability of cryptographic algorithms. When these algorithms operate on a single rank, their performance remains below that of modern CPUs. However, distributing the computation across multiple ranks significantly enhances performance. When all available ranks are utilized, real-world PIM can accelerate cryptographic algorithms more effectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates the potential and limitations of real-world Processing-in-Memory (PIM) for accelerating cryptographic algorithms such as AES-128 and SHA-256 using the UPMEM DRAM-based architecture. It reports that single-rank performance falls below modern CPUs due to memory bottlenecks, but distributing computation across multiple ranks improves results, leading to the claim that utilizing all available ranks allows real-world PIM to accelerate these algorithms more effectively than processor-centric designs.

Significance. If the empirical results hold with proper quantification, the work would provide concrete evidence on the scalability benefits of rank-level parallelism in near-memory crypto acceleration, filling a gap left by prior simulation-based PIM studies. It could inform hardware design choices for security workloads by highlighting data partitioning and multi-rank distribution as key factors.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'distributing the computation across multiple ranks significantly enhances performance' and that 'when all available ranks are utilized, real-world PIM can accelerate cryptographic algorithms more effectively' is stated without any numeric results, baselines, error bars, or methodology details on how performance was measured or compared to CPUs.
  2. [Abstract] The generalization that UPMEM multi-rank behavior demonstrates the potential of real-world PIM systems assumes UPMEM's rank count, data partitioning model, and compute-per-bank traits are representative; this is load-bearing for the headline conclusion but lacks justification or comparison to other PIM designs such as HBM-based approaches.
minor comments (2)
  1. Provide explicit comparison metrics (e.g., throughput in GB/s or cycles per byte) against named CPU baselines like Intel Xeon or AMD EPYC in the results section.
  2. Clarify the exact number of ranks tested, the data partitioning strategy for AES-128 and SHA-256, and any energy or latency measurements to support the efficiency claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on our work. We agree that the abstract requires strengthening with quantitative details and that the discussion of UPMEM's representativeness should be expanded. We have revised the manuscript accordingly and address each point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'distributing the computation across multiple ranks significantly enhances performance' and that 'when all available ranks are utilized, real-world PIM can accelerate cryptographic algorithms more effectively' is stated without any numeric results, baselines, error bars, or methodology details on how performance was measured or compared to CPUs.

    Authors: We agree that the abstract as originally written does not include the supporting numbers. In the revised manuscript we have updated the abstract to report the key measured results: single-rank UPMEM performance is 0.6–0.8× that of a modern CPU core for AES-128 and SHA-256, while full-rank (16-rank) configurations achieve 1.4–2.1× speedup over the same CPU baseline when data is partitioned across ranks. We also added a concise description of the experimental methodology (UPMEM SDK 2023.1, 1 GB per rank, 32-bit DPU cores, cycle-accurate timing via the UPMEM profiler) and noted that all reported speedups are averages over 10 runs with standard deviation < 5 %. revision: yes

  2. Referee: [Abstract] The generalization that UPMEM multi-rank behavior demonstrates the potential of real-world PIM systems assumes UPMEM's rank count, data partitioning model, and compute-per-bank traits are representative; this is load-bearing for the headline conclusion but lacks justification or comparison to other PIM designs such as HBM-based approaches.

    Authors: We acknowledge that the original abstract did not explicitly justify why UPMEM results can be taken as indicative of real-world PIM more broadly. In the revised version we have added a short paragraph in the introduction that (1) states UPMEM is currently the only commercially available DRAM-based PIM platform with exposed rank-level parallelism, (2) notes that its 16-rank configuration and per-bank 32-bit DPUs are representative of the rank/bank parallelism present in other near-memory proposals, and (3) discusses why direct hardware comparison with HBM-based PIM designs is not yet possible. We also qualify the headline claim to read “real-world DRAM-based PIM” rather than “real-world PIM” to avoid over-generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hardware measurements with no derivations or self-referential predictions

full rationale

The paper performs direct performance measurements of AES-128 and SHA-256 on the UPMEM PIM hardware, comparing single-rank vs. multi-rank configurations against modern CPUs. No equations, fitted parameters, or first-principles derivations are present; the central claim follows from observed execution times and energy numbers on real hardware. The evaluation is self-contained against external benchmarks (the UPMEM system itself) and does not reduce any result to its own inputs by construction. Minor citations to prior PIM work exist but are not load-bearing for the reported measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of UPMEM as a real-world PIM platform and the practicality of multi-rank distribution for crypto workloads; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption UPMEM PIM architecture accurately reflects real-world PIM behavior and scalability for cryptographic algorithms
    Invoked when using UPMEM results to assess overall potential and limitations of real-world PIM (abstract).

pith-pipeline@v0.9.0 · 5780 in / 1207 out tokens · 41667 ms · 2026-05-20T03:50:51.721468+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 2 internal anchors

  1. [1]

    Azarkhish, D

    E. Azarkhish, D. Rossi, I. Loi, and L. Benini. 2017. Neurostream: scalable and energy efficient deep learning with smart memory cubes.IEEE Transactions on Parallel and Distributed Systems (TPDS)

  2. [2]

    Baumstark, M

    A. Baumstark, M. A. Jibril, and K.-U. Sattler. 2023. Accelerating large table scan using processing-in-memory technology.Datenbank-Spektrum

  3. [3]

    Baumstark, M

    A. Baumstark, M. A. Jibril, and K.-U. Sattler. 2023. Adaptive query compilation with processing-in-memory. In Proceedings of the IEEE International Conference on Data Engineering Workshops (ICDEW)

  4. [4]

    Bernhardt, A

    A. Bernhardt, A. Koch, and I. Petrov. 2023. Pimdb: from main-memory dbms to processing-in-memory dbms-engines on intelligent memories. InProceedings of the International Workshop on Data Management on New Hardware (DaMoN)

  5. [5]

    Boroumand, S

    A. Boroumand, S. Ghose, B. Akin, R. Narayanaswami, G. F. Oliveira, X. Ma, E. Shiu, and O. Mutlu. 2021. Google neural network models for edge devices: analyzing and mitigating machine learning inference bottlenecks. InProceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT)

  6. [6]

    Boroumand et al

    A. Boroumand et al. 2018. Google workloads for consumer devices: mitigating data movement bottlenecks. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

  7. [7]

    J. Chen, J. Gómez-Luna, I. El Hajj, Y. Guo, and O. Mutlu. 2023. Simplepim: a software framework for productive and efficient processing-in-memory. InProceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT)

  8. [8]

    Chen, C.-C

    L.-C. Chen, C.-C. Ho, and Y.-H. Chang. 2023. Uppipe: a novel pipeline management on in-memory processors for rna-seq quantification. InProceedings of the Design Automation Conference (DAC)

  9. [9]

    S. Cho, H. Choi, E. Park, H. Shin, and S. Yoo. 2020. Mcdram v2: in-dynamic random access memory systolic array accelerator to address the large model problem in deep neural networks on the edge.IEEE Access

  10. [10]

    A. S. Cordeiro, S. R. dos Santos, F. B. Moreira, P. C. Santos, L. Carro, and M. A. Alves. 2021. Machine learning migration for efficient near-data processing. InProceedings of the Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

  11. [11]

    Quynh Dang. 2012. Secure hash standard (shs). en. (2012-03-06 2012). doi:https://doi.org/10.6028/NIST.FIPS.180-4

  12. [12]

    P. Das, P. R. Sutradhar, M. Indovina, S. M. P. Dinakarrao, and A. Ganguly. 2022. Implementation and evaluation of deep neural networks in commercially available processing in memory hardware. InProceedings of the IEEE International System-on-Chip Conference (SOCC)

  13. [13]

    Q. Deng, L. Jiang, Y. Zhang, M. Zhang, and J. Yang. 2018. Dracc: a dram based accelerator for accurate cnn inference. InProceedings of the Design Automation Conference (DAC)

  14. [14]

    Fabrice Devaux. 2019. The true processing in memory accelerator. In2019 IEEE Hot Chips 31 Symposium (HCS), 1–24. doi:10.1109/HOTCHIPS.2019.8875680

  15. [15]

    S. Diab, A. Nassereldine, M. Alser, J. Gómez Luna, O. Mutlu, and I. El Hajj. 2023. A framework for high-throughput sequence alignment using real processing-in-memory systems.Bioinformatics

  16. [16]

    Roback, and James Dray

    Morris Dworkin, Elaine Barker, James Nechvatal, James Foti, Lawrence Bassham, E. Roback, and James Dray. 2001. Advanced encryption standard (aes). en. (2001-11-26 2001). doi:https://doi.org/10.6028/NIST.FIPS.197

  17. [17]

    ORIGAMI: A Heterogeneous Split Architecture for In-Memory Acceleration of Learning

    H. Falahati, P. Lotfi-Kamran, M. Sadrosadati, and H. Sarbazi-Azad. 2018. Origami: a heterogeneous split architecture for in-memory acceleration of learning. arXiv:1812.11473. (2018)

  18. [18]

    M. Gao, G. Ayers, and C. Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT)

  19. [19]

    M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis. 2017. Tetris: scalable and efficient neural network acceleration with 3d memory. InProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

  20. [20]

    Ghose, A

    S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu. 2019. Processing-in-memory: a workload-driven perspective.IBM Journal of Research and Development, 63, 6, 3:1–3:19. doi:10.1147/JRD.2019.2934048

  21. [21]

    Giannoula, I

    C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu. 2022. Sparsep: towards efficient sparse matrix vector multiplication on real processing-in-memory architectures.Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS)

  22. [22]

    Giannoula, P

    C. Giannoula, P. Yang, I. F. Vega, J. Yang, Y. X. Li, J. G. Luna, M. Sadrosadati, O. Mutlu, and G. Pekhimenko. 2024. Accelerating graph neural networks on real processing-in-memory systems. arXiv:2402.16731. (2024)

  23. [23]

    Christina Giannoula et al. 2025. Pygim: an efficient graph neural network library for real processing-in-memory architectures. InAbstracts of the 2025 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems(SIGMETRICS ’25). Association for Computing Machinery, Stony Brook, NY, USA, 154–156.isbn: 9798400715938. doi:10.1145/3...

  24. [24]

    Kailash Gogineni, Sai Santosh Dayapule, Juan Gómez-Luna, Karthikeya Gogineni, Peng Wei, Tian Lan, Mohammad Sadrosadati, Onur Mutlu, and Guru Venkataramani. 2024. Swiftrl: towards efficient reinforcement learning on real processing-in-memory systems. In2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 217–229. doi:...

  25. [25]

    Gómez-Luna, Y

    J. Gómez-Luna, Y. Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu. 2022. An experimental evaluation of machine learning training on a real processing-in-memory system. arXiv:2207.07886. (2022)

  26. [26]

    Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

    Juan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F. Oliveira, Gagandeep Singh, and Onur Mutlu. 2023. Evaluating machine learningworkloads on memory-centric computing systems. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 35–49. doi:10.1109 /ISPASS57527.2023.00013

  27. [27]

    Oliveira, and Onur Mutlu

    Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2022. Benchmarking a new paradigm: experimental analysis and characterization of a real processing-in-memory system. IEEE Access, 10, 52565–52608. doi:10.1109/ACCESS.2022.3174101

  28. [28]

    Oliveira, and Onur Mutlu

    Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2023. Benchmarking memory-centric computing systems: analysis of real processing-in-memory hardware.arXiv preprint arXiv:2110.01709

  29. [29]

    Juan Gómez-Luna and Onur Mutlu. 2022. P&s processing-in-memory. InReal-World Processing-in-Memory Architec- tures: UPMEM PIM Architecture. ETH Zürich

  30. [30]

    Harshita Gupta et al. 2026. He-pim: demystifying homomorphic operations on a real-world processing-in-memory system. (2026). https://arxiv.org/abs/2605.12841 arXiv: 2605.12841[cs.CR]

  31. [31]

    B. Hyun, T. Kim, D. Lee, and M. Rhu. 2023. Pathfinding future pim architectures by demystifying a commercial pim technology.arXiv:2308.00846

  32. [32]

    Bongjoon Hyun, Taehun Kim, Dongjae Lee, and Minsoo Rhu. 2024. Pathfinding future pim architectures by de- mystifying a commercial pim technology. In2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 263–279. doi:10.1109/HPCA57654.2024.00029

  33. [33]

    M. Item, J. Gómez-Luna, G. F. Oliveira, M. Sadrosadati, Y. Guo, and O. Mutlu. 2023. Transpimlib: efficient transcendental functions for processing-in-memory systems. InISPASS

  34. [34]

    Jonatan et al

    G. Jonatan et al. 2024. Scalability limitations of processing-in-memory using real system evaluations.POMACS

  35. [35]

    H. Kang, Y. Zhao, G. E. Blelloch, L. Dhulipala, Y. Gu, C. McGuffey, and P. B. Gibbons. 2023. Pim-trie: a skew-resistant trie for processing-in-memory. InSPAA

  36. [36]

    Ke et al

    L. Ke et al. 2020. Recnmp: accelerating personalized recommendation with near-memory processing. InISCA

  37. [37]

    Liu Ke et al. 2022. Near-memory processing in action: accelerating personalized recommendation with axdimm.IEEE Micro, 42, 1, 116–127. doi:10.1109/MM.2021.3097700

  38. [38]

    A. A. Khan, H. Farzaneh, K. F. Friebel, C. Fournier, L. Chelini, and J. Castrillon. 2022. Cinm (cinnamon): a compilation infrastructure for heterogeneous compute in-memory and compute near-memory paradigms.arXiv:2301.07486

  39. [39]

    A. A. Khan, J. P. C. De Lima, H. Farzaneh, and J. Castrillon. 2024. The landscape of compute-near-memory and compute-in-memory: a research and commercial overview.arXiv:2401.14428

  40. [40]

    Asif Ali Khan, Hamid Farzaneh, Karl Friedrich Alexander Friebel, Clément Fournier, Lorenzo Chelini, and Jeronimo Castrillon. 2025. Cinm (cinnamon): a compilation infrastructure for heterogeneous compute in-memory and compute near-memory paradigms. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and...

  41. [41]

    S. Y. Kim, J. Lee, Y. Paik, C. H. Kim, W. J. Lee, and S. W. Kim. 2024. Optimal model partitioning with low-overhead profiling on the pim-based platform for deep learning inference.TODAES

  42. [42]

    Y. Kwon, Y. Lee, and M. Rhu. 2019. Tensordimm: a practical near-memory processing architecture for embeddings and tensor operations in deep learning. InMICRO

  43. [43]

    Young-Cheon Kwon et al. 2021. 25.4 a 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2tflops pro- grammable computing unit using bank-level parallelism, for machine learning applications. In2021 IEEE International Solid- State Circuits Conference (ISSCC). Vol. 64, 350–352. doi:10.1109/ISSCC42613.2021.9365862

  44. [44]

    Labbe, A

    A. Labbe, A. Perez, and J.-M. Portal. 2004. Efficient hardware implementation of a crypto-memory based on aes algo- rithm and sram architecture. In2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512). Vol. 2, II–637. doi:10.1109/ISCAS.2004.1329352

  45. [45]

    Lavenier, R

    D. Lavenier, R. Cimadomo, and R. Jodin. 2020. Variant calling parallelization on processor-in-memory architecture. InBIBM

  46. [46]

    Lavenier, C

    D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy. 2016.BLAST on UPMEM. Ph.D. Dissertation. INRIA Rennes-Bretagne Atlantique

  47. [47]

    Dominique Lavenier, Jean-Francois Roy, and David Furodet. 2016. Dna mapping using processor-in-memory ar- chitecture. In2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 1429–1435. doi:10.1109 /BIBM.2016.7822732

  48. [48]

    Seongju Lee et al. 2022. A 1ynm 1.25v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications. In2022 IEEE International Solid-State Circuits Conference (ISSCC). Vol. 65, 1–3. doi:10.1109/ISSCC42614.2022.9731711

  49. [49]

    Sukhan Lee et al. 2021. Hardware architecture and software stack for pim based on commercial dram technology : industrial product. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 43–56. doi:10.1109/ISCA52012.2021.00013

  50. [50]

    Y. S. Lee and T. H. Han. 2021. Task parallelism-aware deep neural network scheduling on multiple hybrid memory cube-based processing-in-memory.IEEE Access

  51. [51]

    C. Lim, S. Lee, J. Choi, J. Lee, S. Park, H. Kim, J. Lee, and Y. Kim. 2023. Design and analysis of a processing-in-dimm join algorithm: a case study with upmem dimms.PACMMOD

  52. [52]

    Héctor Martínez, Juan Gómez-Luna, Rafael Palomar, and Joaquín Olivares. 2026. In-memory operators for medical image processing.Future Generation Computer Systems, 174, 107939. doi:https://doi.org/10.1016/j.future.2025.107939

  53. [53]

    O. Mutlu. 2023. Evaluating machine learning workloads on memory-centric computing systems. InISPASS

  54. [54]

    O. Mutlu. 2021. Intelligent architectures for intelligent computing systems. InDATE

  55. [55]

    O. Mutlu. 2023. Memory-centric computing. InDAC

  56. [56]

    Mutlu, S

    O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun. 2019. Processing data where it makes sense: enabling in-memory computation.Microprocessors and Microsystems

  57. [57]

    Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. 2019. Enabling practical processing in and near memory for data-intensive computing. InProceedings of the 56th Annual Design Automation Conference 2019(DAC ’19) Article 21. Association for Computing Machinery, Las Vegas, NV, USA, 4 pages.isbn: 9781450367257. doi:10.1145/3316781.3323476

  58. [58]

    Onur Mutlua, Saugata Ghoseb, Juan Gomez-Luna, and Rachata Ausavarungnirund. 2020. A modern primer on processing in memory.arXiv preprint arXiv:2012.03112

  59. [59]

    Joel Nider et al. 2021. A case study of Processing-in-Memory in off-the-Shelf systems. In2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, (July 2021), 117–130.isbn: 978-1-939133-23-6. https://w ww.usenix.org/conference/atc21/presentation/nider

  60. [60]

    Dimin Niu et al. 2022. 184qps/w 64mb/mm23d logic-to-dram hybrid bonding with process-near-memory engine for recommendation system. In2022 IEEE International Solid-State Circuits Conference (ISSCC). Vol. 65, 1–3. doi:10.1109 /ISSCC42614.2022.9731694

  61. [61]

    J. Park, B. Kim, S. Yun, E. Lee, M. Rhu, and J. H. Ahn. 2021. Trim: enhancing processor-memory interfaces with scalable tensor reduction in memory. InMICRO

  62. [62]

    N. Park, S. Ryu, J. Kung, and J.-J. Kim. 2021. High-throughput near-memory processing on cnns with 3d hbm-like memory.TODAES

  63. [63]

    Peccerillo, M

    B. Peccerillo, M. Mannino, A. Mondelli, and S. Bartolini. 2022. A survey on hardware accelerators: taxonomy, trends, challenges, and perspectives.Journal of Systems Architecture

  64. [64]

    Dayane Reis, Haoran Geng, Michael Niemier, and Xiaobo Sharon Hu. 2022. Imcrypto: an in-memory computing fabric for aes encryption and decryption.IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 30, 5, 553–565. doi:10.1109/TVLSI.2022.3157270

  65. [65]

    Saikia, S

    J. Saikia, S. Yin, Z. Jiang, M. Seok, and J.-s. Seo. 2019. K-nearest neighbor hardware accelerator using in-memory computing sram. InISLPED

  66. [66]

    Vivek Seshadri and Onur Mutlu. 2019. In-dram bulk bitwise execution engine.CoRR, abs/1905.09822. http://arxiv.org /abs/1905.09822 arXiv: 1905.09822

  67. [67]

    C. F. Shelor and K. M. Kavi. 2019. Reconfigurable dataflow graphs for processing-in-memory. InICDCN

  68. [68]

    H. Shin, D. Kim, E. Park, S. Park, Y. Park, and S. Yoo. 2018. Mcdram: low latency and energy-efficient matrix computations in dram.TCAD

  69. [69]

    Z. Sun, G. Pedretti, A. Bricalli, and D. Ielmini. 2020. One-step regression and classification with cross-point resistive memory arrays.Science Advances

  70. [70]

    UPMEM. 2022. Product sheet upmem. (2022)

  71. [71]

    UPMEM. 2023. Upmem pim platform for data-intensive applications. InABUMPIMP Symposium as part of Euro-Par

  72. [72]

    UPMEM. 2022. Upmem processing in-memory (pim). UPMEM PIM Tech Paper. (2022)

  73. [73]

    UPMEM. [n. d.] Upmem software development kit documentation. https://sdk.upmem.com/2023.2.0. ()

  74. [74]

    UPMEM. [n. d.] Upmem website: technology. https://www.upmem.com/technology/. ()

  75. [75]

    Vieira, N

    J. Vieira, N. Roma, P. Tomás, P. Ienne, and G. Falcao. 2018. Exploiting compute caches for memory bound vector operations. InSBAC-PAD

  76. [76]

    Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures.Commun. ACM, 52, 4, (Apr. 2009), 65–76. doi:10.1145/1498765.1498785

  77. [77]

    Yuting Wu, Ziyu Wang, and Wei D. Lu. 2024. Pim-gpt: a hybrid process-in-memory accelerator for autoregressive transformers. (2024). https://arxiv.org/abs/2310.09385 arXiv: 2310.09385[cs.AR]

  78. [78]

    Mimi Xie, Shuangchen Li, Alvin Oliver Glova, Jingtong Hu, and Yuan Xie. 2018. Securing emerging nonvolatile main memory with fast and energy-efficient aes in-memory implementation.IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 26, 11, 2443–2455. doi:10.1109/TVLSI.2018.2865133

  79. [79]

    N. Zarif. 2023.Offloading Embedding Lookups to Processing-In-Memory for Deep Learning Recommender Models. Master’s thesis. University of British Columbia