pith. machine review for the scientific record. sign in

arxiv: 2604.04523 · v1 · submitted 2026-04-06 · 💻 cs.AR

Recognition: no theorem link

LOCALUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:01 UTC · model grok-4.3

classification 💻 cs.AR
keywords LUT-based computationDRAM processing-in-memoryquantized DNN inferencecapacity-computation tradeoffLUT canonicalizationoperation packingUPMEM evaluation
0
0 comments X

The pith

Packing multiple operations into LUTs lets DRAM-PIM hardware replace arithmetic with memory lookups for faster low-bit DNN inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that lookup tables can stand in for arithmetic circuits in processing-in-memory systems by precomputing results for groups of multiply-accumulate steps, which turns extra memory capacity into higher throughput. A reader would care because DRAM-PIM chips already supply high memory bandwidth and large capacity but very limited logic, and quantized neural networks need many different low precisions that standard ALUs handle poorly. By cutting redundant table entries through canonicalization, adding a small remapping table for weights, and streaming only the needed slices of each table into the on-chip buffer, the design keeps the memory footprint manageable while delivering measured speedups on actual hardware. The core mechanism is therefore the explicit tradeoff of memory space for computation that fits the physical constraints of DRAM-based PIM.

Core claim

LOCALUT packs several MAC operations into each LUT lookup so that a single memory read produces multiple results, then removes duplicate entries inside those tables via canonicalization to shrink their size. An auxiliary reordering LUT translates original weight vectors into the canonical index form with one extra lookup, and LUT slice streaming brings only the active columns of each table into the DRAM buffer for reuse across many weight vectors. On real UPMEM hardware this combination produces a geometric mean 1.82 times speedup across several numeric precisions and DNN models while preserving correctness.

What carries the argument

Operation-packed LUTs with canonicalization, which collapse duplicate precomputed entries so that memory capacity directly substitutes for repeated arithmetic logic.

If this is right

  • Memory capacity becomes a direct substitute for arithmetic logic, raising effective throughput in logic-sparse PIM chips.
  • The same LUT structure supports arbitrary bit widths without extra hardware, suiting the variety of precisions in quantized networks.
  • Canonicalization and slice streaming together cut the memory footprint of the tables while keeping execution correct.
  • Real hardware measurements confirm the net gain reaches 1.82 times geometric mean across models and precisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of future PIM chips could deliberately allocate more on-die memory per processing element if they expect LUT-based workloads.
  • The reordering and streaming techniques might transfer to other table-driven computations such as activation functions or sparse matrix operations.
  • If weight distributions in newer models retain similar redundancy, the size-reduction benefit would scale to larger networks without redesign.

Load-bearing premise

Typical DNN weight vectors produce enough duplicate entries inside the packed LUTs that canonicalization shrinks them enough to offset the added cost of reordering and streaming.

What would settle it

Running the same quantized models on the UPMEM platform while measuring the actual LUT size after canonicalization and the cycle count of the reordering and streaming steps; if the net memory traffic or latency increases instead of decreases, the claimed speedup disappears.

Figures

Figures reproduced from arXiv: 2604.04523 by Changmin Shin, Hanjun Kim, Jinho Lee, Junguk Hong, Seongyeon Park, Si Ung Noh, Sukjin Kim, Taehee Kwon, Youngsok Kim.

Figure 1
Figure 1. Figure 1: DRAM-PIM architecture abstraction. Conventionally, bank-level PIM [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Operation-packed LUT with packing degree [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Baseline LUT-based approaches to perform matrix multiplications in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of Reordering LUT design. (a) Index reordering is replaced [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Capacity requirement for LUTs, varying packing degrees [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Transformer layer execution flow of LOCALUT. On the other hand, with a small M, not only is a smaller p preferred, but a fully buffer-resident LUT may yield lower total cost. In this case, the first term of Eq. (2) is removed. Using plocal such that the LUT fits within the local buffer, the cost becomes: Tlocal = MKN plocal · Llocal. (4) Thus, Tlocal is shorter than T when: 1 plocal · M · Llocal < 1 p ∗ (2… view at source ↗
Figure 10
Figure 10. Figure 10: Performance comparison among different DNN models. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sensitivity study of LOCALUT with varying weight matrix dimensions. (N = 128) we observe that the speedup of LOCALUT is more substantial in lower-bit quantization configurations. This is because lower￾bit precisions result in smaller LUTs, leaving more room to pack the LUTs to a higher degree. This aligns well with the trend of movements toward extremely low-bit quantization. C. Performance of Representat… view at source ↗
Figure 14
Figure 14. Figure 14: Energy comparison between baseline methods and L [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗
Figure 13
Figure 13. Figure 13: Sensitivity to the k values. As the packing degree p rises, the performance improves while the capacity increases. An interesting observation is that performance improves as M increases at p = 6. This occurs because increasing the p expands the size of the LUT slice, which introduces additional overhead. Although capacity also grows with higher p, LOCALUT leverages DRAM to store the entire LUT and uses sl… view at source ↗
Figure 18
Figure 18. Figure 18: Validating the cost model’s predicted execution time against real [PITH_FULL_IMAGE:figures/full_fig_p010_18.png] view at source ↗
Figure 17
Figure 17. Figure 17: Execution time and energy comparison with CPU and GPU across [PITH_FULL_IMAGE:figures/full_fig_p010_17.png] view at source ↗
Figure 21
Figure 21. Figure 21: Floating-point support in LOCALUT. (a) GEMM speedup across different precisions (FP4, FP8, FP16) where matrix size indicates M=N=K (e.g., 1K = 1K×1K×1K). (b) ViT accuracy with varying packing degrees. grated into bank-level PIM, we extend LOCALUT to support quantized floating-point operations in [PITH_FULL_IMAGE:figures/full_fig_p011_21.png] view at source ↗
read the original abstract

Lookup tables (LUTs) have recently gained attention as an alternative compute mechanism that maps input operands to precomputed results, eliminating the need for arithmetic logic. LUTs not only reduce logic complexity, but also naturally support diverse numerical precisions without requiring separate circuits for each bitwidth-an increasingly important feature in quantized DNNs. This creates a favorable tradeoff in PIM: memory capacity can be used in place of logic to increase computational throughput, aligning well with DRAM-PIM architectures that offer high bandwidth and easily available memory but limited logic density. In this work, we explore this capacity-computation tradeoff in LUT-based PIM designs, where memory capacity is traded for performance by packing multiple MAC operations into a single LUT lookup. Building on this insight, we propose LOCALUT, a PIM-based design for efficient low-bit quantized DNN inference using operation-packed LUTs. First, we observe that these LUTs contain extensive redundancy and introduce LUT canonicalization, which eliminates duplicate entries to reduce LUT size. Second, we propose reordering LUT, a lightweight auxiliary LUT that remaps weight vectors to their canonical form required by LUT canonicalization with a simple LUT lookup. Third, we propose LUT slice streaming, a novel execution strategy that exploits the DRAM-buffer hierarchy by streaming only relevant LUT columns into the buffer and reusing them across multiple weight vectors. Evaluated on a real system based on UPMEM devices, we demonstrate a geometric mean speedup of 1.82x across various numeric precisions and DNN models. We believe LOCALUT opens a path toward scalable, low-logic PIM designs tailored for LUT-based DNN inference. Our implementation of LOCALUT is available at https://github.com/AIS-SNU/LoCaLUT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes LOCALUT, a LUT-based architecture for low-bit quantized DNN inference on DRAM-PIM hardware. It exploits capacity-computation tradeoffs by packing multiple MAC operations into single LUT lookups, introduces LUT canonicalization to remove duplicate entries and shrink table size, adds a lightweight reordering LUT to map weight vectors to canonical forms, and proposes LUT slice streaming to move only relevant columns into the per-DPU buffer. Real-system measurements on UPMEM devices report a geometric mean speedup of 1.82x across numeric precisions and DNN models, with the implementation released on GitHub.

Significance. If the net speedup survives detailed overhead accounting, the work demonstrates a concrete way to trade abundant DRAM capacity for reduced logic in PIM systems, which is timely for quantized inference. The real-hardware evaluation on UPMEM and public code release are clear strengths that increase confidence relative to simulation-only studies.

major comments (2)
  1. [Evaluation] Evaluation section: the 1.82x geometric-mean speedup is reported as a net figure, yet no per-component timing breakdown is supplied for the reordering LUT lookup, canonicalization, or slice-streaming operations. Because these auxiliary steps consume memory bandwidth and control logic on UPMEM, it is impossible to verify that their cost remains negligible relative to the capacity gains, particularly at low batch sizes or limited LUT reuse.
  2. [LUT canonicalization] Section on LUT canonicalization: the central claim that redundancy is 'extensive' and can be eliminated without correctness issues or offsetting overhead rests on an unquantified assumption. The manuscript should report the measured fraction of duplicate entries before/after canonicalization and confirm that the reordering LUT itself does not introduce new errors or scale poorly with model depth.
minor comments (2)
  1. [Abstract] Abstract: no information is given on the exact baselines, model suite, batch sizes, or presence of error bars; these details should be added for reproducibility.
  2. [Introduction] Notation: the distinction between 'operation-packed LUTs' and conventional LUTs is introduced without a compact equation or diagram showing how multiple MACs map to one lookup; a small illustrative figure would clarify the capacity-computation tradeoff.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the 1.82x geometric-mean speedup is reported as a net figure, yet no per-component timing breakdown is supplied for the reordering LUT lookup, canonicalization, or slice-streaming operations. Because these auxiliary steps consume memory bandwidth and control logic on UPMEM, it is impossible to verify that their cost remains negligible relative to the capacity gains, particularly at low batch sizes or limited LUT reuse.

    Authors: We agree that a per-component breakdown would improve transparency and verifiability. In the revised manuscript we have added timing measurements for the reordering LUT lookup, canonicalization, and slice-streaming operations, reported across multiple batch sizes and LUT-reuse levels on the same UPMEM hardware. These data show that the auxiliary costs remain a small fraction of total execution time, with the dominant savings coming from the packed LUT-based MACs. Because the 1.82x geometric-mean figure is an end-to-end real-system measurement, it already includes all overheads. revision: yes

  2. Referee: [LUT canonicalization] Section on LUT canonicalization: the central claim that redundancy is 'extensive' and can be eliminated without correctness issues or offsetting overhead rests on an unquantified assumption. The manuscript should report the measured fraction of duplicate entries before/after canonicalization and confirm that the reordering LUT itself does not introduce new errors or scale poorly with model depth.

    Authors: We have revised the LUT canonicalization section to include the measured fractions of duplicate entries before and after canonicalization for every evaluated model and precision. These measurements confirm that redundancy is extensive and yields substantial table-size reductions. We also added verification that the reordering LUT performs an exact mapping with no introduced errors (validated by bit-exact equivalence checks against the original computation) and that its per-vector cost scales linearly with the number of weight vectors but remains negligible relative to the main LUT lookups even for the deepest models tested. revision: yes

Circularity Check

0 steps flagged

No circularity in the paper's design and evaluation chain

full rationale

The paper presents an empirical systems design for LUT-based inference on DRAM-PIM hardware. The speedup is measured directly on UPMEM devices rather than derived from equations or fitted parameters. The proposals for LUT canonicalization, reordering LUT, and slice streaming are design choices justified by observations of redundancy in LUTs, without reducing to self-definitional or fitted inputs. No load-bearing self-citations or uniqueness theorems are invoked in a circular manner. The central claim rests on hardware evaluation, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on the empirical observation that LUTs for packed MACs contain extensive redundancy; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption LUTs for quantized DNN operations contain extensive redundancy that can be safely removed by canonicalization.
    Stated as the basis for the first contribution in the abstract.

pith-pipeline@v0.9.0 · 5643 in / 1145 out tokens · 48008 ms · 2026-05-10T20:01:34.395733+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

104 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    Chameleon: Versatile and practical near-DRAM acceleration archi- tecture for large memory systems,

    H. Asghari-Moghaddam, Y . H. Son, J. H. Ahn, and N. S. Kim, “Chameleon: Versatile and practical near-DRAM acceleration archi- tecture for large memory systems,” inMICRO, 2016

  2. [2]

    psyncpim: Partially synchronous execution of sparse matrix operations for all-bank pim architectures,

    D. Baek, S. Hwang, and J. Huh, “psyncpim: Partially synchronous execution of sparse matrix operations for all-bank pim architectures,” inISCA, 2024

  3. [3]

    R., and King, I

    H. Bai, W. Zhang, L. Hou, L. Shang, J. Jin, X. Jiang, Q. Liu, M. Lyu, and I. King, “Binarybert: Pushing the limit of bert quantization,”arXiv preprint arXiv:2012.15701, 2020

  4. [4]

    CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories,

    R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V . Srinivas, “CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories,”ACM TACO, 2017

  5. [5]

    LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory,

    A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, K. Hsieh, K. T. Malladi, H. Zheng, and O. Mutlu, “LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory,”IEEE CAL, 2017

  6. [6]

    Quip: 2-bit quanti- zation of large language models with guarantees,

    J. Chee, Y . Cai, V . Kuleshov, and C. M. De Sa, “Quip: 2-bit quanti- zation of large language models with guarantees,” inNeurIPS, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., 2023

  7. [7]

    Quip: 2-bit quantiza- tion of large language models with guarantees,

    J. Chee, Y . Cai, V . Kuleshov, and C. M. De Sa, “Quip: 2-bit quantiza- tion of large language models with guarantees,” inNeurIPS, 2023

  8. [8]

    Simplepim: A software framework for productive and efficient processing-in- memory,

    J. Chen, J. G ´omez-Luna, I. El Hajj, Y . Guo, and O. Mutlu, “Simplepim: A software framework for productive and efficient processing-in- memory,” inPACT, 2023

  9. [9]

    Asyncdimm: Achieving asynchronous execution in dimm-based near-memory pro- cessing,

    L. Chen, D. Lyu, J. Jiang, Q. Wang, Z. Mao, and N. Jing, “Asyncdimm: Achieving asynchronous execution in dimm-based near-memory pro- cessing,” inHPCA, 2025

  10. [10]

    arXiv preprint arXiv:2505.19115 , year=

    B. Chmiel, M. Fishman, R. Banner, and D. Soudry, “Fp4 all the way: Fully quantized training of llms,”arXiv preprint arXiv:2505.19115, 2025

  11. [11]

    Accurate and efficient 2-bit quantized neural networks,

    J. Choi, S. Venkataramani, V . V . Srinivasan, K. Gopalakrishnan, Z. Wang, and P. Chuang, “Accurate and efficient 2-bit quantized neural networks,” inMLSys, 2019

  12. [12]

    Qimera: Data-free quantization with synthetic boundary supporting samples,

    K. Choi, D. Hong, N. Park, Y . Kim, and J. Lee, “Qimera: Data-free quantization with synthetic boundary supporting samples,”NeurIPS, 2021

  13. [13]

    Mimiq: Low-bit data-free quantization of vision transformers with encouraging inter-head attention similarity,

    K. Choi, H. Lee, D. Kwon, S. Park, K. Kim, N. Park, J. Choi, and J. Lee, “Mimiq: Low-bit data-free quantization of vision transformers with encouraging inter-head attention similarity,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025

  14. [14]

    Low-bit quan- tization of neural networks for efficient inference,

    Y . Choukroun, E. Kravchik, F. Yang, and P. Kisilev, “Low-bit quan- tization of neural networks for efficient inference,” inICCVW, 2019, pp. 3009–3018

  15. [15]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255

  16. [16]

    The True Processing In Memory Accelerator,

    F. Devaux, “The True Processing In Memory Accelerator,” inHot Chips, 2019

  17. [17]

    Bert: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019

  18. [18]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inICLR, 2021

  19. [19]

    BitDistiller: Unleashing the potential of sub-4-bit LLMs via self- distillation,

    D. Du, Y . Zhang, S. Cao, J. Guo, T. Cao, X. Chu, and N. Xu, “BitDistiller: Unleashing the potential of sub-4-bit LLMs via self- distillation,” inACL, 2024

  20. [20]

    pluto: Enabling massively parallel computation in dram via lookup tables,

    J. D. Ferreira, G. Falcao, J. G ´omez-Luna, M. Alser, L. Orosa, M. Sadrosadati, J. S. Kim, G. F. Oliveira, T. Shahroodi, A. Nori, and O. Mutlu, “pluto: Enabling massively parallel computation in dram via lookup tables,” in2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022, pp. 900–919

  21. [21]

    Pimdal: Mitigating the memory bottleneck in data analytics using a real processing-in-memory system,

    M. Frouzakis, J. G ´omez-Luna, G. F. Oliveira, M. Sadrosadati, and O. Mutlu, “Pimdal: Mitigating the memory bottleneck in data analytics using a real processing-in-memory system,” 2025. [Online]. Available: https://arxiv.org/abs/2504.01948

  22. [22]

    Pygim: An efficient graph neural network library for real processing-in-memory architectures,

    C. Giannoula, P. Yang, I. Fernandez, J. Yang, S. Durvasula, Y . X. Li, M. Sadrosadati, J. G. Luna, O. Mutlu, and G. Pekhimenko, “Pygim: An efficient graph neural network library for real processing-in-memory architectures,”ACM POMACS, vol. 8, no. 3, Dec. 2024

  23. [23]

    Swiftrl: Towards efficient reinforcement learning on real processing-in-memory systems,

    K. Gogineni, S. S. Dayapule, J. G ´omez-Luna, K. Gogineni, P. Wei, T. Lan, M. Sadrosadati, O. Mutlu, and G. Venkataramani, “Swiftrl: Towards efficient reinforcement learning on real processing-in-memory systems,” inISPASS, 2024, pp. 217–229

  24. [24]

    Apple intelligence foundation language models

    T. Gunter, Z. Wang, C. Wang, R. Pang, A. Narayanan, A. Zhang, B. Zhang, C. Chen, C.-C. Chiu, D. Qiuet al., “Apple intelligence foundation language models,”arXiv preprint arXiv:2407.21075, 2024

  25. [25]

    Benchmarking memory-centric computing systems: Analysis of real processing-in-memory hardware,

    J. G ´omez-Luna, I. El Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking memory-centric computing systems: Analysis of real processing-in-memory hardware,” inIGSC, 2021

  26. [26]

    Simdram: a framework for bit-serial simd processing using dram,

    N. Hajinazar, G. F. Oliveira, S. Gregorio, J. a. D. Ferreira, N. M. Ghiasi, M. Patel, M. Alser, S. Ghose, J. G ´omez-Luna, and O. Mutlu, “Simdram: a framework for bit-serial simd processing using dram,” in ASPLOS, 2021

  27. [27]

    Newton: A DRAM-maker’s Accelerator-in-Memory (AiM) Architecture for Machine Learning,

    M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. N. Vijaykumar, “Newton: A DRAM-maker’s Accelerator-in-Memory (AiM) Architecture for Machine Learning,” inMICRO, 2020

  28. [28]

    Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,

    G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Maha- jan, and J. Park, “Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,” inASPLOS, 2024

  29. [29]

    I-llm: Efficient integer-only inference for fully- quantized low-bit large language models.arXiv preprint arXiv:2405.17849,

    X. Hu, Y . Cheng, D. Yang, Z. Yuan, J. Yu, C. Xu, and S. Zhou, “I- llm: Efficient integer-only inference for fully-quantized low-bit large language models,”arXiv preprint arXiv:2405.17849, 2024

  30. [30]

    Pathfinding future pim architectures by demystifying a commercial pim technology,

    B. Hyun, T. Kim, D. Lee, and M. Rhu, “Pathfinding future pim architectures by demystifying a commercial pim technology,” inHPCA, 2024, pp. 263–279

  31. [31]

    Transpimlib: Efficient transcendental functions for processing-in-memory systems,

    M. Item, G. F. Oliveira, J. G ´omez-Luna, M. Sadrosadati, Y . Guo, and O. Mutlu, “Transpimlib: Efficient transcendental functions for processing-in-memory systems,” inISPASS, 2023, pp. 235–247

  32. [32]

    Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System,

    H. Jang, J. Song, J. Jung, J. Park, Y . Kim, and J. Lee, “Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System,” inHPCA, 2024

  33. [33]

    Biqgemm: Matrix multiplication with lookup table for binary-coding-based quan- tized dnns,

    Y . Jeon, B. Park, S. J. Kwon, B. Kim, J. Yun, and D. Lee, “Biqgemm: Matrix multiplication with lookup table for binary-coding-based quan- tized dnns,” inSC, 2020, pp. 1–14

  34. [34]

    Kdlsq-bert: A quantized bert combining knowledge distillation with learned step size quantiza- tion,

    J. Jin, C. Liang, T. Wu, L. Zou, and Z. Gan, “Kdlsq-bert: A quantized bert combining knowledge distillation with learned step size quantiza- tion,”arXiv preprint arXiv:2101.05938, 2021

  35. [35]

    Aespa: Asynchronous execution scheme to exploit bank-level parallelism of processing-in-memory,

    H. Kal, C. Yoo, and W. W. Ro, “Aespa: Asynchronous execution scheme to exploit bank-level parallelism of processing-in-memory,” in MICRO, 2023

  36. [36]

    FlexRAM: toward an advanced intelligent memory system,

    Y . Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V . Lam, P. Pattnaik, and J. Torrellas, “FlexRAM: toward an advanced intelligent memory system,” inICCD, 1999

  37. [37]

    Near-memory processing in action: Accelerating personalized recommendation with axdimm,

    L. Ke, X. Zhang, J. So, J.-G. Lee, S.-H. Kang, S. Lee, S. Han, Y . Cho, J. H. Kim, Y . Kwon, K. Kim, J. Jung, I. Yun, S. J. Park, H. Park, J. Song, J. Cho, K. Sohn, N. S. Kim, and H.-H. S. Lee, “Near-memory processing in action: Accelerating personalized recommendation with axdimm,”IEEE Micro, 2022

  38. [38]

    Mvid: Sparse matrix-vector multiplication in mobile dram for accelerating recurrent neural networks,

    B. Kim, J. Chung, E. Lee, W. Jung, S. Lee, J. Choi, J. Park, M. Wi, S. Lee, and J. H. Ahn, “Mvid: Sparse matrix-vector multiplication in mobile dram for accelerating recurrent neural networks,”IEEE TC, vol. 69, no. 7, pp. 955–967, 2020

  39. [39]

    Virtual pim: Resource-aware dynamic dpu allocation and workload scheduling framework for multi-dpu pim architecture,

    D. Kim, T. Kim, I. Hwang, T. Park, H. Kim, Y . Kim, and Y . Park, “Virtual pim: Resource-aware dynamic dpu allocation and workload scheduling framework for multi-dpu pim architecture,” inPACT, 2023

  40. [40]

    GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent,

    H. Kim, H. Park, T. Kim, K. Cho, E. Lee, S. Ryu, H.-J. Lee, K. Choi, and J. Lee, “GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent,” inHPCA, 2021

  41. [41]

    Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization,

    J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, and D. Lee, “Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization,” inNeurIPS, 2023

  42. [42]

    Aquabolt-XL: Samsung HBM2-PIM with In-Memory Processing for ML Accelerators and Beyond,

    J. H. Kim, S.-h. Kang, S. Lee, H. Kim, W. Song, Y . Ro, S. Lee, D. Wang, H. Shin, B. Phuah, J. Choi, J. So, Y . Cho, J. Song, J. Choi, J. Cho, K. Sohn, Y . Sohn, K. Park, and N. S. Kim, “Aquabolt-XL: Samsung HBM2-PIM with In-Memory Processing for ML Accelerators and Beyond,” inHot Chips, 2021

  43. [43]

    I-bert: Integer-only bert quantization,

    S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-bert: Integer-only bert quantization,” inICML. PMLR, 2021

  44. [44]

    {PathWeaver}: A{High-Throughput}{Multi-GPU}system for{Graph-Based}approximate nearest neighbor search,

    S. Kim, S. Park, S. U. Noh, J. Hong, T. Kwon, H. Lim, and J. Lee, “{PathWeaver}: A{High-Throughput}{Multi-GPU}system for{Graph-Based}approximate nearest neighbor search,” in2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025

  45. [45]

    Active Memory: Micron’ s Yukon,

    G. Kirsch, “Active Memory: Micron’ s Yukon,” inIPDPS, 2003

  46. [46]

    EXECUBE-A new architecture for scaleable MPPs,

    P. M. Kogge, “EXECUBE-A new architecture for scaleable MPPs,” in ICPP, 1994

  47. [47]

    System Architecture and Software Stack for GDDR6-AiM,

    Y . Kwon, K. Vladimir, N. Kim, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, G. Kim, B. An, J. Kim, J. Lee, I. Kim, J. Park, C. Park, Y . Song, B. Yang, H. Lee, S. Kim, D. Kwon, S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, M. Lee, M. Shin, M. Shin, J. Cha, C. Jung, K. Chang, C. Jeong, E. Lim, I. Park, J. Chun, a...

  48. [48]

    25.4 a 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2tflops programmable computing unit using bank-level parallelism, for machine learning applications,

    Y .-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, O. Seongil, H.-S. Yu, H. Lee, S. Y . Kim, Y . Cho, J. G. Kim, J. Choi, H.-S. Shin, J. Kim, B. Phuah, H. Kim, M. J. Song, A. Choi, D. Kim, S. Kim, E.-B. Kim, D. Wang, S. Kang, Y . Ro, S. Seo, J. Song, J. Youn, K. Sohn, and N. S. Kim, “25.4 a 20nm 6gb function-in-memory dram, based on hbm2 wi...

  49. [49]

    TensorDIMM: A Practical Near- Memory Processing Architecture for Embeddings and Tensor Oper- ations in Deep Learning,

    Y . Kwon, Y . Lee, and M. Rhu, “TensorDIMM: A Practical Near- Memory Processing Architecture for Embeddings and Tensor Oper- ations in Deep Learning,” inMICRO, 2019

  50. [50]

    Amxfp4: Taming activation outliers with asymmetric microscaling floating-point for 4-bit llm inference,

    J. Lee, J. Park, J. Kim, Y . Kim, J. Oh, J. Oh, and J. Choi, “Amxfp4: Taming activation outliers with asymmetric microscaling floating-point for 4-bit llm inference,” inACL findings, 2025

  51. [51]

    Buffered compares: Excavating the hidden parallelism inside dram architectures with lightweight logic,

    J. Lee, J. H. Ahn, and K. Choi, “Buffered compares: Excavating the hidden parallelism inside dram architectures with lightweight logic,” inDATE, 2016

  52. [52]

    A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Func- tions for Deep-Learning Applications,

    S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, N. Kim, Y . Kwon, K. Vladimir, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, J. Lee, D. Ko, Y . Jun, K. Cho, I. Kim, C. Song, C. Jeong, D. Kwon, J. Jang, I. Park, J. Chun, and J. Cho, “A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS M...

  53. [53]

    Spid-join: A skew-resistant processing-in-dimm join algorithm exploiting the bank- and rank-level parallelisms of dimms,

    S. Lee, C. Lim, J. Choi, H. Choi, C. Lee, Y . Park, K. Park, H. Kim, and Y . Kim, “Spid-join: A skew-resistant processing-in-dimm join algorithm exploiting the bank- and rank-level parallelisms of dimms,” inSIGMOD, 2024

  54. [54]

    Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,

    S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shinet al., “Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,” in ISCA, 2021

  55. [55]

    Pim-dl: Expanding the applicability of commodity dram-pims for deep learning via algorithm-system co-optimization,

    C. Li, Z. Zhou, Y . Wang, F. Yang, T. Cao, M. Yang, Y . Liang, and G. Sun, “Pim-dl: Expanding the applicability of commodity dram-pims for deep learning via algorithm-system co-optimization,” inASPLOS, 2024

  56. [56]

    Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator,

    G. Li, S. Ye, C. Chen, Y . Wang, F. Yang, T. Cao, C. Liu, M. M. S. Aly, and M. Yang, “Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator,” inHPCA, 2025

  57. [57]

    Q-vit: Accurate and fully quantized low-bit vision transformer,

    Y . Li, S. Xu, B. Zhang, X. Cao, P. Gao, and G. Guo, “Q-vit: Accurate and fully quantized low-bit vision transformer,” inNeurIPS, 2022

  58. [58]

    Design and analysis of a processing-in-dimm join algorithm: A case study with upmem dimms,

    C. Lim, S. Lee, J. Choi, J. Lee, S. Park, H. Kim, J. Lee, and Y . Kim, “Design and analysis of a processing-in-dimm join algorithm: A case study with upmem dimms,” inSIGMOD, 2023

  59. [59]

    Llm-fp4: 4-bit floating-point quantized transformers,

    S.-y. Liu, Z. Liu, X. Huang, P. Dong, and K.-T. Cheng, “Llm-fp4: 4-bit floating-point quantized transformers,” inEMNLP, 2023

  60. [60]

    Nanoscaling floating-point (nxfp): Nanomantissa, adaptive microexponents, and code recycling for direct-cast compression of large language models,

    Y .-C. Lo, G.-Y . Wei, and D. Brooks, “Nanoscaling floating-point (nxfp): Nanomantissa, adaptive microexponents, and code recycling for direct-cast compression of large language models,” 2024

  61. [61]

    Ramulator 2.0: A modern, modular, and extensible dram simulator,

    H. Luo, Y . C. Tu ˘grul, F. N. Bostancı, A. Olgun, A. G. Ya ˘glıkc ¸ı, and O. Mutlu, “Ramulator 2.0: A modern, modular, and extensible dram simulator,”IEEE CAL, vol. 23, no. 1, 2024

  62. [62]

    Smart Memories: a modular reconfigurable architecture,

    K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, and M. Horowitz, “Smart Memories: a modular reconfigurable architecture,” inISCA, 2000

  63. [63]

    Lut tensor core: A software-hardware co-design for lut-based low-bit llm inference,

    Z. Mo, L. Wang, J. Wei, Z. Zeng, S. Cao, L. Ma, N. Jing, T. Cao, J. Xue, F. Yang, and M. Yang, “Lut tensor core: A software-hardware co-design for lut-based low-bit llm inference,” inISCA, 2025

  64. [64]

    Memory-centric computing: Solving computing’s memory problem,

    O. Mutlu, A. Olgun, and I. E. Y ¨uksel, “Memory-centric computing: Solving computing’s memory problem,” inIMW, 2025

  65. [65]

    Bit efficient quantization for deep neural networks,

    P. Nayak, D. Zhang, and S. Chai, “Bit efficient quantization for deep neural networks,” inEMC2-NIPS, 2019, pp. 52–56

  66. [66]

    A case study of Processing-in-Memory in off-the-Shelf systems,

    J. Nider, C. Mustard, A. Zoltan, J. Ramsden, L. Liu, J. Grossbard, M. Dashti, R. Jodin, A. Ghiti, J. Chauzi, and A. Fedorova, “A case study of Processing-in-Memory in off-the-Shelf systems,” inUSENIX ATC, 2021

  67. [67]

    PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices,

    S. U. Noh, J. Hong, C. Lim, S. Park, J. Kim, H. Kim, Y . Kim, and J. Lee, “PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices,” inISCA, 2024

  68. [68]

    Mimdram: An end-to-end processing-using-dram system for high-throughput, energy-efficient and programmer-transparent multiple-instruction multiple-data comput- ing,

    G. F. Oliveira, A. Olgun, A. G. Ya ˘glıkc ¸ı, F. N. Bostancı, J. G ´omez-Luna, S. Ghose, and O. Mutlu, “Mimdram: An end-to-end processing-using-dram system for high-throughput, energy-efficient and programmer-transparent multiple-instruction multiple-data comput- ing,” inHPCA, 2024

  69. [69]

    Attacc! unleashing the power of pim for batched transformer- based generative model inference,

    J. Park, J. Choi, K. Kyung, M. J. Kim, Y . Kwon, N. S. Kim, and J. H. Ahn, “Attacc! unleashing the power of pim for batched transformer- based generative model inference,” inASPLOS, 2024

  70. [70]

    TRiM: En- hancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory,

    J. Park, B. Kim, S. Yun, E. Lee, M. Rhu, and J. H. Ahn, “TRiM: En- hancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory,” inMICRO, 2021

  71. [71]

    A 192-gb 12-high 896-gb/s hbm3 dram with a tsv auto-calibration scheme and machine-learning-based layout optimization,

    M.-J. Park, J. Lee, K. Cho, J. Park, J. Moon, S.-H. Lee, T.-K. Kim, S. Oh, S. Choi, Y . Choi, H. S. Cho, T. Yun, Y . J. Koo, J.-S. Lee, B.-K. Yoon, Y .-J. Park, S. Oh, C. K. Lee, S.-H. Lee, H.-W. Kim, Y . Ju, S.-K. Lim, K. Y . Lee, S.-H. Lee, W. S. We, S. Kim, S. M. Yang, K. Lee, I.-K. Kim, Y . Jeon, J.-H. Park, J. C. Yun, S. Kim, D.-Y . Lee, S.-H. Oh, J....

  72. [72]

    A case for intelligent RAM,

    D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, “A case for intelligent RAM,” IEEE Micro, 1997

  73. [73]

    Sancus: Staleness-Aware Communication-Avoiding Full-Graph Decentralized Training in Large-Scale Graph Neural Networks,

    J. Peng, Z. Chen, Y . Shao, Y . Shen, L. Chen, and J. Cao, “Sancus: Staleness-Aware Communication-Avoiding Full-Graph Decentralized Training in Large-Scale Graph Neural Networks,” inpVLDB, 2022

  74. [74]

    Bibert: Accurate fully binarized bert,

    H. Qin, Y . Ding, M. Zhang, Q. Yan, A. Liu, Q. Dang, Z. Liu, and X. Liu, “Bibert: Accurate fully binarized bert,”arXiv preprint arXiv:2203.06390, 2022

  75. [75]

    Look-up table based energy efficient processing in cache support for neural network acceleration,

    A. K. Ramanathan, G. S. Kalsi, S. Srinivasa, T. M. Chandran, K. R. Pillai, O. J. Omer, V . Narayanan, and S. Subramoney, “Look-up table based energy efficient processing in cache support for neural network acceleration,” inMICRO, 2020

  76. [76]

    Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models,

    S. Rashidi, W. Won, S. Srinivasan, S. Sridharan, and T. Krishna, “Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models,” inISCA, 2022

  77. [77]

    arXiv preprint arXiv:2310.10537 , year=

    B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolfet al., “Microscaling data formats for deep learning,”arXiv preprint arXiv:2310.10537, 2023

  78. [78]

    Ianus: Integrated accelerator based on npu-pim unified memory system,

    M. Seo, X. T. Nguyen, S. J. Hwang, Y . Kwon, G. Kim, C. Park, I. Kim, J. Park, J. Kim, W. Shin, J. Won, H. Choi, K. Kim, D. Kwon, C. Jeong, S. Lee, Y . Choi, W. Byun, S. Baek, H.-J. Lee, and J. Kim, “Ianus: Integrated accelerator based on npu-pim unified memory system,” in ASPLOS, 2024

  79. [79]

    Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization,

    V . Seshadri, Y . Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhi- menko, Y . Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, “Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization,” inMICRO, 2013

  80. [80]

    Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology,

    V . Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology,” inMICRO, 2017

Showing first 80 references.