Recognition: no theorem link
LOCALUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM
Pith reviewed 2026-05-10 20:01 UTC · model grok-4.3
The pith
Packing multiple operations into LUTs lets DRAM-PIM hardware replace arithmetic with memory lookups for faster low-bit DNN inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LOCALUT packs several MAC operations into each LUT lookup so that a single memory read produces multiple results, then removes duplicate entries inside those tables via canonicalization to shrink their size. An auxiliary reordering LUT translates original weight vectors into the canonical index form with one extra lookup, and LUT slice streaming brings only the active columns of each table into the DRAM buffer for reuse across many weight vectors. On real UPMEM hardware this combination produces a geometric mean 1.82 times speedup across several numeric precisions and DNN models while preserving correctness.
What carries the argument
Operation-packed LUTs with canonicalization, which collapse duplicate precomputed entries so that memory capacity directly substitutes for repeated arithmetic logic.
If this is right
- Memory capacity becomes a direct substitute for arithmetic logic, raising effective throughput in logic-sparse PIM chips.
- The same LUT structure supports arbitrary bit widths without extra hardware, suiting the variety of precisions in quantized networks.
- Canonicalization and slice streaming together cut the memory footprint of the tables while keeping execution correct.
- Real hardware measurements confirm the net gain reaches 1.82 times geometric mean across models and precisions.
Where Pith is reading between the lines
- Designers of future PIM chips could deliberately allocate more on-die memory per processing element if they expect LUT-based workloads.
- The reordering and streaming techniques might transfer to other table-driven computations such as activation functions or sparse matrix operations.
- If weight distributions in newer models retain similar redundancy, the size-reduction benefit would scale to larger networks without redesign.
Load-bearing premise
Typical DNN weight vectors produce enough duplicate entries inside the packed LUTs that canonicalization shrinks them enough to offset the added cost of reordering and streaming.
What would settle it
Running the same quantized models on the UPMEM platform while measuring the actual LUT size after canonicalization and the cycle count of the reordering and streaming steps; if the net memory traffic or latency increases instead of decreases, the claimed speedup disappears.
Figures
read the original abstract
Lookup tables (LUTs) have recently gained attention as an alternative compute mechanism that maps input operands to precomputed results, eliminating the need for arithmetic logic. LUTs not only reduce logic complexity, but also naturally support diverse numerical precisions without requiring separate circuits for each bitwidth-an increasingly important feature in quantized DNNs. This creates a favorable tradeoff in PIM: memory capacity can be used in place of logic to increase computational throughput, aligning well with DRAM-PIM architectures that offer high bandwidth and easily available memory but limited logic density. In this work, we explore this capacity-computation tradeoff in LUT-based PIM designs, where memory capacity is traded for performance by packing multiple MAC operations into a single LUT lookup. Building on this insight, we propose LOCALUT, a PIM-based design for efficient low-bit quantized DNN inference using operation-packed LUTs. First, we observe that these LUTs contain extensive redundancy and introduce LUT canonicalization, which eliminates duplicate entries to reduce LUT size. Second, we propose reordering LUT, a lightweight auxiliary LUT that remaps weight vectors to their canonical form required by LUT canonicalization with a simple LUT lookup. Third, we propose LUT slice streaming, a novel execution strategy that exploits the DRAM-buffer hierarchy by streaming only relevant LUT columns into the buffer and reusing them across multiple weight vectors. Evaluated on a real system based on UPMEM devices, we demonstrate a geometric mean speedup of 1.82x across various numeric precisions and DNN models. We believe LOCALUT opens a path toward scalable, low-logic PIM designs tailored for LUT-based DNN inference. Our implementation of LOCALUT is available at https://github.com/AIS-SNU/LoCaLUT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes LOCALUT, a LUT-based architecture for low-bit quantized DNN inference on DRAM-PIM hardware. It exploits capacity-computation tradeoffs by packing multiple MAC operations into single LUT lookups, introduces LUT canonicalization to remove duplicate entries and shrink table size, adds a lightweight reordering LUT to map weight vectors to canonical forms, and proposes LUT slice streaming to move only relevant columns into the per-DPU buffer. Real-system measurements on UPMEM devices report a geometric mean speedup of 1.82x across numeric precisions and DNN models, with the implementation released on GitHub.
Significance. If the net speedup survives detailed overhead accounting, the work demonstrates a concrete way to trade abundant DRAM capacity for reduced logic in PIM systems, which is timely for quantized inference. The real-hardware evaluation on UPMEM and public code release are clear strengths that increase confidence relative to simulation-only studies.
major comments (2)
- [Evaluation] Evaluation section: the 1.82x geometric-mean speedup is reported as a net figure, yet no per-component timing breakdown is supplied for the reordering LUT lookup, canonicalization, or slice-streaming operations. Because these auxiliary steps consume memory bandwidth and control logic on UPMEM, it is impossible to verify that their cost remains negligible relative to the capacity gains, particularly at low batch sizes or limited LUT reuse.
- [LUT canonicalization] Section on LUT canonicalization: the central claim that redundancy is 'extensive' and can be eliminated without correctness issues or offsetting overhead rests on an unquantified assumption. The manuscript should report the measured fraction of duplicate entries before/after canonicalization and confirm that the reordering LUT itself does not introduce new errors or scale poorly with model depth.
minor comments (2)
- [Abstract] Abstract: no information is given on the exact baselines, model suite, batch sizes, or presence of error bars; these details should be added for reproducibility.
- [Introduction] Notation: the distinction between 'operation-packed LUTs' and conventional LUTs is introduced without a compact equation or diagram showing how multiple MACs map to one lookup; a small illustrative figure would clarify the capacity-computation tradeoff.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the 1.82x geometric-mean speedup is reported as a net figure, yet no per-component timing breakdown is supplied for the reordering LUT lookup, canonicalization, or slice-streaming operations. Because these auxiliary steps consume memory bandwidth and control logic on UPMEM, it is impossible to verify that their cost remains negligible relative to the capacity gains, particularly at low batch sizes or limited LUT reuse.
Authors: We agree that a per-component breakdown would improve transparency and verifiability. In the revised manuscript we have added timing measurements for the reordering LUT lookup, canonicalization, and slice-streaming operations, reported across multiple batch sizes and LUT-reuse levels on the same UPMEM hardware. These data show that the auxiliary costs remain a small fraction of total execution time, with the dominant savings coming from the packed LUT-based MACs. Because the 1.82x geometric-mean figure is an end-to-end real-system measurement, it already includes all overheads. revision: yes
-
Referee: [LUT canonicalization] Section on LUT canonicalization: the central claim that redundancy is 'extensive' and can be eliminated without correctness issues or offsetting overhead rests on an unquantified assumption. The manuscript should report the measured fraction of duplicate entries before/after canonicalization and confirm that the reordering LUT itself does not introduce new errors or scale poorly with model depth.
Authors: We have revised the LUT canonicalization section to include the measured fractions of duplicate entries before and after canonicalization for every evaluated model and precision. These measurements confirm that redundancy is extensive and yields substantial table-size reductions. We also added verification that the reordering LUT performs an exact mapping with no introduced errors (validated by bit-exact equivalence checks against the original computation) and that its per-vector cost scales linearly with the number of weight vectors but remains negligible relative to the main LUT lookups even for the deepest models tested. revision: yes
Circularity Check
No circularity in the paper's design and evaluation chain
full rationale
The paper presents an empirical systems design for LUT-based inference on DRAM-PIM hardware. The speedup is measured directly on UPMEM devices rather than derived from equations or fitted parameters. The proposals for LUT canonicalization, reordering LUT, and slice streaming are design choices justified by observations of redundancy in LUTs, without reducing to self-definitional or fitted inputs. No load-bearing self-citations or uniqueness theorems are invoked in a circular manner. The central claim rests on hardware evaluation, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LUTs for quantized DNN operations contain extensive redundancy that can be safely removed by canonicalization.
Reference graph
Works this paper leans on
-
[1]
Chameleon: Versatile and practical near-DRAM acceleration archi- tecture for large memory systems,
H. Asghari-Moghaddam, Y . H. Son, J. H. Ahn, and N. S. Kim, “Chameleon: Versatile and practical near-DRAM acceleration archi- tecture for large memory systems,” inMICRO, 2016
2016
-
[2]
psyncpim: Partially synchronous execution of sparse matrix operations for all-bank pim architectures,
D. Baek, S. Hwang, and J. Huh, “psyncpim: Partially synchronous execution of sparse matrix operations for all-bank pim architectures,” inISCA, 2024
2024
-
[3]
H. Bai, W. Zhang, L. Hou, L. Shang, J. Jin, X. Jiang, Q. Liu, M. Lyu, and I. King, “Binarybert: Pushing the limit of bert quantization,”arXiv preprint arXiv:2012.15701, 2020
-
[4]
CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories,
R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V . Srinivas, “CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories,”ACM TACO, 2017
2017
-
[5]
LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory,
A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, K. Hsieh, K. T. Malladi, H. Zheng, and O. Mutlu, “LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory,”IEEE CAL, 2017
2017
-
[6]
Quip: 2-bit quanti- zation of large language models with guarantees,
J. Chee, Y . Cai, V . Kuleshov, and C. M. De Sa, “Quip: 2-bit quanti- zation of large language models with guarantees,” inNeurIPS, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., 2023
2023
-
[7]
Quip: 2-bit quantiza- tion of large language models with guarantees,
J. Chee, Y . Cai, V . Kuleshov, and C. M. De Sa, “Quip: 2-bit quantiza- tion of large language models with guarantees,” inNeurIPS, 2023
2023
-
[8]
Simplepim: A software framework for productive and efficient processing-in- memory,
J. Chen, J. G ´omez-Luna, I. El Hajj, Y . Guo, and O. Mutlu, “Simplepim: A software framework for productive and efficient processing-in- memory,” inPACT, 2023
2023
-
[9]
Asyncdimm: Achieving asynchronous execution in dimm-based near-memory pro- cessing,
L. Chen, D. Lyu, J. Jiang, Q. Wang, Z. Mao, and N. Jing, “Asyncdimm: Achieving asynchronous execution in dimm-based near-memory pro- cessing,” inHPCA, 2025
2025
-
[10]
arXiv preprint arXiv:2505.19115 , year=
B. Chmiel, M. Fishman, R. Banner, and D. Soudry, “Fp4 all the way: Fully quantized training of llms,”arXiv preprint arXiv:2505.19115, 2025
-
[11]
Accurate and efficient 2-bit quantized neural networks,
J. Choi, S. Venkataramani, V . V . Srinivasan, K. Gopalakrishnan, Z. Wang, and P. Chuang, “Accurate and efficient 2-bit quantized neural networks,” inMLSys, 2019
2019
-
[12]
Qimera: Data-free quantization with synthetic boundary supporting samples,
K. Choi, D. Hong, N. Park, Y . Kim, and J. Lee, “Qimera: Data-free quantization with synthetic boundary supporting samples,”NeurIPS, 2021
2021
-
[13]
Mimiq: Low-bit data-free quantization of vision transformers with encouraging inter-head attention similarity,
K. Choi, H. Lee, D. Kwon, S. Park, K. Kim, N. Park, J. Choi, and J. Lee, “Mimiq: Low-bit data-free quantization of vision transformers with encouraging inter-head attention similarity,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025
2025
-
[14]
Low-bit quan- tization of neural networks for efficient inference,
Y . Choukroun, E. Kravchik, F. Yang, and P. Kisilev, “Low-bit quan- tization of neural networks for efficient inference,” inICCVW, 2019, pp. 3009–3018
2019
-
[15]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255
2009
-
[16]
The True Processing In Memory Accelerator,
F. Devaux, “The True Processing In Memory Accelerator,” inHot Chips, 2019
2019
-
[17]
Bert: Pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019
2019
-
[18]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inICLR, 2021
2021
-
[19]
BitDistiller: Unleashing the potential of sub-4-bit LLMs via self- distillation,
D. Du, Y . Zhang, S. Cao, J. Guo, T. Cao, X. Chu, and N. Xu, “BitDistiller: Unleashing the potential of sub-4-bit LLMs via self- distillation,” inACL, 2024
2024
-
[20]
pluto: Enabling massively parallel computation in dram via lookup tables,
J. D. Ferreira, G. Falcao, J. G ´omez-Luna, M. Alser, L. Orosa, M. Sadrosadati, J. S. Kim, G. F. Oliveira, T. Shahroodi, A. Nori, and O. Mutlu, “pluto: Enabling massively parallel computation in dram via lookup tables,” in2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022, pp. 900–919
2022
-
[21]
Pimdal: Mitigating the memory bottleneck in data analytics using a real processing-in-memory system,
M. Frouzakis, J. G ´omez-Luna, G. F. Oliveira, M. Sadrosadati, and O. Mutlu, “Pimdal: Mitigating the memory bottleneck in data analytics using a real processing-in-memory system,” 2025. [Online]. Available: https://arxiv.org/abs/2504.01948
-
[22]
Pygim: An efficient graph neural network library for real processing-in-memory architectures,
C. Giannoula, P. Yang, I. Fernandez, J. Yang, S. Durvasula, Y . X. Li, M. Sadrosadati, J. G. Luna, O. Mutlu, and G. Pekhimenko, “Pygim: An efficient graph neural network library for real processing-in-memory architectures,”ACM POMACS, vol. 8, no. 3, Dec. 2024
2024
-
[23]
Swiftrl: Towards efficient reinforcement learning on real processing-in-memory systems,
K. Gogineni, S. S. Dayapule, J. G ´omez-Luna, K. Gogineni, P. Wei, T. Lan, M. Sadrosadati, O. Mutlu, and G. Venkataramani, “Swiftrl: Towards efficient reinforcement learning on real processing-in-memory systems,” inISPASS, 2024, pp. 217–229
2024
-
[24]
Apple intelligence foundation language models
T. Gunter, Z. Wang, C. Wang, R. Pang, A. Narayanan, A. Zhang, B. Zhang, C. Chen, C.-C. Chiu, D. Qiuet al., “Apple intelligence foundation language models,”arXiv preprint arXiv:2407.21075, 2024
-
[25]
Benchmarking memory-centric computing systems: Analysis of real processing-in-memory hardware,
J. G ´omez-Luna, I. El Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking memory-centric computing systems: Analysis of real processing-in-memory hardware,” inIGSC, 2021
2021
-
[26]
Simdram: a framework for bit-serial simd processing using dram,
N. Hajinazar, G. F. Oliveira, S. Gregorio, J. a. D. Ferreira, N. M. Ghiasi, M. Patel, M. Alser, S. Ghose, J. G ´omez-Luna, and O. Mutlu, “Simdram: a framework for bit-serial simd processing using dram,” in ASPLOS, 2021
2021
-
[27]
Newton: A DRAM-maker’s Accelerator-in-Memory (AiM) Architecture for Machine Learning,
M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. N. Vijaykumar, “Newton: A DRAM-maker’s Accelerator-in-Memory (AiM) Architecture for Machine Learning,” inMICRO, 2020
2020
-
[28]
Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,
G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Maha- jan, and J. Park, “Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,” inASPLOS, 2024
2024
-
[29]
X. Hu, Y . Cheng, D. Yang, Z. Yuan, J. Yu, C. Xu, and S. Zhou, “I- llm: Efficient integer-only inference for fully-quantized low-bit large language models,”arXiv preprint arXiv:2405.17849, 2024
-
[30]
Pathfinding future pim architectures by demystifying a commercial pim technology,
B. Hyun, T. Kim, D. Lee, and M. Rhu, “Pathfinding future pim architectures by demystifying a commercial pim technology,” inHPCA, 2024, pp. 263–279
2024
-
[31]
Transpimlib: Efficient transcendental functions for processing-in-memory systems,
M. Item, G. F. Oliveira, J. G ´omez-Luna, M. Sadrosadati, Y . Guo, and O. Mutlu, “Transpimlib: Efficient transcendental functions for processing-in-memory systems,” inISPASS, 2023, pp. 235–247
2023
-
[32]
Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System,
H. Jang, J. Song, J. Jung, J. Park, Y . Kim, and J. Lee, “Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System,” inHPCA, 2024
2024
-
[33]
Biqgemm: Matrix multiplication with lookup table for binary-coding-based quan- tized dnns,
Y . Jeon, B. Park, S. J. Kwon, B. Kim, J. Yun, and D. Lee, “Biqgemm: Matrix multiplication with lookup table for binary-coding-based quan- tized dnns,” inSC, 2020, pp. 1–14
2020
-
[34]
Kdlsq-bert: A quantized bert combining knowledge distillation with learned step size quantiza- tion,
J. Jin, C. Liang, T. Wu, L. Zou, and Z. Gan, “Kdlsq-bert: A quantized bert combining knowledge distillation with learned step size quantiza- tion,”arXiv preprint arXiv:2101.05938, 2021
-
[35]
Aespa: Asynchronous execution scheme to exploit bank-level parallelism of processing-in-memory,
H. Kal, C. Yoo, and W. W. Ro, “Aespa: Asynchronous execution scheme to exploit bank-level parallelism of processing-in-memory,” in MICRO, 2023
2023
-
[36]
FlexRAM: toward an advanced intelligent memory system,
Y . Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V . Lam, P. Pattnaik, and J. Torrellas, “FlexRAM: toward an advanced intelligent memory system,” inICCD, 1999
1999
-
[37]
Near-memory processing in action: Accelerating personalized recommendation with axdimm,
L. Ke, X. Zhang, J. So, J.-G. Lee, S.-H. Kang, S. Lee, S. Han, Y . Cho, J. H. Kim, Y . Kwon, K. Kim, J. Jung, I. Yun, S. J. Park, H. Park, J. Song, J. Cho, K. Sohn, N. S. Kim, and H.-H. S. Lee, “Near-memory processing in action: Accelerating personalized recommendation with axdimm,”IEEE Micro, 2022
2022
-
[38]
Mvid: Sparse matrix-vector multiplication in mobile dram for accelerating recurrent neural networks,
B. Kim, J. Chung, E. Lee, W. Jung, S. Lee, J. Choi, J. Park, M. Wi, S. Lee, and J. H. Ahn, “Mvid: Sparse matrix-vector multiplication in mobile dram for accelerating recurrent neural networks,”IEEE TC, vol. 69, no. 7, pp. 955–967, 2020
2020
-
[39]
Virtual pim: Resource-aware dynamic dpu allocation and workload scheduling framework for multi-dpu pim architecture,
D. Kim, T. Kim, I. Hwang, T. Park, H. Kim, Y . Kim, and Y . Park, “Virtual pim: Resource-aware dynamic dpu allocation and workload scheduling framework for multi-dpu pim architecture,” inPACT, 2023
2023
-
[40]
GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent,
H. Kim, H. Park, T. Kim, K. Cho, E. Lee, S. Ryu, H.-J. Lee, K. Choi, and J. Lee, “GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent,” inHPCA, 2021
2021
-
[41]
Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization,
J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, and D. Lee, “Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization,” inNeurIPS, 2023
2023
-
[42]
Aquabolt-XL: Samsung HBM2-PIM with In-Memory Processing for ML Accelerators and Beyond,
J. H. Kim, S.-h. Kang, S. Lee, H. Kim, W. Song, Y . Ro, S. Lee, D. Wang, H. Shin, B. Phuah, J. Choi, J. So, Y . Cho, J. Song, J. Choi, J. Cho, K. Sohn, Y . Sohn, K. Park, and N. S. Kim, “Aquabolt-XL: Samsung HBM2-PIM with In-Memory Processing for ML Accelerators and Beyond,” inHot Chips, 2021
2021
-
[43]
I-bert: Integer-only bert quantization,
S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-bert: Integer-only bert quantization,” inICML. PMLR, 2021
2021
-
[44]
{PathWeaver}: A{High-Throughput}{Multi-GPU}system for{Graph-Based}approximate nearest neighbor search,
S. Kim, S. Park, S. U. Noh, J. Hong, T. Kwon, H. Lim, and J. Lee, “{PathWeaver}: A{High-Throughput}{Multi-GPU}system for{Graph-Based}approximate nearest neighbor search,” in2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025
2025
-
[45]
Active Memory: Micron’ s Yukon,
G. Kirsch, “Active Memory: Micron’ s Yukon,” inIPDPS, 2003
2003
-
[46]
EXECUBE-A new architecture for scaleable MPPs,
P. M. Kogge, “EXECUBE-A new architecture for scaleable MPPs,” in ICPP, 1994
1994
-
[47]
System Architecture and Software Stack for GDDR6-AiM,
Y . Kwon, K. Vladimir, N. Kim, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, G. Kim, B. An, J. Kim, J. Lee, I. Kim, J. Park, C. Park, Y . Song, B. Yang, H. Lee, S. Kim, D. Kwon, S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, M. Lee, M. Shin, M. Shin, J. Cha, C. Jung, K. Chang, C. Jeong, E. Lim, I. Park, J. Chun, a...
2022
-
[48]
25.4 a 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2tflops programmable computing unit using bank-level parallelism, for machine learning applications,
Y .-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, O. Seongil, H.-S. Yu, H. Lee, S. Y . Kim, Y . Cho, J. G. Kim, J. Choi, H.-S. Shin, J. Kim, B. Phuah, H. Kim, M. J. Song, A. Choi, D. Kim, S. Kim, E.-B. Kim, D. Wang, S. Kang, Y . Ro, S. Seo, J. Song, J. Youn, K. Sohn, and N. S. Kim, “25.4 a 20nm 6gb function-in-memory dram, based on hbm2 wi...
2021
-
[49]
TensorDIMM: A Practical Near- Memory Processing Architecture for Embeddings and Tensor Oper- ations in Deep Learning,
Y . Kwon, Y . Lee, and M. Rhu, “TensorDIMM: A Practical Near- Memory Processing Architecture for Embeddings and Tensor Oper- ations in Deep Learning,” inMICRO, 2019
2019
-
[50]
Amxfp4: Taming activation outliers with asymmetric microscaling floating-point for 4-bit llm inference,
J. Lee, J. Park, J. Kim, Y . Kim, J. Oh, J. Oh, and J. Choi, “Amxfp4: Taming activation outliers with asymmetric microscaling floating-point for 4-bit llm inference,” inACL findings, 2025
2025
-
[51]
Buffered compares: Excavating the hidden parallelism inside dram architectures with lightweight logic,
J. Lee, J. H. Ahn, and K. Choi, “Buffered compares: Excavating the hidden parallelism inside dram architectures with lightweight logic,” inDATE, 2016
2016
-
[52]
A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Func- tions for Deep-Learning Applications,
S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, N. Kim, Y . Kwon, K. Vladimir, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, J. Lee, D. Ko, Y . Jun, K. Cho, I. Kim, C. Song, C. Jeong, D. Kwon, J. Jang, I. Park, J. Chun, and J. Cho, “A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS M...
2022
-
[53]
Spid-join: A skew-resistant processing-in-dimm join algorithm exploiting the bank- and rank-level parallelisms of dimms,
S. Lee, C. Lim, J. Choi, H. Choi, C. Lee, Y . Park, K. Park, H. Kim, and Y . Kim, “Spid-join: A skew-resistant processing-in-dimm join algorithm exploiting the bank- and rank-level parallelisms of dimms,” inSIGMOD, 2024
2024
-
[54]
Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,
S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shinet al., “Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,” in ISCA, 2021
2021
-
[55]
Pim-dl: Expanding the applicability of commodity dram-pims for deep learning via algorithm-system co-optimization,
C. Li, Z. Zhou, Y . Wang, F. Yang, T. Cao, M. Yang, Y . Liang, and G. Sun, “Pim-dl: Expanding the applicability of commodity dram-pims for deep learning via algorithm-system co-optimization,” inASPLOS, 2024
2024
-
[56]
Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator,
G. Li, S. Ye, C. Chen, Y . Wang, F. Yang, T. Cao, C. Liu, M. M. S. Aly, and M. Yang, “Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator,” inHPCA, 2025
2025
-
[57]
Q-vit: Accurate and fully quantized low-bit vision transformer,
Y . Li, S. Xu, B. Zhang, X. Cao, P. Gao, and G. Guo, “Q-vit: Accurate and fully quantized low-bit vision transformer,” inNeurIPS, 2022
2022
-
[58]
Design and analysis of a processing-in-dimm join algorithm: A case study with upmem dimms,
C. Lim, S. Lee, J. Choi, J. Lee, S. Park, H. Kim, J. Lee, and Y . Kim, “Design and analysis of a processing-in-dimm join algorithm: A case study with upmem dimms,” inSIGMOD, 2023
2023
-
[59]
Llm-fp4: 4-bit floating-point quantized transformers,
S.-y. Liu, Z. Liu, X. Huang, P. Dong, and K.-T. Cheng, “Llm-fp4: 4-bit floating-point quantized transformers,” inEMNLP, 2023
2023
-
[60]
Nanoscaling floating-point (nxfp): Nanomantissa, adaptive microexponents, and code recycling for direct-cast compression of large language models,
Y .-C. Lo, G.-Y . Wei, and D. Brooks, “Nanoscaling floating-point (nxfp): Nanomantissa, adaptive microexponents, and code recycling for direct-cast compression of large language models,” 2024
2024
-
[61]
Ramulator 2.0: A modern, modular, and extensible dram simulator,
H. Luo, Y . C. Tu ˘grul, F. N. Bostancı, A. Olgun, A. G. Ya ˘glıkc ¸ı, and O. Mutlu, “Ramulator 2.0: A modern, modular, and extensible dram simulator,”IEEE CAL, vol. 23, no. 1, 2024
2024
-
[62]
Smart Memories: a modular reconfigurable architecture,
K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, and M. Horowitz, “Smart Memories: a modular reconfigurable architecture,” inISCA, 2000
2000
-
[63]
Lut tensor core: A software-hardware co-design for lut-based low-bit llm inference,
Z. Mo, L. Wang, J. Wei, Z. Zeng, S. Cao, L. Ma, N. Jing, T. Cao, J. Xue, F. Yang, and M. Yang, “Lut tensor core: A software-hardware co-design for lut-based low-bit llm inference,” inISCA, 2025
2025
-
[64]
Memory-centric computing: Solving computing’s memory problem,
O. Mutlu, A. Olgun, and I. E. Y ¨uksel, “Memory-centric computing: Solving computing’s memory problem,” inIMW, 2025
2025
-
[65]
Bit efficient quantization for deep neural networks,
P. Nayak, D. Zhang, and S. Chai, “Bit efficient quantization for deep neural networks,” inEMC2-NIPS, 2019, pp. 52–56
2019
-
[66]
A case study of Processing-in-Memory in off-the-Shelf systems,
J. Nider, C. Mustard, A. Zoltan, J. Ramsden, L. Liu, J. Grossbard, M. Dashti, R. Jodin, A. Ghiti, J. Chauzi, and A. Fedorova, “A case study of Processing-in-Memory in off-the-Shelf systems,” inUSENIX ATC, 2021
2021
-
[67]
PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices,
S. U. Noh, J. Hong, C. Lim, S. Park, J. Kim, H. Kim, Y . Kim, and J. Lee, “PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices,” inISCA, 2024
2024
-
[68]
Mimdram: An end-to-end processing-using-dram system for high-throughput, energy-efficient and programmer-transparent multiple-instruction multiple-data comput- ing,
G. F. Oliveira, A. Olgun, A. G. Ya ˘glıkc ¸ı, F. N. Bostancı, J. G ´omez-Luna, S. Ghose, and O. Mutlu, “Mimdram: An end-to-end processing-using-dram system for high-throughput, energy-efficient and programmer-transparent multiple-instruction multiple-data comput- ing,” inHPCA, 2024
2024
-
[69]
Attacc! unleashing the power of pim for batched transformer- based generative model inference,
J. Park, J. Choi, K. Kyung, M. J. Kim, Y . Kwon, N. S. Kim, and J. H. Ahn, “Attacc! unleashing the power of pim for batched transformer- based generative model inference,” inASPLOS, 2024
2024
-
[70]
TRiM: En- hancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory,
J. Park, B. Kim, S. Yun, E. Lee, M. Rhu, and J. H. Ahn, “TRiM: En- hancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory,” inMICRO, 2021
2021
-
[71]
A 192-gb 12-high 896-gb/s hbm3 dram with a tsv auto-calibration scheme and machine-learning-based layout optimization,
M.-J. Park, J. Lee, K. Cho, J. Park, J. Moon, S.-H. Lee, T.-K. Kim, S. Oh, S. Choi, Y . Choi, H. S. Cho, T. Yun, Y . J. Koo, J.-S. Lee, B.-K. Yoon, Y .-J. Park, S. Oh, C. K. Lee, S.-H. Lee, H.-W. Kim, Y . Ju, S.-K. Lim, K. Y . Lee, S.-H. Lee, W. S. We, S. Kim, S. M. Yang, K. Lee, I.-K. Kim, Y . Jeon, J.-H. Park, J. C. Yun, S. Kim, D.-Y . Lee, S.-H. Oh, J....
2023
-
[72]
A case for intelligent RAM,
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, “A case for intelligent RAM,” IEEE Micro, 1997
1997
-
[73]
Sancus: Staleness-Aware Communication-Avoiding Full-Graph Decentralized Training in Large-Scale Graph Neural Networks,
J. Peng, Z. Chen, Y . Shao, Y . Shen, L. Chen, and J. Cao, “Sancus: Staleness-Aware Communication-Avoiding Full-Graph Decentralized Training in Large-Scale Graph Neural Networks,” inpVLDB, 2022
2022
-
[74]
Bibert: Accurate fully binarized bert,
H. Qin, Y . Ding, M. Zhang, Q. Yan, A. Liu, Q. Dang, Z. Liu, and X. Liu, “Bibert: Accurate fully binarized bert,”arXiv preprint arXiv:2203.06390, 2022
-
[75]
Look-up table based energy efficient processing in cache support for neural network acceleration,
A. K. Ramanathan, G. S. Kalsi, S. Srinivasa, T. M. Chandran, K. R. Pillai, O. J. Omer, V . Narayanan, and S. Subramoney, “Look-up table based energy efficient processing in cache support for neural network acceleration,” inMICRO, 2020
2020
-
[76]
Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models,
S. Rashidi, W. Won, S. Srinivasan, S. Sridharan, and T. Krishna, “Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models,” inISCA, 2022
2022
-
[77]
arXiv preprint arXiv:2310.10537 , year=
B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolfet al., “Microscaling data formats for deep learning,”arXiv preprint arXiv:2310.10537, 2023
-
[78]
Ianus: Integrated accelerator based on npu-pim unified memory system,
M. Seo, X. T. Nguyen, S. J. Hwang, Y . Kwon, G. Kim, C. Park, I. Kim, J. Park, J. Kim, W. Shin, J. Won, H. Choi, K. Kim, D. Kwon, C. Jeong, S. Lee, Y . Choi, W. Byun, S. Baek, H.-J. Lee, and J. Kim, “Ianus: Integrated accelerator based on npu-pim unified memory system,” in ASPLOS, 2024
2024
-
[79]
Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization,
V . Seshadri, Y . Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhi- menko, Y . Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, “Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization,” inMICRO, 2013
2013
-
[80]
Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology,
V . Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology,” inMICRO, 2017
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.