pith. machine review for the scientific record. sign in

arxiv: 2604.18137 · v1 · submitted 2026-04-20 · 💻 cs.AR · cs.AI· cs.LG

Recognition: unknown

AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:05 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.LG
keywords processing in memoryactivation quantizationproduct quantizationlarge language modelskv cachememory capacityattention computationdata movement reduction
0
0 comments X

The pith

By running product quantization inside memory, AQPIM compresses LLM activations enough to fit within PIM hardware limits and enables direct computation on the compact form.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard PIM designs cannot hold the growing KV caches of long-context LLMs because activation memory exceeds on-chip capacity, and that moving quantization inside the memory array solves this by exploiting high internal bandwidth. Product quantization is applied directly to activations so that attention calculations run on the resulting indices rather than full vectors, cutting the need to ship data off-chip. A sympathetic reader would care because GPU-CPU transfers already dominate 90 to 98.5 percent of decoding time; removing most of that movement would let existing PIM chips run larger models and longer sequences without new hardware.

Core claim

AQPIM is a PIM-aware activation quantization framework based on product quantization that performs the clustering and indexing steps entirely inside memory, introduces algorithmic adjustments to keep accuracy acceptable for LLMs, and thereby shrinks the activation footprint while allowing attention to operate directly on the compressed representations.

What carries the argument

Product quantization executed in-memory on activation vectors, where learned codebooks replace each vector with a short index so that approximate inner products can be computed without restoring the original values.

If this is right

  • GPU-CPU communication that currently accounts for 90 to 98.5 percent of decoding latency drops sharply.
  • Overall inference reaches 3.4 times the speed of prior state-of-the-art PIM methods for the same models.
  • Larger KV caches fit inside fixed PIM memory capacity, supporting longer context lengths.
  • Attention operations execute directly on the compressed indices, avoiding full decompression overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same in-memory compression pattern could be reused for other memory-capacity-limited PIM workloads such as large graph analytics.
  • PIM chip designers may add dedicated codebook lookup units as a standard on-die feature rather than leaving quantization to software.
  • Pairing AQPIM-style quantization with sparsity patterns that preserve data locality could produce further efficiency gains beyond what either technique achieves alone.

Load-bearing premise

The accuracy loss from clustering-based product quantization on LLM activations stays small enough when the whole process runs inside memory that downstream model quality does not degrade noticeably.

What would settle it

Measure end-to-end accuracy of a standard LLM on a benchmark such as MMLU after replacing all activations with AQPIM-quantized versions and observe whether the score falls more than 1-2 percent below the full-precision baseline, or profile actual data-transfer volume and find no reduction in the claimed 90-98.5 percent range.

Figures

Figures reproduced from arXiv: 2604.18137 by Daichi Fujiki, Kosuke Matsushima, Masato Motomura, Yasuyuki Okoshi.

Figure 1
Figure 1. Figure 1: Scaling challenges of existing PIM designs for LLMs. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Locality within the projection weights (left-most) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of AQPIM and Product Quantization (PQ). [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The latency comparison of the prefilling and the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: (1) a sequence is divided into multiple windows, and as the window advances, the previous centroids are copied to a new page and subsequently updated for the window. Then, (2) … (1) Cluster assignments are determined within the window (3) Indirect access is limited within the window (DRAM page) window1 window2 windowN Key and Value Matrices … x Indices … DRAM page size 512 part. prods codebook ・ query (2) … view at source ↗
Figure 7
Figure 7. Figure 7: AQPIM architecture and dataflow. TABLE I: List of AQPIM operations. Process Place Necessary Unit Distance Calculation (DC) BankPE ADD, MUL, SUM Cluster Assignment (CA) BufferPE MIN Centroid Calculation (CC) Both MUL, SUM, DIV Attention (ATNK, ATNV) BankPE MUL, SUM Softmax (SFM) BufferPE ADD, SUM, MAX, DIV, EXP we introduce two PE architectures: BankPE and BufferPE, as illustrated in [PITH_FULL_IMAGE:figur… view at source ↗
Figure 8
Figure 8. Figure 8: Intra-row indirection for efficient random access. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Data mapping strategy. Each head is assigned to [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Memory reduction ratio vs. accuracy. gpu attacc aqpim gpu attacc aqpim gpu attacc aqpim gpu attacc aqpim 0.00 0.25 0.50 0.75 1.00 Total execution time 1.00 0.70 0.68 1.00 0.61 0.58 1.00 0.53 0.49 1.00 0.48 0.43 S_in: 4096 S_out: 128 S_in: 4096 S_out: 256 S_in: 4096 S_out: 512 S_in: 4096 S_out: 1024 vs. Architecture gpu pqcache skvq snapkv aqpim gpu pqcache skvq snapkv aqpim gpu pqcache skvq snapkv aqpim g… view at source ↗
Figure 11
Figure 11. Figure 11: Normalized total execution time comparing different architectures (left) and algorithms (right). [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Normalized decoding time comparing different architectures (left) and algorithms (right). [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Decomposition analysis of decoding speedups. [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Normalized energy for decoding comparing different architectures (left) and algorithms (right). [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Accuracy & Speedup vs. Memory Reduction. [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗
read the original abstract

Processing-in-Memory (PIM) architectures offer a promising solution to the memory bottlenecks in data-intensive machine learning, yet often overlook the growing challenge of activation memory footprint. Conventional PIM approaches struggle with massive KV cache sizes generated in long-context scenarios by Transformer-based models, frequently exceeding PIM's limited memory capacity, while techniques like sparse attention can conflict with PIM's need for data locality. Existing PIM approaches and quantization methods are often insufficient or poorly suited for leveraging the unique characteristics of activations. This work identifies an opportunity for PIM-specialized activation quantization to enhance bandwidth and compute efficiency. We explore clustering-based vector quantization approaches, which align well with activation characteristics and PIM's internal bandwidth capabilities. Building on this, we introduce AQPIM, a novel PIM-aware activation quantization framework based on Product Quantization (PQ), optimizing it for modern Large Language Models (LLMs). By performing quantization directly within memory, AQPIM leverages PIM's high internal bandwidth and enables direct computation on compressed data, significantly reducing both memory footprint and computational overhead for attention computation. AQPIM addresses PQ's accuracy challenges by introducing several algorithmic optimizations. Evaluations demonstrate that AQPIM achieves significant performance improvements, drastically reducing of GPU-CPU communication that can account for 90$\sim$98.5\% of decoding latency, together with 3.4$\times$ speedup over a SOTA PIM approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce AQPIM, a PIM-aware activation quantization framework based on Product Quantization (PQ) optimized for LLMs. By performing quantization and computation directly in memory, it reduces activation memory footprint, enables direct computation on compressed data, and minimizes GPU-CPU communication (claimed to account for 90-98.5% of decoding latency), achieving a 3.4× speedup over a state-of-the-art PIM approach.

Significance. If the quantization maintains acceptable model quality, this could meaningfully advance PIM architectures for long-context LLM inference by addressing activation capacity limits that exceed PIM memory and conflict with data locality requirements. The in-memory focus leverages PIM's internal bandwidth advantages and could inform hardware-software co-design for efficient inference.

major comments (2)
  1. The abstract reports performance gains and latency reductions from evaluations but provides no details on accuracy metrics, baselines, error bars, or data selection. This leaves the central empirical claim only partially supported without further evidence.
  2. The assumption that clustering-based product quantization (with unspecified algorithmic optimizations) produces activation representations whose error does not materially degrade attention or generation quality is not validated. Activations in long-context Transformers exhibit heavy-tailed distributions and high sensitivity in the attention softmax; without quantitative results on perplexity impact or quality loss, the reported communication reduction and speedup cannot be assessed for practical utility.
minor comments (1)
  1. The abstract contains a grammatical error in 'drastically reducing of GPU-CPU communication' which should be corrected to 'drastically reducing GPU-CPU communication'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for strengthening the presentation of our empirical results and the validation of AQPIM's quantization effects. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: The abstract reports performance gains and latency reductions from evaluations but provides no details on accuracy metrics, baselines, error bars, or data selection. This leaves the central empirical claim only partially supported without further evidence.

    Authors: We agree that the abstract would be strengthened by including these details. In the revised version, we will expand the abstract to reference the accuracy metrics (perplexity and generation quality), the specific baselines and models evaluated, the use of multiple runs with reported standard deviations, and the datasets/context lengths used. The main text already contains the full evaluation setup and results; the abstract revision will ensure the central claims are better supported at a glance. revision: yes

  2. Referee: The assumption that clustering-based product quantization (with unspecified algorithmic optimizations) produces activation representations whose error does not materially degrade attention or generation quality is not validated. Activations in long-context Transformers exhibit heavy-tailed distributions and high sensitivity in the attention softmax; without quantitative results on perplexity impact or quality loss, the reported communication reduction and speedup cannot be assessed for practical utility.

    Authors: We acknowledge the need for explicit validation given the characteristics of LLM activations. Section 5 of the manuscript already reports perplexity results across LLaMA and other models for varying context lengths, showing limited degradation, along with the algorithmic optimizations in Section 4 that target attention computation. To directly address concerns about heavy-tailed distributions and softmax sensitivity, we will add a new subsection with quantitative analysis of attention score distributions and quality loss metrics in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation or claims

full rationale

The paper introduces AQPIM as a PIM-aware framework that applies product quantization to LLM activations with unspecified algorithmic optimizations to address accuracy. All load-bearing claims (90-98.5% communication reduction, 3.4× speedup) are presented as outcomes of experimental evaluation on hardware and models rather than quantities derived from equations, fitted parameters, or self-citations within the work. No self-definitional steps, fitted-input predictions, or uniqueness theorems appear; the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that activations in LLMs exhibit clustering properties amenable to product quantization and that PIM hardware can support direct computation on quantized data. No explicit free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption PIM hardware provides sufficiently high internal bandwidth and supports the operations needed for in-memory quantization and computation on compressed activations.
    Invoked to justify performing quantization inside memory and claiming bandwidth and compute efficiency gains.

pith-pipeline@v0.9.0 · 5566 in / 1242 out tokens · 32836 ms · 2026-05-10T04:05:41.400195+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 24 canonical work pages · 9 internal anchors

  1. [1]

    Keyformer: Kv cache reduction through key tokens selection for efficient generative inference,

    M. Adnan, A. Arunkumar, G. Jain, P. Nair, I. Soloveychik, and P. Ka- math, “Keyformer: Kv cache reduction through key tokens selection for efficient generative inference,”Proceedings of Machine Learning and Systems, vol. 6, pp. 114–127, 2024

  2. [2]

    LongBench: A bilingual, multitask benchmark for long context understanding,

    Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y . Dong, J. Tang, and J. Li, “LongBench: A bilingual, multitask benchmark for long context understanding,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bang...

  3. [3]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,”arXiv preprint arXiv:2004.05150, 2020

  4. [4]

    Scaling transformer to 1m tokens and beyond with rmt.arXiv preprint arXiv:2304.11062, 2023

    A. Bulatov, Y . Kuratov, Y . Kapushev, and M. S. Burtsev, “Scaling transformer to 1m tokens and beyond with rmt,” 2024. [Online]. Available: https://arxiv.org/abs/2304.11062

  5. [5]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,(2022),”arXiv preprint arXiv:2211.12588, 2022

  6. [6]

    Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,

    Y .-H. Chen, T.-J. Yang, J. Emer, and V . Sze, “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,”IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019

  7. [7]

    Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,

    P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y . Liu, Y . Wang, and Y . Xie, “Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,”ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 27–39, 2016

  8. [8]

    Generating Long Sequences with Sparse Transformers

    R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,”arXiv preprint arXiv:1904.10509, 2019

  9. [9]

    Asap7: A 7-nm finfet predictive process design kit,

    L. T. Clark, V . Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline, C. Ramamurthy, and G. Yeric, “Asap7: A 7-nm finfet predictive process design kit,”Microelectronics Journal, vol. 53, pp. 105–115, 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S002626921630026X

  10. [10]

    Flashattention: Fast and memory-efficient exact attention with io-awareness,

    T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 16 344–16 359. [Online]. Available: https://proceedings.neuri...

  11. [11]

    SKVQ: Sliding-window key and value cache quantization for large language models,

    H. Duanmu, Z. Yuan, X. Li, J. Duan, X. ZHANG, and D. Lin, “SKVQ: Sliding-window key and value cache quantization for large language models,” inFirst Conference on Language Modeling, 2024. [Online]. Available: https://openreview.net/forum?id=nI6JyFSnyV

  12. [12]

    Neural cache: Bit-serial in-cache acceleration of deep neural networks,

    C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaaauw, and R. Das, “Neural cache: Bit-serial in-cache acceleration of deep neural networks,” in2018 ACM/IEEE 45Th annual international symposium on computer architecture (ISCA). IEEE, 2018, pp. 383–396

  13. [13]

    Mvc: Enabling fully coherent multi-data-views through the memory hierarchy with processing in memory,

    D. Fujiki, “Mvc: Enabling fully coherent multi-data-views through the memory hierarchy with processing in memory,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023, pp. 800–814

  14. [14]

    Duality cache for data parallel acceleration,

    D. Fujiki, S. Mahlke, and R. Das, “Duality cache for data parallel acceleration,” inProceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 397–410

  15. [15]

    Fujiki, X

    D. Fujiki, X. Wang, A. Subramaniyan, and R. Das,In-/near-memory Computing. Springer, 2021

  16. [16]

    Pal: Program-aided language models,

    L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig, “Pal: Program-aided language models,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 10 764–10 799

  17. [17]

    A survey of quantization methods for efficient neural network infer- ence,

    A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network infer- ence,” inLow-power computer vision. Chapman and Hall/CRC, 2022, pp. 291–326

  18. [18]

    ToRA: A tool-integrated reasoning agent for mathe- matical problem solving,

    Z. Gou, Z. Shao, Y . Gong, yelong shen, Y . Yang, M. Huang, N. Duan, and W. Chen, “ToRA: A tool-integrated reasoning agent for mathe- matical problem solving,” inThe Twelfth International Conference on Learning Representations, 2024

  19. [19]

    The llama 3 herd of models,

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Ma...

  20. [20]

    The Llama 3 Herd of Models

    [Online]. Available: https://arxiv.org/abs/2407.21783

  21. [21]

    Ramulator2.0,

    S. R. Group, “Ramulator2.0,” 2023, https://github.com/CMU-SAFARI/ ramulator2

  22. [22]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  23. [23]

    Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects,

    M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, N. Akhtar, J. Wu, S. Mirjaliliet al., “Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects,”Authorea Preprints, vol. 1, pp. 1–26, 2023

  24. [24]

    Zipcache: Accurate and efficient KV cache quantization with salient token identification,

    Y . He, L. Zhang, W. Wu, J. Liu, H. Zhou, and B. Zhuang, “Zipcache: Accurate and efficient KV cache quantization with salient token identification,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=5t4ZAkPiJs

  25. [25]

    Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,

    G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, and J. Park, “Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS ’24. New York, NY , USA: Association for Computin...

  26. [26]

    Squeezed attention: Accelerating long context length llm inference,

    C. Hooper, S. Kim, H. Mohammadzadeh, M. Maheswaran, S. Zhao, J. Paik, M. W. Mahoney, K. Keutzer, and A. Gholami, “Squeezed attention: Accelerating long context length llm inference,” 2025. [Online]. Available: https://arxiv.org/abs/2411.09688

  27. [27]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization,

    C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y . S. Shao, K. Keutzer, and A. Gholami, “Kvquant: Towards 10 million context length llm inference with kv cache quantization,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2...

  28. [28]

    M-ant: Efficient low-bit group quantization for llms via mathematically adaptive numerical type,

    W. Hu, H. Zhang, C. Guo, Y . Feng, R. Guan, Z. Hua, Z. Liu, Y . Guan, M. Guo, and J. Leng, “M-ant: Efficient low-bit group quantization for llms via mathematically adaptive numerical type,”arXiv preprint arXiv:2502.18755, 2025

  29. [29]

    Efficient attentions for long document summarization,

    L. Huang, S. Cao, N. Parulian, H. Ji, and L. Wang, “Efficient attentions for long document summarization,” inProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Cha...

  30. [30]

    Available: https://aclanthology.org/2021.naacl-main.112/

    [Online]. Available: https://aclanthology.org/2021.naacl-main.112/

  31. [31]

    Energy and AI – analysis

    IEA. Energy and AI – analysis. [Online]. Available: https://www.iea. org/reports/energy-and-ai

  32. [32]

    Intel xeon platinum 8480+ processor,

    Intel, “Intel xeon platinum 8480+ processor,” 2023, https: //www.intel.com/content/www/us/en/products/sku/231746/intel-xeon- platinum-8480-processor-105m-cache-2-00-ghz/specifications.html

  33. [33]

    High bandwidth memory dram (hbm3),

    JEDEC, “High bandwidth memory dram (hbm3),” 2022

  34. [34]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023. [Online]. Available: https://arxiv.org/abs/2310.06825

  35. [35]

    Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention,

    H. Jiang, Y . Li, C. Zhang, Q. Wu, X. Luo, S. Ahn, Z. Han, A. Abdi, D. Li, C.-Y . Linet al., “Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention,”Advances in Neural Information Processing Systems, vol. 37, pp. 52 481–52 515, 2024

  36. [36]

    Product quantization for nearest neighbor search,

    H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, 2011

  37. [37]

    Aquabolt-xl: Samsung hbm2-pim with in- memory processing for ml accelerators and beyond,

    J. H. Kim, S.-h. Kang, S. Lee, H. Kim, W. Song, Y . Ro, S. Lee, D. Wang, H. Shin, B. Phuahet al., “Aquabolt-xl: Samsung hbm2-pim with in- memory processing for ml accelerators and beyond,” in2021 IEEE Hot Chips 33 Symposium (HCS). IEEE, 2021, pp. 1–26

  38. [38]

    Ramulator: A fast and extensible dram simulator,

    Y . Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible dram simulator,”IEEE Comput. Archit. Lett., vol. 15, no. 1, p. 45–49, Jan

  39. [39]

    Available: https://doi.org/10.1109/LCA.2015.2414456

    [Online]. Available: https://doi.org/10.1109/LCA.2015.2414456

  40. [40]

    The NarrativeQA reading comprehension challenge,

    T. Ko ˇciský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette, “The NarrativeQA reading comprehension challenge,”Transactions of the Association for Computational Linguistics, vol. 6, pp. 317–328, 2018. [Online]. Available: https://aclanthology.org/Q18-1023/

  41. [41]

    Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,

    S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shinet al., “Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 43–56

  42. [42]

    Hardware architecture and software stack for pim based on commercial dram technology : Industrial product,

    S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim, “Hardware architecture and software stack for pim based on commercial dram technology : Industrial product,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 43–56

  43. [43]

    SnapKV: LLM knows what you are looking for before generation,

    Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen, “SnapKV: LLM knows what you are looking for before generation,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=poE54GOq2l

  44. [44]

    Duquant: Distributing outliers via dual transformation makes stronger quantized LLMs,

    H. Lin, H. Xu, Y . Wu, J. Cui, Y . Zhang, L. Mou, L. Song, Z. Sun, and Y . Wei, “Duquant: Distributing outliers via dual transformation makes stronger quantized LLMs,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=mp8u2Pcmqz

  45. [45]

    Qserve: W4a8kv4 quantization and system co-design for efficient llm serving

    Y . Lin*, H. Tang*, S. Yang*, Z. Zhang, G. Xiao, C. Gan, and S. Han, “Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,”arXiv preprint arXiv:2405.04532, 2024

  46. [46]

    DeepSeek-V3 Technical Report

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

  47. [47]

    RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval, December 2024

    D. Liu, M. Chen, B. Lu, H. Jiang, Z. Han, Q. Zhang, Q. Chen, C. Zhang, B. Ding, K. Zhanget al., “Retrievalattention: Accelerating long-context llm inference via vector retrieval,”arXiv preprint arXiv:2409.10516, 2024

  48. [48]

    Clusterkv: Manipulating llm kv cache in semantic space for recallable compression.arXiv preprint arXiv:2412.03213, 2024

    G. Liu, C. Li, J. Zhao, C. Zhang, and M. Guo, “Clusterkv: Manipulating llm kv cache in semantic space for recallable compression,” 2024. [Online]. Available: https://arxiv.org/abs/2412.03213

  49. [49]

    S2ta: Exploiting structured sparsity for energy-efficient mobile cnn acceleration,

    Z.-G. Liu, P. N. Whatmough, Y . Zhu, and M. Mattina, “S2ta: Exploiting structured sparsity for energy-efficient mobile cnn acceleration,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 573–586

  50. [50]

    Kivi: a tuning-free asymmetric 2bit quantization for kv cache,

    Z. Liu, J. Yuan, H. Jin, S. H. Zhong, Z. Xu, V . Braverman, B. Chen, and X. Hu, “Kivi: a tuning-free asymmetric 2bit quantization for kv cache,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

  51. [51]

    Ramulator 2.0: A modern, modular, and extensible dram simulator,

    H. Luo, Y . C. Tu ˘grul, F. N. Bostancı, A. Olgun, A. G. Ya ˘glıkçı, and O. Mutlu, “Ramulator 2.0: A modern, modular, and extensible dram simulator,”IEEE Comput. Archit. Lett., vol. 23, no. 1, p. 112–116, Jan

  52. [52]

    Nisa Bostancı, Ataberk Olgun, A

    [Online]. Available: https://doi.org/10.1109/LCA.2023.3333759

  53. [53]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” 2020. [Online]. Available: https://arxiv.org/abs/1802.03426

  54. [54]

    Pointer sentinel mixture models,

    S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” inInternational Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/ forum?id=Byj72udxe

  55. [55]

    Nvidia h100 tensor core gpu,

    NVIDIA, “Nvidia h100 tensor core gpu,” 2024, https://arc.net/l/quote/ btwhvenw

  56. [56]

    OpenAI, “Models,” 2024, https://platform.openai.com/docs/models/gpt- 4o

  57. [57]

    Transformers are multi-state RNNs , 2024

    M. Oren, M. Hassid, N. Yarden, Y . Adi, and R. Schwartz, “Transformers are multi-state rnns,”arXiv preprint arXiv:2401.06104, 2024

  58. [58]

    Scnn: An accelerator for compressed-sparse convolutional neural networks,

    A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,”ACM SIGARCH computer architecture news, vol. 45, no. 2, pp. 27–40, 2017

  59. [59]

    attacc_simulator,

    J. Park and J. Choi, “attacc_simulator,” 2024, https://github.com/scale- snu/attacc_simulator

  60. [60]

    Attacc! unleashing the power of pim for batched transformer- based generative model inference,

    J. Park, J. Choi, K. Kyung, M. J. Kim, Y . Kwon, N. S. Kim, and J. H. Ahn, “Attacc! unleashing the power of pim for batched transformer- based generative model inference,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’24. New York, NY , USA: Associati...

  61. [61]

    Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,

    A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra- chan, M. Hu, R. S. Williams, and V . Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 14–26, 2016

  62. [62]

    Omniquant: Omnidirectionally calibrated quantization for large language models,

    W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y . Qiao, and P. Luo, “Omniquant: Omnidirectionally calibrated quantization for large language models,” inICLR, 2024. [Online]. Available: https://openreview.net/forum?id=8Wuvhh0LYW

  63. [63]

    FlexGen: High-throughput generative inference of large language models with a single GPU,

    Y . Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Re, I. Stoica, and C. Zhang, “FlexGen: High-throughput generative inference of large language models with a single GPU,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhard...

  64. [64]

    Sze, Y .-H

    V . Sze, Y .-H. Chen, T.-J. Yang, and J. S. Emer,Efficient processing of deep neural networks. Springer, 2020

  65. [65]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, S. Mariooryad, Y . Ding, X. Geng, F. Alcober, R. Frostig, M. Omernick, L. Walker, C. Paduraru, C. Sorokin, A. Tacchetti, C. Gaffney, S. Daruki, O. Sercinoglu, Z. Gleicher, J. Love, P. V oigtlaender, R. Jain, G. Surita, K. Mohamed, R. Blevins, J. Ahn, T...

  66. [66]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  67. [67]

    Memorizing transformers,

    Y . Wu, M. N. Rabe, D. Hutchins, and C. Szegedy, “Memorizing transformers,” inInternational Conference on Learning Representations,

  68. [68]

    Available: https://openreview.net/forum?id=TrjbxzRcnf-

    [Online]. Available: https://openreview.net/forum?id=TrjbxzRcnf-

  69. [69]

    Smoothquant: accurate and efficient post-training quantization for large language models,

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: accurate and efficient post-training quantization for large language models,” inProceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023

  70. [70]

    Efficient streaming language models with attention sinks,

    G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=NG7sS51zVF

  71. [71]

    Online product quantization,

    D. Xu, I. W. Tsang, and Y . Zhang, “Online product quantization,”IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 11, pp. 2185–2198, 2018

  72. [72]

    Native sparse attention: Hardware-aligned and natively trainable sparse attention

    J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . X. Wei, L. Wang, Z. Xiao, Y . Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng, “Native sparse attention: Hardware-aligned and natively trainable sparse attention,” 2025. [Online]. Available: https://arxiv.org/abs/2502.11089

  73. [73]

    Gobo: Quan- tizing attention-based nlp models for low latency and energy efficient inference,

    A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, “Gobo: Quan- tizing attention-based nlp models for low latency and energy efficient inference,” in2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 811–824

  74. [74]

    Thermometer: profile-guided btb replacement for data center applications,

    A. H. Zadeh, M. Mahmoud, A. Abdelhadi, and A. Moshovos, “Mokey: enabling narrow fixed-point inference for out-of-the-box floating-point transformer models,” inProceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 888–901. [Online]. Available: https:...

  75. [75]

    Big bird: Transformers for longer sequences,

    M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yanget al., “Big bird: Transformers for longer sequences,”Advances in neural information processing systems, vol. 33, pp. 17 283–17 297, 2020

  76. [76]

    Zenodo, https://zenodo.org/records/17378113

  77. [77]

    Pqcache: Product quantization-based kvcache for long context llm inference,

    H. Zhang, X. Ji, Y . Chen, F. Fu, X. Miao, X. Nie, W. Chen, and B. Cui, “Pqcache: Product quantization-based kvcache for long context llm inference,” 2024. [Online]. Available: https://arxiv.org/abs/2407.12820

  78. [78]

    Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization,

    T. Zhang, J. Yi, Z. Xu, and A. Shrivastava, “Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 3304–3331. [Online]. Avail...

  79. [79]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models,

    Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. Re, C. Barrett, Z. Wang, and B. Chen, “H2o: Heavy-hitter oracle for efficient generative inference of large language models,” inThirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https: //openreview.net/forum?id=RkRrPp7GKO

  80. [80]

    Atom: Low-bit quantization for efficient and accurate llm serving,

    Y . Zhao, C.-Y . Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, and B. Kasikci, “Atom: Low-bit quantization for efficient and accurate llm serving,”Proceedings of Machine Learning and Systems, vol. 6, pp. 196–209, 2024

Showing first 80 references.