arxiv: 2604.18137 · v1 · submitted 2026-04-20 · 💻 cs.AR · cs.AI· cs.LG

Recognition: unknown

AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

Kosuke Matsushima , Yasuyuki Okoshi , Masato Motomura , Daichi Fujiki

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:05 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.LG

keywords processing in memoryactivation quantizationproduct quantizationlarge language modelskv cachememory capacityattention computationdata movement reduction

0 comments

The pith

By running product quantization inside memory, AQPIM compresses LLM activations enough to fit within PIM hardware limits and enables direct computation on the compact form.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard PIM designs cannot hold the growing KV caches of long-context LLMs because activation memory exceeds on-chip capacity, and that moving quantization inside the memory array solves this by exploiting high internal bandwidth. Product quantization is applied directly to activations so that attention calculations run on the resulting indices rather than full vectors, cutting the need to ship data off-chip. A sympathetic reader would care because GPU-CPU transfers already dominate 90 to 98.5 percent of decoding time; removing most of that movement would let existing PIM chips run larger models and longer sequences without new hardware.

Core claim

AQPIM is a PIM-aware activation quantization framework based on product quantization that performs the clustering and indexing steps entirely inside memory, introduces algorithmic adjustments to keep accuracy acceptable for LLMs, and thereby shrinks the activation footprint while allowing attention to operate directly on the compressed representations.

What carries the argument

Product quantization executed in-memory on activation vectors, where learned codebooks replace each vector with a short index so that approximate inner products can be computed without restoring the original values.

If this is right

GPU-CPU communication that currently accounts for 90 to 98.5 percent of decoding latency drops sharply.
Overall inference reaches 3.4 times the speed of prior state-of-the-art PIM methods for the same models.
Larger KV caches fit inside fixed PIM memory capacity, supporting longer context lengths.
Attention operations execute directly on the compressed indices, avoiding full decompression overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same in-memory compression pattern could be reused for other memory-capacity-limited PIM workloads such as large graph analytics.
PIM chip designers may add dedicated codebook lookup units as a standard on-die feature rather than leaving quantization to software.
Pairing AQPIM-style quantization with sparsity patterns that preserve data locality could produce further efficiency gains beyond what either technique achieves alone.

Load-bearing premise

The accuracy loss from clustering-based product quantization on LLM activations stays small enough when the whole process runs inside memory that downstream model quality does not degrade noticeably.

What would settle it

Measure end-to-end accuracy of a standard LLM on a benchmark such as MMLU after replacing all activations with AQPIM-quantized versions and observe whether the score falls more than 1-2 percent below the full-precision baseline, or profile actual data-transfer volume and find no reduction in the claimed 90-98.5 percent range.

Figures

Figures reproduced from arXiv: 2604.18137 by Daichi Fujiki, Kosuke Matsushima, Masato Motomura, Yasuyuki Okoshi.

**Figure 2.** Figure 2: Locality within the projection weights (left-most) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of AQPIM and Product Quantization (PQ). [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The latency comparison of the prefilling and the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: (1) a sequence is divided into multiple windows, and as the window advances, the previous centroids are copied to a new page and subsequently updated for the window. Then, (2) … (1) Cluster assignments are determined within the window (3) Indirect access is limited within the window (DRAM page) window1 window2 windowN Key and Value Matrices … x Indices … DRAM page size 512 part. prods codebook ・ query (2) … view at source ↗

**Figure 7.** Figure 7: AQPIM architecture and dataflow. TABLE I: List of AQPIM operations. Process Place Necessary Unit Distance Calculation (DC) BankPE ADD, MUL, SUM Cluster Assignment (CA) BufferPE MIN Centroid Calculation (CC) Both MUL, SUM, DIV Attention (ATNK, ATNV) BankPE MUL, SUM Softmax (SFM) BufferPE ADD, SUM, MAX, DIV, EXP we introduce two PE architectures: BankPE and BufferPE, as illustrated in [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 8.** Figure 8: Intra-row indirection for efficient random access. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Data mapping strategy. Each head is assigned to [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Memory reduction ratio vs. accuracy. gpu attacc aqpim gpu attacc aqpim gpu attacc aqpim gpu attacc aqpim 0.00 0.25 0.50 0.75 1.00 Total execution time 1.00 0.70 0.68 1.00 0.61 0.58 1.00 0.53 0.49 1.00 0.48 0.43 S_in: 4096 S_out: 128 S_in: 4096 S_out: 256 S_in: 4096 S_out: 512 S_in: 4096 S_out: 1024 vs. Architecture gpu pqcache skvq snapkv aqpim gpu pqcache skvq snapkv aqpim gpu pqcache skvq snapkv aqpim g… view at source ↗

**Figure 11.** Figure 11: Normalized total execution time comparing different architectures (left) and algorithms (right). [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Normalized decoding time comparing different architectures (left) and algorithms (right). [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: Decomposition analysis of decoding speedups. [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: Normalized energy for decoding comparing different architectures (left) and algorithms (right). [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

**Figure 15.** Figure 15: Accuracy & Speedup vs. Memory Reduction. [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗

read the original abstract

Processing-in-Memory (PIM) architectures offer a promising solution to the memory bottlenecks in data-intensive machine learning, yet often overlook the growing challenge of activation memory footprint. Conventional PIM approaches struggle with massive KV cache sizes generated in long-context scenarios by Transformer-based models, frequently exceeding PIM's limited memory capacity, while techniques like sparse attention can conflict with PIM's need for data locality. Existing PIM approaches and quantization methods are often insufficient or poorly suited for leveraging the unique characteristics of activations. This work identifies an opportunity for PIM-specialized activation quantization to enhance bandwidth and compute efficiency. We explore clustering-based vector quantization approaches, which align well with activation characteristics and PIM's internal bandwidth capabilities. Building on this, we introduce AQPIM, a novel PIM-aware activation quantization framework based on Product Quantization (PQ), optimizing it for modern Large Language Models (LLMs). By performing quantization directly within memory, AQPIM leverages PIM's high internal bandwidth and enables direct computation on compressed data, significantly reducing both memory footprint and computational overhead for attention computation. AQPIM addresses PQ's accuracy challenges by introducing several algorithmic optimizations. Evaluations demonstrate that AQPIM achieves significant performance improvements, drastically reducing of GPU-CPU communication that can account for 90$\sim$98.5\% of decoding latency, together with 3.4$\times$ speedup over a SOTA PIM approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AQPIM applies product quantization to activations inside PIM hardware to shrink LLM KV caches and cut communication, but the accuracy impact on model quality is the unproven part.

read the letter

The paper's main point is that standard PIM setups hit a wall with large activation memories in long-context LLMs, and AQPIM tries to fix it by doing product quantization on activations in-memory so you can compute directly on the compressed form and avoid shipping data off-chip. They add some targeted tweaks to PQ to keep accuracy reasonable for this use case, then show big drops in GPU-CPU traffic (90-98.5% of decoding time) and a 3.4x speedup versus prior PIM work. That lines up with a real bottleneck people are seeing as context lengths grow. The approach feels like a sensible hardware-software fit because it plays to PIM's internal bandwidth instead of fighting it with sparse attention or offloading. The optimizations for accuracy are the part that could make it distinct from off-the-shelf quantization papers. The soft spot is still the accuracy side. Activations in attention layers have heavy tails and the softmax is sensitive, so clustering-based PQ needs to prove it keeps perplexity or generation quality within acceptable bounds. The abstract gives no numbers on that, no error bars, and no clear comparison to simple baselines or existing quantization methods. If the quality loss turns out material, the latency wins become irrelevant. The hardware mapping for codebook lookups and partial sums also needs to be shown working without hidden CPU steps. This is for people building or evaluating PIM accelerators for transformers. Anyone working on memory-bound LLM inference will get something out of the framing, even if they end up adapting the ideas rather than using the exact method. The work deserves a serious referee because the problem is timely and the proposal is concrete enough to review and improve, though the authors will likely need to add detailed accuracy results and hardware validation in revision.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce AQPIM, a PIM-aware activation quantization framework based on Product Quantization (PQ) optimized for LLMs. By performing quantization and computation directly in memory, it reduces activation memory footprint, enables direct computation on compressed data, and minimizes GPU-CPU communication (claimed to account for 90-98.5% of decoding latency), achieving a 3.4× speedup over a state-of-the-art PIM approach.

Significance. If the quantization maintains acceptable model quality, this could meaningfully advance PIM architectures for long-context LLM inference by addressing activation capacity limits that exceed PIM memory and conflict with data locality requirements. The in-memory focus leverages PIM's internal bandwidth advantages and could inform hardware-software co-design for efficient inference.

major comments (2)

The abstract reports performance gains and latency reductions from evaluations but provides no details on accuracy metrics, baselines, error bars, or data selection. This leaves the central empirical claim only partially supported without further evidence.
The assumption that clustering-based product quantization (with unspecified algorithmic optimizations) produces activation representations whose error does not materially degrade attention or generation quality is not validated. Activations in long-context Transformers exhibit heavy-tailed distributions and high sensitivity in the attention softmax; without quantitative results on perplexity impact or quality loss, the reported communication reduction and speedup cannot be assessed for practical utility.

minor comments (1)

The abstract contains a grammatical error in 'drastically reducing of GPU-CPU communication' which should be corrected to 'drastically reducing GPU-CPU communication'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for strengthening the presentation of our empirical results and the validation of AQPIM's quantization effects. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: The abstract reports performance gains and latency reductions from evaluations but provides no details on accuracy metrics, baselines, error bars, or data selection. This leaves the central empirical claim only partially supported without further evidence.

Authors: We agree that the abstract would be strengthened by including these details. In the revised version, we will expand the abstract to reference the accuracy metrics (perplexity and generation quality), the specific baselines and models evaluated, the use of multiple runs with reported standard deviations, and the datasets/context lengths used. The main text already contains the full evaluation setup and results; the abstract revision will ensure the central claims are better supported at a glance. revision: yes
Referee: The assumption that clustering-based product quantization (with unspecified algorithmic optimizations) produces activation representations whose error does not materially degrade attention or generation quality is not validated. Activations in long-context Transformers exhibit heavy-tailed distributions and high sensitivity in the attention softmax; without quantitative results on perplexity impact or quality loss, the reported communication reduction and speedup cannot be assessed for practical utility.

Authors: We acknowledge the need for explicit validation given the characteristics of LLM activations. Section 5 of the manuscript already reports perplexity results across LLaMA and other models for varying context lengths, showing limited degradation, along with the algorithmic optimizations in Section 4 that target attention computation. To directly address concerns about heavy-tailed distributions and softmax sensitivity, we will add a new subsection with quantitative analysis of attention score distributions and quality loss metrics in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation or claims

full rationale

The paper introduces AQPIM as a PIM-aware framework that applies product quantization to LLM activations with unspecified algorithmic optimizations to address accuracy. All load-bearing claims (90-98.5% communication reduction, 3.4× speedup) are presented as outcomes of experimental evaluation on hardware and models rather than quantities derived from equations, fitted parameters, or self-citations within the work. No self-definitional steps, fitted-input predictions, or uniqueness theorems appear; the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that activations in LLMs exhibit clustering properties amenable to product quantization and that PIM hardware can support direct computation on quantized data. No explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption PIM hardware provides sufficiently high internal bandwidth and supports the operations needed for in-memory quantization and computation on compressed activations.
Invoked to justify performing quantization inside memory and claiming bandwidth and compute efficiency gains.

pith-pipeline@v0.9.0 · 5566 in / 1242 out tokens · 32836 ms · 2026-05-10T04:05:41.400195+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 24 canonical work pages · 9 internal anchors

[1]

Keyformer: Kv cache reduction through key tokens selection for efficient generative inference,

M. Adnan, A. Arunkumar, G. Jain, P. Nair, I. Soloveychik, and P. Ka- math, “Keyformer: Kv cache reduction through key tokens selection for efficient generative inference,”Proceedings of Machine Learning and Systems, vol. 6, pp. 114–127, 2024

2024
[2]

LongBench: A bilingual, multitask benchmark for long context understanding,

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y . Dong, J. Tang, and J. Li, “LongBench: A bilingual, multitask benchmark for long context understanding,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bang...

2024
[3]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,”arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review arXiv 2004
[4]

Scaling transformer to 1m tokens and beyond with rmt.arXiv preprint arXiv:2304.11062, 2023

A. Bulatov, Y . Kuratov, Y . Kapushev, and M. S. Burtsev, “Scaling transformer to 1m tokens and beyond with rmt,” 2024. [Online]. Available: https://arxiv.org/abs/2304.11062

work page arXiv 2024
[5]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,(2022),”arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review arXiv 2022
[6]

Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,

Y .-H. Chen, T.-J. Yang, J. Emer, and V . Sze, “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,”IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019

2019
[7]

Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,

P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y . Liu, Y . Wang, and Y . Xie, “Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,”ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 27–39, 2016

2016
[8]

Generating Long Sequences with Sparse Transformers

R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,”arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review arXiv 1904
[9]

Asap7: A 7-nm finfet predictive process design kit,

L. T. Clark, V . Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline, C. Ramamurthy, and G. Yeric, “Asap7: A 7-nm finfet predictive process design kit,”Microelectronics Journal, vol. 53, pp. 105–115, 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S002626921630026X

2016
[10]

Flashattention: Fast and memory-efficient exact attention with io-awareness,

T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 16 344–16 359. [Online]. Available: https://proceedings.neuri...

2022
[11]

SKVQ: Sliding-window key and value cache quantization for large language models,

H. Duanmu, Z. Yuan, X. Li, J. Duan, X. ZHANG, and D. Lin, “SKVQ: Sliding-window key and value cache quantization for large language models,” inFirst Conference on Language Modeling, 2024. [Online]. Available: https://openreview.net/forum?id=nI6JyFSnyV

2024
[12]

Neural cache: Bit-serial in-cache acceleration of deep neural networks,

C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaaauw, and R. Das, “Neural cache: Bit-serial in-cache acceleration of deep neural networks,” in2018 ACM/IEEE 45Th annual international symposium on computer architecture (ISCA). IEEE, 2018, pp. 383–396

2018
[13]

Mvc: Enabling fully coherent multi-data-views through the memory hierarchy with processing in memory,

D. Fujiki, “Mvc: Enabling fully coherent multi-data-views through the memory hierarchy with processing in memory,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023, pp. 800–814

2023
[14]

Duality cache for data parallel acceleration,

D. Fujiki, S. Mahlke, and R. Das, “Duality cache for data parallel acceleration,” inProceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 397–410

2019
[15]

Fujiki, X

D. Fujiki, X. Wang, A. Subramaniyan, and R. Das,In-/near-memory Computing. Springer, 2021

2021
[16]

Pal: Program-aided language models,

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig, “Pal: Program-aided language models,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 10 764–10 799

2023
[17]

A survey of quantization methods for efficient neural network infer- ence,

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network infer- ence,” inLow-power computer vision. Chapman and Hall/CRC, 2022, pp. 291–326

2022
[18]

ToRA: A tool-integrated reasoning agent for mathe- matical problem solving,

Z. Gou, Z. Shao, Y . Gong, yelong shen, Y . Yang, M. Huang, N. Duan, and W. Chen, “ToRA: A tool-integrated reasoning agent for mathe- matical problem solving,” inThe Twelfth International Conference on Learning Representations, 2024

2024
[19]

The llama 3 herd of models,

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Ma...
[20]

The Llama 3 Herd of Models

[Online]. Available: https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Ramulator2.0,

S. R. Group, “Ramulator2.0,” 2023, https://github.com/CMU-SAFARI/ ramulator2

2023
[22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects,

M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, N. Akhtar, J. Wu, S. Mirjaliliet al., “Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects,”Authorea Preprints, vol. 1, pp. 1–26, 2023

2023
[24]

Zipcache: Accurate and efficient KV cache quantization with salient token identification,

Y . He, L. Zhang, W. Wu, J. Liu, H. Zhou, and B. Zhuang, “Zipcache: Accurate and efficient KV cache quantization with salient token identification,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=5t4ZAkPiJs

2024
[25]

Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,

G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, and J. Park, “Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS ’24. New York, NY , USA: Association for Computin...

work page doi:10.1145/3620666.3651380 2024
[26]

Squeezed attention: Accelerating long context length llm inference,

C. Hooper, S. Kim, H. Mohammadzadeh, M. Maheswaran, S. Zhao, J. Paik, M. W. Mahoney, K. Keutzer, and A. Gholami, “Squeezed attention: Accelerating long context length llm inference,” 2025. [Online]. Available: https://arxiv.org/abs/2411.09688

work page arXiv 2025
[27]

Kvquant: Towards 10 million context length llm inference with kv cache quantization,

C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y . S. Shao, K. Keutzer, and A. Gholami, “Kvquant: Towards 10 million context length llm inference with kv cache quantization,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2...

2024
[28]

M-ant: Efficient low-bit group quantization for llms via mathematically adaptive numerical type,

W. Hu, H. Zhang, C. Guo, Y . Feng, R. Guan, Z. Hua, Z. Liu, Y . Guan, M. Guo, and J. Leng, “M-ant: Efficient low-bit group quantization for llms via mathematically adaptive numerical type,”arXiv preprint arXiv:2502.18755, 2025

work page arXiv 2025
[29]

Efficient attentions for long document summarization,

L. Huang, S. Cao, N. Parulian, H. Ji, and L. Wang, “Efficient attentions for long document summarization,” inProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Cha...

2021
[30]

Available: https://aclanthology.org/2021.naacl-main.112/

[Online]. Available: https://aclanthology.org/2021.naacl-main.112/

2021
[31]

Energy and AI – analysis

IEA. Energy and AI – analysis. [Online]. Available: https://www.iea. org/reports/energy-and-ai
[32]

Intel xeon platinum 8480+ processor,

Intel, “Intel xeon platinum 8480+ processor,” 2023, https: //www.intel.com/content/www/us/en/products/sku/231746/intel-xeon- platinum-8480-processor-105m-cache-2-00-ghz/specifications.html

2023
[33]

High bandwidth memory dram (hbm3),

JEDEC, “High bandwidth memory dram (hbm3),” 2022

2022
[34]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023. [Online]. Available: https://arxiv.org/abs/2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention,

H. Jiang, Y . Li, C. Zhang, Q. Wu, X. Luo, S. Ahn, Z. Han, A. Abdi, D. Li, C.-Y . Linet al., “Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention,”Advances in Neural Information Processing Systems, vol. 37, pp. 52 481–52 515, 2024

2024
[36]

Product quantization for nearest neighbor search,

H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, 2011

2011
[37]

Aquabolt-xl: Samsung hbm2-pim with in- memory processing for ml accelerators and beyond,

J. H. Kim, S.-h. Kang, S. Lee, H. Kim, W. Song, Y . Ro, S. Lee, D. Wang, H. Shin, B. Phuahet al., “Aquabolt-xl: Samsung hbm2-pim with in- memory processing for ml accelerators and beyond,” in2021 IEEE Hot Chips 33 Symposium (HCS). IEEE, 2021, pp. 1–26

2021
[38]

Ramulator: A fast and extensible dram simulator,

Y . Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible dram simulator,”IEEE Comput. Archit. Lett., vol. 15, no. 1, p. 45–49, Jan
[39]

Available: https://doi.org/10.1109/LCA.2015.2414456

[Online]. Available: https://doi.org/10.1109/LCA.2015.2414456

work page doi:10.1109/lca.2015.2414456 2015
[40]

The NarrativeQA reading comprehension challenge,

T. Ko ˇciský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette, “The NarrativeQA reading comprehension challenge,”Transactions of the Association for Computational Linguistics, vol. 6, pp. 317–328, 2018. [Online]. Available: https://aclanthology.org/Q18-1023/

2018
[41]

Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,

S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shinet al., “Hardware architecture and software stack for pim based on commercial dram technology: Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 43–56

2021
[42]

Hardware architecture and software stack for pim based on commercial dram technology : Industrial product,

S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim, “Hardware architecture and software stack for pim based on commercial dram technology : Industrial product,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 43–56

2021
[43]

SnapKV: LLM knows what you are looking for before generation,

Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen, “SnapKV: LLM knows what you are looking for before generation,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=poE54GOq2l

2024
[44]

Duquant: Distributing outliers via dual transformation makes stronger quantized LLMs,

H. Lin, H. Xu, Y . Wu, J. Cui, Y . Zhang, L. Mou, L. Song, Z. Sun, and Y . Wei, “Duquant: Distributing outliers via dual transformation makes stronger quantized LLMs,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=mp8u2Pcmqz

2024
[45]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving

Y . Lin*, H. Tang*, S. Yang*, Z. Zhang, G. Xiao, C. Gan, and S. Han, “Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,”arXiv preprint arXiv:2405.04532, 2024

work page arXiv 2024
[46]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval, December 2024

D. Liu, M. Chen, B. Lu, H. Jiang, Z. Han, Q. Zhang, Q. Chen, C. Zhang, B. Ding, K. Zhanget al., “Retrievalattention: Accelerating long-context llm inference via vector retrieval,”arXiv preprint arXiv:2409.10516, 2024

work page arXiv 2024
[48]

Clusterkv: Manipulating llm kv cache in semantic space for recallable compression.arXiv preprint arXiv:2412.03213, 2024

G. Liu, C. Li, J. Zhao, C. Zhang, and M. Guo, “Clusterkv: Manipulating llm kv cache in semantic space for recallable compression,” 2024. [Online]. Available: https://arxiv.org/abs/2412.03213

work page arXiv 2024
[49]

S2ta: Exploiting structured sparsity for energy-efficient mobile cnn acceleration,

Z.-G. Liu, P. N. Whatmough, Y . Zhu, and M. Mattina, “S2ta: Exploiting structured sparsity for energy-efficient mobile cnn acceleration,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 573–586

2022
[50]

Kivi: a tuning-free asymmetric 2bit quantization for kv cache,

Z. Liu, J. Yuan, H. Jin, S. H. Zhong, Z. Xu, V . Braverman, B. Chen, and X. Hu, “Kivi: a tuning-free asymmetric 2bit quantization for kv cache,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

2024
[51]

Ramulator 2.0: A modern, modular, and extensible dram simulator,

H. Luo, Y . C. Tu ˘grul, F. N. Bostancı, A. Olgun, A. G. Ya ˘glıkçı, and O. Mutlu, “Ramulator 2.0: A modern, modular, and extensible dram simulator,”IEEE Comput. Archit. Lett., vol. 23, no. 1, p. 112–116, Jan
[52]

Nisa Bostancı, Ataberk Olgun, A

[Online]. Available: https://doi.org/10.1109/LCA.2023.3333759

work page doi:10.1109/lca.2023.3333759 2023
[53]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” 2020. [Online]. Available: https://arxiv.org/abs/1802.03426

work page internal anchor Pith review arXiv 2020
[54]

Pointer sentinel mixture models,

S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” inInternational Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/ forum?id=Byj72udxe

2017
[55]

Nvidia h100 tensor core gpu,

NVIDIA, “Nvidia h100 tensor core gpu,” 2024, https://arc.net/l/quote/ btwhvenw

2024
[56]

OpenAI, “Models,” 2024, https://platform.openai.com/docs/models/gpt- 4o

2024
[57]

Transformers are multi-state RNNs , 2024

M. Oren, M. Hassid, N. Yarden, Y . Adi, and R. Schwartz, “Transformers are multi-state rnns,”arXiv preprint arXiv:2401.06104, 2024

work page arXiv 2024
[58]

Scnn: An accelerator for compressed-sparse convolutional neural networks,

A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,”ACM SIGARCH computer architecture news, vol. 45, no. 2, pp. 27–40, 2017

2017
[59]

attacc_simulator,

J. Park and J. Choi, “attacc_simulator,” 2024, https://github.com/scale- snu/attacc_simulator

2024
[60]

Attacc! unleashing the power of pim for batched transformer- based generative model inference,

J. Park, J. Choi, K. Kyung, M. J. Kim, Y . Kwon, N. S. Kim, and J. H. Ahn, “Attacc! unleashing the power of pim for batched transformer- based generative model inference,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’24. New York, NY , USA: Associati...

work page doi:10.1145/3620665.3640422 2024
[61]

Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,

A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra- chan, M. Hu, R. S. Williams, and V . Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 14–26, 2016

2016
[62]

Omniquant: Omnidirectionally calibrated quantization for large language models,

W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y . Qiao, and P. Luo, “Omniquant: Omnidirectionally calibrated quantization for large language models,” inICLR, 2024. [Online]. Available: https://openreview.net/forum?id=8Wuvhh0LYW

2024
[63]

FlexGen: High-throughput generative inference of large language models with a single GPU,

Y . Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Re, I. Stoica, and C. Zhang, “FlexGen: High-throughput generative inference of large language models with a single GPU,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhard...

2023
[64]

Sze, Y .-H

V . Sze, Y .-H. Chen, T.-J. Yang, and J. S. Emer,Efficient processing of deep neural networks. Springer, 2020

2020
[65]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, S. Mariooryad, Y . Ding, X. Geng, F. Alcober, R. Frostig, M. Omernick, L. Walker, C. Paduraru, C. Sorokin, A. Tacchetti, C. Gaffney, S. Daruki, O. Sercinoglu, Z. Gleicher, J. Love, P. V oigtlaender, R. Jain, G. Surita, K. Mohamed, R. Blevins, J. Ahn, T...

work page internal anchor Pith review arXiv 2024
[66]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

2022
[67]

Memorizing transformers,

Y . Wu, M. N. Rabe, D. Hutchins, and C. Szegedy, “Memorizing transformers,” inInternational Conference on Learning Representations,
[68]

Available: https://openreview.net/forum?id=TrjbxzRcnf-

[Online]. Available: https://openreview.net/forum?id=TrjbxzRcnf-
[69]

Smoothquant: accurate and efficient post-training quantization for large language models,

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: accurate and efficient post-training quantization for large language models,” inProceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023

2023
[70]

Efficient streaming language models with attention sinks,

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=NG7sS51zVF

2024
[71]

Online product quantization,

D. Xu, I. W. Tsang, and Y . Zhang, “Online product quantization,”IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 11, pp. 2185–2198, 2018

2018
[72]

Native sparse attention: Hardware-aligned and natively trainable sparse attention

J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . X. Wei, L. Wang, Z. Xiao, Y . Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng, “Native sparse attention: Hardware-aligned and natively trainable sparse attention,” 2025. [Online]. Available: https://arxiv.org/abs/2502.11089

work page arXiv 2025
[73]

Gobo: Quan- tizing attention-based nlp models for low latency and energy efficient inference,

A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, “Gobo: Quan- tizing attention-based nlp models for low latency and energy efficient inference,” in2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 811–824

2020
[74]

Thermometer: profile-guided btb replacement for data center applications,

A. H. Zadeh, M. Mahmoud, A. Abdelhadi, and A. Moshovos, “Mokey: enabling narrow fixed-point inference for out-of-the-box floating-point transformer models,” inProceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 888–901. [Online]. Available: https:...

work page doi:10.1145/3470496.3527438 2022
[75]

Big bird: Transformers for longer sequences,

M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yanget al., “Big bird: Transformers for longer sequences,”Advances in neural information processing systems, vol. 33, pp. 17 283–17 297, 2020

2020
[76]

Zenodo, https://zenodo.org/records/17378113

work page arXiv
[77]

Pqcache: Product quantization-based kvcache for long context llm inference,

H. Zhang, X. Ji, Y . Chen, F. Fu, X. Miao, X. Nie, W. Chen, and B. Cui, “Pqcache: Product quantization-based kvcache for long context llm inference,” 2024. [Online]. Available: https://arxiv.org/abs/2407.12820

work page arXiv 2024
[78]

Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization,

T. Zhang, J. Yi, Z. Xu, and A. Shrivastava, “Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 3304–3331. [Online]. Avail...

2024
[79]

H2o: Heavy-hitter oracle for efficient generative inference of large language models,

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. Re, C. Barrett, Z. Wang, and B. Chen, “H2o: Heavy-hitter oracle for efficient generative inference of large language models,” inThirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https: //openreview.net/forum?id=RkRrPp7GKO

2023
[80]

Atom: Low-bit quantization for efficient and accurate llm serving,

Y . Zhao, C.-Y . Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, and B. Kasikci, “Atom: Low-bit quantization for efficient and accurate llm serving,”Proceedings of Machine Learning and Systems, vol. 6, pp. 196–209, 2024

2024

Showing first 80 references.