Recognition: unknown
RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write
Pith reviewed 2026-05-07 10:15 UTC · model grok-4.3
The pith
The read-compute/write architecture for digital CIM minimizes weight-update overhead during LLM inference by decoupling computation from writes and fusing nonlinear operators.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the RCW architecture, nonlinear operator fusion, and WS-OCS dataflow together reduce decoding latency by 21.59 percent, prefill latency by 49.76 percent, and both DRAM accesses and internal weight updates substantially on the Llama2-7B model, while the 22 nm prototype achieves 3.28 TOPS and 42.3 TOPS/W at 100 MHz with 4.2 ms prefill time and 26.87 tokens per second.
What carries the argument
The read-compute/write (RCW) architecture, which splits each cycle into a read-compute phase followed by a write phase to hide weight-update latency, together with group-based nonlinear operator fusion and the weight-stationary output-column-stationary dataflow.
Load-bearing premise
The group-based approximation used in nonlinear operator fusion preserves model accuracy at scale, and the measured latency reductions remain dominant once the accelerator is placed inside a full system with real software and memory controllers.
What would settle it
Measure end-to-end prefill and decoding latency plus model accuracy on the fabricated 22 nm RCW-CIM chip while running the complete Llama2-7B inference pipeline against a baseline digital accelerator that lacks the RCW separation and fusion.
Figures
read the original abstract
Digital computing-in-memory (DCIM) has emerged as a promising solution for large language model (LLM) acceleration by minimizing data transfers between external DRAM and on-chip accelerators while maintaining high precision for superior accuracy. However, existing CIM architectures often overlook weight update latency, which becomes critical as LLM weights are far larger than a single CIM macro capacity. To address this issue, this paper proposes a read-compute/write (RCW) architecture that effectively minimizes weight update latency, along with a nonlinear operator fusion that further mitigates dependencyinduced latency. The proposed RCW reduces decoding computing latency by 21.59% on the Llama2-7B model. In addition, the nonlinear operator fusion mechanism achieves a 69.17% latency reduction through efficient partial accumulation and group-based approximation. Furthermore, a weight-stationary and output column stationary (WS-OCS) dataflow is introduced to reduce both external DRAM access and internal CIM weight updates by 51.6% and 87.6% respectively during the prefill phase of 1024 tokens, leading to an overall 49.76% latency reduction. Fabricated using TSMC 22 nm CMOS technology and operating at 100 MHz, the proposed RCW-CIM achieves 3.28 TOPS and 42.3 TOPS/W, enabling 4.2 ms prefill latency and 26.87 decoded tokens per second for the INT4-weight Llama2 model with dual DDR5-6400 memory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RCW-CIM, a digital computing-in-memory (DCIM) accelerator for LLMs that introduces a read-compute/write (RCW) architecture to reduce weight-update latency, a nonlinear operator fusion scheme using group-based approximation to cut dependency-induced latency, and a weight-stationary/output-column-stationary (WS-OCS) dataflow to lower external DRAM accesses and internal CIM updates. Fabricated in TSMC 22 nm CMOS at 100 MHz, the design is reported to deliver 3.28 TOPS and 42.3 TOPS/W, with measured latency reductions of 21.59 % on Llama2-7B decoding, 69.17 % from fusion, and 49.76 % prefill latency via WS-OCS, yielding 4.2 ms prefill and 26.87 tokens/s for INT4 Llama2-7B with dual DDR5-6400.
Significance. If the central performance numbers and approximation fidelity hold, the work supplies one of the few silicon-validated DCIM accelerators for LLMs that explicitly targets weight-update overheads, a practical bottleneck for models larger than a single macro. The concrete TOPS/W and end-to-end latency figures on a 7 B model provide a useful data point for the community.
major comments (2)
- [Abstract and §5] Abstract and §5 (Experimental Results): the group-based approximation inside the nonlinear operator fusion is stated to deliver a 69.17 % latency reduction, yet no perplexity, zero-shot accuracy, or accuracy-vs-baseline tables are supplied for Llama2-7B or any other model. Because the approximation is load-bearing for the headline latency claim, the absence of quantified error metrics prevents assessment of whether the reported gains remain usable.
- [§5 and Table 3] §5 and Table 3 (or equivalent results table): the fabricated chip is reported to achieve 3.28 TOPS, 42.3 TOPS/W, 4.2 ms prefill, and 26.87 tokens/s on INT4 Llama2-7B with dual DDR5-6400, but the text supplies neither error bars, full-system software-stack overhead measurements, nor direct comparisons against prior DCIM or GPU baselines under identical memory conditions. These omissions make it impossible to verify that the 51.6 % DRAM-access and 87.6 % weight-update reductions remain dominant once controller and software costs are included.
minor comments (1)
- [Abstract] Abstract: the phrase 'enabling 4.2 ms prefill latency' should be accompanied by the input token length (1024 is mentioned only later) to avoid ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help strengthen the presentation of our RCW-CIM design. We address each major comment below and have updated the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and §5] Abstract and §5 (Experimental Results): the group-based approximation inside the nonlinear operator fusion is stated to deliver a 69.17 % latency reduction, yet no perplexity, zero-shot accuracy, or accuracy-vs-baseline tables are supplied for Llama2-7B or any other model. Because the approximation is load-bearing for the headline latency claim, the absence of quantified error metrics prevents assessment of whether the reported gains remain usable.
Authors: We agree that accuracy metrics are essential to substantiate the group-based approximation. The original manuscript prioritized hardware latency and energy results, but this leaves the fidelity of the approximation unverified. In the revised manuscript we have added a dedicated accuracy subsection in §5 that reports perplexity on WikiText-2 and zero-shot accuracies on standard benchmarks for INT4 Llama2-7B. The group-based approximation increases perplexity by <0.3 points and changes zero-shot accuracy by at most 0.4 %, confirming that the 69.17 % latency reduction is obtained with negligible model degradation. revision: yes
-
Referee: [§5 and Table 3] §5 and Table 3 (or equivalent results table): the fabricated chip is reported to achieve 3.28 TOPS, 42.3 TOPS/W, 4.2 ms prefill, and 26.87 tokens/s on INT4 Llama2-7B with dual DDR5-6400, but the text supplies neither error bars, full-system software-stack overhead measurements, nor direct comparisons against prior DCIM or GPU baselines under identical memory conditions. These omissions make it impossible to verify that the 51.6 % DRAM-access and 87.6 % weight-update reductions remain dominant once controller and software costs are included.
Authors: The quoted figures are direct silicon measurements at 100 MHz. We have revised Table 3 to include error bars obtained from repeated chip runs. A new comparison table has been added that places RCW-CIM against recent DCIM accelerators and an NVIDIA A100 GPU under comparable INT4 precision and memory-bandwidth settings, showing that the reported DRAM-access and weight-update reductions remain the dominant contributors to the measured latency. Full-system software-stack overhead beyond the on-chip controller and DMA was outside the scope of the hardware prototype; we have clarified the measurement boundary in the text and note that higher-level runtime costs are orthogonal to the architectural claims. revision: partial
Circularity Check
No circularity: architecture proposal and empirical results are independent
full rationale
The paper proposes RCW architecture, nonlinear operator fusion with group-based approximation, and WS-OCS dataflow. All latency, TOPS, and energy claims are presented as outcomes of TSMC 22 nm fabrication at 100 MHz plus workload-specific simulations on Llama2-7B with dual DDR5-6400. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the derivation chain consists of new hardware mechanisms whose performance is measured rather than defined into existence. This is the normal case for an architecture paper whose central results are externally falsifiable via chip measurements.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions in digital CMOS circuit timing, memory access energy, and typical LLM prefill/decode workload patterns hold for the reported measurements.
Reference graph
Works this paper leans on
-
[1]
Improving language understanding by generative pre- training,
A. Radfordet al., “Improving language understanding by generative pre- training,” 2018
2018
-
[2]
Language Models are Few-Shot Learners
T. B. Brownet al., “Language models are few-shot learners,” 2020. [Online]. Available: https://arxiv.org/abs/2005.14165
work page internal anchor Pith review arXiv 2020
-
[3]
16.4 an 89tops/w and 16.3tops/mm2 all-digital sram- based full-precision compute-in memory macro in 22nm for machine- learning edge applications,
Y .-D. Chihet al., “16.4 an 89tops/w and 16.3tops/mm2 all-digital sram- based full-precision compute-in memory macro in 22nm for machine- learning edge applications,” in2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64, 2021, pp. 252–254
2021
-
[4]
CELLA: A 28nm compute-memory co-optimized real-time digital CIM-based edge LLM accelerator with 1.78ms-response in prefill and 31.32 token/s in decoding,
Z. Wuet al., “CELLA: A 28nm compute-memory co-optimized real-time digital CIM-based edge LLM accelerator with 1.78ms-response in prefill and 31.32 token/s in decoding,” in2025 Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), 2025, pp. 1–3
2025
-
[5]
LLM-CIM: A 28nm 126.7TOPS/W input-LUT-based digital cim macro with reconfigurable matrix multiplication and nonlinear operation modes for LLMs,
Y . Wanget al., “LLM-CIM: A 28nm 126.7TOPS/W input-LUT-based digital cim macro with reconfigurable matrix multiplication and nonlinear operation modes for LLMs,” in2025 Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), 2025, pp. 1–3
2025
-
[6]
An efficient data reuse with tile-based adaptive stationary for transformer accelerators,
T.-J. Li and T.-S. Chang, “An efficient data reuse with tile-based adaptive stationary for transformer accelerators,” in2025 IEEE International Symposium on Circuits and Systems (ISCAS), 2025, pp. 1–5
2025
-
[7]
Systolicattention: Fus- ing flashattention within a single systolic array,
J. Linet al., “Systolicattention: Fusing flashattention within a single systolic array,”arXiv preprint arXiv:2507.11331, 2025
-
[8]
Root mean square layer normalization,
B. Zhang and R. Sennrich, “Root mean square layer normalization,” in Advances in Neural Information Processing Systems (NeurIPS), 2019
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.