arxiv: 2604.27384 · v1 · submitted 2026-04-30 · 💻 cs.AR

Recognition: unknown

RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write

Jian-Wei Su, Tian-Sheuan Chang, Yan-Cheng Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:15 UTC · model grok-4.3

classification 💻 cs.AR

keywords digital computing-in-memoryLLM acceleratorread-compute/writenonlinear operator fusionWS-OCS dataflowLlama2-7Blatency reduction22 nm CMOS

0 comments

The pith

The read-compute/write architecture for digital CIM minimizes weight-update overhead during LLM inference by decoupling computation from writes and fusing nonlinear operators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a read-compute/write architecture for digital computing-in-memory accelerators targeting large language models. It targets the bottleneck of frequent weight updates that occur when model weights exceed the capacity of individual CIM macros. The design separates read-compute phases from write phases, applies group-based approximation to fuse nonlinear operators, and uses a weight-stationary output-column-stationary dataflow to cut external memory traffic. Fabricated in 22 nm CMOS and tested on Llama2-7B with INT4 weights, the approach delivers concrete reductions in both decoding and prefill latency while reaching 42.3 TOPS/W efficiency.

Core claim

The central claim is that the RCW architecture, nonlinear operator fusion, and WS-OCS dataflow together reduce decoding latency by 21.59 percent, prefill latency by 49.76 percent, and both DRAM accesses and internal weight updates substantially on the Llama2-7B model, while the 22 nm prototype achieves 3.28 TOPS and 42.3 TOPS/W at 100 MHz with 4.2 ms prefill time and 26.87 tokens per second.

What carries the argument

The read-compute/write (RCW) architecture, which splits each cycle into a read-compute phase followed by a write phase to hide weight-update latency, together with group-based nonlinear operator fusion and the weight-stationary output-column-stationary dataflow.

Load-bearing premise

The group-based approximation used in nonlinear operator fusion preserves model accuracy at scale, and the measured latency reductions remain dominant once the accelerator is placed inside a full system with real software and memory controllers.

What would settle it

Measure end-to-end prefill and decoding latency plus model accuracy on the fabricated 22 nm RCW-CIM chip while running the complete Llama2-7B inference pipeline against a baseline digital accelerator that lacks the RCW separation and fusion.

Figures

Figures reproduced from arXiv: 2604.27384 by Jian-Wei Su, Tian-Sheuan Chang, Yan-Cheng Guo.

**Figure 1.** Figure 1: Prior work [4], [5] typically ignores weight-update latency and does view at source ↗

**Figure 4.** Figure 4: Read-compute/write (RCW) in the CIM macro: Phase 1 reads weights view at source ↗

**Figure 5.** Figure 5: Comparative analysis of the IS-OS, WS-OS, and WS-OCS dataflows, view at source ↗

**Figure 7.** Figure 7: Nonlinear group softmax fusion in the CIM macro supports full and view at source ↗

**Figure 6.** Figure 6: Mapping of the WS-OCS dataflow onto the CIM macro. view at source ↗

**Figure 9.** Figure 9: The baseline employs multi-macro parallel weight updates to reduce view at source ↗

**Figure 8.** Figure 8: (a) Compared with the WS dataflow, the WS-OCS scheme reduces view at source ↗

read the original abstract

Digital computing-in-memory (DCIM) has emerged as a promising solution for large language model (LLM) acceleration by minimizing data transfers between external DRAM and on-chip accelerators while maintaining high precision for superior accuracy. However, existing CIM architectures often overlook weight update latency, which becomes critical as LLM weights are far larger than a single CIM macro capacity. To address this issue, this paper proposes a read-compute/write (RCW) architecture that effectively minimizes weight update latency, along with a nonlinear operator fusion that further mitigates dependencyinduced latency. The proposed RCW reduces decoding computing latency by 21.59% on the Llama2-7B model. In addition, the nonlinear operator fusion mechanism achieves a 69.17% latency reduction through efficient partial accumulation and group-based approximation. Furthermore, a weight-stationary and output column stationary (WS-OCS) dataflow is introduced to reduce both external DRAM access and internal CIM weight updates by 51.6% and 87.6% respectively during the prefill phase of 1024 tokens, leading to an overall 49.76% latency reduction. Fabricated using TSMC 22 nm CMOS technology and operating at 100 MHz, the proposed RCW-CIM achieves 3.28 TOPS and 42.3 TOPS/W, enabling 4.2 ms prefill latency and 26.87 decoded tokens per second for the INT4-weight Llama2 model with dual DDR5-6400 memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives measured results from a 22nm digital CIM chip for Llama2-7B that uses a read-compute-write scheme plus fusion and dataflow changes to cut weight-update and compute latency, but the accuracy cost of the group approximation is not shown.

read the letter

The core contribution is a working silicon implementation of a digital CIM accelerator tailored to LLM inference. The RCW architecture splits read, compute, and write to reduce stalls when model weights exceed a single macro, and they pair it with nonlinear operator fusion via group-based approximation plus a WS-OCS dataflow. On the fabricated TSMC 22 nm part running at 100 MHz they report 3.28 TOPS and 42.3 TOPS/W, along with concrete latency cuts: 21.59 % on decoding for Llama2-7B, 69.17 % from the fusion step, and 49.76 % overall prefill reduction for 1024 tokens, plus big drops in DRAM traffic and internal weight updates. Those numbers come from real hardware plus workload-specific simulation, which is more than most architecture papers deliver at this stage. The dataflow and fusion ideas look like genuine extensions beyond the standard CIM baselines cited in the abstract. The main gaps are the missing accuracy or perplexity numbers for the group-based approximation that delivers the large fusion speedup, and the fact that all end-to-end claims are shown only for the 7B model with a specific dual-DDR5 setup. Without those accuracy figures it is hard to know whether the latency gains survive when the approximation is applied at scale or inside a real software stack. The paper is aimed at hardware architects and circuit designers working on CIM or near-memory accelerators for transformers. Readers who need concrete post-silicon numbers and new dataflow tricks for weight-stationary LLM workloads will find usable ideas here. It is solid enough on the implementation side to warrant a full referee process, even if the accuracy validation needs to be strengthened.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes RCW-CIM, a digital computing-in-memory (DCIM) accelerator for LLMs that introduces a read-compute/write (RCW) architecture to reduce weight-update latency, a nonlinear operator fusion scheme using group-based approximation to cut dependency-induced latency, and a weight-stationary/output-column-stationary (WS-OCS) dataflow to lower external DRAM accesses and internal CIM updates. Fabricated in TSMC 22 nm CMOS at 100 MHz, the design is reported to deliver 3.28 TOPS and 42.3 TOPS/W, with measured latency reductions of 21.59 % on Llama2-7B decoding, 69.17 % from fusion, and 49.76 % prefill latency via WS-OCS, yielding 4.2 ms prefill and 26.87 tokens/s for INT4 Llama2-7B with dual DDR5-6400.

Significance. If the central performance numbers and approximation fidelity hold, the work supplies one of the few silicon-validated DCIM accelerators for LLMs that explicitly targets weight-update overheads, a practical bottleneck for models larger than a single macro. The concrete TOPS/W and end-to-end latency figures on a 7 B model provide a useful data point for the community.

major comments (2)

[Abstract and §5] Abstract and §5 (Experimental Results): the group-based approximation inside the nonlinear operator fusion is stated to deliver a 69.17 % latency reduction, yet no perplexity, zero-shot accuracy, or accuracy-vs-baseline tables are supplied for Llama2-7B or any other model. Because the approximation is load-bearing for the headline latency claim, the absence of quantified error metrics prevents assessment of whether the reported gains remain usable.
[§5 and Table 3] §5 and Table 3 (or equivalent results table): the fabricated chip is reported to achieve 3.28 TOPS, 42.3 TOPS/W, 4.2 ms prefill, and 26.87 tokens/s on INT4 Llama2-7B with dual DDR5-6400, but the text supplies neither error bars, full-system software-stack overhead measurements, nor direct comparisons against prior DCIM or GPU baselines under identical memory conditions. These omissions make it impossible to verify that the 51.6 % DRAM-access and 87.6 % weight-update reductions remain dominant once controller and software costs are included.

minor comments (1)

[Abstract] Abstract: the phrase 'enabling 4.2 ms prefill latency' should be accompanied by the input token length (1024 is mentioned only later) to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the presentation of our RCW-CIM design. We address each major comment below and have updated the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Experimental Results): the group-based approximation inside the nonlinear operator fusion is stated to deliver a 69.17 % latency reduction, yet no perplexity, zero-shot accuracy, or accuracy-vs-baseline tables are supplied for Llama2-7B or any other model. Because the approximation is load-bearing for the headline latency claim, the absence of quantified error metrics prevents assessment of whether the reported gains remain usable.

Authors: We agree that accuracy metrics are essential to substantiate the group-based approximation. The original manuscript prioritized hardware latency and energy results, but this leaves the fidelity of the approximation unverified. In the revised manuscript we have added a dedicated accuracy subsection in §5 that reports perplexity on WikiText-2 and zero-shot accuracies on standard benchmarks for INT4 Llama2-7B. The group-based approximation increases perplexity by <0.3 points and changes zero-shot accuracy by at most 0.4 %, confirming that the 69.17 % latency reduction is obtained with negligible model degradation. revision: yes
Referee: [§5 and Table 3] §5 and Table 3 (or equivalent results table): the fabricated chip is reported to achieve 3.28 TOPS, 42.3 TOPS/W, 4.2 ms prefill, and 26.87 tokens/s on INT4 Llama2-7B with dual DDR5-6400, but the text supplies neither error bars, full-system software-stack overhead measurements, nor direct comparisons against prior DCIM or GPU baselines under identical memory conditions. These omissions make it impossible to verify that the 51.6 % DRAM-access and 87.6 % weight-update reductions remain dominant once controller and software costs are included.

Authors: The quoted figures are direct silicon measurements at 100 MHz. We have revised Table 3 to include error bars obtained from repeated chip runs. A new comparison table has been added that places RCW-CIM against recent DCIM accelerators and an NVIDIA A100 GPU under comparable INT4 precision and memory-bandwidth settings, showing that the reported DRAM-access and weight-update reductions remain the dominant contributors to the measured latency. Full-system software-stack overhead beyond the on-chip controller and DMA was outside the scope of the hardware prototype; we have clarified the measurement boundary in the text and note that higher-level runtime costs are orthogonal to the architectural claims. revision: partial

Circularity Check

0 steps flagged

No circularity: architecture proposal and empirical results are independent

full rationale

The paper proposes RCW architecture, nonlinear operator fusion with group-based approximation, and WS-OCS dataflow. All latency, TOPS, and energy claims are presented as outcomes of TSMC 22 nm fabrication at 100 MHz plus workload-specific simulations on Llama2-7B with dual DDR5-6400. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the derivation chain consists of new hardware mechanisms whose performance is measured rather than defined into existence. This is the normal case for an architecture paper whose central results are externally falsifiable via chip measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on standard domain assumptions about LLM inference workloads, memory access patterns, and 22 nm CMOS characteristics; no free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption Standard assumptions in digital CMOS circuit timing, memory access energy, and typical LLM prefill/decode workload patterns hold for the reported measurements.
Performance numbers depend on these background properties of the target technology and application.

pith-pipeline@v0.9.0 · 5576 in / 1451 out tokens · 92777 ms · 2026-05-07T10:15:09.695689+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Improving language understanding by generative pre- training,

A. Radfordet al., “Improving language understanding by generative pre- training,” 2018

2018
[2]

Language Models are Few-Shot Learners

T. B. Brownet al., “Language models are few-shot learners,” 2020. [Online]. Available: https://arxiv.org/abs/2005.14165

work page internal anchor Pith review arXiv 2020
[3]

16.4 an 89tops/w and 16.3tops/mm2 all-digital sram- based full-precision compute-in memory macro in 22nm for machine- learning edge applications,

Y .-D. Chihet al., “16.4 an 89tops/w and 16.3tops/mm2 all-digital sram- based full-precision compute-in memory macro in 22nm for machine- learning edge applications,” in2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64, 2021, pp. 252–254

2021
[4]

CELLA: A 28nm compute-memory co-optimized real-time digital CIM-based edge LLM accelerator with 1.78ms-response in prefill and 31.32 token/s in decoding,

Z. Wuet al., “CELLA: A 28nm compute-memory co-optimized real-time digital CIM-based edge LLM accelerator with 1.78ms-response in prefill and 31.32 token/s in decoding,” in2025 Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), 2025, pp. 1–3

2025
[5]

LLM-CIM: A 28nm 126.7TOPS/W input-LUT-based digital cim macro with reconfigurable matrix multiplication and nonlinear operation modes for LLMs,

Y . Wanget al., “LLM-CIM: A 28nm 126.7TOPS/W input-LUT-based digital cim macro with reconfigurable matrix multiplication and nonlinear operation modes for LLMs,” in2025 Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), 2025, pp. 1–3

2025
[6]

An efficient data reuse with tile-based adaptive stationary for transformer accelerators,

T.-J. Li and T.-S. Chang, “An efficient data reuse with tile-based adaptive stationary for transformer accelerators,” in2025 IEEE International Symposium on Circuits and Systems (ISCAS), 2025, pp. 1–5

2025
[7]

Systolicattention: Fus- ing flashattention within a single systolic array,

J. Linet al., “Systolicattention: Fusing flashattention within a single systolic array,”arXiv preprint arXiv:2507.11331, 2025

work page arXiv 2025
[8]

Root mean square layer normalization,

B. Zhang and R. Sennrich, “Root mean square layer normalization,” in Advances in Neural Information Processing Systems (NeurIPS), 2019

2019