arxiv: 2605.09375 · v1 · submitted 2026-05-10 · 💻 cs.AR

Recognition: no theorem link

31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding

Pingcheng Dong , Yonghao Tan , Xuejiao Liu , Peng Luo , Yu Liu , Di Pang , Songchen Ma , Xijie Huang

show 8 more authors

Shih-Yang Liu Dong Zhang Zhichao Lu Luhong Liang Chi-Ying Tsui Fengbin Tu Liang Zhao Kwang-Ting Cheng

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:40 UTC · model grok-4.3

classification 💻 cs.AR

keywords ReRAM-on-logic stackingLLM acceleratorspeculative decodingoutlier-free quantizationblock-clustered compressionparallel decodinghardware accelerator55nm CMOS

0 comments

The pith

A 55nm ReRAM-on-logic stacked accelerator runs large language models at 14 to 136 tokens per second using outlier-free quantization and parallel speculative decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a hardware chip that stacks resistive memory directly on logic circuits to accelerate inference for large language models. It combines a local rotation unit that enables low-bit quantization without outliers, blockwise vector quantization tailored to the stacking layout, and an adaptive parallel speculative decoding scheme with out-of-order scheduling. These elements together aim to deliver high throughput while keeping model accuracy intact. A sympathetic reader would care because current LLM inference demands enormous compute and energy; hardware that embeds memory closer to processing could make real-time, efficient AI feasible on edge devices if the approach scales.

Core claim

The central claim is that bumping-based face-to-face ReRAM-on-logic stacking, paired with a local rotation unit for outlier-free low-bit quantization, a stacking-aware PNM architecture co-designed with blockwise vector quantization, and an adaptive parallel speculative decoding scheme with out-of-order scheduler, enables the 55nm chip to reach 14.08-to-135.69 tokens per second and 4.46-to-7.17x speedup over vanilla speculative decoding.

What carries the argument

Bumping-based face-to-face ReRAM-on-logic stacking integrated with local rotation for outlier-free quantization and adaptive parallel speculative decoding with out-of-order scheduling.

If this is right

LLM inference can run at speeds suitable for interactive applications on a single compact chip.
Weight storage and computation overheads drop through co-optimized block-clustered compression matched to the stacked memory layout.
Parallel speculative decoding with out-of-order scheduling raises hardware resource and bandwidth utilization beyond sequential methods.
Outlier-free low-bit quantization becomes practical without retraining, lowering memory bandwidth demands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the stacking process proves manufacturable, similar hybrid memory-logic integration could extend to other memory technologies for denser AI accelerators.
The quantization approach might reduce the need for model-specific fine-tuning when deploying compressed LLMs on hardware, lowering overall development cost.
Adaptive parallel decoding techniques could transfer to software frameworks to improve efficiency on existing GPUs or CPUs.
Thermal and yield challenges in stacked designs may require new cooling or redundancy methods before widespread adoption.

Load-bearing premise

The design assumes the bumping-based face-to-face ReRAM-on-logic stacking can be fabricated at scale without yield or thermal problems and that local rotation plus block-clustered quantization preserves LLM accuracy across models without additional fine-tuning.

What would settle it

Fabricate multiple chips at scale, run standard LLM benchmarks such as Llama or GPT variants, and check whether sustained token throughput stays above 14 tokens per second while perplexity or accuracy remains within 1 percent of the full-precision baseline.

Figures

Figures reproduced from arXiv: 2605.09375 by Chi-Ying Tsui, Di Pang, Dong Zhang, Fengbin Tu, Kwang-Ting Cheng, Liang Zhao, Luhong Liang, Peng Luo, Pingcheng Dong, Shih-Yang Liu, Songchen Ma, Xijie Huang, Xuejiao Liu, Yonghao Tan, Yu Liu, Zhichao Lu.

**Figure 31.1.** Figure 31.1: 2 shows the overall architecture of the proposed LLM accelerator with 4 ReRAM [PITH_FULL_IMAGE:figures/full_fig_p001_31_1.png] view at source ↗

read the original abstract

This work presents a 55nm speculative decoding-based LLM accelerator with bumping-based face-to-face ReRAM-on-logic stacking technology. It features a local rotation unit for outlier-free low-bit quantization, a stacking-aware PNM architecture co-designed with blockwise vector quantization to reduce weight EMA overheads, and an adaptive parallel speculative decoding scheme with an out-of-order scheduler for high resource and bandwidth utilization. Our chip achieves 14.08-to-135.69token/s and 4.46-to-7.17x speedup over vanilla speculative decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They built and measured a 55nm ReRAM-stacked chip that hits 14-135 tokens/s on LLM inference with adaptive speculative decoding and outlier-free quantization.

read the letter

The core takeaway is that this is a working silicon prototype, not a simulation. The team taped out a 55nm chip using bumping-based face-to-face ReRAM-on-logic stacking, added a local rotation unit for quantization, applied block-clustered vector quantization, and ran an out-of-order adaptive parallel speculative decoder. The measured results show 4.46-7.17x speedup over basic speculative decoding and token rates in that 14-135 range. Real post-silicon numbers on a fabricated part carry weight here because they bypass the usual simulation-to-silicon gap worries.

Referee Report

1 major / 3 minor

Summary. The manuscript presents the design, implementation, and post-silicon evaluation of a 55 nm LLM accelerator that employs bumping-based face-to-face ReRAM-on-logic stacking. It introduces a local rotation unit to enable outlier-free low-bit quantization, a stacking-aware PNM architecture co-designed with block-clustered vector quantization to reduce EMA overhead, and an adaptive parallel speculative decoding scheme with out-of-order scheduling. Measured results on the fabricated prototype report token generation rates of 14.08–135.69 tokens/s and speedups of 4.46–7.17× relative to vanilla speculative decoding.

Significance. If the reported silicon measurements are reproducible under the stated conditions, the work constitutes a meaningful contribution to hardware acceleration for LLMs by providing concrete evidence of a functional ReRAM-logic stacked prototype that co-optimizes emerging memory technology with quantization and decoding algorithms. The post-silicon validation and per-model accuracy checks strengthen the feasibility argument for such hybrid architectures and offer practical data on throughput and resource utilization that could inform subsequent designs.

major comments (1)

The central performance claims rest on post-silicon measurements of token/s rates and speedups; however, the manuscript should explicitly state the LLM models (e.g., specific Llama or OPT variants), input/output sequence lengths, batch sizes, and the exact implementation of the vanilla speculative decoding baseline (on-chip or external reference) in the experimental results section to allow independent assessment of the 4.46–7.17× range.

minor comments (3)

Abstract: the range of token rates is given without reference to the models or sequence lengths that produce the extrema; adding one sentence would improve immediate context.
Notation: ensure that 'PNM', 'EMA', and 'out-of-order scheduler' are defined at first use in the main text and that all figure captions are self-contained.
References: consider adding citations to recent speculative decoding papers (e.g., on adaptive or parallel variants) to better situate the adaptive scheme.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment and the recommendation of minor revision. We address the point raised below.

read point-by-point responses

Referee: The central performance claims rest on post-silicon measurements of token/s rates and speedups; however, the manuscript should explicitly state the LLM models (e.g., specific Llama or OPT variants), input/output sequence lengths, batch sizes, and the exact implementation of the vanilla speculative decoding baseline (on-chip or external reference) in the experimental results section to allow independent assessment of the 4.46–7.17× range.

Authors: We agree with the referee that these details are necessary for a complete understanding and reproducibility of the results. The current manuscript provides the performance ranges but does not break down the specific configurations for each measurement. In the revised version, we will update the experimental results section to explicitly include the LLM models used, the input and output sequence lengths, batch sizes, and a clear description of how the vanilla speculative decoding baseline was implemented (as an on-chip reference without the proposed optimizations). revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on silicon measurements

full rationale

The paper reports measured token/s rates and speedups from a fabricated 55 nm prototype chip using ReRAM-on-logic stacking, local rotation quantization, block-clustered compression, and adaptive speculative decoding. No equations, fitted parameters, or derivation steps are presented that reduce the central performance claims to self-referential inputs or prior self-citations. The results are empirical hardware benchmarks, not predictions derived from models defined by the outcome itself. Self-citations, if present, are not load-bearing for the reported metrics.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 2 invented entities

With only the abstract available, specific free parameters such as exact quantization bit widths, block sizes for clustering, and scheduler thresholds cannot be enumerated; the design implicitly relies on engineering choices for stacking alignment and compression ratios that are tuned to achieve the stated metrics.

free parameters (2)

Quantization bit-width and rotation parameters
Chosen to achieve outlier-free low-bit representation while maintaining accuracy.
Block size and clustering factors for weight compression
Tuned to reduce EMA overheads in the stacking-aware PNM architecture.

invented entities (2)

Local rotation unit no independent evidence
purpose: Enable outlier-free low-bit quantization of weights
New hardware component introduced to handle outliers before quantization.
Stacking-aware PNM architecture no independent evidence
purpose: Co-design with blockwise vector quantization to cut weight storage overhead
Custom architecture tailored to the ReRAM-on-logic stack.

pith-pipeline@v0.9.0 · 5470 in / 1424 out tokens · 59248 ms · 2026-05-12T03:40:12.032151+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 4 internal anchors

[1]

The Llama 3 Herd of Models

A. Dubey et al., “The Llama 3 Herd of Models,” arXiv: 2407.21783, 2024. https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” arXiv: 2307.09288, 2023. https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

C-Transformer: A 2.6-18.1μ J/token Homogeneous DNN- Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models,

S. Kim et al., “C-Transformer: A 2.6-18.1μ J/token Homogeneous DNN- Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models,” ISSCC, pp. 368-370, 2024. https://doi.org/10.1109/ISSCC49657.2024.10454330

work page doi:10.1109/isscc49657.2024.10454330 2024
[4]

Chen, Phil C

Y. Qin et al., “An 88.36TOPS/W Bit-Level-Weight-Compressed Large-Language-Model Accelerator with Cluster-Aligned INT-FP-GEMM and Bi-Dimensional Workﬂow Reformulation,” ISSCC, pp. 420-422, 2025. https://doi.org/10.1109/ISSCC49661.2025.10904774

work page doi:10.1109/isscc49661.2025.10904774 2025
[5]

Chen, Phil C

S. Kim et al., “Slim-Llama: A 4.69mW Large-Language-Model Processor with Binary/Ternary Weights for Billion-Parameter Llama Model,” ISSCC, pp. 421-423, 2025. https://doi.org/10.1109/ISSCC49661.2025.10904761

work page doi:10.1109/isscc49661.2025.10904761 2025
[6]

LLM-CIM: A 28nm 126.7 TOPS/W Input-LUT-Based Digital CIM Macro with Reconﬁgurable Matrix Multiplication and Nonlinear Operation Modes for LLMs,

Y. Wang et al., “LLM-CIM: A 28nm 126.7 TOPS/W Input-LUT-Based Digital CIM Macro with Reconﬁgurable Matrix Multiplication and Nonlinear Operation Modes for LLMs,” IEEE Symp. VLSI Circuits, 2025. https://doi.org/10.23919/VLSITechnologyandCir65189.2025.11074939

work page doi:10.23919/vlsitechnologyandcir65189.2025.11074939 2025
[7]

LLM-CIM: A 28nm 126.7 TOPS/W Input-LUT-Based Digital CIM Macro with Reconﬁgurable Matrix Multiplication and Nonlinear Operation Modes for LLMs,

Z. Wu et al., “CELLA: A 28nm Compute-Memory Co-Optimized Real-Time Digital CIM- Based Edge LLM Accelerator with 1.78 ms-Response in Preﬁll and 31.32 Token/s in Decoding,” IEEE Symp. VLSI Circuits, 2025. https://doi.org/10.23919/VLSITechnologyandCir65189.2025.11075101

work page doi:10.23919/vlsitechnologyandcir65189.2025.11075101 2025
[8]

Fast inference from transformers via speculative decoding, 2023.URL https://arxiv

Y. Leviathan et al., “Fast Inference from Transformers via Speculative Decoding,” ICML, pp. 19274-19286, 2023. https://arxiv.org/abs/2211.17192

work page arXiv 2023
[9]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

T. Li et al., “EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty,” ICML, pp. 28935-28948, 2024. https://arxiv.org/abs/2401.15077

work page internal anchor Pith review arXiv 2024
[10]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar et al., “GPTQ: Accurate Post-Training Quantization for Generative Pre- trained Transformers,” ICLR, 2023. https://arxiv.org/abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Quarot: Outlier-free 4-bit inference in rotated llms,

S. Ashkboos et al., “QuaRot: Outlier-Free 4-bit Inference in Rotated LLMs,” NeurIPS , pp.100213-100240, 2024. https://arxiv.org/abs/2404.00456

work page arXiv 2024
[12]

Spinquant–llm quantization with learned rotations,

Z. Liu et al., “SpinQuant: LLM Quantization with Learned Rotations,” ICLR, 2025. https://arxiv.org/abs/2405.16406

work page arXiv 2025
[13]

RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight- Activation Quantization,

X. Huang et al., “RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight- Activation Quantization,” Empirical Methods in Natural Language Proc., pp. 7563-7576,

work page
[14]

https://doi.org/10.18653/v1/2024.ﬁndings-emnlp.444

work page doi:10.18653/v1/2024 2024
[15]

Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun, and Xiaoyan Sun

T. Liu et al., “PEARL: Parallel Speculative Decoding with Adaptive Draft Length,” ICLR, 2025. https://arxiv.org/pdf/2408.11850

work page arXiv 2025
[16]

A Library of Hadamard Matrices,

N. Sloane, “A Library of Hadamard Matrices,” 2024. http://neilsloane.com/hadamard/

work page 2024
[17]

GPTVQ: The Blessing of Dimensionality for LLM Quantization,

V. B. Mart et al., “GPTVQ: The Blessing of Dimensionality for LLM Quantization,” arXiv: 2402.15319, 2024. https://arxiv.org/abs/2402.15319

work page arXiv 2024
[18]

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models,

Y. Liu et al., “VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models,” ACL, pp. 8181-8196, 2024. https://doi.org/10.18653/v1/2024.emnlp-main.467

work page doi:10.18653/v1/2024.emnlp-main.467 2024
[19]

MVQ: Towards Efﬁcient DNN Compression and Acceleration with Masked Vector Quantization,

S. Li et al., “MVQ: Towards Efﬁcient DNN Compression and Acceleration with Masked Vector Quantization,” ACM ASPLOS, pp. 731-745, 2025. https://arxiv.org/abs/2412.10261

work page arXiv 2025
[20]

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models,

F. Gong et al., “MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models,” NeurIPS, pp.7736-7758, 2024. https://arxiv.org/abs/2409.17481

work page arXiv 2024
[21]

Chen, Phil C

P. Dong et al., “A 28nm 0.22 μJ/Token Memory-Compute-Intensity-Aware CNN- Transformer Accelerator with Hybrid-Attention-Based Layer-Fusion and Cascaded Pruning for Semantic-Segmentation,” ISSCC, pp. 408-409, 2025. https://doi.org/10.1109/ISSCC49661.2025.10904499

work page doi:10.1109/isscc49661.2025.10904499 2025
[22]

Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy,

M. Gao et al., “Tetris: Scalable and Efﬁcient Neural Network Acceleration with 3D Memory,” ACM ASPLOS, pp. 751-764, 2017. https://doi.org/10.1145/3093337.3037702 Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on May 10,2026 at 06:49:26 UTC from IEEE Xplore. Restrictions apply. Figure 31.1.3: Local rotation u...

work page doi:10.1145/3093337.3037702 2017