Recognition: no theorem link
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding
Pith reviewed 2026-05-12 03:40 UTC · model grok-4.3
The pith
A 55nm ReRAM-on-logic stacked accelerator runs large language models at 14 to 136 tokens per second using outlier-free quantization and parallel speculative decoding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that bumping-based face-to-face ReRAM-on-logic stacking, paired with a local rotation unit for outlier-free low-bit quantization, a stacking-aware PNM architecture co-designed with blockwise vector quantization, and an adaptive parallel speculative decoding scheme with out-of-order scheduler, enables the 55nm chip to reach 14.08-to-135.69 tokens per second and 4.46-to-7.17x speedup over vanilla speculative decoding.
What carries the argument
Bumping-based face-to-face ReRAM-on-logic stacking integrated with local rotation for outlier-free quantization and adaptive parallel speculative decoding with out-of-order scheduling.
If this is right
- LLM inference can run at speeds suitable for interactive applications on a single compact chip.
- Weight storage and computation overheads drop through co-optimized block-clustered compression matched to the stacked memory layout.
- Parallel speculative decoding with out-of-order scheduling raises hardware resource and bandwidth utilization beyond sequential methods.
- Outlier-free low-bit quantization becomes practical without retraining, lowering memory bandwidth demands.
Where Pith is reading between the lines
- If the stacking process proves manufacturable, similar hybrid memory-logic integration could extend to other memory technologies for denser AI accelerators.
- The quantization approach might reduce the need for model-specific fine-tuning when deploying compressed LLMs on hardware, lowering overall development cost.
- Adaptive parallel decoding techniques could transfer to software frameworks to improve efficiency on existing GPUs or CPUs.
- Thermal and yield challenges in stacked designs may require new cooling or redundancy methods before widespread adoption.
Load-bearing premise
The design assumes the bumping-based face-to-face ReRAM-on-logic stacking can be fabricated at scale without yield or thermal problems and that local rotation plus block-clustered quantization preserves LLM accuracy across models without additional fine-tuning.
What would settle it
Fabricate multiple chips at scale, run standard LLM benchmarks such as Llama or GPT variants, and check whether sustained token throughput stays above 14 tokens per second while perplexity or accuracy remains within 1 percent of the full-precision baseline.
Figures
read the original abstract
This work presents a 55nm speculative decoding-based LLM accelerator with bumping-based face-to-face ReRAM-on-logic stacking technology. It features a local rotation unit for outlier-free low-bit quantization, a stacking-aware PNM architecture co-designed with blockwise vector quantization to reduce weight EMA overheads, and an adaptive parallel speculative decoding scheme with an out-of-order scheduler for high resource and bandwidth utilization. Our chip achieves 14.08-to-135.69token/s and 4.46-to-7.17x speedup over vanilla speculative decoding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the design, implementation, and post-silicon evaluation of a 55 nm LLM accelerator that employs bumping-based face-to-face ReRAM-on-logic stacking. It introduces a local rotation unit to enable outlier-free low-bit quantization, a stacking-aware PNM architecture co-designed with block-clustered vector quantization to reduce EMA overhead, and an adaptive parallel speculative decoding scheme with out-of-order scheduling. Measured results on the fabricated prototype report token generation rates of 14.08–135.69 tokens/s and speedups of 4.46–7.17× relative to vanilla speculative decoding.
Significance. If the reported silicon measurements are reproducible under the stated conditions, the work constitutes a meaningful contribution to hardware acceleration for LLMs by providing concrete evidence of a functional ReRAM-logic stacked prototype that co-optimizes emerging memory technology with quantization and decoding algorithms. The post-silicon validation and per-model accuracy checks strengthen the feasibility argument for such hybrid architectures and offer practical data on throughput and resource utilization that could inform subsequent designs.
major comments (1)
- The central performance claims rest on post-silicon measurements of token/s rates and speedups; however, the manuscript should explicitly state the LLM models (e.g., specific Llama or OPT variants), input/output sequence lengths, batch sizes, and the exact implementation of the vanilla speculative decoding baseline (on-chip or external reference) in the experimental results section to allow independent assessment of the 4.46–7.17× range.
minor comments (3)
- Abstract: the range of token rates is given without reference to the models or sequence lengths that produce the extrema; adding one sentence would improve immediate context.
- Notation: ensure that 'PNM', 'EMA', and 'out-of-order scheduler' are defined at first use in the main text and that all figure captions are self-contained.
- References: consider adding citations to recent speculative decoding papers (e.g., on adaptive or parallel variants) to better situate the adaptive scheme.
Simulated Author's Rebuttal
We thank the referee for the constructive comment and the recommendation of minor revision. We address the point raised below.
read point-by-point responses
-
Referee: The central performance claims rest on post-silicon measurements of token/s rates and speedups; however, the manuscript should explicitly state the LLM models (e.g., specific Llama or OPT variants), input/output sequence lengths, batch sizes, and the exact implementation of the vanilla speculative decoding baseline (on-chip or external reference) in the experimental results section to allow independent assessment of the 4.46–7.17× range.
Authors: We agree with the referee that these details are necessary for a complete understanding and reproducibility of the results. The current manuscript provides the performance ranges but does not break down the specific configurations for each measurement. In the revised version, we will update the experimental results section to explicitly include the LLM models used, the input and output sequence lengths, batch sizes, and a clear description of how the vanilla speculative decoding baseline was implemented (as an on-chip reference without the proposed optimizations). revision: yes
Circularity Check
No significant circularity; claims rest on silicon measurements
full rationale
The paper reports measured token/s rates and speedups from a fabricated 55 nm prototype chip using ReRAM-on-logic stacking, local rotation quantization, block-clustered compression, and adaptive speculative decoding. No equations, fitted parameters, or derivation steps are presented that reduce the central performance claims to self-referential inputs or prior self-citations. The results are empirical hardware benchmarks, not predictions derived from models defined by the outcome itself. Self-citations, if present, are not load-bearing for the reported metrics.
Axiom & Free-Parameter Ledger
free parameters (2)
- Quantization bit-width and rotation parameters
- Block size and clustering factors for weight compression
invented entities (2)
-
Local rotation unit
no independent evidence
-
Stacking-aware PNM architecture
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A. Dubey et al., “The Llama 3 Herd of Models,” arXiv: 2407.21783, 2024. https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” arXiv: 2307.09288, 2023. https://arxiv.org/abs/2307.09288
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
S. Kim et al., “C-Transformer: A 2.6-18.1μ J/token Homogeneous DNN- Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models,” ISSCC, pp. 368-370, 2024. https://doi.org/10.1109/ISSCC49657.2024.10454330
-
[4]
Y. Qin et al., “An 88.36TOPS/W Bit-Level-Weight-Compressed Large-Language-Model Accelerator with Cluster-Aligned INT-FP-GEMM and Bi-Dimensional Workflow Reformulation,” ISSCC, pp. 420-422, 2025. https://doi.org/10.1109/ISSCC49661.2025.10904774
-
[5]
S. Kim et al., “Slim-Llama: A 4.69mW Large-Language-Model Processor with Binary/Ternary Weights for Billion-Parameter Llama Model,” ISSCC, pp. 421-423, 2025. https://doi.org/10.1109/ISSCC49661.2025.10904761
-
[6]
Y. Wang et al., “LLM-CIM: A 28nm 126.7 TOPS/W Input-LUT-Based Digital CIM Macro with Reconfigurable Matrix Multiplication and Nonlinear Operation Modes for LLMs,” IEEE Symp. VLSI Circuits, 2025. https://doi.org/10.23919/VLSITechnologyandCir65189.2025.11074939
work page doi:10.23919/vlsitechnologyandcir65189.2025.11074939 2025
-
[7]
Z. Wu et al., “CELLA: A 28nm Compute-Memory Co-Optimized Real-Time Digital CIM- Based Edge LLM Accelerator with 1.78 ms-Response in Prefill and 31.32 Token/s in Decoding,” IEEE Symp. VLSI Circuits, 2025. https://doi.org/10.23919/VLSITechnologyandCir65189.2025.11075101
work page doi:10.23919/vlsitechnologyandcir65189.2025.11075101 2025
-
[8]
Fast inference from transformers via speculative decoding, 2023.URL https://arxiv
Y. Leviathan et al., “Fast Inference from Transformers via Speculative Decoding,” ICML, pp. 19274-19286, 2023. https://arxiv.org/abs/2211.17192
-
[9]
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
T. Li et al., “EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty,” ICML, pp. 28935-28948, 2024. https://arxiv.org/abs/2401.15077
work page internal anchor Pith review arXiv 2024
-
[10]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
E. Frantar et al., “GPTQ: Accurate Post-Training Quantization for Generative Pre- trained Transformers,” ICLR, 2023. https://arxiv.org/abs/2210.17323
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Quarot: Outlier-free 4-bit inference in rotated llms,
S. Ashkboos et al., “QuaRot: Outlier-Free 4-bit Inference in Rotated LLMs,” NeurIPS , pp.100213-100240, 2024. https://arxiv.org/abs/2404.00456
-
[12]
Spinquant–llm quantization with learned rotations,
Z. Liu et al., “SpinQuant: LLM Quantization with Learned Rotations,” ICLR, 2025. https://arxiv.org/abs/2405.16406
-
[13]
RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight- Activation Quantization,
X. Huang et al., “RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight- Activation Quantization,” Empirical Methods in Natural Language Proc., pp. 7563-7576,
-
[14]
https://doi.org/10.18653/v1/2024.findings-emnlp.444
-
[15]
Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun, and Xiaoyan Sun
T. Liu et al., “PEARL: Parallel Speculative Decoding with Adaptive Draft Length,” ICLR, 2025. https://arxiv.org/pdf/2408.11850
-
[16]
A Library of Hadamard Matrices,
N. Sloane, “A Library of Hadamard Matrices,” 2024. http://neilsloane.com/hadamard/
work page 2024
-
[17]
GPTVQ: The Blessing of Dimensionality for LLM Quantization,
V. B. Mart et al., “GPTVQ: The Blessing of Dimensionality for LLM Quantization,” arXiv: 2402.15319, 2024. https://arxiv.org/abs/2402.15319
-
[18]
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models,
Y. Liu et al., “VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models,” ACL, pp. 8181-8196, 2024. https://doi.org/10.18653/v1/2024.emnlp-main.467
-
[19]
MVQ: Towards Efficient DNN Compression and Acceleration with Masked Vector Quantization,
S. Li et al., “MVQ: Towards Efficient DNN Compression and Acceleration with Masked Vector Quantization,” ACM ASPLOS, pp. 731-745, 2025. https://arxiv.org/abs/2412.10261
-
[20]
MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models,
F. Gong et al., “MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models,” NeurIPS, pp.7736-7758, 2024. https://arxiv.org/abs/2409.17481
-
[21]
P. Dong et al., “A 28nm 0.22 μJ/Token Memory-Compute-Intensity-Aware CNN- Transformer Accelerator with Hybrid-Attention-Based Layer-Fusion and Cascaded Pruning for Semantic-Segmentation,” ISSCC, pp. 408-409, 2025. https://doi.org/10.1109/ISSCC49661.2025.10904499
-
[22]
Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy,
M. Gao et al., “Tetris: Scalable and Efficient Neural Network Acceleration with 3D Memory,” ACM ASPLOS, pp. 751-764, 2017. https://doi.org/10.1145/3093337.3037702 Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on May 10,2026 at 06:49:26 UTC from IEEE Xplore. Restrictions apply. Figure 31.1.3: Local rotation u...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.