arxiv: 2605.08615 · v1 · submitted 2026-05-09 · 💻 cs.AR

Recognition: no theorem link

DSPE: An Energy-Efficient Edge Processor for DeepSeek Inference with MerkleTree-based Incremental Pruning, Multi-Stage Boothing Lookup and Dynamic Adaptive Posit Processing

Yuhan Zhang (1) , Zhou Wang (2 , 3) , Zhou Shu (4 , 5) , Jiuren Zhou (4 , Yanqing Xu (6) , Xiaonan Tang (7)

show 29 more authors

Shushan Qiao (8 9) Tianchun Ye (8 Yang Liu (3 10) Anil A. Bharath (2 Emm Mic Drakakis (2 3) ((1) School of Computer Science Engineering Northeastern University Shenyang China; (2) Imperial College London London United Kingdom; (3) Imperial Global Singapore Singapore; (4) School of Microelectronics Xidian University Xi'an China; (5) Hangzhou Institute of Technology Hangzhou China; (6) The Chinese University of Hong Kong Shenzhen China; (7) Wisemaytech Co. Ltd. Beijing China; (8) Institute of Microelectronics Chinese Academy of Sciences China; (9) University of Chinese Academy of Sciences China; (10) Nanyang Technological University Singapore)

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:49 UTC · model grok-4.3

classification 💻 cs.AR

keywords edge processorDeepSeek inferenceenergy efficiencyMerkle tree pruningapproximate multiplicationposit arithmetichardware accelerator

0 comments

The pith

DSPE combines MerkleTree pruning, multi-stage boothing lookup, and a new adaptive posit format to reach 109.4 TFLOPS/W for DeepSeek inference on edge hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the DeepSeek Processing Element, an architecture built for running the DeepSeek model on devices with tight power limits. It proposes three specific techniques to cut redundant computation and optimize arithmetic while keeping the model usable. The design is fabricated in TSMC 28nm CMOS and reports 109.4 TFLOPS/W energy efficiency against prior work. If the techniques hold up, this would let large-model inference move from data centers to everyday edge platforms without requiring large batteries or cooling.

Core claim

The DSPE architecture integrates the MerkleTree-based Incremental Pruning Scheme for secure reduction of redundant vectors, the Multi-Stage Boothing Lookup Method for bit-flip-aware approximate multiplication, and the Dynamic Adaptive Posit Processing Mechanism that introduces a new DA-Posit data format together with its matching hardware multiplier. When realized in 28nm CMOS, the combined processor delivers 109.4 TFLOPS/W energy efficiency for DeepSeek inference and is presented as a scalable base for edge deployment.

What carries the argument

The MerkleTree-based Incremental Pruning Scheme, Multi-Stage Boothing Lookup Method, and Dynamic Adaptive Posit Processing Mechanism with its DA-Posit format, which together prune redundant work, approximate multiplications safely, and adapt numeric precision at runtime to cut energy use.

If this is right

DeepSeek inference becomes practical on battery-powered or thermally limited edge devices.
The architecture supplies a concrete template for scaling similar large models to edge hardware.
Secure incremental pruning reduces vector counts while preserving model integrity during execution.
Approximate multiplication and adaptive precision together lower the dominant energy cost of matrix operations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same combination of tree-structured pruning and low-precision adaptive arithmetic could be tested on other large language models to check broader applicability.
Hardware support for the new DA-Posit format might encourage adoption of similar variable-precision formats in future low-power accelerators.
If the techniques prove robust, they could be combined with existing quantization flows to further reduce memory traffic on edge platforms.

Load-bearing premise

The three techniques can be combined in real silicon to produce the reported energy-efficiency gains without unacceptable drops in inference accuracy or prohibitive hardware overhead.

What would settle it

Fabricated-chip measurements showing energy efficiency well below 109.4 TFLOPS/W or accuracy loss exceeding a few percent on standard DeepSeek benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.08615 by 10), 3), 3) ((1) School of Computer Science, 5), 9), Anil A. Bharath (2, Beijing, China; (10) Nanyang Technological University, China; (2) Imperial College London, China; (5) Hangzhou Institute of Technology, China; (6) The Chinese University of Hong Kong, China; (7) Wisemaytech Co., China; (8) Institute of Microelectronics, China; (9) University of Chinese Academy of Sciences, Chinese Academy of Sciences, Emm Mic Drakakis (2, Engineering, Hangzhou, Jiuren Zhou (4, London, Ltd., Northeastern University, Shenyang, Shenzhen, Shushan Qiao (8, Singapore), Singapore; (4) School of Microelectronics, Tianchun Ye (8, United Kingdom; (3) Imperial Global Singapore, Xi'an, Xiaonan Tang (7), Xidian University, Yang Liu (3, Yanqing Xu (6), Yuhan Zhang (1), Zhou Shu (4, Zhou Wang (2.

**Figure 1.** Figure 1: Diagram of Deepseek-V3. In the development of large-scale language models, as parameter redundancy becomes increasingly significant, both academia and industry have proposed various optimization directions to improve inference efficiency [10]. On one hand, algorithm-level approaches include model distillation, low-rank matrix factorization, sparsification, and MoE, which enhance inference performance by r… view at source ↗

**Figure 3.** Figure 3: Challenges of DeepSeek Inference on Edge Devices. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The Architecture of the DeepSeek Processing Ele [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: MerkleTree-based Incremental Pruning Scheme. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Multi-Stage Boothing Lookup Method. model for redundant classification, and a branched boothing lookup pipeline, and is used to differentiate vector redundancy caused by spatiotemporal locality. First, MBLM takes 8 multiplication operands into the invalid computation detector at a time. If the weight or activation values are very small, their contribution can be ignored. Based on this observation, the mo… view at source ↗

**Figure 8.** Figure 8: The layout of the proposed processor. and energy consumption have limited edge-side deployment. This paper proposes the DeepSeek Processing Element (DSPE)—a highefficiency edge inference processor architecture designed for the DeepSeek model. This paper introduces three architectural innovations: First, the MerkleTree-based Incremental Pruning Scheme (MIPS), which reduces redundant vector computation and… view at source ↗

read the original abstract

In recent years, DeepSeek has achieved strong inference performance but remains hard to deploy on energy-constrained edge devices. This paper presents the DeepSeek Processing Element (DSPE), an edge-oriented architecture that alleviates the model's heavy computational and energy demands. DSPE introduces three techniques: the MerkleTree-based Incremental Pruning Scheme (MIPS) for secure redundant-vector reduction, the Multi-Stage Boothing Lookup Method (MBLM) for bit-flip-aware approximate multiplication, and the Dynamic Adaptive Posit Processing Mechanism (DAPPM), which introduces a new DA-Posit format and its corresponding hardware multiplication architecture. Implemented in TSMC 28nm CMOS, DSPE achieves 109.4 TFLOPS/W energy efficiency compared with state-of-the-art designs and offers a scalable foundation for edge deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes the DeepSeek Processing Element (DSPE), an edge-oriented hardware architecture for efficient inference of the DeepSeek model. It introduces three techniques: MerkleTree-based Incremental Pruning Scheme (MIPS) for secure redundant vector reduction, Multi-Stage Boothing Lookup Method (MBLM) for bit-flip-aware approximate multiplication, and Dynamic Adaptive Posit Processing Mechanism (DAPPM) that defines a new DA-Posit format along with its hardware multiplier. The design is implemented in TSMC 28nm CMOS and claims 109.4 TFLOPS/W energy efficiency relative to prior art, offering a scalable foundation for edge deployment of DeepSeek.

Significance. If the headline efficiency figure is substantiated with measured silicon data, accuracy preservation for the target DeepSeek model, and per-technique breakdowns, the work would represent a meaningful contribution to energy-efficient LLM accelerators for edge devices. The integration of tree-based pruning, approximate Booth-style multiplication, and a custom posit format targets relevant bottlenecks in compute and memory energy. No machine-checked proofs, open code, or parameter-free derivations are present to strengthen the assessment.

major comments (3)

[Abstract] Abstract: The central claim that 'DSPE achieves 109.4 TFLOPS/W energy efficiency' is stated without reference to any table, figure, section, or measurement conditions (voltage, frequency, workload, power breakdown, or post-layout vs. silicon results). This directly undermines the headline result and prevents comparison with state-of-the-art designs.
[Evaluation (absent)] No accuracy evaluation section or table: The manuscript provides no quantitative accuracy-vs-baseline results for DeepSeek inference under the combined MIPS + MBLM + DAPPM/DA-Posit scheme, nor any error analysis showing that accuracy loss remains acceptable. This is load-bearing for the claim that the techniques deliver efficiency 'without unacceptable accuracy loss'.
[Implementation] Implementation section: No breakdown of area, power, or latency overheads for the new DA-Posit hardware multiplier or the Merkle-tree pruning logic is supplied, nor any comparison against standard posit or FP16 baselines on the same TSMC 28nm process.

minor comments (1)

[Title] Title and abstract: 'Boothing' appears to be a typographical variant of 'Booth'; clarify whether this refers to a modified Booth multiplication algorithm.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current manuscript requires clarifications and additions to strengthen the presentation of results. We will revise the paper accordingly to address each point.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'DSPE achieves 109.4 TFLOPS/W energy efficiency' is stated without reference to any table, figure, section, or measurement conditions (voltage, frequency, workload, power breakdown, or post-layout vs. silicon results). This directly undermines the headline result and prevents comparison with state-of-the-art designs.

Authors: We agree that the abstract should explicitly reference the supporting data. In the revised version, we will add citations to the specific evaluation tables and figures that report the 109.4 TFLOPS/W figure, along with the associated conditions (voltage, frequency, workload, power breakdown) and clarification on whether results are post-layout estimates or silicon measurements. This will enable direct comparisons. revision: yes
Referee: [Evaluation (absent)] No accuracy evaluation section or table: The manuscript provides no quantitative accuracy-vs-baseline results for DeepSeek inference under the combined MIPS + MBLM + DAPPM/DA-Posit scheme, nor any error analysis showing that accuracy loss remains acceptable. This is load-bearing for the claim that the techniques deliver efficiency 'without unacceptable accuracy loss'.

Authors: We acknowledge the absence of a dedicated accuracy evaluation and agree it is necessary. We will add a new section with quantitative accuracy results for DeepSeek inference under the combined techniques, including baseline comparisons and error analysis to demonstrate that accuracy degradation remains within acceptable bounds for the target application. revision: yes
Referee: [Implementation] Implementation section: No breakdown of area, power, or latency overheads for the new DA-Posit hardware multiplier or the Merkle-tree pruning logic is supplied, nor any comparison against standard posit or FP16 baselines on the same TSMC 28nm process.

Authors: We will expand the implementation section to include detailed breakdowns of area, power, and latency overheads for the DA-Posit multiplier and Merkle-tree pruning logic. We will also add direct comparisons to standard posit and FP16 implementations synthesized on the same TSMC 28nm process to quantify the overheads and benefits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on hardware implementation rather than self-referential derivation

full rationale

The paper describes three novel techniques (MIPS, MBLM, DAPPM with new DA-Posit format) and reports a measured energy efficiency of 109.4 TFLOPS/W from TSMC 28nm CMOS implementation. No mathematical derivation chain, equations, or first-principles predictions are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The efficiency figure is positioned as an empirical hardware result, not a tautological output of internal definitions. No load-bearing self-citations or ansatz smuggling appear in the provided text. The architecture is self-contained via design and measurement, with no evidence of the central claims looping back to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review provides no equations, derivations, or validation data; the DA-Posit format is presented as a new entity without independent evidence or derivation details.

invented entities (1)

DA-Posit format no independent evidence
purpose: Dynamic adaptive posit representation for efficient hardware multiplication in the DAPPM mechanism
Introduced as a new format in the Dynamic Adaptive Posit Processing Mechanism.

pith-pipeline@v0.9.0 · 5660 in / 1302 out tokens · 50916 ms · 2026-05-12T00:49:50.292826+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 2 internal anchors

[1]

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, and H. Gao. 2024. DeepSeek LLM: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Choquette

J. Choquette. 2022. NVIDIA Hopper GPU: Scaling performance. In2022 IEEE Hot Chips 34 Symposium (HCS). IEEE Computer Society, 1–46

work page 2022
[3]

Choquette and R

J. Choquette and R. Krashinsky. 2022. NVIDIA Hopper GPU: Scaling performance. InProceedings of the IEEE Hot Chips Symposium (HCS)

work page 2022
[4]

Huang, W

M.-C. Huang, W. W. Mar, S. Kanade, B. Bai, A. Gayatri, K. Khairnar, A. Lai, Y.-H. Hsu, H.-J. Liao, Y. Wang, and T.-Y. J. Chang. 2024. A 3.3 GHz 1024×640 multi-bank single-port SRAM with frequency enhancing techniques and 0.55 V–1.35 V wide voltage range operation in 3nm FinFET for HPC applications. InProceedings of the IEEE Symposium on VLSI Technology an...

work page 2024
[5]

Jouppi, G

N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, and C. Young. 2023. TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture. 1–14

work page 2023
[6]

N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, and R. Boyle. 2017. In-datacenter performance analysis of a tensor processing unit. InProceedings of the 44th Annual International Symposium on Computer Architecture. 1–12

work page 2017
[8]

Keller, R

B. Keller, R. Venkatesan, S. Dai, et al. 2022. A 17–95.6 TOPS/W deep learning inference accelerator with per-vector scaled 4-bit quantization for transformers in 5nm. In2022 IEEE Symposium on VLSI Technology and Circuits. IEEE, 16–17

work page 2022
[9]

S. Lee, K. Kim, S. Oh, et al . 2022. A 1ynm 1.25 V 8Gb, 16Gb/s/pin GDDR6- based accelerator-in-memory supporting 1TFLOPS MAC operation and various activation functions for deep-learning applications. In2022 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 65. IEEE, 1–3

work page 2022
[10]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, and D. Dai. 2024. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

H. Peng, S. Huang, T. Geng, A. Li, W. Jiang, H. Liu, S. Wang, and C. Ding. 2021. Accelerating transformer-based deep learning models on FPGAs using column balanced block pruning. In2021 22nd International Symposium on Quality Elec- tronic Design (ISQED). IEEE, 142–148

work page 2021
[12]

A. Pinkus. 1999. Approximation theory of the MLP model in neural networks. Acta Numerica8 (1999), 143–195

work page 1999
[13]

Tambe, J

T. Tambe, J. Zhang, C. Hooper, T. Jia, P. N. Whatmough, J. Zuckerman, M. C. Dos Santos, E. J. Loscalzo, D. Giri, K. Shepard, L. Carloni, A. Rush, D. Brooks, and G.-Y. Wei. 2023. 22.9 A 12nm 18.1 TFLOPs/W sparse transformer processor with entropy-based early exit, mixed-precision predication and fine-grained power management. InProceedings of the IEEE Inte...

work page doi:10.1109/isscc42615.2023.10067817 2023
[14]

Taud and J

H. Taud and J. F. Mas. 2017. Multilayer perceptron (MLP). InGeomatic Approaches for Modeling Land Change Scenarios. Springer International Publishing, Cham, 451–455

work page 2017
[15]

F. Tu, Z. Wu, Y. Wang, et al. 2022. A 28nm 15.59 𝜇J/token full-digital bitline- transpose CIM-based sparse transformer accelerator with pipeline/parallel re- configurable modes. In2022 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 65. IEEE, 466–468

work page 2022
[16]

F. Tu, Z. Wu, Y. Wang, W. Wu, L. Liu, Y. Hu, S. Wei, and S. Yin. 2023. 16.1 MuITCIM: A 28nm 2.24 𝜇J/token attention-token-bit hybrid sparse digital CIM-based accel- erator for multimodal transformers. InProceedings of the IEEE International Solid- State Circuits Conference (ISSCC). 248–250. doi:10.1109/ISSCC42615.2023.10067842

work page doi:10.1109/isscc42615.2023.10067842 2023
[17]

Varghese, N

B. Varghese, N. Wang, S. Barbhuiya, P. Kilpatrick, and D. S. Nikolopoulos. 2016. Challenges and opportunities in edge computing. In2016 IEEE International Conference on Smart Cloud (SmartCloud). IEEE, 20–26

work page 2016
[18]

H. Wang, Z. Zhang, and S. Han. 2021. SpAtten: Efficient sparse attention architec- ture with cascade token and head pruning. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 97–110

work page 2021
[19]

Y. Wang, Y. Qin, D. Deng, et al . 2022. A 28nm 27.5 TOPS/W approximate- computing-based transformer processor with asymptotic sparsity speculating and out-of-order computing. In2022 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 65. IEEE, 1–3

work page 2022
[20]

Z. Wang, J. Wei, B. Han, H. He, L. Liu, S. Wei, and S. Yin. 2023. CPE: An energy- efficient edge-device training with multi-dimensional compression mechanism. In2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6

work page 2023
[21]

Z. Wang, J. Wei, X. Tang, B. Han, H. He, L. Liu, S. Wei, and S. Yin. 2023. TPE: A high-performance edge-device inference with multi-level transformational mechanism. In2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 1–5

work page 2023