arxiv: 2604.11659 · v1 · submitted 2026-04-13 · 💻 cs.CR · cs.DC· cs.DS· cs.LG· cs.PF

Recognition: unknown

GPU Acceleration of Sparse Fully Homomorphic Encrypted DNNs

Lara D'Agata , Carlos Agull\'o-Domingo , \'Oscar Vera-L\'opez , Kaustubh Shivdikar , Ardhi W. B. Yudha , Ferhat Yaman , David Kaeli , Jos\'e L. Abell\'an

show 2 more authors

Ian Colbert Jos\'e Cano

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:11 UTC · model grok-4.3

classification 💻 cs.CR cs.DCcs.DScs.LGcs.PF

keywords fully homomorphic encryptionGPU accelerationsparse matrix multiplicationencrypted neural networksciphertext computationFHE librariesprivacy-preserving inference

0 comments

The pith

Sparse matrix multiplication on GPUs for fully homomorphic encrypted DNNs reduces complexity from cubic to semi-linear and outperforms CPU implementations by up to 3x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets matrix multiplication as the dominant cost in deep neural networks under fully homomorphic encryption. It adapts the operation to AMD GPUs using an open-source FHE library and exploits sparsity present in both operands of the ciphertext multiplication. This yields a measured 3x speedup over a CPU baseline while changing the asymptotic cost. The approach is presented as a practical improvement for running encrypted inference workloads. A sympathetic reader would care because FHE enables privacy-preserving computation on sensitive data, yet its high overhead has limited real deployment.

Core claim

By exploiting sparsity in both operands of ciphertext matrix multiplication, the proposed GPU implementation achieves up to 3.0x better runtime than its CPU counterpart and reduces the time complexity of FHE matmul from cubic to semi-linear, providing a concrete improvement over prior FHE matrix-multiplication techniques.

What carries the argument

Sparse ciphertext matrix multiplication on GPUs that processes only the non-zero elements of both FHE operands inside the FIDESlib library.

If this is right

Encrypted DNN inference can execute faster on commodity GPUs without decrypting intermediate values.
The semi-linear complexity enables scaling encrypted models to larger layer widths than previously feasible on CPUs.
FHE libraries can incorporate sparsity-aware kernels as a standard optimization for DNN workloads.
Privacy-preserving machine learning becomes more practical for edge or cloud deployment where GPUs are available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar sparsity exploitation could be applied to other linear algebra primitives inside FHE, such as convolutions.
The technique may extend to other GPU vendors if the underlying FHE library is ported.
Lower latency for encrypted inference could encourage adoption in regulated domains such as healthcare or finance.

Load-bearing premise

The matrix multiplications inside FHE-based DNNs contain enough sparsity in the ciphertexts to deliver the claimed complexity reduction and speedup while preserving correctness and security.

What would settle it

Run the same sparse matmul kernel on representative FHE DNN layers with measured sparsity below 50 percent and check whether the observed speedup drops below 1.5x or the complexity reverts to cubic scaling.

Figures

Figures reproduced from arXiv: 2604.11659 by Ardhi W. B. Yudha, Carlos Agull\'o-Domingo, David Kaeli, Ferhat Yaman, Ian Colbert, Jos\'e Cano, Jos\'e L. Abell\'an, Kaustubh Shivdikar, Lara D'Agata, \'Oscar Vera-L\'opez.

**Figure 1.** Figure 1: Visualization of a request to a third-party cloud environment with and without FHE; with FHE, data remains encrypted throughout processing. large size of ciphertexts, it also follows that the computational resources it takes to perform operations on this type of data is very large in terms of both time and space. When multiplying square matrices using FHE, the runtime can be 106× higher compared to a mult… view at source ↗

**Figure 2.** Figure 2: Example of a matrix represented in the CSR sparsity representation format. utilization, reduces I/O pressure and the overall runtime compared to dense representations. Common efficient sparse formats include compressed sparse row (CSR), compressed sparse column (CSC), and block-based formats, which balance storage overhead and access efficiency depending on the access pattern. An example of CSR in shown… view at source ↗

**Figure 3.** Figure 3: Overview of SIMD operations used for the FHE CKKS matmul computation, with example value matrices A and B. The encrypted vectors are represented using a blue outline, the ellipses (...) represent the ciphertext padding. is very time consuming [5], so our goal is to reduce the number of rotations as much as possible. Algorithm 1 checks the positions within the ciphertexts of the two NZVs that we want to mul… view at source ↗

**Figure 4.** Figure 4: Runtime comparison of the different algorithms used to multiply two 16x16 matrices on (a) an AMD Radeon RX 7900 GPU, (b) a comparison with OpenFHE on the AMD Radeon GPU, and (c) runtimes of each method on the AMD Radeon and AMD MI300 GPUs. The matrix sparsity is represented as a fraction, and the runtimes are normalized. the inner loop depends on the number of non-zero values (NZVs) present in each matrix.… view at source ↗

read the original abstract

Fully homomorphic encryption (FHE) has recently attracted significant attention as both a cryptographic primitive and a systems challenge. Given the latest advances in accelerated computing, FHE presents a promising opportunity for progress, with applications ranging from machine learning to information security. We target the most computationally intensive operation in deep neural networks from a hardware perspective, matrix multiplication (matmul), and adapt it for execution on AMD GPUs. We propose a new optimized method that improves the runtime and complexity of ciphertext matmul by using FIDESlib, a recent open-source FHE library designed specifically for GPUs. By exploiting sparsity in both operands, our sparse matmul implementation outperforms its CPU counterpart by up to $3.0\times$ and reduces the time complexity from cubic to semi-linear, demonstrating an improvement over existing FHE matmul implementations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents a GPU-accelerated sparse matrix multiplication method for fully homomorphic encrypted (FHE) deep neural networks, implemented on AMD GPUs using the open-source FIDESlib library. It claims that exploiting sparsity in both operands yields up to 3.0× speedup over CPU implementations and reduces ciphertext matmul time complexity from cubic to semi-linear, improving on prior FHE matmul approaches.

Significance. If the claimed speedups and complexity reduction are rigorously demonstrated with detailed algorithms and experiments, the work would be significant for practical FHE-based machine learning, as it targets the core computational bottleneck of encrypted DNN inference through hardware acceleration and sparsity. The reliance on an open-source GPU FHE library supports reproducibility in systems cryptography research.

major comments (2)

[Abstract] Abstract: The central claim that exploiting sparsity in both operands reduces time complexity from cubic to semi-linear is load-bearing but unsupported; the text supplies no sparsity ratios for DNN weights/activations, no pseudocode or equations for the sparse ciphertext matmul, and no analysis of how zero positions are identified and skipped in the encrypted domain while preserving FHE correctness and security (see skeptic note on unquantified sparsity exploitation).
[Abstract] Abstract and full text: No experimental methodology, baseline comparisons (e.g., dense FHE matmul or other GPU FHE libraries), error analysis, or verification is provided to confirm that the reported 3.0× speedup and complexity improvement derive from the described sparsity method rather than unstated optimizations or implementation details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and will revise the manuscript accordingly to strengthen the presentation of our claims and experiments.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that exploiting sparsity in both operands reduces time complexity from cubic to semi-linear is load-bearing but unsupported; the text supplies no sparsity ratios for DNN weights/activations, no pseudocode or equations for the sparse ciphertext matmul, and no analysis of how zero positions are identified and skipped in the encrypted domain while preserving FHE correctness and security (see skeptic note on unquantified sparsity exploitation).

Authors: We agree the claim requires explicit support. In the revision we will add typical DNN sparsity ratios (e.g., 70-90% zero weights after pruning), pseudocode and equations for the sparse matmul kernel in FIDESlib, and a complexity analysis. Zero positions are identified from known plaintext weights (one operand); corresponding ciphertext multiplications are omitted, which is equivalent and leaks no information, preserving FHE correctness and security. This yields the stated semi-linear scaling in the number of non-zeros. revision: yes
Referee: [Abstract] Abstract and full text: No experimental methodology, baseline comparisons (e.g., dense FHE matmul or other GPU FHE libraries), error analysis, or verification is provided to confirm that the reported 3.0× speedup and complexity improvement derive from the described sparsity method rather than unstated optimizations or implementation details.

Authors: We acknowledge the need for fuller experimental rigor. The revision will include a dedicated methodology section, direct comparisons against dense FHE matmul (CPU and GPU) and other libraries, noise/error analysis confirming equivalence to dense results, and ablation studies isolating the contribution of sparsity to the observed 3.0× speedup. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims derive from implementation and benchmarking

full rationale

The paper presents an empirical systems contribution: a GPU-accelerated sparse ciphertext matrix multiplication for FHE DNNs built on FIDESlib. The claimed 3.0× speedup and reduction from cubic to semi-linear complexity are stated as direct outcomes of the implemented algorithm that exploits sparsity in both operands. No equations, parameters, or predictions are fitted to target results and then re-presented as independent derivations; no self-citations supply load-bearing uniqueness theorems or ansatzes; and no renaming of known patterns occurs. The derivation chain is therefore self-contained in the described implementation and its measured behavior, which remains externally verifiable through code and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central performance claim rests on standard FHE security and correctness properties plus the assumption that DNN computations under encryption contain exploitable sparsity; no free parameters or new entities are introduced.

axioms (2)

domain assumption FHE schemes maintain semantic security and correct decryption after operations
Invoked implicitly as the foundation for all FHE DNN work.
domain assumption Matrix operands in encrypted DNNs contain sufficient zero entries to yield semi-linear complexity
Core premise enabling the complexity reduction and speedup.

pith-pipeline@v0.9.0 · 5497 in / 1177 out tokens · 73666 ms · 2026-05-10T15:11:48.780458+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 6 canonical work pages

[1]

2023.On Architecting Fully Homo- morphic Encryption-based Computing Systems

Rashmi Agrawal and Ajay Joshi. 2023.On Architecting Fully Homo- morphic Encryption-based Computing Systems. Springer Cham

2023
[2]

Carlos Agulló-Domingo, Óscar Vera-López, Seyda Guzelhan, Lohit Daksha, Aymane El Jerari, Kaustubh Shivdikar, Rashmi Agrawal, David Kaeli, Ajay Joshi, and José L. Abellán. 2025. FIDESlib: A Fully-Fledged Open-Source FHE Library for Efficient CKKS on GPUs. arXiv:2507.04775

work page arXiv 2025
[3]

Nir Ailon, Akhiad Bercovich, Yahel Uffenheimer, and Omri Wein- stein. 2025. Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs. arXiv:2503.12211

work page arXiv 2025
[4]

Ahmad Al Badawi, Jack Bates, Flavio Bergamaschi, David Bruce Cousins, Saroja Erabelli, Nicholas Genise, Shai Halevi, Hamish Hunt, Andrey Kim, Yongwoo Lee, Zeyu Liu, Daniele Micciancio, Ian Quah, Yuriy Polyakov, Saraswathy R.V., Kurt Rohloff, Jonathan Saylor, Dmitriy Suponitsky, Matthew Triplett, Vinod Vaikuntanathan, and Vincent Zucca. 2022. OpenFHE: Open...

2022
[5]

Pedro Alves, Jheyne Ortiz, and Diego Aranha. 2023. Performance of hierarchical transforms in homomorphic encryption: a case study on logistic regression inference.J. of Cryptographic Engineering(2023)

2023
[6]

Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. 2014. (Lev- eled) Fully Homomorphic Encryption without Bootstrapping.ACM Trans. Comput. Theory6, 3, Article 13 (July 2014), 36 pages

2014
[7]

Liu, Hao Wang, and Cheng Hong

Chaochao Chen, Jun Zhou, Li Wang, Xibin Wu, Wenjing Fang, Jin Tan, Lei Wang, Alex X. Liu, Hao Wang, and Cheng Hong. 2021. When Homomorphic Encryption Marries Secret Sharing: Secure Large-Scale Sparse Logistic Regression and Applications in Risk Control. In27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

2021
[8]

Xiaofeng Chen, Xinyi Huang, Jin Li, Jianfeng Ma, Wenjing Lou, and Duncan S. Wong. 2015. New Algorithms for Secure Outsourcing of Large-Scale Systems of Linear Equations.IEEE Transactions on Information Forensics and Security10, 1 (2015), 69–78

2015
[9]

Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. 2017. Homomorphic Encryption for Arithmetic of Approximate Numbers. InAdvances in Cryptology – ASIACRYPT 2017. 409–437

2017
[10]

Moontaha Nishat Chowdhury, André Bauer, and Minxuan Zhou. 2025. Efficient Privacy-Preserving Recommendation on Sparse Data using Fully Homomorphic Encryption. In2025 IEEE International Conference on eScience (eScience). 1–9

2025
[11]

Jinming Cui, Chaochao Chen, Lingjuan Lyu, Carl Yang, and Wang Li
[12]

InAdvances in Neural Information Processing Systems

Exploiting Data Sparsity in Secure Cross-Platform Social Rec- ommendation. InAdvances in Neural Information Processing Systems
[13]

Junfeng Fan and Frederik Vercauteren. 2012. Somewhat Practical Fully Homomorphic Encryption. Cryptology Archive, 2012/144

2012
[14]

Shengyu Fan, Zhiwei Wang, Weizhi Xu, Rui Hou, Dan Meng, and Mingzhe Zhang. 2023. TensorFHE: Achieving Practical Computation on Encrypted Data Using GPGPU. In2023 IEEE International Sympo- sium on High-Performance Computer Architecture (HPCA). 922–934

2023
[15]

Aidan Ferguson, Perry Gibson, Lara D’Agata, Parker McLeod, Ferhat Yaman, Amitabh Das, Ian Colbert, and José Cano. 2025. Exploiting Unstructured Sparsity in Fully Homomorphic Encrypted DNNs. In5th Workshop on Machine Learning and Systems (EuroMLSys ’25). 31–38

2025
[16]

Craig Gentry. 2010. Computing arbitrary functions of encrypted data. Commun. ACM53, 3 (March 2010), 97–105

2010
[17]

Perry Gibson, Jose Cano, Elliot Crowley, Amos Storkey, and Michael O’boyle. 2025. DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration.ACM Trans. Archit. Code Optim.(2025)

2025
[18]

Shai Halevi and Victor Shoup. 2015. Bootstrapping for HElib. InAd- vances in Cryptology – EUROCRYPT 2015. Springer Berlin Heidelberg

2015
[19]

Jude Haris, Rappy Saha, Wenhao Hu, and José Cano. 2024. Designing Efficient LLM Accelerators for Edge Devices. arXiv:2408.00462

work page arXiv 2024
[20]

Ming-Chien Ho, Yu-Te Ku, Yu Xiao, Feng-Hao Liu, Chih-Fan Hsu, Ming-Ching Chang, Shih-Hao Hung, and Wei-Chao Chen. 2025. In- vited Paper: Efficient Design of FHEW/TFHE Bootstrapping Imple- mentation with Scalable Parameters. In43rd IEEE/ACM International Conference on Computer-Aided Design

2025
[21]

Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. 2016. Net- work Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures. arXiv:1607.03250

work page Pith review arXiv 2016
[22]

Elmira Karimi, Nicolas Bohm Agostini, Shi Dong, and David Kaeli
[23]

Parallel Distrib

VCSR: An Efficient GPU Memory-Aware Sparse Format.IEEE Trans. Parallel Distrib. Syst.33, 12 (Dec. 2022), 3977–3989

2022
[24]

Jongmin Kim, Wonseok Choi, and Jung Ho Ahn. 2024. Cheddar: A swift fully homomorphic encryption library for cuda gpus. arXiv:2407.13055

work page arXiv 2024
[25]

Abellán, Manuel E

Francisco Muñoz Martínez, Raveesh Garg, Michael Pellauer, José L. Abellán, Manuel E. Acacio, and Tushar Krishna. 2023. Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing. In28th ACM Int. Conference on Architectural Support for Programming Languages and Operating Systems

2023
[26]

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shiv- ani Agrawal, and Jeff Dean. 2022. Efficiently Scaling Transformer Inference. arXiv:2211.05102

work page arXiv 2022
[27]

T Thammi Reddy, Silpakesav Velagaleti, BVV Satyanarayana, and G Prasanna Kumar. 2025. Hardware efficient arithmetic reconfigurable fully homomorphic encryption (ARFHE) accelerator of low power IoT based RISC-V processor.Analog Integrated Circuits and Signal Processing124, 1 (2025), 20

2025
[28]

Abellán, John Kim, Ajay Joshi, and David Kaeli

Kaustubh Shivdikar, Yuhui Bao, Rashmi Agrawal, Michael Shen, Gilbert Jonatan, Evelio Mora, Alexander Ingare, Neal Livesay, José L. Abellán, John Kim, Ajay Joshi, and David Kaeli. 2023. GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryp- tion. In56th Annual IEEE/ACM International Symposium on Microar- chitecture

2023
[29]

Emmanuel Vintimilla-Tapia, Alexander Rojas, Marco Sigüenza, An- drea Paulina Rodríguez Zúñiga, and Priscila Cedillo. 2026. A Sys- tematic Literature Review on the Security Weaknesses of Fully Ho- momorphic Encryption Schemes. InInformation and Communication Technologies. Springer Nature Switzerland, Cham, 312–328

2026
[30]

Qingfeng Wang and Li-Ping Wang. 2025. A Novel Asymmetric BSGS Polynomial Evaluation Algorithm under Homomorphic Encryption. In 20th ACM Asia Conference on Computer and Communications Security

2025
[31]

Yannan Nellie Wu, Po-An Tsai, Saurav Muralidharan, Angshuman Parashar, Vivienne Sze, and Joel Emer. 2023. HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity. In 56th Annual IEEE/ACM International Symposium on Microarchitecture

2023
[32]

Weizhi Xu, Yintai Sun, Shengyu Fan, Hui Yu, and Xin Fu. 2023. Ac- celerating Convolutional Neural Network by Exploiting Sparsity on GPUs.ACM Trans. Archit. Code Optim.20, 3, Article 36 (July 2023)

2023
[33]

Keren Zhou, Karthik Ganapathi Subramanian, Po-Hsun Lin, Matthias Fey, Binqian Yin, and Jiajia Li. 2024. FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural Networks. In38th ACM International Conference on Supercomputing

2024
[34]

Ali Şah Özcan and Erkay Savaş. 2024. HEonGPU: a GPU-based Fully Homomorphic Encryption Library 1.0. Cryptology Archive, 2024/1543

2024