Triple-Hoisted Baby-Step Giant-Step Linear Transformation over CKKS Homomorphic Encryption and Hardware Accelerator

Sajjad Akherati; Xinmiao Zhang

arxiv: 2605.17222 · v1 · pith:CNIZZ6BBnew · submitted 2026-05-17 · 💻 cs.CR

Triple-Hoisted Baby-Step Giant-Step Linear Transformation over CKKS Homomorphic Encryption and Hardware Accelerator

Sajjad Akherati , Xinmiao Zhang This is my paper

Pith reviewed 2026-05-20 00:00 UTC · model grok-4.3

classification 💻 cs.CR

keywords homomorphic encryptionCKKSlinear transformationbaby-step giant-stepciphertext rotationhardware acceleratorFPGAmemory optimization

0 comments

The pith

A triple-hoisted baby-step giant-step algorithm reduces the rotations and off-chip memory accesses needed for linear transformations under CKKS homomorphic encryption.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a further decomposition of the baby steps inside the standard baby-step giant-step method so that fewer ciphertext rotations are required to evaluate a linear transformation on encrypted data. It then partitions the computation into phases that keep most data on-chip and thereby cut the dominant source of hardware latency. An FPGA accelerator with a specialized permutation network is built to route the remaining data efficiently under this scheme. For typical parameter sets the design lowers off-chip memory traffic 2.9 times relative to the best previous implementation and cuts total computational latency 5.8 times relative to an unoptimized baseline.

Core claim

By applying three levels of hoisting to the baby-step giant-step decomposition of a linear transformation, the number of ciphertext rotations drops sharply; phase partitioning of the resulting schedule then minimizes off-chip DRAM accesses, and an FPGA implementation with an optimized message-permutation circuit realizes these reductions in hardware.

What carries the argument

The triple-hoisted baby-step giant-step decomposition with phase partitioning that separates rotation-heavy stages from memory-bound stages.

If this is right

Fewer rotations directly lower the time spent on matrix-vector products inside encrypted neural-network layers.
Reduced off-chip traffic makes hardware accelerators for homomorphic encryption more practical for repeated linear operations.
The FPGA data path can be reused or scaled when larger decomposition depths are chosen for higher security parameters.
The same phase-partitioning idea applies to other rotation-heavy primitives such as bootstrapping or convolution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rotation savings could compound across many layers in a deep encrypted model, amplifying the benefit beyond single-layer measurements.
Similar hoisting and partitioning may transfer to other homomorphic schemes that also rely on rotation-based linear algebra.
An ASIC version of the permutation circuit could further increase throughput once the FPGA prototype validates the approach.
Integration with ciphertext-packing strategies already used in software libraries could produce additional multiplicative gains.

Load-bearing premise

The reported speedups assume that the selected decomposition parameters and phase boundaries introduce no extra overheads that would appear when the method is embedded inside a complete end-to-end encrypted inference workload.

What would settle it

An end-to-end measurement of a full encrypted neural-network inference workload that either reproduces the 2.9x memory-access and 5.8x latency gains or shows that hidden costs erase them.

Figures

Figures reproduced from arXiv: 2605.17222 by Sajjad Akherati, Xinmiao Zhang.

**Figure 2.** Figure 2: Number of modular multiplications and switching key size of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Block diagrams of the proposed memory-optimized data path of (a) Lines 1–5 (Phase 1); (b) Lines 6–7 (Phase 2); (c) Lines 8–11 (Phase 3); (d) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparisons of off-chip memory access (left axis) and number of modular multiplications (right axis) for the diagonal method [ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Top-level architecture of the proposed accelerator for HE-LT. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Architecture for the PE. capacity for the polynomials, and consequently the number of coefficients in each polynomial that can be processed simultaneously, denoted as dp, is determined. The product of dp and the number of polynomials processed concurrently determines nPE. By analyzing the data flow across the six phases of Algorithm 2, the required number of TAs for the summations in Decompose, ModDown, th… view at source ↗

read the original abstract

Computations can be directly carried out over ciphertexts using homomorphic encryption (HE), which is indispensable for privacy-preserving cloud computing. Linear transformation is widely used in neural networks, including large language models. However, the implementation of linear transformation over HE requires a large number of ciphertext rotations, which incur significant memory and hardware overhead despite existing simplification techniques. This paper proposes a triple-hoisted baby-step giant-step algorithm that decomposes the baby step further to substantially reduce the number of ciphertext rotations needed for the CKKS HE evaluation of linear transformation. Moreover, to reduce off-chip memory access, which contributes to the majority of the latency, a memory-optimized data path is proposed by partitioning the algorithm into multiple phases. Furthermore, an efficient FPGA-based hardware accelerator with an optimized permutation circuit for message routing is designed for the proposed scheme. For a set of typical parameters, the proposed design reduces the off-chip memory access by 2.9x compared to the best prior design. Synthesized for Xilinx Virtex UltraScale+ devices, the proposed design achieves a 5.8x reduction in computational latency compared with the baseline design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This refines BSGS for CKKS linear transforms via triple hoisting plus phased memory and FPGA permutation, delivering 2.9x memory and 5.8x latency cuts on isolated typical parameters.

read the letter

The core advance is a triple-hoisted baby-step giant-step decomposition that breaks the baby step further to cut ciphertext rotations for CKKS linear transforms. They add a memory-optimized data path that splits the work into phases to reduce off-chip accesses, plus an FPGA permutation circuit for routing. For typical parameters the synthesis on Virtex UltraScale+ reports 2.9x less off-chip memory traffic than the best prior design and 5.8x lower computational latency than baseline. That is concrete engineering progress on a bottleneck that shows up in every encrypted neural net layer. The algorithmic tweak and the phase partitioning are the parts that feel new relative to earlier BSGS papers. The numbers are stated clearly enough that a referee can check the rotation counts and the memory model against the claimed decomposition factors. The main limitation is that all results stay inside the isolated linear-transform primitive. A full inference workload brings bootstrapping, activations, and inter-layer ciphertext movement that could change how the extra hoisting steps amortize or add control overhead. The abstract does not report those end-to-end measurements, so the practical speedups remain an assumption rather than a demonstrated fact. This paper is for hardware designers and systems people who already work on CKKS accelerators for privacy-preserving ML. Someone building or tuning an FPGA or ASIC datapath for linear transforms would get usable ideas from the phase split and the permutation circuit. It is worth sending to peer review. The claims are narrow enough to verify, the hardware synthesis is reproducible in principle, and the subfield needs these targeted reductions even if the gains need workload-level confirmation later.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a triple-hoisted baby-step giant-step (BSGS) decomposition for linear transformations under CKKS homomorphic encryption, further partitioned into phases with a memory-optimized data path and implemented via an FPGA accelerator with an optimized permutation circuit. For a set of typical CKKS parameters, synthesis on Xilinx Virtex UltraScale+ reports a 2.9× reduction in off-chip memory access versus the best prior design and a 5.8× reduction in computational latency versus baseline.

Significance. If the reported reductions hold under the stated assumptions, the co-design of algorithmic hoisting and phased hardware partitioning could meaningfully lower the dominant costs of rotations and DRAM traffic in HE linear layers, supporting more efficient privacy-preserving inference. The work supplies concrete synthesis numbers and an explicit hardware architecture, which are positive for reproducibility in the hardware-HE community.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: the 2.9× off-chip memory and 5.8× latency claims are derived from isolated linear-transform synthesis under 'typical parameters'; the manuscript does not quantify additional rotation or key-switching traffic that would arise when the primitive is embedded in a full inference graph containing bootstraps and inter-layer movements, leaving the practical end-to-end gain unverified.
[Algorithm / §3] Algorithm description: the triple-hoisting decomposition is presented as reducing rotation count, yet the paper does not supply a closed-form expression or table showing the exact rotation count versus prior BSGS variants for the same matrix dimension and ring parameters, making it difficult to confirm the claimed improvement is not an artifact of parameter selection.

minor comments (2)

Notation for the hoisting factors and phase boundaries could be made more consistent between the algorithmic pseudocode and the hardware data-path diagram.
[Results] The baseline design used for the 5.8× latency comparison should be explicitly referenced (e.g., prior work citation and parameter set) in the results tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of scope and clarity that we address below. We propose targeted revisions to strengthen the manuscript while maintaining focus on the linear-transform primitive.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: the 2.9× off-chip memory and 5.8× latency claims are derived from isolated linear-transform synthesis under 'typical parameters'; the manuscript does not quantify additional rotation or key-switching traffic that would arise when the primitive is embedded in a full inference graph containing bootstraps and inter-layer movements, leaving the practical end-to-end gain unverified.

Authors: We agree that the reported gains are measured for the isolated linear-transformation primitive, which is the dominant cost in many CKKS workloads. A full end-to-end quantification would require selecting a specific neural-network architecture, bootstrap schedule, and inter-layer data movement pattern, none of which are standardized for this primitive. In the revised manuscript we will add a dedicated paragraph in the Evaluation section that (i) explicitly states the scope of the claims, (ii) discusses how the triple-hoisted BSGS primitive composes with bootstrapping and key-switching, and (iii) provides a back-of-the-envelope estimate of additional rotation traffic for a representative ResNet-style workload. We will not claim end-to-end numbers without new experiments. revision: partial
Referee: [Algorithm / §3] Algorithm description: the triple-hoisting decomposition is presented as reducing rotation count, yet the paper does not supply a closed-form expression or table showing the exact rotation count versus prior BSGS variants for the same matrix dimension and ring parameters, making it difficult to confirm the claimed improvement is not an artifact of parameter selection.

Authors: Section 3 derives the rotation count for the triple-hoisted decomposition by recursively partitioning the baby-step index set, but we did not tabulate the resulting counts against the classic and double-hoisted BSGS baselines. We will insert a new table (Table X) that lists, for representative matrix dimensions N and ring parameters (N, logQ, etc.), the exact number of rotations required by each variant. A short closed-form expression for the triple-hoisted case will also be added to the text preceding the table. revision: yes

Circularity Check

0 steps flagged

No circularity: new algorithmic decomposition and hardware synthesis rest on independent design choices

full rationale

The paper proposes a triple-hoisted BSGS decomposition for CKKS linear transforms, a phased memory-optimized datapath, and an FPGA accelerator with permutation circuit. Reported gains (2.9x off-chip access, 5.8x latency) are obtained from synthesis on Xilinx Virtex UltraScale+ for a set of typical parameters and compared against external prior designs. No equations, fitted parameters, or self-citations in the abstract or described structure reduce these results to quantities defined by the authors' own prior work or by construction from the inputs. The derivation chain consists of novel algorithmic partitioning and hardware mapping whose correctness is evaluated externally rather than tautologically.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The performance numbers rest on the existence of a valid CKKS parameter set and on the assumption that rotation costs dominate; no new mathematical axioms or invented entities are introduced beyond standard CKKS ring operations.

free parameters (1)

hoisting decomposition factors
The triple-hoist step sizes are chosen to minimize rotations for the target matrix dimensions; these are free parameters tuned for the reported 'typical parameters'.

axioms (1)

domain assumption CKKS rotation cost model remains accurate after triple hoisting
The abstract implicitly assumes that further decomposition does not introduce new noise or correctness issues that would invalidate the rotation savings.

pith-pipeline@v0.9.0 · 5732 in / 1468 out tokens · 43195 ms · 2026-05-20T00:00:06.000135+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This paper proposes a triple-hoisted baby-step giant-step algorithm that decomposes the baby step further to substantially reduce the number of ciphertext rotations... partitioning the algorithm into multiple phases.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

For a set of typical parameters, the proposed design reduces the off-chip memory access by 2.9× compared to the best prior design.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

[1]

Advancing sensor network optimization and data analytics for smart wastewater and stormwater collection systems: An overview

M. Moradi, M. Najafi, Z. Lin, and H. Pan, “Advancing sensor network optimization and data analytics for smart wastewater and stormwater collection systems: An overview”, inPipelines 2025, pp. 463–472

work page 2025
[2]

BioDeepHash: Generating consistent templates for se- cure biometric recognition

B. Song, D. Zhao, J. Yan, H. Li, and H. Jiang, “BioDeepHash: Generating consistent templates for se- cure biometric recognition”,IEEE Trans. on Depend- able and Secure Comp., pp. 1–18, 2026

work page 2026
[3]

Privacy-preserving machine learning techniques based on homomorphic encryption for credit risk analysis

V . V . L. D. Allavarpu, V . S. Naresh, and A. K. Mohan, “Privacy-preserving machine learning techniques based on homomorphic encryption for credit risk analysis”, Electronic Commerce Research, 2026

work page 2026
[4]

Ho- momorphic encryption for arithmetic of approximate numbers,

J. H. Cheon, A. Kim, M. Kim, and Y . Song, “Ho- momorphic encryption for arithmetic of approximate numbers,” inProc. of Intl. Conf. on the Theory and Appl. of Cryptol. and Info. Secur ., Cham, Switzerland, 2017, pp. 409–437

work page 2017
[5]

(Lev- eled) fully homomorphic encryption without bootstrap- ping,

Z. Brakerski, C. Gentry, and V . Vaikuntanathan, “(Lev- eled) fully homomorphic encryption without bootstrap- ping,” inProc. of Innov. in Theoret. Comp. Sci. Conf., Cambridge, Massachusetts, 2012, pp. 309–325

work page 2012
[6]

An improved RNS variant of the BFV homomorphic encryption scheme

S. Halevi, Y . Polyakov, and V . Shoup, “An improved RNS variant of the BFV homomorphic encryption scheme”, inTopics in Cryptology, M. Matsui, Ed., Cham: Springer International Publishing, 2019, pp. 83– 105

work page 2019
[7]

Faster packed homomorphic operations and efficient circuit bootstrapping for TFHE,

I. Chillotti, N. Gama, M. Georgieva, and M. Izabach `ene, “Faster packed homomorphic operations and efficient circuit bootstrapping for TFHE,” inAdvances in Cryp- tology, Cham, Switzeland, 2017, pp. 377–408. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS , VOL. X, NO. X, FEBRUARY 2026 11

work page 2017
[8]

Polynomial multi- plication architecture with integrated modular reduction for R-LWE cryptosystems,

X. Zhang, Z. Huai, and K. K. Parhi, “Polynomial multi- plication architecture with integrated modular reduction for R-LWE cryptosystems,”Journ. of Sig. Process. Syst., vol. 94, no. 8, pp. 799–809, 2022

work page 2022
[9]

An area- efficient, conflict-free, and configurable architecture for accelerating NTT/INTT,

S.-H. Liu, C.-Y . Kuo, Y .-N. Mo, and T. Su, “An area- efficient, conflict-free, and configurable architecture for accelerating NTT/INTT,”IEEE Trans. on V ery Large Scale Integ. (VLSI) Syst., vol. 32, no. 3, pp. 519–529, 2024

work page 2024
[10]

Area- efficient number theoretic transform architecture for homomorphic encryption,

P. Duong-Ngoc, S. Kwon, D. Yoo, and H. Lee, “Area- efficient number theoretic transform architecture for homomorphic encryption,”IEEE Trans. on Circ. and Syst. I, vol. 70, no. 3, pp. 1270–1283, 2023

work page 2023
[11]

Low-complexity NTT and INTT structures via twiddle shifting

S.-W. Chiu and K. K. Parhi, “Low-complexity NTT and INTT structures via twiddle shifting”, inIEEE Intl. Midwest Symp. on Circ. and Syst., 2025, pp. 444–448

work page 2025
[12]

PaReNTT: Low-latency parallel residue number system and NTT-based long polynomial modular multi- plication for homomorphic encryption,

W. Tan, S. W. Chiu, A. Wang, Y . Lao, and K. K. Parhi, “PaReNTT: Low-latency parallel residue number system and NTT-based long polynomial modular multi- plication for homomorphic encryption,”IEEE Trans. on Info. F oren. and Secur ., vol. 19, pp. 1646–1659, 2024

work page 2024
[13]

Efficient generalized integer division and modular reduction architectures for homomorphic encryption

S. Akherati, J. Cai, and X. Zhang, “Efficient generalized integer division and modular reduction architectures for homomorphic encryption”,Journ. of Sig. Process. Syst., vol. 97, no. 2, pp. 157–171, 2025

work page 2025
[14]

FPGA-based accelerators of fully pipelined modular multipliers for homomorphic encryption

S. Kim, K. Lee, W. Cho, J. H. Cheon, and R. A. Rutenbar, “FPGA-based accelerators of fully pipelined modular multipliers for homomorphic encryption”, in Intl. Conf. on ReConFigurable Comp. and FPGAs, 2019, pp. 1–8

work page 2019
[15]

A full RNS variant of approximate homomorphic encryption,

J. H. Cheon, K. Han, A. Kim, M. Kim, and Y . Song, “A full RNS variant of approximate homomorphic encryption,” inProc. of Select. Areas in Cryptog. Intl. Conf., Springer, 2019, pp. 347–368

work page 2019
[16]

Pan- ther: Practical secure two-party neural network infer- ence

J. Feng, Y . Wu, H. Sun, S. Zhang, and D. Liu, “Pan- ther: Practical secure two-party neural network infer- ence”,IEEE Trans. on Inf. F orensics and Sec., vol. 20, pp. 1149–1162, 2025

work page 2025
[17]

Practical secure inference algorithm for a fine-tuned large language model based on fully homomorphic encryption

R. Zhang, Z. Zheng, and W. Bao, “Practical secure inference algorithm for a fine-tuned large language model based on fully homomorphic encryption”,IEEE Trans. on Inf. F orensics and Sec., vol. 21, pp. 17–29, 2026

work page 2026
[18]

Efficient bootstrapping for approxi- mate homomorphic encryption with non-sparse keys,

J. P. Bossuat, C. Mouchet, J. Troncoso Pastoriza, and J. P. Hubaux, “Efficient bootstrapping for approxi- mate homomorphic encryption with non-sparse keys,” inAdvances in Cryptology, Cham, Switzerland, 2021, pp. 587–617

work page 2021
[19]

GAZELLE: A low latency framework for secure neural network inference,

C. Juvekar, V . Vaikuntanathan, and A. Chandrakasan, “GAZELLE: A low latency framework for secure neural network inference,” inProc. of the USENIX Conf. on Secur . Symp., Baltimore, MD, USA, 2018, pp. 1651– 1668

work page 2018
[20]

Bootstrapping for HElib

S. Halevi and V . Shoup, “Bootstrapping for HElib”,J. Cryptol., vol. 34, no. 1, 2021

work page 2021
[21]

Faster homomorphic linear transformations in HElib

S. Halevi and V . Shoup, “Faster homomorphic linear transformations in HElib”, inAdvances in Cryptol- ogy, Santa Barbara, CA, USA: Springer-Verlag, 2018, pp. 93–120

work page 2018
[22]

APAS: Application-specific accelerators for RLWE-based homomorphic linear transformations

S. Bian, D. E. S. Kundi, K. Hirozawa, W. Liu, and T. Sato, “APAS: Application-specific accelerators for RLWE-based homomorphic linear transformations”, IEEE Trans. on Inf. F orensics and Sec., vol. 16, pp. 4663–4678, 2021

work page 2021
[23]

FPGA accelerator for homomorphic en- crypted sparse convolutional neural network inference

Y . Yang, S. R. Kuppannagari, R. Kannan, and V . K. Prasanna, “FPGA accelerator for homomorphic en- crypted sparse convolutional neural network inference”, inIEEE Intl. Symp. on Field-Program. Custom Comp. Machines, 2022, pp. 1–9

work page 2022
[24]

CHAM: A customized homomorphic encryption accelerator for fast matrix-vector product

X. Ren, Z. Chen, Z. Gu, Y . Lu, R. Zhong, W.-J. Lu, J. Zhang, Y . Zhang, H. Wu, X. Zheng, H. Liu, T. Chu, C. Hong, C. Wei, D. Niu, and Y . Xie, “CHAM: A customized homomorphic encryption accelerator for fast matrix-vector product”, inACM/IEEE Design Au- tomation Conf., 2023, pp. 1–6

work page 2023
[25]

Bandwidth efficient homomorphic encrypted matrix vector multiplication accelerator on FPGA

Y . Yang, S. R. Kuppannagari, R. Kannan, and V . K. Prasanna, “Bandwidth efficient homomorphic encrypted matrix vector multiplication accelerator on FPGA”, in Intl. Conf. on Field-Program. Tech., 2022, pp. 1–9

work page 2022
[26]

Z. Xu, R. Kannan, and V . K. Prasanna,F AME: FPGA acceleration of secure matrix multiplication with homomorphic encryption, 2025. arXiv: 2512 . 15515 [cs.AR]. [Online]. Available: https://arxiv.org/abs/ 2512.15515

work page arXiv 2025
[27]

Efficient homomorphic conversion between (Ring) LWE cipher- texts,

H. Chen, W. Dai, M. Kim, and Y . Song, “Efficient homomorphic conversion between (Ring) LWE cipher- texts,” inProc. of Appli. Cryptog. and Network Secur ., Kamakura, Japan, 2021, pp. 460–479

work page 2021
[28]

Secure and efficient general matrix multiplication on cloud using homo- morphic encryption

Y . Gao, G. Quan, S. Homsi, et al., “Secure and efficient general matrix multiplication on cloud using homo- morphic encryption”,The Journal of Supercomputing, vol. 80, pp. 26 394–26 434, 2024

work page 2024
[29]

Improved ciphertext multi- plication for RNS-CKKS homomorphic encryption

S. Akherati and X. Zhang, “Improved ciphertext multi- plication for RNS-CKKS homomorphic encryption”, in IEEE Work. on Sig. Process. Syst., 2024, pp. 136–140

work page 2024
[30]

Better bootstrapping for approxi- mate homomorphic encryption

K. Han and D. Ki, “Better bootstrapping for approxi- mate homomorphic encryption”, inTopics in Cryptol- ogy, San Francisco, CA, USA, 2020, pp. 364–390

work page 2020
[31]

Multi-input ciphertext mul- tiplication for homomorphic encryption

S. Akherati and X. Zhang, “Multi-input ciphertext mul- tiplication for homomorphic encryption”,IEEE Trans. on Circ. and Syst. I: Regular Papers, pp. 1–12, 2026

work page 2026
[32]

Algorithms in HElib

S. Halevi and V . Shoup, “Algorithms in HElib”, in Advances in Cryptology, J. A. Garay and R. Gennaro, Eds., Berlin, Heidelberg, 2014, pp. 554–571

work page 2014
[33]

MAD: Memory- aware design techniques for accelerating fully homo- morphic encryption

R. Agrawal, L. d. Castro, C. Juvekar, A. Chandrakasan, V . Vaikuntanathan, and A. Joshi, “MAD: Memory- aware design techniques for accelerating fully homo- morphic encryption”, inIEEE/ACM Intl. Symp. on Mi- croarchitecture, 2023, pp. 685–697

work page 2023
[34]

Automatic generation of high throughput energy efficient streaming architectures for arbitrary fixed permutations

R. Chen and V . K. Prasanna, “Automatic generation of high throughput energy efficient streaming architectures for arbitrary fixed permutations”, inIntl. Conf. on Field Program. Logic and App., 2015, pp. 1–8

work page 2015

[1] [1]

Advancing sensor network optimization and data analytics for smart wastewater and stormwater collection systems: An overview

M. Moradi, M. Najafi, Z. Lin, and H. Pan, “Advancing sensor network optimization and data analytics for smart wastewater and stormwater collection systems: An overview”, inPipelines 2025, pp. 463–472

work page 2025

[2] [2]

BioDeepHash: Generating consistent templates for se- cure biometric recognition

B. Song, D. Zhao, J. Yan, H. Li, and H. Jiang, “BioDeepHash: Generating consistent templates for se- cure biometric recognition”,IEEE Trans. on Depend- able and Secure Comp., pp. 1–18, 2026

work page 2026

[3] [3]

Privacy-preserving machine learning techniques based on homomorphic encryption for credit risk analysis

V . V . L. D. Allavarpu, V . S. Naresh, and A. K. Mohan, “Privacy-preserving machine learning techniques based on homomorphic encryption for credit risk analysis”, Electronic Commerce Research, 2026

work page 2026

[4] [4]

Ho- momorphic encryption for arithmetic of approximate numbers,

J. H. Cheon, A. Kim, M. Kim, and Y . Song, “Ho- momorphic encryption for arithmetic of approximate numbers,” inProc. of Intl. Conf. on the Theory and Appl. of Cryptol. and Info. Secur ., Cham, Switzerland, 2017, pp. 409–437

work page 2017

[5] [5]

(Lev- eled) fully homomorphic encryption without bootstrap- ping,

Z. Brakerski, C. Gentry, and V . Vaikuntanathan, “(Lev- eled) fully homomorphic encryption without bootstrap- ping,” inProc. of Innov. in Theoret. Comp. Sci. Conf., Cambridge, Massachusetts, 2012, pp. 309–325

work page 2012

[6] [6]

An improved RNS variant of the BFV homomorphic encryption scheme

S. Halevi, Y . Polyakov, and V . Shoup, “An improved RNS variant of the BFV homomorphic encryption scheme”, inTopics in Cryptology, M. Matsui, Ed., Cham: Springer International Publishing, 2019, pp. 83– 105

work page 2019

[7] [7]

Faster packed homomorphic operations and efficient circuit bootstrapping for TFHE,

I. Chillotti, N. Gama, M. Georgieva, and M. Izabach `ene, “Faster packed homomorphic operations and efficient circuit bootstrapping for TFHE,” inAdvances in Cryp- tology, Cham, Switzeland, 2017, pp. 377–408. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS , VOL. X, NO. X, FEBRUARY 2026 11

work page 2017

[8] [8]

Polynomial multi- plication architecture with integrated modular reduction for R-LWE cryptosystems,

X. Zhang, Z. Huai, and K. K. Parhi, “Polynomial multi- plication architecture with integrated modular reduction for R-LWE cryptosystems,”Journ. of Sig. Process. Syst., vol. 94, no. 8, pp. 799–809, 2022

work page 2022

[9] [9]

An area- efficient, conflict-free, and configurable architecture for accelerating NTT/INTT,

S.-H. Liu, C.-Y . Kuo, Y .-N. Mo, and T. Su, “An area- efficient, conflict-free, and configurable architecture for accelerating NTT/INTT,”IEEE Trans. on V ery Large Scale Integ. (VLSI) Syst., vol. 32, no. 3, pp. 519–529, 2024

work page 2024

[10] [10]

Area- efficient number theoretic transform architecture for homomorphic encryption,

P. Duong-Ngoc, S. Kwon, D. Yoo, and H. Lee, “Area- efficient number theoretic transform architecture for homomorphic encryption,”IEEE Trans. on Circ. and Syst. I, vol. 70, no. 3, pp. 1270–1283, 2023

work page 2023

[11] [11]

Low-complexity NTT and INTT structures via twiddle shifting

S.-W. Chiu and K. K. Parhi, “Low-complexity NTT and INTT structures via twiddle shifting”, inIEEE Intl. Midwest Symp. on Circ. and Syst., 2025, pp. 444–448

work page 2025

[12] [12]

PaReNTT: Low-latency parallel residue number system and NTT-based long polynomial modular multi- plication for homomorphic encryption,

W. Tan, S. W. Chiu, A. Wang, Y . Lao, and K. K. Parhi, “PaReNTT: Low-latency parallel residue number system and NTT-based long polynomial modular multi- plication for homomorphic encryption,”IEEE Trans. on Info. F oren. and Secur ., vol. 19, pp. 1646–1659, 2024

work page 2024

[13] [13]

Efficient generalized integer division and modular reduction architectures for homomorphic encryption

S. Akherati, J. Cai, and X. Zhang, “Efficient generalized integer division and modular reduction architectures for homomorphic encryption”,Journ. of Sig. Process. Syst., vol. 97, no. 2, pp. 157–171, 2025

work page 2025

[14] [14]

FPGA-based accelerators of fully pipelined modular multipliers for homomorphic encryption

S. Kim, K. Lee, W. Cho, J. H. Cheon, and R. A. Rutenbar, “FPGA-based accelerators of fully pipelined modular multipliers for homomorphic encryption”, in Intl. Conf. on ReConFigurable Comp. and FPGAs, 2019, pp. 1–8

work page 2019

[15] [15]

A full RNS variant of approximate homomorphic encryption,

J. H. Cheon, K. Han, A. Kim, M. Kim, and Y . Song, “A full RNS variant of approximate homomorphic encryption,” inProc. of Select. Areas in Cryptog. Intl. Conf., Springer, 2019, pp. 347–368

work page 2019

[16] [16]

Pan- ther: Practical secure two-party neural network infer- ence

J. Feng, Y . Wu, H. Sun, S. Zhang, and D. Liu, “Pan- ther: Practical secure two-party neural network infer- ence”,IEEE Trans. on Inf. F orensics and Sec., vol. 20, pp. 1149–1162, 2025

work page 2025

[17] [17]

Practical secure inference algorithm for a fine-tuned large language model based on fully homomorphic encryption

R. Zhang, Z. Zheng, and W. Bao, “Practical secure inference algorithm for a fine-tuned large language model based on fully homomorphic encryption”,IEEE Trans. on Inf. F orensics and Sec., vol. 21, pp. 17–29, 2026

work page 2026

[18] [18]

Efficient bootstrapping for approxi- mate homomorphic encryption with non-sparse keys,

J. P. Bossuat, C. Mouchet, J. Troncoso Pastoriza, and J. P. Hubaux, “Efficient bootstrapping for approxi- mate homomorphic encryption with non-sparse keys,” inAdvances in Cryptology, Cham, Switzerland, 2021, pp. 587–617

work page 2021

[19] [19]

GAZELLE: A low latency framework for secure neural network inference,

C. Juvekar, V . Vaikuntanathan, and A. Chandrakasan, “GAZELLE: A low latency framework for secure neural network inference,” inProc. of the USENIX Conf. on Secur . Symp., Baltimore, MD, USA, 2018, pp. 1651– 1668

work page 2018

[20] [20]

Bootstrapping for HElib

S. Halevi and V . Shoup, “Bootstrapping for HElib”,J. Cryptol., vol. 34, no. 1, 2021

work page 2021

[21] [21]

Faster homomorphic linear transformations in HElib

S. Halevi and V . Shoup, “Faster homomorphic linear transformations in HElib”, inAdvances in Cryptol- ogy, Santa Barbara, CA, USA: Springer-Verlag, 2018, pp. 93–120

work page 2018

[22] [22]

APAS: Application-specific accelerators for RLWE-based homomorphic linear transformations

S. Bian, D. E. S. Kundi, K. Hirozawa, W. Liu, and T. Sato, “APAS: Application-specific accelerators for RLWE-based homomorphic linear transformations”, IEEE Trans. on Inf. F orensics and Sec., vol. 16, pp. 4663–4678, 2021

work page 2021

[23] [23]

FPGA accelerator for homomorphic en- crypted sparse convolutional neural network inference

Y . Yang, S. R. Kuppannagari, R. Kannan, and V . K. Prasanna, “FPGA accelerator for homomorphic en- crypted sparse convolutional neural network inference”, inIEEE Intl. Symp. on Field-Program. Custom Comp. Machines, 2022, pp. 1–9

work page 2022

[24] [24]

CHAM: A customized homomorphic encryption accelerator for fast matrix-vector product

X. Ren, Z. Chen, Z. Gu, Y . Lu, R. Zhong, W.-J. Lu, J. Zhang, Y . Zhang, H. Wu, X. Zheng, H. Liu, T. Chu, C. Hong, C. Wei, D. Niu, and Y . Xie, “CHAM: A customized homomorphic encryption accelerator for fast matrix-vector product”, inACM/IEEE Design Au- tomation Conf., 2023, pp. 1–6

work page 2023

[25] [25]

Bandwidth efficient homomorphic encrypted matrix vector multiplication accelerator on FPGA

Y . Yang, S. R. Kuppannagari, R. Kannan, and V . K. Prasanna, “Bandwidth efficient homomorphic encrypted matrix vector multiplication accelerator on FPGA”, in Intl. Conf. on Field-Program. Tech., 2022, pp. 1–9

work page 2022

[26] [26]

Z. Xu, R. Kannan, and V . K. Prasanna,F AME: FPGA acceleration of secure matrix multiplication with homomorphic encryption, 2025. arXiv: 2512 . 15515 [cs.AR]. [Online]. Available: https://arxiv.org/abs/ 2512.15515

work page arXiv 2025

[27] [27]

Efficient homomorphic conversion between (Ring) LWE cipher- texts,

H. Chen, W. Dai, M. Kim, and Y . Song, “Efficient homomorphic conversion between (Ring) LWE cipher- texts,” inProc. of Appli. Cryptog. and Network Secur ., Kamakura, Japan, 2021, pp. 460–479

work page 2021

[28] [28]

Secure and efficient general matrix multiplication on cloud using homo- morphic encryption

Y . Gao, G. Quan, S. Homsi, et al., “Secure and efficient general matrix multiplication on cloud using homo- morphic encryption”,The Journal of Supercomputing, vol. 80, pp. 26 394–26 434, 2024

work page 2024

[29] [29]

Improved ciphertext multi- plication for RNS-CKKS homomorphic encryption

S. Akherati and X. Zhang, “Improved ciphertext multi- plication for RNS-CKKS homomorphic encryption”, in IEEE Work. on Sig. Process. Syst., 2024, pp. 136–140

work page 2024

[30] [30]

Better bootstrapping for approxi- mate homomorphic encryption

K. Han and D. Ki, “Better bootstrapping for approxi- mate homomorphic encryption”, inTopics in Cryptol- ogy, San Francisco, CA, USA, 2020, pp. 364–390

work page 2020

[31] [31]

Multi-input ciphertext mul- tiplication for homomorphic encryption

S. Akherati and X. Zhang, “Multi-input ciphertext mul- tiplication for homomorphic encryption”,IEEE Trans. on Circ. and Syst. I: Regular Papers, pp. 1–12, 2026

work page 2026

[32] [32]

Algorithms in HElib

S. Halevi and V . Shoup, “Algorithms in HElib”, in Advances in Cryptology, J. A. Garay and R. Gennaro, Eds., Berlin, Heidelberg, 2014, pp. 554–571

work page 2014

[33] [33]

MAD: Memory- aware design techniques for accelerating fully homo- morphic encryption

R. Agrawal, L. d. Castro, C. Juvekar, A. Chandrakasan, V . Vaikuntanathan, and A. Joshi, “MAD: Memory- aware design techniques for accelerating fully homo- morphic encryption”, inIEEE/ACM Intl. Symp. on Mi- croarchitecture, 2023, pp. 685–697

work page 2023

[34] [34]

Automatic generation of high throughput energy efficient streaming architectures for arbitrary fixed permutations

R. Chen and V . K. Prasanna, “Automatic generation of high throughput energy efficient streaming architectures for arbitrary fixed permutations”, inIntl. Conf. on Field Program. Logic and App., 2015, pp. 1–8

work page 2015