Triple-Hoisted Baby-Step Giant-Step Linear Transformation over CKKS Homomorphic Encryption and Hardware Accelerator
Pith reviewed 2026-05-20 00:00 UTC · model grok-4.3
The pith
A triple-hoisted baby-step giant-step algorithm reduces the rotations and off-chip memory accesses needed for linear transformations under CKKS homomorphic encryption.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying three levels of hoisting to the baby-step giant-step decomposition of a linear transformation, the number of ciphertext rotations drops sharply; phase partitioning of the resulting schedule then minimizes off-chip DRAM accesses, and an FPGA implementation with an optimized message-permutation circuit realizes these reductions in hardware.
What carries the argument
The triple-hoisted baby-step giant-step decomposition with phase partitioning that separates rotation-heavy stages from memory-bound stages.
If this is right
- Fewer rotations directly lower the time spent on matrix-vector products inside encrypted neural-network layers.
- Reduced off-chip traffic makes hardware accelerators for homomorphic encryption more practical for repeated linear operations.
- The FPGA data path can be reused or scaled when larger decomposition depths are chosen for higher security parameters.
- The same phase-partitioning idea applies to other rotation-heavy primitives such as bootstrapping or convolution.
Where Pith is reading between the lines
- The rotation savings could compound across many layers in a deep encrypted model, amplifying the benefit beyond single-layer measurements.
- Similar hoisting and partitioning may transfer to other homomorphic schemes that also rely on rotation-based linear algebra.
- An ASIC version of the permutation circuit could further increase throughput once the FPGA prototype validates the approach.
- Integration with ciphertext-packing strategies already used in software libraries could produce additional multiplicative gains.
Load-bearing premise
The reported speedups assume that the selected decomposition parameters and phase boundaries introduce no extra overheads that would appear when the method is embedded inside a complete end-to-end encrypted inference workload.
What would settle it
An end-to-end measurement of a full encrypted neural-network inference workload that either reproduces the 2.9x memory-access and 5.8x latency gains or shows that hidden costs erase them.
Figures
read the original abstract
Computations can be directly carried out over ciphertexts using homomorphic encryption (HE), which is indispensable for privacy-preserving cloud computing. Linear transformation is widely used in neural networks, including large language models. However, the implementation of linear transformation over HE requires a large number of ciphertext rotations, which incur significant memory and hardware overhead despite existing simplification techniques. This paper proposes a triple-hoisted baby-step giant-step algorithm that decomposes the baby step further to substantially reduce the number of ciphertext rotations needed for the CKKS HE evaluation of linear transformation. Moreover, to reduce off-chip memory access, which contributes to the majority of the latency, a memory-optimized data path is proposed by partitioning the algorithm into multiple phases. Furthermore, an efficient FPGA-based hardware accelerator with an optimized permutation circuit for message routing is designed for the proposed scheme. For a set of typical parameters, the proposed design reduces the off-chip memory access by 2.9x compared to the best prior design. Synthesized for Xilinx Virtex UltraScale+ devices, the proposed design achieves a 5.8x reduction in computational latency compared with the baseline design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a triple-hoisted baby-step giant-step (BSGS) decomposition for linear transformations under CKKS homomorphic encryption, further partitioned into phases with a memory-optimized data path and implemented via an FPGA accelerator with an optimized permutation circuit. For a set of typical CKKS parameters, synthesis on Xilinx Virtex UltraScale+ reports a 2.9× reduction in off-chip memory access versus the best prior design and a 5.8× reduction in computational latency versus baseline.
Significance. If the reported reductions hold under the stated assumptions, the co-design of algorithmic hoisting and phased hardware partitioning could meaningfully lower the dominant costs of rotations and DRAM traffic in HE linear layers, supporting more efficient privacy-preserving inference. The work supplies concrete synthesis numbers and an explicit hardware architecture, which are positive for reproducibility in the hardware-HE community.
major comments (2)
- [Abstract / Evaluation] Abstract and Evaluation section: the 2.9× off-chip memory and 5.8× latency claims are derived from isolated linear-transform synthesis under 'typical parameters'; the manuscript does not quantify additional rotation or key-switching traffic that would arise when the primitive is embedded in a full inference graph containing bootstraps and inter-layer movements, leaving the practical end-to-end gain unverified.
- [Algorithm / §3] Algorithm description: the triple-hoisting decomposition is presented as reducing rotation count, yet the paper does not supply a closed-form expression or table showing the exact rotation count versus prior BSGS variants for the same matrix dimension and ring parameters, making it difficult to confirm the claimed improvement is not an artifact of parameter selection.
minor comments (2)
- Notation for the hoisting factors and phase boundaries could be made more consistent between the algorithmic pseudocode and the hardware data-path diagram.
- [Results] The baseline design used for the 5.8× latency comparison should be explicitly referenced (e.g., prior work citation and parameter set) in the results tables.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of scope and clarity that we address below. We propose targeted revisions to strengthen the manuscript while maintaining focus on the linear-transform primitive.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and Evaluation section: the 2.9× off-chip memory and 5.8× latency claims are derived from isolated linear-transform synthesis under 'typical parameters'; the manuscript does not quantify additional rotation or key-switching traffic that would arise when the primitive is embedded in a full inference graph containing bootstraps and inter-layer movements, leaving the practical end-to-end gain unverified.
Authors: We agree that the reported gains are measured for the isolated linear-transformation primitive, which is the dominant cost in many CKKS workloads. A full end-to-end quantification would require selecting a specific neural-network architecture, bootstrap schedule, and inter-layer data movement pattern, none of which are standardized for this primitive. In the revised manuscript we will add a dedicated paragraph in the Evaluation section that (i) explicitly states the scope of the claims, (ii) discusses how the triple-hoisted BSGS primitive composes with bootstrapping and key-switching, and (iii) provides a back-of-the-envelope estimate of additional rotation traffic for a representative ResNet-style workload. We will not claim end-to-end numbers without new experiments. revision: partial
-
Referee: [Algorithm / §3] Algorithm description: the triple-hoisting decomposition is presented as reducing rotation count, yet the paper does not supply a closed-form expression or table showing the exact rotation count versus prior BSGS variants for the same matrix dimension and ring parameters, making it difficult to confirm the claimed improvement is not an artifact of parameter selection.
Authors: Section 3 derives the rotation count for the triple-hoisted decomposition by recursively partitioning the baby-step index set, but we did not tabulate the resulting counts against the classic and double-hoisted BSGS baselines. We will insert a new table (Table X) that lists, for representative matrix dimensions N and ring parameters (N, logQ, etc.), the exact number of rotations required by each variant. A short closed-form expression for the triple-hoisted case will also be added to the text preceding the table. revision: yes
Circularity Check
No circularity: new algorithmic decomposition and hardware synthesis rest on independent design choices
full rationale
The paper proposes a triple-hoisted BSGS decomposition for CKKS linear transforms, a phased memory-optimized datapath, and an FPGA accelerator with permutation circuit. Reported gains (2.9x off-chip access, 5.8x latency) are obtained from synthesis on Xilinx Virtex UltraScale+ for a set of typical parameters and compared against external prior designs. No equations, fitted parameters, or self-citations in the abstract or described structure reduce these results to quantities defined by the authors' own prior work or by construction from the inputs. The derivation chain consists of novel algorithmic partitioning and hardware mapping whose correctness is evaluated externally rather than tautologically.
Axiom & Free-Parameter Ledger
free parameters (1)
- hoisting decomposition factors
axioms (1)
- domain assumption CKKS rotation cost model remains accurate after triple hoisting
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This paper proposes a triple-hoisted baby-step giant-step algorithm that decomposes the baby step further to substantially reduce the number of ciphertext rotations... partitioning the algorithm into multiple phases.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
For a set of typical parameters, the proposed design reduces the off-chip memory access by 2.9× compared to the best prior design.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M. Moradi, M. Najafi, Z. Lin, and H. Pan, “Advancing sensor network optimization and data analytics for smart wastewater and stormwater collection systems: An overview”, inPipelines 2025, pp. 463–472
work page 2025
-
[2]
BioDeepHash: Generating consistent templates for se- cure biometric recognition
B. Song, D. Zhao, J. Yan, H. Li, and H. Jiang, “BioDeepHash: Generating consistent templates for se- cure biometric recognition”,IEEE Trans. on Depend- able and Secure Comp., pp. 1–18, 2026
work page 2026
-
[3]
V . V . L. D. Allavarpu, V . S. Naresh, and A. K. Mohan, “Privacy-preserving machine learning techniques based on homomorphic encryption for credit risk analysis”, Electronic Commerce Research, 2026
work page 2026
-
[4]
Ho- momorphic encryption for arithmetic of approximate numbers,
J. H. Cheon, A. Kim, M. Kim, and Y . Song, “Ho- momorphic encryption for arithmetic of approximate numbers,” inProc. of Intl. Conf. on the Theory and Appl. of Cryptol. and Info. Secur ., Cham, Switzerland, 2017, pp. 409–437
work page 2017
-
[5]
(Lev- eled) fully homomorphic encryption without bootstrap- ping,
Z. Brakerski, C. Gentry, and V . Vaikuntanathan, “(Lev- eled) fully homomorphic encryption without bootstrap- ping,” inProc. of Innov. in Theoret. Comp. Sci. Conf., Cambridge, Massachusetts, 2012, pp. 309–325
work page 2012
-
[6]
An improved RNS variant of the BFV homomorphic encryption scheme
S. Halevi, Y . Polyakov, and V . Shoup, “An improved RNS variant of the BFV homomorphic encryption scheme”, inTopics in Cryptology, M. Matsui, Ed., Cham: Springer International Publishing, 2019, pp. 83– 105
work page 2019
-
[7]
Faster packed homomorphic operations and efficient circuit bootstrapping for TFHE,
I. Chillotti, N. Gama, M. Georgieva, and M. Izabach `ene, “Faster packed homomorphic operations and efficient circuit bootstrapping for TFHE,” inAdvances in Cryp- tology, Cham, Switzeland, 2017, pp. 377–408. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS , VOL. X, NO. X, FEBRUARY 2026 11
work page 2017
-
[8]
Polynomial multi- plication architecture with integrated modular reduction for R-LWE cryptosystems,
X. Zhang, Z. Huai, and K. K. Parhi, “Polynomial multi- plication architecture with integrated modular reduction for R-LWE cryptosystems,”Journ. of Sig. Process. Syst., vol. 94, no. 8, pp. 799–809, 2022
work page 2022
-
[9]
An area- efficient, conflict-free, and configurable architecture for accelerating NTT/INTT,
S.-H. Liu, C.-Y . Kuo, Y .-N. Mo, and T. Su, “An area- efficient, conflict-free, and configurable architecture for accelerating NTT/INTT,”IEEE Trans. on V ery Large Scale Integ. (VLSI) Syst., vol. 32, no. 3, pp. 519–529, 2024
work page 2024
-
[10]
Area- efficient number theoretic transform architecture for homomorphic encryption,
P. Duong-Ngoc, S. Kwon, D. Yoo, and H. Lee, “Area- efficient number theoretic transform architecture for homomorphic encryption,”IEEE Trans. on Circ. and Syst. I, vol. 70, no. 3, pp. 1270–1283, 2023
work page 2023
-
[11]
Low-complexity NTT and INTT structures via twiddle shifting
S.-W. Chiu and K. K. Parhi, “Low-complexity NTT and INTT structures via twiddle shifting”, inIEEE Intl. Midwest Symp. on Circ. and Syst., 2025, pp. 444–448
work page 2025
-
[12]
W. Tan, S. W. Chiu, A. Wang, Y . Lao, and K. K. Parhi, “PaReNTT: Low-latency parallel residue number system and NTT-based long polynomial modular multi- plication for homomorphic encryption,”IEEE Trans. on Info. F oren. and Secur ., vol. 19, pp. 1646–1659, 2024
work page 2024
-
[13]
S. Akherati, J. Cai, and X. Zhang, “Efficient generalized integer division and modular reduction architectures for homomorphic encryption”,Journ. of Sig. Process. Syst., vol. 97, no. 2, pp. 157–171, 2025
work page 2025
-
[14]
FPGA-based accelerators of fully pipelined modular multipliers for homomorphic encryption
S. Kim, K. Lee, W. Cho, J. H. Cheon, and R. A. Rutenbar, “FPGA-based accelerators of fully pipelined modular multipliers for homomorphic encryption”, in Intl. Conf. on ReConFigurable Comp. and FPGAs, 2019, pp. 1–8
work page 2019
-
[15]
A full RNS variant of approximate homomorphic encryption,
J. H. Cheon, K. Han, A. Kim, M. Kim, and Y . Song, “A full RNS variant of approximate homomorphic encryption,” inProc. of Select. Areas in Cryptog. Intl. Conf., Springer, 2019, pp. 347–368
work page 2019
-
[16]
Pan- ther: Practical secure two-party neural network infer- ence
J. Feng, Y . Wu, H. Sun, S. Zhang, and D. Liu, “Pan- ther: Practical secure two-party neural network infer- ence”,IEEE Trans. on Inf. F orensics and Sec., vol. 20, pp. 1149–1162, 2025
work page 2025
-
[17]
R. Zhang, Z. Zheng, and W. Bao, “Practical secure inference algorithm for a fine-tuned large language model based on fully homomorphic encryption”,IEEE Trans. on Inf. F orensics and Sec., vol. 21, pp. 17–29, 2026
work page 2026
-
[18]
Efficient bootstrapping for approxi- mate homomorphic encryption with non-sparse keys,
J. P. Bossuat, C. Mouchet, J. Troncoso Pastoriza, and J. P. Hubaux, “Efficient bootstrapping for approxi- mate homomorphic encryption with non-sparse keys,” inAdvances in Cryptology, Cham, Switzerland, 2021, pp. 587–617
work page 2021
-
[19]
GAZELLE: A low latency framework for secure neural network inference,
C. Juvekar, V . Vaikuntanathan, and A. Chandrakasan, “GAZELLE: A low latency framework for secure neural network inference,” inProc. of the USENIX Conf. on Secur . Symp., Baltimore, MD, USA, 2018, pp. 1651– 1668
work page 2018
-
[20]
S. Halevi and V . Shoup, “Bootstrapping for HElib”,J. Cryptol., vol. 34, no. 1, 2021
work page 2021
-
[21]
Faster homomorphic linear transformations in HElib
S. Halevi and V . Shoup, “Faster homomorphic linear transformations in HElib”, inAdvances in Cryptol- ogy, Santa Barbara, CA, USA: Springer-Verlag, 2018, pp. 93–120
work page 2018
-
[22]
APAS: Application-specific accelerators for RLWE-based homomorphic linear transformations
S. Bian, D. E. S. Kundi, K. Hirozawa, W. Liu, and T. Sato, “APAS: Application-specific accelerators for RLWE-based homomorphic linear transformations”, IEEE Trans. on Inf. F orensics and Sec., vol. 16, pp. 4663–4678, 2021
work page 2021
-
[23]
FPGA accelerator for homomorphic en- crypted sparse convolutional neural network inference
Y . Yang, S. R. Kuppannagari, R. Kannan, and V . K. Prasanna, “FPGA accelerator for homomorphic en- crypted sparse convolutional neural network inference”, inIEEE Intl. Symp. on Field-Program. Custom Comp. Machines, 2022, pp. 1–9
work page 2022
-
[24]
CHAM: A customized homomorphic encryption accelerator for fast matrix-vector product
X. Ren, Z. Chen, Z. Gu, Y . Lu, R. Zhong, W.-J. Lu, J. Zhang, Y . Zhang, H. Wu, X. Zheng, H. Liu, T. Chu, C. Hong, C. Wei, D. Niu, and Y . Xie, “CHAM: A customized homomorphic encryption accelerator for fast matrix-vector product”, inACM/IEEE Design Au- tomation Conf., 2023, pp. 1–6
work page 2023
-
[25]
Bandwidth efficient homomorphic encrypted matrix vector multiplication accelerator on FPGA
Y . Yang, S. R. Kuppannagari, R. Kannan, and V . K. Prasanna, “Bandwidth efficient homomorphic encrypted matrix vector multiplication accelerator on FPGA”, in Intl. Conf. on Field-Program. Tech., 2022, pp. 1–9
work page 2022
- [26]
-
[27]
Efficient homomorphic conversion between (Ring) LWE cipher- texts,
H. Chen, W. Dai, M. Kim, and Y . Song, “Efficient homomorphic conversion between (Ring) LWE cipher- texts,” inProc. of Appli. Cryptog. and Network Secur ., Kamakura, Japan, 2021, pp. 460–479
work page 2021
-
[28]
Secure and efficient general matrix multiplication on cloud using homo- morphic encryption
Y . Gao, G. Quan, S. Homsi, et al., “Secure and efficient general matrix multiplication on cloud using homo- morphic encryption”,The Journal of Supercomputing, vol. 80, pp. 26 394–26 434, 2024
work page 2024
-
[29]
Improved ciphertext multi- plication for RNS-CKKS homomorphic encryption
S. Akherati and X. Zhang, “Improved ciphertext multi- plication for RNS-CKKS homomorphic encryption”, in IEEE Work. on Sig. Process. Syst., 2024, pp. 136–140
work page 2024
-
[30]
Better bootstrapping for approxi- mate homomorphic encryption
K. Han and D. Ki, “Better bootstrapping for approxi- mate homomorphic encryption”, inTopics in Cryptol- ogy, San Francisco, CA, USA, 2020, pp. 364–390
work page 2020
-
[31]
Multi-input ciphertext mul- tiplication for homomorphic encryption
S. Akherati and X. Zhang, “Multi-input ciphertext mul- tiplication for homomorphic encryption”,IEEE Trans. on Circ. and Syst. I: Regular Papers, pp. 1–12, 2026
work page 2026
-
[32]
S. Halevi and V . Shoup, “Algorithms in HElib”, in Advances in Cryptology, J. A. Garay and R. Gennaro, Eds., Berlin, Heidelberg, 2014, pp. 554–571
work page 2014
-
[33]
MAD: Memory- aware design techniques for accelerating fully homo- morphic encryption
R. Agrawal, L. d. Castro, C. Juvekar, A. Chandrakasan, V . Vaikuntanathan, and A. Joshi, “MAD: Memory- aware design techniques for accelerating fully homo- morphic encryption”, inIEEE/ACM Intl. Symp. on Mi- croarchitecture, 2023, pp. 685–697
work page 2023
-
[34]
R. Chen and V . K. Prasanna, “Automatic generation of high throughput energy efficient streaming architectures for arbitrary fixed permutations”, inIntl. Conf. on Field Program. Logic and App., 2015, pp. 1–8
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.