pith. sign in

arxiv: 2605.17222 · v1 · pith:CNIZZ6BBnew · submitted 2026-05-17 · 💻 cs.CR

Triple-Hoisted Baby-Step Giant-Step Linear Transformation over CKKS Homomorphic Encryption and Hardware Accelerator

Pith reviewed 2026-05-20 00:00 UTC · model grok-4.3

classification 💻 cs.CR
keywords homomorphic encryptionCKKSlinear transformationbaby-step giant-stepciphertext rotationhardware acceleratorFPGAmemory optimization
0
0 comments X p. Extension
pith:CNIZZ6BB Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{CNIZZ6BB}

Prints a linked pith:CNIZZ6BB badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A triple-hoisted baby-step giant-step algorithm reduces the rotations and off-chip memory accesses needed for linear transformations under CKKS homomorphic encryption.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a further decomposition of the baby steps inside the standard baby-step giant-step method so that fewer ciphertext rotations are required to evaluate a linear transformation on encrypted data. It then partitions the computation into phases that keep most data on-chip and thereby cut the dominant source of hardware latency. An FPGA accelerator with a specialized permutation network is built to route the remaining data efficiently under this scheme. For typical parameter sets the design lowers off-chip memory traffic 2.9 times relative to the best previous implementation and cuts total computational latency 5.8 times relative to an unoptimized baseline.

Core claim

By applying three levels of hoisting to the baby-step giant-step decomposition of a linear transformation, the number of ciphertext rotations drops sharply; phase partitioning of the resulting schedule then minimizes off-chip DRAM accesses, and an FPGA implementation with an optimized message-permutation circuit realizes these reductions in hardware.

What carries the argument

The triple-hoisted baby-step giant-step decomposition with phase partitioning that separates rotation-heavy stages from memory-bound stages.

If this is right

  • Fewer rotations directly lower the time spent on matrix-vector products inside encrypted neural-network layers.
  • Reduced off-chip traffic makes hardware accelerators for homomorphic encryption more practical for repeated linear operations.
  • The FPGA data path can be reused or scaled when larger decomposition depths are chosen for higher security parameters.
  • The same phase-partitioning idea applies to other rotation-heavy primitives such as bootstrapping or convolution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rotation savings could compound across many layers in a deep encrypted model, amplifying the benefit beyond single-layer measurements.
  • Similar hoisting and partitioning may transfer to other homomorphic schemes that also rely on rotation-based linear algebra.
  • An ASIC version of the permutation circuit could further increase throughput once the FPGA prototype validates the approach.
  • Integration with ciphertext-packing strategies already used in software libraries could produce additional multiplicative gains.

Load-bearing premise

The reported speedups assume that the selected decomposition parameters and phase boundaries introduce no extra overheads that would appear when the method is embedded inside a complete end-to-end encrypted inference workload.

What would settle it

An end-to-end measurement of a full encrypted neural-network inference workload that either reproduces the 2.9x memory-access and 5.8x latency gains or shows that hidden costs erase them.

Figures

Figures reproduced from arXiv: 2605.17222 by Sajjad Akherati, Xinmiao Zhang.

Figure 1
Figure 1. Figure 1: The block diagram for the key switching operation in the RNS [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Number of modular multiplications and switching key size of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Block diagrams of the proposed memory-optimized data path of (a) Lines 1–5 (Phase 1); (b) Lines 6–7 (Phase 2); (c) Lines 8–11 (Phase 3); (d) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparisons of off-chip memory access (left axis) and number of modular multiplications (right axis) for the diagonal method [ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Top-level architecture of the proposed accelerator for HE-LT. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Architecture for the PE. capacity for the polynomials, and consequently the number of coefficients in each polynomial that can be processed simultaneously, denoted as dp, is determined. The product of dp and the number of polynomials processed concurrently determines nPE. By analyzing the data flow across the six phases of Algorithm 2, the required number of TAs for the summations in Decompose, ModDown, th… view at source ↗
read the original abstract

Computations can be directly carried out over ciphertexts using homomorphic encryption (HE), which is indispensable for privacy-preserving cloud computing. Linear transformation is widely used in neural networks, including large language models. However, the implementation of linear transformation over HE requires a large number of ciphertext rotations, which incur significant memory and hardware overhead despite existing simplification techniques. This paper proposes a triple-hoisted baby-step giant-step algorithm that decomposes the baby step further to substantially reduce the number of ciphertext rotations needed for the CKKS HE evaluation of linear transformation. Moreover, to reduce off-chip memory access, which contributes to the majority of the latency, a memory-optimized data path is proposed by partitioning the algorithm into multiple phases. Furthermore, an efficient FPGA-based hardware accelerator with an optimized permutation circuit for message routing is designed for the proposed scheme. For a set of typical parameters, the proposed design reduces the off-chip memory access by 2.9x compared to the best prior design. Synthesized for Xilinx Virtex UltraScale+ devices, the proposed design achieves a 5.8x reduction in computational latency compared with the baseline design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a triple-hoisted baby-step giant-step (BSGS) decomposition for linear transformations under CKKS homomorphic encryption, further partitioned into phases with a memory-optimized data path and implemented via an FPGA accelerator with an optimized permutation circuit. For a set of typical CKKS parameters, synthesis on Xilinx Virtex UltraScale+ reports a 2.9× reduction in off-chip memory access versus the best prior design and a 5.8× reduction in computational latency versus baseline.

Significance. If the reported reductions hold under the stated assumptions, the co-design of algorithmic hoisting and phased hardware partitioning could meaningfully lower the dominant costs of rotations and DRAM traffic in HE linear layers, supporting more efficient privacy-preserving inference. The work supplies concrete synthesis numbers and an explicit hardware architecture, which are positive for reproducibility in the hardware-HE community.

major comments (2)
  1. [Abstract / Evaluation] Abstract and Evaluation section: the 2.9× off-chip memory and 5.8× latency claims are derived from isolated linear-transform synthesis under 'typical parameters'; the manuscript does not quantify additional rotation or key-switching traffic that would arise when the primitive is embedded in a full inference graph containing bootstraps and inter-layer movements, leaving the practical end-to-end gain unverified.
  2. [Algorithm / §3] Algorithm description: the triple-hoisting decomposition is presented as reducing rotation count, yet the paper does not supply a closed-form expression or table showing the exact rotation count versus prior BSGS variants for the same matrix dimension and ring parameters, making it difficult to confirm the claimed improvement is not an artifact of parameter selection.
minor comments (2)
  1. Notation for the hoisting factors and phase boundaries could be made more consistent between the algorithmic pseudocode and the hardware data-path diagram.
  2. [Results] The baseline design used for the 5.8× latency comparison should be explicitly referenced (e.g., prior work citation and parameter set) in the results tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of scope and clarity that we address below. We propose targeted revisions to strengthen the manuscript while maintaining focus on the linear-transform primitive.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: the 2.9× off-chip memory and 5.8× latency claims are derived from isolated linear-transform synthesis under 'typical parameters'; the manuscript does not quantify additional rotation or key-switching traffic that would arise when the primitive is embedded in a full inference graph containing bootstraps and inter-layer movements, leaving the practical end-to-end gain unverified.

    Authors: We agree that the reported gains are measured for the isolated linear-transformation primitive, which is the dominant cost in many CKKS workloads. A full end-to-end quantification would require selecting a specific neural-network architecture, bootstrap schedule, and inter-layer data movement pattern, none of which are standardized for this primitive. In the revised manuscript we will add a dedicated paragraph in the Evaluation section that (i) explicitly states the scope of the claims, (ii) discusses how the triple-hoisted BSGS primitive composes with bootstrapping and key-switching, and (iii) provides a back-of-the-envelope estimate of additional rotation traffic for a representative ResNet-style workload. We will not claim end-to-end numbers without new experiments. revision: partial

  2. Referee: [Algorithm / §3] Algorithm description: the triple-hoisting decomposition is presented as reducing rotation count, yet the paper does not supply a closed-form expression or table showing the exact rotation count versus prior BSGS variants for the same matrix dimension and ring parameters, making it difficult to confirm the claimed improvement is not an artifact of parameter selection.

    Authors: Section 3 derives the rotation count for the triple-hoisted decomposition by recursively partitioning the baby-step index set, but we did not tabulate the resulting counts against the classic and double-hoisted BSGS baselines. We will insert a new table (Table X) that lists, for representative matrix dimensions N and ring parameters (N, logQ, etc.), the exact number of rotations required by each variant. A short closed-form expression for the triple-hoisted case will also be added to the text preceding the table. revision: yes

Circularity Check

0 steps flagged

No circularity: new algorithmic decomposition and hardware synthesis rest on independent design choices

full rationale

The paper proposes a triple-hoisted BSGS decomposition for CKKS linear transforms, a phased memory-optimized datapath, and an FPGA accelerator with permutation circuit. Reported gains (2.9x off-chip access, 5.8x latency) are obtained from synthesis on Xilinx Virtex UltraScale+ for a set of typical parameters and compared against external prior designs. No equations, fitted parameters, or self-citations in the abstract or described structure reduce these results to quantities defined by the authors' own prior work or by construction from the inputs. The derivation chain consists of novel algorithmic partitioning and hardware mapping whose correctness is evaluated externally rather than tautologically.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The performance numbers rest on the existence of a valid CKKS parameter set and on the assumption that rotation costs dominate; no new mathematical axioms or invented entities are introduced beyond standard CKKS ring operations.

free parameters (1)
  • hoisting decomposition factors
    The triple-hoist step sizes are chosen to minimize rotations for the target matrix dimensions; these are free parameters tuned for the reported 'typical parameters'.
axioms (1)
  • domain assumption CKKS rotation cost model remains accurate after triple hoisting
    The abstract implicitly assumes that further decomposition does not introduce new noise or correctness issues that would invalidate the rotation savings.

pith-pipeline@v0.9.0 · 5732 in / 1468 out tokens · 43195 ms · 2026-05-20T00:00:06.000135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    Advancing sensor network optimization and data analytics for smart wastewater and stormwater collection systems: An overview

    M. Moradi, M. Najafi, Z. Lin, and H. Pan, “Advancing sensor network optimization and data analytics for smart wastewater and stormwater collection systems: An overview”, inPipelines 2025, pp. 463–472

  2. [2]

    BioDeepHash: Generating consistent templates for se- cure biometric recognition

    B. Song, D. Zhao, J. Yan, H. Li, and H. Jiang, “BioDeepHash: Generating consistent templates for se- cure biometric recognition”,IEEE Trans. on Depend- able and Secure Comp., pp. 1–18, 2026

  3. [3]

    Privacy-preserving machine learning techniques based on homomorphic encryption for credit risk analysis

    V . V . L. D. Allavarpu, V . S. Naresh, and A. K. Mohan, “Privacy-preserving machine learning techniques based on homomorphic encryption for credit risk analysis”, Electronic Commerce Research, 2026

  4. [4]

    Ho- momorphic encryption for arithmetic of approximate numbers,

    J. H. Cheon, A. Kim, M. Kim, and Y . Song, “Ho- momorphic encryption for arithmetic of approximate numbers,” inProc. of Intl. Conf. on the Theory and Appl. of Cryptol. and Info. Secur ., Cham, Switzerland, 2017, pp. 409–437

  5. [5]

    (Lev- eled) fully homomorphic encryption without bootstrap- ping,

    Z. Brakerski, C. Gentry, and V . Vaikuntanathan, “(Lev- eled) fully homomorphic encryption without bootstrap- ping,” inProc. of Innov. in Theoret. Comp. Sci. Conf., Cambridge, Massachusetts, 2012, pp. 309–325

  6. [6]

    An improved RNS variant of the BFV homomorphic encryption scheme

    S. Halevi, Y . Polyakov, and V . Shoup, “An improved RNS variant of the BFV homomorphic encryption scheme”, inTopics in Cryptology, M. Matsui, Ed., Cham: Springer International Publishing, 2019, pp. 83– 105

  7. [7]

    Faster packed homomorphic operations and efficient circuit bootstrapping for TFHE,

    I. Chillotti, N. Gama, M. Georgieva, and M. Izabach `ene, “Faster packed homomorphic operations and efficient circuit bootstrapping for TFHE,” inAdvances in Cryp- tology, Cham, Switzeland, 2017, pp. 377–408. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS , VOL. X, NO. X, FEBRUARY 2026 11

  8. [8]

    Polynomial multi- plication architecture with integrated modular reduction for R-LWE cryptosystems,

    X. Zhang, Z. Huai, and K. K. Parhi, “Polynomial multi- plication architecture with integrated modular reduction for R-LWE cryptosystems,”Journ. of Sig. Process. Syst., vol. 94, no. 8, pp. 799–809, 2022

  9. [9]

    An area- efficient, conflict-free, and configurable architecture for accelerating NTT/INTT,

    S.-H. Liu, C.-Y . Kuo, Y .-N. Mo, and T. Su, “An area- efficient, conflict-free, and configurable architecture for accelerating NTT/INTT,”IEEE Trans. on V ery Large Scale Integ. (VLSI) Syst., vol. 32, no. 3, pp. 519–529, 2024

  10. [10]

    Area- efficient number theoretic transform architecture for homomorphic encryption,

    P. Duong-Ngoc, S. Kwon, D. Yoo, and H. Lee, “Area- efficient number theoretic transform architecture for homomorphic encryption,”IEEE Trans. on Circ. and Syst. I, vol. 70, no. 3, pp. 1270–1283, 2023

  11. [11]

    Low-complexity NTT and INTT structures via twiddle shifting

    S.-W. Chiu and K. K. Parhi, “Low-complexity NTT and INTT structures via twiddle shifting”, inIEEE Intl. Midwest Symp. on Circ. and Syst., 2025, pp. 444–448

  12. [12]

    PaReNTT: Low-latency parallel residue number system and NTT-based long polynomial modular multi- plication for homomorphic encryption,

    W. Tan, S. W. Chiu, A. Wang, Y . Lao, and K. K. Parhi, “PaReNTT: Low-latency parallel residue number system and NTT-based long polynomial modular multi- plication for homomorphic encryption,”IEEE Trans. on Info. F oren. and Secur ., vol. 19, pp. 1646–1659, 2024

  13. [13]

    Efficient generalized integer division and modular reduction architectures for homomorphic encryption

    S. Akherati, J. Cai, and X. Zhang, “Efficient generalized integer division and modular reduction architectures for homomorphic encryption”,Journ. of Sig. Process. Syst., vol. 97, no. 2, pp. 157–171, 2025

  14. [14]

    FPGA-based accelerators of fully pipelined modular multipliers for homomorphic encryption

    S. Kim, K. Lee, W. Cho, J. H. Cheon, and R. A. Rutenbar, “FPGA-based accelerators of fully pipelined modular multipliers for homomorphic encryption”, in Intl. Conf. on ReConFigurable Comp. and FPGAs, 2019, pp. 1–8

  15. [15]

    A full RNS variant of approximate homomorphic encryption,

    J. H. Cheon, K. Han, A. Kim, M. Kim, and Y . Song, “A full RNS variant of approximate homomorphic encryption,” inProc. of Select. Areas in Cryptog. Intl. Conf., Springer, 2019, pp. 347–368

  16. [16]

    Pan- ther: Practical secure two-party neural network infer- ence

    J. Feng, Y . Wu, H. Sun, S. Zhang, and D. Liu, “Pan- ther: Practical secure two-party neural network infer- ence”,IEEE Trans. on Inf. F orensics and Sec., vol. 20, pp. 1149–1162, 2025

  17. [17]

    Practical secure inference algorithm for a fine-tuned large language model based on fully homomorphic encryption

    R. Zhang, Z. Zheng, and W. Bao, “Practical secure inference algorithm for a fine-tuned large language model based on fully homomorphic encryption”,IEEE Trans. on Inf. F orensics and Sec., vol. 21, pp. 17–29, 2026

  18. [18]

    Efficient bootstrapping for approxi- mate homomorphic encryption with non-sparse keys,

    J. P. Bossuat, C. Mouchet, J. Troncoso Pastoriza, and J. P. Hubaux, “Efficient bootstrapping for approxi- mate homomorphic encryption with non-sparse keys,” inAdvances in Cryptology, Cham, Switzerland, 2021, pp. 587–617

  19. [19]

    GAZELLE: A low latency framework for secure neural network inference,

    C. Juvekar, V . Vaikuntanathan, and A. Chandrakasan, “GAZELLE: A low latency framework for secure neural network inference,” inProc. of the USENIX Conf. on Secur . Symp., Baltimore, MD, USA, 2018, pp. 1651– 1668

  20. [20]

    Bootstrapping for HElib

    S. Halevi and V . Shoup, “Bootstrapping for HElib”,J. Cryptol., vol. 34, no. 1, 2021

  21. [21]

    Faster homomorphic linear transformations in HElib

    S. Halevi and V . Shoup, “Faster homomorphic linear transformations in HElib”, inAdvances in Cryptol- ogy, Santa Barbara, CA, USA: Springer-Verlag, 2018, pp. 93–120

  22. [22]

    APAS: Application-specific accelerators for RLWE-based homomorphic linear transformations

    S. Bian, D. E. S. Kundi, K. Hirozawa, W. Liu, and T. Sato, “APAS: Application-specific accelerators for RLWE-based homomorphic linear transformations”, IEEE Trans. on Inf. F orensics and Sec., vol. 16, pp. 4663–4678, 2021

  23. [23]

    FPGA accelerator for homomorphic en- crypted sparse convolutional neural network inference

    Y . Yang, S. R. Kuppannagari, R. Kannan, and V . K. Prasanna, “FPGA accelerator for homomorphic en- crypted sparse convolutional neural network inference”, inIEEE Intl. Symp. on Field-Program. Custom Comp. Machines, 2022, pp. 1–9

  24. [24]

    CHAM: A customized homomorphic encryption accelerator for fast matrix-vector product

    X. Ren, Z. Chen, Z. Gu, Y . Lu, R. Zhong, W.-J. Lu, J. Zhang, Y . Zhang, H. Wu, X. Zheng, H. Liu, T. Chu, C. Hong, C. Wei, D. Niu, and Y . Xie, “CHAM: A customized homomorphic encryption accelerator for fast matrix-vector product”, inACM/IEEE Design Au- tomation Conf., 2023, pp. 1–6

  25. [25]

    Bandwidth efficient homomorphic encrypted matrix vector multiplication accelerator on FPGA

    Y . Yang, S. R. Kuppannagari, R. Kannan, and V . K. Prasanna, “Bandwidth efficient homomorphic encrypted matrix vector multiplication accelerator on FPGA”, in Intl. Conf. on Field-Program. Tech., 2022, pp. 1–9

  26. [26]

    Z. Xu, R. Kannan, and V . K. Prasanna,F AME: FPGA acceleration of secure matrix multiplication with homomorphic encryption, 2025. arXiv: 2512 . 15515 [cs.AR]. [Online]. Available: https://arxiv.org/abs/ 2512.15515

  27. [27]

    Efficient homomorphic conversion between (Ring) LWE cipher- texts,

    H. Chen, W. Dai, M. Kim, and Y . Song, “Efficient homomorphic conversion between (Ring) LWE cipher- texts,” inProc. of Appli. Cryptog. and Network Secur ., Kamakura, Japan, 2021, pp. 460–479

  28. [28]

    Secure and efficient general matrix multiplication on cloud using homo- morphic encryption

    Y . Gao, G. Quan, S. Homsi, et al., “Secure and efficient general matrix multiplication on cloud using homo- morphic encryption”,The Journal of Supercomputing, vol. 80, pp. 26 394–26 434, 2024

  29. [29]

    Improved ciphertext multi- plication for RNS-CKKS homomorphic encryption

    S. Akherati and X. Zhang, “Improved ciphertext multi- plication for RNS-CKKS homomorphic encryption”, in IEEE Work. on Sig. Process. Syst., 2024, pp. 136–140

  30. [30]

    Better bootstrapping for approxi- mate homomorphic encryption

    K. Han and D. Ki, “Better bootstrapping for approxi- mate homomorphic encryption”, inTopics in Cryptol- ogy, San Francisco, CA, USA, 2020, pp. 364–390

  31. [31]

    Multi-input ciphertext mul- tiplication for homomorphic encryption

    S. Akherati and X. Zhang, “Multi-input ciphertext mul- tiplication for homomorphic encryption”,IEEE Trans. on Circ. and Syst. I: Regular Papers, pp. 1–12, 2026

  32. [32]

    Algorithms in HElib

    S. Halevi and V . Shoup, “Algorithms in HElib”, in Advances in Cryptology, J. A. Garay and R. Gennaro, Eds., Berlin, Heidelberg, 2014, pp. 554–571

  33. [33]

    MAD: Memory- aware design techniques for accelerating fully homo- morphic encryption

    R. Agrawal, L. d. Castro, C. Juvekar, A. Chandrakasan, V . Vaikuntanathan, and A. Joshi, “MAD: Memory- aware design techniques for accelerating fully homo- morphic encryption”, inIEEE/ACM Intl. Symp. on Mi- croarchitecture, 2023, pp. 685–697

  34. [34]

    Automatic generation of high throughput energy efficient streaming architectures for arbitrary fixed permutations

    R. Chen and V . K. Prasanna, “Automatic generation of high throughput energy efficient streaming architectures for arbitrary fixed permutations”, inIntl. Conf. on Field Program. Logic and App., 2015, pp. 1–8