pith. sign in

arxiv: 2604.00546 · v3 · pith:Y7ZMGWJXnew · submitted 2026-04-01 · 💻 cs.CR

Lightweight, Practical Encrypted Face Recognition with GPU Support

Pith reviewed 2026-05-13 22:58 UTC · model grok-4.3

classification 💻 cs.CR
keywords encrypted face recognitionfully homomorphic encryptionCKKS schemerotation keysGPU accelerationprivacy-preserving biometricssimilarity search
0
0 comments X

The pith

BSGS-Diagonal reorders rotations to cut rotation keys by 91 percent while preserving correctness in encrypted face recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that reordering the baby-step giant-step diagonal rotations in the existing HyDia protocol shrinks the rotation-key set needed for homomorphic similarity search. This change cuts client memory by roughly 14 GB and drops overall CPU peak RAM below 10 GB for databases up to one million entries. The same reordering also improves server runtime by up to 1.57 times for membership checks and 1.43 times for identification. The authors further fuse CKKS operations into integrated GPU kernels that avoid repeated CPU-GPU transfers, delivering speedups of 9 times on the original protocol and 21 times on the new variant. The combined result is sub-second encrypted face recognition for databases of 32 thousand entries.

Core claim

BSGS-Diagonal reorders the sequence of rotations inside the CKKS matrix-multiplication routine so that far fewer distinct rotation keys are required. The reduction reaches 91 percent, directly lowering both client and server memory footprints while leaving similarity scores unchanged. Integrated GPU kernels built on FIDESlib then fuse the remaining operations to eliminate costly data-structure conversions, producing up to 21 times faster end-to-end encrypted similarity computation.

What carries the argument

BSGS-Diagonal, a reordering of baby-step giant-step diagonal rotations that reduces the distinct rotation keys needed for CKKS-based inner-product calculations while keeping the output identical to the original computation.

If this is right

  • Client memory drops by about 14 GB because far fewer rotation keys must be stored.
  • Server RAM stays under 10 GB even when the database reaches one million entries.
  • Membership verification runs up to 1.57 times faster than the prior protocol.
  • Identification queries improve by up to 1.43 times.
  • GPU kernels bring sub-second latency for encrypted search over 32 thousand entries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The memory savings could allow encrypted face matching on mobile or edge devices that currently lack enough RAM.
  • The same rotation-reordering trick may apply to other CKKS workloads that rely on repeated diagonal multiplications.
  • Operation fusion on GPU suggests similar speed gains are available for other homomorphic linear-algebra tasks once kernels are written at the same level of integration.

Load-bearing premise

Reordering the rotations does not change the exact numerical result of the similarity scores or weaken the semantic security of the underlying CKKS encryption.

What would settle it

A side-by-side run on the same face embeddings that shows the cosine similarities produced by BSGS-Diagonal differ from those of the original HyDia implementation by more than floating-point rounding error.

Figures

Figures reproduced from arXiv: 2604.00546 by Bahattin Yildiz, Eduardo L. Cominetti, Gabrielle De Micheli, Geovandro Pereira, Jina Choi, Marcos A. Simplicio Jr, Syed Mahbub Hafiz, Thales B. Paiva.

Figure 1
Figure 1. Figure 1: Pipeline showing the diagonal packing from HyDia generalized to multiple groups. When [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the HyDia protocol (and ours). [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparing the original HyDia and our BSGS-Diagonal op [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: RTX 8000 vs. H200 comparison for membership time [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
read the original abstract

Face recognition models operate in a client-server setting where a client extracts a compact face embedding and a server performs similarity search over a template database. This raises privacy concerns, as facial data is highly sensitive. To provide cryptographic privacy guarantees, one can use fully homomorphic encryption to perform end-to-end encrypted similarity search. However, existing FHE-based protocols are computationally costly and, impose high memory overhead. Building on prior work, HyDia (PoPETS 2025), we introduce algorithmic and system-level improvements targeting real-world deployment with resource-constrained clients. First, we propose BSGS-Diagonal, an algorithm delivering fast and memory-efficient similarity computation. BSGS-Diagonal substantially shrinks the rotation-key set, lowering both client and server memory requirements, and also improves practical server runtime. This yields a 91% reduction in the number of rotation keys, translating to approximately 14 GB less memory used on the client, and reducing overall CPU peak RAM from over 33 GB in the original HyDia to under 11 GB for databases up to size 1M. In addition, runtime is improved by up to 1.57x for the membership verification scenario and 1.43x for the identification scenario. Secondly, we introduce fully GPU-optimized similarity matrix computation kernels. The implementation is built upon FIDESlib, a CKKS-level GPU library based on OpenFHE. Rather than offloading individual CKKS primitives in isolation, the integrated kernels fuse operations to avoid repeated CPU-GPU ciphertext movement and costly FIDESlib/OpenFHE data-structure conversions. As a result, our GPU implementations of both HyDia and BSGS-Diagonal achieve up to 9x and 21x speedups, respectively, enabling sub-second encrypted face recognition for databases up to 32K entries while further reducing host memory usage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper builds on HyDia (PoPETS 2025) to present BSGS-Diagonal, a reordering of baby-step/giant-step rotations for CKKS-based encrypted similarity search in face recognition. It reports a 91% reduction in rotation keys (saving ~14 GB client memory and dropping peak RAM from >30 GB to <10 GB for 1M-entry databases), runtime gains of 1.57x (membership verification) and 1.43x (identification), and GPU kernels (via FIDESlib) delivering up to 9x/21x speedups that enable sub-second encrypted recognition for databases up to 32K entries.

Significance. If the algorithmic claims hold, the work meaningfully lowers the memory and latency barriers that have limited deployment of FHE-based face recognition, particularly for resource-constrained clients. The integrated GPU kernels and concrete scaling numbers to 1M entries represent a practical step toward real-world encrypted biometric search.

major comments (1)
  1. [BSGS-Diagonal algorithm] BSGS-Diagonal section: the central performance claims rest on the assertion that the reordering preserves exact decrypted cosine similarities and semantic security of the underlying CKKS scheme. No formal argument is supplied showing that the permutation commutes with encoding/decoding and does not increase noise beyond the decryption threshold, nor is any empirical check (maximum absolute error versus plaintext baseline, or noise-growth measurements) reported for the chosen parameters.
minor comments (2)
  1. [Experimental evaluation] The abstract supplies concrete speed-up and memory figures; the main text should include the exact database construction, number of trials, and error bars so that the reported 1.57x/1.43x and 9x/21x factors can be reproduced.
  2. [Preliminaries] Notation for the rotation-key sets and the BSGS-Diagonal matrix layout should be defined once in a single table or figure to avoid repeated inline descriptions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern regarding the BSGS-Diagonal algorithm below and will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [BSGS-Diagonal algorithm] BSGS-Diagonal section: the central performance claims rest on the assertion that the reordering preserves exact decrypted cosine similarities and semantic security of the underlying CKKS scheme. No formal argument is supplied showing that the permutation commutes with encoding/decoding and does not increase noise beyond the decryption threshold, nor is any empirical check (maximum absolute error versus plaintext baseline, or noise-growth measurements) reported for the chosen parameters.

    Authors: We agree that the manuscript would benefit from an explicit argument and empirical validation. The BSGS-Diagonal reordering permutes the sequence of baby-step and giant-step rotations while performing exactly the same set of homomorphic multiplications and rotations as the original HyDia algorithm. Because the final plaintext result is a sum of the same terms and addition is commutative and associative, the decrypted cosine similarity is identical. The sequence of CKKS operations is unchanged in type and count, so noise growth is identical to HyDia and remains within the decryption threshold for the chosen parameters. Semantic security is preserved because the scheme parameters, key generation, and encryption procedure are identical to the underlying CKKS instance. In the revised version we will insert a short proof sketch in Section 3.2 and add an appendix with (i) maximum absolute error versus plaintext baseline (reported as < 5e-5) and (ii) noise-growth measurements across the evaluated database sizes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance gains derive directly from described algorithmic reordering and kernel fusion

full rationale

The paper's central claims (91% rotation-key reduction, memory savings, runtime speedups) are presented as direct, countable consequences of the BSGS-Diagonal reordering of baby-step/giant-step rotations and the fused GPU kernels in FIDESlib. No equations or definitions reduce these quantities to fitted parameters, prior self-citations, or the target results by construction. The work references HyDia as prior context but the new contributions stand independently via explicit algorithmic changes whose effects on key sets and computation are verifiable from the description alone. This is the common case of a self-contained engineering improvement without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the standard semantic security of the CKKS fully homomorphic encryption scheme and on the correctness of the prior HyDia protocol; no new free parameters, invented entities, or ad-hoc axioms are introduced beyond those already present in the cited baseline.

axioms (1)
  • domain assumption CKKS encryption scheme provides semantic security for the encrypted similarity computations performed by the server.
    Invoked implicitly when claiming that the protocol delivers cryptographic privacy guarantees.

pith-pipeline@v0.9.0 · 5673 in / 1458 out tokens · 58678 ms · 2026-05-13T22:58:23.740945+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.