pith. sign in

arxiv: 2605.14718 · v1 · pith:7TBCM3XKnew · submitted 2026-05-14 · 💻 cs.CR

Adapting AlphaEvolve to Optimize Fully Homomorphic Encryption on TPUs

Pith reviewed 2026-06-30 20:41 UTC · model grok-4.3

classification 💻 cs.CR
keywords fully homomorphic encryptionTPUevolutionary searchTFHECKKShardware optimizationLLM code generationbootstrap latency
0
0 comments X

The pith

An automated evolutionary search with language model code generation discovers FHE kernel optimizations that reduce TFHE bootstrap latency by 2.5 times on TPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a closed-loop evolutionary system can automate the discovery of low-level hardware-aware optimizations for fully homomorphic encryption kernels on systolic-array TPUs. It frames the task as search over code variants generated by large language models, guided by direct measurements of latency and correctness on real TPUv5e hardware. A sympathetic reader would care because current FHE performance remains a barrier to practical private computation, and faster kernels on available accelerators would narrow that gap. The work evaluates the approach on TFHE bootstrap and CKKS rotation and multiplication primitives, reporting concrete speedups over prior manually engineered baselines.

Core claim

Within 24 hours of automated exploration, AlphaEvolve discovered implementation-level optimizations that improve TFHE bootstrap latency by 2.5x and CKKS rotation and multiplication latency by 1.31x and 1.18x, respectively, relative to human-engineered state of the art on Google Cloud TPUv5e.

What carries the argument

The closed-loop evolutionary search that uses LLM-driven code generation together with real hardware execution feedback and rigorous correctness testing to guide optimization of MXU and VPU usage.

If this is right

  • The same search loop can co-optimize data movement across vector register files and systolic array utilization for other FHE primitives.
  • Automated exploration reduces reliance on manual trial-and-error when mapping cryptographic kernels to accelerator hardware.
  • The reported speedups apply to both TFHE (Jaxite) and CKKS (CROSS) schemes on contemporary TPU architectures.
  • Researchers gain a practical way to explore trade-offs among cryptography parameters, compiler choices, and hardware constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may transfer to GPU or other accelerator targets for the same FHE schemes, provided equivalent hardware feedback loops can be built.
  • If the method scales, it could shorten the iteration cycle for testing new FHE parameter sets or algorithmic variants on target hardware.
  • Success on these kernels suggests the search technique could apply to related high-performance cryptography tasks such as zero-knowledge proof generation on accelerators.

Load-bearing premise

The evolutionary process with LLM code generation and hardware feedback reliably produces correct implementations that generalize beyond the specific kernels and inputs tested.

What would settle it

Execute the discovered optimized kernels on a fresh set of inputs or on a different TPU variant and check whether the reported latency improvements and correctness both hold.

Figures

Figures reproduced from arXiv: 2605.14718 by Abhradeep Thakurta, Amir Yazdanbakhsh, Asra Ali, Baiyu Li, Jeremy Kun, Jianming Tong, Jonathan Katz, Julian Walker, Shruthi Gorantala, Thomas Steinke.

Figure 1
Figure 1. Figure 1: Polynomial multiplication a(X) · b(X) ∈ RQ = ZQ[X]/(X n +1) via Toeplitz matrix-vector product, where a and b are coefficient vectors of a(X) and b(X), respectively. 8 sublanes, and each sublane provides 32 32-bit registers. Registers are organized into (8,128) blocks, called vector registers (VRegs), each formed by selecting one register from every sublane across all 128 lanes. The XLU performs data reorg… view at source ↗
Figure 2
Figure 2. Figure 2: TPU microarchitecture overview – SIMD machine. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Workflow for adapting AlphaEvolve to cryptographic [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of the discovered layout changes. The initial [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

The deployment of Fully Homomorphic Encryption (FHE) at scale is hindered due to its heavy computational overhead. While specialized hardware accelerators like Google Tensor Processing Units (TPUs) can help, mapping complex cryptographic kernels onto such architectures remains a challenge. Efficient execution requires co-optimization between the systolic array-based Matrix Multiplication Unit (MXU) and Vector Processing Units (VPUs), as well as the orchestration of data movement across the vector register files. Existing compiler stacks often abstract low-level hardware utilization, requiring developers to adopt a manual trial-and-error process that often results in fragmented execution and underutilized resources. To accelerate this development process, we use AlphaEvolve to automate the exploration of hardware-aware cryptographic-kernel optimizations. We frame optimization as an evolutionary search problem, utilizing the closed-loop system provided by AlphaEvolve, that leverages LLM-driven code generation. We use real-world feedback from hardware execution and rigorous correctness testing to guide the evolution process. We evaluate AlphaEvolve optimization on primitives for both the TFHE (Jaxite) and CKKS (CROSS) FHE schemes on Google Cloud TPUv5e, a contemporary TPU architecture. Within 24 hours of automated exploration, AlphaEvolve discovered implementation-level optimizations that improve TFHE bootstrap latency by 2.5x and CKKS rotation and multiplication latency by 1.31x and 1.18x, respectively, relative to human-engineered state of the art. These results demonstrate that AlphaEvolve can be used to enable researchers to navigate the optimization trade-offs between cryptography, compilers, and hardware accelerators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper adapts AlphaEvolve—an LLM-driven closed-loop evolutionary search system—to automatically optimize low-level implementations of TFHE (Jaxite) and CKKS (CROSS) cryptographic kernels for execution on Google Cloud TPUv5e. It frames the task as hardware-aware evolutionary search that incorporates real TPU execution feedback and correctness checks, and reports that 24 hours of search yields implementation-level optimizations delivering 2.5× lower TFHE bootstrap latency, 1.31× lower CKKS rotation latency, and 1.18× lower CKKS multiplication latency relative to human-engineered state-of-the-art baselines.

Significance. If the reported speedups are reproducible and the optimized kernels generalize, the work provides concrete evidence that automated evolutionary search with hardware-in-the-loop feedback can navigate the complex trade-offs among cryptography, compiler scheduling, and TPU systolic-array / VPU utilization. The specific numeric gains on production TPU hardware constitute a practical demonstration of the approach rather than a purely theoretical claim.

major comments (2)
  1. [Abstract] Abstract: the central performance claims (2.5× TFHE bootstrap, 1.31× CKKS rotation, 1.18× CKKS multiplication) are presented without any description of the exact baseline implementations, the measurement methodology (including number of runs or statistical significance), or the concrete correctness-verification procedures applied to the evolved kernels. These omissions directly affect the ability to assess whether the reported speedups are load-bearing for the paper’s thesis.
  2. [Introduction / Method] The manuscript does not indicate whether the AlphaEvolve search harness, LLM prompts, or evaluation infrastructure were developed independently of the present authors; given the overlap noted in the submission metadata, this creates a circularity risk around the claimed generality of the closed-loop system.
minor comments (1)
  1. [Abstract] The abstract refers to “rigorous correctness testing” but supplies no further detail on the test suite or failure modes; a short paragraph in §3 or §4 would clarify this without altering the central claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our results. We respond to each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (2.5× TFHE bootstrap, 1.31× CKKS rotation, 1.18× CKKS multiplication) are presented without any description of the exact baseline implementations, the measurement methodology (including number of runs or statistical significance), or the concrete correctness-verification procedures applied to the evolved kernels. These omissions directly affect the ability to assess whether the reported speedups are load-bearing for the paper’s thesis.

    Authors: We agree that the abstract would be strengthened by including these details. In the revised manuscript we will expand the abstract to briefly specify the human-engineered baselines (the reference Jaxite TFHE and CROSS CKKS implementations), the measurement protocol (latency averaged over 100 runs on TPUv5e with reported standard deviation), and the correctness procedures (functional equivalence tests plus cryptographic property checks against reference outputs). These elements are already described in Section 4; the revision will ensure they are summarized at the abstract level. revision: yes

  2. Referee: [Introduction / Method] The manuscript does not indicate whether the AlphaEvolve search harness, LLM prompts, or evaluation infrastructure were developed independently of the present authors; given the overlap noted in the submission metadata, this creates a circularity risk around the claimed generality of the closed-loop system.

    Authors: AlphaEvolve is presented as a reusable evolutionary framework whose core components (search harness, prompt templates, and hardware-in-the-loop evaluator) were designed to be domain-agnostic. The current work applies this framework to FHE kernels. To remove ambiguity we will insert explicit language in the introduction and methods sections stating that the base AlphaEvolve infrastructure predates this application and has been used in other optimization domains. This clarification directly addresses the generality claim without altering the technical content. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical application of an existing evolutionary search tool (AlphaEvolve) to discover hardware-specific optimizations for FHE kernels, with results validated directly against real TPU execution and independent correctness checks. No load-bearing step reduces by definition, self-citation, or fitted-input renaming to the claimed performance gains; the reported speedups (2.5x TFHE bootstrap, 1.31x/1.18x CKKS) are externally measurable outcomes of the closed-loop process rather than tautological restatements of inputs or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the AlphaEvolve closed-loop system; no free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption AlphaEvolve supplies a reliable closed-loop evolutionary search using LLM code generation and hardware execution feedback for cryptographic kernels.
    The optimization results depend on this system functioning as described without the abstract providing independent validation details.

pith-pipeline@v0.9.1-grok · 5856 in / 1263 out tokens · 28269 ms · 2026-06-30T20:41:34.747764+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    On the concrete hardness of learning with errors,

    M. R. Albrecht, R. Player, and S. Scott, “On the concrete hardness of learning with errors,” Cryptology ePrint Archive, Paper 2015/046, 2015. [Online]. Available: https://eprint.iacr.org/2015/046

  2. [2]

    Jaxite: A fully homomorphic encryption backend targeting TPUs and GPUs, written in JAX,

    A. Ali, E. Astor, B. Gipson, S. Gorantala, M. Guevara, J. Kun, W. Lam, R. Misoczki, R. Springer, J. Takeshita, J. Tong, C. Tew, and C. Yun, “Jaxite: A fully homomorphic encryption backend targeting TPUs and GPUs, written in JAX,” https://github.com/google/jaxite, 2024

  3. [3]

    Heir: A universal compiler for homomorphic encryption,

    A. Ali, J. Choi, B. Gipson, S. Gorantala, J. Kun, W. Legiest, L. Lim, A. Viand, M. Z. Demissie, and H. Zheng, “Heir: A universal compiler for homomorphic encryption,” 2025. [Online]. Available: https://arxiv.org/abs/2508.11095

  4. [4]

    Homomorphic encryption for arithmetic of approximate numbers,

    J. H. Cheon, A. Kim, M. Kim, and Y . Song, “Homomorphic encryption for arithmetic of approximate numbers,” inAdvances in Cryptology – ASIACRYPT 2017, ser. Lecture Notes in Computer Science, vol. 10624. Springer, 2017, pp. 409–437

  5. [5]

    TFHE: Fast fully homomorphic encryption over the torus,

    I. Chillotti, N. Gama, M. Georgieva, and M. Izabach `ene, “TFHE: Fast fully homomorphic encryption over the torus,”Journal of Cryptology, vol. 33, no. 1, pp. 34–91, Jan 2020

  6. [6]

    Add an unroll factor of 8 (reduces bootstrap speedup on tpu by 2 ms),

    code perspective, “Add an unroll factor of 8 (reduces bootstrap speedup on tpu by 2 ms),” https://github.com/google/jaxite/pull/88, 2026, gitHub Pull Request #88, Accessed: 2026-05-06

  7. [7]

    Improvements to the vector_matrix_polymul provides a speedup of 1.7 ms,

    code perspective, “Improvements to the vector_matrix_polymul provides a speedup of 1.7 ms,” https://github.com/google/jaxite/pull/89, 2026, gitHub Pull Request #89, Accessed: 2026-05-06

  8. [8]

    Update i32 matmul unreduced cggi to use single bfloat16 (instead of 4) for int32 multiplications,

    code perspective, “Update i32 matmul unreduced cggi to use single bfloat16 (instead of 4) for int32 multiplications,” https://github.com/ google/jaxite/pull/90, 2026, gitHub Pull Request #90, Accessed: 2026- 05-06

  9. [9]

    Fully homomorphic encryption using ideal lattices,

    C. Gentry, “Fully homomorphic encryption using ideal lattices,” in Proceedings of the F orty-First Annual ACM Symposium on Theory of Computing, ser. STOC ’09. New York, NY , USA: Association for Computing Machinery, 2009, p. 169–178. [Online]. Available: https://doi.org/10.1145/1536414.1536440

  10. [10]

    Unlocking the potential of fully homomorphic encryption,

    S. Gorantala, R. Springer, and B. Gipson, “Unlocking the potential of fully homomorphic encryption,”Communications of the ACM, vol. 66, no. 6, pp. 68–75, jun 2023. [Online]. Available: https://doi.org/10.1145/3582500

  11. [11]

    A general purpose transpiler for fully homomorphic encryption,

    S. Gorantala, R. Springer, S. Purser-Haskell, W. Lam, R. J. Wilson, A. Ali, E. P. Astor, I. Zukerman, S. Ruth, C. Dibak, P. Schoppmann, S. Kulankhina, A. Forget, D. Marn, C. Tew, R. Misoczki, B. Guillen, X. Ye, D. Kraft, D. Desfontaines, A. Krishnamurthy, M. Guevara, I. M. Perera, Y . Sushko, and B. Gipson, “A general purpose transpiler for fully homomorp...

  12. [12]

    Accelerated linear algebra compiler for computationally efficient numerical models: Success and potential area of improvement,

    X. He, “Accelerated linear algebra compiler for computationally efficient numerical models: Success and potential area of improvement,”PLOS ONE, vol. 18, no. 2, p. e0282265, 2023

  13. [13]

    Autocomp: A powerful and portable code optimizer for tensor accelerators,

    C. Hong, S. Bhatia, A. Cheung, and Y . S. Shao, “Autocomp: A powerful and portable code optimizer for tensor accelerators,” 2025. [Online]. Available: https://arxiv.org/abs/2505.18574

  14. [14]

    Pallas quickstart — JAX documentation,

    JAX Developers, “Pallas quickstart — JAX documentation,” https://docs. jax.dev/en/latest/pallas/quickstart.html, 2024

  15. [15]

    Evox: Meta-evolution for automated discovery

    S. Liu, S. Agarwal, M. Maheswaran, M. Cemri, Z. Li, Q. Mang, A. Naren, E. Boneh, A. Cheng, M. Z. Pan, A. Du, K. Keutzer, A. Cheung, A. G. Dimakis, K. Sen, M. Zaharia, and I. Stoica, “Evox: Meta-evolution for automated discovery,” 2026. [Online]. Available: https://arxiv.org/abs/2602.23413

  16. [16]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    A. Novikov, N. V ˜u, M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. Ruiz, A. Mehrabianet al., “Alphaevolve: A coding agent for scientific and algorithmic discovery,” arXiv preprint arXiv:2506.13131, 2025

  17. [17]

    OpenXLA: Accelerated linear algebra,

    OpenXLA Project, “OpenXLA: Accelerated linear algebra,” https://github. com/openxla/xla, 2026

  18. [18]

    Leveraging asic ai chips for homomorphic encryption,

    J. Tong, T. Huang, J. Dang, L. de Castro, A. Itagi, A. Golder, A. Ali, J. Jiang, J. Kun, Arvind, G. E. Suh, and T. Krishna, “Leveraging asic ai chips for homomorphic encryption,” ser. HPCA’26. Australia: 2026 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2026. 6 APPENDIX A. LLM Instruction-Prompt Pool • Prompt 1:Suggest a ...