pith. sign in

arxiv: 2601.18511 · v2 · pith:4T4P2Z3Knew · submitted 2026-01-26 · 💻 cs.CR

Scaling up FHE-based Privacy-Preserving ML: Higher Throughput, Longer Inputs for LLama-3-8B

classification 💻 cs.CR
keywords homomorphicinferencetokenencryptedinputevaluationfhe-basedtokens
0
0 comments X
read the original abstract

As large language models (LLMs) become ubiquitous, privacy concerns pertaining to inference keep growing. Fully homomorphic encryption (FHE) has emerged as a primary cryptographic solution for non-interactive confidential LLM inference. However, existing solutions scale poorly with input token length, focusing on small models or input sizes. They also suffer from large outlier values, which strongly impact the evaluation of non-linear layers, leading to heavy polynomial approximation costs. We scale up FHE-based LLM inference in two directions. First, we accelerate FHE-based inference for 128 encrypted tokens. We adopt ML techniques (token prepending and orthogonal rotations) to mitigate outlier impacts on the FHE evaluation of non-linear layers. Separately, we devise a novel polynomial evaluation method for sparsely-packed ciphertexts to speed up our homomorphic SoftMax implementation. We combine these with recent fast homomorphic linear algebra techniques, achieving significantly improved efficiency. Second, we expand the prompt size up to thousands of tokens for contexts where only the final part of the input is sensitive and encrypted. Processing this requires handling standard plaintext-plaintext and ciphertext-ciphertext components, alongside a wide homomorphic computation for a novel plaintext-ciphertext component. To address this, we devise a dedicated homomorphic linear algebra algorithm, building a shallow homomorphic attention circuit that minimizes bootstrapping costs. Based on these ingredients, we present a CKKS-based end-to-end implementation of Llama-3-8B private inference. On 8 NVIDIA RTX PRO 6000 GPUs, 128 encrypted tokens take 20s for summarization and 18s/token for generation (vastly outperforming the SOTA 295s on costlier H100 GPUs). For a heterogeneous 4096-token input (last 128 encrypted), it takes 64s for summarization and 22s/token for generation.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.