pith. sign in

arxiv: 2606.17358 · v2 · pith:XG4VK7SMnew · submitted 2026-06-15 · 💻 cs.CR

OTRO: Oblivious Tokenization Path with Square-Root ORAM

Pith reviewed 2026-06-27 02:49 UTC · model grok-4.3

classification 💻 cs.CR
keywords oblivious RAMsquare-root ORAMtokenizerside-channel leakageconfidential computingLLM servingTEEaccess pattern
0
0 comments X

The pith

OTRO makes LLM tokenizers oblivious using replicated square-root ORAM instances, epoch rotation, and KV-cache chunking to limit TTFT overhead to 4.5%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to eliminate side-channel leakage from CPU-side tokenizer table lookups in TEE-protected LLM serving stacks, where access patterns have been shown to recover full prompts on Intel TDX. It replaces costly tree ORAMs with square-root ORAM and removes the usual rebuild penalty by maintaining a pool of read-only replicas, rotating through epochs padded with dummy accesses, and overlapping rebuild work with GPU prefill via chunked tokenization. A sympathetic reader would care because this closes the tokenizer gap without forcing the 13x slowdown or 10-58% TTFT penalty of generic oblivious RAM. The result keeps tokenizer latency under 10% of TTFT and memory overhead below 0.5 GB while measurably reducing observable leakage across model families.

Core claim

OTRO supplies an oblivious tokenization path by maintaining a pool of replicated square-root ORAM instances on the read-only tokenizer table, applying an epoch-based rotation policy that pads each epoch to its boundary with dummy accesses, and performing chunked KV-cache-aware tokenization that overlaps rebuilds with GPU prefill; when implemented in HuggingFace Tokenizers and nano-vLLM inside a TDX CVM on H100, the design bounds TTFT overhead at 4.5%, tokenizer latency below 10% of TTFT, and memory growth below 0.5 GB while lowering leakage.

What carries the argument

Pool of replicated square-root ORAM instances with epoch-based rotation and dummy padding, which decouples accesses from rebuilds by exploiting the tokenizer table's read-only property and KV-cache chunking.

If this is right

  • Tokenizer access patterns cease to leak prompt content under the TEE threat model.
  • TTFT increases by at most 4.5% while tokenizer work stays under 10% of total TTFT.
  • Memory footprint grows by less than 0.5 GB across tested model sizes.
  • The same modules integrate directly into existing HuggingFace and vLLM code paths.
  • Leakage reduction holds across multiple model families and sizes without per-model tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The replication technique could extend to any read-only lookup table whose updates are infrequent or centrally managed.
  • Epoch padding length could be tuned per security policy to trade dummy overhead for lower information leakage.
  • The chunking strategy may generalize to other GPU-CPU overlapped stages beyond KV-cache construction.
  • If tokenizer tables ever become writable, a lightweight consistency protocol would be needed before replication remains safe.

Load-bearing premise

The tokenizer table is strictly read-only, so replicated ORAM instances never require consistency maintenance or extra rebuilds.

What would settle it

A side-channel experiment in the same TDX-plus-H100 setup that recovers a non-negligible fraction of prompts from OTRO-protected access patterns, or a timing measurement showing TTFT overhead above 4.5% under the reported model families and batch sizes.

Figures

Figures reproduced from arXiv: 2606.17358 by Daniel Wong, Jonghyun Lee, Mengyuan Li, Murali Annavaram, Rachit Rajat, Yongqin Wang.

Figure 1
Figure 1. Figure 1: Latency of ORAM and GPU-based tokenizers. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SqrtORAM and TreeORAM client-server model. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Threat model of a Confidential LLM Service [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Access-pattern leakage during BPE Tokenization. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: illustrates the workflow of the BPE tokenizer, consisting of the encoder step (Figure 5a) and the BPE merge algorithm (Figure 5b). The goal of the tokenizer in this example is to tokenize the input string ”tiktoken is a fast BPE tokeniser.” The BPE tokenizer in Figure 5a takes the following steps. First, the pre-tokenization step takes our example input string and breaks it down into tokenizer￾manageable p… view at source ↗
Figure 6
Figure 6. Figure 6: Overview of OTRO system. TDXRay reports over 90% reconstruction similarity across model families. This result proves that the tokenizer is a demonstrated, highly-vulnerable prompt-reconstruction channel under a realistic host adversary that even survives the TEE protections. Thus, defending against it requires making the tokenizer’s table-access pattern independent of the input, which we develop in Section… view at source ↗
Figure 7
Figure 7. Figure 7: Chunked tokenization with asynchronous reshuf [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Normalized TTFT in Llama3.1, Qwen3, Gemma-3, and Phi-3. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: (a) Tokenization overhead as a fraction of TTFT [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
read the original abstract

The CPU-side large language model (LLM) tokenizer is a critical security gap in LLM serving through a confidential computing stack with CPU and GPU trusted execution environments (TEEs). Tokenizers converts the prompts through table-driven lookups, and the resulting memory access patterns are a powerful source of side-channel leakage. Recent work demonstrates end-to-end recovery of user prompts from tokenizer access pattern on production Intel TDX. However, a drop-in use of the popular tree-based Oblivious RAMs (e.g., PathORAM) to prevent access-pattern leakage introduces $\sim$13$\times$ tokenizer slowdown, resulting in 10-58% higher time-to-first-token (TTFT). In this paper, we present OTRO, an efficient, oblivious tokenization path tailored to latency-critical LLM serving. OTRO relies on square-root ORAM for fast single-access lookups, but avoids its prohibitive $O(N\log^2N$) rebuild cost every $\sqrt{N}$ accesses through three key innovations. First, OTRO provides a pool of replicated square-root ORAM instances that utilize the read-only nature of tokenizer table. Second, an epoch-based rotation policy decouples accesses from rebuilds and pads each epoch with dummy accesses to its boundaries, minimizing observable information. Lastly, chunked KV-cache-aware tokenization further overlaps rebuilds with GPU prefill and minimizes the instance count. Implemented as modules in HuggingFace Tokenizers and nano-vLLM, running within a TDX-enabled CVM with an NVIDIA H100 GPU, OTRO limits TTFT overhead to at most 4.5%, keeps tokenizer-induced latency under 10\% of total TTFT, and adds less than 0.5 GB of memory overhead while reducing the tokenizer's observable leakage across various model families and sizes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents OTRO, an oblivious tokenization scheme for LLM serving in confidential computing (TDX + GPU TEEs) that uses square-root ORAM with a pool of replicated instances, epoch-based rotation plus dummy padding, and chunked KV-cache-aware tokenization to avoid the O(N log²N) rebuild cost. It claims concrete performance bounds (≤4.5% TTFT overhead, tokenizer latency <10% of TTFT, <0.5 GB memory overhead) while reducing observable leakage, with an implementation in HuggingFace Tokenizers and nano-vLLM evaluated on an NVIDIA H100 inside a TDX CVM across model families.

Significance. If the overhead bounds and leakage reduction hold under the stated assumptions, the work is significant for closing a demonstrated side-channel in production LLM tokenizers without prohibitive latency cost. The concrete implementation in real tokenization and serving libraries plus hardware evaluation on H100/TDX is a strength that makes the result directly usable.

major comments (2)
  1. [Abstract] Abstract and design description: the claimed TTFT overhead (≤4.5%), memory (<0.5 GB), and rebuild avoidance all rest on maintaining a pool of replicated square-root ORAM instances without consistency or rebuild costs. This is explicitly justified by “utiliz[ing] the read-only nature of tokenizer table.” No mechanism, analysis, or fallback is provided for vocabulary extension, per-user custom tokenizers, or multi-tenant updates; any write would invalidate the replication premise and the reported bounds.
  2. [Abstract] Abstract: concrete overhead numbers (4.5% TTFT, <10% tokenizer share, <0.5 GB) are stated without reference to evaluation methodology, baselines, number of runs, or error bars. The reader cannot assess whether the numbers reflect post-hoc selection, specific prompt distributions, or measurement artifacts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity on assumptions and evaluation details.

read point-by-point responses
  1. Referee: [Abstract] Abstract and design description: the claimed TTFT overhead (≤4.5%), memory (<0.5 GB), and rebuild avoidance all rest on maintaining a pool of replicated square-root ORAM instances without consistency or rebuild costs. This is explicitly justified by “utiliz[ing] the read-only nature of tokenizer table.” No mechanism, analysis, or fallback is provided for vocabulary extension, per-user custom tokenizers, or multi-tenant updates; any write would invalidate the replication premise and the reported bounds.

    Authors: The design explicitly targets the common case of fixed, read-only tokenizer tables used in production LLM serving for pre-trained models. The replication and epoch-rotation approach is justified only under this assumption, as stated in the manuscript. We agree that no mechanism is provided for dynamic updates, vocabulary extensions, or per-user custom tokenizers, as these would require separate handling (e.g., rebuilds or isolated instances) and fall outside the paper's scope. We will add an explicit statement of this assumption and its implications in the abstract and Section 3 to prevent misinterpretation. revision: partial

  2. Referee: [Abstract] Abstract: concrete overhead numbers (4.5% TTFT, <10% tokenizer share, <0.5 GB) are stated without reference to evaluation methodology, baselines, number of runs, or error bars. The reader cannot assess whether the numbers reflect post-hoc selection, specific prompt distributions, or measurement artifacts.

    Authors: The full evaluation (Section 5) details the methodology: implementation in HuggingFace Tokenizers and nano-vLLM, execution inside TDX CVM with H100 GPU, across multiple model families, using representative prompt distributions from standard benchmarks, with multiple runs and reported averages. However, the abstract presents the headline numbers without cross-references or qualifiers. We will revise the abstract to include a brief pointer to the evaluation setup and ensure the results section explicitly states run counts, error bars, and baseline comparisons to address this presentation concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on implementation and direct measurement

full rationale

The paper describes an engineering system (OTRO) that applies square-root ORAM to tokenizers, with three concrete mechanisms (replicated instances, epoch rotation with dummy padding, chunked KV-cache overlap) whose overhead bounds are obtained by running the implementation inside TDX CVMs on H100 GPUs and measuring TTFT, latency, and memory. No equations, fitted parameters, or self-citations are presented as load-bearing derivations; the read-only table premise is an explicit environmental assumption that enables replication but does not create a self-referential loop in any claimed result. The central performance numbers are therefore externally falsifiable benchmarks rather than tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on the domain assumption that the tokenizer table is read-only and that rebuilds can be overlapped with GPU prefill without affecting correctness; no free parameters or invented entities are visible in the abstract.

axioms (2)
  • domain assumption Tokenizer table is read-only
    Invoked to justify replication of square-root ORAM instances without consistency overhead.
  • domain assumption Epoch boundaries can be padded with dummy accesses without leaking timing information beyond the epoch
    Central to the rotation policy that decouples accesses from rebuilds.

pith-pipeline@v0.9.1-grok · 5876 in / 1460 out tokens · 42460 ms · 2026-06-27T02:49:22.348666+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    What was your prompt? a remote keylogging attack on{AI}assistants,

    R. Weiss, D. Ayzenshteyn, and Y . Mirsky, “What was your prompt? a remote keylogging attack on{AI}assistants,” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 3367–3384

  2. [2]

    Data mixture inference attack: Bpe tokenizers reveal training data compositions,

    J. Hayase, A. Liu, Y . Choi, S. Oh, and N. A. Smith, “Data mixture inference attack: Bpe tokenizers reveal training data compositions,” Advances in Neural Information Processing Systems, vol. 37, pp. 8956–8983, 2024

  3. [3]

    I know what you said: Unveiling hardware cache side-channels in local large language model inference,

    Z. Gao, J. Hu, F. Guo, Y . Zhang, Y . Han, S. Liu, H. Li, and Z. Lv, “I know what you said: Unveiling hardware cache side-channels in local large language model inference,”arXiv preprint arXiv:2505.06738, 2025

  4. [4]

    Time will tell: Timing side channels via output token count in large language models,

    T. Zhang, G. Saileshwar, and D. Lie, “Time will tell: Timing side channels via output token count in large language models,”arXiv preprint arXiv:2412.15431, 2024. 13

  5. [5]

    arXiv preprint arXiv:2403.06634 , year=

    N. Carlini, D. Paleka, K. D. Dvijotham, T. Steinke, J. Hayase, A. F. Cooper, K. Lee, M. Jagielski, M. Nasr, A. Conmy, I. Yona, E. Wallace, D. Rolnick, and F. Tram `er, “Stealing part of a production language model,” 2024. [Online]. Available: https://arxiv.org/abs/2403.06634

  6. [6]

    Amd secure encrypted virtualization (sev),

    “Amd secure encrypted virtualization (sev),” https://www.amd.com/ en/developer/sev.html, AMD, accessed: 2024-10-07

  7. [7]

    Strengthening vm isolation with integrity protection and more,

    A. Sev-Snp, “Strengthening vm isolation with integrity protection and more,”White Paper, January, vol. 53, no. 2020, pp. 1450–1465, 2020

  8. [8]

    Intel trust domain extensions,

    “Intel trust domain extensions,” https://www.intel.com/content/www/ us/en/developer/tools/trust-domain-extensions/overview.html, Intel, accessed: 2024-10-07

  9. [9]

    Architecture Specification: Intel Trust Domain Extensions (In- tel TDX) Module,

    Intel, “Architecture Specification: Intel Trust Domain Extensions (In- tel TDX) Module,” 2020

  10. [10]

    What is Azure confidential computing?

    Microsoft, “What is Azure confidential computing?” May 2025, accessed: 2025-09-28. [Online]. Available: https://learn.microsoft. com/en-us/azure/confidential-computing/overview

  11. [11]

    Confidential Compute on NVIDIA Hopper H100,

    NVIDIA, “Confidential Compute on NVIDIA Hopper H100,” https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/ HCC-Whitepaper-v1.0.pdf, 2023

  12. [12]

    AI security with confidential computing,

    NVIDIA, “AI security with confidential computing,” https://www. nvidia.com/en-us/data-center/solutions/confidential-computing/

  13. [13]

    Private processing technical whitepaper,

    Meta, “Private processing technical whitepaper,” Meta AI, Tech. Rep., June 2025, https://ai.meta.com/static-resource/ private-processing-technical-whitepaper

  14. [14]

    How Confidential Accelerators can boost AI work- load security,

    Google Cloud, “How Confidential Accelerators can boost AI work- load security,” 2025

  15. [15]

    About Private Processing

    WhatsApp. About Private Processing. [Online]. Available: https: //faq.whatsapp.com/2089630958184255

  16. [16]

    {CIPHERLEAKS}: Breaking constant-time cryptography on{AMD}{SEV}via the ci- phertext side channel,

    M. Li, Y . Zhang, H. Wang, K. Li, and Y . Cheng, “{CIPHERLEAKS}: Breaking constant-time cryptography on{AMD}{SEV}via the ci- phertext side channel,” in30th USENIX Security Symposium (USENIX Security 21), 2021, pp. 717–732

  17. [17]

    SEVered: Subverting AMD’s virtual machine encryption,

    M. Morbitzer, M. Huber, J. Horsch, and S. Wessel, “SEVered: Subverting AMD’s virtual machine encryption,” in11th European Workshop on Systems Security. ACM, 2018

  18. [18]

    The SEVerESt of them all: Inference attacks against secure virtual enclaves,

    J. Werner, J. Mason, M. Antonakakis, M. Polychronakis, and F. Mon- rose, “The SEVerESt of them all: Inference attacks against secure virtual enclaves,” inACM Asia Conference on Computer and Com- munications Security. ACM, 2019, pp. 73–85

  19. [19]

    Flush+ reload: A high resolution, low noise, l3 cache side-channel attack

    Y . Yarom and K. Falkner, “Flush+ reload: A high resolution, low noise, l3 cache side-channel attack.” inUSENIX Security Symposium, 2014, pp. 719–732

  20. [20]

    Cachequote: Efficiently recovering long- term secrets of sgx epid via cache attacks,

    F. Dall, G. De Micheli, T. Eisenbarth, D. Genkin, N. Heninger, A. Moghimi, and Y . Yarom, “Cachequote: Efficiently recovering long- term secrets of sgx epid via cache attacks,” 2018

  21. [22]

    SGX-Step: A practical attack framework for precise enclave execution control,

    J. Van Bulck, F. Piessens, and R. Strackx, “SGX-Step: A practical attack framework for precise enclave execution control,” inProceed- ings of the 2nd Workshop on System Software for Trusted Execution, 2017, pp. 1–6

  22. [23]

    Relocate-vote: Using sparsity information to exploit cipher- text side-channels,

    Y . Yan, W. Huang, I. Grishchenko, G. Saileshwar, A. Mehta, and D. Lie, “Relocate-vote: Using sparsity information to exploit cipher- text side-channels,” inProceedings of the 34th USENIX Conference on Security Symposium, 2025, pp. 5699–5717

  23. [24]

    Hypertheft: Thieving model weights from tee-shielded neural net- works via ciphertext side channels,

    Y . Yuan, Z. Liu, S. Deng, Y . Chen, S. Wang, Y . Zhang, and Z. Su, “Hypertheft: Thieving model weights from tee-shielded neural net- works via ciphertext side channels,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 4346–4360

  24. [25]

    A systematic look at ciphertext side channels on amd sev- snp,

    M. Li, L. Wilke, J. Wichelmann, T. Eisenbarth, R. Teodorescu, and Y . Zhang, “A systematic look at ciphertext side channels on amd sev- snp,” in2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 337–351

  25. [26]

    Gemma 3,

    G. Team, “Gemma 3,” 2025. [Online]. Available: https://goo.gle/ Gemma3Report

  26. [27]

    Llama 3 model card,

    AI@Meta, “Llama 3 model card,” 2024. [Online]. Available: https://github.com/meta-llama/llama3/blob/main/MODEL CARD.md

  27. [28]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    M. Abdin, J. Aneja, H. Awadalla, A. Awadallah, A. A. Awan, N. Bach, A. Bahree, A. Bakhtiari, J. Bao, H. Behl, A. Benhaim, M. Bilenko, J. Bjorck, S. Bubeck, M. Cai, Q. Cai, V . Chaudhary, D. Chen, D. Chen, W. Chen, Y .-C. Chen, Y .-L. Chen, H. Cheng, P. Chopra, X. Dai, M. Dixon, R. Eldan, V . Fragoso, J. Gao, M. Gao, M. Gao, A. Garg, A. D. Giorno, A. Goswa...

  28. [29]

    Qwen3 Technical Report

    Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

  29. [30]

    Time will tell: Timing side channels via output token count,

    T. Zhang, “Time will tell: Timing side channels via output token count,” Master’s thesis, University of Toronto (Canada), 2025

  30. [31]

    Tdxray: Microarchitectural side-channel analysis of intel tdx for real-world workloads

    T. Hornetz, H. Yavarzadeh, A. Cheu, A. Gascon, L. Gerlach, D. Moghimi, P. Schoppmann, M. Schwarz, and R. Zhang, “Tdxray: Microarchitectural side-channel analysis of intel tdx for real-world workloads.”

  31. [32]

    Laoram: A look ahead oram architecture for training large embedding tables,

    R. Rajat, Y . Wang, and M. Annavaram, “Laoram: A look ahead oram architecture for training large embedding tables,” inProceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–15

  32. [33]

    Data leakage via access patterns of sparse features in deep learning-based recommendation systems,

    H. Hashemi, W. Xiong, L. Ke, K. Maeng, M. Annavaram, G. E. Suh, and H.-H. S. Lee, “Data leakage via access patterns of sparse features in deep learning-based recommendation systems,”arXiv preprint arXiv:2212.06264, 2022

  33. [34]

    Efficient memory side-channel protection for embedding generation in machine learning,

    M. Umar, A. P. Marathe, M. D. Gupta, S. J. Ghosh, G. E. Suh, and W. Xiong, “Efficient memory side-channel protection for embedding generation in machine learning,” in2025 IEEE International Sympo- sium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 423–441

  34. [35]

    Software protection and simulation on oblivious rams,

    O. Goldreich and R. Ostrovsky, “Software protection and simulation on oblivious rams,”J. ACM, vol. 43, no. 3, p. 431–473, May 1996. [Online]. Available: https://doi.org/10.1145/233551.233553

  35. [36]

    Path oram: An extremely simple oblivious ram protocol,

    E. Stefanov, M. van Dijk, E. Shi, C. Fletcher, L. Ren, X. Yu, and S. Devadas, “Path oram: An extremely simple oblivious ram protocol,” inProceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security (CCS), 2013

  36. [37]

    Constants count: Practical improvements to oblivious RAM,

    L. Ren, C. Fletcher, A. Kwon, E. Stefanov, E. Shi, M. van Dijk, and S. Devadas, “Constants count: Practical improvements to oblivious RAM,” in24th USENIX Security Symposium, 2015

  37. [38]

    AMD memory encryption,

    D. Kaplan, J. Powell, and T. Woller, “AMD memory encryption,” White paper, 2016. 14

  38. [39]

    Intel Hardware Shield- Intel Total Memory Encryption,

    Intel, “Intel Hardware Shield- Intel Total Memory Encryption,” https: //www.intel.com/content/www/us/en/architecture-and-technology/ total-memory-encryption-security-paper.html, 2017

  39. [40]

    Intel trust domain extensions whitepaper,

    ——, “Intel trust domain extensions whitepaper,” https: //software.intel.com/content/dam/develop/external/us/en/documents/ tdx-whitepaper-final9-17.pdf, 2020

  40. [41]

    SEV API version 0.22,

    AMD, “SEV API version 0.22,” 2019

  41. [42]

    ARM CCA Security Model 1.0,

    ARM, “ARM CCA Security Model 1.0,” 2021

  42. [43]

    Intel SGX explained

    V . Costan and S. Devadas, “Intel SGX explained.”IACR Cryptol. ePrint Arch., vol. 2016, no. 86, pp. 1–118, 2016

  43. [44]

    AMD64 architecture programmer’s manual volume 2: System programming,

    AMD, “AMD64 architecture programmer’s manual volume 2: System programming,” 2019

  44. [45]

    Upcoming x86 technologies for malicious hyper- visor protection,

    D. Kaplan, “Upcoming x86 technologies for malicious hyper- visor protection,” https://static.sched.com/hosted files/lsseu2019/65/ SEV-SNP%20Slides%20Nov%201%202019.pdf, 2020

  45. [46]

    Protecting VM register state with SEV-ES,

    ——, “Protecting VM register state with SEV-ES,”White paper, 2017

  46. [47]

    NVIDIA Attestation,

    Nvidia, “NVIDIA Attestation,” https://docs.nvidia.com/attestation/ index.html, 2025

  47. [48]

    Private processing for whatsapp: Tech- nical whitepaper,

    Meta AI, “Private processing for whatsapp: Tech- nical whitepaper,” https://ai.meta.com/static-resource/ private-processing-technical-whitepaper, Jun. 2025

  48. [49]

    Azure AI Confidential Inferenc- ing: Technical Deep-Dive,

    Microsoft, “Azure AI Confidential Inferenc- ing: Technical Deep-Dive,” https://techcommunity. microsoft.com/blog/azureconfidentialcomputingblog/ azure-ai-confidential-inferencing-technical-deep-dive/4253150, 2025

  49. [50]

    (2023) Blindai: Confidential ai deployment with trusted execution environments

    Mithril Security. (2023) Blindai: Confidential ai deployment with trusted execution environments. [Online]. Available: https: //github.com/mithril-security/blindai

  50. [51]

    (2024) Blindllama: Zero-trust confidential ai for llms

    ——. (2024) Blindllama: Zero-trust confidential ai for llms. [Online]. Available: https://blindllama.mithrilsecurity.io/en/latest/

  51. [52]

    (2024, Jul.) Continuum: The first confidential llm platform to revolutionize secure, privacy-preserving ai with nvidia h100 gpus

    Edgeless Systems. (2024, Jul.) Continuum: The first confidential llm platform to revolutionize secure, privacy-preserving ai with nvidia h100 gpus. [Online]. Available: https://www.edgeless.systems/blog/ launching-confidential-llm-platform-continuum-ai

  52. [53]

    Neural machine translation of rare words with subword units,

    R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” inProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), 2016, pp. 1715–1725

  53. [54]

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Y . Wu, M. Schuster, Z. Chen, Q. V . Le, M. Norouzi, W. Macherey, M. Krikun, Y . Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Łukasz Kaiser, S. Gouws, Y . Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google’s neural mach...

  54. [55]

    SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

    T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text pro- cessing,”arXiv preprint arXiv:1808.06226, 2018

  55. [56]

    Huggingface’s tokenizers,

    A. Moi and N. Patry, “Huggingface’s tokenizers,” accessed: 2025- 11-22. [Online]. Available: https://github.com/huggingface/tokenizers

  56. [57]

    Nvidia confidential computing,

    “Nvidia confidential computing,” https://images.nvidia.com/ aem-dam/en-zz/Solutions/data-center/HCC-Whitepaper-v1.0.pdf, NVIDIA, accessed: 2024-10-07

  57. [58]

    Wiretap: Breaking server sgx via dram bus inter- position,

    A. Seto, O. K. Duran, S. Amer, J. Chuang, S. van Schaik, D. Genkin, and C. Garman, “Wiretap: Breaking server sgx via dram bus inter- position,” inProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, 2025, pp. 708–722

  58. [59]

    Badram: Practical memory aliasing attacks on trusted execution environments,

    J. De Meulemeester, L. Wilke, D. Oswald, T. Eisenbarth, I. Ver- bauwhede, and J. Van Bulck, “Badram: Practical memory aliasing attacks on trusted execution environments,” in2025 IEEE Symposium on Security and Privacy (SP). IEEE, 2025, pp. 4117–4135

  59. [60]

    Pageoram: An efficient dram page aware oram strategy,

    R. Rajat, Y . Wang, and M. Annavaram, “Pageoram: An efficient dram page aware oram strategy,” in2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2022, pp. 91– 107

  60. [61]

    nano-vllm,

    X. Yu, “nano-vllm,” 2025. [Online]. Available: https://github.com/ GeeeekExplorer/nano-vllm

  61. [62]

    Cache attacks on intel sgx,

    J. G ¨otzfried, M. Eckert, S. Schinzel, and T. M ¨uller, “Cache attacks on intel sgx,” inProceedings of the 10th European Workshop on Systems Security, 2017, pp. 1–6

  62. [63]

    Platypus: Software-based power side-channel attacks on x86,

    M. Lipp, A. Kogler, D. Oswald, M. Schwarz, C. Easdon, C. Canella, and D. Gruss, “Platypus: Software-based power side-channel attacks on x86,” in2021 IEEE Symposium on Security and Privacy (SP). IEEE, 2021, pp. 355–371

  63. [64]

    Software grand exposure:{SGX}cache attacks are practical,

    F. Brasser, U. M ¨uller, A. Dmitrienko, K. Kostiainen, S. Capkun, and A.-R. Sadeghi, “Software grand exposure:{SGX}cache attacks are practical,” in11th USENIX workshop on offensive technologies (WOOT 17), 2017

  64. [65]

    Cachequote: Efficiently recovering long- term secrets of sgx epid via cache attacks,

    F. Dall, G. De Micheli, T. Eisenbarth, D. Genkin, N. Heninger, A. Moghimi, and Y . Yarom, “Cachequote: Efficiently recovering long- term secrets of sgx epid via cache attacks,”IACR Transactions on Cryptographic Hardware and Embedded Systems, vol. 2018, no. 2, 2018

  65. [66]

    Cachezoom: How sgx amplifies the power of cache attacks,

    A. Moghimi, G. Irazoqui, and T. Eisenbarth, “Cachezoom: How sgx amplifies the power of cache attacks,” inInternational conference on cryptographic hardware and embedded systems. Springer, 2017, pp. 69–90

  66. [67]

    Malware guard extension: Using sgx to conceal cache attacks,

    M. Schwarz, S. Weiser, D. Gruss, C. Maurice, and S. Mangard, “Malware guard extension: Using sgx to conceal cache attacks,” in International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 2017, pp. 3–24

  67. [68]

    Nemesis: Studying mi- croarchitectural timing leaks in rudimentary cpu interrupt logic,

    J. Van Bulck, F. Piessens, and R. Strackx, “Nemesis: Studying mi- croarchitectural timing leaks in rudimentary cpu interrupt logic,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, 2018, pp. 178–195

  68. [69]

    Private in- formation retrieval,

    B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan, “Private in- formation retrieval,”Journal of the ACM (JACM), vol. 45, no. 6, pp. 965–981, 1998

  69. [70]

    Replication is not needed: Single database, computationally-private information retrieval,

    E. Kushilevitz and R. Ostrovsky, “Replication is not needed: Single database, computationally-private information retrieval,” inProceed- ings 38th annual symposium on foundations of computer science. IEEE, 1997, pp. 364–373

  70. [71]

    Oblix: An efficient oblivious search index,

    P. Mishra, R. Poddar, J. Chen, A. Chiesa, and R. A. Popa, “Oblix: An efficient oblivious search index,” in2018 IEEE symposium on security and privacy (SP). IEEE, 2018, pp. 279–296

  71. [72]

    {EnigMap}:{External-Memory} oblivious map for secure enclaves,

    A. Tinoco, S. Gao, and E. Shi, “{EnigMap}:{External-Memory} oblivious map for secure enclaves,” in32nd USENIX Security Sym- posium (USENIX Security 23), 2023, pp. 4033–4050

  72. [73]

    Design space exploration and optimization of path oblivious ram in secure processors,

    L. Ren, X. Yu, C. W. Fletcher, M. Van Dijk, and S. Devadas, “Design space exploration and optimization of path oblivious ram in secure processors,” inProceedings of the 40th Annual International Symposium on Computer Architecture, 2013, pp. 571–582

  73. [74]

    Constants count: Practical improvements to oblivious {RAM},

    L. Ren, C. Fletcher, A. Kwon, E. Stefanov, E. Shi, M. Van Dijk, and S. Devadas, “Constants count: Practical improvements to oblivious {RAM},” in24th USENIX Security Symposium (USENIX Security 15), 2015, pp. 415–430

  74. [75]

    Ir-oram: Path access type based memory intensity reduction for path-oram,

    M. Raoufi, Y . Zhang, and J. Yang, “Ir-oram: Path access type based memory intensity reduction for path-oram,” in2022 IEEE Inter- national Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 360–372

  75. [76]

    Ep-oram: efficient nvm- friendly path eviction for ring oram in hybrid memory,

    M. Raoufi, J. Yang, X. Tang, and Y . Zhang, “Ep-oram: efficient nvm- friendly path eviction for ring oram in hybrid memory,” in2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 2023, pp. 1–6. 15

  76. [77]

    Ab-oram: Constructing adjustable buckets for space reduction in ring oram,

    ——, “Ab-oram: Constructing adjustable buckets for space reduction in ring oram,” in2023 IEEE International Symposium on High- Performance Computer Architecture (HPCA). IEEE, 2023, pp. 361– 373

  77. [78]

    Ps-oram: Efficient crash consistency support for oblivious ram on nvm,

    G. Liu, K. Li, Z. Xiao, and R. Wang, “Ps-oram: Efficient crash consistency support for oblivious ram on nvm,” inProceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 188–203

  78. [79]

    H2oram: Low response latency optimized oram for hybrid memory systems,

    W. He, F. Wang, and D. Feng, “H2oram: Low response latency optimized oram for hybrid memory systems,” in2020 IEEE 38th International Conference on Computer Design (ICCD). IEEE, 2020, pp. 405–408

  79. [80]

    Multi-range supported oblivious ram for efficient block data retrieval,

    Y . Che and R. Wang, “Multi-range supported oblivious ram for efficient block data retrieval,” in2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 369–382

  80. [81]

    Streamline ring oram accesses through spatial and temporal op- timization,

    D. Cao, M. Zhang, H. Lu, X. Ye, D. Fan, Y . Che, and R. Wang, “Streamline ring oram accesses through spatial and temporal op- timization,” in2021 IEEE International Symposium on High- Performance Computer Architecture (HPCA). IEEE, 2021, pp. 14– 25

Showing first 80 references.