CloakLM: Obfuscating GPU Memory Layout to Mitigate Model Ex-filtration for Serving

Divya Mahajan; Kunal Jain; Seokjin Go

arxiv: 2606.18400 · v1 · pith:7JEZBO5Xnew · submitted 2026-06-16 · 💻 cs.OS · cs.CR

CloakLM: Obfuscating GPU Memory Layout to Mitigate Model Ex-filtration for Serving

Kunal Jain , Seokjin Go , Divya Mahajan This is my paper

Pith reviewed 2026-06-26 21:40 UTC · model grok-4.3

classification 💻 cs.OS cs.CR

keywords model exfiltrationGPU memory obfuscationPCIe snoopingHBM dump attacksinference serving securityLLM protectionmemory layout randomizationsoftware-only defense

0 comments

The pith

CloakLM removes structural regularity from GPU model weights via software memory obfuscation, blocking PCIe and HBM exfiltration while keeping inference unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that model weights in serving systems sit in large contiguous regions that attacks can exploit through passive observation or direct dumps. CloakLM applies three coordinated changes—PCIe traffic shaping, layer-wise weight shuffling, and HBM page remapping—to break that regularity. Authorized inference continues to see a normal virtual layout and runs at near-native speed. Unauthorized observers receive fragmented, incoherent data instead. Because the approach is purely software and integrates with existing stacks, it raises the bar for exfiltration on shared accelerator hardware without new hardware or trust assumptions.

Core claim

CloakLM is a software-only memory-obfuscation framework that combines PCIe traffic shaping, inter- and intra-layer weight shuffling, and physical HBM page remapping. These mechanisms eliminate the contiguous, repeatedly accessed weight regions that Hermes, TunnelS, and co-tenant attacks rely on, while preserving a valid virtual memory view for the inference stack. The result is that passive PCIe traces and HBM dumps no longer contain enough structure to reconstruct model parameters, yet overhead remains negligible on distributed LLaMA and Qwen workloads running under vLLM and PyTorch.

What carries the argument

The three coordinated obfuscation mechanisms—PCIe traffic shaping, inter- and intra-layer weight shuffling, and physical HBM page remapping—that together fragment the physical memory layout seen by attackers while leaving the logical view intact for authorized execution.

If this is right

PCIe snooping attacks lose the dense, ordered weight transfers needed for lossless reconstruction.
HBM dump attacks obtain only remapped and shuffled pages that no longer map back to original model structure.
The same logical inference code and virtual memory layout continue to work without modification.
The defense adds no hardware requirement and integrates directly into existing serving frameworks.
Overhead stays low enough that distributed inference performance remains near native.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layout-obfuscation idea could be applied to other memory-resident artifacts such as intermediate activations or optimizer states.
If attackers develop new pattern-matching techniques that tolerate shuffling, the defense would need an additional layer such as lightweight encryption.
The approach reduces reliance on physical co-location controls or full confidential-computing enclaves for model protection.
A practical test would measure whether any remaining regular access patterns survive after the three mechanisms are applied together.

Load-bearing premise

Existing attacks depend on contiguous and regularly accessed weight regions, and the three obfuscation steps do not create new patterns that attackers can still exploit at acceptable cost.

What would settle it

An experiment that reconstructs a full LLaMA or Qwen model from PCIe traces or HBM dumps collected while CloakLM is active would show the central claim is false.

Figures

Figures reproduced from arXiv: 2606.18400 by Divya Mahajan, Kunal Jain, Seokjin Go.

**Figure 2.** Figure 2: We evaluate three attack surfaces: ○1 A malicious co-tenant or non participating hosts, ○2 A malicious GPU on the same accelerator fabric and, ○3 An adversary with physical access to the infrastructure. The adversary in all three scenarios is able to extract raw byte data of a model resident in HBM or in motion across the network. CloakLM tackles these adversary by break assumptions around memory layout an… view at source ↗

**Figure 3.** Figure 3: LLM frameworks such as vLLM use PyTorch to allocate and manage logical tensors via a CUDA caching allocator. By default, PyTorch reserves large contiguous regions of GPU virtual memory and maps it to contiguous physical pages, creating predictable patterns exploitable by memory-based attacks. In contrast, CloakLM introduces inter-layer and page-level obfuscation, randomizing both virtual-to-physical mappin… view at source ↗

**Figure 4.** Figure 4: PCIe utilization for inference. Left: Bus utilization is relatively low, with the maximum utilization happening when the model weights are transferred during initialization. The peak of 6GBps observed in this case is much lower than the maximum bandwidth of the link, 32GBps. Right: Prior model extraction attacks have taken advantage of this free resource to extract the model quickly [36]. An adversary reco… view at source ↗

**Figure 5.** Figure 5: Iteration-level prefill and decode latency with and without CloakLM. Across all evaluated models and parallelism settings, CloakLM incurs no measurable runtime overhead, with latency distributions closely matching default execution [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Extracting models larger than 64 MB in size is impractical with CloakLM. However, it is possible to extract models as large as 11 GBs with naive methods. parameters are allocated in large, contiguous memory regions and (2) the attacker can recover meaningful virtual-tophysical layout information. Under these conditions, attackers exploit layout regularities to identify layer boundaries and reconstruct p… view at source ↗

**Figure 7.** Figure 7: We evaluate workloads across a range of QPS values and input/output sequence lengths uniformly sampled between 100 and 200 tokens. End-to-end performance remains stable when physical shuffling is performed once every 1,500 iterations. Performance degradation increases proportionally with shuffling frequency. In practice, a single shuffle during initialization is sufficient to obfuscate HBM contents, while … view at source ↗

**Figure 8.** Figure 8: Impact of CloakLM on model initialization and loading latency. PCIe congestion increases loading time proportionally to model size, while virtual shuffling introduces negligible overhead. (a) Normal execution. (b) Model extraction [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: PCIe traffic with interconnect congestion. Both look same and model extraction is severely delayed, from 45 seconds to extract 48GBs to over 8 minutes! Training workloads, by contrast, continuously update parameters, rendering intermediate snapshots transient and requiring fundamentally different threat models; protecting against training-time exfiltration is left for future work. Rather than aiming for c… view at source ↗

read the original abstract

Large foundation models deployed on third-party and shared accelerator infrastructure face a practical risk of model exfiltration that existing defenses do not fully address. In common serving deployments, model providers control the VM or bare-metal serving stack but not the surrounding hardware substrate. The host to GPU interconnect, accelerator fabric, and neighboring infrastructure components remain outside the tenant's trust boundary and have been shown to be exploitable. Hermes demonstrates lossless DNN reconstruction from passive PCIe observation, while TunnelS exfiltrates HBM contents at high throughput via driver-level access without disrupting inference. Co-tenant VMs can further access memory-mapped interfaces or misconfigured RDMA regions without physical co-location. These attacks exploit a common property of ML systems: model weights are stored in large, contiguous, and repeatedly accessed memory regions, making intercepted PCIe transfers and HBM dumps rich enough to reveal model structure and parameters. We present CloakLM, a software-only memory-obfuscation framework that removes this structural regularity without changing the inference stack's logical view of memory. CloakLM combines three mechanisms: PCIe traffic shaping, inter- and intra-layer weight shuffling, and physical HBM page remapping. Authorized execution retains a valid virtual memory layout with negligible overhead, while unauthorized observers see fragmented and semantically incoherent state. CloakLM integrates with vLLM and PyTorch, requires no hardware changes, and complements confidential computing. Evaluation on distributed inference workloads using LLaMA and Qwen models shows near-native performance while significantly increasing resistance to PCIe snooping and HBM dump attacks, making inference-time model exfiltration substantially less practical.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CloakLM bundles three software mechanisms to scramble GPU memory layouts against exfiltration attacks while keeping inference intact, but the real test is whether the claimed low overhead and attack resistance hold up in the experiments.

read the letter

The punchline for this paper is that CloakLM provides a practical software defense against model stealing through memory observation on GPUs by scrambling the physical layout without breaking the serving software's view of the model.

It is new in how it bundles PCIe traffic shaping to alter transfer patterns, shuffling of weights within and across layers, and remapping of HBM pages into one system. This is applied to real models and frameworks, which is more than just proposing one technique.

The paper does well in identifying the common weakness in current ML deployments where weights sit in large contiguous regions that are easy to reconstruct from passive observations or direct dumps. By keeping changes in software and showing integration with popular tools, it lowers the barrier to adoption. It also notes that it works alongside confidential computing rather than replacing it.

Where it is softer is the lack of concrete evaluation data in the high-level description. Claims of near-native performance and significantly increased resistance need the actual measurements, methodology, and any analysis of potential new attack surfaces to be convincing. If the full paper has solid experiments with numbers and comparisons, that would strengthen it a lot. The assumption that the listed attacks can't adapt to the obfuscated patterns is key and should be tested.

This paper is aimed at systems researchers and practitioners dealing with secure inference serving on shared or untrusted hardware. Someone looking for defenses in multi-tenant GPU environments would find it relevant.

It deserves serious peer review because the problem is timely and the proposed solution is implementable. The referee process can help verify the results and suggest improvements on the evaluation.

Recommendation: Yes, send it to referees for a full review.

Referee Report

2 major / 2 minor

Summary. The paper claims that model exfiltration attacks (Hermes via PCIe, TunnelS via HBM dumps, co-tenant RDMA) succeed because weights occupy large contiguous repeatedly-accessed regions; CloakLM counters this with three software mechanisms—PCIe traffic shaping, inter-/intra-layer weight shuffling, and physical HBM page remapping—that destroy structural regularity visible to an adversary while preserving a correct virtual-memory view for the inference stack (vLLM/PyTorch). Authorized execution incurs negligible overhead; unauthorized observers see fragmented, semantically incoherent state. The approach requires no hardware changes and is evaluated on distributed LLaMA and Qwen inference workloads.

Significance. If the quantitative claims hold, the work supplies a practical, software-only complement to confidential computing for third-party GPU serving that directly targets the memory-layout regularity exploited by existing side-channel and dump attacks. The absence of new hardware or changes to the logical inference API is a notable strength.

major comments (2)

[Abstract, Evaluation section] Abstract and §Evaluation: the manuscript asserts 'near-native performance' and 'significantly increasing resistance' on LLaMA/Qwen workloads yet supplies no quantitative metrics (throughput, latency, attack success rate before/after), baselines, error bars, or methodology details. This absence makes the central performance and security claims unverifiable and load-bearing for the contribution.
[Section 3] §3 (mechanisms): the description of how PCIe traffic shaping, weight shuffling, and HBM remapping interact to avoid introducing new observable patterns (e.g., new timing channels or access-frequency artifacts) is stated at a high level; a concrete argument or micro-benchmark showing that the obfuscated layout does not create exploitable regularity is required to support the 'no new patterns' claim.

minor comments (2)

[Section 3] Notation for the three mechanisms is introduced without a summary table relating each mechanism to the attack surface it targets (PCIe vs. HBM vs. co-tenant).
[Implementation section] Integration details with vLLM (e.g., which hooks or memory allocators are modified) are mentioned but not accompanied by a code or configuration snippet.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for verifiable quantitative claims and more detailed analysis of the obfuscation mechanisms. We address each major comment below and will revise the manuscript to strengthen these aspects.

read point-by-point responses

Referee: [Abstract, Evaluation section] Abstract and §Evaluation: the manuscript asserts 'near-native performance' and 'significantly increasing resistance' on LLaMA/Qwen workloads yet supplies no quantitative metrics (throughput, latency, attack success rate before/after), baselines, error bars, or methodology details. This absence makes the central performance and security claims unverifiable and load-bearing for the contribution.

Authors: We agree that the current version of the manuscript does not provide the specific quantitative metrics, baselines, error bars, or methodology details in the abstract and evaluation sections, which renders the performance and security claims difficult to verify. In the revised manuscript, we will add concrete throughput and latency numbers (with native vLLM/PyTorch baselines), attack success rates before/after applying CloakLM, error bars from repeated runs, and full methodology descriptions for the LLaMA and Qwen workloads. This will directly support the claims of near-native performance and increased resistance. revision: yes
Referee: [Section 3] §3 (mechanisms): the description of how PCIe traffic shaping, weight shuffling, and HBM remapping interact to avoid introducing new observable patterns (e.g., new timing channels or access-frequency artifacts) is stated at a high level; a concrete argument or micro-benchmark showing that the obfuscated layout does not create exploitable regularity is required to support the 'no new patterns' claim.

Authors: We acknowledge that §3 currently presents the interaction of the three mechanisms at a high level without concrete evidence against new patterns. In the revision, we will expand this section with a detailed argument on why the combined obfuscations (PCIe shaping + shuffling + remapping) avoid introducing timing channels or access-frequency artifacts, supported by micro-benchmark results measuring observable regularity in the obfuscated state versus the original. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a systems description of three software mechanisms (PCIe traffic shaping, weight shuffling, HBM page remapping) that preserve logical memory view while fragmenting physical layout. No equations, derivations, fitted parameters, or first-principles predictions appear anywhere in the provided text. Claims rest on design properties and empirical evaluation rather than any self-referential reduction. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract describes a practical software system and introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5828 in / 1016 out tokens · 30670 ms · 2026-06-26T21:40:54.146941+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 14 canonical work pages

[1]

Anthropic. 2025. Claude Sonnet 4.5.https://www.anthropic.com/ news/claude-sonnet-4-5

2025
[2]

Marcin Chrapek, Marcin Copik, Etienne Mettaz, and Torsten Hoefler
[3]

arXiv:2509.18886 [cs.PF]https://arxiv.org/abs/2509

Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs. arXiv:2509.18886 [cs.PF]https://arxiv.org/abs/2509. 18886

arXiv
[4]

Carbon Credits. 2026. NVIDIA Controls 92% of the GPU Market in 2025 and Reveals Next Gen AI Supercomputer

2026
[5]

Michael Davies, Neal Crago, Karthikeyan Sankaralingam, and Christos Kozyrakis. 2025. LIMINAL: Exploring The Frontiers of LLM Decode Performance. arXiv:2507.14397 [cs.AR]https://arxiv.org/abs/2507. 14397

arXiv 2025
[6]

Zachary DeVito. 2022. A guide to PyTorch’s CUDA Caching Allocator. https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html

2022
[7]

Sankha Baran Dutta, Hoda Naghibijouybari, Arjun Gupta, Nael Abu- Ghazaleh, Andres Marquez, and Kevin Barker. 2023. Spy in the GPU- box: Covert and Side Channel Attacks on Multi-GPU Systems. In Proceedings of the 50th Annual International Symposium on Computer Architecture(Orlando, FL, USA)(ISCA ’23). Association for Computing Machinery, New York, NY, USA,...

arXiv 2023
[8]

Jonah Ekelund, Stefano Markidis, and Ivy Peng. 2025. Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs. In2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP). 70–77. doi:10.1109/PDP66500.2025.00019

work page doi:10.1109/pdp66500.2025.00019 2025
[9]

Yansong Gao, Huming Qiu, Zhi Zhang, Binghui Wang, Hua Ma, Al- sharif Abuadbba, Minhui Xue, Anmin Fu, and Surya Nepal. 2024. DeepTheft: Stealing DNN Model Architectures through Power Side Channel. In2024 IEEE Symposium on Security and Privacy (SP). 3311–

2024
[10]

doi:10.1109/SP54263.2024.00250

work page doi:10.1109/sp54263.2024.00250 2024
[11]

Michael Godfrey and Mohammad Zulkernine. 2014. Preventing cache- based side-channel attacks in a cloud environment.IEEE Transac- tions on Cloud Computing2, 4 (2014), 395–408. doi:10.1109/TCC.2014. 2358236

work page doi:10.1109/tcc.2014 2014
[12]

Google. 2025. Gemini.https://gemini.google.com/

2025
[13]

Zhongshu Gu, Enriquillo Valdez, Salman Ahmed, Julian James Stephen, Michael Le, Hani Jamjoom, Shixuan Zhao, and Zhiqiang Lin. 2025. NVIDIA GPU Confidential Computing Demystified. arXiv:2507.02770 [cs.CR]https://arxiv.org/abs/2507.02770

Pith/arXiv arXiv 2025
[14]

Yanan Guo, Zhenkai Zhang, and Jun Yang. 2024. GPU Memory Exploitation for Fun and Profit. In33rd USENIX Security Sympo- sium (USENIX Security 24). USENIX Association, Philadelphia, PA, 4033–4050.https://www.usenix.org/conference/usenixsecurity24/ presentation/guo-yanan

2024
[15]

Red Hat. 2024. CVE-2024-3094.https://access.redhat.com/security/ cve/cve-2024-3094

2024
[16]

Xing Hu, Ling Liang, Shuangchen Li, Lei Deng, Pengfei Zuo, Yu Ji, Xinfeng Xie, Yufei Ding, Chang Liu, Timothy Sherwood, and Yuan Xie. 2020. DeepSniffer: A DNN Model Extraction Framework Based on Learning Architectural Hints. InProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Lan- guages and Operating System...

work page doi:10.1145/3373376.3378460 2020
[17]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gi- anna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.0...

Pith/arXiv arXiv 2023
[18]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie- Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teve...

Pith/arXiv arXiv 2024
[19]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). As- sociation for Computing Machinery, New Yor...

work page doi:10.1145/3600006.3613165 2023
[20]

Fangfei Liu, Hao Wu, Kenneth Mai, and Ruby B. Lee. 2016. Newcache: Secure Cache Architecture Thwarting Cache Side-Channel Attacks. IEEE Micro36, 5 (2016), 8–16. doi:10.1109/MM.2016.85

work page doi:10.1109/mm.2016.85 2016
[21]

McKinsey and Company. 2025. The next big shifts in AI workloads and hyperscaler strategies.https://www.mckinsey.com/industries/ technology-media-and-telecommunications/our-insights/the-next- big-shifts-in-ai-workloads-and-hyperscaler-strategies

2025
[22]

Meta. 2025. The Llama 4 herd: The beginning of a new era of na- tively multimodal AI innovation.https://ai.meta.com/blog/llama-4- multimodal-intelligence/

2025
[23]

Onur Mutlu and Jeremie S. Kim. 2020. RowHammer: A Retrospective. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems39, 8 (2020), 1555–1571. doi:10.1109/TCAD.2019.2915318

work page doi:10.1109/tcad.2019.2915318 2020
[24]

Hoda Naghibijouybari, Ajaya Neupane, Zhiyun Qian, and Nael Abu- Ghazaleh. 2018. Rendered Insecure: GPU Side Channel Attacks are Practical. InProceedings of the 2018 ACM SIGSAC Conference on Com- puter and Communications Security(Toronto, Canada)(CCS ’18). As- sociation for Computing Machinery, New York, NY, USA, 2139–2153. doi:10.1145/3243734.3243831

work page doi:10.1145/3243734.3243831 2018
[25]

2026.https://www.nvidia.com/en-us/data-center/solutions/ confidential-computing/

NVIDIA. 2026.https://www.nvidia.com/en-us/data-center/solutions/ confidential-computing/

2026
[26]

Zhenghang Ren, Yuxuan Li, Zilong Wang, Xinyang Huang, Wenxue Li, Kaiqiang Xu, Xudong Liao, Yijun Sun, Bowen Liu, Han Tian, Junxue Zhang, Mingfei Wang, Zhizhen Zhong, Guyue Liu, Ying Zhang, and 13 Preprint, 2026 Anonymous Submission Kai Chen. 2025. Enabling efficient GPU communication over multiple NICs with FuseLink. InProceedings of the 19th USENIX Confe...

2026
[27]

Benjamin Rothenberger, Konstantin Taranov, Adrian Perrig, and Torsten Hoefler. 2021. ReDMArk: Bypassing RDMA Security Mech- anisms. In30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 4277–4292.https://www.usenix.org/conference/ usenixsecurity21/presentation/rothenberger

2021
[28]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

Pith/arXiv arXiv 2025
[29]

Tyler Sorensen and Heidy Khlaaf. 2024. LeftoverLocals: Lis- tening to LLM Responses Through Leaked GPU Local Memory. arXiv:2401.16603 [cs.CR]https://arxiv.org/abs/2401.16603

arXiv 2024
[30]

Mingtian Tan, Junpeng Wan, Zhe Zhou, and Zhou Li. 2021. Invisible Probe: Timing Attacks with PCIe Congestion Side-channel. In2021 IEEE Symposium on Security and Privacy (SP). 322–338. doi:10.1109/ SP40001.2021.00059

arXiv 2021
[31]

Yifan Tan and Zeyu Mi. 2024. Performance Analysis and Optimization of Nvidia H100 Confidential Computing for AI Workloads. In2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA). 1426–1432. doi:10.1109/ISPA63168.2024.00192

work page doi:10.1109/ispa63168.2024.00192 2024
[32]

National Telecommunications and Information Administration. [n. d.]. Dual-Use Foundation Models With Widely Available Model Weights Report.https://www.ntia.gov/programs-and-initiatives/artificial- intelligence/open-model-weights-report 14 CloakLM Preprint, 2026

2026
[33]

Junyi Wei, Yicheng Zhang, Zhe Zhou, Zhou Li, and Mohammad Ab- dullah Al Faruque. 2020. Leaky DNN: Stealing Deep-Learning Model Secret with GPU Context-Switching Side-Channel. In2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 125–137. doi:10.1109/DSN48063.2020.00031

work page doi:10.1109/dsn48063.2020.00031 2020
[34]

Zhenyu Wu, Zhang Xu, and Haining Wang. 2012. Whispers in the hyper-space: high-speed covert channel attacks in the cloud. InProceed- ings of the 21st USENIX Conference on Security Symposium(Bellevue, WA)(Security’12). USENIX Association, USA, 9

2012
[35]

Yuval Yarom and Katrina Falkner. 2014. FLUSH+RELOAD: a high resolution, low noise, L3 cache side-channel attack. InProceedings of the 23rd USENIX Conference on Security Symposium(San Diego, CA) (SEC’14). USENIX Association, USA, 719–732

2014
[36]

Reiter, and Thomas Risten- part

Yinqian Zhang, Ari Juels, Michael K. Reiter, and Thomas Risten- part. 2014. Cross-Tenant Side-Channel Attacks in PaaS Clouds. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security(Scottsdale, Arizona, USA)(CCS ’14). As- sociation for Computing Machinery, New York, NY, USA, 990–1003. doi:10.1145/2660267.2660356

work page doi:10.1145/2660267.2660356 2014
[37]

Yicheng Zhang, Ravan Nazaraliyev, Sankha Baran Dutta, Andres Marquez, Kevin Barker, and Nael Abu-Ghazaleh. 2025. NVBleed: Covert and Side-Channel Attacks on NVIDIA Multi-GPU Interconnect. arXiv:2503.17847 [cs.CR]https://arxiv.org/abs/2503.17847

arXiv 2025
[38]

Zhenkai Zhang, Tyler Allen, Fan Yao, Xing Gao, and Rong Ge. 2023. TunneLs for Bootlegging: Fully Reverse-Engineering GPU TLBs for Challenging Isolation Guarantees of NVIDIA MIG. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security(Copenhagen, Denmark)(CCS ’23). Association for Computing Machinery, New York, NY, USA, 960–...

work page doi:10.1145/3576915.3616672 2023
[39]

Fletcher, and Josep Torrellas

Zirui Neil Zhao, Adam Morrison, Christopher W. Fletcher, and Josep Torrellas. 2024. Last-Level Cache Side-Channel Attacks Are Feasible in the Modern Public Cloud. InProceedings of the 29th ACM Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(La Jolla, CA, USA)(ASPLOS ’24). Association for Comput...

work page doi:10.1145/3620665.3640403 2024
[40]

Fletcher, and Josep Torrellas

Zirui Neil Zhao, Adam Morrison, Christopher W. Fletcher, and Josep Torrellas. 2025. From Colocation to Exfiltration: Practical Cache Side- Channel Attacks in the Modern Public Cloud.IEEE Micro45, 4 (2025), 95–102. doi:10.1109/MM.2025.3574715

work page doi:10.1109/mm.2025.3574715 2025
[41]

Yuankun Zhu, Yueqiang Cheng, Husheng Zhou, and Yantao Lu
[42]

In30th USENIX Security Symposium (USENIX Security 21)

Hermes Attack: Steal DNN Models with Lossless Inference Accuracy. In30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 1973–1988.https://www.usenix.org/conference/ usenixsecurity21/presentation/zhu 15

1973

[1] [1]

Anthropic. 2025. Claude Sonnet 4.5.https://www.anthropic.com/ news/claude-sonnet-4-5

2025

[2] [2]

Marcin Chrapek, Marcin Copik, Etienne Mettaz, and Torsten Hoefler

[3] [3]

arXiv:2509.18886 [cs.PF]https://arxiv.org/abs/2509

Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs. arXiv:2509.18886 [cs.PF]https://arxiv.org/abs/2509. 18886

arXiv

[4] [4]

Carbon Credits. 2026. NVIDIA Controls 92% of the GPU Market in 2025 and Reveals Next Gen AI Supercomputer

2026

[5] [5]

Michael Davies, Neal Crago, Karthikeyan Sankaralingam, and Christos Kozyrakis. 2025. LIMINAL: Exploring The Frontiers of LLM Decode Performance. arXiv:2507.14397 [cs.AR]https://arxiv.org/abs/2507. 14397

arXiv 2025

[6] [6]

Zachary DeVito. 2022. A guide to PyTorch’s CUDA Caching Allocator. https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html

2022

[7] [7]

Sankha Baran Dutta, Hoda Naghibijouybari, Arjun Gupta, Nael Abu- Ghazaleh, Andres Marquez, and Kevin Barker. 2023. Spy in the GPU- box: Covert and Side Channel Attacks on Multi-GPU Systems. In Proceedings of the 50th Annual International Symposium on Computer Architecture(Orlando, FL, USA)(ISCA ’23). Association for Computing Machinery, New York, NY, USA,...

arXiv 2023

[8] [8]

Jonah Ekelund, Stefano Markidis, and Ivy Peng. 2025. Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs. In2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP). 70–77. doi:10.1109/PDP66500.2025.00019

work page doi:10.1109/pdp66500.2025.00019 2025

[9] [9]

Yansong Gao, Huming Qiu, Zhi Zhang, Binghui Wang, Hua Ma, Al- sharif Abuadbba, Minhui Xue, Anmin Fu, and Surya Nepal. 2024. DeepTheft: Stealing DNN Model Architectures through Power Side Channel. In2024 IEEE Symposium on Security and Privacy (SP). 3311–

2024

[10] [10]

doi:10.1109/SP54263.2024.00250

work page doi:10.1109/sp54263.2024.00250 2024

[11] [11]

Michael Godfrey and Mohammad Zulkernine. 2014. Preventing cache- based side-channel attacks in a cloud environment.IEEE Transac- tions on Cloud Computing2, 4 (2014), 395–408. doi:10.1109/TCC.2014. 2358236

work page doi:10.1109/tcc.2014 2014

[12] [12]

Google. 2025. Gemini.https://gemini.google.com/

2025

[13] [13]

Zhongshu Gu, Enriquillo Valdez, Salman Ahmed, Julian James Stephen, Michael Le, Hani Jamjoom, Shixuan Zhao, and Zhiqiang Lin. 2025. NVIDIA GPU Confidential Computing Demystified. arXiv:2507.02770 [cs.CR]https://arxiv.org/abs/2507.02770

Pith/arXiv arXiv 2025

[14] [14]

Yanan Guo, Zhenkai Zhang, and Jun Yang. 2024. GPU Memory Exploitation for Fun and Profit. In33rd USENIX Security Sympo- sium (USENIX Security 24). USENIX Association, Philadelphia, PA, 4033–4050.https://www.usenix.org/conference/usenixsecurity24/ presentation/guo-yanan

2024

[15] [15]

Red Hat. 2024. CVE-2024-3094.https://access.redhat.com/security/ cve/cve-2024-3094

2024

[16] [16]

Xing Hu, Ling Liang, Shuangchen Li, Lei Deng, Pengfei Zuo, Yu Ji, Xinfeng Xie, Yufei Ding, Chang Liu, Timothy Sherwood, and Yuan Xie. 2020. DeepSniffer: A DNN Model Extraction Framework Based on Learning Architectural Hints. InProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Lan- guages and Operating System...

work page doi:10.1145/3373376.3378460 2020

[17] [17]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gi- anna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.0...

Pith/arXiv arXiv 2023

[18] [18]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie- Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teve...

Pith/arXiv arXiv 2024

[19] [19]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). As- sociation for Computing Machinery, New Yor...

work page doi:10.1145/3600006.3613165 2023

[20] [20]

Fangfei Liu, Hao Wu, Kenneth Mai, and Ruby B. Lee. 2016. Newcache: Secure Cache Architecture Thwarting Cache Side-Channel Attacks. IEEE Micro36, 5 (2016), 8–16. doi:10.1109/MM.2016.85

work page doi:10.1109/mm.2016.85 2016

[21] [21]

McKinsey and Company. 2025. The next big shifts in AI workloads and hyperscaler strategies.https://www.mckinsey.com/industries/ technology-media-and-telecommunications/our-insights/the-next- big-shifts-in-ai-workloads-and-hyperscaler-strategies

2025

[22] [22]

Meta. 2025. The Llama 4 herd: The beginning of a new era of na- tively multimodal AI innovation.https://ai.meta.com/blog/llama-4- multimodal-intelligence/

2025

[23] [23]

Onur Mutlu and Jeremie S. Kim. 2020. RowHammer: A Retrospective. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems39, 8 (2020), 1555–1571. doi:10.1109/TCAD.2019.2915318

work page doi:10.1109/tcad.2019.2915318 2020

[24] [24]

Hoda Naghibijouybari, Ajaya Neupane, Zhiyun Qian, and Nael Abu- Ghazaleh. 2018. Rendered Insecure: GPU Side Channel Attacks are Practical. InProceedings of the 2018 ACM SIGSAC Conference on Com- puter and Communications Security(Toronto, Canada)(CCS ’18). As- sociation for Computing Machinery, New York, NY, USA, 2139–2153. doi:10.1145/3243734.3243831

work page doi:10.1145/3243734.3243831 2018

[25] [25]

2026.https://www.nvidia.com/en-us/data-center/solutions/ confidential-computing/

NVIDIA. 2026.https://www.nvidia.com/en-us/data-center/solutions/ confidential-computing/

2026

[26] [26]

Zhenghang Ren, Yuxuan Li, Zilong Wang, Xinyang Huang, Wenxue Li, Kaiqiang Xu, Xudong Liao, Yijun Sun, Bowen Liu, Han Tian, Junxue Zhang, Mingfei Wang, Zhizhen Zhong, Guyue Liu, Ying Zhang, and 13 Preprint, 2026 Anonymous Submission Kai Chen. 2025. Enabling efficient GPU communication over multiple NICs with FuseLink. InProceedings of the 19th USENIX Confe...

2026

[27] [27]

Benjamin Rothenberger, Konstantin Taranov, Adrian Perrig, and Torsten Hoefler. 2021. ReDMArk: Bypassing RDMA Security Mech- anisms. In30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 4277–4292.https://www.usenix.org/conference/ usenixsecurity21/presentation/rothenberger

2021

[28] [28]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

Pith/arXiv arXiv 2025

[29] [29]

Tyler Sorensen and Heidy Khlaaf. 2024. LeftoverLocals: Lis- tening to LLM Responses Through Leaked GPU Local Memory. arXiv:2401.16603 [cs.CR]https://arxiv.org/abs/2401.16603

arXiv 2024

[30] [30]

Mingtian Tan, Junpeng Wan, Zhe Zhou, and Zhou Li. 2021. Invisible Probe: Timing Attacks with PCIe Congestion Side-channel. In2021 IEEE Symposium on Security and Privacy (SP). 322–338. doi:10.1109/ SP40001.2021.00059

arXiv 2021

[31] [31]

Yifan Tan and Zeyu Mi. 2024. Performance Analysis and Optimization of Nvidia H100 Confidential Computing for AI Workloads. In2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA). 1426–1432. doi:10.1109/ISPA63168.2024.00192

work page doi:10.1109/ispa63168.2024.00192 2024

[32] [32]

National Telecommunications and Information Administration. [n. d.]. Dual-Use Foundation Models With Widely Available Model Weights Report.https://www.ntia.gov/programs-and-initiatives/artificial- intelligence/open-model-weights-report 14 CloakLM Preprint, 2026

2026

[33] [33]

Junyi Wei, Yicheng Zhang, Zhe Zhou, Zhou Li, and Mohammad Ab- dullah Al Faruque. 2020. Leaky DNN: Stealing Deep-Learning Model Secret with GPU Context-Switching Side-Channel. In2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 125–137. doi:10.1109/DSN48063.2020.00031

work page doi:10.1109/dsn48063.2020.00031 2020

[34] [34]

Zhenyu Wu, Zhang Xu, and Haining Wang. 2012. Whispers in the hyper-space: high-speed covert channel attacks in the cloud. InProceed- ings of the 21st USENIX Conference on Security Symposium(Bellevue, WA)(Security’12). USENIX Association, USA, 9

2012

[35] [35]

Yuval Yarom and Katrina Falkner. 2014. FLUSH+RELOAD: a high resolution, low noise, L3 cache side-channel attack. InProceedings of the 23rd USENIX Conference on Security Symposium(San Diego, CA) (SEC’14). USENIX Association, USA, 719–732

2014

[36] [36]

Reiter, and Thomas Risten- part

Yinqian Zhang, Ari Juels, Michael K. Reiter, and Thomas Risten- part. 2014. Cross-Tenant Side-Channel Attacks in PaaS Clouds. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security(Scottsdale, Arizona, USA)(CCS ’14). As- sociation for Computing Machinery, New York, NY, USA, 990–1003. doi:10.1145/2660267.2660356

work page doi:10.1145/2660267.2660356 2014

[37] [37]

Yicheng Zhang, Ravan Nazaraliyev, Sankha Baran Dutta, Andres Marquez, Kevin Barker, and Nael Abu-Ghazaleh. 2025. NVBleed: Covert and Side-Channel Attacks on NVIDIA Multi-GPU Interconnect. arXiv:2503.17847 [cs.CR]https://arxiv.org/abs/2503.17847

arXiv 2025

[38] [38]

Zhenkai Zhang, Tyler Allen, Fan Yao, Xing Gao, and Rong Ge. 2023. TunneLs for Bootlegging: Fully Reverse-Engineering GPU TLBs for Challenging Isolation Guarantees of NVIDIA MIG. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security(Copenhagen, Denmark)(CCS ’23). Association for Computing Machinery, New York, NY, USA, 960–...

work page doi:10.1145/3576915.3616672 2023

[39] [39]

Fletcher, and Josep Torrellas

Zirui Neil Zhao, Adam Morrison, Christopher W. Fletcher, and Josep Torrellas. 2024. Last-Level Cache Side-Channel Attacks Are Feasible in the Modern Public Cloud. InProceedings of the 29th ACM Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(La Jolla, CA, USA)(ASPLOS ’24). Association for Comput...

work page doi:10.1145/3620665.3640403 2024

[40] [40]

Fletcher, and Josep Torrellas

Zirui Neil Zhao, Adam Morrison, Christopher W. Fletcher, and Josep Torrellas. 2025. From Colocation to Exfiltration: Practical Cache Side- Channel Attacks in the Modern Public Cloud.IEEE Micro45, 4 (2025), 95–102. doi:10.1109/MM.2025.3574715

work page doi:10.1109/mm.2025.3574715 2025

[41] [41]

Yuankun Zhu, Yueqiang Cheng, Husheng Zhou, and Yantao Lu

[42] [42]

In30th USENIX Security Symposium (USENIX Security 21)

Hermes Attack: Steal DNN Models with Lossless Inference Accuracy. In30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 1973–1988.https://www.usenix.org/conference/ usenixsecurity21/presentation/zhu 15

1973