GNStor: Design of GPU-Native High-Performance Remote All-Flash Array

Chen Tian; Chenying Huan; Guoci Chen; Jie Zhang; Junrong Zhu; Mao Bo; Shengwen Liang; Shushu Yi; Wenbo Wu

arxiv: 2606.04908 · v1 · pith:4TGJPEHZnew · submitted 2026-06-03 · 💻 cs.OS

GNStor: Design of GPU-Native High-Performance Remote All-Flash Array

Shushu Yi , Wenbo Wu , Guoci Chen , Junrong Zhu , Shengwen Liang , Mao Bo , Chenying Huan , Chen Tian

show 1 more author

Jie Zhang

This is my paper

Pith reviewed 2026-06-28 03:21 UTC · model grok-4.3

classification 💻 cs.OS

keywords GPU-native storageremote all-flash arrayNVMe over RDMACPU-bypass I/Odecentralized AFA enginehigh-performance storageGPU direct accessSSD firmware integration

0 comments

The pith

GNStor lets GPUs access remote all-flash arrays directly without CPU involvement to reach 3.2 times higher I/O throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that existing GPU-AFA setups incur heavy overhead because the CPU still handles all remote I/O orchestration and AFA functions. By redesigning the stack so the GPU itself starts and manages requests to remote SSDs, the system removes CPU-GPU handoffs and traffic amplification. A sympathetic reader would care because GPU applications increasingly rely on large shared remote storage, and any reduction in I/O friction directly speeds up those workloads. The design keeps required AFA features intact by moving them into the SSDs rather than leaving them on the CPU.

Core claim

GNStor is a GPU-native remote AFA system built around GNoR, a GPU-centric NVMe over RDMA stack that lets GPUs issue I/O requests directly using atomic operations and the SIMT execution model, and deEngine, which decomposes AFA tasks such as access control and metadata persistence into each SSD firmware. This produces a complete CPU-bypass path while preserving AFA semantics, yielding 3.2 times higher I/O throughput and 31.1 percent shorter application execution times than prior AFA systems.

What carries the argument

GNoR, the GPU-centric NVMe over RDMA stack that orchestrates I/O with atomic operations under the SIMT model, together with deEngine, the decentralized AFA engine placed inside SSD firmware.

If this is right

GPU applications gain direct use of remote AFA bandwidth without CPU orchestration costs.
I/O throughput rises by a measured factor of 3.2 compared with current CPU-centric AFA designs.
End-to-end application run times fall by 31.1 percent under the same workloads.
AFA-level guarantees remain available even though the CPU is removed from the I/O path.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same CPU-bypass pattern could be tested on other accelerators that currently route storage through a host CPU.
Clusters built around many GPUs might reduce their total CPU count if storage management moves entirely into the devices.
Further AFA features could be pushed into SSD firmware to test whether additional performance headroom appears.

Load-bearing premise

Essential AFA functions such as access control and metadata persistence can be decomposed and placed inside SSD firmware while keeping correctness and low overhead in a direct GPU-to-SSD path.

What would settle it

A measurement showing that access-control logic moved into SSD firmware either allows unauthorized access or adds latency that cancels the reported throughput gains would disprove the central claim.

Figures

Figures reproduced from arXiv: 2606.04908 by Chen Tian, Chenying Huan, Guoci Chen, Jie Zhang, Junrong Zhu, Mao Bo, Shengwen Liang, Shushu Yi, Wenbo Wu.

**Figure 1.** Figure 1: I/O path comparison of different GPU-AFA systems. provide fault tolerance for long-term data. For instance, the training corpus of LLMs has reached PB scale, which is persistently stored on remote AFA and shared across multiple GPU clients [14, 35, 47, 78]. Moreover, LLM training frequently generates checkpoints (including intermediate model weights and optimizer states), which can amount to tens of TBs … view at source ↗

**Figure 2.** Figure 2: Architectures of AFA and SSD. failures; (4) High performance: These datasets are frequently accessed by GPU during computation, making I/O performance critical. For example, training processes repeatedly load the corpus for computation, while inference workloads demand low-latency access to shared data (e.g., KV cache). 2.2 All-Flash Array and SSD Architecture All-flash array. Due to the stringent storage… view at source ↗

**Figure 3.** Figure 3: Overview of GNStor. Challenge 1). GNoR is built from the ground up to exploit the massive parallelism of GPU architecture. Specifically, GNoR remains non-I/O-critical tasks and data structures (e.g., keep-alive handling and NVMe admin queues) on the CPU, as they are rarely called and have a minor impact on steady-state performance. In contrast, GNoR migrates I/O-critical components (e.g., NVMe I/O queues a… view at source ↗

**Figure 4.** Figure 4: Initialization and I/O procedures of channel. it also notifies the daemon via RPC. The daemon then updates volume permission table on SSDs with customized VOLUME DELETE/CHMOD command and then acknowledges the client. GNStor assumes all clients in the GPU-AFA system are trusted (e.g., in private clusters). Access control is primarily designed to support data sharing with consistency (i.e., accessing corre… view at source ↗

**Figure 6.** Figure 6: Merged FTL mapping table in deEngine. completely eliminates CPU intervention in I/O path, thereby achieving higher performance. Access control. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of batched I/O. critical metadata can be reliably recovered after system reboot. In GNStor, since AFA-level mapping tables have been integrated into the SSD-resident FTL, they naturally benefit from SSD-internal persistence and recovery mechanisms. This design eliminates the need for additional logging or checkpointing in the AFA layer. When a GPU client recovers or migrates, it can re-establ… view at source ↗

**Figure 9.** Figure 9: Throughput comparison on microbenchmarks. SSD/CPU-core ratio in commercial AFA products [13]. For real-world applications, we choose tensor computing, data pre-processing, graph analytics, and LLM training (cf [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 12.** Figure 12: Scalability test with different numbers of SSDs. 4KB-Rd 4KB-Wr 64KB-Rd 64KB-Wr 0 4 8 1 2 1 6 (a) Ban dwi dth . B a n d wi d t h ( G B / s ) GD GD+deE n gi n e GN Stor 4KB-Rd 4KB-Wr 64KB-Rd 64KB-Wr 0 1 00 200 300 (b) Laten cy. A v g . l a t e n c y ( u s ) [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: Ablation analysis. gains limited benefits in multi-client scenarios, even with extra CPU threads. For example, it achieves only 4.4 GB/s and 4.1 GB/s read and write throughput in 64 KB tests. This can be attributed to its CPU-centric design, in which the CPU-GPU interaction and memory copy overhead become the performance bottleneck. GD is more scalable than Basic, thanks to its peer-to-peer data transfer … view at source ↗

**Figure 11.** Figure 11: Scalability test with different numbers of clients. GD enables data zero-copy between GPU and NIC, avoiding the time-consuming detour to host memory. GNStor further reduces read and write latency by 35.7% and 39.8%, thanks to our fully CPU-bypass design, which paves the shortest I/O path between GPU and AFA. 5.3 Scalability Test Different numbers of clients. We further measure the scalability of different… view at source ↗

**Figure 14.** Figure 14: Tensor computing. Basi c GDGN Stor 0 1 0 20 30 40 50 60 Read Wri te Compu te T i m e ( s ) (b) E xec. time. Basi c GDGN Stor 0 400 800 1 200 1 600 T h r o u g h p u t ( i m a g e s / s ) (a) Th rou gh pu t [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗

**Figure 17.** Figure 17: GPT-2 training. trains 50 steps (Compute), and then writes back the checkpoint (including intermediate model weights and optimizer states, Checkpoint). Training datasets are cached in local host memory for fast loading. As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗

read the original abstract

GPU has become the leading computing device for a wide range of data-intensive applications, which tightly collaborates with remote all-flash array (AFA) to accommodate ever-expanding datasets, facilitate multi-client data sharing, and guarantee fault tolerance. Although GPU is the center of computation, all I/O processes in existing GPU-AFA systems are still CPU-centric. CPU orchestrates remote I/O requests and executes a centralized AFA engine to take charge of AFA-level functionalities (e.g., access control and metadata persistence). This design disparity suffers from substantial CPU-GPU interaction overhead and I/O traffic amplification, compromising end-to-end I/O performance. In this work, we present \emph{GNStor}, a GPU-native AFA system that enables GPU to directly access remote AFA without CPU intervention in the I/O path, thereby fully exploiting the performance of AFA. Specifically, GNStor first proposes a GPU-centric NVMe over RDMA (NoR) software stack (named \emph{GNoR}), paving a fast path for GPUs to directly initiate NoR I/O requests to SSDs within remote AFA. GNoR employs an atomic-operation-based I/O orchestration design and follows the single-instruction-multiple-thread (SIMT) execution model of GPU, fully exploiting the massive parallelism of GPU architectures. To facilitate essential AFA functionalities in a CPU-bypass I/O path, GNStor further designs \emph{deEngine}, a decentralized AFA engine that seamlessly decomposes and integrates AFA-level tasks into each SSD firmware, thereby achieving efficient AFA access at low cost. Evaluation results show that GNStor achieves 3.2$\times$ higher I/O throughput and reduces application execution time by 31.1\%, compared to state-of-the-art AFA systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GNStor pushes AFA functions into SSD firmware for direct GPU access, but the cross-device coordination in deEngine is the untested piece that determines if the gains are real.

read the letter

The main point is that GNStor removes the CPU from the remote I/O path for GPU workloads by building a GPU-native NVMe-over-RDMA stack and then splitting AFA tasks into per-SSD firmware. GNoR uses atomic operations to match GPU's SIMT model, and deEngine tries to handle access control and metadata persistence without central CPU involvement.

This is a clear departure from the CPU-centric designs that dominate current GPU-AFA systems. The paper correctly identifies the interaction overhead and traffic amplification that come from keeping the AFA engine on the host. The reported 3.2× throughput and 31% reduction in application time would matter if they survive real measurement.

The design is plausible on paper. Moving orchestration to GPU threads and decentralizing the engine follows from the stated problem. The architecture description gives a reader a concrete picture of how the bypass path is supposed to work.

The soft spot is exactly where the stress test points: access control and metadata are inherently multi-device. The abstract does not spell out how firmware-level enforcement serializes updates across SSDs or guarantees persistence after power loss without falling back to the CPU. If that coordination adds latency or requires hidden host involvement, the claimed speedup shrinks. The evaluation numbers are given without setup details, baselines, or error bars in the provided text, so the central claim stays plausible rather than demonstrated.

This is for storage and OS researchers who work on accelerator-centric clusters. A reader who needs to understand GPU-direct storage options will get value from the GNoR and deEngine descriptions even if the numbers need verification.

Send it to peer review. The problem is real, the approach is distinct from prior work, and the implementation questions are answerable with the right experiments.

Referee Report

2 major / 0 minor

Summary. The paper presents GNStor, a GPU-native remote all-flash array (AFA) design that enables direct GPU-initiated I/O to remote SSDs without CPU intervention in the data path. It introduces GNoR, a GPU-centric NVMe over RDMA software stack using atomic operations and SIMT execution, and deEngine, a decentralized AFA engine that decomposes tasks such as access control and metadata persistence into per-SSD firmware. The central claim is that this CPU-bypass architecture delivers 3.2× higher I/O throughput and reduces application execution time by 31.1% versus state-of-the-art AFA systems.

Significance. If the performance numbers and correctness of the decentralized engine hold under realistic multi-client workloads, the work would represent a meaningful shift from CPU-orchestrated to GPU-direct remote storage for data-intensive GPU applications, potentially reducing interaction overhead and traffic amplification in GPU-AFA co-designs.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: The specific claims of 3.2× throughput improvement and 31.1% execution-time reduction are stated without any description of experimental setup, baselines, workloads, hardware configuration, or error bars. This absence makes the central performance result unverifiable from the provided text and is load-bearing for the paper's contribution.
[deEngine design] deEngine design (architecture section): The claim that essential AFA functionalities including access control and metadata persistence can be fully decomposed and integrated into individual SSD firmware while preserving correctness in a CPU-bypass path lacks a concrete mechanism for cross-device atomicity, policy replication, or power-loss persistence. If these cannot be handled without races or hidden CPU fallbacks, the reported throughput gains are at risk.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment point by point below and outline the changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: The specific claims of 3.2× throughput improvement and 31.1% execution-time reduction are stated without any description of experimental setup, baselines, workloads, hardware configuration, or error bars. This absence makes the central performance result unverifiable from the provided text and is load-bearing for the paper's contribution.

Authors: We agree that the abstract would benefit from a brief mention of the experimental context to improve immediate verifiability of the central claims. The full evaluation section provides details on the hardware platform (GPU and remote AFA configuration), baselines (state-of-the-art CPU-centric AFA systems), workloads, and reports results with error bars from repeated runs. In the revised version we will expand the abstract with a concise experimental overview and ensure the evaluation section opens with an explicit setup subsection. revision: yes
Referee: [deEngine design] deEngine design (architecture section): The claim that essential AFA functionalities including access control and metadata persistence can be fully decomposed and integrated into individual SSD firmware while preserving correctness in a CPU-bypass path lacks a concrete mechanism for cross-device atomicity, policy replication, or power-loss persistence. If these cannot be handled without races or hidden CPU fallbacks, the reported throughput gains are at risk.

Authors: We acknowledge that the current architecture description would be strengthened by additional concrete mechanisms. The deEngine design relies on per-SSD firmware extensions coordinated via GPU-initiated RDMA atomics, but explicit details on cross-device atomicity (e.g., distributed locking), policy replication, and power-loss persistence (e.g., journaling) are not fully elaborated. We will add a dedicated subsection in the revised architecture section that specifies these mechanisms, discusses potential race conditions, and clarifies that no hidden CPU fallbacks are used in the data path. revision: yes

Circularity Check

0 steps flagged

No circularity: systems design with external evaluation

full rationale

The paper presents an architectural design (GNoR stack and deEngine decomposition) and reports measured throughput gains from implementation and benchmarking. No equations, fitted parameters, or predictions appear; performance claims rest on described hardware/software changes evaluated against external baselines rather than any self-referential reduction or self-citation chain. The central claim is therefore independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

This is an engineering systems paper introducing new software components with no mathematical free parameters, axioms, or invented physical entities; the design rests on standard assumptions about NVMe, RDMA, and SSD firmware capabilities.

invented entities (2)

GNoR no independent evidence
purpose: GPU-centric NVMe over RDMA software stack for direct GPU I/O initiation
New software stack proposed to enable CPU-bypass path
deEngine no independent evidence
purpose: Decentralized AFA engine decomposed into SSD firmware
New component to handle AFA functionalities without CPU

pith-pipeline@v0.9.1-grok · 5910 in / 1101 out tokens · 78728 ms · 2026-06-28T03:21:32.802334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 5 linked inside Pith

[1]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[2]

Key-ssd: Access- control drive to protect files from ransomware attacks.arXiv preprint arXiv:1904.05012, 2019

Jinwoo Ahn, Donggyu Park, Chang-Gyu Lee, Donghyun Min, Junghee Lee, Sungyong Park, Qian Chen, and Youngjae Kim. Key-ssd: Access- control drive to protect files from ransomware attacks.arXiv preprint arXiv:1904.05012, 2019

Pith/arXiv arXiv 1904
[3]

Epyc™9654.https://www .amd.com/en/products/processors/ server/epyc/4th-generation-9004-and-8004-series/amd-epyc- 9654.html

AMD. Epyc™9654.https://www .amd.com/en/products/processors/ server/epyc/4th-generation-9004-and-8004-series/amd-epyc- 9654.html
[4]

The gap bench- mark suite, 2017

Scott Beamer, Krste Asanović, and David Patterson. The gap bench- mark suite, 2017

2017
[5]

Spin: Seamless operating system integration of peer-to-peer dma be- tween ssds and gpus.ACM Transactions on Computer Systems (TOCS), 36(2):1–26, 2019

Shai Bergman, Tanya Brokhman, Tzachi Cohen, and Mark Silberstein. Spin: Seamless operating system integration of peer-to-peer dma be- tween ssds and gpus.ACM Transactions on Computer Systems (TOCS), 36(2):1–26, 2019

2019
[6]

O’Reilly Media, Inc

John Bloomer.Power programming with RPC. " O’Reilly Media, Inc. ", 1992

1992
[7]

Recom- mender systems: An overview.Ai Magazine, 32(3):13–18, 2011

Robin Burke, Alexander Felfernig, and Mehmet H Göker. Recom- mender systems: An overview.Ai Magazine, 32(3):13–18, 2011

2011
[8]

Nvme over fabrics user space initiator library.https: //github.com/bytedance/libnvmf, 2024

Bytedance. Nvme over fabrics user space initiator library.https: //github.com/bytedance/libnvmf, 2024

2024
[9]

Efficient distributed memory management with rdma and caching

Qingchao Cai, Wentian Guo, Hao Zhang, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Yong Meng Teo, and Sheng Wang. Efficient distributed memory management with rdma and caching. Proceedings of the VLDB Endowment, 11(11):1604–1617, 2018

2018
[10]

Hyq: Hybrid i/o queue architecture for nvme over fabrics to enable high- performance hardware offloading

Yiquan Chen, Jinlong Chen, Yijing Wang, Yi Chen, Zhen Jin, Jiexiong Xu, Guoju Fang, Wenhai Lin, Chengkun Wei, and Wenzhi Chen. Hyq: Hybrid i/o queue architecture for nvme over fabrics to enable high- performance hardware offloading. In2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pages 13–24. IEEE, 2023

2023
[11]

A lightweight, gpu-based software raid system

Matthew L Curry, H Lee Ward, Anthony Skjellum, and Ron Brightwell. A lightweight, gpu-based software raid system. In2010 39th Interna- tional Conference on Parallel Processing, pages 565–572. IEEE, 2010

2010
[12]

Fire-flyer file system.https://github .com/deepseek-ai/3fs, 2026

DeepSeek. Fire-flyer file system.https://github .com/deepseek-ai/3fs, 2026

2026
[13]

Powerstore 500t storage array.https:// www.delltechnologies.com/asset/en-ca/products/storage/technical- support/dell-powerstore-gen2-spec-sheet.pdf, 2023

DELL. Powerstore 500t storage array.https:// www.delltechnologies.com/asset/en-ca/products/storage/technical- support/dell-powerstore-gen2-spec-sheet.pdf, 2023

2023
[14]

A survey of llm datasets: From autoregressive model to ai chatbot.Journal of Computer Science and Technology, 39(3):542–566, 2024

Fei Du, Xin-Jian Ma, Jing-Ru Yang, Yi Liu, Chao-Ran Luo, Xue-Bin Wang, Hai-Ou Jiang, and Xiang Jing. A survey of llm datasets: From autoregressive model to ai chatbot.Journal of Computer Science and Technology, 39(3):542–566, 2024

2024
[15]

Imagenet-100.https://huggingface .co/datasets/clane9/ imagenet-100

Hugging Face. Imagenet-100.https://huggingface .co/datasets/clane9/ imagenet-100
[16]

The recovery manager of the system r database manager.ACM Computing Surveys (CSUR), 13(2):223–242, 1981

Jim Gray, Paul McJones, Mike Blasgen, Bruce Lindsay, Raymond Lorie, Tom Price, Franco Putzolu, and Irving Traiger. The recovery manager of the system r database manager.ACM Computing Surveys (CSUR), 13(2):223–242, 1981

1981
[17]

A novel approach to real-time bilinear interpolation

Kim T Gribbon and Donald G Bailey. A novel approach to real-time bilinear interpolation. InProceedings. DELTA 2004. Second IEEE in- ternational workshop on electronic design, test and applications, pages 126–131. IEEE, 2004

2004
[18]

Intel®xeon®gold 5320 processor.https://www .intel.com/ content/www/us/en/products/sku/215285/intel-xeon-gold-5320- processor-39m-cache-2-20-ghz/specifications.html

Intel. Intel®xeon®gold 5320 processor.https://www .intel.com/ content/www/us/en/products/sku/215285/intel-xeon-gold-5320- processor-39m-cache-2-20-ghz/specifications.html
[19]

Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web

David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. InProceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 654–663, 1997

1997
[20]

On the use of gpus in realizing cost-effective distributed raid

Aleksandr Khasymski, M Mustafa Rafique, Ali R Butt, Sudharshan S Vazhkudai, and Dimitrios S Nikolopoulos. On the use of gpus in realizing cost-effective distributed raid. In2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pages 469–478. IEEE, 2012

2012
[21]

{D2FS}:{Device-Driven} filesystem garbage collection

Juwon Kim, Seungjae Lee, Joontaek Oh, Dongkun Shin, and Youjip Won. {D2FS}:{Device-Driven} filesystem garbage collection. In23rd USENIX Conference on File and Storage Technologies (FAST 25), pages 337–353, 2025

2025
[22]

{NVMeVirt}: A versatile software- defined virtual {NVMe} device

Sang-Hoon Kim, Jaehoon Shim, Euidong Lee, Seongyeop Jeong, Ilkueon Kang, and Jin-Soo Kim. {NVMeVirt}: A versatile software- defined virtual {NVMe} device. In21st USENIX Conference on File and Storage Technologies (FAST 23), pages 379–394, 2023

2023
[23]

Fessd: A fast encrypted ssd employing on-chip access-control memory.IEEE Computer Architecture Letters, 16(2):115–118, 2017

Junghee Lee, Kalidas Ganesh, Hyuk-Jun Lee, and Youngjae Kim. Fessd: A fast encrypted ssd employing on-chip access-control memory.IEEE Computer Architecture Letters, 16(2):115–118, 2017

2017
[24]

Gpu snapshot: check- point offloading for gpu-dense systems

Kyushick Lee, Michael B Sullivan, Siva Kumar Sastry Hari, Timothy Tsai, Stephen W Keckler, and Mattan Erez. Gpu snapshot: check- point offloading for gpu-dense systems. InProceedings of the ACM International Conference on Supercomputing, pages 171–183, 2019

2019
[25]

{RubbleDB}:{CPU-Efficient} replica- tion with {NVMe-oF}

Haoyu Li, Sheng Jiang, Chen Chen, Ashwini Raina, Xingyu Zhu, Changxu Luo, and Asaf Cidon. {RubbleDB}:{CPU-Efficient} replica- tion with {NVMe-oF}. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 689–703, 2023

2023
[26]

Man- aging scalable direct storage accesses for gpus with gofs

Shaobo Li, Yirui Eric Zhou, Yuqi Xue, Yuan Xu, and Jian Huang. Man- aging scalable direct storage accesses for gpus with gofs. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 979–995, 2025

2025
[27]

Cognitive {SSD}: A deep learning engine for{In-Storage} data retrieval

Shengwen Liang, Ying Wang, Youyou Lu, Zhe Yang, Huawei Li, and Xiaowei Li. Cognitive {SSD}: A deep learning engine for{In-Storage} data retrieval. In2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 395–410, 2019

2019
[28]

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024
[29]

Lmcache: An efficient kv cache layer for enterprise-scale llm inference

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, et al. Lmcache: An efficient kv cache layer for enterprise-scale llm inference. 13 arXiv preprint arXiv:2510.09665, 2025

arXiv 2025
[30]

Smar- tio: Zero-overhead device sharing through pcie networking.ACM Transactions on Computer Systems (TOCS), 38(1-2):1–78, 2021

Jonas Markussen, Lars Bjørlykke Kristiansen, Pål Halvorsen, Halvor Kielland-Gyrud, Håkon Kvale Stensland, and Carsten Griwodz. Smar- tio: Zero-overhead device sharing through pcie networking.ACM Transactions on Computer Systems (TOCS), 38(1-2):1–78, 2021

2021
[31]

Springer Science & Business Media, 2010

Rino Micheloni, Luca Crippa, and Alessia Marelli.Inside NAND flash memories. Springer Science & Business Media, 2010

2010
[32]

A lightweight infrastructure for graph analytics

Donald Nguyen, Andrew Lenharth, and Keshav Pingali. A lightweight infrastructure for graph analytics. InProceedings of the twenty-fourth ACM symposium on operating systems principles, pages 456–471, 2013

2013
[33]

A100 tensor core gpu.https://www .nvidia.com/en-us/data- center/a100/

NVIDIA. A100 tensor core gpu.https://www .nvidia.com/en-us/data- center/a100/
[34]

Connectx-7.https://www .nvidia.com/content/dam/en- zz/Solutions/networking/ethernet-adapters/connectx-7-datasheet- Final.pdf, 2021

NVIDIA. Connectx-7.https://www .nvidia.com/content/dam/en- zz/Solutions/networking/ethernet-adapters/connectx-7-datasheet- Final.pdf, 2021

2021
[35]

NVIDIA. Advancing memory and storage architectures for next-gen ai workloads.https://files .futurememorystorage.com/proceedings/2025/ 20250807_OPSW-301-1_Mailthody-2025-08-07-15.14.33.pdf, 2025

2025
[36]

Cuda toolkit.https://developer .nvidia.com/cuda/toolkit, 2025

NVIDIA. Cuda toolkit.https://developer .nvidia.com/cuda/toolkit, 2025

2025
[37]

Doca software framework.https://developer .nvidia.com/ networking/doca, 2025

NVIDIA. Doca software framework.https://developer .nvidia.com/ networking/doca, 2025

2025
[38]

Gpudirect.https://developer.nvidia.com/gpudirect, 2026

NVIDIA. Gpudirect.https://developer.nvidia.com/gpudirect, 2026

2026
[39]

Nvm command set specification.https://nvmexpress .org/ specification/nvm-command-set-specification/, 2025

NVMe. Nvm command set specification.https://nvmexpress .org/ specification/nvm-command-set-specification/, 2025

2025
[40]

Nvm express base specification 2.3.https://nvmexpress .org/ specification/nvm-express-base-specification/, 2025

NVMe. Nvm express base specification 2.3.https://nvmexpress .org/ specification/nvm-express-base-specification/, 2025

2025
[41]

Gpt-2.https://github.com/openai/gpt-2

OpenAI. Gpt-2.https://github.com/openai/gpt-2
[42]

Cuckoo hashing.Journal of Algorithms, 51(2):122–144, 2004

Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing.Journal of Algorithms, 51(2):122–144, 2004

2004
[43]

Multi-gpu graph analytics

Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, and John D Owens. Multi-gpu graph analytics. In2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 479–490. IEEE, 2017

2017
[44]

Optimizing memory-mapped {I/O} for fast storage devices

Anastasios Papagiannis, Giorgos Xanthakis, Giorgos Saloustros, Mano- lis Marazakis, and Angelos Bilas. Optimizing memory-mapped {I/O} for fast storage devices. In2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 813–827, 2020

2020
[45]

Energy-aware gpu-raid scheduling for reducing energy consumption in cloud storage sys- tems

Mehdi Pirahandeh and Deok-Hwan Kim. Energy-aware gpu-raid scheduling for reducing energy consumption in cloud storage sys- tems. InComputer Science and its Applications: Ubiquitous Information Technologies, pages 705–711. Springer, 2015

2015
[46]

Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot. In23rd USENIX conference on file and storage technologies (FAST 25), pages 155–170, 2025

2025
[47]

{GeminiFS}: A com- panion file system for {GPUs }

Shi Qiu, Weinan Liu, Yifan Hu, Jianqin Yan, Zhirong Shen, Xin Yao, Renhai Chen, Gong Zhang, and Yiming Zhang. {GeminiFS}: A com- panion file system for {GPUs }. In23rd USENIX Conference on File and Storage Technologies (FAST 25), pages 221–236, 2025

2025
[48]

Gpu-initiated on-demand high-throughput storage access in the bam system architecture

Zaid Qureshi, Vikram Sharma Mailthody, Isaac Gelado, Seungwon Min, Amna Masood, Jeongmin Park, Jinjun Xiong, Chris J Newburn, Dmitri Vainbrand, I-Hsin Chung, et al. Gpu-initiated on-demand high-throughput storage access in the bam system architecture. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Language...

2023
[49]

Recommender systems.Communica- tions of the ACM, 40(3):56–58, 1997

Paul Resnick and Hal R Varian. Recommender systems.Communica- tions of the ACM, 40(3):56–58, 1997

1997
[50]

Samsung. Power loss protection (plp) protect your data against sudden power loss.https://download .semiconductor.samsung.com/resources/ others/Samsung_SSD_845DC_05_Power_loss_protection_PLP.pdf, 2014

2014
[51]

980pro nvme ssd.https://www .samsung.com/us/ computing/memory-storage/solid-state-drives/980-pro-pcie-4-0- nvme-ssd-1tb-mz-v8p1t0b-am/, 2020

Samsung. 980pro nvme ssd.https://www .samsung.com/us/ computing/memory-storage/solid-state-drives/980-pro-pcie-4-0- nvme-ssd-1tb-mz-v8p1t0b-am/, 2020

2020
[52]

Samsung pm1743.https://semiconductor .samsung.com/ ssd/enterprise-ssd/pm1743/, 2023

Samsung. Samsung pm1743.https://semiconductor .samsung.com/ ssd/enterprise-ssd/pm1743/, 2023

2023
[53]

Distributed graph neural network training: A survey.ACM Computing Surveys, 56(8):1–39, 2024

Yingxia Shao, Hongzheng Li, Xizhi Gu, Hongbo Yin, Yawen Li, Xupeng Miao, Wentao Zhang, Bin Cui, and Lei Chen. Distributed graph neural network training: A survey.ACM Computing Surveys, 56(8):1–39, 2024

2024
[54]

Cam: Asyn- chronous gpu-initiated, cpu-managed ssd management for batching storage access

Ziyu Song, Jie Zhang, Jie Sun, Mo Sun, Zihan Yang, Zheng Zhang, Xuzheng Chen, Fei Wu, Huajin Tang, and Zeke Wang. Cam: Asyn- chronous gpu-initiated, cpu-managed ssd management for batching storage access. In2025 IEEE 41st International Conference on Data Engineering (ICDE), pages 2309–2322. IEEE, 2025

2025
[55]

Storage performance development kit.https://spdk.io, 2025

SPDK. Storage performance development kit.https://spdk.io, 2025

2025
[56]

Selective buddy allocation for scheduling parallel jobs on clusters

Vijay Subramani, Rajkumar Kettimuthu, Srividya Srinivasan, Jeanette Johnston, and P Sadayappan. Selective buddy allocation for scheduling parallel jobs on clusters. InProceedings. IEEE International Conference on Cluster Computing, pages 107–116. IEEE, 2002

2002
[57]

Scalio: Scaling up {DPU-based } {JBOF} key-value store with {NVMe-oF} target offload

Xun Sun, Mingxing Zhang, Yingdi Shan, Kang Chen, Jinlei Jiang, and Yongwei Wu. Scalio: Scaling up {DPU-based } {JBOF} key-value store with {NVMe-oF} target offload. In19th USENIX Symposium on Op- erating Systems Design and Implementation (OSDI 25), pages 449–464, 2025

2025
[58]

Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023
[59]

T-lease: A trusted lease primitive for distributed systems

Bohdan Trach, Rasha Faqeh, Oleksii Oleksenko, Wojciech Ozga, Pramod Bhatotia, and Christof Fetzer. T-lease: A trusted lease primitive for distributed systems. InProceedings of the 11th ACM Symposium on Cloud Computing, pages 387–400, 2020

2020
[60]

The land- scape of gpu-centric communication.arXiv preprint arXiv:2409.09874, 2024

Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Doğan Sağbili, Flavio Vella, Daniele De Sensi, and Ismayil Ismayilov. The land- scape of gpu-centric communication.arXiv preprint arXiv:2409.09874, 2024

Pith/arXiv arXiv 2024
[61]

Lock-free linked lists using compare-and-swap

John D Valois. Lock-free linked lists using compare-and-swap. In Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing, pages 214–222, 1995

1995
[62]

{FineMem}: Breaking the allocation overhead vs

Xiaoyang Wang, Yongkun Li, Kan Wu, Wenzhe Zhu, Yuqi Li, and Yin- long Xu. {FineMem}: Breaking the allocation overhead vs. memory waste dilemma in {Fine-Grained} disaggregated memory manage- ment. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 57–74, 2025

2025
[63]

Gunrock: Gpu graph analytics.ACM Transactions on Parallel Computing (TOPC), 4(1):1–49, 2017

Yangzihao Wang, Yuechao Pan, Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Muhammad Osama, Chenshan Yuan, Weitang Liu, Andy T Riffel, et al. Gunrock: Gpu graph analytics.ACM Transactions on Parallel Computing (TOPC), 4(1):1–49, 2017

2017
[64]

Merlin hugectr: Gpu-accelerated recommender system training and inference

Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Shijie Liu, Daniel G Abel, Xu Guo, Jianbing Dong, et al. Merlin hugectr: Gpu-accelerated recommender system training and inference. InProceedings of the 16th ACM Conference on Recommender Systems, pages 534–537, 2022

2022
[65]

Crush: Controlled, scalable, decentralized placement of replicated data

Sage A Weil, Scott A Brandt, Ethan L Miller, and Carlos Maltzahn. Crush: Controlled, scalable, decentralized placement of replicated data. InProceedings of the 2006 ACM/IEEE conference on Supercomputing, pages 122–es, 2006

2006
[66]

Eliminating storage management over- head of deduplication over ssd arrays through a hardware/software co-design

Yuhong Wen, Xiaogang Zhao, You Zhou, Tong Zhang, Shangjun Yang, Changsheng Xie, and Fei Wu. Eliminating storage management over- head of deduplication over ssd arrays through a hardware/software co-design. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 320–335, 2024

2024
[67]

{D2FQ}:{Device-Direct}fair queueing for{NVMe} {SSDs}

Jiwon Woo, Minwoo Ahn, Gyusun Lee, and Jinkyu Jeong. {D2FQ}:{Device-Direct}fair queueing for{NVMe} {SSDs}. In19th 14 GNStor Arxiv, June 2026, Online USENIX Conference on File and Storage Technologies (FAST 21), pages 403–415, 2021

2026
[68]

Understanding and exploiting the full potential of ssd address remapping.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(11):5112–5125, 2022

Qiulin Wu, You Zhou, Fei Wu, Hong Jiang, Jian Zhou, and Changsheng Xie. Understanding and exploiting the full potential of ssd address remapping.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(11):5112–5125, 2022

2022
[69]

Kintex™ultrascale+™fpgas.https://www .amd.com/ en/products/adaptive-socs-and-fpgas/fpga/kintex-ultrascale- plus.html

XILINX. Kintex™ultrascale+™fpgas.https://www .amd.com/ en/products/adaptive-socs-and-fpgas/fpga/kintex-ultrascale- plus.html
[70]

Perfor- mance characterization of smartnic nvme-over-fabrics target offload- ing

Jiexiong Xu, Yue Qiu, Yiquan Chen, Yijing Wang, Wenhai Lin, Yiquan Lin, Shushu Zhao, Yuqi Liu, Ying Wang, and Wenzhi Chen. Perfor- mance characterization of smartnic nvme-over-fabrics target offload- ing. InProceedings of the 17th ACM International Systems and Storage Conference, pages 14–24, 2024

2024
[71]

On-demand and parallel checkpoint/restore for gpu applications

Yanning Yang, Dong Du, Haitao Song, and Yubin Xia. On-demand and parallel checkpoint/restore for gpu applications. InProceedings of the 2024 ACM Symposium on Cloud Computing, pages 415–433, 2024

2024
[72]

{𝜆-IO}: A unified {IO} stack for computational storage

Zhe Yang, Youyou Lu, Xiaojian Liao, Youmin Chen, Junru Li, Siyu He, and Jiwu Shu. {𝜆-IO}: A unified {IO} stack for computational storage. In21st USENIX Conference on File and Storage Technologies (FAST 23), pages 347–362, 2023

2023
[73]

ScalaAFA: Constructing User-Space All-Flash array engine with holistic designs

Shushu Yi, Xiurui Pan, Qiao Li, Qiang Li, Chenxi Wang, Bo Mao, Myoungsoo Jung, and Jie Zhang. ScalaAFA: Constructing User-Space All-Flash array engine with holistic designs. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 141–156, Santa Clara, CA, July 2024. USENIX Association

2024
[74]

{GPU } {Checkpoint/Restore} made fast and lightweight

Shaoxun Zeng, Tingxu Ren, Jiwu Shu, and Youyou Lu. {GPU } {Checkpoint/Restore} made fast and lightweight. In 24th USENIX Conference on File and Storage Technologies (FAST 26), pages 239–254, 2026

2026
[75]

Nvmmu: A non-volatile memory management unit for heterogeneous gpu-ssd architectures

Jie Zhang, David Donofrio, John Shalf, Mahmut T Kandemir, and Myoungsoo Jung. Nvmmu: A non-volatile memory management unit for heterogeneous gpu-ssd architectures. In2015 International Conference on Parallel Architecture and Compilation (PACT), pages 13–24. IEEE, 2015

2015
[76]

Bytegnn: efficient graph neural network training at large scale.Pro- ceedings of the VLDB Endowment, 15(6):1228–1242, 2022

Chenguang Zheng, Hongzhi Chen, Yuxuan Cheng, Zhezheng Song, Yifan Wu, Changji Li, James Cheng, Hao Yang, and Shuai Zhang. Bytegnn: efficient graph neural network training at large scale.Pro- ceedings of the VLDB Endowment, 15(6):1228–1242, 2022

2022
[77]

{Remap-SSD}: Safely and efficiently exploiting {SSD} address remapping to eliminate duplicate writes

You Zhou, Qiulin Wu, Fei Wu, Hong Jiang, Jian Zhou, and Changsheng Xie. {Remap-SSD}: Safely and efficiently exploiting {SSD} address remapping to eliminate duplicate writes. In19th USENIX Conference on File and Storage Technologies (FAST 21), pages 187–202, 2021

2021
[78]

Toolqa: A dataset for llm question answering with external tools.Ad- vances in Neural Information Processing Systems, 36:50117–50143, 2023

Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. Toolqa: A dataset for llm question answering with external tools.Ad- vances in Neural Information Processing Systems, 36:50117–50143, 2023

2023

[1] [1]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[2] [2]

Key-ssd: Access- control drive to protect files from ransomware attacks.arXiv preprint arXiv:1904.05012, 2019

Jinwoo Ahn, Donggyu Park, Chang-Gyu Lee, Donghyun Min, Junghee Lee, Sungyong Park, Qian Chen, and Youngjae Kim. Key-ssd: Access- control drive to protect files from ransomware attacks.arXiv preprint arXiv:1904.05012, 2019

Pith/arXiv arXiv 1904

[3] [3]

Epyc™9654.https://www .amd.com/en/products/processors/ server/epyc/4th-generation-9004-and-8004-series/amd-epyc- 9654.html

AMD. Epyc™9654.https://www .amd.com/en/products/processors/ server/epyc/4th-generation-9004-and-8004-series/amd-epyc- 9654.html

[4] [4]

The gap bench- mark suite, 2017

Scott Beamer, Krste Asanović, and David Patterson. The gap bench- mark suite, 2017

2017

[5] [5]

Spin: Seamless operating system integration of peer-to-peer dma be- tween ssds and gpus.ACM Transactions on Computer Systems (TOCS), 36(2):1–26, 2019

Shai Bergman, Tanya Brokhman, Tzachi Cohen, and Mark Silberstein. Spin: Seamless operating system integration of peer-to-peer dma be- tween ssds and gpus.ACM Transactions on Computer Systems (TOCS), 36(2):1–26, 2019

2019

[6] [6]

O’Reilly Media, Inc

John Bloomer.Power programming with RPC. " O’Reilly Media, Inc. ", 1992

1992

[7] [7]

Recom- mender systems: An overview.Ai Magazine, 32(3):13–18, 2011

Robin Burke, Alexander Felfernig, and Mehmet H Göker. Recom- mender systems: An overview.Ai Magazine, 32(3):13–18, 2011

2011

[8] [8]

Nvme over fabrics user space initiator library.https: //github.com/bytedance/libnvmf, 2024

Bytedance. Nvme over fabrics user space initiator library.https: //github.com/bytedance/libnvmf, 2024

2024

[9] [9]

Efficient distributed memory management with rdma and caching

Qingchao Cai, Wentian Guo, Hao Zhang, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Yong Meng Teo, and Sheng Wang. Efficient distributed memory management with rdma and caching. Proceedings of the VLDB Endowment, 11(11):1604–1617, 2018

2018

[10] [10]

Hyq: Hybrid i/o queue architecture for nvme over fabrics to enable high- performance hardware offloading

Yiquan Chen, Jinlong Chen, Yijing Wang, Yi Chen, Zhen Jin, Jiexiong Xu, Guoju Fang, Wenhai Lin, Chengkun Wei, and Wenzhi Chen. Hyq: Hybrid i/o queue architecture for nvme over fabrics to enable high- performance hardware offloading. In2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pages 13–24. IEEE, 2023

2023

[11] [11]

A lightweight, gpu-based software raid system

Matthew L Curry, H Lee Ward, Anthony Skjellum, and Ron Brightwell. A lightweight, gpu-based software raid system. In2010 39th Interna- tional Conference on Parallel Processing, pages 565–572. IEEE, 2010

2010

[12] [12]

Fire-flyer file system.https://github .com/deepseek-ai/3fs, 2026

DeepSeek. Fire-flyer file system.https://github .com/deepseek-ai/3fs, 2026

2026

[13] [13]

Powerstore 500t storage array.https:// www.delltechnologies.com/asset/en-ca/products/storage/technical- support/dell-powerstore-gen2-spec-sheet.pdf, 2023

DELL. Powerstore 500t storage array.https:// www.delltechnologies.com/asset/en-ca/products/storage/technical- support/dell-powerstore-gen2-spec-sheet.pdf, 2023

2023

[14] [14]

A survey of llm datasets: From autoregressive model to ai chatbot.Journal of Computer Science and Technology, 39(3):542–566, 2024

Fei Du, Xin-Jian Ma, Jing-Ru Yang, Yi Liu, Chao-Ran Luo, Xue-Bin Wang, Hai-Ou Jiang, and Xiang Jing. A survey of llm datasets: From autoregressive model to ai chatbot.Journal of Computer Science and Technology, 39(3):542–566, 2024

2024

[15] [15]

Imagenet-100.https://huggingface .co/datasets/clane9/ imagenet-100

Hugging Face. Imagenet-100.https://huggingface .co/datasets/clane9/ imagenet-100

[16] [16]

The recovery manager of the system r database manager.ACM Computing Surveys (CSUR), 13(2):223–242, 1981

Jim Gray, Paul McJones, Mike Blasgen, Bruce Lindsay, Raymond Lorie, Tom Price, Franco Putzolu, and Irving Traiger. The recovery manager of the system r database manager.ACM Computing Surveys (CSUR), 13(2):223–242, 1981

1981

[17] [17]

A novel approach to real-time bilinear interpolation

Kim T Gribbon and Donald G Bailey. A novel approach to real-time bilinear interpolation. InProceedings. DELTA 2004. Second IEEE in- ternational workshop on electronic design, test and applications, pages 126–131. IEEE, 2004

2004

[18] [18]

Intel®xeon®gold 5320 processor.https://www .intel.com/ content/www/us/en/products/sku/215285/intel-xeon-gold-5320- processor-39m-cache-2-20-ghz/specifications.html

Intel. Intel®xeon®gold 5320 processor.https://www .intel.com/ content/www/us/en/products/sku/215285/intel-xeon-gold-5320- processor-39m-cache-2-20-ghz/specifications.html

[19] [19]

Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web

David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. InProceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 654–663, 1997

1997

[20] [20]

On the use of gpus in realizing cost-effective distributed raid

Aleksandr Khasymski, M Mustafa Rafique, Ali R Butt, Sudharshan S Vazhkudai, and Dimitrios S Nikolopoulos. On the use of gpus in realizing cost-effective distributed raid. In2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pages 469–478. IEEE, 2012

2012

[21] [21]

{D2FS}:{Device-Driven} filesystem garbage collection

Juwon Kim, Seungjae Lee, Joontaek Oh, Dongkun Shin, and Youjip Won. {D2FS}:{Device-Driven} filesystem garbage collection. In23rd USENIX Conference on File and Storage Technologies (FAST 25), pages 337–353, 2025

2025

[22] [22]

{NVMeVirt}: A versatile software- defined virtual {NVMe} device

Sang-Hoon Kim, Jaehoon Shim, Euidong Lee, Seongyeop Jeong, Ilkueon Kang, and Jin-Soo Kim. {NVMeVirt}: A versatile software- defined virtual {NVMe} device. In21st USENIX Conference on File and Storage Technologies (FAST 23), pages 379–394, 2023

2023

[23] [23]

Fessd: A fast encrypted ssd employing on-chip access-control memory.IEEE Computer Architecture Letters, 16(2):115–118, 2017

Junghee Lee, Kalidas Ganesh, Hyuk-Jun Lee, and Youngjae Kim. Fessd: A fast encrypted ssd employing on-chip access-control memory.IEEE Computer Architecture Letters, 16(2):115–118, 2017

2017

[24] [24]

Gpu snapshot: check- point offloading for gpu-dense systems

Kyushick Lee, Michael B Sullivan, Siva Kumar Sastry Hari, Timothy Tsai, Stephen W Keckler, and Mattan Erez. Gpu snapshot: check- point offloading for gpu-dense systems. InProceedings of the ACM International Conference on Supercomputing, pages 171–183, 2019

2019

[25] [25]

{RubbleDB}:{CPU-Efficient} replica- tion with {NVMe-oF}

Haoyu Li, Sheng Jiang, Chen Chen, Ashwini Raina, Xingyu Zhu, Changxu Luo, and Asaf Cidon. {RubbleDB}:{CPU-Efficient} replica- tion with {NVMe-oF}. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 689–703, 2023

2023

[26] [26]

Man- aging scalable direct storage accesses for gpus with gofs

Shaobo Li, Yirui Eric Zhou, Yuqi Xue, Yuan Xu, and Jian Huang. Man- aging scalable direct storage accesses for gpus with gofs. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 979–995, 2025

2025

[27] [27]

Cognitive {SSD}: A deep learning engine for{In-Storage} data retrieval

Shengwen Liang, Ying Wang, Youyou Lu, Zhe Yang, Huawei Li, and Xiaowei Li. Cognitive {SSD}: A deep learning engine for{In-Storage} data retrieval. In2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 395–410, 2019

2019

[28] [28]

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024

[29] [29]

Lmcache: An efficient kv cache layer for enterprise-scale llm inference

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, et al. Lmcache: An efficient kv cache layer for enterprise-scale llm inference. 13 arXiv preprint arXiv:2510.09665, 2025

arXiv 2025

[30] [30]

Smar- tio: Zero-overhead device sharing through pcie networking.ACM Transactions on Computer Systems (TOCS), 38(1-2):1–78, 2021

Jonas Markussen, Lars Bjørlykke Kristiansen, Pål Halvorsen, Halvor Kielland-Gyrud, Håkon Kvale Stensland, and Carsten Griwodz. Smar- tio: Zero-overhead device sharing through pcie networking.ACM Transactions on Computer Systems (TOCS), 38(1-2):1–78, 2021

2021

[31] [31]

Springer Science & Business Media, 2010

Rino Micheloni, Luca Crippa, and Alessia Marelli.Inside NAND flash memories. Springer Science & Business Media, 2010

2010

[32] [32]

A lightweight infrastructure for graph analytics

Donald Nguyen, Andrew Lenharth, and Keshav Pingali. A lightweight infrastructure for graph analytics. InProceedings of the twenty-fourth ACM symposium on operating systems principles, pages 456–471, 2013

2013

[33] [33]

A100 tensor core gpu.https://www .nvidia.com/en-us/data- center/a100/

NVIDIA. A100 tensor core gpu.https://www .nvidia.com/en-us/data- center/a100/

[34] [34]

Connectx-7.https://www .nvidia.com/content/dam/en- zz/Solutions/networking/ethernet-adapters/connectx-7-datasheet- Final.pdf, 2021

NVIDIA. Connectx-7.https://www .nvidia.com/content/dam/en- zz/Solutions/networking/ethernet-adapters/connectx-7-datasheet- Final.pdf, 2021

2021

[35] [35]

NVIDIA. Advancing memory and storage architectures for next-gen ai workloads.https://files .futurememorystorage.com/proceedings/2025/ 20250807_OPSW-301-1_Mailthody-2025-08-07-15.14.33.pdf, 2025

2025

[36] [36]

Cuda toolkit.https://developer .nvidia.com/cuda/toolkit, 2025

NVIDIA. Cuda toolkit.https://developer .nvidia.com/cuda/toolkit, 2025

2025

[37] [37]

Doca software framework.https://developer .nvidia.com/ networking/doca, 2025

NVIDIA. Doca software framework.https://developer .nvidia.com/ networking/doca, 2025

2025

[38] [38]

Gpudirect.https://developer.nvidia.com/gpudirect, 2026

NVIDIA. Gpudirect.https://developer.nvidia.com/gpudirect, 2026

2026

[39] [39]

Nvm command set specification.https://nvmexpress .org/ specification/nvm-command-set-specification/, 2025

NVMe. Nvm command set specification.https://nvmexpress .org/ specification/nvm-command-set-specification/, 2025

2025

[40] [40]

Nvm express base specification 2.3.https://nvmexpress .org/ specification/nvm-express-base-specification/, 2025

NVMe. Nvm express base specification 2.3.https://nvmexpress .org/ specification/nvm-express-base-specification/, 2025

2025

[41] [41]

Gpt-2.https://github.com/openai/gpt-2

OpenAI. Gpt-2.https://github.com/openai/gpt-2

[42] [42]

Cuckoo hashing.Journal of Algorithms, 51(2):122–144, 2004

Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing.Journal of Algorithms, 51(2):122–144, 2004

2004

[43] [43]

Multi-gpu graph analytics

Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, and John D Owens. Multi-gpu graph analytics. In2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 479–490. IEEE, 2017

2017

[44] [44]

Optimizing memory-mapped {I/O} for fast storage devices

Anastasios Papagiannis, Giorgos Xanthakis, Giorgos Saloustros, Mano- lis Marazakis, and Angelos Bilas. Optimizing memory-mapped {I/O} for fast storage devices. In2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 813–827, 2020

2020

[45] [45]

Energy-aware gpu-raid scheduling for reducing energy consumption in cloud storage sys- tems

Mehdi Pirahandeh and Deok-Hwan Kim. Energy-aware gpu-raid scheduling for reducing energy consumption in cloud storage sys- tems. InComputer Science and its Applications: Ubiquitous Information Technologies, pages 705–711. Springer, 2015

2015

[46] [46]

Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot. In23rd USENIX conference on file and storage technologies (FAST 25), pages 155–170, 2025

2025

[47] [47]

{GeminiFS}: A com- panion file system for {GPUs }

Shi Qiu, Weinan Liu, Yifan Hu, Jianqin Yan, Zhirong Shen, Xin Yao, Renhai Chen, Gong Zhang, and Yiming Zhang. {GeminiFS}: A com- panion file system for {GPUs }. In23rd USENIX Conference on File and Storage Technologies (FAST 25), pages 221–236, 2025

2025

[48] [48]

Gpu-initiated on-demand high-throughput storage access in the bam system architecture

Zaid Qureshi, Vikram Sharma Mailthody, Isaac Gelado, Seungwon Min, Amna Masood, Jeongmin Park, Jinjun Xiong, Chris J Newburn, Dmitri Vainbrand, I-Hsin Chung, et al. Gpu-initiated on-demand high-throughput storage access in the bam system architecture. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Language...

2023

[49] [49]

Recommender systems.Communica- tions of the ACM, 40(3):56–58, 1997

Paul Resnick and Hal R Varian. Recommender systems.Communica- tions of the ACM, 40(3):56–58, 1997

1997

[50] [50]

Samsung. Power loss protection (plp) protect your data against sudden power loss.https://download .semiconductor.samsung.com/resources/ others/Samsung_SSD_845DC_05_Power_loss_protection_PLP.pdf, 2014

2014

[51] [51]

980pro nvme ssd.https://www .samsung.com/us/ computing/memory-storage/solid-state-drives/980-pro-pcie-4-0- nvme-ssd-1tb-mz-v8p1t0b-am/, 2020

Samsung. 980pro nvme ssd.https://www .samsung.com/us/ computing/memory-storage/solid-state-drives/980-pro-pcie-4-0- nvme-ssd-1tb-mz-v8p1t0b-am/, 2020

2020

[52] [52]

Samsung pm1743.https://semiconductor .samsung.com/ ssd/enterprise-ssd/pm1743/, 2023

Samsung. Samsung pm1743.https://semiconductor .samsung.com/ ssd/enterprise-ssd/pm1743/, 2023

2023

[53] [53]

Distributed graph neural network training: A survey.ACM Computing Surveys, 56(8):1–39, 2024

Yingxia Shao, Hongzheng Li, Xizhi Gu, Hongbo Yin, Yawen Li, Xupeng Miao, Wentao Zhang, Bin Cui, and Lei Chen. Distributed graph neural network training: A survey.ACM Computing Surveys, 56(8):1–39, 2024

2024

[54] [54]

Cam: Asyn- chronous gpu-initiated, cpu-managed ssd management for batching storage access

Ziyu Song, Jie Zhang, Jie Sun, Mo Sun, Zihan Yang, Zheng Zhang, Xuzheng Chen, Fei Wu, Huajin Tang, and Zeke Wang. Cam: Asyn- chronous gpu-initiated, cpu-managed ssd management for batching storage access. In2025 IEEE 41st International Conference on Data Engineering (ICDE), pages 2309–2322. IEEE, 2025

2025

[55] [55]

Storage performance development kit.https://spdk.io, 2025

SPDK. Storage performance development kit.https://spdk.io, 2025

2025

[56] [56]

Selective buddy allocation for scheduling parallel jobs on clusters

Vijay Subramani, Rajkumar Kettimuthu, Srividya Srinivasan, Jeanette Johnston, and P Sadayappan. Selective buddy allocation for scheduling parallel jobs on clusters. InProceedings. IEEE International Conference on Cluster Computing, pages 107–116. IEEE, 2002

2002

[57] [57]

Scalio: Scaling up {DPU-based } {JBOF} key-value store with {NVMe-oF} target offload

Xun Sun, Mingxing Zhang, Yingdi Shan, Kang Chen, Jinlei Jiang, and Yongwei Wu. Scalio: Scaling up {DPU-based } {JBOF} key-value store with {NVMe-oF} target offload. In19th USENIX Symposium on Op- erating Systems Design and Implementation (OSDI 25), pages 449–464, 2025

2025

[58] [58]

Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023

[59] [59]

T-lease: A trusted lease primitive for distributed systems

Bohdan Trach, Rasha Faqeh, Oleksii Oleksenko, Wojciech Ozga, Pramod Bhatotia, and Christof Fetzer. T-lease: A trusted lease primitive for distributed systems. InProceedings of the 11th ACM Symposium on Cloud Computing, pages 387–400, 2020

2020

[60] [60]

The land- scape of gpu-centric communication.arXiv preprint arXiv:2409.09874, 2024

Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Doğan Sağbili, Flavio Vella, Daniele De Sensi, and Ismayil Ismayilov. The land- scape of gpu-centric communication.arXiv preprint arXiv:2409.09874, 2024

Pith/arXiv arXiv 2024

[61] [61]

Lock-free linked lists using compare-and-swap

John D Valois. Lock-free linked lists using compare-and-swap. In Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing, pages 214–222, 1995

1995

[62] [62]

{FineMem}: Breaking the allocation overhead vs

Xiaoyang Wang, Yongkun Li, Kan Wu, Wenzhe Zhu, Yuqi Li, and Yin- long Xu. {FineMem}: Breaking the allocation overhead vs. memory waste dilemma in {Fine-Grained} disaggregated memory manage- ment. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 57–74, 2025

2025

[63] [63]

Gunrock: Gpu graph analytics.ACM Transactions on Parallel Computing (TOPC), 4(1):1–49, 2017

Yangzihao Wang, Yuechao Pan, Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Muhammad Osama, Chenshan Yuan, Weitang Liu, Andy T Riffel, et al. Gunrock: Gpu graph analytics.ACM Transactions on Parallel Computing (TOPC), 4(1):1–49, 2017

2017

[64] [64]

Merlin hugectr: Gpu-accelerated recommender system training and inference

Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Shijie Liu, Daniel G Abel, Xu Guo, Jianbing Dong, et al. Merlin hugectr: Gpu-accelerated recommender system training and inference. InProceedings of the 16th ACM Conference on Recommender Systems, pages 534–537, 2022

2022

[65] [65]

Crush: Controlled, scalable, decentralized placement of replicated data

Sage A Weil, Scott A Brandt, Ethan L Miller, and Carlos Maltzahn. Crush: Controlled, scalable, decentralized placement of replicated data. InProceedings of the 2006 ACM/IEEE conference on Supercomputing, pages 122–es, 2006

2006

[66] [66]

Eliminating storage management over- head of deduplication over ssd arrays through a hardware/software co-design

Yuhong Wen, Xiaogang Zhao, You Zhou, Tong Zhang, Shangjun Yang, Changsheng Xie, and Fei Wu. Eliminating storage management over- head of deduplication over ssd arrays through a hardware/software co-design. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 320–335, 2024

2024

[67] [67]

{D2FQ}:{Device-Direct}fair queueing for{NVMe} {SSDs}

Jiwon Woo, Minwoo Ahn, Gyusun Lee, and Jinkyu Jeong. {D2FQ}:{Device-Direct}fair queueing for{NVMe} {SSDs}. In19th 14 GNStor Arxiv, June 2026, Online USENIX Conference on File and Storage Technologies (FAST 21), pages 403–415, 2021

2026

[68] [68]

Understanding and exploiting the full potential of ssd address remapping.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(11):5112–5125, 2022

Qiulin Wu, You Zhou, Fei Wu, Hong Jiang, Jian Zhou, and Changsheng Xie. Understanding and exploiting the full potential of ssd address remapping.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(11):5112–5125, 2022

2022

[69] [69]

Kintex™ultrascale+™fpgas.https://www .amd.com/ en/products/adaptive-socs-and-fpgas/fpga/kintex-ultrascale- plus.html

XILINX. Kintex™ultrascale+™fpgas.https://www .amd.com/ en/products/adaptive-socs-and-fpgas/fpga/kintex-ultrascale- plus.html

[70] [70]

Perfor- mance characterization of smartnic nvme-over-fabrics target offload- ing

Jiexiong Xu, Yue Qiu, Yiquan Chen, Yijing Wang, Wenhai Lin, Yiquan Lin, Shushu Zhao, Yuqi Liu, Ying Wang, and Wenzhi Chen. Perfor- mance characterization of smartnic nvme-over-fabrics target offload- ing. InProceedings of the 17th ACM International Systems and Storage Conference, pages 14–24, 2024

2024

[71] [71]

On-demand and parallel checkpoint/restore for gpu applications

Yanning Yang, Dong Du, Haitao Song, and Yubin Xia. On-demand and parallel checkpoint/restore for gpu applications. InProceedings of the 2024 ACM Symposium on Cloud Computing, pages 415–433, 2024

2024

[72] [72]

{𝜆-IO}: A unified {IO} stack for computational storage

Zhe Yang, Youyou Lu, Xiaojian Liao, Youmin Chen, Junru Li, Siyu He, and Jiwu Shu. {𝜆-IO}: A unified {IO} stack for computational storage. In21st USENIX Conference on File and Storage Technologies (FAST 23), pages 347–362, 2023

2023

[73] [73]

ScalaAFA: Constructing User-Space All-Flash array engine with holistic designs

Shushu Yi, Xiurui Pan, Qiao Li, Qiang Li, Chenxi Wang, Bo Mao, Myoungsoo Jung, and Jie Zhang. ScalaAFA: Constructing User-Space All-Flash array engine with holistic designs. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 141–156, Santa Clara, CA, July 2024. USENIX Association

2024

[74] [74]

{GPU } {Checkpoint/Restore} made fast and lightweight

Shaoxun Zeng, Tingxu Ren, Jiwu Shu, and Youyou Lu. {GPU } {Checkpoint/Restore} made fast and lightweight. In 24th USENIX Conference on File and Storage Technologies (FAST 26), pages 239–254, 2026

2026

[75] [75]

Nvmmu: A non-volatile memory management unit for heterogeneous gpu-ssd architectures

Jie Zhang, David Donofrio, John Shalf, Mahmut T Kandemir, and Myoungsoo Jung. Nvmmu: A non-volatile memory management unit for heterogeneous gpu-ssd architectures. In2015 International Conference on Parallel Architecture and Compilation (PACT), pages 13–24. IEEE, 2015

2015

[76] [76]

Bytegnn: efficient graph neural network training at large scale.Pro- ceedings of the VLDB Endowment, 15(6):1228–1242, 2022

Chenguang Zheng, Hongzhi Chen, Yuxuan Cheng, Zhezheng Song, Yifan Wu, Changji Li, James Cheng, Hao Yang, and Shuai Zhang. Bytegnn: efficient graph neural network training at large scale.Pro- ceedings of the VLDB Endowment, 15(6):1228–1242, 2022

2022

[77] [77]

{Remap-SSD}: Safely and efficiently exploiting {SSD} address remapping to eliminate duplicate writes

You Zhou, Qiulin Wu, Fei Wu, Hong Jiang, Jian Zhou, and Changsheng Xie. {Remap-SSD}: Safely and efficiently exploiting {SSD} address remapping to eliminate duplicate writes. In19th USENIX Conference on File and Storage Technologies (FAST 21), pages 187–202, 2021

2021

[78] [78]

Toolqa: A dataset for llm question answering with external tools.Ad- vances in Neural Information Processing Systems, 36:50117–50143, 2023

Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. Toolqa: A dataset for llm question answering with external tools.Ad- vances in Neural Information Processing Systems, 36:50117–50143, 2023

2023