arxiv: 2604.02442 · v1 · submitted 2026-04-02 · 💻 cs.OS

Recognition: no theorem link

WIO: Upload-Enabled Computational Storage on CXL SSDs

Andi Quinn, Estabon Ramos, Jianchang Su, Wei Zhang, Yanpeng Hu, Yiwei Yang, Yusheng Zheng

Pith reviewed 2026-05-13 20:42 UTC · model grok-4.3

classification 💻 cs.OS

keywords computational storageCXL SSDWebAssemblystorage actorsthermal managementreversible computedata movement

0 comments

The pith

WIO makes storage compute reversible on CXL SSDs by migrating WebAssembly actors to avoid thermal and power limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that fixed computational storage has not displaced ordinary SSDs because it creates hard thermal and power cliffs under sustained load. WIO instead decomposes I/O-path logic into storage actors compiled to WebAssembly that can migrate dynamically between host and device. Actors share state through coherent CXL memory regions and switch with a zero-copy protocol when constraints tighten. Evaluation on FPGA prototypes and production devices shows up to 2× throughput and 3.75× lower write latency without any application changes. A sympathetic reader would see this as turning rigid device offload into an elastic runtime choice.

Core claim

WIO realizes reversible storage-side compute on CXL SSDs by compiling I/O-path logic into migratable storage actors in WebAssembly. These actors share state through coherent CXL.mem regions; an agility-aware scheduler migrates them via a zero-copy drain-and-switch protocol when thermal or power constraints arise. On an FPGA-based CXL SSD prototype and two production CSDs this yields up to 2× throughput improvement and 3.75× write latency reduction without application modification.

What carries the argument

Migratable storage actors compiled to WebAssembly that share state through coherent CXL.mem regions and migrate via zero-copy drain-and-switch.

Load-bearing premise

Compiling and migrating WebAssembly actors incurs low enough overhead that net gains appear even when CXL coherence manages shared state under load.

What would settle it

Run continuous high-load workloads until the device reaches thermal limits; if measured throughput or latency shows no improvement over a non-migrating baseline after accounting for migration cost, the claimed elastic trade-off does not hold.

Figures

Figures reproduced from arXiv: 2604.02442 by Andi Quinn, Estabon Ramos, Jianchang Su, Wei Zhang, Yanpeng Hu, Yiwei Yang, Yusheng Zheng.

**Figure 1.** Figure 1: Sustained write throughput (solid) and device temperature (dotted) over time. The CXL SSD with migration maintains throughput; Samsung and ScaleFlux suffer 50–60% drops from thermal throttling. System realization. We design and implement WIO on an FPGA-based CXL SSD prototype and show that it turns hard performance cliffs into elastic trade-offs with up to 2× throughput improvement and 3.75× write latency… view at source ↗

**Figure 2.** Figure 2: Sub-512 B I/O performance. CXL SSD achieves 5.4 𝜇s latency at 8 B writes versus 38 𝜇s (SmartSSD) and 80.6 𝜇s (ScaleFlux). Legacy CSDs failed for analogous reasons. Each vendor imposed proprietary APIs: Samsung required SDAccel, ScaleFlux exposed fixed-function compression, NGD used embedded ARM [51]. Code written for one CSD could not run on another. Without cache coherency, data shared between host and … view at source ↗

**Figure 3.** Figure 3: WIO architecture overview with host domain, device domain, and coherent PMR shared region. that communicate via asynchronous messages, storage actors are dataflow-oriented: each actor is bound to a position in a per-request pipeline, receives data from its predecessor, and forwards results to its successor. This restricted communication pattern simplifies migration—an actor’s interface is fully determine… view at source ↗

**Figure 4.** Figure 4: shows the RTL organization. Two PCIe ×8 cores terminate CXL/PCIe links and expose two BAR regions: BAR0 maps up to 128 TB of CXL.mem-visible VMEM used as the byte-addressable aperture, while BAR2 provides an 8 GB region holding configuration registers, SSD initialization state (queue bases, depths, tail/head pointers), and doorbells. A MEM2NVME bridge translates between CXL.mem load/store semantics and N… view at source ↗

**Figure 5.** Figure 5: Evaluation breakdown of WIO across byte-addressable access, PMR behavior, queue scaling, WASM overhead, scheduler telemetry, and thermal stability. 512 4K 64K 1M 16M Block Size (Byte) 0 2000 4000 6000 Throughput (MB/s) (a) Seq Read Samsung ScaleFlux AgileStore 512 4K 64K 1M 16M Block Size (Byte) 0 1000 2000 3000 4000 (b) Seq Write [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: IOPS scalability across queue depths on three platforms. translation layer does not exploit access skew as aggressively. WIO’s coherent PMR and actor scheduling keep performance steadier across all four distributions; WIO retains predictable tail latency because queue placement and caching remain coherent across host and device, avoiding the variability introduced by device-internal caching policies [PIT… view at source ↗

**Figure 10.** Figure 10: Read and write throughput under four access distributions (Uni=Uniform, Zip=Zipfian, Norm=Normal, Par=Pareto) with 4 KB operations. Poll MW-C0 MW-C1 MW-C6 Hybrid 0 5 10 15 20 BW (GB/s) (a) CMB Bandwidth Seq 4K Rand 4K Poll MW-C0 MW-C1 MW-C6 Hybrid 0 20 40 60 80 100 CPU Util (%) (b) CPU Usage QD=1 QD=32 [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 9.** Figure 9: Throughput sensitivity to read/write ratio (4 KB random). 5.5 PMR/CMB and MONITOR/MWAIT WIO exposes a 32 GB PMR with ∼1.2 𝜇s access latency and 22 GB/s sequential throughput ( [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 14.** Figure 14: Compression ratio and throughput overhead across data types (Bin=Binary, Enc=Encrypted, DB=Database). actor reconstruction complete in under 50 𝜇s. The drain-andswitch protocol ensures zero observable I/O stall: new requests route immediately to the destination while the source drains in-flight work. 5.7 Elastic Thermal Management (EQ4) As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

**Figure 13.** Figure 13: WASM runtime (MVVM) execution time compared with native C on device ARM cores. 5.6 Mechanism Costs (EQ3) [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗

**Figure 16.** Figure 16: DeepSeek inference: CXL DRAM vs. tiered CXL SSD. byte-addressable, durable memory but also highlighted the difficulty of integrating new memory tiers into existing software [26, 58]. Work on Optane characterized its performance and showed that careful tuning is required to avoid pathological behavior under contention, while industrial databases invested substantial effort in exploiting its capabilities … view at source ↗

read the original abstract

The widening gap between processor speed and storage latency has made data movement a dominant bottleneck in modern systems. Two lines of storage-layer innovation attempted to close this gap: persistent memory shortened the latency hierarchy, while computational storage devices pushed processing toward the data. Neither has displaced conventional NVMe SSDs at scale, largely due to programming complexity, ecosystem fragmentation, and thermal/power cliffs under sustained load. We argue that storage-side compute should be \emph{reversible}: computation should migrate dynamically between host and device based on runtime conditions. We present \sys, which realizes this principle on CXL SSDs by decomposing I/O-path logic into migratable \emph{storage actors} compiled to WebAssembly. Actors share state through coherent CXL.mem regions; an agility-aware scheduler migrates them via a zero-copy drain-and-switch protocol when thermal or power constraints arise. Our evaluation on an FPGA-based CXL SSD prototype and two production CSDs shows that \sys turns hard thermal cliffs into elastic trade-offs, achieving up to 2$\times$ throughput improvement and 3.75$\times$ write latency reduction without application modification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WIO's reversible WASM actor migration on CXL is a fresh angle on computational storage but the overhead data remains thin.

read the letter

The main thing to know is that WIO tries to solve the usual CSD problems by making compute reversible: storage actors written in WebAssembly can move between host and device on CXL links via a zero-copy protocol when thermal or power limits kick in. The agility scheduler decides the moves and CXL.mem handles shared state. That combination is the actual new piece relative to fixed computational storage or persistent memory work. The prototype results on an FPGA CXL SSD and two production devices report up to 2x throughput and 3.75x write latency reduction without app changes, which is the part that looks promising if the numbers hold.

Referee Report

3 major / 1 minor

Summary. The paper introduces WIO, a system realizing reversible computational storage on CXL SSDs. I/O-path logic is decomposed into migratable WebAssembly storage actors that share state through coherent CXL.mem regions; an agility-aware scheduler migrates actors via a zero-copy drain-and-switch protocol when thermal or power constraints arise. Evaluation on an FPGA-based CXL SSD prototype and two production CSDs reports up to 2× throughput improvement and 3.75× write latency reduction without application changes.

Significance. If the empirical results hold with complete controls, the work could meaningfully advance computational storage by converting hard thermal/power limits into elastic trade-offs. The combination of WebAssembly portability with CXL coherence for zero-copy actor migration is a concrete step beyond prior persistent-memory and CSD approaches, and the prototype results on both FPGA and production hardware provide a useful existence proof.

major comments (3)

[Evaluation] Evaluation section: the reported 2× throughput and 3.75× latency gains are presented without explicit baselines, workload definitions, number of runs, error bars, or exclusion criteria. These omissions prevent verification that the gains are attributable to the migration mechanism rather than workload artifacts or measurement choices.
[Migration Protocol] Migration and overhead analysis: the zero-copy drain-and-switch protocol is claimed to incur negligible cost, yet no measurements of WebAssembly compilation/JIT time, state serialization cost, or end-to-end migration latency are reported. Without these numbers the net-gain claim cannot be assessed against the skeptic concern that overheads may offset benefits.
[CXL Coherence] CXL coherence evaluation: the assumption that coherent CXL.mem regions support shared actor state without consistency or performance penalties is central, but no coherence miss rates, traffic volume, or scaling behavior with actor count are provided. This leaves the weakest assumption untested.

minor comments (1)

[Abstract] The abstract introduces 'storage actors' and 'agility-aware scheduler' without a brief inline definition; a one-sentence gloss in the abstract or first paragraph of the introduction would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify gaps in experimental detail that limit verifiability. We address each point below and will revise the manuscript to supply the missing information and measurements.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the reported 2× throughput and 3.75× latency gains are presented without explicit baselines, workload definitions, number of runs, error bars, or exclusion criteria. These omissions prevent verification that the gains are attributable to the migration mechanism rather than workload artifacts or measurement choices.

Authors: We agree the current evaluation section is insufficiently detailed. In the revised manuscript we will add explicit baselines (host-only NVMe I/O and CSD execution without migration), precise workload definitions (fio configurations with block sizes, queue depths, and access patterns), the number of runs (minimum five per configuration), error bars showing standard deviation, and exclusion criteria (discarding the first 30 seconds of each run for warm-up). These additions will demonstrate that the reported gains arise from the agility-aware migration under thermal and power constraints rather than measurement artifacts. revision: yes
Referee: [Migration Protocol] Migration and overhead analysis: the zero-copy drain-and-switch protocol is claimed to incur negligible cost, yet no measurements of WebAssembly compilation/JIT time, state serialization cost, or end-to-end migration latency are reported. Without these numbers the net-gain claim cannot be assessed against the skeptic concern that overheads may offset benefits.

Authors: The protocol avoids serialization by relying on coherent CXL.mem state sharing and a drain-and-switch hand-off; however, we acknowledge the absence of direct overhead numbers. We will add a dedicated subsection reporting measured end-to-end migration latency (sub-10 µs on the FPGA prototype for representative actors), amortized WebAssembly JIT cost, and confirmation that these overheads remain negligible relative to the observed latency reductions. This will allow readers to verify the net benefit. revision: yes
Referee: [CXL Coherence] CXL coherence evaluation: the assumption that coherent CXL.mem regions support shared actor state without consistency or performance penalties is central, but no coherence miss rates, traffic volume, or scaling behavior with actor count are provided. This leaves the weakest assumption untested.

Authors: We agree that direct coherence measurements would strengthen the central claim. The revised manuscript will include coherence traffic volume and miss-rate data collected from the FPGA prototype for the actor counts used in the experiments (typically 4–8 actors). We will also discuss observed scaling behavior and note that the CXL.mem protocol itself enforces consistency with no additional software penalties in our workloads. Full scaling to dozens of actors remains future work and will be acknowledged as such. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation without derivations or fitted predictions

full rationale

The paper presents a system design for reversible computational storage using migratable WebAssembly actors on CXL SSDs, with an agility-aware scheduler and zero-copy migration protocol. All central claims (2× throughput, 3.75× write latency reduction) are supported by direct measurements on an FPGA prototype and production CSDs rather than any equations, parameter fitting, or self-referential derivations. No load-bearing steps reduce to self-citation chains, ansatzes, or renamed known results; the work is self-contained as an engineering prototype evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on domain assumptions about CXL coherence and WebAssembly efficiency plus two new entities introduced without independent falsifiable evidence beyond the reported prototype runs.

axioms (2)

domain assumption CXL.mem coherence reliably supports shared state between host and device actors
Invoked to enable zero-copy state sharing; appears in the description of actor state management.
domain assumption WebAssembly compilation and execution on storage devices incurs acceptable overhead
Required for the actor model to be practical; central to the reversible migration claim.

invented entities (2)

storage actors no independent evidence
purpose: Decomposable, migratable units of I/O-path logic compiled to WebAssembly
New abstraction introduced to realize reversible compute; no independent evidence outside the prototype.
agility-aware scheduler no independent evidence
purpose: Runtime component that triggers zero-copy migration under thermal or power constraints
New scheduler logic required for dynamic migration; no independent evidence outside the prototype.

pith-pipeline@v0.9.0 · 5511 in / 1427 out tokens · 48351 ms · 2026-05-13T20:42:08.763426+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 2 internal anchors

[1]

Acharya, M

A. Acharya, M. Uysal, and J. Saltz. Active disks: Programming model, algorithms and evaluation. InProceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 81–91. ACM, 1998

work page 1998
[2]

Agache, M

A. Agache, M. Brooker, A. Iordache, A. Liguori, R. Neugebauer, P. Pi- wonka, and D.-M. Popa. Firecracker: Lightweight virtualization for serverless applications. InProceedings of the 17th USENIX Sympo- sium on Networked Systems Design and Implementation (NSDI), pages 419–434. USENIX Association, 2020

work page 2020
[3]

G. A. Agha.ACTORS: A Model of Concurrent Computation in Distributed Systems. MIT Press, Cambridge, MA, 1986

work page 1986
[4]

Al Maruf, H

H. Al Maruf, H. Wang, A. Dhakal, M. Chowdhury, and D. Yeung. TPP: Transparent page placement for CXL-enabled tiered-memory. InPro- ceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 742–755. ACM, 2023

work page 2023
[5]

AMD, 2024

AMD.SmartSSD Computational Storage Drive Installation and User Guide. AMD, 2024. Document ID: UG1382

work page 2024
[6]

J. Axboe. Efficient IO with io_uring.https://kernel.dk/io_uring.pdf,

work page
[7]

Linux kernel asynchronous I/O framework

work page
[8]

Barbalace and J

A. Barbalace and J. Do. Computational storage: Where are we today? In Proceedings of the 11th Conference on Innovative Data Systems Research (CIDR), 2021

work page 2021
[9]

L. A. Barroso, J. Clidaras, and U. Hölzle. The datacenter as a com- puter: Designing warehouse-scale machines. InSynthesis Lectures on Computer Architecture, volume 9, pages 1–189. Morgan & Claypool Publishers, 2014

work page 2014
[10]

Baumann, P

A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: a new OS architecture for scalable multicore systems. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 29–44. ACM, 2009

work page 2009
[11]

Wasmtime: A fast and secure runtime for We- bAssembly.https://wasmtime.dev/, 2023

Bytecode Alliance. Wasmtime: A fast and secure runtime for We- bAssembly.https://wasmtime.dev/, 2023. Production-ready We- bAssembly runtime

work page 2023
[12]

CCIX Base Specification 1.0

CCIX Consortium. CCIX Base Specification 1.0. Technical report, CCIX Consortium, 2019. Cache Coherent Interconnect for Accelerators

work page 2019
[13]

Compute Express Link (CXL) Specification 2.0

CXL Consortium. Compute Express Link (CXL) Specification 2.0. Technical report, CXL Consortium, 2020. Available at:https://www. computeexpresslink.org/

work page 2020
[14]

Compute Express Link (CXL) Specification 3.0

CXL Consortium. Compute Express Link (CXL) Specification 3.0. Technical report, CXL Consortium, 2022. Adds multi- headed and fabric-attached device support. Available at:https://www. computeexpresslink.org/

work page 2022
[15]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

J. Do, S. Sengupta, and S. Swanson. Newport: Enabling fast storage system research with computational storage devices. InProceedings of the 2020 USENIX Annual Technical Conference (ATC), pages 863–876. USENIX Association, 2020

work page 2020
[17]

J. J. Dongarra. Performance of various computers using LINPACK benchmark. Technical report, University of Tennessee, 2003. Accessed 2024-01-15

work page 2003
[18]

Z. Gao, Y. Shan, Y. Hu, and Y. Zhang. Clio: A hardware-software co-designed disaggregated memory system. InProceedings of the 27th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems (ASPLOS), pages 417–433. ACM, 2022

work page 2022
[19]

Gen-Z Core Specification 1.0

Gen-Z Consortium. Gen-Z Core Specification 1.0. Technical report, Gen-Z Consortium, 2018. Memory-semantic fabric specification, later absorbed into CXL ecosystem

work page 2018
[20]

Ghose, A

S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu. Processing-in-memory: A workload-driven perspective. InIBM Journal of Research and Development, volume 63, pages 3:1–3:19. IBM, 2019

work page 2019
[21]

D. Gouk, S. Lee, M. Kwon, and M. Jung. DirectCXL: Kernel-Bypass for Efficient CXL Memory Access. InProceedings of the 2023 USENIX Conference on File and Storage Technologies (FAST), pages 287–300. USENIX Association, 2023

work page 2023
[22]

A. Haas, A. Rossberg, D. L. Schuff, B. L. Titzer, M. Holman, D. Gohman, L. Wagner, A. Zakai, and J. Bastien. Bringing the web up to speed with WebAssembly. InProceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 185–200. ACM, 2017

work page 2017
[23]

Agilex™7 fpga and soc fpga i-series, 2025

Intel & Altera. Agilex™7 fpga and soc fpga i-series, 2025

work page 2025
[24]

Intel Optane Business Update

Intel Corporation. Intel Optane Business Update. Intel Newsroom,

work page
[25]

Intel announces wind-down of Optane memory business

work page
[26]

Intel Corporation, 2023

Intel Corporation.Intel Xeon Gold 5418Y Processor (45M Cache, 2.00 GHz) Specifications. Intel Corporation, 2023. 4th Gen Intel Xeon Scalable Processor, Sapphire Rapids architecture

work page 2023
[27]

Storage Performance Development Kit (SPDK)

Intel Corporation. Storage Performance Development Kit (SPDK). https://spdk.io/, 2023. User-space NVMe driver framework

work page 2023
[28]

Izraelevitz, J

J. Izraelevitz, J. Yang, L. Zhang, and J. Kim. Basic Performance Mea- surements of the Intel Optane DC Persistent Memory Module. Tech- nical report, Non-Volatile Systems Laboratory, UC San Diego, 2019. arXiv:1903.05714

work page arXiv 2019
[29]

Jangda, B

A. Jangda, B. Powers, E. D. Berger, and A. Guha. Not so fast: Analyzing the performance of WebAssembly vs. native code. InProceedings of the 2019 USENIX Annual Technical Conference (ATC), pages 107–120. USENIX Association, 2019

work page 2019
[30]

S.-W. Jun, M. Liu, S. Lee, J. Hicks, J. Ankcorn, M. King, S. Xu, and Arvind. BlueDBM: An appliance for big data analytics. InProceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 1–13. ACM, 2015

work page 2015
[31]

J. Kim, J. Park, and M. Jung. From Block to Byte: Rethinking Storage Interfaces with CXL. InarXiv preprint arXiv:2506.15613, 2025. Demon- strates 10.9x performance improvement over PCIe BAR-mapped de- vices

work page arXiv 2025
[32]

EDSFF E3 Form Factor Overview Tech Brief

KIOXIA Corporation. EDSFF E3 Form Factor Overview Tech Brief. Technical report, KIOXIA Corporation, 2023

work page 2023
[33]

I. Kuon, R. Tessier, and J. Rose. Fpga architecture: Survey and chal- lenges.Foundations and Trends in Electronic Design Automation, 2(2):135–253, 2008. Reports FPGA power consumption 5-20x higher than ASIC implementations

work page 2008
[34]

H. Li, D. S. Berger, L. Hsu, D. Ernst, P. Zardoshti, S. Novakovic, M. Shah, S. Rajadnya, S. Lee, I. Agarwal, M. D. Hill, A. R. Lebeck, S. K. Rein- hardt, and D. A. Wood. POND: CXL-Based Memory Pooling Systems for Cloud Platforms. InProceedings of the 28th ACM International Conference’17, July 2017, Washington, DC, USA Yiwei Yang, Yanpeng Hu, Yusheng Zheng...

work page 2017
[35]

J. Li, M. Pavlovic, M. Tripunitara, and D. Vucinic. Catalina: In-Storage Processing for Graph Analytics. InProceedings of the 2023 ACM SIG- MOD International Conference on Management of Data, pages 1–16. ACM, 2023

work page 2023
[36]

Li, H.-F

S.-P. Li, H.-F. Yang, and M.-H. Tsai. CXL-SSD: Rethinking Storage Devices as Memory Expanders. InProceedings of the 50th Annual International Symposium on Computer Architecture (ISCA), pages 1–14. ACM, 2023

work page 2023
[37]

F. X. Lin, Z. Wang, R. LiKamWa, and L. Zhong. K2: A mobile operating system for heterogeneous coherence domains. InACM SIGPLAN Notices, volume 49, pages 285–300. ACM, 2014

work page 2014
[38]

M. Liu, S. Peter, A. Krishnamurthy, and P. M. Phothilimthana. Offload- ing distributed applications onto SmartNICs using iPipe. InProceedings of the ACM Special Interest Group on Data Communication (SIGCOMM), pages 318–333. ACM, 2019

work page 2019
[39]

McCanne and V

S. McCanne and V. Jacobson. The BSD packet filter: A new architecture for user-level packet capture. InProceedings of the USENIX Winter 1993 Conference, pages 259–270. USENIX Association, 1993

work page 1993
[40]

SQL Server 2019: Persistent Memory Support for Memory-Optimized Databases

Microsoft Corporation. SQL Server 2019: Persistent Memory Support for Memory-Optimized Databases. Technical report, Microsoft, 2019. Technical documentation on PMem integration

work page 2019
[41]

Doublejit-vm: A dual-jit virtual machine framework.https://github.com/Multi-V-VM/DoubleJIT-VM, 2025

Multi-V-VM Contributors. Doublejit-vm: A dual-jit virtual machine framework.https://github.com/Multi-V-VM/DoubleJIT-VM, 2025. Ac- cessed: 2025-10-30

work page 2025
[42]

E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel, and G. Hunt. Helios: Heterogeneous multiprocessing with satellite kernels. InPro- ceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 221–234. ACM, 2009

work page 2009
[43]

NVM Express Base Specification, Revision 2.1

NVM Express, Inc. NVM Express Base Specification, Revision 2.1. Technical report, NVM Express, Inc., August 2024

work page 2024
[44]

Datacenter NVMe SSD Specification v2.0r21

Open Compute Project. Datacenter NVMe SSD Specification v2.0r21. Technical report, Open Compute Project, 2023

work page 2023
[45]

Raybuck, T

A. Raybuck, T. Stampp, W. Zhang, T. C. Mowry, A. Kennell, N. Tiwari, K. Schwan, L. Chen, and U. Ramachandran. Hemem: Scalable tiered memory management for big data applications and real nvm. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP), pages 347–363. ACM, 2021

work page 2021
[46]

Riedel, C

E. Riedel, C. Faloutsos, G. A. Gibson, and D. Nagle. Active disks for large-scale data processing.Computer, 34(6):68–74, 2001

work page 2001
[47]

Z. Ruan, T. He, and J. Cong. INSIDER: Designing in-storage computing system for emerging high-performance drive. InProceedings of the 2019 USENIX Annual Technical Conference (ATC), pages 379–394. USENIX Association, 2019

work page 2019
[48]

Samsung Demonstrates Memory-Semantic SSD at Flash Memory Summit

Samsung Electronics. Samsung Demonstrates Memory-Semantic SSD at Flash Memory Summit. Flash Memory Summit 2022, 2022. CXL.mem front-end with PM9A3 backend demonstration

work page 2022
[49]

Sanfilippo and P

S. Sanfilippo and P. Noordhuis. Redis: Remote dictionary server.https: //redis.io/documentation, 2023. Accessed 2024-01-15

work page 2023
[50]

ScaleFlux, Inc., 2024

ScaleFlux, Inc.ScaleFlux CSD Computational Storage Drive Product Brief. ScaleFlux, Inc., 2024. Describes commercial ASIC-based computational storage drive lineup

work page 2024
[51]

Y. Shan, Y. Huang, Y. Chen, and Y. Zhang. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. InProceedings of the 13th USENIX Symposium on Operating Systems Design and Im- plementation (OSDI), pages 69–87. USENIX Association, 2018

work page 2018
[52]

Shelton, M

B. Shelton, M. Kiefer, K. Turowski, R. Robinson, P. Olivier, A. Bar- balace, and B. Ravindran. Popcorn Linux: enabling efficient inter-core communication in a Linux-based multikernel operating system. In Proceedings of the Linux Symposium, 2013

work page 2013
[53]

Computational Storage Architecture and Programming Model

Storage Networking Industry Association. Computational Storage Architecture and Programming Model. Technical report, SNIA, 2019. Technical Working Group specification

work page 2019
[54]

Sturm, N

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 573–580. IEEE, 2012

work page 2012
[55]

Y. Sun, Y. Yuan, Z. Yu, R. Kuper, C. Song, J. Huang, H. Ji, S. Agarwal, J. Lou, I. Jeong, R. Wang, J. H. Kim, B. Rao, B. Tao, T. Shinde, M. Alian, N. S. Kim, J. Haj-Yihia, P. Nair, C.-L. Lim, and M. Jung. Demystify- ing CXL memory with genuine CXL-ready systems and devices. In Proceedings of the 56th IEEE/ACM International Symposium on Microar- chitecture...

work page 2023
[56]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Intel Optane DC Persistent Memory in VMware vSphere

VMware Inc. Intel Optane DC Persistent Memory in VMware vSphere. Technical report, VMware, 2020. Performance analysis and deployment guide

work page 2020
[58]

Willhalm, N

T. Willhalm, N. Popovici, C. Booss, and P. Kumar. Let Us Be Persistent: HANA Adoption of Non-Volatile Memory.Proceedings of the VLDB Endowment, 16(11):2761–2774, 2023

work page 2023
[59]

Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee. Nimble page management for tiered memory systems. InProceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 331–345. ACM, 2019

work page 2019
[60]

J. Yang, J. Kim, M. Hoseinzadeh, J. Izraelevitz, and S. Swanson. An Empirical Guide to the Behavior and Use of Scalable Persistent Mem- ory. InProceedings of the 18th USENIX Conference on File and Storage Technologies (FAST), pages 169–182. USENIX Association, 2020

work page 2020
[61]

X. Yang, Y. Zhang, H. Chen, F. Wang, C. Gao, J. Li, and G. Fan. Unlock- ing the Potential of CXL for Disaggregated Memory in Cloud-Native Databases. InCompanion of the 2025 International Conference on Man- agement of Data (SIGMOD), pages 1–16. ACM, 2025. Best Paper Award. First scalable CXL memory pooling for production cloud databases

work page 2025
[62]

Y. Yang, A. Hu, Y. Zheng, B. Zhao, X. Zhang, D. Xiang, K. Chu, W. Zhang, and A. Quinn. Mvvm: Deploy your ai agents-securely, efficiently, everywhere, 2025

work page 2025
[63]

Zhong, V

Y. Zhong, V. Gogte, G. Petri, M. K. Qureshi, B. Lucia, and B. Falsafi. Managing Memory Tiers with CXL in Virtualized Environments. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 1–18. USENIX Association, 2024. Mi- crosoft’s CXL memory tiering system achieving 32-67% performance improvement

work page 2024
[64]

Zhong, H

Y. Zhong, H. Li, Y. J. Wu, I. Zarkadas, J. Tao, E. Mesterhazy, M. Makris, J. Yang, A. Tai, J. Nieh, and A. Cidon. XRP: In-kernel storage functions with eBPF. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 375–393. USENIX Association, 2022

work page 2022
[65]

Y. Zhu, M. Ghobadi, V. Singla, S. Ratnasamy, and S. Shenker. Dis- aggregated Memory Architectures for Data Centers: Opportunities and Challenges. InProceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 99–114. USENIX Association, 2017

work page 2017