pith. machine review for the scientific record. sign in

arxiv: 2604.02442 · v1 · submitted 2026-04-02 · 💻 cs.OS

Recognition: no theorem link

WIO: Upload-Enabled Computational Storage on CXL SSDs

Andi Quinn, Estabon Ramos, Jianchang Su, Wei Zhang, Yanpeng Hu, Yiwei Yang, Yusheng Zheng

Pith reviewed 2026-05-13 20:42 UTC · model grok-4.3

classification 💻 cs.OS
keywords computational storageCXL SSDWebAssemblystorage actorsthermal managementreversible computedata movement
0
0 comments X

The pith

WIO makes storage compute reversible on CXL SSDs by migrating WebAssembly actors to avoid thermal and power limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that fixed computational storage has not displaced ordinary SSDs because it creates hard thermal and power cliffs under sustained load. WIO instead decomposes I/O-path logic into storage actors compiled to WebAssembly that can migrate dynamically between host and device. Actors share state through coherent CXL memory regions and switch with a zero-copy protocol when constraints tighten. Evaluation on FPGA prototypes and production devices shows up to 2× throughput and 3.75× lower write latency without any application changes. A sympathetic reader would see this as turning rigid device offload into an elastic runtime choice.

Core claim

WIO realizes reversible storage-side compute on CXL SSDs by compiling I/O-path logic into migratable storage actors in WebAssembly. These actors share state through coherent CXL.mem regions; an agility-aware scheduler migrates them via a zero-copy drain-and-switch protocol when thermal or power constraints arise. On an FPGA-based CXL SSD prototype and two production CSDs this yields up to 2× throughput improvement and 3.75× write latency reduction without application modification.

What carries the argument

Migratable storage actors compiled to WebAssembly that share state through coherent CXL.mem regions and migrate via zero-copy drain-and-switch.

Load-bearing premise

Compiling and migrating WebAssembly actors incurs low enough overhead that net gains appear even when CXL coherence manages shared state under load.

What would settle it

Run continuous high-load workloads until the device reaches thermal limits; if measured throughput or latency shows no improvement over a non-migrating baseline after accounting for migration cost, the claimed elastic trade-off does not hold.

Figures

Figures reproduced from arXiv: 2604.02442 by Andi Quinn, Estabon Ramos, Jianchang Su, Wei Zhang, Yanpeng Hu, Yiwei Yang, Yusheng Zheng.

Figure 1
Figure 1. Figure 1: Sustained write throughput (solid) and device tem￾perature (dotted) over time. The CXL SSD with migration maintains throughput; Samsung and ScaleFlux suffer 50–60% drops from thermal throttling. System realization. We design and implement WIO on an FPGA-based CXL SSD prototype and show that it turns hard performance cliffs into elastic trade-offs with up to 2× throughput improvement and 3.75× write latency… view at source ↗
Figure 2
Figure 2. Figure 2: Sub-512 B I/O performance. CXL SSD achieves 5.4 𝜇s latency at 8 B writes versus 38 𝜇s (SmartSSD) and 80.6 𝜇s (ScaleFlux). Legacy CSDs failed for analogous reasons. Each vendor imposed proprietary APIs: Samsung required SDAccel, Scale￾Flux exposed fixed-function compression, NGD used em￾bedded ARM [51]. Code written for one CSD could not run on another. Without cache coherency, data shared between host and … view at source ↗
Figure 3
Figure 3. Figure 3: WIO architecture overview with host domain, device domain, and coherent PMR shared region. that communicate via asynchronous messages, storage ac￾tors are dataflow-oriented: each actor is bound to a position in a per-request pipeline, receives data from its predecessor, and forwards results to its successor. This restricted commu￾nication pattern simplifies migration—an actor’s interface is fully determine… view at source ↗
Figure 4
Figure 4. Figure 4: shows the RTL organization. Two PCIe ×8 cores ter￾minate CXL/PCIe links and expose two BAR regions: BAR0 maps up to 128 TB of CXL.mem-visible VMEM used as the byte-addressable aperture, while BAR2 provides an 8 GB re￾gion holding configuration registers, SSD initialization state (queue bases, depths, tail/head pointers), and doorbells. A MEM2NVME bridge translates between CXL.mem load/store semantics and N… view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation breakdown of WIO across byte-addressable access, PMR behavior, queue scaling, WASM overhead, scheduler telemetry, and thermal stability. 512 4K 64K 1M 16M Block Size (Byte) 0 2000 4000 6000 Throughput (MB/s) (a) Seq Read Samsung ScaleFlux AgileStore 512 4K 64K 1M 16M Block Size (Byte) 0 1000 2000 3000 4000 (b) Seq Write [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: IOPS scalability across queue depths on three plat￾forms. translation layer does not exploit access skew as aggressively. WIO’s coherent PMR and actor scheduling keep performance steadier across all four distributions; WIO retains predictable tail latency because queue placement and caching remain coherent across host and device, avoiding the variability introduced by device-internal caching policies [PIT… view at source ↗
Figure 10
Figure 10. Figure 10: Read and write throughput under four access distributions (Uni=Uniform, Zip=Zipfian, Norm=Normal, Par=Pareto) with 4 KB operations. Poll MW-C0 MW-C1 MW-C6 Hybrid 0 5 10 15 20 BW (GB/s) (a) CMB Bandwidth Seq 4K Rand 4K Poll MW-C0 MW-C1 MW-C6 Hybrid 0 20 40 60 80 100 CPU Util (%) (b) CPU Usage QD=1 QD=32 [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: Throughput sensitivity to read/write ratio (4 KB random). 5.5 PMR/CMB and MONITOR/MWAIT WIO exposes a 32 GB PMR with ∼1.2 𝜇s access latency and 22 GB/s sequential throughput ( [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 14
Figure 14. Figure 14: Compression ratio and throughput over￾head across data types (Bin=Binary, Enc=Encrypted, DB=Database). actor reconstruction complete in under 50 𝜇s. The drain-and￾switch protocol ensures zero observable I/O stall: new re￾quests route immediately to the destination while the source drains in-flight work. 5.7 Elastic Thermal Management (EQ4) As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
Figure 13
Figure 13. Figure 13: WASM runtime (MVVM) execution time com￾pared with native C on device ARM cores. 5.6 Mechanism Costs (EQ3) [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 16
Figure 16. Figure 16: DeepSeek inference: CXL DRAM vs. tiered CXL SSD. byte-addressable, durable memory but also highlighted the difficulty of integrating new memory tiers into existing soft￾ware [26, 58]. Work on Optane characterized its performance and showed that careful tuning is required to avoid patholog￾ical behavior under contention, while industrial databases invested substantial effort in exploiting its capabilities … view at source ↗
read the original abstract

The widening gap between processor speed and storage latency has made data movement a dominant bottleneck in modern systems. Two lines of storage-layer innovation attempted to close this gap: persistent memory shortened the latency hierarchy, while computational storage devices pushed processing toward the data. Neither has displaced conventional NVMe SSDs at scale, largely due to programming complexity, ecosystem fragmentation, and thermal/power cliffs under sustained load. We argue that storage-side compute should be \emph{reversible}: computation should migrate dynamically between host and device based on runtime conditions. We present \sys, which realizes this principle on CXL SSDs by decomposing I/O-path logic into migratable \emph{storage actors} compiled to WebAssembly. Actors share state through coherent CXL.mem regions; an agility-aware scheduler migrates them via a zero-copy drain-and-switch protocol when thermal or power constraints arise. Our evaluation on an FPGA-based CXL SSD prototype and two production CSDs shows that \sys turns hard thermal cliffs into elastic trade-offs, achieving up to 2$\times$ throughput improvement and 3.75$\times$ write latency reduction without application modification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces WIO, a system realizing reversible computational storage on CXL SSDs. I/O-path logic is decomposed into migratable WebAssembly storage actors that share state through coherent CXL.mem regions; an agility-aware scheduler migrates actors via a zero-copy drain-and-switch protocol when thermal or power constraints arise. Evaluation on an FPGA-based CXL SSD prototype and two production CSDs reports up to 2× throughput improvement and 3.75× write latency reduction without application changes.

Significance. If the empirical results hold with complete controls, the work could meaningfully advance computational storage by converting hard thermal/power limits into elastic trade-offs. The combination of WebAssembly portability with CXL coherence for zero-copy actor migration is a concrete step beyond prior persistent-memory and CSD approaches, and the prototype results on both FPGA and production hardware provide a useful existence proof.

major comments (3)
  1. [Evaluation] Evaluation section: the reported 2× throughput and 3.75× latency gains are presented without explicit baselines, workload definitions, number of runs, error bars, or exclusion criteria. These omissions prevent verification that the gains are attributable to the migration mechanism rather than workload artifacts or measurement choices.
  2. [Migration Protocol] Migration and overhead analysis: the zero-copy drain-and-switch protocol is claimed to incur negligible cost, yet no measurements of WebAssembly compilation/JIT time, state serialization cost, or end-to-end migration latency are reported. Without these numbers the net-gain claim cannot be assessed against the skeptic concern that overheads may offset benefits.
  3. [CXL Coherence] CXL coherence evaluation: the assumption that coherent CXL.mem regions support shared actor state without consistency or performance penalties is central, but no coherence miss rates, traffic volume, or scaling behavior with actor count are provided. This leaves the weakest assumption untested.
minor comments (1)
  1. [Abstract] The abstract introduces 'storage actors' and 'agility-aware scheduler' without a brief inline definition; a one-sentence gloss in the abstract or first paragraph of the introduction would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify gaps in experimental detail that limit verifiability. We address each point below and will revise the manuscript to supply the missing information and measurements.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the reported 2× throughput and 3.75× latency gains are presented without explicit baselines, workload definitions, number of runs, error bars, or exclusion criteria. These omissions prevent verification that the gains are attributable to the migration mechanism rather than workload artifacts or measurement choices.

    Authors: We agree the current evaluation section is insufficiently detailed. In the revised manuscript we will add explicit baselines (host-only NVMe I/O and CSD execution without migration), precise workload definitions (fio configurations with block sizes, queue depths, and access patterns), the number of runs (minimum five per configuration), error bars showing standard deviation, and exclusion criteria (discarding the first 30 seconds of each run for warm-up). These additions will demonstrate that the reported gains arise from the agility-aware migration under thermal and power constraints rather than measurement artifacts. revision: yes

  2. Referee: [Migration Protocol] Migration and overhead analysis: the zero-copy drain-and-switch protocol is claimed to incur negligible cost, yet no measurements of WebAssembly compilation/JIT time, state serialization cost, or end-to-end migration latency are reported. Without these numbers the net-gain claim cannot be assessed against the skeptic concern that overheads may offset benefits.

    Authors: The protocol avoids serialization by relying on coherent CXL.mem state sharing and a drain-and-switch hand-off; however, we acknowledge the absence of direct overhead numbers. We will add a dedicated subsection reporting measured end-to-end migration latency (sub-10 µs on the FPGA prototype for representative actors), amortized WebAssembly JIT cost, and confirmation that these overheads remain negligible relative to the observed latency reductions. This will allow readers to verify the net benefit. revision: yes

  3. Referee: [CXL Coherence] CXL coherence evaluation: the assumption that coherent CXL.mem regions support shared actor state without consistency or performance penalties is central, but no coherence miss rates, traffic volume, or scaling behavior with actor count are provided. This leaves the weakest assumption untested.

    Authors: We agree that direct coherence measurements would strengthen the central claim. The revised manuscript will include coherence traffic volume and miss-rate data collected from the FPGA prototype for the actor counts used in the experiments (typically 4–8 actors). We will also discuss observed scaling behavior and note that the CXL.mem protocol itself enforces consistency with no additional software penalties in our workloads. Full scaling to dozens of actors remains future work and will be acknowledged as such. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation without derivations or fitted predictions

full rationale

The paper presents a system design for reversible computational storage using migratable WebAssembly actors on CXL SSDs, with an agility-aware scheduler and zero-copy migration protocol. All central claims (2× throughput, 3.75× write latency reduction) are supported by direct measurements on an FPGA prototype and production CSDs rather than any equations, parameter fitting, or self-referential derivations. No load-bearing steps reduce to self-citation chains, ansatzes, or renamed known results; the work is self-contained as an engineering prototype evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on domain assumptions about CXL coherence and WebAssembly efficiency plus two new entities introduced without independent falsifiable evidence beyond the reported prototype runs.

axioms (2)
  • domain assumption CXL.mem coherence reliably supports shared state between host and device actors
    Invoked to enable zero-copy state sharing; appears in the description of actor state management.
  • domain assumption WebAssembly compilation and execution on storage devices incurs acceptable overhead
    Required for the actor model to be practical; central to the reversible migration claim.
invented entities (2)
  • storage actors no independent evidence
    purpose: Decomposable, migratable units of I/O-path logic compiled to WebAssembly
    New abstraction introduced to realize reversible compute; no independent evidence outside the prototype.
  • agility-aware scheduler no independent evidence
    purpose: Runtime component that triggers zero-copy migration under thermal or power constraints
    New scheduler logic required for dynamic migration; no independent evidence outside the prototype.

pith-pipeline@v0.9.0 · 5511 in / 1427 out tokens · 48351 ms · 2026-05-13T20:42:08.763426+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 2 internal anchors

  1. [1]

    Acharya, M

    A. Acharya, M. Uysal, and J. Saltz. Active disks: Programming model, algorithms and evaluation. InProceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 81–91. ACM, 1998

  2. [2]

    Agache, M

    A. Agache, M. Brooker, A. Iordache, A. Liguori, R. Neugebauer, P. Pi- wonka, and D.-M. Popa. Firecracker: Lightweight virtualization for serverless applications. InProceedings of the 17th USENIX Sympo- sium on Networked Systems Design and Implementation (NSDI), pages 419–434. USENIX Association, 2020

  3. [3]

    G. A. Agha.ACTORS: A Model of Concurrent Computation in Distributed Systems. MIT Press, Cambridge, MA, 1986

  4. [4]

    Al Maruf, H

    H. Al Maruf, H. Wang, A. Dhakal, M. Chowdhury, and D. Yeung. TPP: Transparent page placement for CXL-enabled tiered-memory. InPro- ceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 742–755. ACM, 2023

  5. [5]

    AMD, 2024

    AMD.SmartSSD Computational Storage Drive Installation and User Guide. AMD, 2024. Document ID: UG1382

  6. [6]

    J. Axboe. Efficient IO with io_uring.https://kernel.dk/io_uring.pdf,

  7. [7]

    Linux kernel asynchronous I/O framework

  8. [8]

    Barbalace and J

    A. Barbalace and J. Do. Computational storage: Where are we today? In Proceedings of the 11th Conference on Innovative Data Systems Research (CIDR), 2021

  9. [9]

    L. A. Barroso, J. Clidaras, and U. Hölzle. The datacenter as a com- puter: Designing warehouse-scale machines. InSynthesis Lectures on Computer Architecture, volume 9, pages 1–189. Morgan & Claypool Publishers, 2014

  10. [10]

    Baumann, P

    A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: a new OS architecture for scalable multicore systems. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 29–44. ACM, 2009

  11. [11]

    Wasmtime: A fast and secure runtime for We- bAssembly.https://wasmtime.dev/, 2023

    Bytecode Alliance. Wasmtime: A fast and secure runtime for We- bAssembly.https://wasmtime.dev/, 2023. Production-ready We- bAssembly runtime

  12. [12]

    CCIX Base Specification 1.0

    CCIX Consortium. CCIX Base Specification 1.0. Technical report, CCIX Consortium, 2019. Cache Coherent Interconnect for Accelerators

  13. [13]

    Compute Express Link (CXL) Specification 2.0

    CXL Consortium. Compute Express Link (CXL) Specification 2.0. Technical report, CXL Consortium, 2020. Available at:https://www. computeexpresslink.org/

  14. [14]

    Compute Express Link (CXL) Specification 3.0

    CXL Consortium. Compute Express Link (CXL) Specification 3.0. Technical report, CXL Consortium, 2022. Adds multi- headed and fabric-attached device support. Available at:https://www. computeexpresslink.org/

  15. [15]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

  16. [16]

    J. Do, S. Sengupta, and S. Swanson. Newport: Enabling fast storage system research with computational storage devices. InProceedings of the 2020 USENIX Annual Technical Conference (ATC), pages 863–876. USENIX Association, 2020

  17. [17]

    J. J. Dongarra. Performance of various computers using LINPACK benchmark. Technical report, University of Tennessee, 2003. Accessed 2024-01-15

  18. [18]

    Z. Gao, Y. Shan, Y. Hu, and Y. Zhang. Clio: A hardware-software co-designed disaggregated memory system. InProceedings of the 27th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems (ASPLOS), pages 417–433. ACM, 2022

  19. [19]

    Gen-Z Core Specification 1.0

    Gen-Z Consortium. Gen-Z Core Specification 1.0. Technical report, Gen-Z Consortium, 2018. Memory-semantic fabric specification, later absorbed into CXL ecosystem

  20. [20]

    Ghose, A

    S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu. Processing-in-memory: A workload-driven perspective. InIBM Journal of Research and Development, volume 63, pages 3:1–3:19. IBM, 2019

  21. [21]

    D. Gouk, S. Lee, M. Kwon, and M. Jung. DirectCXL: Kernel-Bypass for Efficient CXL Memory Access. InProceedings of the 2023 USENIX Conference on File and Storage Technologies (FAST), pages 287–300. USENIX Association, 2023

  22. [22]

    A. Haas, A. Rossberg, D. L. Schuff, B. L. Titzer, M. Holman, D. Gohman, L. Wagner, A. Zakai, and J. Bastien. Bringing the web up to speed with WebAssembly. InProceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 185–200. ACM, 2017

  23. [23]

    Agilex™7 fpga and soc fpga i-series, 2025

    Intel & Altera. Agilex™7 fpga and soc fpga i-series, 2025

  24. [24]

    Intel Optane Business Update

    Intel Corporation. Intel Optane Business Update. Intel Newsroom,

  25. [25]

    Intel announces wind-down of Optane memory business

  26. [26]

    Intel Corporation, 2023

    Intel Corporation.Intel Xeon Gold 5418Y Processor (45M Cache, 2.00 GHz) Specifications. Intel Corporation, 2023. 4th Gen Intel Xeon Scalable Processor, Sapphire Rapids architecture

  27. [27]

    Storage Performance Development Kit (SPDK)

    Intel Corporation. Storage Performance Development Kit (SPDK). https://spdk.io/, 2023. User-space NVMe driver framework

  28. [28]

    Izraelevitz, J

    J. Izraelevitz, J. Yang, L. Zhang, and J. Kim. Basic Performance Mea- surements of the Intel Optane DC Persistent Memory Module. Tech- nical report, Non-Volatile Systems Laboratory, UC San Diego, 2019. arXiv:1903.05714

  29. [29]

    Jangda, B

    A. Jangda, B. Powers, E. D. Berger, and A. Guha. Not so fast: Analyzing the performance of WebAssembly vs. native code. InProceedings of the 2019 USENIX Annual Technical Conference (ATC), pages 107–120. USENIX Association, 2019

  30. [30]

    S.-W. Jun, M. Liu, S. Lee, J. Hicks, J. Ankcorn, M. King, S. Xu, and Arvind. BlueDBM: An appliance for big data analytics. InProceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 1–13. ACM, 2015

  31. [31]

    J. Kim, J. Park, and M. Jung. From Block to Byte: Rethinking Storage Interfaces with CXL. InarXiv preprint arXiv:2506.15613, 2025. Demon- strates 10.9x performance improvement over PCIe BAR-mapped de- vices

  32. [32]

    EDSFF E3 Form Factor Overview Tech Brief

    KIOXIA Corporation. EDSFF E3 Form Factor Overview Tech Brief. Technical report, KIOXIA Corporation, 2023

  33. [33]

    I. Kuon, R. Tessier, and J. Rose. Fpga architecture: Survey and chal- lenges.Foundations and Trends in Electronic Design Automation, 2(2):135–253, 2008. Reports FPGA power consumption 5-20x higher than ASIC implementations

  34. [34]

    H. Li, D. S. Berger, L. Hsu, D. Ernst, P. Zardoshti, S. Novakovic, M. Shah, S. Rajadnya, S. Lee, I. Agarwal, M. D. Hill, A. R. Lebeck, S. K. Rein- hardt, and D. A. Wood. POND: CXL-Based Memory Pooling Systems for Cloud Platforms. InProceedings of the 28th ACM International Conference’17, July 2017, Washington, DC, USA Yiwei Yang, Yanpeng Hu, Yusheng Zheng...

  35. [35]

    J. Li, M. Pavlovic, M. Tripunitara, and D. Vucinic. Catalina: In-Storage Processing for Graph Analytics. InProceedings of the 2023 ACM SIG- MOD International Conference on Management of Data, pages 1–16. ACM, 2023

  36. [36]

    Li, H.-F

    S.-P. Li, H.-F. Yang, and M.-H. Tsai. CXL-SSD: Rethinking Storage Devices as Memory Expanders. InProceedings of the 50th Annual International Symposium on Computer Architecture (ISCA), pages 1–14. ACM, 2023

  37. [37]

    F. X. Lin, Z. Wang, R. LiKamWa, and L. Zhong. K2: A mobile operating system for heterogeneous coherence domains. InACM SIGPLAN Notices, volume 49, pages 285–300. ACM, 2014

  38. [38]

    M. Liu, S. Peter, A. Krishnamurthy, and P. M. Phothilimthana. Offload- ing distributed applications onto SmartNICs using iPipe. InProceedings of the ACM Special Interest Group on Data Communication (SIGCOMM), pages 318–333. ACM, 2019

  39. [39]

    McCanne and V

    S. McCanne and V. Jacobson. The BSD packet filter: A new architecture for user-level packet capture. InProceedings of the USENIX Winter 1993 Conference, pages 259–270. USENIX Association, 1993

  40. [40]

    SQL Server 2019: Persistent Memory Support for Memory-Optimized Databases

    Microsoft Corporation. SQL Server 2019: Persistent Memory Support for Memory-Optimized Databases. Technical report, Microsoft, 2019. Technical documentation on PMem integration

  41. [41]

    Doublejit-vm: A dual-jit virtual machine framework.https://github.com/Multi-V-VM/DoubleJIT-VM, 2025

    Multi-V-VM Contributors. Doublejit-vm: A dual-jit virtual machine framework.https://github.com/Multi-V-VM/DoubleJIT-VM, 2025. Ac- cessed: 2025-10-30

  42. [42]

    E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel, and G. Hunt. Helios: Heterogeneous multiprocessing with satellite kernels. InPro- ceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 221–234. ACM, 2009

  43. [43]

    NVM Express Base Specification, Revision 2.1

    NVM Express, Inc. NVM Express Base Specification, Revision 2.1. Technical report, NVM Express, Inc., August 2024

  44. [44]

    Datacenter NVMe SSD Specification v2.0r21

    Open Compute Project. Datacenter NVMe SSD Specification v2.0r21. Technical report, Open Compute Project, 2023

  45. [45]

    Raybuck, T

    A. Raybuck, T. Stampp, W. Zhang, T. C. Mowry, A. Kennell, N. Tiwari, K. Schwan, L. Chen, and U. Ramachandran. Hemem: Scalable tiered memory management for big data applications and real nvm. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP), pages 347–363. ACM, 2021

  46. [46]

    Riedel, C

    E. Riedel, C. Faloutsos, G. A. Gibson, and D. Nagle. Active disks for large-scale data processing.Computer, 34(6):68–74, 2001

  47. [47]

    Z. Ruan, T. He, and J. Cong. INSIDER: Designing in-storage computing system for emerging high-performance drive. InProceedings of the 2019 USENIX Annual Technical Conference (ATC), pages 379–394. USENIX Association, 2019

  48. [48]

    Samsung Demonstrates Memory-Semantic SSD at Flash Memory Summit

    Samsung Electronics. Samsung Demonstrates Memory-Semantic SSD at Flash Memory Summit. Flash Memory Summit 2022, 2022. CXL.mem front-end with PM9A3 backend demonstration

  49. [49]

    Sanfilippo and P

    S. Sanfilippo and P. Noordhuis. Redis: Remote dictionary server.https: //redis.io/documentation, 2023. Accessed 2024-01-15

  50. [50]

    ScaleFlux, Inc., 2024

    ScaleFlux, Inc.ScaleFlux CSD Computational Storage Drive Product Brief. ScaleFlux, Inc., 2024. Describes commercial ASIC-based computational storage drive lineup

  51. [51]

    Y. Shan, Y. Huang, Y. Chen, and Y. Zhang. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. InProceedings of the 13th USENIX Symposium on Operating Systems Design and Im- plementation (OSDI), pages 69–87. USENIX Association, 2018

  52. [52]

    Shelton, M

    B. Shelton, M. Kiefer, K. Turowski, R. Robinson, P. Olivier, A. Bar- balace, and B. Ravindran. Popcorn Linux: enabling efficient inter-core communication in a Linux-based multikernel operating system. In Proceedings of the Linux Symposium, 2013

  53. [53]

    Computational Storage Architecture and Programming Model

    Storage Networking Industry Association. Computational Storage Architecture and Programming Model. Technical report, SNIA, 2019. Technical Working Group specification

  54. [54]

    Sturm, N

    J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 573–580. IEEE, 2012

  55. [55]

    Y. Sun, Y. Yuan, Z. Yu, R. Kuper, C. Song, J. Huang, H. Ji, S. Agarwal, J. Lou, I. Jeong, R. Wang, J. H. Kim, B. Rao, B. Tao, T. Shinde, M. Alian, N. S. Kim, J. Haj-Yihia, P. Nair, C.-L. Lim, and M. Jung. Demystify- ing CXL memory with genuine CXL-ready systems and devices. In Proceedings of the 56th IEEE/ACM International Symposium on Microar- chitecture...

  56. [56]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  57. [57]

    Intel Optane DC Persistent Memory in VMware vSphere

    VMware Inc. Intel Optane DC Persistent Memory in VMware vSphere. Technical report, VMware, 2020. Performance analysis and deployment guide

  58. [58]

    Willhalm, N

    T. Willhalm, N. Popovici, C. Booss, and P. Kumar. Let Us Be Persistent: HANA Adoption of Non-Volatile Memory.Proceedings of the VLDB Endowment, 16(11):2761–2774, 2023

  59. [59]

    Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee. Nimble page management for tiered memory systems. InProceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 331–345. ACM, 2019

  60. [60]

    J. Yang, J. Kim, M. Hoseinzadeh, J. Izraelevitz, and S. Swanson. An Empirical Guide to the Behavior and Use of Scalable Persistent Mem- ory. InProceedings of the 18th USENIX Conference on File and Storage Technologies (FAST), pages 169–182. USENIX Association, 2020

  61. [61]

    X. Yang, Y. Zhang, H. Chen, F. Wang, C. Gao, J. Li, and G. Fan. Unlock- ing the Potential of CXL for Disaggregated Memory in Cloud-Native Databases. InCompanion of the 2025 International Conference on Man- agement of Data (SIGMOD), pages 1–16. ACM, 2025. Best Paper Award. First scalable CXL memory pooling for production cloud databases

  62. [62]

    Y. Yang, A. Hu, Y. Zheng, B. Zhao, X. Zhang, D. Xiang, K. Chu, W. Zhang, and A. Quinn. Mvvm: Deploy your ai agents-securely, efficiently, everywhere, 2025

  63. [63]

    Zhong, V

    Y. Zhong, V. Gogte, G. Petri, M. K. Qureshi, B. Lucia, and B. Falsafi. Managing Memory Tiers with CXL in Virtualized Environments. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 1–18. USENIX Association, 2024. Mi- crosoft’s CXL memory tiering system achieving 32-67% performance improvement

  64. [64]

    Zhong, H

    Y. Zhong, H. Li, Y. J. Wu, I. Zarkadas, J. Tao, E. Mesterhazy, M. Makris, J. Yang, A. Tai, J. Nieh, and A. Cidon. XRP: In-kernel storage functions with eBPF. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 375–393. USENIX Association, 2022

  65. [65]

    Y. Zhu, M. Ghobadi, V. Singla, S. Ratnasamy, and S. Shenker. Dis- aggregated Memory Architectures for Data Centers: Opportunities and Challenges. InProceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 99–114. USENIX Association, 2017