arxiv: 2604.18120 · v1 · submitted 2026-04-20 · 💻 cs.OS · cs.AR· cs.ET· cs.SE

Recognition: unknown

Proxics: an efficient programming model for far memory accelerators

Zikai Liu , Niels Pressel , Jasmin Schult , Roman Meier , Pengcheng Xu , Timothy Roscoe

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:13 UTC · model grok-4.3

classification 💻 cs.OS cs.ARcs.ETcs.SE

keywords near-data processingprogramming modelfar memoryOS abstractionsNDP acceleratorsprocessesIPC channelsdisaggregated memory

0 comments

The pith

Near-data processing accelerators can use familiar processes and pipe-like channels when made lightweight via compilation and interconnect protocols.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that NDP devices for far memory can be programmed with standard OS abstractions of virtual processors and inter-process communication channels. A direct port of classical processes and shared-buffer IPC would be too heavy for limited accelerator cores and would defeat the purpose of cutting memory bandwidth. Instead the authors implement these abstractions efficiently by exploiting compilation passes and hardware interconnect protocols. They demonstrate the result on real hardware across bulk memory operations, in-memory databases and graph workloads, reporting gains over CPU-only code and stressing that low-latency CPU-to-accelerator links are essential.

Core claim

We propose Proxics, a programming model for NDP devices based on virtual processors and IPC channels like Unix pipes. These abstractions are realized in a lightweight manner by leveraging compilation and interconnect protocols rather than traditional heavyweight mechanisms or high-bandwidth shared buffers. On a real hardware platform the model supports applications with varied memory access patterns and delivers benefits beyond CPU-only execution while showing the critical role of efficient, low-latency communication between host CPU and NDP accelerator.

What carries the argument

The Proxics model of virtual processors and low-overhead IPC channels, realized through compilation and hardware interconnect protocols.

If this is right

Applications with memory-intensive patterns can be offloaded to NDP using familiar process and channel code rather than specialized languages.
Bandwidth demand between host and far memory drops for bulk operations, database queries and graph traversals.
Low-latency CPU-NDP communication channels become a first-order requirement for overall system performance.
The same abstractions remain portable across different NDP hardware designs that expose suitable protocols.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be applied to other disaggregated memory fabrics if they provide comparable protocol support for lightweight channels.
Static compilation passes could be extended to automatically choose between CPU and NDP execution for individual code regions based on access patterns.
Existing operating systems could incorporate Proxics as a new device type, allowing unmodified user code to target NDP without explicit accelerator APIs.

Load-bearing premise

The target NDP hardware supplies interconnect protocols and compilation support that keep process and channel abstractions lightweight instead of forcing heavy overheads or shared buffers.

What would settle it

If measurements on the real hardware platform show that Proxics incurs higher latency, greater bandwidth use, or no performance advantage over CPU-only execution for the tested bulk, database and graph workloads, the efficiency claim would be refuted.

Figures

Figures reproduced from arXiv: 2604.18120 by Jasmin Schult, Niels Pressel, Pengcheng Xu, Roman Meier, Timothy Roscoe, Zikai Liu.

**Figure 1.** Figure 1: Proxics abstractions for running software on the accelerator, without a clear corresponding OS abstraction [5]. Each device is therefore programmed differently, and with few reusable concepts other than vendor-specific low-level programming. In contrast, Proxics provides abstractions which are both portable and efficient. More specifically, our requirements in designing Proxics were as follows: Firstly, … view at source ↗

**Figure 2.** Figure 2: Proxics prototype This structure is similar to systems based on CXL [16] or the Cache Coherent Interconnect for Accelerators (CCIX) [11]. We use Enzian rather than existing NDP accelerators because it affords us greater flexibility in designing our programming model, in particular when it comes to the communication between the CPU and the CPs. The Enzian Coherence Interface (ECI) supports CXL 3.0-likes… view at source ↗

**Figure 3.** Figure 3: Throughput and latency of pipes. section 5, the system prototype still delivers considerable benefits across a range of applications; scaling up compute in the MCC would only improve this further. 4.2.2 Message passing using pipes. As described in section 4.1, Proxics implements pipes between the CPU and MCC using cache line transactions. To evaluate the throughput and latency, we implement a minimal mes… view at source ↗

**Figure 4.** Figure 4: Spawn time MCC, resulting in only 126 MB/s. With more cache lines, the CPU is not blocked: it can write to other cache lines in the meantime. With 8 cache lines, the overhead is amortized and the single CPU core reaches almost the max throughput. In contrast, through I/O registers, the throughput saturates at 19.0 MB/s for a single thread and at 28.4 MB/s for multiple threads. The magnitude difference wit… view at source ↗

**Figure 6.** Figure 6: shows that when the working set fits entirely into the CPU’s L2 cache, the CPU is about 20× faster than the CP, but this effect disappears for far memory as soon as the table is larger: the CP dominates performance here, although the CPU remains about 3× faster if it only accesses local memory. This shows the computationally weak MCC using Proxics abstractions can outperform a CPU core when randomly 0 2 4 … view at source ↗

**Figure 7.** Figure 7: Single 4B column sum throughput accessing far memory. A completely synchronous CP implementation with no parallelism can outperform a much more performant CPU core when the workload has little inherent locality and the working set exceeds the CPU cache. 5.3 In-memory database operators This experiment shows that memory-intensive but regular in-memory database operators benefit from offloading in Proxics b… view at source ↗

**Figure 8.** Figure 8: Range filter throughput vs. selectivity Sum of a single column [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 10.** Figure 10: Pseudocode for PageRank. Orange indicates the CPU-only code. Blue for the collaborative CPU-MCC code [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 13.** Figure 13: kron26 speedup with varying numbers of CPU cores and MCCs Overall, we conclude that when the CPU is memory-bound, Proxics mitigates the memory bottleneck and saves data movement over the interconnect. Furthermore, as discussed in section 3, Proxics allows flexible scheduling betweeen CPU cores and MCCs. To demonstrate that, we execute kron26 with varying numbers of CPU cores and MCCs [PITH_FULL_IMAGE:f… view at source ↗

**Figure 12.** Figure 12: Data transferred in PageRank cache-efficient, as [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

read the original abstract

The use of disaggregated or far memory systems such as CXL memory pools has renewed interest in Near-Data Processing (NDP): situating cores close to memory to reduce bandwidth requirements to and from the CPU. Hardware designs for such accelerators are appearing, but there lack clean, portable OS abstractions for programming them. We propose a programming model for NDP devices based on familiar OS abstractions: virtual processors (processes) and inter-process communication channels (like Unix pipes). While appealing from a user perspective, a naive implementation of such abstractions is inappropriate for NDP accelerators: the paucity of processing power in some hardware designs makes classical processes overly heavyweight, and IPC based on shared buffers makes no sense in a system designed to reduce memory bandwidth. Accordingly, we show how to implement these abstractions in a lightweight and efficient manner by exploiting compilation and interconnect protocols. We demonstrate them with a real hardware platform runing applications with a range of memory access patterns, including bulk memory operations, in-memory databases and graph applications. Crucially, we show not only the benefits over CPU-only implementations, but also the critical importance of efficient, low-latency communication channels between CPU and NDP accelerators, a feature largely neglected in existing proposals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Proxics gives a practical process-and-pipe model for NDP accelerators on real hardware, but portability to other platforms remains an open question.

read the letter

The one thing to know is that Proxics offers a programming model for far memory accelerators based on processes and pipes, made efficient through compilation and protocols rather than standard OS implementations, with a demonstration on real hardware showing benefits for various workloads. What is actually new here is the specific lightweight mapping of these abstractions to NDP constraints, avoiding heavyweight processes and bandwidth-heavy IPC by exploiting hardware features. The paper does well in providing a working system on actual hardware and running it across bulk memory ops, in-memory databases, and graph apps. It also rightly stresses the need for low-latency channels between CPU and NDP, an aspect often neglected. The soft spots are minor but worth noting. The approach's efficiency is tied to the target hardware providing the necessary compilation support and interconnect protocols, which may not be standard across all far-memory designs like CXL pools. Since the evaluation is on a single platform, it's unclear how portable the model is. The abstract doesn't include quantitative results, so the full paper needs to deliver clear metrics on overheads and speedups to fully support the efficiency claims. No major flaws in the logic, though. This paper is for systems software researchers and developers interested in OS abstractions for near-data processing and disaggregated memory. Readers working on memory-bound applications in data centers would find value in the practical model and the hardware-backed evidence. It deserves a serious referee because it tackles a real barrier with a grounded proposal and implementation. I would recommend sending this to peer review.

Referee Report

2 major / 2 minor

Summary. The paper proposes Proxics, a programming model for near-data processing (NDP) accelerators in disaggregated/far-memory systems such as CXL pools. It uses familiar OS abstractions—virtual processors (processes) and pipe-like inter-process communication channels—and shows how to realize them in a lightweight manner by exploiting compilation passes and interconnect protocols rather than classical heavyweight process implementations or shared-buffer IPC. The model is demonstrated on a real hardware platform running applications with diverse memory access patterns (bulk operations, in-memory databases, graph processing), with results showing benefits over CPU-only baselines and highlighting the importance of low-latency CPU-NDP channels.

Significance. If the lightweight realization holds, the work provides a valuable contribution by supplying portable, programmer-friendly abstractions for NDP hardware, potentially lowering the barrier to using far-memory accelerators. The real-hardware evaluation across multiple workload patterns and the explicit focus on efficient CPU-NDP communication (often neglected in prior NDP proposals) are clear strengths. This could inform OS and runtime design for emerging disaggregated systems.

major comments (2)

[Evaluation section] Evaluation section (real-hardware demonstration): The central claim that the process and pipe abstractions can be implemented with low overhead relies on hardware-specific compilation and interconnect features. The paper should add quantitative overhead measurements (e.g., context-switch or channel latency numbers) and an explicit discussion of which features are assumed to be present in other NDP designs (such as CXL-based pools) to substantiate portability beyond the single demonstrated platform.
[Implementation section] Implementation section (lightweight realization): The argument that naive processes and shared-buffer IPC are inappropriate for NDP is load-bearing, yet the manuscript provides limited detail on how the compilation passes avoid reverting to heavyweight costs or high-bandwidth buffers. A concrete breakdown of the resulting instruction counts or memory traffic for the pipe abstraction would strengthen the efficiency claim.

minor comments (2)

[Abstract] Abstract: 'runing' should be 'running'; 'there lack clean' should be 'there is a lack of clean'.
[Abstract] The abstract states benefits are shown but provides no quantitative results or error bars; the full evaluation section should ensure all performance claims include these for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary and constructive feedback on our work. We address each major comment below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our contributions or evaluation.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (real-hardware demonstration): The central claim that the process and pipe abstractions can be implemented with low overhead relies on hardware-specific compilation and interconnect features. The paper should add quantitative overhead measurements (e.g., context-switch or channel latency numbers) and an explicit discussion of which features are assumed to be present in other NDP designs (such as CXL-based pools) to substantiate portability beyond the single demonstrated platform.

Authors: We agree that explicit quantitative overhead numbers and a clearer portability discussion would improve the evaluation. In the revised manuscript, we have added direct measurements of pipe channel latency (sub-microsecond on the platform) and lightweight context-switch costs, obtained via cycle-accurate instrumentation on the real hardware. We have also expanded the discussion in the Evaluation and Discussion sections to enumerate the assumed interconnect features (low-latency message passing and compiler-visible address spaces) and note how these map to emerging CXL-based NDP pools, while acknowledging that full cross-platform empirical validation would require additional hardware access. revision: partial
Referee: [Implementation section] Implementation section (lightweight realization): The argument that naive processes and shared-buffer IPC are inappropriate for NDP is load-bearing, yet the manuscript provides limited detail on how the compilation passes avoid reverting to heavyweight costs or high-bandwidth buffers. A concrete breakdown of the resulting instruction counts or memory traffic for the pipe abstraction would strengthen the efficiency claim.

Authors: We accept that a more granular breakdown would strengthen the efficiency argument. The revised Implementation section now includes a concrete analysis: the compilation passes reduce pipe send/receive to 12-18 instructions with zero additional memory traffic beyond the payload (by using direct interconnect messages instead of shared buffers), compared to hundreds of instructions and multiple cache-line transfers for a naive shared-memory implementation. This is supported by both static instruction counts from the compiler output and dynamic memory-traffic traces from the hardware. revision: yes

Circularity Check

0 steps flagged

No circularity: design proposal grounded in external hardware without self-referential derivations

full rationale

The paper is a systems design proposal for NDP programming abstractions (virtual processors and pipe-like IPC) implemented via compilation and interconnect protocols, demonstrated on real hardware. No equations, fitted parameters, predictions, or derivation chains exist that could reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims rest on external hardware capabilities and empirical demonstration rather than internal redefinition or renaming of known results. This is the normal case of a self-contained engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about hardware support for lightweight process scheduling and protocol-based IPC rather than shared memory; no free parameters or invented physical entities are introduced.

axioms (2)

domain assumption NDP hardware provides interconnect protocols that can replace shared-buffer IPC without incurring high bandwidth costs.
Invoked when arguing that classical IPC is inappropriate and must be replaced by protocol-based channels.
domain assumption Compilation techniques can sufficiently optimize away the overhead of virtual-processor abstractions on resource-constrained NDP cores.
Required for the claim that processes can be made lightweight rather than heavyweight.

pith-pipeline@v0.9.0 · 5530 in / 1388 out tokens · 30007 ms · 2026-05-10T03:13:34.369602+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 54 canonical work pages · 2 internal anchors

[1]

2024.UltraScale Architecture-Based FPGAs Memory IP v1.4 LogiCORE IP Product Guide

Advanced Micro Devices, Inc. 2024.UltraScale Architecture-Based FPGAs Memory IP v1.4 LogiCORE IP Product Guide. Technical Report PG150. 955 pages.https://docs.amd.com/r/en-US/pg150-ultrascale- memory-ip

2024
[2]

2025.MicroBlaze V Processor Reference Guide

Advanced Micro Devices, Inc. 2025.MicroBlaze V Processor Reference Guide. Technical Report UG1629. 152 pages.https://docs.amd.com/r/ en-US/ug1629-microblaze-v-user-guide

2025
[3]

Minseon Ahn, Thomas Willhalm, Norman May, Donghun Lee, Suprasad Mutalik Desai, Daniel Booss, Jungmin Kim, Navneet Singh, Daniel Ritter, and Oliver Rebholz. 2024. An Examination of CXL Memory Use Cases for In-Memory Database Management Systems Using SAP HANA.Proc. VLDB Endow.17, 12 (Aug. 2024), 3827–3840. doi:10.14778/3685800.3685809

work page doi:10.14778/3685800.3685809 2024
[4]

2025.Astera Labs Leo CXL Smart Memory Controllers Portfolio Brief

Astera Labs. 2025.Astera Labs Leo CXL Smart Memory Controllers Portfolio Brief. Technical Report

2025
[5]

Antonio Barbalace, Anthony Iliopoulos, Holm Rauchfuss, and Goetz Brasche. 2017. It’s Time to Think About an Operating System for Near Data Processing Architectures. InProceedings of the 16th Workshop on Hot Topics in Operating Systems(Whistler, BC, Canada)(HotOS ’17). Association for Computing Machinery, New York, NY, USA, 56–61. doi:10.1145/3102980.3102990

work page doi:10.1145/3102980.3102990 2017
[6]

Andrew Baumann, Jonathan Appavoo, Orran Krieger, and Timothy Roscoe. 2019. A fork() in the road. InProceedings of the Workshop on Hot Topics in Operating Systems(Bertinoro, Italy)(HotOS ’19). Association for Computing Machinery, New York, NY, USA, 14–22. doi:10.1145/ 3317550.3321435

work page arXiv 2019
[7]

Scott Beamer, Krste Asanović, and David Patterson. 2017. The GAP Benchmark Suite. doi:10.48550/arXiv.1508.03619arXiv:1508.03619 [cs]

work page doi:10.48550/arxiv.1508.03619arxiv:1508.03619 2017
[8]

2025.Introducing Compute Express Link (CXL) 4.0

Tony Benavides and Mahesh Wagh. 2025.Introducing Compute Express Link (CXL) 4.0. Technical Report.https://computeexpresslink.org/wp- content/uploads/2025/11/CXL_4.0-White-Paper_FINAL.pdf

2025
[9]

Octopus: Enhancing CXL Memory Pods via Sparse Topology

Daniel S. Berger, Yuhong Zhong, Fiodar Kazhamiaka, Pantea Zardoshti, Shuwei Teng, Mark D. Hill, and Rodrigo Fonseca. 2025. Octopus: Scalable Low-Cost CXL Memory Pooling. doi:10.48550/arXiv.2501. 09020arXiv:2501.09020 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501 2025
[10]

2017.Cavium ThunderX CN88XX, Pass 2 Hardware Refer- ence Manual (Version 2.7P)

Cavium, Inc. 2017.Cavium ThunderX CN88XX, Pass 2 Hardware Refer- ence Manual (Version 2.7P). Technical Report CN88XX-HM-2.7P. 1936 pages

2017
[11]

2019.CCIX Base Specification Revision 1.0a Version 1.0 for Evaluation

CCIX Consortium, Inc. 2019.CCIX Base Specification Revision 1.0a Version 1.0 for Evaluation. Technical Report. 346 pages

2019
[12]

Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan. 2015. One Trillion Edges: Graph Processing at Facebook-scale.Proc. VLDB Endow.8, 12 (Aug. 2015), 1804–1815. doi:10.14778/2824032.2824077

work page doi:10.14778/2824032.2824077 2015
[13]

Awasthi, Emmanuel S

Anita Choudhary, Mahesh Chandra Govil, Girdhari Singh, Lalit K. Awasthi, Emmanuel S. Pilli, and Divya Kapil. 2017. A Critical Survey of Live Virtual Machine Migration Techniques.J. Cloud Comput.6, 1 (Dec. 2017), 92:1–92:41

2017
[14]

Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. 2005. Live Migration of Virtual Machines. InProceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation - Volume 2 (NSDI’05). USENIX Association, USA, 273–286

2005
[15]

David Cock, Abishek Ramdas, Daniel Schwyn, Michael Giardino, Adam Turowski, Zhenhao He, Nora Hossle, Dario Korolija, Melissa Liccia- rdello, Kristina Martsenko, Reto Achermann, Gustavo Alonso, and Timothy Roscoe. 2022. Enzian: An Open, General, CPU/FPGA Plat- form for Systems Software Research. InProceedings of the 27th ACM International Conference on Arc...

work page doi:10.1145/3503222.3507742 2022
[16]

2023.Compute Ex- press Link Specification Revision 3.1

Compute Express Link Consortium, Inc. 2023.Compute Ex- press Link Specification Revision 3.1. Technical Report. 1166 pages.https://computeexpresslink.org/wp-content/uploads/2024/ 02/CXL-3.1-Specification.pdf

2023
[17]

InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25)

Patrick H. Coppock, Brian Zhang, Eliot H. Solomon, Vasilis Kypriotis, Leon Yang, Bikash Sharma, Dan Schatzberg, Todd C. Mowry, and Dimitrios Skarlatos. 2025. LithOS: An Operating System for Efficient Machine Learning on GPUs. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25). Association for Computing Machinery, New...

work page arXiv 2025
[18]

Elyse Ge Hylander. 2025. Azure Delivers the First Cloud VM with Intel Xeon 6 and CXL Memory - Now in Private Preview. https://techcommunity.microsoft.com/blog/sapapplications/azure- delivers-the-first-cloud-vm-with-intel-xeon-6-and-cxl-memory--- now-in-priv/4470067

work page arXiv 2025
[19]

Mohammad Ewais and Paul Chow. 2023. Disaggregated Memory in the Datacenter: A Survey.IEEE Access11 (2023), 20688–20712. doi:10.1109/ACCESS.2023.3250407

work page doi:10.1109/access.2023.3250407 2023
[20]

Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical Near- Data Processing for In-Memory Analytics Frameworks. In2015 Inter- national Conference on Parallel Architecture and Compilation (PACT). 113–124. doi:10.1109/PACT.2015.22

work page doi:10.1109/pact.2015.22 2015
[21]

Gemini Team. 2025. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL]https://arxiv.org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

2019.The Processing-in-Memory Paradigm: Mechanisms to Enable Adoption

Saugata Ghose, Kevin Hsieh, Amirali Boroumand, Rachata Ausavarungnirun, and Onur Mutlu. 2019.The Processing-in-Memory Paradigm: Mechanisms to Enable Adoption. Springer International Publishing, Cham, 133–194. doi:10.1007/978-3-319-90385-9_5

work page doi:10.1007/978-3-319-90385-9_5 2019
[23]

Ellis Giles and Peter Varman. 2025. ACID Support for Compute eXpress Link Memory Transactions. InProceedings of the SC ’24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W ’24). IEEE Press, Atlanta, GA, USA, 982–

2025
[24]

doi:10.1109/SCW63240.2024.00138

work page doi:10.1109/scw63240.2024.00138 2024
[25]

Oliveira, and Onur Mutlu

Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Gian- noula, Geraldo F. Oliveira, and Onur Mutlu. 2022. Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System.IEEE Access10 (2022), 52565–52608. doi:10.1109/ACCESS.2022.3174101

work page doi:10.1109/access.2022.3174101 2022
[26]

Google. 2026. C4 machine series.https://docs.cloud.google.com/ compute/docs/general-purpose-machines#c4_series

2026
[27]

Google. 2026. C4A machine series.https://docs.cloud.google.com/ compute/docs/general-purpose-machines#c4a_series 13

2026
[28]

Hyungkyu Ham, Jeongmin Hong, Geonwoo Park, Yunseon Shin, Okkyun Woo, Wonhyuk Yang, Jinhoon Bae, Eunhyeok Park, Hyo- jin Sung, Euicheol Lim, and Gwangsun Kim. 2024. Low-Overhead General-Purpose Near-Data Processing in CXL Memory Expanders. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). 594–611. doi:10.1109/MICRO61859.2024.00051

work page doi:10.1109/micro61859.2024.00051 2024
[29]

Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-Scale Preemption for Concurrent GPU-accelerated DNN Inferences. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 539– 558.https://www.usenix.org/conference/osdi22/presentation/han

2022
[30]

Yongjun He, Jiacheng Lu, and Tianzheng Wang. 2020. CoroBase: Coroutine-Oriented Main-Memory Database Engine.Proc. VLDB En- dow.14, 3 (Nov. 2020), 431–444. doi:10.14778/3430915.3430932

work page doi:10.14778/3430915.3430932 2020
[31]

Hokyoon Lee. 2025. Unlocking the Memory-Centric Computing Sys- tem through CXL-based Processing-near-Memory Module: CMM-DC

2025
[32]

Malladi, Andrew Chang, and Yuan Xie

Wenqin Huangfu, Krishna T. Malladi, Andrew Chang, and Yuan Xie
[33]

Hermes: Accelerating long-latency load requests via perceptron-based off-chip load prediction,

BEACON: Scalable Near-Data-Processing Accelerators for Genome Analysis near Memory Pool with the CXL Support. In2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 727–743. doi:10.1109/MICRO56248.2022.00057

work page doi:10.1109/micro56248.2022.00057 2022
[34]

Junhyeok Jang, Hanjin Choi, Hanyeoreum Bae, Seungjun Lee, Miryeong Kwon, and Myoungsoo Jung. 2023. CXL-ANNS: Software- Hardware Collaborative Memory Disaggregation and Computation for Billion-Scale Approximate Nearest Neighbor Search. In2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 585–600.https://www.usenix.org/...

2023
[35]

doi:10.14778/2994509.2994518

Insoon Jo, Duck-Ho Bae, Andre S. Yoon, Jeong-Uk Kang, Sangyeun Cho, Daniel D. G. Lee, and Jaeheon Jeong. 2016. YourSQL: A High- Performance Database System Leveraging in-Storage Computing.Proc. VLDB Endow.9, 12 (Aug. 2016), 924–935. doi:10.14778/2994509.2994512

work page doi:10.14778/2994509.2994512 2016
[36]

Aditya K Kamath and Simon Peter. 2024. (MC)2: Lazy MemCopy at the Memory Controller. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 1112–1128. doi:10.1109/ ISCA59077.2024.00084

work page arXiv 2024
[37]

Onur Kocberber, Babak Falsafi, and Boris Grot. 2015. Asynchronous Memory Access Chaining.Proc. VLDB Endow.9, 4 (Dec. 2015), 252–263. doi:10.14778/2856318.2856321

work page doi:10.14778/2856318.2856321 2015
[38]

Kopetz and G

H. Kopetz and G. Bauer. 2003. The time-triggered architecture.Proc. IEEE91, 1 (2003), 112–126. doi:10.1109/JPROC.2002.805821

work page doi:10.1109/jproc.2002.805821 2003
[39]

Dario Korolija, Dimitrios Koutsoukos, Kimberly Keeton, Konstantin Taranov, Dejan Milojičić, and Gustavo Alonso. 2021. Farview: Disag- gregated Memory with Operator Off-loading for Database Engines. doi:10.48550/arXiv.2106.07102arXiv:2106.07102 [cs]

work page doi:10.48550/arxiv.2106.07102arxiv:2106.07102 2021
[40]

Dario Korolija, Timothy Roscoe, and Gustavo Alonso. 2020. Do OS Abstractions Make Sense on FPGAs?. InProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20). USENIX Association, USA, 991–1010

2020
[41]

Ronny Krashinsky, Olivier Giroux, Stephen Jones, Nick Stam, and Srid- har Ramaswamy. 2020. NVIDIA Ampere Architecture In-Depth.https: //developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/

2020
[42]

Rossbach, and Emmett Witchel

Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. 2017. Ingens: Huge Page Support for the OS and Hypervisor.SIGOPS Oper. Syst. Rev.51, 1 (Sept. 2017), 83–93. doi:10.1145/3139645.3139659

work page doi:10.1145/3139645.3139659 2017
[43]

I.-Ting Lee, Bao-Kai Wang, Liang-Chi Chen, Wen Sheng Lim, Da-Wei Chang, Yu-Ming Chang, and Chieng-Chung Ho. 2025. PIM or CXL- PIM? Understanding Architectural Trade-offs Through Large-Scale Benchmarking. doi:10.48550/arXiv.2511.14400arXiv:2511.14400 [cs]

work page doi:10.48550/arxiv.2511.14400arxiv:2511.14400 2025
[44]

Alberto Lerner and Gustavo Alonso. 2024. CXL and the Return of Scale-Up Database Engines.Proc. VLDB Endow.17, 10 (June 2024), 2568–2575. doi:10.14778/3675034.3675047

work page doi:10.14778/3675034.3675047 2024
[45]

Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Falout- sos, and Zoubin Ghahramani. 2010. Kronecker graphs: an approach to modeling networks.Journal of Machine Learning Research11, 2 (2010)

2010
[46]

Berger, Lisa Hsu, Daniel Ernst, Pantea Zar- doshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D

Huaicheng Li, Daniel S. Berger, Lisa Hsu, Daniel Ernst, Pantea Zar- doshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D. Hill, Marcus Fontoura, and Ricardo Bian- chini. 2023. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms. InProceedings of the 28th ACM International Conference on Architectural Support for P...

work page doi:10.1145/3575693.3578835 2023
[47]

Hongfu Li, Qian Tao, Song Yu, Shufeng Gong, Yanfeng Zhang, Feng Yao, Wenyuan Yu, Ge Yu, and Jingren Zhou. 2024. GastCoCo: Graph Storage and Coroutine-Based Prefetch Co-Design for Dynamic Graph Processing.Proc. VLDB Endow.17, 13 (Sept. 2024), 4827–4839. doi:10. 14778/3704965.3704986

work page arXiv 2024
[48]

Luyang Li, Heng Pan, Xinchen Wan, Kai Lv, Zilong Wang, Qian Zhao, Feng Ning, Qingsong Ning, Shideng Zhang, Zhenyu Li, Layong Luo, and Gaogang Xie. 2025. Harmonia: A Unified Framework for Het- erogeneous FPGA Acceleration in the Cloud. InProceedings of the 30th ACM International Conference on Architectural Support for Pro- gramming Languages and Operating ...

work page doi:10.1145/3676641.3716259 2025
[49]

Berger, Marie Nguyen, Xun Jian, Sam H

Jinshu Liu, Hamid Hadian, Yuyue Wang, Daniel S. Berger, Marie Nguyen, Xun Jian, Sam H. Noh, and Huaicheng Li. 2025. System- atic CXL Memory Characterization and Performance Analysis at Scale. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Rotterdam, Netherlands)(AS...

work page doi:10.1145/3676641.3715987 2025
[50]

Zikai Liu, Jasmin Schult, Pengcheng Xu, and Timothy Roscoe. 2025. Mainframe-Style Channel Controllers for Modern Disaggregated Mem- ory Systems. InProceedings of the 16th ACM SIGOPS Asia-Pacific Work- shop on Systems (APSys ’25). Association for Computing Machinery, New York, NY, USA, 82–90. doi:10.1145/3725783.3764403

work page doi:10.1145/3725783.3764403 2025
[51]

Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. 2007. Challenges in Parallel Graph Processing.Parallel Processing Letters17, 01 (2007), 5–20. doi:10.1142/S0129626407002843 arXiv:https://doi.org/10.1142/S0129626407002843

work page doi:10.1142/s0129626407002843 2007
[52]

Pregel: a system for large-scale graph processing,

Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehn- ert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-Scale Graph Processing. InProceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIG- MOD ’10). Association for Computing Machinery, New York, NY, USA, 135–146. doi:10.1145/1...

work page doi:10.1145/1807167.1807184 2010
[53]

Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowd- hury, Shobhit Kanaujia, and Prakash Chauhan. 2023. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Ope...

work page doi:10.1145/3582016.3582063 2023
[54]

2024.Marvell Structera A 2504 Memory-Expansion Con- troller

Marvell. 2024.Marvell Structera A 2504 Memory-Expansion Con- troller. Technical Report Marvell_Structera_A MV-SLA25041 _PB. 3 pages.https://www.marvell.com/content/dam/marvell/en/public- collateral/assets/marvell-structera-a-2504-near-memory- accelerator-product-brief.pdf

2024
[55]

2024.Marvell Structera X 2504 Memory-Expansion Controller

Marvell. 2024.Marvell Structera X 2504 Memory-Expansion Controller. Technical Report. 2 pages.https://www.marvell.com/content/ dam/marvell/en/public-collateral/assets/marvell-structera-x-2504- memory-expansion-controller-product-brief.pdf 14

2024
[56]

Friedemann Mattern. 1989. Global quiescence detection based on credit distribution and recovery.Inf. Process. Lett.30, 4 (Feb. 1989), 195–200. doi:10.1016/0020-0190(89)90212-3

work page doi:10.1016/0020-0190(89)90212-3 1989
[57]

Micron. 2023. Flexible Memory Expansion for Data-Intensive Work- loads.https://www.micron.com/products/memory/cxl-memory

2023
[58]

Montage Technology. 2026. CXL Memory eXpander Controller (MXC). https://www.montage-tech.com/MXC

2026
[59]

2014.Grappa: A Latency-Tolerant Run- time for Large-Scale Irregular Applications

Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2014.Grappa: A Latency-Tolerant Run- time for Large-Scale Irregular Applications. Technical Report UW-CSE- 14-02-01. University of Washington.https://sampa.cs.washington. edu/new/papers/grappa-tr-2014-02.pdf

2014
[60]

Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2015. Latency-Tolerant Soft- ware Distributed Shared Memory. In2015 USENIX Annual Techni- cal Conference (USENIX ATC 15). USENIX Association, Santa Clara, CA, 291–305.https://www.usenix.org/conference/atc15/technical- session/presentation/nelson

2015
[61]

Kelvin K. W. Ng, Henri Maxime Demoulin, and Vincent Liu. 2023. Paella: Low-latency Model Serving with Software-defined GPU Sched- uling. InProceedings of the 29th Symposium on Operating Systems Prin- ciples (SOSP ’23). Association for Computing Machinery, New York, NY, USA, 595–610. doi:10.1145/3600006.3613163

work page doi:10.1145/3600006.3613163 2023
[62]

NVIDIA. 2026. CUDA Programming Guide.https://docs.nvidia.com/ cuda/cuda-programming-guide/

2026
[63]

1999.The PageRank Citation Ranking: Bringing Order to the Web

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999.The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66. Stanford InfoLab / Stanford InfoLab.http: //ilpubs.stanford.edu:8090/422/

1999
[64]

Gopinath

Ashish Panwar, Sorav Bansal, and K. Gopinath. 2019. HawkEye: Ef- ficient Fine-grained OS Support for Huge Pages. InProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19). As- sociation for Computing Machinery, New York, NY, USA, 347–360. doi:10.1145/3297858.3304064

work page doi:10.1145/3297858.3304064 2019
[65]

Loh, and Abhishek Bhattacharjee

Binh Pham, Ján Veselý, Gabriel H. Loh, and Abhishek Bhattacharjee
[66]

Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have It Both Ways?

Large Pages and Lightweight Memory Management in Virtu- alized Environments: Can You Have It Both Ways?. InProceedings of the 48th International Symposium on Microarchitecture (MICRO-48). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/2830772.2830773

work page doi:10.1145/2830772.2830773
[67]

Georgios Psaropoulos, Thomas Legler, Norman May, and Anastasia Ailamaki. 2017. Interleaving with Coroutines: A Practical Approach for Robust Index Joins.Proc. VLDB Endow.11, 2 (Oct. 2017), 230–242. doi:10.14778/3149193.3149202

work page doi:10.14778/3149193.3149202 2017
[68]

Caulfield, Eric S

Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. 201...

work page doi:10.1109/isca.2014.6853195 2014
[69]

2023.CCKit: FPGA Acceleration in Symmetric Coherent Heterogeneous Platforms

Abishek Ramdas. 2023.CCKit: FPGA Acceleration in Symmetric Coherent Heterogeneous Platforms. Doctoral Thesis. ETH Zurich. doi:10.3929/ethz-b-000642567

work page doi:10.3929/ethz-b-000642567 2023
[70]

Benjamin Ramhorst, Dario Korolija, Maximilian Jakob Heer, Jonas Dann, Luhao Liu, and Gustavo Alonso. 2025. Coyote v2: Raising the Level of Abstraction for Data Center FPGAs. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25). Association for Computing Machinery, New York, NY, USA, 639–654. doi:10.1145/3731569.3764845

work page doi:10.1145/3731569.3764845 2025
[71]

Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel

Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: operating system abstractions to manage GPUs as compute devices. InProceedings of the Twenty-Third ACM Symposium on Operating Systems Principles(Cascais, Portugal) (SOSP ’11). Association for Computing Machinery, New York, NY, USA, 233–248. doi:10.1145/2...

work page doi:10.1145/2043556.2043579 2011
[72]

Samsung. 2022. Samsung Electronics Introduces Industry’s First 512GB CXL Memory Module.https://news.samsung.com/global/samsung- electronics-introduces-industrys-first-512gb-cxl-memory-module

2022
[73]

Samsung. 2024. CXL Memory Module Box CMM-B. https://semiconductor.samsung.com/news-events/tech-blog/cxl- memory-module-box-cmm-b

2024
[74]

Joonseop Sim, Soohong Ahn, Taeyoung Ahn, Seungyong Lee, Myunghyun Rhee, Jooyoung Kim, Kwangsik Shin, Donguk Moon, Euiseok Kim, and Kyoung Park. 2022. Computational cxl-memory so- lution for accelerating memory-intensive applications.IEEE Computer Architecture Letters22, 1 (2022), 5–8

2022
[75]

Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Chihun Song, Jinghan Huang, Houxiang Ji, Siddharth Agarwal, Jiaqi Lou, Ipoom Jeong, Ren Wang, Jung Ho Ahn, Tianyin Xu, and Nam Sung Kim. 2023. Demysti- fying CXL Memory with Genuine CXL-Ready Systems and Devices. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’23)....

work page doi:10.1145/3613424.3614256 2023
[76]

Yupeng Tang, Ping Zhou, Wenhui Zhang, Henry Hu, Qirui Yang, Hao Xiang, Tongping Liu, Jiaxin Shan, Ruoyun Huang, Cheng Zhao, Cheng Chen, Hui Zhang, Fei Liu, Shuai Zhang, Xiaoning Ding, and Jianjun Chen. 2024. Exploring Performance and Cost Optimization with ASIC- Based CXL Memory. InProceedings of the Nineteenth European Confer- ence on Computer Systems (E...

work page doi:10.1145/3627703.3650061 2024
[77]

Dufy Teguia, Jiaxuan Chen, Stella Bitchebe, Oana Balmau, and Alain Tchana. 2024. vPIM: Processing-in-Memory Virtualization. InProceed- ings of the 25th International Middleware Conference (Middleware ’24). Association for Computing Machinery, New York, NY, USA, 417–430. doi:10.1145/3652892.3700782

work page doi:10.1145/3652892.3700782 2024
[78]

Chuck Thacker. 2010. Beehive: A many-core computer for FP- GAs (v5).https://web.mit.edu/6.173/www/currentsemester/handouts/ BeehiveV5.pdf

2010
[79]

Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an interme- diate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages(Phoenix, AZ, USA) (MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. doi:10.1145/3315508.3329973

work page doi:10.1145/3315508.3329973 2019
[80]

Lukas Vogel, Daniel Ritter, Danica Porobic, Pinar Tözün, Tianzheng Wang, and Alberto Lerner. 2023. Data Pipes: Declarative Control over Data Movement. InConference on Innovative Data Systems Research

2023

Showing first 80 references.