pith. machine review for the scientific record. sign in

arxiv: 2604.18120 · v1 · submitted 2026-04-20 · 💻 cs.OS · cs.AR· cs.ET· cs.SE

Recognition: unknown

Proxics: an efficient programming model for far memory accelerators

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:13 UTC · model grok-4.3

classification 💻 cs.OS cs.ARcs.ETcs.SE
keywords near-data processingprogramming modelfar memoryOS abstractionsNDP acceleratorsprocessesIPC channelsdisaggregated memory
0
0 comments X

The pith

Near-data processing accelerators can use familiar processes and pipe-like channels when made lightweight via compilation and interconnect protocols.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that NDP devices for far memory can be programmed with standard OS abstractions of virtual processors and inter-process communication channels. A direct port of classical processes and shared-buffer IPC would be too heavy for limited accelerator cores and would defeat the purpose of cutting memory bandwidth. Instead the authors implement these abstractions efficiently by exploiting compilation passes and hardware interconnect protocols. They demonstrate the result on real hardware across bulk memory operations, in-memory databases and graph workloads, reporting gains over CPU-only code and stressing that low-latency CPU-to-accelerator links are essential.

Core claim

We propose Proxics, a programming model for NDP devices based on virtual processors and IPC channels like Unix pipes. These abstractions are realized in a lightweight manner by leveraging compilation and interconnect protocols rather than traditional heavyweight mechanisms or high-bandwidth shared buffers. On a real hardware platform the model supports applications with varied memory access patterns and delivers benefits beyond CPU-only execution while showing the critical role of efficient, low-latency communication between host CPU and NDP accelerator.

What carries the argument

The Proxics model of virtual processors and low-overhead IPC channels, realized through compilation and hardware interconnect protocols.

If this is right

  • Applications with memory-intensive patterns can be offloaded to NDP using familiar process and channel code rather than specialized languages.
  • Bandwidth demand between host and far memory drops for bulk operations, database queries and graph traversals.
  • Low-latency CPU-NDP communication channels become a first-order requirement for overall system performance.
  • The same abstractions remain portable across different NDP hardware designs that expose suitable protocols.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be applied to other disaggregated memory fabrics if they provide comparable protocol support for lightweight channels.
  • Static compilation passes could be extended to automatically choose between CPU and NDP execution for individual code regions based on access patterns.
  • Existing operating systems could incorporate Proxics as a new device type, allowing unmodified user code to target NDP without explicit accelerator APIs.

Load-bearing premise

The target NDP hardware supplies interconnect protocols and compilation support that keep process and channel abstractions lightweight instead of forcing heavy overheads or shared buffers.

What would settle it

If measurements on the real hardware platform show that Proxics incurs higher latency, greater bandwidth use, or no performance advantage over CPU-only execution for the tested bulk, database and graph workloads, the efficiency claim would be refuted.

Figures

Figures reproduced from arXiv: 2604.18120 by Jasmin Schult, Niels Pressel, Pengcheng Xu, Roman Meier, Timothy Roscoe, Zikai Liu.

Figure 1
Figure 1. Figure 1: Proxics abstractions for running software on the accelerator, without a clear cor￾responding OS abstraction [5]. Each device is therefore pro￾grammed differently, and with few reusable concepts other than vendor-specific low-level programming. In contrast, Proxics provides abstractions which are both portable and efficient. More specifically, our requirements in designing Proxics were as follows: Firstly, … view at source ↗
Figure 2
Figure 2. Figure 2: Proxics prototype This structure is similar to systems based on CXL [16] or the Cache Coherent Interconnect for Accelerators (CCIX) [11]. We use Enzian rather than existing NDP accelerators be￾cause it affords us greater flexibility in designing our pro￾gramming model, in particular when it comes to the com￾munication between the CPU and the CPs. The Enzian Co￾herence Interface (ECI) supports CXL 3.0-likes… view at source ↗
Figure 3
Figure 3. Figure 3: Throughput and latency of pipes. section 5, the system prototype still delivers considerable benefits across a range of applications; scaling up compute in the MCC would only improve this further. 4.2.2 Message passing using pipes. As described in sec￾tion 4.1, Proxics implements pipes between the CPU and MCC using cache line transactions. To evaluate the through￾put and latency, we implement a minimal mes… view at source ↗
Figure 4
Figure 4. Figure 4: Spawn time MCC, resulting in only 126 MB/s. With more cache lines, the CPU is not blocked: it can write to other cache lines in the meantime. With 8 cache lines, the overhead is amortized and the single CPU core reaches almost the max throughput. In contrast, through I/O registers, the throughput satu￾rates at 19.0 MB/s for a single thread and at 28.4 MB/s for multiple threads. The magnitude difference wit… view at source ↗
Figure 6
Figure 6. Figure 6: shows that when the working set fits entirely into the CPU’s L2 cache, the CPU is about 20× faster than the CP, but this effect disappears for far memory as soon as the table is larger: the CP dominates performance here, although the CPU remains about 3× faster if it only accesses local memory. This shows the computationally weak MCC using Proxics abstractions can outperform a CPU core when randomly 0 2 4 … view at source ↗
Figure 7
Figure 7. Figure 7: Single 4B column sum throughput accessing far memory. A completely synchronous CP imple￾mentation with no parallelism can outperform a much more performant CPU core when the workload has little inherent locality and the working set exceeds the CPU cache. 5.3 In-memory database operators This experiment shows that memory-intensive but regular in-memory database operators benefit from offloading in Proxics b… view at source ↗
Figure 8
Figure 8. Figure 8: Range filter throughput vs. selectivity Sum of a single column [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Pseudocode for PageRank. Orange indicates the CPU-only code. Blue for the collaborative CPU-MCC code [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 13
Figure 13. Figure 13: kron26 speedup with varying numbers of CPU cores and MCCs Overall, we conclude that when the CPU is memory-bound, Proxics mitigates the memory bottleneck and saves data movement over the interconnect. Furthermore, as discussed in section 3, Proxics allows flex￾ible scheduling betweeen CPU cores and MCCs. To demon￾strate that, we execute kron26 with varying numbers of CPU cores and MCCs [PITH_FULL_IMAGE:f… view at source ↗
Figure 12
Figure 12. Figure 12: Data transferred in PageRank cache-efficient, as [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
read the original abstract

The use of disaggregated or far memory systems such as CXL memory pools has renewed interest in Near-Data Processing (NDP): situating cores close to memory to reduce bandwidth requirements to and from the CPU. Hardware designs for such accelerators are appearing, but there lack clean, portable OS abstractions for programming them. We propose a programming model for NDP devices based on familiar OS abstractions: virtual processors (processes) and inter-process communication channels (like Unix pipes). While appealing from a user perspective, a naive implementation of such abstractions is inappropriate for NDP accelerators: the paucity of processing power in some hardware designs makes classical processes overly heavyweight, and IPC based on shared buffers makes no sense in a system designed to reduce memory bandwidth. Accordingly, we show how to implement these abstractions in a lightweight and efficient manner by exploiting compilation and interconnect protocols. We demonstrate them with a real hardware platform runing applications with a range of memory access patterns, including bulk memory operations, in-memory databases and graph applications. Crucially, we show not only the benefits over CPU-only implementations, but also the critical importance of efficient, low-latency communication channels between CPU and NDP accelerators, a feature largely neglected in existing proposals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Proxics, a programming model for near-data processing (NDP) accelerators in disaggregated/far-memory systems such as CXL pools. It uses familiar OS abstractions—virtual processors (processes) and pipe-like inter-process communication channels—and shows how to realize them in a lightweight manner by exploiting compilation passes and interconnect protocols rather than classical heavyweight process implementations or shared-buffer IPC. The model is demonstrated on a real hardware platform running applications with diverse memory access patterns (bulk operations, in-memory databases, graph processing), with results showing benefits over CPU-only baselines and highlighting the importance of low-latency CPU-NDP channels.

Significance. If the lightweight realization holds, the work provides a valuable contribution by supplying portable, programmer-friendly abstractions for NDP hardware, potentially lowering the barrier to using far-memory accelerators. The real-hardware evaluation across multiple workload patterns and the explicit focus on efficient CPU-NDP communication (often neglected in prior NDP proposals) are clear strengths. This could inform OS and runtime design for emerging disaggregated systems.

major comments (2)
  1. [Evaluation section] Evaluation section (real-hardware demonstration): The central claim that the process and pipe abstractions can be implemented with low overhead relies on hardware-specific compilation and interconnect features. The paper should add quantitative overhead measurements (e.g., context-switch or channel latency numbers) and an explicit discussion of which features are assumed to be present in other NDP designs (such as CXL-based pools) to substantiate portability beyond the single demonstrated platform.
  2. [Implementation section] Implementation section (lightweight realization): The argument that naive processes and shared-buffer IPC are inappropriate for NDP is load-bearing, yet the manuscript provides limited detail on how the compilation passes avoid reverting to heavyweight costs or high-bandwidth buffers. A concrete breakdown of the resulting instruction counts or memory traffic for the pipe abstraction would strengthen the efficiency claim.
minor comments (2)
  1. [Abstract] Abstract: 'runing' should be 'running'; 'there lack clean' should be 'there is a lack of clean'.
  2. [Abstract] The abstract states benefits are shown but provides no quantitative results or error bars; the full evaluation section should ensure all performance claims include these for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary and constructive feedback on our work. We address each major comment below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our contributions or evaluation.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section (real-hardware demonstration): The central claim that the process and pipe abstractions can be implemented with low overhead relies on hardware-specific compilation and interconnect features. The paper should add quantitative overhead measurements (e.g., context-switch or channel latency numbers) and an explicit discussion of which features are assumed to be present in other NDP designs (such as CXL-based pools) to substantiate portability beyond the single demonstrated platform.

    Authors: We agree that explicit quantitative overhead numbers and a clearer portability discussion would improve the evaluation. In the revised manuscript, we have added direct measurements of pipe channel latency (sub-microsecond on the platform) and lightweight context-switch costs, obtained via cycle-accurate instrumentation on the real hardware. We have also expanded the discussion in the Evaluation and Discussion sections to enumerate the assumed interconnect features (low-latency message passing and compiler-visible address spaces) and note how these map to emerging CXL-based NDP pools, while acknowledging that full cross-platform empirical validation would require additional hardware access. revision: partial

  2. Referee: [Implementation section] Implementation section (lightweight realization): The argument that naive processes and shared-buffer IPC are inappropriate for NDP is load-bearing, yet the manuscript provides limited detail on how the compilation passes avoid reverting to heavyweight costs or high-bandwidth buffers. A concrete breakdown of the resulting instruction counts or memory traffic for the pipe abstraction would strengthen the efficiency claim.

    Authors: We accept that a more granular breakdown would strengthen the efficiency argument. The revised Implementation section now includes a concrete analysis: the compilation passes reduce pipe send/receive to 12-18 instructions with zero additional memory traffic beyond the payload (by using direct interconnect messages instead of shared buffers), compared to hundreds of instructions and multiple cache-line transfers for a naive shared-memory implementation. This is supported by both static instruction counts from the compiler output and dynamic memory-traffic traces from the hardware. revision: yes

Circularity Check

0 steps flagged

No circularity: design proposal grounded in external hardware without self-referential derivations

full rationale

The paper is a systems design proposal for NDP programming abstractions (virtual processors and pipe-like IPC) implemented via compilation and interconnect protocols, demonstrated on real hardware. No equations, fitted parameters, predictions, or derivation chains exist that could reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims rest on external hardware capabilities and empirical demonstration rather than internal redefinition or renaming of known results. This is the normal case of a self-contained engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about hardware support for lightweight process scheduling and protocol-based IPC rather than shared memory; no free parameters or invented physical entities are introduced.

axioms (2)
  • domain assumption NDP hardware provides interconnect protocols that can replace shared-buffer IPC without incurring high bandwidth costs.
    Invoked when arguing that classical IPC is inappropriate and must be replaced by protocol-based channels.
  • domain assumption Compilation techniques can sufficiently optimize away the overhead of virtual-processor abstractions on resource-constrained NDP cores.
    Required for the claim that processes can be made lightweight rather than heavyweight.

pith-pipeline@v0.9.0 · 5530 in / 1388 out tokens · 30007 ms · 2026-05-10T03:13:34.369602+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references · 54 canonical work pages · 2 internal anchors

  1. [1]

    2024.UltraScale Architecture-Based FPGAs Memory IP v1.4 LogiCORE IP Product Guide

    Advanced Micro Devices, Inc. 2024.UltraScale Architecture-Based FPGAs Memory IP v1.4 LogiCORE IP Product Guide. Technical Report PG150. 955 pages.https://docs.amd.com/r/en-US/pg150-ultrascale- memory-ip

  2. [2]

    2025.MicroBlaze V Processor Reference Guide

    Advanced Micro Devices, Inc. 2025.MicroBlaze V Processor Reference Guide. Technical Report UG1629. 152 pages.https://docs.amd.com/r/ en-US/ug1629-microblaze-v-user-guide

  3. [3]

    Minseon Ahn, Thomas Willhalm, Norman May, Donghun Lee, Suprasad Mutalik Desai, Daniel Booss, Jungmin Kim, Navneet Singh, Daniel Ritter, and Oliver Rebholz. 2024. An Examination of CXL Memory Use Cases for In-Memory Database Management Systems Using SAP HANA.Proc. VLDB Endow.17, 12 (Aug. 2024), 3827–3840. doi:10.14778/3685800.3685809

  4. [4]

    2025.Astera Labs Leo CXL Smart Memory Controllers Portfolio Brief

    Astera Labs. 2025.Astera Labs Leo CXL Smart Memory Controllers Portfolio Brief. Technical Report

  5. [5]

    Antonio Barbalace, Anthony Iliopoulos, Holm Rauchfuss, and Goetz Brasche. 2017. It’s Time to Think About an Operating System for Near Data Processing Architectures. InProceedings of the 16th Workshop on Hot Topics in Operating Systems(Whistler, BC, Canada)(HotOS ’17). Association for Computing Machinery, New York, NY, USA, 56–61. doi:10.1145/3102980.3102990

  6. [6]

    Andrew Baumann, Jonathan Appavoo, Orran Krieger, and Timothy Roscoe. 2019. A fork() in the road. InProceedings of the Workshop on Hot Topics in Operating Systems(Bertinoro, Italy)(HotOS ’19). Association for Computing Machinery, New York, NY, USA, 14–22. doi:10.1145/ 3317550.3321435

  7. [7]

    Scott Beamer, Krste Asanović, and David Patterson. 2017. The GAP Benchmark Suite. doi:10.48550/arXiv.1508.03619arXiv:1508.03619 [cs]

  8. [8]

    2025.Introducing Compute Express Link (CXL) 4.0

    Tony Benavides and Mahesh Wagh. 2025.Introducing Compute Express Link (CXL) 4.0. Technical Report.https://computeexpresslink.org/wp- content/uploads/2025/11/CXL_4.0-White-Paper_FINAL.pdf

  9. [9]

    Octopus: Enhancing CXL Memory Pods via Sparse Topology

    Daniel S. Berger, Yuhong Zhong, Fiodar Kazhamiaka, Pantea Zardoshti, Shuwei Teng, Mark D. Hill, and Rodrigo Fonseca. 2025. Octopus: Scalable Low-Cost CXL Memory Pooling. doi:10.48550/arXiv.2501. 09020arXiv:2501.09020 [cs]

  10. [10]

    2017.Cavium ThunderX CN88XX, Pass 2 Hardware Refer- ence Manual (Version 2.7P)

    Cavium, Inc. 2017.Cavium ThunderX CN88XX, Pass 2 Hardware Refer- ence Manual (Version 2.7P). Technical Report CN88XX-HM-2.7P. 1936 pages

  11. [11]

    2019.CCIX Base Specification Revision 1.0a Version 1.0 for Evaluation

    CCIX Consortium, Inc. 2019.CCIX Base Specification Revision 1.0a Version 1.0 for Evaluation. Technical Report. 346 pages

  12. [12]

    Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan. 2015. One Trillion Edges: Graph Processing at Facebook-scale.Proc. VLDB Endow.8, 12 (Aug. 2015), 1804–1815. doi:10.14778/2824032.2824077

  13. [13]

    Awasthi, Emmanuel S

    Anita Choudhary, Mahesh Chandra Govil, Girdhari Singh, Lalit K. Awasthi, Emmanuel S. Pilli, and Divya Kapil. 2017. A Critical Survey of Live Virtual Machine Migration Techniques.J. Cloud Comput.6, 1 (Dec. 2017), 92:1–92:41

  14. [14]

    Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. 2005. Live Migration of Virtual Machines. InProceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation - Volume 2 (NSDI’05). USENIX Association, USA, 273–286

  15. [15]

    David Cock, Abishek Ramdas, Daniel Schwyn, Michael Giardino, Adam Turowski, Zhenhao He, Nora Hossle, Dario Korolija, Melissa Liccia- rdello, Kristina Martsenko, Reto Achermann, Gustavo Alonso, and Timothy Roscoe. 2022. Enzian: An Open, General, CPU/FPGA Plat- form for Systems Software Research. InProceedings of the 27th ACM International Conference on Arc...

  16. [16]

    2023.Compute Ex- press Link Specification Revision 3.1

    Compute Express Link Consortium, Inc. 2023.Compute Ex- press Link Specification Revision 3.1. Technical Report. 1166 pages.https://computeexpresslink.org/wp-content/uploads/2024/ 02/CXL-3.1-Specification.pdf

  17. [17]

    InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25)

    Patrick H. Coppock, Brian Zhang, Eliot H. Solomon, Vasilis Kypriotis, Leon Yang, Bikash Sharma, Dan Schatzberg, Todd C. Mowry, and Dimitrios Skarlatos. 2025. LithOS: An Operating System for Efficient Machine Learning on GPUs. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25). Association for Computing Machinery, New...

  18. [18]

    Elyse Ge Hylander. 2025. Azure Delivers the First Cloud VM with Intel Xeon 6 and CXL Memory - Now in Private Preview. https://techcommunity.microsoft.com/blog/sapapplications/azure- delivers-the-first-cloud-vm-with-intel-xeon-6-and-cxl-memory--- now-in-priv/4470067

  19. [19]

    Mohammad Ewais and Paul Chow. 2023. Disaggregated Memory in the Datacenter: A Survey.IEEE Access11 (2023), 20688–20712. doi:10.1109/ACCESS.2023.3250407

  20. [20]

    Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical Near- Data Processing for In-Memory Analytics Frameworks. In2015 Inter- national Conference on Parallel Architecture and Compilation (PACT). 113–124. doi:10.1109/PACT.2015.22

  21. [21]

    Gemini Team. 2025. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL]https://arxiv.org/abs/2312.11805

  22. [22]

    2019.The Processing-in-Memory Paradigm: Mechanisms to Enable Adoption

    Saugata Ghose, Kevin Hsieh, Amirali Boroumand, Rachata Ausavarungnirun, and Onur Mutlu. 2019.The Processing-in-Memory Paradigm: Mechanisms to Enable Adoption. Springer International Publishing, Cham, 133–194. doi:10.1007/978-3-319-90385-9_5

  23. [23]

    Ellis Giles and Peter Varman. 2025. ACID Support for Compute eXpress Link Memory Transactions. InProceedings of the SC ’24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W ’24). IEEE Press, Atlanta, GA, USA, 982–

  24. [24]

    doi:10.1109/SCW63240.2024.00138

  25. [25]

    Oliveira, and Onur Mutlu

    Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Gian- noula, Geraldo F. Oliveira, and Onur Mutlu. 2022. Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System.IEEE Access10 (2022), 52565–52608. doi:10.1109/ACCESS.2022.3174101

  26. [26]

    Google. 2026. C4 machine series.https://docs.cloud.google.com/ compute/docs/general-purpose-machines#c4_series

  27. [27]

    Google. 2026. C4A machine series.https://docs.cloud.google.com/ compute/docs/general-purpose-machines#c4a_series 13

  28. [28]

    Hyungkyu Ham, Jeongmin Hong, Geonwoo Park, Yunseon Shin, Okkyun Woo, Wonhyuk Yang, Jinhoon Bae, Eunhyeok Park, Hyo- jin Sung, Euicheol Lim, and Gwangsun Kim. 2024. Low-Overhead General-Purpose Near-Data Processing in CXL Memory Expanders. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). 594–611. doi:10.1109/MICRO61859.2024.00051

  29. [29]

    Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-Scale Preemption for Concurrent GPU-accelerated DNN Inferences. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 539– 558.https://www.usenix.org/conference/osdi22/presentation/han

  30. [30]

    Yongjun He, Jiacheng Lu, and Tianzheng Wang. 2020. CoroBase: Coroutine-Oriented Main-Memory Database Engine.Proc. VLDB En- dow.14, 3 (Nov. 2020), 431–444. doi:10.14778/3430915.3430932

  31. [31]

    Hokyoon Lee. 2025. Unlocking the Memory-Centric Computing Sys- tem through CXL-based Processing-near-Memory Module: CMM-DC

  32. [32]

    Malladi, Andrew Chang, and Yuan Xie

    Wenqin Huangfu, Krishna T. Malladi, Andrew Chang, and Yuan Xie

  33. [33]

    Hermes: Accelerating long-latency load requests via perceptron-based off-chip load prediction,

    BEACON: Scalable Near-Data-Processing Accelerators for Genome Analysis near Memory Pool with the CXL Support. In2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 727–743. doi:10.1109/MICRO56248.2022.00057

  34. [34]

    Junhyeok Jang, Hanjin Choi, Hanyeoreum Bae, Seungjun Lee, Miryeong Kwon, and Myoungsoo Jung. 2023. CXL-ANNS: Software- Hardware Collaborative Memory Disaggregation and Computation for Billion-Scale Approximate Nearest Neighbor Search. In2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 585–600.https://www.usenix.org/...

  35. [35]

    doi:10.14778/2994509.2994518

    Insoon Jo, Duck-Ho Bae, Andre S. Yoon, Jeong-Uk Kang, Sangyeun Cho, Daniel D. G. Lee, and Jaeheon Jeong. 2016. YourSQL: A High- Performance Database System Leveraging in-Storage Computing.Proc. VLDB Endow.9, 12 (Aug. 2016), 924–935. doi:10.14778/2994509.2994512

  36. [36]

    Aditya K Kamath and Simon Peter. 2024. (MC)2: Lazy MemCopy at the Memory Controller. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 1112–1128. doi:10.1109/ ISCA59077.2024.00084

  37. [37]

    Onur Kocberber, Babak Falsafi, and Boris Grot. 2015. Asynchronous Memory Access Chaining.Proc. VLDB Endow.9, 4 (Dec. 2015), 252–263. doi:10.14778/2856318.2856321

  38. [38]

    Kopetz and G

    H. Kopetz and G. Bauer. 2003. The time-triggered architecture.Proc. IEEE91, 1 (2003), 112–126. doi:10.1109/JPROC.2002.805821

  39. [39]

    Dario Korolija, Dimitrios Koutsoukos, Kimberly Keeton, Konstantin Taranov, Dejan Milojičić, and Gustavo Alonso. 2021. Farview: Disag- gregated Memory with Operator Off-loading for Database Engines. doi:10.48550/arXiv.2106.07102arXiv:2106.07102 [cs]

  40. [40]

    Dario Korolija, Timothy Roscoe, and Gustavo Alonso. 2020. Do OS Abstractions Make Sense on FPGAs?. InProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20). USENIX Association, USA, 991–1010

  41. [41]

    Ronny Krashinsky, Olivier Giroux, Stephen Jones, Nick Stam, and Srid- har Ramaswamy. 2020. NVIDIA Ampere Architecture In-Depth.https: //developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/

  42. [42]

    Rossbach, and Emmett Witchel

    Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. 2017. Ingens: Huge Page Support for the OS and Hypervisor.SIGOPS Oper. Syst. Rev.51, 1 (Sept. 2017), 83–93. doi:10.1145/3139645.3139659

  43. [43]

    I.-Ting Lee, Bao-Kai Wang, Liang-Chi Chen, Wen Sheng Lim, Da-Wei Chang, Yu-Ming Chang, and Chieng-Chung Ho. 2025. PIM or CXL- PIM? Understanding Architectural Trade-offs Through Large-Scale Benchmarking. doi:10.48550/arXiv.2511.14400arXiv:2511.14400 [cs]

  44. [44]

    Alberto Lerner and Gustavo Alonso. 2024. CXL and the Return of Scale-Up Database Engines.Proc. VLDB Endow.17, 10 (June 2024), 2568–2575. doi:10.14778/3675034.3675047

  45. [45]

    Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Falout- sos, and Zoubin Ghahramani. 2010. Kronecker graphs: an approach to modeling networks.Journal of Machine Learning Research11, 2 (2010)

  46. [46]

    Berger, Lisa Hsu, Daniel Ernst, Pantea Zar- doshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D

    Huaicheng Li, Daniel S. Berger, Lisa Hsu, Daniel Ernst, Pantea Zar- doshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D. Hill, Marcus Fontoura, and Ricardo Bian- chini. 2023. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms. InProceedings of the 28th ACM International Conference on Architectural Support for P...

  47. [47]

    Hongfu Li, Qian Tao, Song Yu, Shufeng Gong, Yanfeng Zhang, Feng Yao, Wenyuan Yu, Ge Yu, and Jingren Zhou. 2024. GastCoCo: Graph Storage and Coroutine-Based Prefetch Co-Design for Dynamic Graph Processing.Proc. VLDB Endow.17, 13 (Sept. 2024), 4827–4839. doi:10. 14778/3704965.3704986

  48. [48]

    Luyang Li, Heng Pan, Xinchen Wan, Kai Lv, Zilong Wang, Qian Zhao, Feng Ning, Qingsong Ning, Shideng Zhang, Zhenyu Li, Layong Luo, and Gaogang Xie. 2025. Harmonia: A Unified Framework for Het- erogeneous FPGA Acceleration in the Cloud. InProceedings of the 30th ACM International Conference on Architectural Support for Pro- gramming Languages and Operating ...

  49. [49]

    Berger, Marie Nguyen, Xun Jian, Sam H

    Jinshu Liu, Hamid Hadian, Yuyue Wang, Daniel S. Berger, Marie Nguyen, Xun Jian, Sam H. Noh, and Huaicheng Li. 2025. System- atic CXL Memory Characterization and Performance Analysis at Scale. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Rotterdam, Netherlands)(AS...

  50. [50]

    Zikai Liu, Jasmin Schult, Pengcheng Xu, and Timothy Roscoe. 2025. Mainframe-Style Channel Controllers for Modern Disaggregated Mem- ory Systems. InProceedings of the 16th ACM SIGOPS Asia-Pacific Work- shop on Systems (APSys ’25). Association for Computing Machinery, New York, NY, USA, 82–90. doi:10.1145/3725783.3764403

  51. [51]

    Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. 2007. Challenges in Parallel Graph Processing.Parallel Processing Letters17, 01 (2007), 5–20. doi:10.1142/S0129626407002843 arXiv:https://doi.org/10.1142/S0129626407002843

  52. [52]

    Pregel: a system for large-scale graph processing,

    Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehn- ert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-Scale Graph Processing. InProceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIG- MOD ’10). Association for Computing Machinery, New York, NY, USA, 135–146. doi:10.1145/1...

  53. [53]

    Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowd- hury, Shobhit Kanaujia, and Prakash Chauhan. 2023. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Ope...

  54. [54]

    2024.Marvell Structera A 2504 Memory-Expansion Con- troller

    Marvell. 2024.Marvell Structera A 2504 Memory-Expansion Con- troller. Technical Report Marvell_Structera_A MV-SLA25041 _PB. 3 pages.https://www.marvell.com/content/dam/marvell/en/public- collateral/assets/marvell-structera-a-2504-near-memory- accelerator-product-brief.pdf

  55. [55]

    2024.Marvell Structera X 2504 Memory-Expansion Controller

    Marvell. 2024.Marvell Structera X 2504 Memory-Expansion Controller. Technical Report. 2 pages.https://www.marvell.com/content/ dam/marvell/en/public-collateral/assets/marvell-structera-x-2504- memory-expansion-controller-product-brief.pdf 14

  56. [56]

    Friedemann Mattern. 1989. Global quiescence detection based on credit distribution and recovery.Inf. Process. Lett.30, 4 (Feb. 1989), 195–200. doi:10.1016/0020-0190(89)90212-3

  57. [57]

    Micron. 2023. Flexible Memory Expansion for Data-Intensive Work- loads.https://www.micron.com/products/memory/cxl-memory

  58. [58]

    Montage Technology. 2026. CXL Memory eXpander Controller (MXC). https://www.montage-tech.com/MXC

  59. [59]

    2014.Grappa: A Latency-Tolerant Run- time for Large-Scale Irregular Applications

    Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2014.Grappa: A Latency-Tolerant Run- time for Large-Scale Irregular Applications. Technical Report UW-CSE- 14-02-01. University of Washington.https://sampa.cs.washington. edu/new/papers/grappa-tr-2014-02.pdf

  60. [60]

    Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2015. Latency-Tolerant Soft- ware Distributed Shared Memory. In2015 USENIX Annual Techni- cal Conference (USENIX ATC 15). USENIX Association, Santa Clara, CA, 291–305.https://www.usenix.org/conference/atc15/technical- session/presentation/nelson

  61. [61]

    Kelvin K. W. Ng, Henri Maxime Demoulin, and Vincent Liu. 2023. Paella: Low-latency Model Serving with Software-defined GPU Sched- uling. InProceedings of the 29th Symposium on Operating Systems Prin- ciples (SOSP ’23). Association for Computing Machinery, New York, NY, USA, 595–610. doi:10.1145/3600006.3613163

  62. [62]

    NVIDIA. 2026. CUDA Programming Guide.https://docs.nvidia.com/ cuda/cuda-programming-guide/

  63. [63]

    1999.The PageRank Citation Ranking: Bringing Order to the Web

    Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999.The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66. Stanford InfoLab / Stanford InfoLab.http: //ilpubs.stanford.edu:8090/422/

  64. [64]

    Gopinath

    Ashish Panwar, Sorav Bansal, and K. Gopinath. 2019. HawkEye: Ef- ficient Fine-grained OS Support for Huge Pages. InProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19). As- sociation for Computing Machinery, New York, NY, USA, 347–360. doi:10.1145/3297858.3304064

  65. [65]

    Loh, and Abhishek Bhattacharjee

    Binh Pham, Ján Veselý, Gabriel H. Loh, and Abhishek Bhattacharjee

  66. [66]

    Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have It Both Ways?

    Large Pages and Lightweight Memory Management in Virtu- alized Environments: Can You Have It Both Ways?. InProceedings of the 48th International Symposium on Microarchitecture (MICRO-48). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/2830772.2830773

  67. [67]

    Georgios Psaropoulos, Thomas Legler, Norman May, and Anastasia Ailamaki. 2017. Interleaving with Coroutines: A Practical Approach for Robust Index Joins.Proc. VLDB Endow.11, 2 (Oct. 2017), 230–242. doi:10.14778/3149193.3149202

  68. [68]

    Caulfield, Eric S

    Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. 201...

  69. [69]

    2023.CCKit: FPGA Acceleration in Symmetric Coherent Heterogeneous Platforms

    Abishek Ramdas. 2023.CCKit: FPGA Acceleration in Symmetric Coherent Heterogeneous Platforms. Doctoral Thesis. ETH Zurich. doi:10.3929/ethz-b-000642567

  70. [70]

    Benjamin Ramhorst, Dario Korolija, Maximilian Jakob Heer, Jonas Dann, Luhao Liu, and Gustavo Alonso. 2025. Coyote v2: Raising the Level of Abstraction for Data Center FPGAs. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25). Association for Computing Machinery, New York, NY, USA, 639–654. doi:10.1145/3731569.3764845

  71. [71]

    Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel

    Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: operating system abstractions to manage GPUs as compute devices. InProceedings of the Twenty-Third ACM Symposium on Operating Systems Principles(Cascais, Portugal) (SOSP ’11). Association for Computing Machinery, New York, NY, USA, 233–248. doi:10.1145/2...

  72. [72]

    Samsung. 2022. Samsung Electronics Introduces Industry’s First 512GB CXL Memory Module.https://news.samsung.com/global/samsung- electronics-introduces-industrys-first-512gb-cxl-memory-module

  73. [73]

    Samsung. 2024. CXL Memory Module Box CMM-B. https://semiconductor.samsung.com/news-events/tech-blog/cxl- memory-module-box-cmm-b

  74. [74]

    Joonseop Sim, Soohong Ahn, Taeyoung Ahn, Seungyong Lee, Myunghyun Rhee, Jooyoung Kim, Kwangsik Shin, Donguk Moon, Euiseok Kim, and Kyoung Park. 2022. Computational cxl-memory so- lution for accelerating memory-intensive applications.IEEE Computer Architecture Letters22, 1 (2022), 5–8

  75. [75]

    Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Chihun Song, Jinghan Huang, Houxiang Ji, Siddharth Agarwal, Jiaqi Lou, Ipoom Jeong, Ren Wang, Jung Ho Ahn, Tianyin Xu, and Nam Sung Kim. 2023. Demysti- fying CXL Memory with Genuine CXL-Ready Systems and Devices. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’23)....

  76. [76]

    Yupeng Tang, Ping Zhou, Wenhui Zhang, Henry Hu, Qirui Yang, Hao Xiang, Tongping Liu, Jiaxin Shan, Ruoyun Huang, Cheng Zhao, Cheng Chen, Hui Zhang, Fei Liu, Shuai Zhang, Xiaoning Ding, and Jianjun Chen. 2024. Exploring Performance and Cost Optimization with ASIC- Based CXL Memory. InProceedings of the Nineteenth European Confer- ence on Computer Systems (E...

  77. [77]

    Dufy Teguia, Jiaxuan Chen, Stella Bitchebe, Oana Balmau, and Alain Tchana. 2024. vPIM: Processing-in-Memory Virtualization. InProceed- ings of the 25th International Middleware Conference (Middleware ’24). Association for Computing Machinery, New York, NY, USA, 417–430. doi:10.1145/3652892.3700782

  78. [78]

    Chuck Thacker. 2010. Beehive: A many-core computer for FP- GAs (v5).https://web.mit.edu/6.173/www/currentsemester/handouts/ BeehiveV5.pdf

  79. [79]

    Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an interme- diate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages(Phoenix, AZ, USA) (MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. doi:10.1145/3315508.3329973

  80. [80]

    Lukas Vogel, Daniel Ritter, Danica Porobic, Pinar Tözün, Tianzheng Wang, and Alberto Lerner. 2023. Data Pipes: Declarative Control over Data Movement. InConference on Innovative Data Systems Research

Showing first 80 references.