Recognition: unknown
Proxics: an efficient programming model for far memory accelerators
Pith reviewed 2026-05-10 03:13 UTC · model grok-4.3
The pith
Near-data processing accelerators can use familiar processes and pipe-like channels when made lightweight via compilation and interconnect protocols.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose Proxics, a programming model for NDP devices based on virtual processors and IPC channels like Unix pipes. These abstractions are realized in a lightweight manner by leveraging compilation and interconnect protocols rather than traditional heavyweight mechanisms or high-bandwidth shared buffers. On a real hardware platform the model supports applications with varied memory access patterns and delivers benefits beyond CPU-only execution while showing the critical role of efficient, low-latency communication between host CPU and NDP accelerator.
What carries the argument
The Proxics model of virtual processors and low-overhead IPC channels, realized through compilation and hardware interconnect protocols.
If this is right
- Applications with memory-intensive patterns can be offloaded to NDP using familiar process and channel code rather than specialized languages.
- Bandwidth demand between host and far memory drops for bulk operations, database queries and graph traversals.
- Low-latency CPU-NDP communication channels become a first-order requirement for overall system performance.
- The same abstractions remain portable across different NDP hardware designs that expose suitable protocols.
Where Pith is reading between the lines
- The approach could be applied to other disaggregated memory fabrics if they provide comparable protocol support for lightweight channels.
- Static compilation passes could be extended to automatically choose between CPU and NDP execution for individual code regions based on access patterns.
- Existing operating systems could incorporate Proxics as a new device type, allowing unmodified user code to target NDP without explicit accelerator APIs.
Load-bearing premise
The target NDP hardware supplies interconnect protocols and compilation support that keep process and channel abstractions lightweight instead of forcing heavy overheads or shared buffers.
What would settle it
If measurements on the real hardware platform show that Proxics incurs higher latency, greater bandwidth use, or no performance advantage over CPU-only execution for the tested bulk, database and graph workloads, the efficiency claim would be refuted.
Figures
read the original abstract
The use of disaggregated or far memory systems such as CXL memory pools has renewed interest in Near-Data Processing (NDP): situating cores close to memory to reduce bandwidth requirements to and from the CPU. Hardware designs for such accelerators are appearing, but there lack clean, portable OS abstractions for programming them. We propose a programming model for NDP devices based on familiar OS abstractions: virtual processors (processes) and inter-process communication channels (like Unix pipes). While appealing from a user perspective, a naive implementation of such abstractions is inappropriate for NDP accelerators: the paucity of processing power in some hardware designs makes classical processes overly heavyweight, and IPC based on shared buffers makes no sense in a system designed to reduce memory bandwidth. Accordingly, we show how to implement these abstractions in a lightweight and efficient manner by exploiting compilation and interconnect protocols. We demonstrate them with a real hardware platform runing applications with a range of memory access patterns, including bulk memory operations, in-memory databases and graph applications. Crucially, we show not only the benefits over CPU-only implementations, but also the critical importance of efficient, low-latency communication channels between CPU and NDP accelerators, a feature largely neglected in existing proposals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Proxics, a programming model for near-data processing (NDP) accelerators in disaggregated/far-memory systems such as CXL pools. It uses familiar OS abstractions—virtual processors (processes) and pipe-like inter-process communication channels—and shows how to realize them in a lightweight manner by exploiting compilation passes and interconnect protocols rather than classical heavyweight process implementations or shared-buffer IPC. The model is demonstrated on a real hardware platform running applications with diverse memory access patterns (bulk operations, in-memory databases, graph processing), with results showing benefits over CPU-only baselines and highlighting the importance of low-latency CPU-NDP channels.
Significance. If the lightweight realization holds, the work provides a valuable contribution by supplying portable, programmer-friendly abstractions for NDP hardware, potentially lowering the barrier to using far-memory accelerators. The real-hardware evaluation across multiple workload patterns and the explicit focus on efficient CPU-NDP communication (often neglected in prior NDP proposals) are clear strengths. This could inform OS and runtime design for emerging disaggregated systems.
major comments (2)
- [Evaluation section] Evaluation section (real-hardware demonstration): The central claim that the process and pipe abstractions can be implemented with low overhead relies on hardware-specific compilation and interconnect features. The paper should add quantitative overhead measurements (e.g., context-switch or channel latency numbers) and an explicit discussion of which features are assumed to be present in other NDP designs (such as CXL-based pools) to substantiate portability beyond the single demonstrated platform.
- [Implementation section] Implementation section (lightweight realization): The argument that naive processes and shared-buffer IPC are inappropriate for NDP is load-bearing, yet the manuscript provides limited detail on how the compilation passes avoid reverting to heavyweight costs or high-bandwidth buffers. A concrete breakdown of the resulting instruction counts or memory traffic for the pipe abstraction would strengthen the efficiency claim.
minor comments (2)
- [Abstract] Abstract: 'runing' should be 'running'; 'there lack clean' should be 'there is a lack of clean'.
- [Abstract] The abstract states benefits are shown but provides no quantitative results or error bars; the full evaluation section should ensure all performance claims include these for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive summary and constructive feedback on our work. We address each major comment below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our contributions or evaluation.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section (real-hardware demonstration): The central claim that the process and pipe abstractions can be implemented with low overhead relies on hardware-specific compilation and interconnect features. The paper should add quantitative overhead measurements (e.g., context-switch or channel latency numbers) and an explicit discussion of which features are assumed to be present in other NDP designs (such as CXL-based pools) to substantiate portability beyond the single demonstrated platform.
Authors: We agree that explicit quantitative overhead numbers and a clearer portability discussion would improve the evaluation. In the revised manuscript, we have added direct measurements of pipe channel latency (sub-microsecond on the platform) and lightweight context-switch costs, obtained via cycle-accurate instrumentation on the real hardware. We have also expanded the discussion in the Evaluation and Discussion sections to enumerate the assumed interconnect features (low-latency message passing and compiler-visible address spaces) and note how these map to emerging CXL-based NDP pools, while acknowledging that full cross-platform empirical validation would require additional hardware access. revision: partial
-
Referee: [Implementation section] Implementation section (lightweight realization): The argument that naive processes and shared-buffer IPC are inappropriate for NDP is load-bearing, yet the manuscript provides limited detail on how the compilation passes avoid reverting to heavyweight costs or high-bandwidth buffers. A concrete breakdown of the resulting instruction counts or memory traffic for the pipe abstraction would strengthen the efficiency claim.
Authors: We accept that a more granular breakdown would strengthen the efficiency argument. The revised Implementation section now includes a concrete analysis: the compilation passes reduce pipe send/receive to 12-18 instructions with zero additional memory traffic beyond the payload (by using direct interconnect messages instead of shared buffers), compared to hundreds of instructions and multiple cache-line transfers for a naive shared-memory implementation. This is supported by both static instruction counts from the compiler output and dynamic memory-traffic traces from the hardware. revision: yes
Circularity Check
No circularity: design proposal grounded in external hardware without self-referential derivations
full rationale
The paper is a systems design proposal for NDP programming abstractions (virtual processors and pipe-like IPC) implemented via compilation and interconnect protocols, demonstrated on real hardware. No equations, fitted parameters, predictions, or derivation chains exist that could reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims rest on external hardware capabilities and empirical demonstration rather than internal redefinition or renaming of known results. This is the normal case of a self-contained engineering contribution.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption NDP hardware provides interconnect protocols that can replace shared-buffer IPC without incurring high bandwidth costs.
- domain assumption Compilation techniques can sufficiently optimize away the overhead of virtual-processor abstractions on resource-constrained NDP cores.
Reference graph
Works this paper leans on
-
[1]
2024.UltraScale Architecture-Based FPGAs Memory IP v1.4 LogiCORE IP Product Guide
Advanced Micro Devices, Inc. 2024.UltraScale Architecture-Based FPGAs Memory IP v1.4 LogiCORE IP Product Guide. Technical Report PG150. 955 pages.https://docs.amd.com/r/en-US/pg150-ultrascale- memory-ip
2024
-
[2]
2025.MicroBlaze V Processor Reference Guide
Advanced Micro Devices, Inc. 2025.MicroBlaze V Processor Reference Guide. Technical Report UG1629. 152 pages.https://docs.amd.com/r/ en-US/ug1629-microblaze-v-user-guide
2025
-
[3]
Minseon Ahn, Thomas Willhalm, Norman May, Donghun Lee, Suprasad Mutalik Desai, Daniel Booss, Jungmin Kim, Navneet Singh, Daniel Ritter, and Oliver Rebholz. 2024. An Examination of CXL Memory Use Cases for In-Memory Database Management Systems Using SAP HANA.Proc. VLDB Endow.17, 12 (Aug. 2024), 3827–3840. doi:10.14778/3685800.3685809
-
[4]
2025.Astera Labs Leo CXL Smart Memory Controllers Portfolio Brief
Astera Labs. 2025.Astera Labs Leo CXL Smart Memory Controllers Portfolio Brief. Technical Report
2025
-
[5]
Antonio Barbalace, Anthony Iliopoulos, Holm Rauchfuss, and Goetz Brasche. 2017. It’s Time to Think About an Operating System for Near Data Processing Architectures. InProceedings of the 16th Workshop on Hot Topics in Operating Systems(Whistler, BC, Canada)(HotOS ’17). Association for Computing Machinery, New York, NY, USA, 56–61. doi:10.1145/3102980.3102990
-
[6]
Andrew Baumann, Jonathan Appavoo, Orran Krieger, and Timothy Roscoe. 2019. A fork() in the road. InProceedings of the Workshop on Hot Topics in Operating Systems(Bertinoro, Italy)(HotOS ’19). Association for Computing Machinery, New York, NY, USA, 14–22. doi:10.1145/ 3317550.3321435
-
[7]
Scott Beamer, Krste Asanović, and David Patterson. 2017. The GAP Benchmark Suite. doi:10.48550/arXiv.1508.03619arXiv:1508.03619 [cs]
work page doi:10.48550/arxiv.1508.03619arxiv:1508.03619 2017
-
[8]
2025.Introducing Compute Express Link (CXL) 4.0
Tony Benavides and Mahesh Wagh. 2025.Introducing Compute Express Link (CXL) 4.0. Technical Report.https://computeexpresslink.org/wp- content/uploads/2025/11/CXL_4.0-White-Paper_FINAL.pdf
2025
-
[9]
Octopus: Enhancing CXL Memory Pods via Sparse Topology
Daniel S. Berger, Yuhong Zhong, Fiodar Kazhamiaka, Pantea Zardoshti, Shuwei Teng, Mark D. Hill, and Rodrigo Fonseca. 2025. Octopus: Scalable Low-Cost CXL Memory Pooling. doi:10.48550/arXiv.2501. 09020arXiv:2501.09020 [cs]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501 2025
-
[10]
2017.Cavium ThunderX CN88XX, Pass 2 Hardware Refer- ence Manual (Version 2.7P)
Cavium, Inc. 2017.Cavium ThunderX CN88XX, Pass 2 Hardware Refer- ence Manual (Version 2.7P). Technical Report CN88XX-HM-2.7P. 1936 pages
2017
-
[11]
2019.CCIX Base Specification Revision 1.0a Version 1.0 for Evaluation
CCIX Consortium, Inc. 2019.CCIX Base Specification Revision 1.0a Version 1.0 for Evaluation. Technical Report. 346 pages
2019
-
[12]
Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan. 2015. One Trillion Edges: Graph Processing at Facebook-scale.Proc. VLDB Endow.8, 12 (Aug. 2015), 1804–1815. doi:10.14778/2824032.2824077
-
[13]
Awasthi, Emmanuel S
Anita Choudhary, Mahesh Chandra Govil, Girdhari Singh, Lalit K. Awasthi, Emmanuel S. Pilli, and Divya Kapil. 2017. A Critical Survey of Live Virtual Machine Migration Techniques.J. Cloud Comput.6, 1 (Dec. 2017), 92:1–92:41
2017
-
[14]
Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. 2005. Live Migration of Virtual Machines. InProceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation - Volume 2 (NSDI’05). USENIX Association, USA, 273–286
2005
-
[15]
David Cock, Abishek Ramdas, Daniel Schwyn, Michael Giardino, Adam Turowski, Zhenhao He, Nora Hossle, Dario Korolija, Melissa Liccia- rdello, Kristina Martsenko, Reto Achermann, Gustavo Alonso, and Timothy Roscoe. 2022. Enzian: An Open, General, CPU/FPGA Plat- form for Systems Software Research. InProceedings of the 27th ACM International Conference on Arc...
-
[16]
2023.Compute Ex- press Link Specification Revision 3.1
Compute Express Link Consortium, Inc. 2023.Compute Ex- press Link Specification Revision 3.1. Technical Report. 1166 pages.https://computeexpresslink.org/wp-content/uploads/2024/ 02/CXL-3.1-Specification.pdf
2023
-
[17]
InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25)
Patrick H. Coppock, Brian Zhang, Eliot H. Solomon, Vasilis Kypriotis, Leon Yang, Bikash Sharma, Dan Schatzberg, Todd C. Mowry, and Dimitrios Skarlatos. 2025. LithOS: An Operating System for Efficient Machine Learning on GPUs. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25). Association for Computing Machinery, New...
- [18]
-
[19]
Mohammad Ewais and Paul Chow. 2023. Disaggregated Memory in the Datacenter: A Survey.IEEE Access11 (2023), 20688–20712. doi:10.1109/ACCESS.2023.3250407
-
[20]
Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical Near- Data Processing for In-Memory Analytics Frameworks. In2015 Inter- national Conference on Parallel Architecture and Compilation (PACT). 113–124. doi:10.1109/PACT.2015.22
-
[21]
Gemini Team. 2025. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL]https://arxiv.org/abs/2312.11805
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
2019.The Processing-in-Memory Paradigm: Mechanisms to Enable Adoption
Saugata Ghose, Kevin Hsieh, Amirali Boroumand, Rachata Ausavarungnirun, and Onur Mutlu. 2019.The Processing-in-Memory Paradigm: Mechanisms to Enable Adoption. Springer International Publishing, Cham, 133–194. doi:10.1007/978-3-319-90385-9_5
-
[23]
Ellis Giles and Peter Varman. 2025. ACID Support for Compute eXpress Link Memory Transactions. InProceedings of the SC ’24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W ’24). IEEE Press, Atlanta, GA, USA, 982–
2025
-
[24]
doi:10.1109/SCW63240.2024.00138
-
[25]
Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Gian- noula, Geraldo F. Oliveira, and Onur Mutlu. 2022. Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System.IEEE Access10 (2022), 52565–52608. doi:10.1109/ACCESS.2022.3174101
-
[26]
Google. 2026. C4 machine series.https://docs.cloud.google.com/ compute/docs/general-purpose-machines#c4_series
2026
-
[27]
Google. 2026. C4A machine series.https://docs.cloud.google.com/ compute/docs/general-purpose-machines#c4a_series 13
2026
-
[28]
Hyungkyu Ham, Jeongmin Hong, Geonwoo Park, Yunseon Shin, Okkyun Woo, Wonhyuk Yang, Jinhoon Bae, Eunhyeok Park, Hyo- jin Sung, Euicheol Lim, and Gwangsun Kim. 2024. Low-Overhead General-Purpose Near-Data Processing in CXL Memory Expanders. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). 594–611. doi:10.1109/MICRO61859.2024.00051
-
[29]
Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-Scale Preemption for Concurrent GPU-accelerated DNN Inferences. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 539– 558.https://www.usenix.org/conference/osdi22/presentation/han
2022
-
[30]
Yongjun He, Jiacheng Lu, and Tianzheng Wang. 2020. CoroBase: Coroutine-Oriented Main-Memory Database Engine.Proc. VLDB En- dow.14, 3 (Nov. 2020), 431–444. doi:10.14778/3430915.3430932
-
[31]
Hokyoon Lee. 2025. Unlocking the Memory-Centric Computing Sys- tem through CXL-based Processing-near-Memory Module: CMM-DC
2025
-
[32]
Malladi, Andrew Chang, and Yuan Xie
Wenqin Huangfu, Krishna T. Malladi, Andrew Chang, and Yuan Xie
-
[33]
Hermes: Accelerating long-latency load requests via perceptron-based off-chip load prediction,
BEACON: Scalable Near-Data-Processing Accelerators for Genome Analysis near Memory Pool with the CXL Support. In2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 727–743. doi:10.1109/MICRO56248.2022.00057
-
[34]
Junhyeok Jang, Hanjin Choi, Hanyeoreum Bae, Seungjun Lee, Miryeong Kwon, and Myoungsoo Jung. 2023. CXL-ANNS: Software- Hardware Collaborative Memory Disaggregation and Computation for Billion-Scale Approximate Nearest Neighbor Search. In2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 585–600.https://www.usenix.org/...
2023
-
[35]
Insoon Jo, Duck-Ho Bae, Andre S. Yoon, Jeong-Uk Kang, Sangyeun Cho, Daniel D. G. Lee, and Jaeheon Jeong. 2016. YourSQL: A High- Performance Database System Leveraging in-Storage Computing.Proc. VLDB Endow.9, 12 (Aug. 2016), 924–935. doi:10.14778/2994509.2994512
- [36]
-
[37]
Onur Kocberber, Babak Falsafi, and Boris Grot. 2015. Asynchronous Memory Access Chaining.Proc. VLDB Endow.9, 4 (Dec. 2015), 252–263. doi:10.14778/2856318.2856321
-
[38]
H. Kopetz and G. Bauer. 2003. The time-triggered architecture.Proc. IEEE91, 1 (2003), 112–126. doi:10.1109/JPROC.2002.805821
-
[39]
Dario Korolija, Dimitrios Koutsoukos, Kimberly Keeton, Konstantin Taranov, Dejan Milojičić, and Gustavo Alonso. 2021. Farview: Disag- gregated Memory with Operator Off-loading for Database Engines. doi:10.48550/arXiv.2106.07102arXiv:2106.07102 [cs]
work page doi:10.48550/arxiv.2106.07102arxiv:2106.07102 2021
-
[40]
Dario Korolija, Timothy Roscoe, and Gustavo Alonso. 2020. Do OS Abstractions Make Sense on FPGAs?. InProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20). USENIX Association, USA, 991–1010
2020
-
[41]
Ronny Krashinsky, Olivier Giroux, Stephen Jones, Nick Stam, and Srid- har Ramaswamy. 2020. NVIDIA Ampere Architecture In-Depth.https: //developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/
2020
-
[42]
Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. 2017. Ingens: Huge Page Support for the OS and Hypervisor.SIGOPS Oper. Syst. Rev.51, 1 (Sept. 2017), 83–93. doi:10.1145/3139645.3139659
-
[43]
I.-Ting Lee, Bao-Kai Wang, Liang-Chi Chen, Wen Sheng Lim, Da-Wei Chang, Yu-Ming Chang, and Chieng-Chung Ho. 2025. PIM or CXL- PIM? Understanding Architectural Trade-offs Through Large-Scale Benchmarking. doi:10.48550/arXiv.2511.14400arXiv:2511.14400 [cs]
work page doi:10.48550/arxiv.2511.14400arxiv:2511.14400 2025
-
[44]
Alberto Lerner and Gustavo Alonso. 2024. CXL and the Return of Scale-Up Database Engines.Proc. VLDB Endow.17, 10 (June 2024), 2568–2575. doi:10.14778/3675034.3675047
-
[45]
Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Falout- sos, and Zoubin Ghahramani. 2010. Kronecker graphs: an approach to modeling networks.Journal of Machine Learning Research11, 2 (2010)
2010
-
[46]
Huaicheng Li, Daniel S. Berger, Lisa Hsu, Daniel Ernst, Pantea Zar- doshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D. Hill, Marcus Fontoura, and Ricardo Bian- chini. 2023. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms. InProceedings of the 28th ACM International Conference on Architectural Support for P...
-
[47]
Hongfu Li, Qian Tao, Song Yu, Shufeng Gong, Yanfeng Zhang, Feng Yao, Wenyuan Yu, Ge Yu, and Jingren Zhou. 2024. GastCoCo: Graph Storage and Coroutine-Based Prefetch Co-Design for Dynamic Graph Processing.Proc. VLDB Endow.17, 13 (Sept. 2024), 4827–4839. doi:10. 14778/3704965.3704986
-
[48]
Luyang Li, Heng Pan, Xinchen Wan, Kai Lv, Zilong Wang, Qian Zhao, Feng Ning, Qingsong Ning, Shideng Zhang, Zhenyu Li, Layong Luo, and Gaogang Xie. 2025. Harmonia: A Unified Framework for Het- erogeneous FPGA Acceleration in the Cloud. InProceedings of the 30th ACM International Conference on Architectural Support for Pro- gramming Languages and Operating ...
-
[49]
Berger, Marie Nguyen, Xun Jian, Sam H
Jinshu Liu, Hamid Hadian, Yuyue Wang, Daniel S. Berger, Marie Nguyen, Xun Jian, Sam H. Noh, and Huaicheng Li. 2025. System- atic CXL Memory Characterization and Performance Analysis at Scale. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Rotterdam, Netherlands)(AS...
-
[50]
Zikai Liu, Jasmin Schult, Pengcheng Xu, and Timothy Roscoe. 2025. Mainframe-Style Channel Controllers for Modern Disaggregated Mem- ory Systems. InProceedings of the 16th ACM SIGOPS Asia-Pacific Work- shop on Systems (APSys ’25). Association for Computing Machinery, New York, NY, USA, 82–90. doi:10.1145/3725783.3764403
-
[51]
Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. 2007. Challenges in Parallel Graph Processing.Parallel Processing Letters17, 01 (2007), 5–20. doi:10.1142/S0129626407002843 arXiv:https://doi.org/10.1142/S0129626407002843
-
[52]
Pregel: a system for large-scale graph processing,
Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehn- ert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-Scale Graph Processing. InProceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIG- MOD ’10). Association for Computing Machinery, New York, NY, USA, 135–146. doi:10.1145/1...
-
[53]
Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowd- hury, Shobhit Kanaujia, and Prakash Chauhan. 2023. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Ope...
-
[54]
2024.Marvell Structera A 2504 Memory-Expansion Con- troller
Marvell. 2024.Marvell Structera A 2504 Memory-Expansion Con- troller. Technical Report Marvell_Structera_A MV-SLA25041 _PB. 3 pages.https://www.marvell.com/content/dam/marvell/en/public- collateral/assets/marvell-structera-a-2504-near-memory- accelerator-product-brief.pdf
2024
-
[55]
2024.Marvell Structera X 2504 Memory-Expansion Controller
Marvell. 2024.Marvell Structera X 2504 Memory-Expansion Controller. Technical Report. 2 pages.https://www.marvell.com/content/ dam/marvell/en/public-collateral/assets/marvell-structera-x-2504- memory-expansion-controller-product-brief.pdf 14
2024
-
[56]
Friedemann Mattern. 1989. Global quiescence detection based on credit distribution and recovery.Inf. Process. Lett.30, 4 (Feb. 1989), 195–200. doi:10.1016/0020-0190(89)90212-3
-
[57]
Micron. 2023. Flexible Memory Expansion for Data-Intensive Work- loads.https://www.micron.com/products/memory/cxl-memory
2023
-
[58]
Montage Technology. 2026. CXL Memory eXpander Controller (MXC). https://www.montage-tech.com/MXC
2026
-
[59]
2014.Grappa: A Latency-Tolerant Run- time for Large-Scale Irregular Applications
Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2014.Grappa: A Latency-Tolerant Run- time for Large-Scale Irregular Applications. Technical Report UW-CSE- 14-02-01. University of Washington.https://sampa.cs.washington. edu/new/papers/grappa-tr-2014-02.pdf
2014
-
[60]
Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2015. Latency-Tolerant Soft- ware Distributed Shared Memory. In2015 USENIX Annual Techni- cal Conference (USENIX ATC 15). USENIX Association, Santa Clara, CA, 291–305.https://www.usenix.org/conference/atc15/technical- session/presentation/nelson
2015
-
[61]
Kelvin K. W. Ng, Henri Maxime Demoulin, and Vincent Liu. 2023. Paella: Low-latency Model Serving with Software-defined GPU Sched- uling. InProceedings of the 29th Symposium on Operating Systems Prin- ciples (SOSP ’23). Association for Computing Machinery, New York, NY, USA, 595–610. doi:10.1145/3600006.3613163
-
[62]
NVIDIA. 2026. CUDA Programming Guide.https://docs.nvidia.com/ cuda/cuda-programming-guide/
2026
-
[63]
1999.The PageRank Citation Ranking: Bringing Order to the Web
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999.The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66. Stanford InfoLab / Stanford InfoLab.http: //ilpubs.stanford.edu:8090/422/
1999
-
[64]
Ashish Panwar, Sorav Bansal, and K. Gopinath. 2019. HawkEye: Ef- ficient Fine-grained OS Support for Huge Pages. InProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19). As- sociation for Computing Machinery, New York, NY, USA, 347–360. doi:10.1145/3297858.3304064
-
[65]
Loh, and Abhishek Bhattacharjee
Binh Pham, Ján Veselý, Gabriel H. Loh, and Abhishek Bhattacharjee
-
[66]
Large Pages and Lightweight Memory Management in Virtu- alized Environments: Can You Have It Both Ways?. InProceedings of the 48th International Symposium on Microarchitecture (MICRO-48). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/2830772.2830773
-
[67]
Georgios Psaropoulos, Thomas Legler, Norman May, and Anastasia Ailamaki. 2017. Interleaving with Coroutines: A Practical Approach for Robust Index Joins.Proc. VLDB Endow.11, 2 (Oct. 2017), 230–242. doi:10.14778/3149193.3149202
-
[68]
Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. 201...
-
[69]
2023.CCKit: FPGA Acceleration in Symmetric Coherent Heterogeneous Platforms
Abishek Ramdas. 2023.CCKit: FPGA Acceleration in Symmetric Coherent Heterogeneous Platforms. Doctoral Thesis. ETH Zurich. doi:10.3929/ethz-b-000642567
-
[70]
Benjamin Ramhorst, Dario Korolija, Maximilian Jakob Heer, Jonas Dann, Luhao Liu, and Gustavo Alonso. 2025. Coyote v2: Raising the Level of Abstraction for Data Center FPGAs. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25). Association for Computing Machinery, New York, NY, USA, 639–654. doi:10.1145/3731569.3764845
-
[71]
Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel
Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: operating system abstractions to manage GPUs as compute devices. InProceedings of the Twenty-Third ACM Symposium on Operating Systems Principles(Cascais, Portugal) (SOSP ’11). Association for Computing Machinery, New York, NY, USA, 233–248. doi:10.1145/2...
-
[72]
Samsung. 2022. Samsung Electronics Introduces Industry’s First 512GB CXL Memory Module.https://news.samsung.com/global/samsung- electronics-introduces-industrys-first-512gb-cxl-memory-module
2022
-
[73]
Samsung. 2024. CXL Memory Module Box CMM-B. https://semiconductor.samsung.com/news-events/tech-blog/cxl- memory-module-box-cmm-b
2024
-
[74]
Joonseop Sim, Soohong Ahn, Taeyoung Ahn, Seungyong Lee, Myunghyun Rhee, Jooyoung Kim, Kwangsik Shin, Donguk Moon, Euiseok Kim, and Kyoung Park. 2022. Computational cxl-memory so- lution for accelerating memory-intensive applications.IEEE Computer Architecture Letters22, 1 (2022), 5–8
2022
-
[75]
Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Chihun Song, Jinghan Huang, Houxiang Ji, Siddharth Agarwal, Jiaqi Lou, Ipoom Jeong, Ren Wang, Jung Ho Ahn, Tianyin Xu, and Nam Sung Kim. 2023. Demysti- fying CXL Memory with Genuine CXL-Ready Systems and Devices. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’23)....
-
[76]
Yupeng Tang, Ping Zhou, Wenhui Zhang, Henry Hu, Qirui Yang, Hao Xiang, Tongping Liu, Jiaxin Shan, Ruoyun Huang, Cheng Zhao, Cheng Chen, Hui Zhang, Fei Liu, Shuai Zhang, Xiaoning Ding, and Jianjun Chen. 2024. Exploring Performance and Cost Optimization with ASIC- Based CXL Memory. InProceedings of the Nineteenth European Confer- ence on Computer Systems (E...
-
[77]
Dufy Teguia, Jiaxuan Chen, Stella Bitchebe, Oana Balmau, and Alain Tchana. 2024. vPIM: Processing-in-Memory Virtualization. InProceed- ings of the 25th International Middleware Conference (Middleware ’24). Association for Computing Machinery, New York, NY, USA, 417–430. doi:10.1145/3652892.3700782
-
[78]
Chuck Thacker. 2010. Beehive: A many-core computer for FP- GAs (v5).https://web.mit.edu/6.173/www/currentsemester/handouts/ BeehiveV5.pdf
2010
-
[79]
Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an interme- diate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages(Phoenix, AZ, USA) (MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. doi:10.1145/3315508.3329973
-
[80]
Lukas Vogel, Daniel Ritter, Danica Porobic, Pinar Tözün, Tianzheng Wang, and Alberto Lerner. 2023. Data Pipes: Declarative Control over Data Movement. InConference on Innovative Data Systems Research
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.