pith. machine review for the scientific record. sign in

arxiv: 2604.15128 · v1 · submitted 2026-04-16 · 💻 cs.AR · cs.DC· cs.NI

Recognition: unknown

SCENIC: Stream Computation-Enhanced SmartNIC

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:43 UTC · model grok-4.3

classification 💻 cs.AR cs.DCcs.NI
keywords SmartNICstream computationRDMA offloadprogrammable datapathdatacenter networkingTCP/IP stackcollective communicationnetwork-to-GPU
0
0 comments X

The pith

SCENIC turns the SmartNIC datapath into a first-class stream computation substrate to match commercial 200G performance while adding programmability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Datacenter SmartNICs currently force a choice between commercial platforms that deliver high bandwidth and easy integration but limited customization, and research designs that often fall short on speed or compatibility. SCENIC closes this gap by treating the network datapath itself as a programmable stream computation substrate. It combines 200G offloaded TCP/IP and RDMA stacks with on-datapath Stream Compute Units for processing and embedded ARM cores for control, plus direct GPU and SSD access. The design stays fully compatible with existing Linux network and RDMA interfaces, so applications see no change yet can invoke custom offloads. A sympathetic reader would care because the result shows that meaningful programmability and control can be added without the latency or bandwidth penalties that usually accompany research prototypes.

Core claim

SCENIC implements a 200G network datapath over offloaded TCP/IP and RDMA stacks together with a fallback path for arbitrary traffic; on top of this logic it places Stream Compute Units for data processing and embedded ARM cores for flexible control-path manipulation, all while exposing native Linux network and RDMA verb interfaces so that the programmable elements remain transparent to unmodified applications.

What carries the argument

Stream Compute Units (SCUs) placed directly on the network datapath, paired with embedded ARM cores and shared hardware/software abstractions that allow tight co-design of infrastructure and applications.

Load-bearing premise

Adding Stream Compute Units and ARM cores to the high-speed datapath can deliver useful programmability and control without creating latency or bandwidth penalties that would keep the device from matching commercial SmartNIC performance.

What would settle it

A direct benchmark comparison in which SCENIC's measured latency or sustained bandwidth on standard RDMA or TCP workloads falls measurably below a commercial 200G SmartNIC, or in which the described use cases (offloaded collectives or network-to-GPU partitioning) cannot run at line rate.

Figures

Figures reproduced from arXiv: 2604.15128 by Benjamin Ramhorst, Gustavo Alonso, Heejae Kim, Jin-Soo Kim, Jonas Dann, Luhao Liu, Maximilian Jakob Heer.

Figure 1
Figure 1. Figure 1: Overview of SCENIC with two example offloads: hash-based network-to-GPU data partitioning (Section 9.2) and hybrid flow monitoring (Section 6.2). that, modern workloads require direct interaction with het￾erogeneous GPUs [81] and storage [15] at line rate. The challenge is doing so while also supporting in-network com￾pute. SCENIC demonstrates both: compliance with existing software (exposed as Linux netde… view at source ↗
Figure 2
Figure 2. Figure 2: Programmable congestion control in SCENIC. 5.2 Congestion control and network extensions Large-scale ML workloads expose the limitations of fixed congestion control algorithms [29, 50]. Programmable con￾gestion control (PCC), with scenario-adaptive algorithm se￾lection, addresses this challenge but requires direct modifica￾tion of the transport logic, which is restricted on commercial NICs. SCENIC’s open a… view at source ↗
Figure 3
Figure 3. Figure 3: DMA packet forwarding to the network driver. pipeline, classifying flows by source subnet to reflect pod￾level positions in Fat Tree topologies [5], while implementa￾tion dynamic policy decision making on the Arm cores. A hardware timer periodically interrupts the CPU, which reads traffic statistics via the AXI bus. A dynamically configurable SCU rate limiter then enforces the resulting policies. 7 Driver … view at source ↗
Figure 4
Figure 4. Figure 4: Performance evaluation of the fallback path. Left: ping latency. Right: iperf3 throughput. Hybrid refers to Mellanox-to-SCENIC communication. 7.2 ibv_device integration For data-intensive workloads, SCENIC includes a fully of￾floaded RoCEv2 stack and exposes itself as an ibv_device, ensuring compatibility with existing IB Verb applications (R2). This is realized through a two-component stack: (1) a kernel … view at source ↗
Figure 5
Figure 5. Figure 5: RDMA performance benchmark in a fully switched datacenter network. engines, resulting in slightly lower latencies. Despite this gap, SCENIC’s slow-path latency is well within the range re￾quired for management traffic: SSH sessions, monitoring, and control-plane communication remain fully responsive un￾der all tested conditions. A similar conclusion can be drawn for achievable throughput with iperf3 at a t… view at source ↗
Figure 6
Figure 6. Figure 6: Throughput of SCENIC to GPU with RDMA READs and RDMA WRITEs. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Time series of bandwidth sharing scaling up to four parallel flows performing 128 KiB RDMA READs through separated SCUs. To demonstrate this design aspect, we configure SCENIC with four SCUs and run a throughput benchmark with 128 KiB RDMA READs. We incrementally scale the workload from one to four parallel flows, mapping each distinct flow to its own SCU [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of BROADCAST and GATHER collectives on SCENIC with OpenMPI on a commercial NIC. training [41, 55]; offloading them to the network can free up CPU cycles or reduce the GPU utilization [14]. Second, of￾floaded collectives create the possibility of collocating gradi￾ent compression as an in-network processing step to further overlap compute and communication [3]. For future work we plan to extend t… view at source ↗
Figure 10
Figure 10. Figure 10: Performance of hash partitioning on the CPU (B: Baseline, 1 and 16 threads) and offloaded with SCENIC. the multi-threaded software baseline compared to SCENIC￾offloaded hash partitioning. SCENIC offloading achieves la￾tency that scales linearly with data set size and, for larger transfer sizes, approaches the lower bound of just the RDMA communication. Throughput shows a fixed overhead at small data set s… view at source ↗
read the original abstract

Although modern, AI-centric datacenters heavily rely on SmartNICs, existing devices impose a hard trade-off. Commercial SmartNICs provide high bandwidth and easy software integration, but offer limited support for customization and data processing offload. In contrast, research SmartNICs often suffer from low bandwidth, limited functionality, and poor software compatibility -- to the point that many are not actual NICs in a technical sense. This gap can be closed by treating the NIC datapath as a first-class stream computation substrate with shared hardware/software abstractions for a tight co-design of infrastructure and applications. To demonstrate this, we introduce SCENIC, an open-source datacenter SmartNIC. SCENIC implements a 200G network datapath over offloaded TCP/IP and RDMA stacks, as well as a fallback path for processing arbitrary network traffic. On top of the network logic, SCENIC combines on-datapath Stream Compute Units (SCUs) for data processing and embedded ARM cores for flexible control path manipulation with direct access to GPUs and SSDs. SCENIC is fully integrated with the OS, exposing native Linux network and RDMA verb interfaces, making the programmable datapath transparent to existing applications while enabling control of, e.g., user-defined offloads and programmable congestion control. SCENIC's performance matches commercial platforms, and we show its versatility through several use cases such as offloaded collective communication and network-to-GPU hash-based data partitioning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SCENIC, an open-source datacenter SmartNIC that implements a 200G network datapath over offloaded TCP/IP and RDMA stacks together with a fallback path for arbitrary traffic. It augments the datapath with on-datapath Stream Compute Units (SCUs) for data processing and embedded ARM cores for control, while providing direct GPU/SSD access and full integration with the Linux OS and RDMA verb interfaces. The central claims are that this design achieves performance parity with commercial platforms and enables versatile use cases such as offloaded collective communication and network-to-GPU hash-based data partitioning without sacrificing bandwidth or latency.

Significance. If the zero-overhead integration of SCUs and ARM cores is demonstrated, the work would meaningfully close the gap between high-bandwidth commercial SmartNICs and customizable research platforms by delivering a fully OS-compatible, programmable datapath. The open-source prototype, native Linux/RDMA compatibility, and multiple concrete use cases are positive attributes that could support broader adoption and further co-design research.

major comments (2)
  1. [Abstract] Abstract: The claim that 'SCENIC's performance matches commercial platforms' is load-bearing for the contribution yet is presented without any quantitative throughput, latency, or bandwidth measurements, error bars, baseline comparisons (e.g., to BlueField-class devices), or ablation data isolating the impact of the added SCUs and ARM cores versus pure bypass paths.
  2. [Use cases] Use-case demonstrations: The offloaded collective communication and hash-based GPU partitioning examples are described at a high level but lack performance numbers, comparisons against non-programmable baselines, or stress-test results on the programmable paths and fallback route under realistic 200G workloads, leaving the 'no-penalty' assumption unverified.
minor comments (1)
  1. [Abstract] Abstract: The acronym 'SCU' is introduced without a short definition or pointer to the stream-computation literature that motivates the hardware abstraction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of SCENIC's potential impact. We address each major comment below with point-by-point responses and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'SCENIC's performance matches commercial platforms' is load-bearing for the contribution yet is presented without any quantitative throughput, latency, or bandwidth measurements, error bars, baseline comparisons (e.g., to BlueField-class devices), or ablation data isolating the impact of the added SCUs and ARM cores versus pure bypass paths.

    Authors: We agree that the abstract would be strengthened by including a concise summary of the key quantitative results that appear in the evaluation section. The manuscript already contains throughput and latency measurements at 200 Gbps, direct comparisons to BlueField-class devices, error bars from repeated runs, and ablations isolating SCU/ARM overhead on the bypass path. We will revise the abstract to incorporate these supporting metrics and comparisons so the performance claim is substantiated at the abstract level as well. revision: yes

  2. Referee: [Use cases] Use-case demonstrations: The offloaded collective communication and hash-based GPU partitioning examples are described at a high level but lack performance numbers, comparisons against non-programmable baselines, or stress-test results on the programmable paths and fallback route under realistic 200G workloads, leaving the 'no-penalty' assumption unverified.

    Authors: The quantitative results for both use cases, including comparisons to non-programmable baselines and stress tests under full 200 Gbps load on programmable and fallback paths, are presented in the evaluation section. We acknowledge that the use-case descriptions themselves remain somewhat high-level and do not sufficiently cross-reference or restate these numbers. We will revise the use-case subsections to explicitly include the relevant performance figures, baseline comparisons, and workload results, thereby directly verifying the no-penalty claim in context. revision: yes

Circularity Check

0 steps flagged

No circularity detected in SCENIC implementation claims

full rationale

The paper presents an implemented hardware/software prototype for a 200G SmartNIC with on-datapath SCUs and ARM cores, offloaded TCP/IP/RDMA stacks, and OS integration. Central claims of matching commercial performance and versatility in use cases rest on the physical prototype, fallback paths, and demonstrated applications rather than any equations, fitted parameters, or derivations that reduce to their own inputs by construction. No self-definitional steps, uniqueness theorems, or ansatz smuggling via self-citation appear in the provided text; the design is described as a co-design substrate without load-bearing self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on standard domain assumptions about the feasibility of 200G networking hardware and RDMA offload; no free parameters, invented physical entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • domain assumption Existing hardware primitives for 200G networking and RDMA offload are sufficient to support the described performance and integration.
    The claims of matching commercial performance presuppose that the underlying silicon and protocol stacks behave as expected when augmented with SCUs and ARM cores.

pith-pipeline@v0.9.0 · 5580 in / 1438 out tokens · 36844 ms · 2026-05-10T09:43:46.936431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 46 canonical work pages · 1 internal anchor

  1. [1]

    Andersen, Peter Bailis, Magdalena Balazinska, Philip A

    Daniel Abadi, Anastasia Ailamaki, David G. Andersen, Peter Bailis, Magdalena Balazinska, Philip A. Bernstein, Peter Boncz, Surajit Chaud- huri, Alvin Cheung, AnHai Doan, Luna Dong, Michael J. Franklin, Juliana Freire, Alon Y. Halevy, Joseph M. Hellerstein, Stratos Idreos, Donald Kossmann, Tim Kraska, Sailesh Krishnamurthy, Volker Markl, Sergey Melnik, Tov...

  2. [2]

    Advanced Micro Devices, Inc. 2024. AMD Pensando Software-in-Silicon Development Kit (SSDK). https://www.amd.com/content/dam/amd/ en/documents/pensando-technical-docs/product-briefs/pensando- ssdk-product-brief.pdf

  3. [3]

    Papailiopoulos

    Saurabh Agarwal, Hongyi Wang, Shivaram Venkataraman, and Dim- itris S. Papailiopoulos. 2022. On the Utility of Gradient Compression in Distributed Training Systems. (2022). https://proceedings.mlsys.or g/paper_files/paper/2022/hash/773862fcc2e29f650d68960ba5bd1101- Abstract.html

  4. [4]

    Olasupo Ajayi and Ryan Grant. 2025. A Chronological Analysis of the Evolution of SmartNICs. CoRR abs/2512.04054 (2025). arXiv:2512.04054 doi:10.48550/ARXIV.2512.04054

  5. [5]

    Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture. InProceedings of the ACM SIGCOMM 2008 Conference on Data Communication (Seattle, WA, USA) (SIGCOMM ’08). Association for Computing Machinery, New York, NY, USA, 63–74. doi: 10.1145/1402958.1402967

  6. [6]

    Amazon Web Services. 2022. The Components of the Nitro System (The Security Design of the A WS Nitro System Whitepaper). Technical Report. Amazon Web Services. https://docs.aws.amazon.com/whitepapers/la test/security-design-of-aws-nitro-system/the-components-of-the- nitro-system.html Accessed: 2026-04-15

  7. [7]

    AMD. 2025. DMA/Bridge Subsystem for PCI Express Product Guide (PG195). https://docs.amd.com/r/en-US/pg195-pcie-dma

  8. [8]

    AMD. 2025. Versal Adaptive SoC 600G Channelized Multirate Ethernet Subsystem (DCMAC) LogiCORE IP Product Guide (PG369). https: //docs.amd.com/r/en-US/pg369-dcmac/Introduction

  9. [9]

    AMD. 2025. Versal Adaptive SoC CPM DMA and Bridge Mode for PCI Express v3.4. https://docs.amd.com/r/en-US/pg347-cpm-dma- bridge?tocId=oTd_ZrdYcOWw7fqmc3hb9g

  10. [10]

    AMD. 2025. Vitis Networking P4. https://docs.amd.com/r/en- US/ug1308-vitis-p4-user-guide

  11. [11]

    AMD Pensando. 2022. AMD Pensando Elba DPU (DSC-200) Product Overview. https://www.amd.com/en/products/data-processing- units/pensando.html

  12. [12]

    AMD/Xilinx. 2021. OpenNIC: An Open-Source NIC Shell for Alveo FPGAs. GitHub. https://github.com/Xilinx/open-nic

  13. [13]

    Kyle Aubrey and Farshad Ghodsian. 2026. Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform. NVIDIA Technical Blog. https://developer.nvidia.com/blo g/inside-nvidia-groq-3-lpx-the-low-latency-inference-accelerator- for-the-nvidia-vera-rubin-platform/ Accessed: 2026-03-28

  14. [14]

    John Bachan, Kaiming Ouyang, Misbah Mubarak, Thomas Gillis, Bruce Chang, Devendar Bureddy, Giuseppe Congiu, Keith Caton, Kyle Aubrey, and Xiaofan Li. 2025. Enabling Fast Inference and Resilient Training with NCCL 2.27. https://developer.nvidia.com/blog/enabling- fast-inference-and-resilient-training-with-nccl-2-27/

  15. [15]

    Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, Rebecca Chow, Jeff Cohen, Mahmoud Elhaddad, Vivek Ette, Igal Figlin, Daniel Firestone, Mathew George, Ilya German, Lakhmeet Ghai, Eric Green, Albert G. Greenberg, Manish Gupta, Randy Haagens, Matthew Hendel...

  16. [16]

    Tommaso Bonato, Abdul Kabbani, Ahmad Ghalayini, Anup Agarwal, Daniele De Sensi, Rong Pan, Costin Raiciu, Mark Handley, Mihai Brodschi, Timo Schneider, Nils Blach, Daniel Santos Ferreira Alves, and Torsten Hoefler. 2026. SMaRTT: Sender-based Marked Rapidly- adapting Trimmed & Timed Transport. (2026). arXiv:2404.01630 [cs.NI] https://arxiv.org/abs/2404.01630

  17. [17]

    Broadcom. 2019. Broadcom Stingray PS225 Dual-Port 25GbE PCIe Ethernet SmartNIC Data Sheet. https://www.broadcom.com/compa ny/news/product-releases/53106

  18. [18]

    Marco Spaziani Brunella, Giacomo Belocchi, Marco Bonola, Salva- tore Pontarelli, Giuseppe Siracusano, Giuseppe Bianchi, Aniello Cam- marano, Alessandro Palumbo, Luca Petrucci, and Roberto Bifulco. 2020. hXDP: Efficient Software Packet Processing on FPGA NICs. In 14th USENIX Symposium on Operating Systems Design and Implementation, 12 OSDI 2020, Virtual Ev...

  19. [19]

    https://www.usenix.org/conference/osdi20/presentation/brunella

  20. [20]

    Xuzheng Chen, Jie Zhang, Ting Fu, Yifan Shen, Shu Ma, Kun Qian, Lingjun Zhu, Chao Shi, Yin Zhang, Ming Liu, and Zeke Wang. 2024. Demystifying Datapath Accelerator Enhanced Off-path SmartNIC. In 32nd IEEE International Conference on Network Protocols, ICNP 2024, Charleroi, Belgium, October 28-31, 2024 . IEEE, 1–12. doi: 10.1109/ICNP 61940.2024.10858560

  21. [21]

    Alibaba Cloud Community. 2022. A Detailed Explanation about Al- ibaba Cloud CIPU. https://www.alibabacloud.com/blog/a-detailed- explanation-about-alibaba-cloud-cipu_599183

  22. [22]

    Dan Daly, Jakub Kicinski, and Willem de Bruijn. 2023. OCP NIC Core Features Specification, Version 1.0. Technical Specification. Open Compute Project (OCP). https://www.opencompute.org/document s/ocp-server-nic-core-features-specification-ocp-spec-format-1-1- pdf Accessed: 2026-03-23

  23. [23]

    Jonas Dann and Gustavo Alonso. 2026. Should I Hide My Duck in the Lake? CoRR abs/2602.18775 (2026). doi:10.48550/ARXIV.2602.18775

  24. [24]

    Jonas Dann, Royden Wagner, Daniel Ritter, Christian Faerber, and Holger Fröning. 2022. PipeJSON: Parsing JSON at Line Speed on FPGAs. In International Conference on Management of Data, DaMoN 2022, Philadelphia, PA, USA, 13 June 2022 , Spyros Blanas and Norman May (Eds.). ACM, 3:1–3:7. doi:10.1145/3533737.3535094

  25. [25]

    Tristan Döring, Henning Stubbe, and Kilian Holzinger. 2021. Smart- NICs: Current Trends in Research and Industry . Technical Report NET- 2021-05-1. Chair of Network Architectures and Services, Department of Informatics, Technical University of Munich. https://www.net.in.t um.de/fileadmin/TUM/NET/NET-2021-05-1/NET-2021-05-1_05.pdf

  26. [26]

    Kfoury, Jose Gomez, and Jorge Crichigno

    Sergio Elizalde, Ali AlSabeh, Ali Mazloum, Samia Choueiri, Elie F. Kfoury, Jose Gomez, and Jorge Crichigno. 2025. A survey on security applications with SmartNICs: Taxonomy, implementations, challenges, and future trends. J. Netw. Comput. Appl. 242 (2025), 104257. doi:10.1 016/J.JNCA.2025.104257

  27. [27]

    Caulfield, Eric S

    Daniel Firestone, Andrew Putnam, Sambrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian M. Caulfield, Eric S. Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Sil...

  28. [28]

    In 15th USENIX Symposium on Networked Systems Design and Im- plementation, NSDI 2018, Renton, W A, USA, April 9-11, 2018 , Sujata Banerjee and Srinivasan Seshan (Eds.)

    Azure Accelerated Networking: SmartNICs in the Public Cloud. In 15th USENIX Symposium on Networked Systems Design and Im- plementation, NSDI 2018, Renton, W A, USA, April 9-11, 2018 , Sujata Banerjee and Srinivasan Seshan (Eds.). USENIX Association, 51–66. https://www.usenix.org/conference/nsdi18/presentation/firestone

  29. [29]

    Snoeren, George Porter, and George Papen

    Alex Forencich, Alex C. Snoeren, George Porter, and George Papen

  30. [30]

    In 28th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2020, Fayetteville, AR, USA, May 3-6, 2020

    Corundum: An Open-Source 100-Gbps Nic. In 28th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2020, Fayetteville, AR, USA, May 3-6, 2020. IEEE, 38–46. doi:10.1109/FCCM48280.2020.00015

  31. [31]

    fpgasystems. [n. d.]. GitHub - fpgasystems/fpga-network-stack: Scal- able Network Stack for FPGAs (TCP/IP, RoCEv2). https://github.com /fpgasystems/fpga-network-stack

  32. [32]

    Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashidhar Gandham, and Hongyi Zeng. 2024. RDMA over Ethernet for Distributed Training at Meta Scale. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM S...

  33. [33]

    Anqi Guo, Yuchen Hao, Xiteng Yao, Shining Yang, Jianyu Huang, Tony (Tong) Geng, and Martin Herbordt. 2025. SmartNIC-GPU- CPU Heterogeneous System for Large Machine Learning Model with Software-Hardware Codesign. In Proceedings of the 39th ACM Interna- tional Conference on Supercomputing (ICS ’25). Association for Com- puting Machinery, New York, NY, USA, ...

  34. [34]

    Zhenhao He, Dario Korolija, and Gustavo Alonso. 2021. EasyNet: 100 Gbps Network for HLS. In 31st International Conference on Field- Programmable Logic and Applications, FPL 2021, Dresden, Germany, August 30 - Sept. 3, 2021 . IEEE, 197–203. doi: 10.1109/FPL53798.2021.00 040

  35. [35]

    Zhenhao He, Dario Korolija, Yu Zhu, Benjamin Ramhorst, Tristan Laan, Lucian Petrica, Michaela Blott, and Gustavo Alonso. 2024. ACCL+: an FPGA-Based Collective Engine for Distributed Applications. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024 , Ada Gavrilovska and Douglas B. Terry...

  36. [36]

    Maximilian Jakob Heer, Benjamin Ramhorst, Yu Zhu, Luhao Liu, Zhiyi Hu, Jonas Dann, and Gustavo Alonso. 2025. RoCE BALBOA: Service- enhanced Data Center RDMA for SmartNICs. arXiv:2507.20412 doi: 10 .48550/ARXIV.2507.20412

  37. [37]

    Underwood, Robert Alver- son, Mark Griswold, Vahid Tabatabaee, Mohan Kalkunte, Suren- dra Anubolu, Siyuan Shen, Moray McLaren, Abdul Kabbani, and Steve Scott

    Torsten Hoefler, Duncan Roweth, Keith D. Underwood, Robert Alver- son, Mark Griswold, Vahid Tabatabaee, Mohan Kalkunte, Suren- dra Anubolu, Siyuan Shen, Moray McLaren, Abdul Kabbani, and Steve Scott. 2023. Data Center Ethernet and Remote Direct Mem- ory Access: Issues at Hyperscale. Computer 56, 7 (2023), 67–77. doi:10.1109/MC.2023.3261184

  38. [38]

    Underwood, Cedell Alexander, Bob Alverson, Paul Bottorff, Adrian M

    Torsten Hoefler, Karen Schramm, Eric Spada, Keith D. Underwood, Cedell Alexander, Bob Alverson, Paul Bottorff, Adrian M. Caulfield, Mark Handley, Cathy Huang, Costin Raiciu, Abdul Kabbani, Eugene Opsasnick, Rong Pan, Adee Ran, and Rip Sohan. 2025. Ultra Ethernet’s Design Principles and Architectural Innovations. arXiv:2508.08906 doi:10.48550/ARXIV.2508.08906

  39. [39]

    Hongjing Huang, Jie Zhang, Xuzheng Chen, Ziyu Song, Jiajun Qin, and Zeke Wang. 2025. SwCC: Software-Programmable and Per-Packet Con- gestion Control in RDMA Engine. In Proceedings of the 2025 USENIX Annual Technical Conference, USENIX ATC 2025, Boston, MA, USA, July 7-9, 2025, Deniz Altinbüken and Ryan Stutsman (Eds.). USENIX Asso- ciation, 1243–1260. htt...

  40. [40]

    IEEE. 2010. IEEE Standard for Information technology–Local and met- ropolitan area networks–Specific requirements–Part 3: CSMA/CD Access Method and Physical Layer Specifications Amendment 4: Media Access Control Parameters, Physical Layers, and Manage- ment Parameters for 40 Gb/s and 100 Gb/s Operation. 457 pages. doi:10.1109/IEEESTD.2010.5501740

  41. [41]

    IEEE. 2011. IEEE Standard for Local and metropolitan area networks– Media Access Control (MAC) Bridges and Virtual Bridged Local Area Networks–Amendment 17: Priority-based Flow Control. 40 pages. doi:10.1109/IEEESTD.2011.6032693

  42. [42]

    IEEE. 2017. IEEE Standard for Ethernet - Amendment 10: Media Access Control Parameters, Physical Layers, and Management Parameters for 200 Gb/s and 400 Gb/s Operation. 416 pages. doi: 10.1109/IEEESTD.20 17.8207825

  43. [43]

    Intel. 2022. Intel Infrastructure Processing Unit (Intel IPU) E2000. https://www.intel.com/content/www/us/en/products/details/netwo rk-io/ipu.html

  44. [44]

    Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

  45. [45]

    Marko Kabic, Bowen Wu, Jonas Dann, and Gustavo Alonso. 2025. Powerful GPUs or Fast Interconnects: Analyzing Relational Workloads on Modern GPUs. Proc. VLDB Endow. 18, 11 (2025), 4350–4363. doi: 10 .14778/3749646.3749698

  46. [46]

    Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ran- ganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Pro- filing a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (Portland, Oregon) (ISCA ’15). Association for Computing Machinery, New York, NY, USA, 158–169. doi: 10.11...

  47. [47]

    Kfoury, Samia Choueiri, Ali Mazloum, Ali AlSabeh, Jose Gomez, and Jorge Crichigno

    Elie F. Kfoury, Samia Choueiri, Ali Mazloum, Ali AlSabeh, Jose Gomez, and Jorge Crichigno. 2024. A Comprehensive Survey on SmartNICs: Architectures, Development Models, Applications, and Research Di- rections. IEEE Access 12 (2024), 107297–107336. doi:10.1109/ACCESS.2 024.3437203

  48. [48]

    Dario Korolija, Timothy Roscoe, and Gustavo Alonso. 2020. Do OS abstractions make sense on FPGAs?. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4-6, 2020. USENIX Association, 991–1010. https://www.us enix.org/conference/osdi20/presentation/roscoe

  49. [49]

    Hämäläinen

    Sakari Lahti and Timo D. Hämäläinen. 2025. High-Level Synthesis for FPGAs - A Hardware Engineer’s Perspective. IEEE Access 13 (2025), 28574–28593. doi: 10.1109/ACCESS.2025.3540320

  50. [50]

    Bojie Li, Kun Tan, Layong Larry Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang Xiong, and Peng Cheng. 2016. ClickNP: Highly flexible and High-performance Network Processing with Reconfig- urable Hardware. In Proceedings of the ACM SIGCOMM 2016 Confer- ence, Florianopolis, Brazil, August 22-26, 2016 , Marinho P. Barcellos, Jon Crowcroft, Amin Vahdat,...

  51. [51]

    Jiayong Li, Jonas Dann, Zhenhao He, Gustavo Alonso, Sai Rahul Cha- lamalasetti, Dejan Milojicic, Lance Evans, Alex Veprinsky, and Runbin Shi. 2026. StreamDedup: Distributed In-line Deduplication for Disag- gregated Storage. ACM Trans. Reconfigurable Technol. Syst. (March 2026). doi: 10.1145/3799896

  52. [52]

    Junru Li, Youyou Lu, Qing Wang, Jiazhen Lin, Zhe Yang, and Jiwu Shu

  53. [53]

    In 2022 USENIX Annual Technical Conference (USENIX ATC 22)

    AlNiCo: SmartNIC-accelerated Contention-aware Request Sched- uling for Transaction Processing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). USENIX Association, Carlsbad, CA, 951–

  54. [54]

    https://www.usenix.org/conference/atc22/presentation/li-junru

  55. [55]

    Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Al- izadeh, and Minlan Yu. 2019. HPCC: high precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communica- tion, SIGCOMM 2019, Beijing, China, August 19-23, 2019 , Jianping Wu and Wendy Hall (Eds.). ...

  56. [56]

    Will Lin, Yizhou Shan, Ryan Kosta, Arvind Krishnamurthy, and Yiying Zhang. 2024. SuperNIC: An FPGA-Based, Cloud-Oriented SmartNIC. In Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2024, Monterey, CA, USA, March 3-5, 2024, Zhiru Zhang and Andrew Putnam (Eds.). ACM, 130–141. doi:10.1145/3626202.3637564

  57. [57]

    Linux RDMA. 2024. perftest – RDMA Performance Tests. https: //github.com/linux-rdma/perftest. Accessed: 04/15/2026

  58. [58]

    Junyi Liu, Aleksandar Dragojević, Shane Fleming, Antonios Katsarakis, Dario Korolija, Igor Zablotchi, Ho-Cheung Ng, Anuj Kalia, and Miguel Castro. 2024. Honeycomb: Ordered Key-Value Store Acceleration on an FPGA-Based SmartNIC. IEEE Trans. Comput. 73, 3 (2024), 857–871. doi:10.1109/TC.2023.3345173

  59. [59]

    Ming Liu, Tianyi Cui, Henry Schuh, Arvind Krishnamurthy, Simon Peter, and Karan Gupta. 2019. Offloading distributed applications onto smartNICs using iPipe. InProceedings of the ACM Special Interest Group on Data Communication, SIGCOMM 2019, Beijing, China, August 19-23, 2019, Jianping Wu and Wendy Hall (Eds.). ACM, 318–333. doi:10.1145/ 3341302.3342079

  60. [60]

    Rui Ma, Evangelos Georganas, Alexander Heinecke, Sergey Gribok, Andrew Boutros, and Eriko Nurvitadhi. 2022. FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems. IEEE Computer Architecture Letters 21, 2 (2022), 49–52. doi: 10.1109/LCA.2022.3189207

  61. [61]

    MangoBoost. 2025. Mango BoostX ™ Programmable DPUs. https: //cdn.sanity.io/files/hx87iaks/production/ce5454fc6af423cd241b5784 3750527b05d29811.pdf. Accessed on 04/15/2026

  62. [62]

    YoungGyoun Moon, SeungEon Lee, Muhammad Asim Jamshed, and KyoungSoo Park. 2020. AccelTCP: Accelerating Network Applica- tions with Stateful TCP Offloading. In 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020, Santa Clara, CA, USA, February 25-27, 2020 , Ranjita Bhagwan and George Porter (Eds.). USENIX Association, 77–92. htt...

  63. [63]

    Javier Moya, Matthias Gabathuler, Mario Ruiz, and Gustavo Alonso

  64. [64]

    fpgasystems/hacc: ETHZ-HACC. Zenodo. doi: 10.5281/zenodo.8 340448 https://doi.org/10.5281/zenodo.8340448

  65. [65]

    NVIDIA. [n. d.]. GPUDirect RDMA and GPUDirect Storage — NVIDIA GPU Operator. https://docs.nvidia.com/datacenter/cloud-native/gpu- operator/25.3.1/gpu-operator-rdma.html#gpudirect-rdma-and- gpudirect-storage

  66. [66]

    NVIDIA. 2023. NVIDIA BlueField-3 DPU Data Sheet. https://www.nv idia.com/content/dam/en-zz/Solutions/Data-Center/documents/dat asheet-nvidia-bluefield-3-dpu.pdf

  67. [67]

    NVIDIA Corporation. 2024. NVIDIA DOCA SDK. https://developer.nv idia.com/networking/doca Version 2.6.0, Accessed: 2026-03-24

  68. [68]

    Oracle. 2025. Oracle Unveils Next-Generation Oracle Cloud Infras- tructure Zettascale10 Cluster for AI. Oracle Corporation. https: //www.oracle.com/news/announcement/ai-world-oracle-unveils- next-generation-oci-zettascale10-cluster-for-ai-2025-10-14/ Re- trieved March 25, 2026

  69. [69]

    Sourav Panda, Yixiao Feng, Sameer G Kulkarni, K. K. Ramakrishnan, Nick Duffield, and Laxmi N. Bhuyan. 2021. SmartWatch: accurate traffic analysis and flow-state tracking for intrusion prevention us- ing SmartNICs. In Proceedings of the 17th International Conference on Emerging Networking EXperiments and Technologies (Virtual Event, Germany) (CoNEXT ’21). ...

  70. [70]

    Charles Papon. 2016. SpinalHDL Documentation. https://spinalhdl.gi thub.io/SpinalDoc-RTD/master/SpinalHDL/Introduction/SpinalHD L.html. Accessed: 2025-04-15

  71. [71]

    Sinclair

    Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, and Matthew D. Sinclair. 2023. Tale of Two Cs: Computation vs. Commu- nication Scaling for Future Transformers on Future Hardware. In IEEE International Symposium on Workload Characterization, IISWC 2023, Ghent, Belgium, October 1-3, 2023 . IEEE, 140–153. doi: 10.1109/IISWC5 9245.2023.00026

  72. [72]

    Salvatore Pontarelli, Roberto Bifulco, Marco Bonola, Carmelo Cascone, Marco Spaziani Brunella, Valerio Bruschi, Davide Sanvito, Giuseppe Siracusano, Antonio Capone, Michio Honda, and Felipe Huici. 2019. FlowBlaze: Stateful Packet Processing in Hardware. In 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019, Boston, MA, Februar...

  73. [73]

    Benjamin Ramhorst, Dario Korolija, Maximilian Jakob Heer, Jonas Dann, Luhao Liu, and Gustavo Alonso. 2025. Coyote v2: Raising the Level of Abstraction for Data Center FPGAs. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, SOSP 2025, Lotte Hotel World, Seoul, Republic of Korea, October 13-16, 2025 , Youjip 14 Won, Youngjin K...

  74. [74]

    Mario Ruiz, David Sidler, Gustavo Sutter, Gustavo Alonso, and Sergio López-Buedo. 2019. Limago: An FPGA-Based Open-Source 100 GbE TCP/IP Stack. In 29th International Conference on Field Programmable Logic and Applications, FPL 2019, Barcelona, Spain, September 8-12, 2019, Ioannis Sourdis, Christos-Savvas Bouganis, Carlos Álvarez, Leonel Antonio Toledo Día...

  75. [75]

    Emas, John Demme, Ana Ibarra, Kara Kagi, Brandon Klouchek, Abhijeet Lawande, Todd Massengill, David J

    Rob Rydberg, Madison N. Emas, John Demme, Ana Ibarra, Kara Kagi, Brandon Klouchek, Abhijeet Lawande, Todd Massengill, David J. Pow- ers, and Andrew Putnam. 2026. Hyperscale FPGA Engineering Sys- tems at Microsoft. In Proceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2026, Seaside, CA, USA, February 22-24, 20...

  76. [76]

    Leah Shalev, Hani Ayoub, Nafea Bshara, and Erez Sabbag. 2020. A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC. IEEE Micro 40, 6 (2020), 67–73. doi: 10.1109/MM.2020.3016891

  77. [77]

    Rajath Shashidhara, Tim Stamler, Antoine Kaufmann, and Simon Peter

  78. [78]

    In 19th USENIX Symposium on Networked Systems Design and Imple- mentation, NSDI 2022, Renton, W A, USA, April 4-6, 2022, Amar Phan- ishayee and Vyas Sekar (Eds.)

    FlexTOE: Flexible TCP Offload with Fine-Grained Parallelism. In 19th USENIX Symposium on Networked Systems Design and Imple- mentation, NSDI 2022, Renton, W A, USA, April 4-6, 2022, Amar Phan- ishayee and Vyas Sekar (Eds.). USENIX Association, 87–102. https: //www.usenix.org/conference/nsdi22/presentation/shashidhara

  79. [79]

    David Sidler, Zeke Wang, Monica Chiosa, Amit Kulkarni, and Gustavo Alonso. 2020. StRoM: smart remote memory. In EuroSys ’20: Fifteenth EuroSys Conference 2020, Heraklion, Greece, April 27-30, 2020 , Angelos Bilas, Kostas Magoutis, Evangelos P. Markatos, Dejan Kostic, and Margo I. Seltzer (Eds.). ACM, 29:1–29:16. doi: 10.1145/3342195.3387519

  80. [80]

    Arjun Singhvi, Nandita Dukkipati, Prashant Chandra, Hassan M. G. Wassel, Naveen Kr. Sharma, Anthony Rebello, Henry Schuh, Praveen Kumar, Behnam Montazeri, Neelesh Bansod, Sarin Thomas, Inho Cho, Hyojeong Lee Seibert, Baijun Wu, Rui Yang, Yuliang Li, Kai Huang, Qianwen Yin, Abhishek Agarwal, Srinivas Vaduvatha, Wei- huang Wang, Masoud Moshref, Tao Ji, Davi...

Showing first 80 references.