arxiv: 2604.15128 · v1 · submitted 2026-04-16 · 💻 cs.AR · cs.DC· cs.NI

Recognition: unknown

SCENIC: Stream Computation-Enhanced SmartNIC

Benjamin Ramhorst , Maximilian Jakob Heer , Luhao Liu , Heejae Kim , Jonas Dann , Jin-Soo Kim , Gustavo Alonso

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:43 UTC · model grok-4.3

classification 💻 cs.AR cs.DCcs.NI

keywords SmartNICstream computationRDMA offloadprogrammable datapathdatacenter networkingTCP/IP stackcollective communicationnetwork-to-GPU

0 comments

The pith

SCENIC turns the SmartNIC datapath into a first-class stream computation substrate to match commercial 200G performance while adding programmability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Datacenter SmartNICs currently force a choice between commercial platforms that deliver high bandwidth and easy integration but limited customization, and research designs that often fall short on speed or compatibility. SCENIC closes this gap by treating the network datapath itself as a programmable stream computation substrate. It combines 200G offloaded TCP/IP and RDMA stacks with on-datapath Stream Compute Units for processing and embedded ARM cores for control, plus direct GPU and SSD access. The design stays fully compatible with existing Linux network and RDMA interfaces, so applications see no change yet can invoke custom offloads. A sympathetic reader would care because the result shows that meaningful programmability and control can be added without the latency or bandwidth penalties that usually accompany research prototypes.

Core claim

SCENIC implements a 200G network datapath over offloaded TCP/IP and RDMA stacks together with a fallback path for arbitrary traffic; on top of this logic it places Stream Compute Units for data processing and embedded ARM cores for flexible control-path manipulation, all while exposing native Linux network and RDMA verb interfaces so that the programmable elements remain transparent to unmodified applications.

What carries the argument

Stream Compute Units (SCUs) placed directly on the network datapath, paired with embedded ARM cores and shared hardware/software abstractions that allow tight co-design of infrastructure and applications.

Load-bearing premise

Adding Stream Compute Units and ARM cores to the high-speed datapath can deliver useful programmability and control without creating latency or bandwidth penalties that would keep the device from matching commercial SmartNIC performance.

What would settle it

A direct benchmark comparison in which SCENIC's measured latency or sustained bandwidth on standard RDMA or TCP workloads falls measurably below a commercial 200G SmartNIC, or in which the described use cases (offloaded collectives or network-to-GPU partitioning) cannot run at line rate.

Figures

Figures reproduced from arXiv: 2604.15128 by Benjamin Ramhorst, Gustavo Alonso, Heejae Kim, Jin-Soo Kim, Jonas Dann, Luhao Liu, Maximilian Jakob Heer.

**Figure 1.** Figure 1: Overview of SCENIC with two example offloads: hash-based network-to-GPU data partitioning (Section 9.2) and hybrid flow monitoring (Section 6.2). that, modern workloads require direct interaction with heterogeneous GPUs [81] and storage [15] at line rate. The challenge is doing so while also supporting in-network compute. SCENIC demonstrates both: compliance with existing software (exposed as Linux netde… view at source ↗

**Figure 2.** Figure 2: Programmable congestion control in SCENIC. 5.2 Congestion control and network extensions Large-scale ML workloads expose the limitations of fixed congestion control algorithms [29, 50]. Programmable congestion control (PCC), with scenario-adaptive algorithm selection, addresses this challenge but requires direct modification of the transport logic, which is restricted on commercial NICs. SCENIC’s open a… view at source ↗

**Figure 3.** Figure 3: DMA packet forwarding to the network driver. pipeline, classifying flows by source subnet to reflect podlevel positions in Fat Tree topologies [5], while implementation dynamic policy decision making on the Arm cores. A hardware timer periodically interrupts the CPU, which reads traffic statistics via the AXI bus. A dynamically configurable SCU rate limiter then enforces the resulting policies. 7 Driver … view at source ↗

**Figure 4.** Figure 4: Performance evaluation of the fallback path. Left: ping latency. Right: iperf3 throughput. Hybrid refers to Mellanox-to-SCENIC communication. 7.2 ibv_device integration For data-intensive workloads, SCENIC includes a fully offloaded RoCEv2 stack and exposes itself as an ibv_device, ensuring compatibility with existing IB Verb applications (R2). This is realized through a two-component stack: (1) a kernel … view at source ↗

**Figure 5.** Figure 5: RDMA performance benchmark in a fully switched datacenter network. engines, resulting in slightly lower latencies. Despite this gap, SCENIC’s slow-path latency is well within the range required for management traffic: SSH sessions, monitoring, and control-plane communication remain fully responsive under all tested conditions. A similar conclusion can be drawn for achievable throughput with iperf3 at a t… view at source ↗

**Figure 6.** Figure 6: Throughput of SCENIC to GPU with RDMA READs and RDMA WRITEs. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: Time series of bandwidth sharing scaling up to four parallel flows performing 128 KiB RDMA READs through separated SCUs. To demonstrate this design aspect, we configure SCENIC with four SCUs and run a throughput benchmark with 128 KiB RDMA READs. We incrementally scale the workload from one to four parallel flows, mapping each distinct flow to its own SCU [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of BROADCAST and GATHER collectives on SCENIC with OpenMPI on a commercial NIC. training [41, 55]; offloading them to the network can free up CPU cycles or reduce the GPU utilization [14]. Second, offloaded collectives create the possibility of collocating gradient compression as an in-network processing step to further overlap compute and communication [3]. For future work we plan to extend t… view at source ↗

**Figure 10.** Figure 10: Performance of hash partitioning on the CPU (B: Baseline, 1 and 16 threads) and offloaded with SCENIC. the multi-threaded software baseline compared to SCENICoffloaded hash partitioning. SCENIC offloading achieves latency that scales linearly with data set size and, for larger transfer sizes, approaches the lower bound of just the RDMA communication. Throughput shows a fixed overhead at small data set s… view at source ↗

read the original abstract

Although modern, AI-centric datacenters heavily rely on SmartNICs, existing devices impose a hard trade-off. Commercial SmartNICs provide high bandwidth and easy software integration, but offer limited support for customization and data processing offload. In contrast, research SmartNICs often suffer from low bandwidth, limited functionality, and poor software compatibility -- to the point that many are not actual NICs in a technical sense. This gap can be closed by treating the NIC datapath as a first-class stream computation substrate with shared hardware/software abstractions for a tight co-design of infrastructure and applications. To demonstrate this, we introduce SCENIC, an open-source datacenter SmartNIC. SCENIC implements a 200G network datapath over offloaded TCP/IP and RDMA stacks, as well as a fallback path for processing arbitrary network traffic. On top of the network logic, SCENIC combines on-datapath Stream Compute Units (SCUs) for data processing and embedded ARM cores for flexible control path manipulation with direct access to GPUs and SSDs. SCENIC is fully integrated with the OS, exposing native Linux network and RDMA verb interfaces, making the programmable datapath transparent to existing applications while enabling control of, e.g., user-defined offloads and programmable congestion control. SCENIC's performance matches commercial platforms, and we show its versatility through several use cases such as offloaded collective communication and network-to-GPU hash-based data partitioning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCENIC integrates 200G offloaded TCP/RDMA with on-datapath SCUs and ARM cores while staying Linux-compatible, but the no-overhead performance claim is the part that still needs concrete data.

read the letter

The paper's main contribution is a SmartNIC prototype that runs a full 200G offloaded TCP/IP and RDMA stack, adds Stream Compute Units right on the datapath for data processing, includes embedded ARM cores for control, and gives direct GPU and SSD access. It keeps everything transparent to existing Linux applications through native interfaces and RDMA verbs. That combination is the new piece: prior commercial cards are fast but limited in customization, while research designs often drop bandwidth or break compatibility. SCENIC tries to close that gap with a main offload path plus a fallback for arbitrary traffic, and it demonstrates a couple of use cases like offloaded collectives and hash-based partitioning straight to GPU memory. The open-source release and the focus on shared hardware/software abstractions for the datapath are practical steps that could matter for people building AI datacenter infrastructure. The architecture choices look coherent on paper and address real trade-offs in the space. The soft spot is the performance story. The abstract states that SCENIC matches commercial platforms, yet the provided text gives no throughput curves, latency numbers, ablation results, or head-to-head comparisons under load with the SCUs and ARM cores active. Programmable logic inserted into a 200G pipeline can easily create stalls or contention, so without measurements that isolate the added elements versus pure bypass, it is hard to judge whether the flexibility really comes at zero cost. If the full evaluation only shows peak line-rate figures without stressing the programmable paths, the versatility claims stay harder to rely on for production workloads. This paper is aimed at hardware systems researchers and datacenter architects who care about SmartNIC evolution. It gives a concrete design point worth discussing even if the numbers need tightening. I would send it to peer review because the integration is specific enough to be useful and the open prototype lets others test the claims directly.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SCENIC, an open-source datacenter SmartNIC that implements a 200G network datapath over offloaded TCP/IP and RDMA stacks together with a fallback path for arbitrary traffic. It augments the datapath with on-datapath Stream Compute Units (SCUs) for data processing and embedded ARM cores for control, while providing direct GPU/SSD access and full integration with the Linux OS and RDMA verb interfaces. The central claims are that this design achieves performance parity with commercial platforms and enables versatile use cases such as offloaded collective communication and network-to-GPU hash-based data partitioning without sacrificing bandwidth or latency.

Significance. If the zero-overhead integration of SCUs and ARM cores is demonstrated, the work would meaningfully close the gap between high-bandwidth commercial SmartNICs and customizable research platforms by delivering a fully OS-compatible, programmable datapath. The open-source prototype, native Linux/RDMA compatibility, and multiple concrete use cases are positive attributes that could support broader adoption and further co-design research.

major comments (2)

[Abstract] Abstract: The claim that 'SCENIC's performance matches commercial platforms' is load-bearing for the contribution yet is presented without any quantitative throughput, latency, or bandwidth measurements, error bars, baseline comparisons (e.g., to BlueField-class devices), or ablation data isolating the impact of the added SCUs and ARM cores versus pure bypass paths.
[Use cases] Use-case demonstrations: The offloaded collective communication and hash-based GPU partitioning examples are described at a high level but lack performance numbers, comparisons against non-programmable baselines, or stress-test results on the programmable paths and fallback route under realistic 200G workloads, leaving the 'no-penalty' assumption unverified.

minor comments (1)

[Abstract] Abstract: The acronym 'SCU' is introduced without a short definition or pointer to the stream-computation literature that motivates the hardware abstraction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of SCENIC's potential impact. We address each major comment below with point-by-point responses and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'SCENIC's performance matches commercial platforms' is load-bearing for the contribution yet is presented without any quantitative throughput, latency, or bandwidth measurements, error bars, baseline comparisons (e.g., to BlueField-class devices), or ablation data isolating the impact of the added SCUs and ARM cores versus pure bypass paths.

Authors: We agree that the abstract would be strengthened by including a concise summary of the key quantitative results that appear in the evaluation section. The manuscript already contains throughput and latency measurements at 200 Gbps, direct comparisons to BlueField-class devices, error bars from repeated runs, and ablations isolating SCU/ARM overhead on the bypass path. We will revise the abstract to incorporate these supporting metrics and comparisons so the performance claim is substantiated at the abstract level as well. revision: yes
Referee: [Use cases] Use-case demonstrations: The offloaded collective communication and hash-based GPU partitioning examples are described at a high level but lack performance numbers, comparisons against non-programmable baselines, or stress-test results on the programmable paths and fallback route under realistic 200G workloads, leaving the 'no-penalty' assumption unverified.

Authors: The quantitative results for both use cases, including comparisons to non-programmable baselines and stress tests under full 200 Gbps load on programmable and fallback paths, are presented in the evaluation section. We acknowledge that the use-case descriptions themselves remain somewhat high-level and do not sufficiently cross-reference or restate these numbers. We will revise the use-case subsections to explicitly include the relevant performance figures, baseline comparisons, and workload results, thereby directly verifying the no-penalty claim in context. revision: yes

Circularity Check

0 steps flagged

No circularity detected in SCENIC implementation claims

full rationale

The paper presents an implemented hardware/software prototype for a 200G SmartNIC with on-datapath SCUs and ARM cores, offloaded TCP/IP/RDMA stacks, and OS integration. Central claims of matching commercial performance and versatility in use cases rest on the physical prototype, fallback paths, and demonstrated applications rather than any equations, fitted parameters, or derivations that reduce to their own inputs by construction. No self-definitional steps, uniqueness theorems, or ansatz smuggling via self-citation appear in the provided text; the design is described as a co-design substrate without load-bearing self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on standard domain assumptions about the feasibility of 200G networking hardware and RDMA offload; no free parameters, invented physical entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption Existing hardware primitives for 200G networking and RDMA offload are sufficient to support the described performance and integration.
The claims of matching commercial performance presuppose that the underlying silicon and protocol stacks behave as expected when augmented with SCUs and ARM cores.

pith-pipeline@v0.9.0 · 5580 in / 1438 out tokens · 36844 ms · 2026-05-10T09:43:46.936431+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

101 extracted references · 46 canonical work pages · 1 internal anchor

[1]

Andersen, Peter Bailis, Magdalena Balazinska, Philip A

Daniel Abadi, Anastasia Ailamaki, David G. Andersen, Peter Bailis, Magdalena Balazinska, Philip A. Bernstein, Peter Boncz, Surajit Chaud- huri, Alvin Cheung, AnHai Doan, Luna Dong, Michael J. Franklin, Juliana Freire, Alon Y. Halevy, Joseph M. Hellerstein, Stratos Idreos, Donald Kossmann, Tim Kraska, Sailesh Krishnamurthy, Volker Markl, Sergey Melnik, Tov...

work page doi:10.1145/3524284 2022
[2]

Advanced Micro Devices, Inc. 2024. AMD Pensando Software-in-Silicon Development Kit (SSDK). https://www.amd.com/content/dam/amd/ en/documents/pensando-technical-docs/product-briefs/pensando- ssdk-product-brief.pdf

2024
[3]

Papailiopoulos

Saurabh Agarwal, Hongyi Wang, Shivaram Venkataraman, and Dim- itris S. Papailiopoulos. 2022. On the Utility of Gradient Compression in Distributed Training Systems. (2022). https://proceedings.mlsys.or g/paper_files/paper/2022/hash/773862fcc2e29f650d68960ba5bd1101- Abstract.html

2022
[4]

Olasupo Ajayi and Ryan Grant. 2025. A Chronological Analysis of the Evolution of SmartNICs. CoRR abs/2512.04054 (2025). arXiv:2512.04054 doi:10.48550/ARXIV.2512.04054

work page doi:10.48550/arxiv.2512.04054 2025
[5]

Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture. InProceedings of the ACM SIGCOMM 2008 Conference on Data Communication (Seattle, WA, USA) (SIGCOMM ’08). Association for Computing Machinery, New York, NY, USA, 63–74. doi: 10.1145/1402958.1402967

work page doi:10.1145/1402958.1402967 2008
[6]

Amazon Web Services. 2022. The Components of the Nitro System (The Security Design of the A WS Nitro System Whitepaper). Technical Report. Amazon Web Services. https://docs.aws.amazon.com/whitepapers/la test/security-design-of-aws-nitro-system/the-components-of-the- nitro-system.html Accessed: 2026-04-15

2022
[7]

AMD. 2025. DMA/Bridge Subsystem for PCI Express Product Guide (PG195). https://docs.amd.com/r/en-US/pg195-pcie-dma

2025
[8]

AMD. 2025. Versal Adaptive SoC 600G Channelized Multirate Ethernet Subsystem (DCMAC) LogiCORE IP Product Guide (PG369). https: //docs.amd.com/r/en-US/pg369-dcmac/Introduction

2025
[9]

AMD. 2025. Versal Adaptive SoC CPM DMA and Bridge Mode for PCI Express v3.4. https://docs.amd.com/r/en-US/pg347-cpm-dma- bridge?tocId=oTd_ZrdYcOWw7fqmc3hb9g

2025
[10]

AMD. 2025. Vitis Networking P4. https://docs.amd.com/r/en- US/ug1308-vitis-p4-user-guide

2025
[11]

AMD Pensando. 2022. AMD Pensando Elba DPU (DSC-200) Product Overview. https://www.amd.com/en/products/data-processing- units/pensando.html

2022
[12]

AMD/Xilinx. 2021. OpenNIC: An Open-Source NIC Shell for Alveo FPGAs. GitHub. https://github.com/Xilinx/open-nic

2021
[13]

Kyle Aubrey and Farshad Ghodsian. 2026. Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform. NVIDIA Technical Blog. https://developer.nvidia.com/blo g/inside-nvidia-groq-3-lpx-the-low-latency-inference-accelerator- for-the-nvidia-vera-rubin-platform/ Accessed: 2026-03-28

2026
[14]

John Bachan, Kaiming Ouyang, Misbah Mubarak, Thomas Gillis, Bruce Chang, Devendar Bureddy, Giuseppe Congiu, Keith Caton, Kyle Aubrey, and Xiaofan Li. 2025. Enabling Fast Inference and Resilient Training with NCCL 2.27. https://developer.nvidia.com/blog/enabling- fast-inference-and-resilient-training-with-nccl-2-27/

2025
[15]

Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, Rebecca Chow, Jeff Cohen, Mahmoud Elhaddad, Vivek Ette, Igal Figlin, Daniel Firestone, Mathew George, Ilya German, Lakhmeet Ghai, Eric Green, Albert G. Greenberg, Manish Gupta, Randy Haagens, Matthew Hendel...

2023
[16]

Tommaso Bonato, Abdul Kabbani, Ahmad Ghalayini, Anup Agarwal, Daniele De Sensi, Rong Pan, Costin Raiciu, Mark Handley, Mihai Brodschi, Timo Schneider, Nils Blach, Daniel Santos Ferreira Alves, and Torsten Hoefler. 2026. SMaRTT: Sender-based Marked Rapidly- adapting Trimmed & Timed Transport. (2026). arXiv:2404.01630 [cs.NI] https://arxiv.org/abs/2404.01630

work page arXiv 2026
[17]

Broadcom. 2019. Broadcom Stingray PS225 Dual-Port 25GbE PCIe Ethernet SmartNIC Data Sheet. https://www.broadcom.com/compa ny/news/product-releases/53106

2019
[18]

Marco Spaziani Brunella, Giacomo Belocchi, Marco Bonola, Salva- tore Pontarelli, Giuseppe Siracusano, Giuseppe Bianchi, Aniello Cam- marano, Alessandro Palumbo, Luca Petrucci, and Roberto Bifulco. 2020. hXDP: Efficient Software Packet Processing on FPGA NICs. In 14th USENIX Symposium on Operating Systems Design and Implementation, 12 OSDI 2020, Virtual Ev...

2020
[19]

https://www.usenix.org/conference/osdi20/presentation/brunella
[20]

Xuzheng Chen, Jie Zhang, Ting Fu, Yifan Shen, Shu Ma, Kun Qian, Lingjun Zhu, Chao Shi, Yin Zhang, Ming Liu, and Zeke Wang. 2024. Demystifying Datapath Accelerator Enhanced Off-path SmartNIC. In 32nd IEEE International Conference on Network Protocols, ICNP 2024, Charleroi, Belgium, October 28-31, 2024 . IEEE, 1–12. doi: 10.1109/ICNP 61940.2024.10858560

work page doi:10.1109/icnp 2024
[21]

Alibaba Cloud Community. 2022. A Detailed Explanation about Al- ibaba Cloud CIPU. https://www.alibabacloud.com/blog/a-detailed- explanation-about-alibaba-cloud-cipu_599183

2022
[22]

Dan Daly, Jakub Kicinski, and Willem de Bruijn. 2023. OCP NIC Core Features Specification, Version 1.0. Technical Specification. Open Compute Project (OCP). https://www.opencompute.org/document s/ocp-server-nic-core-features-specification-ocp-spec-format-1-1- pdf Accessed: 2026-03-23

2023
[23]

Jonas Dann and Gustavo Alonso. 2026. Should I Hide My Duck in the Lake? CoRR abs/2602.18775 (2026). doi:10.48550/ARXIV.2602.18775

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.18775 2026
[24]

Jonas Dann, Royden Wagner, Daniel Ritter, Christian Faerber, and Holger Fröning. 2022. PipeJSON: Parsing JSON at Line Speed on FPGAs. In International Conference on Management of Data, DaMoN 2022, Philadelphia, PA, USA, 13 June 2022 , Spyros Blanas and Norman May (Eds.). ACM, 3:1–3:7. doi:10.1145/3533737.3535094

work page doi:10.1145/3533737.3535094 2022
[25]

Tristan Döring, Henning Stubbe, and Kilian Holzinger. 2021. Smart- NICs: Current Trends in Research and Industry . Technical Report NET- 2021-05-1. Chair of Network Architectures and Services, Department of Informatics, Technical University of Munich. https://www.net.in.t um.de/fileadmin/TUM/NET/NET-2021-05-1/NET-2021-05-1_05.pdf

2021
[26]

Kfoury, Jose Gomez, and Jorge Crichigno

Sergio Elizalde, Ali AlSabeh, Ali Mazloum, Samia Choueiri, Elie F. Kfoury, Jose Gomez, and Jorge Crichigno. 2025. A survey on security applications with SmartNICs: Taxonomy, implementations, challenges, and future trends. J. Netw. Comput. Appl. 242 (2025), 104257. doi:10.1 016/J.JNCA.2025.104257

work page arXiv 2025
[27]

Caulfield, Eric S

Daniel Firestone, Andrew Putnam, Sambrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian M. Caulfield, Eric S. Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Sil...
[28]

In 15th USENIX Symposium on Networked Systems Design and Im- plementation, NSDI 2018, Renton, W A, USA, April 9-11, 2018 , Sujata Banerjee and Srinivasan Seshan (Eds.)

Azure Accelerated Networking: SmartNICs in the Public Cloud. In 15th USENIX Symposium on Networked Systems Design and Im- plementation, NSDI 2018, Renton, W A, USA, April 9-11, 2018 , Sujata Banerjee and Srinivasan Seshan (Eds.). USENIX Association, 51–66. https://www.usenix.org/conference/nsdi18/presentation/firestone

2018
[29]

Snoeren, George Porter, and George Papen

Alex Forencich, Alex C. Snoeren, George Porter, and George Papen
[30]

In 28th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2020, Fayetteville, AR, USA, May 3-6, 2020

Corundum: An Open-Source 100-Gbps Nic. In 28th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2020, Fayetteville, AR, USA, May 3-6, 2020. IEEE, 38–46. doi:10.1109/FCCM48280.2020.00015

work page doi:10.1109/fccm48280.2020.00015 2020
[31]

fpgasystems. [n. d.]. GitHub - fpgasystems/fpga-network-stack: Scal- able Network Stack for FPGAs (TCP/IP, RoCEv2). https://github.com /fpgasystems/fpga-network-stack
[32]

Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashidhar Gandham, and Hongyi Zeng. 2024. RDMA over Ethernet for Distributed Training at Meta Scale. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM S...

work page doi:10.1145/3651890.3672233 2024
[33]

Anqi Guo, Yuchen Hao, Xiteng Yao, Shining Yang, Jianyu Huang, Tony (Tong) Geng, and Martin Herbordt. 2025. SmartNIC-GPU- CPU Heterogeneous System for Large Machine Learning Model with Software-Hardware Codesign. In Proceedings of the 39th ACM Interna- tional Conference on Supercomputing (ICS ’25). Association for Com- puting Machinery, New York, NY, USA, ...

work page doi:10.1145/3721145 2025
[34]

Zhenhao He, Dario Korolija, and Gustavo Alonso. 2021. EasyNet: 100 Gbps Network for HLS. In 31st International Conference on Field- Programmable Logic and Applications, FPL 2021, Dresden, Germany, August 30 - Sept. 3, 2021 . IEEE, 197–203. doi: 10.1109/FPL53798.2021.00 040

work page doi:10.1109/fpl53798.2021.00 2021
[35]

Zhenhao He, Dario Korolija, Yu Zhu, Benjamin Ramhorst, Tristan Laan, Lucian Petrica, Michaela Blott, and Gustavo Alonso. 2024. ACCL+: an FPGA-Based Collective Engine for Distributed Applications. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024 , Ada Gavrilovska and Douglas B. Terry...

2024
[36]

Maximilian Jakob Heer, Benjamin Ramhorst, Yu Zhu, Luhao Liu, Zhiyi Hu, Jonas Dann, and Gustavo Alonso. 2025. RoCE BALBOA: Service- enhanced Data Center RDMA for SmartNICs. arXiv:2507.20412 doi: 10 .48550/ARXIV.2507.20412

work page arXiv 2025
[37]

Underwood, Robert Alver- son, Mark Griswold, Vahid Tabatabaee, Mohan Kalkunte, Suren- dra Anubolu, Siyuan Shen, Moray McLaren, Abdul Kabbani, and Steve Scott

Torsten Hoefler, Duncan Roweth, Keith D. Underwood, Robert Alver- son, Mark Griswold, Vahid Tabatabaee, Mohan Kalkunte, Suren- dra Anubolu, Siyuan Shen, Moray McLaren, Abdul Kabbani, and Steve Scott. 2023. Data Center Ethernet and Remote Direct Mem- ory Access: Issues at Hyperscale. Computer 56, 7 (2023), 67–77. doi:10.1109/MC.2023.3261184

work page doi:10.1109/mc.2023.3261184 2023
[38]

Underwood, Cedell Alexander, Bob Alverson, Paul Bottorff, Adrian M

Torsten Hoefler, Karen Schramm, Eric Spada, Keith D. Underwood, Cedell Alexander, Bob Alverson, Paul Bottorff, Adrian M. Caulfield, Mark Handley, Cathy Huang, Costin Raiciu, Abdul Kabbani, Eugene Opsasnick, Rong Pan, Adee Ran, and Rip Sohan. 2025. Ultra Ethernet’s Design Principles and Architectural Innovations. arXiv:2508.08906 doi:10.48550/ARXIV.2508.08906

work page doi:10.48550/arxiv.2508.08906 2025
[39]

Hongjing Huang, Jie Zhang, Xuzheng Chen, Ziyu Song, Jiajun Qin, and Zeke Wang. 2025. SwCC: Software-Programmable and Per-Packet Con- gestion Control in RDMA Engine. In Proceedings of the 2025 USENIX Annual Technical Conference, USENIX ATC 2025, Boston, MA, USA, July 7-9, 2025, Deniz Altinbüken and Ryan Stutsman (Eds.). USENIX Asso- ciation, 1243–1260. htt...

2025
[40]

IEEE. 2010. IEEE Standard for Information technology–Local and met- ropolitan area networks–Specific requirements–Part 3: CSMA/CD Access Method and Physical Layer Specifications Amendment 4: Media Access Control Parameters, Physical Layers, and Manage- ment Parameters for 40 Gb/s and 100 Gb/s Operation. 457 pages. doi:10.1109/IEEESTD.2010.5501740

work page doi:10.1109/ieeestd.2010.5501740 2010
[41]

IEEE. 2011. IEEE Standard for Local and metropolitan area networks– Media Access Control (MAC) Bridges and Virtual Bridged Local Area Networks–Amendment 17: Priority-based Flow Control. 40 pages. doi:10.1109/IEEESTD.2011.6032693

work page doi:10.1109/ieeestd.2011.6032693 2011
[42]

IEEE. 2017. IEEE Standard for Ethernet - Amendment 10: Media Access Control Parameters, Physical Layers, and Management Parameters for 200 Gb/s and 400 Gb/s Operation. 416 pages. doi: 10.1109/IEEESTD.20 17.8207825

work page doi:10.1109/ieeestd.20 2017
[43]

Intel. 2022. Intel Infrastructure Processing Unit (Intel IPU) E2000. https://www.intel.com/content/www/us/en/products/details/netwo rk-io/ipu.html

2022
[44]

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

2024
[45]

Marko Kabic, Bowen Wu, Jonas Dann, and Gustavo Alonso. 2025. Powerful GPUs or Fast Interconnects: Analyzing Relational Workloads on Modern GPUs. Proc. VLDB Endow. 18, 11 (2025), 4350–4363. doi: 10 .14778/3749646.3749698

work page arXiv 2025
[46]

Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ran- ganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Pro- filing a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (Portland, Oregon) (ISCA ’15). Association for Computing Machinery, New York, NY, USA, 158–169. doi: 10.11...

work page doi:10.1145/2749469.2750392 2015
[47]

Kfoury, Samia Choueiri, Ali Mazloum, Ali AlSabeh, Jose Gomez, and Jorge Crichigno

Elie F. Kfoury, Samia Choueiri, Ali Mazloum, Ali AlSabeh, Jose Gomez, and Jorge Crichigno. 2024. A Comprehensive Survey on SmartNICs: Architectures, Development Models, Applications, and Research Di- rections. IEEE Access 12 (2024), 107297–107336. doi:10.1109/ACCESS.2 024.3437203

work page doi:10.1109/access.2 2024
[48]

Dario Korolija, Timothy Roscoe, and Gustavo Alonso. 2020. Do OS abstractions make sense on FPGAs?. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4-6, 2020. USENIX Association, 991–1010. https://www.us enix.org/conference/osdi20/presentation/roscoe

2020
[49]

Hämäläinen

Sakari Lahti and Timo D. Hämäläinen. 2025. High-Level Synthesis for FPGAs - A Hardware Engineer’s Perspective. IEEE Access 13 (2025), 28574–28593. doi: 10.1109/ACCESS.2025.3540320

work page doi:10.1109/access.2025.3540320 2025
[50]

Bojie Li, Kun Tan, Layong Larry Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang Xiong, and Peng Cheng. 2016. ClickNP: Highly flexible and High-performance Network Processing with Reconfig- urable Hardware. In Proceedings of the ACM SIGCOMM 2016 Confer- ence, Florianopolis, Brazil, August 22-26, 2016 , Marinho P. Barcellos, Jon Crowcroft, Amin Vahdat,...

work page doi:10.1145/2934872.2934897 2016
[51]

Jiayong Li, Jonas Dann, Zhenhao He, Gustavo Alonso, Sai Rahul Cha- lamalasetti, Dejan Milojicic, Lance Evans, Alex Veprinsky, and Runbin Shi. 2026. StreamDedup: Distributed In-line Deduplication for Disag- gregated Storage. ACM Trans. Reconfigurable Technol. Syst. (March 2026). doi: 10.1145/3799896

work page doi:10.1145/3799896 2026
[52]

Junru Li, Youyou Lu, Qing Wang, Jiazhen Lin, Zhe Yang, and Jiwu Shu
[53]

In 2022 USENIX Annual Technical Conference (USENIX ATC 22)

AlNiCo: SmartNIC-accelerated Contention-aware Request Sched- uling for Transaction Processing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). USENIX Association, Carlsbad, CA, 951–

2022
[54]

https://www.usenix.org/conference/atc22/presentation/li-junru
[55]

Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Al- izadeh, and Minlan Yu. 2019. HPCC: high precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communica- tion, SIGCOMM 2019, Beijing, China, August 19-23, 2019 , Jianping Wu and Wendy Hall (Eds.). ...

work page doi:10.1145/3341302.3342085 2019
[56]

Will Lin, Yizhou Shan, Ryan Kosta, Arvind Krishnamurthy, and Yiying Zhang. 2024. SuperNIC: An FPGA-Based, Cloud-Oriented SmartNIC. In Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2024, Monterey, CA, USA, March 3-5, 2024, Zhiru Zhang and Andrew Putnam (Eds.). ACM, 130–141. doi:10.1145/3626202.3637564

work page doi:10.1145/3626202.3637564 2024
[57]

Linux RDMA. 2024. perftest – RDMA Performance Tests. https: //github.com/linux-rdma/perftest. Accessed: 04/15/2026

2024
[58]

Junyi Liu, Aleksandar Dragojević, Shane Fleming, Antonios Katsarakis, Dario Korolija, Igor Zablotchi, Ho-Cheung Ng, Anuj Kalia, and Miguel Castro. 2024. Honeycomb: Ordered Key-Value Store Acceleration on an FPGA-Based SmartNIC. IEEE Trans. Comput. 73, 3 (2024), 857–871. doi:10.1109/TC.2023.3345173

work page doi:10.1109/tc.2023.3345173 2024
[59]

Ming Liu, Tianyi Cui, Henry Schuh, Arvind Krishnamurthy, Simon Peter, and Karan Gupta. 2019. Offloading distributed applications onto smartNICs using iPipe. InProceedings of the ACM Special Interest Group on Data Communication, SIGCOMM 2019, Beijing, China, August 19-23, 2019, Jianping Wu and Wendy Hall (Eds.). ACM, 318–333. doi:10.1145/ 3341302.3342079

work page arXiv 2019
[60]

Rui Ma, Evangelos Georganas, Alexander Heinecke, Sergey Gribok, Andrew Boutros, and Eriko Nurvitadhi. 2022. FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems. IEEE Computer Architecture Letters 21, 2 (2022), 49–52. doi: 10.1109/LCA.2022.3189207

work page doi:10.1109/lca.2022.3189207 2022
[61]

MangoBoost. 2025. Mango BoostX ™ Programmable DPUs. https: //cdn.sanity.io/files/hx87iaks/production/ce5454fc6af423cd241b5784 3750527b05d29811.pdf. Accessed on 04/15/2026

2025
[62]

YoungGyoun Moon, SeungEon Lee, Muhammad Asim Jamshed, and KyoungSoo Park. 2020. AccelTCP: Accelerating Network Applica- tions with Stateful TCP Offloading. In 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020, Santa Clara, CA, USA, February 25-27, 2020 , Ranjita Bhagwan and George Porter (Eds.). USENIX Association, 77–92. htt...

2020
[63]

Javier Moya, Matthias Gabathuler, Mario Ruiz, and Gustavo Alonso
[64]

fpgasystems/hacc: ETHZ-HACC. Zenodo. doi: 10.5281/zenodo.8 340448 https://doi.org/10.5281/zenodo.8340448

work page doi:10.5281/zenodo.8
[65]

NVIDIA. [n. d.]. GPUDirect RDMA and GPUDirect Storage — NVIDIA GPU Operator. https://docs.nvidia.com/datacenter/cloud-native/gpu- operator/25.3.1/gpu-operator-rdma.html#gpudirect-rdma-and- gpudirect-storage
[66]

NVIDIA. 2023. NVIDIA BlueField-3 DPU Data Sheet. https://www.nv idia.com/content/dam/en-zz/Solutions/Data-Center/documents/dat asheet-nvidia-bluefield-3-dpu.pdf

2023
[67]

NVIDIA Corporation. 2024. NVIDIA DOCA SDK. https://developer.nv idia.com/networking/doca Version 2.6.0, Accessed: 2026-03-24

2024
[68]

Oracle. 2025. Oracle Unveils Next-Generation Oracle Cloud Infras- tructure Zettascale10 Cluster for AI. Oracle Corporation. https: //www.oracle.com/news/announcement/ai-world-oracle-unveils- next-generation-oci-zettascale10-cluster-for-ai-2025-10-14/ Re- trieved March 25, 2026

2025
[69]

Sourav Panda, Yixiao Feng, Sameer G Kulkarni, K. K. Ramakrishnan, Nick Duffield, and Laxmi N. Bhuyan. 2021. SmartWatch: accurate traffic analysis and flow-state tracking for intrusion prevention us- ing SmartNICs. In Proceedings of the 17th International Conference on Emerging Networking EXperiments and Technologies (Virtual Event, Germany) (CoNEXT ’21). ...

work page doi:10.1145/3485983.3494861 2021
[70]

Charles Papon. 2016. SpinalHDL Documentation. https://spinalhdl.gi thub.io/SpinalDoc-RTD/master/SpinalHDL/Introduction/SpinalHD L.html. Accessed: 2025-04-15

2016
[71]

Sinclair

Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, and Matthew D. Sinclair. 2023. Tale of Two Cs: Computation vs. Commu- nication Scaling for Future Transformers on Future Hardware. In IEEE International Symposium on Workload Characterization, IISWC 2023, Ghent, Belgium, October 1-3, 2023 . IEEE, 140–153. doi: 10.1109/IISWC5 9245.2023.00026

work page doi:10.1109/iiswc5 2023
[72]

Salvatore Pontarelli, Roberto Bifulco, Marco Bonola, Carmelo Cascone, Marco Spaziani Brunella, Valerio Bruschi, Davide Sanvito, Giuseppe Siracusano, Antonio Capone, Michio Honda, and Felipe Huici. 2019. FlowBlaze: Stateful Packet Processing in Hardware. In 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019, Boston, MA, Februar...

2019
[73]

Benjamin Ramhorst, Dario Korolija, Maximilian Jakob Heer, Jonas Dann, Luhao Liu, and Gustavo Alonso. 2025. Coyote v2: Raising the Level of Abstraction for Data Center FPGAs. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, SOSP 2025, Lotte Hotel World, Seoul, Republic of Korea, October 13-16, 2025 , Youjip 14 Won, Youngjin K...

work page doi:10.1145/3731569.3764845 2025
[74]

Mario Ruiz, David Sidler, Gustavo Sutter, Gustavo Alonso, and Sergio López-Buedo. 2019. Limago: An FPGA-Based Open-Source 100 GbE TCP/IP Stack. In 29th International Conference on Field Programmable Logic and Applications, FPL 2019, Barcelona, Spain, September 8-12, 2019, Ioannis Sourdis, Christos-Savvas Bouganis, Carlos Álvarez, Leonel Antonio Toledo Día...

work page doi:10.1109/fpl.2019.00053 2019
[75]

Emas, John Demme, Ana Ibarra, Kara Kagi, Brandon Klouchek, Abhijeet Lawande, Todd Massengill, David J

Rob Rydberg, Madison N. Emas, John Demme, Ana Ibarra, Kara Kagi, Brandon Klouchek, Abhijeet Lawande, Todd Massengill, David J. Pow- ers, and Andrew Putnam. 2026. Hyperscale FPGA Engineering Sys- tems at Microsoft. In Proceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2026, Seaside, CA, USA, February 22-24, 20...

work page doi:10.1145/3748173.3779203 2026
[76]

Leah Shalev, Hani Ayoub, Nafea Bshara, and Erez Sabbag. 2020. A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC. IEEE Micro 40, 6 (2020), 67–73. doi: 10.1109/MM.2020.3016891

work page doi:10.1109/mm.2020.3016891 2020
[77]

Rajath Shashidhara, Tim Stamler, Antoine Kaufmann, and Simon Peter
[78]

In 19th USENIX Symposium on Networked Systems Design and Imple- mentation, NSDI 2022, Renton, W A, USA, April 4-6, 2022, Amar Phan- ishayee and Vyas Sekar (Eds.)

FlexTOE: Flexible TCP Offload with Fine-Grained Parallelism. In 19th USENIX Symposium on Networked Systems Design and Imple- mentation, NSDI 2022, Renton, W A, USA, April 4-6, 2022, Amar Phan- ishayee and Vyas Sekar (Eds.). USENIX Association, 87–102. https: //www.usenix.org/conference/nsdi22/presentation/shashidhara

2022
[79]

David Sidler, Zeke Wang, Monica Chiosa, Amit Kulkarni, and Gustavo Alonso. 2020. StRoM: smart remote memory. In EuroSys ’20: Fifteenth EuroSys Conference 2020, Heraklion, Greece, April 27-30, 2020 , Angelos Bilas, Kostas Magoutis, Evangelos P. Markatos, Dejan Kostic, and Margo I. Seltzer (Eds.). ACM, 29:1–29:16. doi: 10.1145/3342195.3387519

work page doi:10.1145/3342195.3387519 2020
[80]

Arjun Singhvi, Nandita Dukkipati, Prashant Chandra, Hassan M. G. Wassel, Naveen Kr. Sharma, Anthony Rebello, Henry Schuh, Praveen Kumar, Behnam Montazeri, Neelesh Bansod, Sarin Thomas, Inho Cho, Hyojeong Lee Seibert, Baijun Wu, Rui Yang, Yuliang Li, Kai Huang, Qianwen Yin, Abhishek Agarwal, Srinivas Vaduvatha, Wei- huang Wang, Masoud Moshref, Tao Ji, Davi...

work page doi:10.1145/3718958.3754353 2025

Showing first 80 references.