OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol

Bojie Li

arxiv: 2605.28717 · v1 · pith:JDK75LSLnew · submitted 2026-05-27 · 💻 cs.AI · cs.AR· cs.NI

OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol

Bojie Li This is my paper

Pith reviewed 2026-06-29 12:03 UTC · model grok-4.3

classification 💻 cs.AI cs.ARcs.NI

keywords OpenURMAUnified BusRDMARoCEremote memory accessdatacenter networkingFPGA implementationlatency

0 comments

The pith

OpenURMA's clean-room implementation of the Unified Bus protocol achieves ~500 ns end-to-end latency on 64-byte remote fetches, 4.37 times lower than a matched RoCE baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenURMA as the first open implementation of Huawei's Unified Bus (UB) specification for datacenter RDMA. UB decouples per-application endpoint state from per-host transport state and routes remote accesses through native CPU load/store to an on-chip controller instead of Queue Pair abstractions. This design change is shown to eliminate per-connection state bloat and multiple PCIe traversals. The work realizes UB at three tiers—synthesisable RTL on Alveo U50, cycle-level SystemC, and gem5—each paired with an OpenRoCE baseline for direct comparison. On the canonical 64-byte LOAD/READ operation the UB path records ~500 ns latency, 2.80 times higher throughput, and ~14 percent LUT occupancy.

Core claim

The central claim is that a faithful three-tier open realization of the public UB specification delivers a load/store remote-fetch path with ~500 ns end-to-end latency on the canonical 64-byte operation, 4.37 times below the matched OpenRoCE baseline of 2186 ns, while sustaining 2.80 times higher throughput and occupying only ~14 percent of a U50's LUTs.

What carries the argument

The three-tier OpenURMA stack (synthesisable RTL on Alveo U50, cycle-level two-node SystemC simulator, gem5 full-system scaffold) that implements UB transport and transaction layers and is compared against a matched OpenRoCEv2 RC baseline.

If this is right

Connection context grows additively with applications rather than scaling with hundreds of megabytes per host at 1024-application fanout.
Ordering guarantees become opt-in instead of mandatory for every operation.
Remote memory is reached via a single on-chip-bus controller load/store rather than a four-traversal PCIe round trip.
The measured resource footprint of 14 percent LUTs leaves headroom for additional on-NIC functions in the same silicon budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

An open reference implementation allows other groups to test UB variants or port the design to different FPGA or ASIC targets without access to closed silicon.
The latency reduction suggests that similar abstraction changes could be explored for non-Huawei RDMA stacks if the spec remains public.
The gem5 scaffold provides a full-system model that could be extended to study interactions between UB and host OS or application runtimes.
Low LUT usage implies UB could be integrated into smaller or lower-cost network devices than current RoCE NICs.

Load-bearing premise

The three-tier OpenURMA implementation correctly and faithfully realizes the public UB specification without hidden optimizations or deviations that would not be present in a production closed-silicon realization.

What would settle it

Independent synthesis and cycle-accurate measurement of the released RTL on the same U50 platform yielding latency or throughput numbers materially different from the reported 500 ns / 2.80 times figures would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.28717 by Bojie Li.

**Figure 1.** Figure 1: The three architectural moves and their dependencies. RoCEv2 RC (top) puts the NIC behind PCIe, holds one Queue Pair per [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Architectural comparison. RoCE puts the NIC behind PCIe; it holds one Queue Pair per (application, remote-endpoint) pair, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: State models compared. RoCE binds one Queue Pair to every (application, remote-host) pair, so per-NIC state grows as [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Per-operation data path for a small synchronous read. The traditional work-queue-driven path (top) traverses four PCIe [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: OpenURMA’s NIC as a ClickNP element graph. The TX path (top) flows from CPU doorbell to wire; the RX path (bottom) [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The pipeline carries two reorder buffers serving disjoint correctness contracts. Packet-sequence reordering at the transport [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Target-side SEND dispatch. RoCE demultiplexes through a shared completion queue plus an application event loop; UB’s [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Submission path per stack. RoCE traversals between CPU and NIC (dashed) sit on PCIe; UB carries the same hand-offs [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Post-route LUT budget by architectural role for both [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Per-element post-route LUT, sorted descending. All [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Raw NIC-pipeline microbenchmarks. (a) Per-stage cycle contribution, cumulative 24 cy at the wire. (b) Sustained WR rate [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Modeled RoCE-DMA RDMA WRITE latency (curves) vs published ConnectX-7 ranges (bands). to 3,855× — the residual spec fields the MVP elides do not explain the gap; the (N+M) vs (N·M) split does. At (1024, 1024) that gap straddles the boundary between fitsin-on-chip-SRAM and spill-to-host-DRAM for a typical NIC [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 13.** Figure 13: Per-NIC connection state vs endpoint count [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 16.** Figure 16: Per-op latency under K-Jetty contention on one TP Channel. Linear PSN-allocator scaling; UB beats per-QP RoCE until K≈255. 2 1 2 2 2 3 2 4 2 5 2 6 Cluster size N (all-to-all) 1000 2000 3000 4000 Mean per-op latency (ns) RoCE QP cache spill (N 2 >512) UB §8.3 LD/ST UB §8.4 URMA WR RoCE BF RoCE DMA [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗

**Figure 17.** Figure 17: Cluster-scale per-op latency vs node count. RoCE [PITH_FULL_IMAGE:figures/full_fig_p013_17.png] view at source ↗

**Figure 18.** Figure 18: Total connection-setup time at symmetric [PITH_FULL_IMAGE:figures/full_fig_p014_18.png] view at source ↗

**Figure 19.** Figure 19: Per-operation latency (CDF). All four NIC stacks on the three workloads; link-delay [PITH_FULL_IMAGE:figures/full_fig_p015_19.png] view at source ↗

**Figure 20.** Figure 20: Op-rate vs in-flight depth on pointer-chase, link [PITH_FULL_IMAGE:figures/full_fig_p015_20.png] view at source ↗

**Figure 21.** Figure 21: End-to-end latency vs one-way link delay on pointer [PITH_FULL_IMAGE:figures/full_fig_p015_21.png] view at source ↗

**Figure 24.** Figure 24: UB LD/ST latency under three cache policies (write [PITH_FULL_IMAGE:figures/full_fig_p016_24.png] view at source ↗

**Figure 23.** Figure 23: Per-verb mean latency comparison. UB (UB [PITH_FULL_IMAGE:figures/full_fig_p016_23.png] view at source ↗

**Figure 25.** Figure 25: Page-swap baseline comparison. (a) Per-op latency CDF on a 64-K-key Zipfian read workload at [PITH_FULL_IMAGE:figures/full_fig_p018_25.png] view at source ↗

**Figure 26.** Figure 26: End-to-end latency vs payload size on bulk-read. [PITH_FULL_IMAGE:figures/full_fig_p018_26.png] view at source ↗

**Figure 28.** Figure 28: Per-op latency CDFs under jitter. UB’s tail ( [PITH_FULL_IMAGE:figures/full_fig_p019_28.png] view at source ↗

**Figure 29.** Figure 29: READ vs WRITE latency per stack. RoCE READ [PITH_FULL_IMAGE:figures/full_fig_p019_29.png] view at source ↗

**Figure 30.** Figure 30: Operating envelope: median (left) and p99 (right) latency vs sustained throughput per stack, open-loop Poisson arrivals. [PITH_FULL_IMAGE:figures/full_fig_p020_30.png] view at source ↗

**Figure 31.** Figure 31: Standalone TLM two-node throughput envelope. [PITH_FULL_IMAGE:figures/full_fig_p020_31.png] view at source ↗

**Figure 32.** Figure 32: gem5 FS-mode sustained polled goodput vs back-to [PITH_FULL_IMAGE:figures/full_fig_p020_32.png] view at source ↗

**Figure 33.** Figure 33: YCSB-A throughput (left) and p50 latency (right) vs concurrency across the four stacks. [PITH_FULL_IMAGE:figures/full_fig_p021_33.png] view at source ↗

**Figure 34.** Figure 34: Ordering cost in isolation. (a) Cycles from comple [PITH_FULL_IMAGE:figures/full_fig_p021_34.png] view at source ↗

**Figure 35.** Figure 35: Latency (left) and throughput (right) vs strict-order fraction. UB scales linearly with mix; RoCE is flat (always-on strict [PITH_FULL_IMAGE:figures/full_fig_p022_35.png] view at source ↗

**Figure 37.** Figure 37: Dual-NIC gem5-FS run after the OpenRoCE codec [PITH_FULL_IMAGE:figures/full_fig_p022_37.png] view at source ↗

**Figure 40.** Figure 40: Per-WR mean latency vs N across the three CQE paths: all are per-access-overhead-bound, not amortisable setup. The ioctl floor is ∼23× the MMIO floor; ppoll is another ∼2.3× above ioctl. 0 1000 2000 3000 4000 5000 cumulative cycles consumed (1 cycle = 1 ns @ 1 GHz) ethdec jsched 5403 1803 expTier2_atomic_gem5 (total cum_cycles=7206) [PITH_FULL_IMAGE:figures/full_fig_p023_40.png] view at source ↗

**Figure 41.** Figure 41: Per-SC-module cycle decomposition during a full [PITH_FULL_IMAGE:figures/full_fig_p023_41.png] view at source ↗

**Figure 42.** Figure 42: Per-WR mean latency vs WRITE payload at N=16 through paths (a) and (b). Flat within ±1 ns from 8 B to 4 KB on both paths — the per-access overhead dominates the perpayload cost. In-context ConnectX-7 comparison. The ioctl path’s 484 ns is 3.1–3.7× below Mellanox’s published 1500– 1800 ns 8 B RDMA WRITE on ConnectX-7 [48, 22], and the UB-spec §8.3 proxy (3–6 ns) two orders lower — from the same gem5 stack… view at source ↗

**Figure 43.** Figure 43: OpenURMA gem5-FS per-WR latency (path-(a) polled MMIO) vs WireLoopback link delay, post-Tier-2. With the SC pipeline cycle count and the wire delay folded back into the CPU’s view (NICTopologySC::pending_wire_delay_), per-WR latency tracks base + 5 × link delay (1644 ns at 0 ns delay → 26.6 µs at 5 µs delay): the 5× slope reflects the wire round-trip (request out, TAACK back) plus the intermediate decode… view at source ↗

**Figure 44.** Figure 44: Goodput (left) and p99 tail (right) vs loss rate. Go-Back-N amplifies single-packet losses into 32-packet flights; the [PITH_FULL_IMAGE:figures/full_fig_p025_44.png] view at source ↗

**Figure 45.** Figure 45: C-AQM vs DCQCN controller dynamics: congestion-window trajectory (left) and steady-state utilisation (right). Parameters [PITH_FULL_IMAGE:figures/full_fig_p025_45.png] view at source ↗

**Figure 46.** Figure 46: Per-host fabric state vs coherence-domain / peer [PITH_FULL_IMAGE:figures/full_fig_p026_46.png] view at source ↗

**Figure 48.** Figure 48: Per-coherent-write latency vs cluster size [PITH_FULL_IMAGE:figures/full_fig_p026_48.png] view at source ↗

**Figure 49.** Figure 49: Multi-rack distance sweep: per-coherent-write la [PITH_FULL_IMAGE:figures/full_fig_p027_49.png] view at source ↗

read the original abstract

Modern datacenter RDMA is bottlenecked at the network interface, not the wire. A NIC running RoCE or InfiniBand holds per-connection state for every (application, remote-endpoint) pair - hundreds of megabytes at 1024-application fanout - and pays a four-traversal PCIe round trip on a 64-byte operation, inflating latency an order of magnitude beyond the wire. Both follow from the Queue Pair over PCIe abstraction RDMA inherits from InfiniBand. Huawei's Unified Bus (UB), a public 2025 specification, changes the abstraction: it decouples per-application endpoint state from per-host transport state so connection context grows additively, exposes ordering as opt-in, and reaches remote memory through native CPU load/store to an on-chip-bus controller. UB ships in Huawei's closed Ascend 950 silicon. OpenURMA is the first clean-room open implementation of UB's transport and transaction layers, realised at three tiers - synthesisable RTL on Alveo U50, a cycle-level two-node SystemC simulator, and a gem5 full-system scaffold - each with a matched OpenRoCE (RoCEv2 RC) baseline. The contribution is the implementation, harness, and controlled comparison closed silicon does not admit. On the canonical 64-byte remote fetch - LOAD on UB-spec Sec.8.3, READ on RoCEv2 RC - UB's load/store path delivers ~500 ns end-to-end, 4.37x below the matched baseline (2186 ns), sustains 2.80x higher throughput, and fits in ~14% of a U50's LUTs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenURMA gives the first open multi-tier UB implementation with matched RoCE baselines and reports big latency/throughput gains, but the numbers rest on unverified fidelity to the public spec.

read the letter

The main things to know are that this is the first clean-room open implementation of the 2025 UB spec across RTL on U50, SystemC, and gem5, each with a matched OpenRoCE baseline, and it claims 4.37x lower latency (~500 ns vs 2186 ns) and 2.8x higher throughput on 64-byte remote loads while using 14% LUTs.

The paper does well by making a controlled comparison possible at all. Closed Ascend silicon does not allow that, so the multi-tier harness and side-by-side numbers are genuinely new and let readers see what the decoupled state and opt-in ordering actually buy.

The soft spot is verification. The gains depend on the three tiers faithfully reproducing the spec's load/store path and state decoupling without shortcuts like reduced PCIe traversals or idealized ordering. No machine-checked correspondence, external test vectors, or third-party audit is described, so self-reported matching is the sole evidence. Workload details, error bars, and code artifacts are also absent from the provided description.

This paper is for systems and architecture researchers working on datacenter interconnects and open hardware alternatives to RDMA. A reader building AI or HPC clusters would get value from the implementation approach and the comparison, provided the numbers hold up.

It deserves peer review because the contribution is concrete and the topic is timely, even though referees will need to press on reproducibility and verification.

Referee Report

1 major / 1 minor

Summary. The manuscript presents OpenURMA as the first clean-room open implementation of Huawei's public Unified Bus (UB) specification, realized in three tiers (synthesizable RTL on Alveo U50 FPGA, cycle-level SystemC simulator, and gem5 full-system scaffold) with matched OpenRoCEv2 RC baselines. It reports that on the canonical 64-byte remote fetch (LOAD per UB-spec Sec.8.3 vs. READ on RoCE), the UB load/store path achieves ~500 ns end-to-end latency (4.37× below the 2186 ns baseline), 2.80× higher throughput, and occupies ~14% of U50 LUTs. The contribution centers on the implementation, harness, and controlled comparison that closed silicon does not permit.

Significance. If the three-tier implementations faithfully realize the public UB specification without hidden deviations, the work supplies the first reproducible open platform for studying UB's decoupled state, opt-in ordering, and native load/store path against conventional RDMA. The multi-tier design (RTL + SystemC + gem5) is a concrete strength that enables different fidelity levels and controlled experiments. This is valuable because UB currently exists only in closed Ascend 950 silicon.

major comments (1)

[Abstract and evaluation section] Abstract and § on evaluation (performance numbers): the central claims of 500 ns latency, 4.37× improvement, and 2.80× throughput rest on the three-tier OpenURMA exactly reproducing UB-spec Sec.8.3 behavior (decoupled state, opt-in ordering, native load/store without reduced PCIe traversals or idealized shortcuts). The manuscript supplies no machine-checked correspondence, external test vectors, third-party audit, or workload descriptions to confirm fidelity; self-reported matching is the sole evidence. This is load-bearing for the comparison to the matched OpenRoCE baseline.

minor comments (1)

[Abstract] The abstract states performance numbers but omits workload descriptions, error bars, or measurement methodology; these details should be added for reproducibility even if moved to an appendix.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for recognizing the multi-tier design as a strength. We address the major comment on implementation fidelity point by point below.

read point-by-point responses

Referee: [Abstract and evaluation section] Abstract and § on evaluation (performance numbers): the central claims of 500 ns latency, 4.37× improvement, and 2.80× throughput rest on the three-tier OpenURMA exactly reproducing UB-spec Sec.8.3 behavior (decoupled state, opt-in ordering, native load/store without reduced PCIe traversals or idealized shortcuts). The manuscript supplies no machine-checked correspondence, external test vectors, third-party audit, or workload descriptions to confirm fidelity; self-reported matching is the sole evidence. This is load-bearing for the comparison to the matched OpenRoCE baseline.

Authors: We agree that fidelity to UB-spec Sec.8.3 is load-bearing for the reported latency, throughput, and comparison results. The three tiers were developed as a clean-room implementation strictly following the public specification, with explicit attention to decoupled per-application state, opt-in ordering, and the native load/store path without idealized shortcuts or reduced PCIe traversals. The SystemC model is cycle-level, the gem5 scaffold is full-system, and the RTL is synthesizable on the Alveo U50; the OpenRoCEv2 RC baseline was realized in identical environments for controlled comparison. That said, the manuscript provides no machine-checked correspondence, external test vectors, or third-party audit. We will revise the evaluation section to add explicit workload descriptions, sample test vectors with their mapping to specification sections, and additional validation details to make the fidelity evidence more transparent. revision: partial

Circularity Check

0 steps flagged

Implementation and measurement paper with no derivation chain or predictions

full rationale

The manuscript describes a clean-room open implementation of the public UB specification realized in three tiers (RTL on U50, SystemC simulator, gem5 scaffold) and reports measured latency/throughput numbers against a matched OpenRoCE baseline. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text; the central claims are direct empirical outcomes of running the implemented hardware and simulators. Because there is no load-bearing derivation step that could reduce to its own inputs by construction, the paper is self-contained against external benchmarks and exhibits no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work rests on standard hardware description languages, cycle-accurate simulation, and the public UB specification.

pith-pipeline@v0.9.1-grok · 5830 in / 1213 out tokens · 25649 ms · 2026-06-29T12:03:02.275078+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Effec- tively prefetching remote memory with Leap

Hasan Al Maruf and Mosharaf Chowdhury. Effec- tively prefetching remote memory with Leap. In Proc. USENIX ATC, 2020. Far-memory prefetch heuristic; cited in §8.2 as an example of software- side swap optimisation

2020
[2]

Aguilera, Aurojit Panda, Sylvia Ratnasamy, and Scott Shenker

Emmanuel Amaro, Christopher Branner-Augmon, Zhihong Luo, Amy Ousterhout, Marcos K. Aguilera, Aurojit Panda, Sylvia Ratnasamy, and Scott Shenker. Can far memory improve job throughput? InProc. EuroSys, 2020. Introduces Fastswap; reports ∼1 µs kernel-side overhead and batched-prefetch swap-in, the basis of the second swap profile in §8.2

2020
[3]

Enabling programmable transport protocols in high-speed NICs

Mina Tahmasbi Arashloo, Alexey Lavrov, Manya Ghobadi, Jennifer Rexford, David Walker, and David Wentzlaff. Enabling programmable transport protocols in high-speed NICs. InProc. USENIX NSDI, 2020. 30

2020
[4]

Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears

Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Bench- marking cloud serving systems with YCSB. InProc. ACM SoCC, 2010. The Yahoo! Cloud Serving Benchmark; we use the YCSB-A 50/50 Get-Put Zipfian workload in §8.3

2010
[5]

Compute Express Link (CXL) Specification 3.1

CXL Consortium. Compute Express Link (CXL) Specification 3.1. https://www. computeexpresslink.org/, 2024

2024
[6]

FaRM: Fast remote memory

Aleksandar Dragojevi ´c, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. FaRM: Fast remote memory. InProc. USENIX NSDI, 2014

2014
[7]

NICA: An infrastructure for inline acceleration of network applications

Haggai Eran, Lior Zeno, Maroun Tork, Gabi Malka, and Mark Silberstein. NICA: An infrastructure for inline acceleration of network applications. InProc. USENIX ATC, 2019

2019
[8]

Azure Accelerated Networking: SmartNICs in the public cloud

Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike An- drewartha, Hari Angepat, et al. Azure Accelerated Networking: SmartNICs in the public cloud. In Proc. USENIX NSDI, 2018

2018
[9]

RDMA over Ethernet for dis- tributed training at meta scale

Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jee- varaj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashidhar Gandham, and Hongyi Zeng. RDMA over Ethernet for dis- tributed training at meta scale. InProc. ACM SIG- COMM, 2024

2024
[10]

Dan Gibson, Hema Hariharan, Eric Lance, Moray McLaren, Behnam Montazeri, Arjun Singh, Stephen Wang, Hassan M. G. Wassel, Zhehua Wu, Sungh- wan Yoo, Raghuraman Balasubramanian, Prashant Chandra, Michael Cutforth, Peter Cuy, David De- cotigny, Rakesh Gautam, Alex Iriza, Milo M. K. Martin, Rick Roy, Zuowei Shen, Ming Tan, Ye Tang, Monica Wong-Chan, Joe Zbici...

2022
[11]

Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G. Shin. Efficient memory disaggregation with Infiniswap. InProc. USENIX NSDI, 2017. Kernel-side overhead of 3– 5 µs on the swap-in path is the parameter referenced in §8.2

2017
[12]

Clio: A hardware- software co-designed disaggregated memory system

Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, and Yiying Zhang. Clio: A hardware- software co-designed disaggregated memory system. InProc. ACM ASPLOS, 2022

2022
[13]

TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

Tingbo He. A time scaling theory for multi- layer electronic systems.ChinaXiv, May 2026. chinarxiv-202605.00224. Perspective from Huawei Semiconductor: τ scaling as successor to geomet- ric Moore’s-Law scaling; positions Unified Bus as the system-layer τ reduction mechanism with end-to-end remote-access latency from ∼10s of µs (TCP/IP-class) to∼100 ns

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

SwCC: Software- programmable and per-packet congestion control in RDMA engine

Hongjing Huang, Jie Zhang, Xuzheng Chen, Ziyu Song, Jiajun Qin, and Zeke Wang. SwCC: Software- programmable and per-packet congestion control in RDMA engine. InProc. USENIX ATC, 2025

2025
[15]

Fast and scal- able selective retransmission for RDMA

Peihao Huang, Guo Chen, Xin Zhang, Can Liu, Hongyu Wang, Huijun Shen, Ying Bian, Yuanwei Lu, Zhenyuan Ruan, Bojie Li, Jiansong Zhang, Yongfeng Liu, and Zhigang Chen. Fast and scal- able selective retransmission for RDMA. InProc. IEEE INFOCOM, 2025

2025
[16]

LEFT: Lightweight and fast packet reordering for RDMA

Peihao Huang, Xin Zhang, Zhigang Chen, Can Liu, and Guo Chen. LEFT: Lightweight and fast packet reordering for RDMA. InProc. APNet, 2024

2024
[17]

UB-base-specification 2.0.1

Huawei Technologies. UB-base-specification 2.0.1. https://www.unifiedbus.org/,
[18]

Unified Bus consortium specification, avail- able from the consortium’s documentation portal
[19]

Ascend 950 NPU archi- tecture white paper

Huawei Technologies. Ascend 950 NPU archi- tecture white paper. Huawei vendor white paper, May 2026. Architectural disclosure for the Ascend 950PR and 950DT NPUs; first publicly documented silicon implementing the Unified Bus spec, with URMA (asynchronous Write/Read/Send/Atomic via Jetty) and UB Memory (synchronous Load/Store + AtomicStore/Load/Swap/CAS) ...

2026
[20]

NanoTransport: A low-latency, programmable transport layer for NICs

Stephen Ibanez, Alex Mallery, Serhat Arslan, Theo Jepsen, Muhammad Shahbaz, Nick McKeown, and Changhoon Kim. NanoTransport: A low-latency, programmable transport layer for NICs. InProc. ACM SOSR, 2021

2021
[21]

An- dersen

Anuj Kalia, Michael Kaminsky, and David G. An- dersen. FaSST: Fast, scalable and simple dis- tributed transactions with two-sided (RDMA) data- gram RPCs. InProc. USENIX OSDI, 2016

2016
[22]

An- dersen

Anuj Kalia, Michael Kaminsky, and David G. An- dersen. Datacenter RPCs can be general and fast. In Proc. USENIX NSDI, 2019

2019
[23]

Sharma, Arvind Krishnamurthy, and Thomas Anderson

Antoine Kaufmann, Tim Stamler, Simon Peter, Naveen Kr. Sharma, Arvind Krishnamurthy, and Thomas Anderson. TAS: TCP acceleration as an OS service. InProc. EuroSys, 2019. Reports detailed PCIe-class transaction latency decompositions used as parameter references in §5

2019
[24]

Lebeck, and Danyang Zhuo

Xinhao Kong, Jingrong Chen, Wei Bai, Yechen Xu, Mahmoud Elhaddad, Shachar Raindel, Jiten- dra Padhye, Alvin R. Lebeck, and Danyang Zhuo. 31 Understanding RDMA microarchitecture resources for performance isolation. InProc. USENIX NSDI, 2023

2023
[25]

Collie: Finding performance anomalies in RDMA subsystems

Xinhao Kong, Yibo Zhu, Huaping Zhou, Zhuo Jiang, Jianxi Ye, Chuanxiong Guo, and Danyang Zhuo. Collie: Finding performance anomalies in RDMA subsystems. InProc. USENIX NSDI, 2022

2022
[26]

Gautam Kumar, Nandita Dukkipati, Keon Jang, Hassan M. G. Wassel, Xian Wu, Behnam Montaz- eri, Yaogong Wang, Kevin Springborn, Christopher Alfeld, Michael Ryan, David Wetherall, and Amin Vahdat. Swift: Delay is simple and effective for congestion control in the datacenter. InProc. ACM SIGCOMM, 2020

2020
[27]

STrack: A re- liable multipath transport for AI/ML clusters

Yanfang Le, Rong Pan, Peter Newman, Jeremias Blendin, Abdul Kabbani, Vipin Jain, Raghava Sivaramu, and Francis Matus. STrack: A re- liable multipath transport for AI/ML clusters. arXiv:2407.15266, 2024

work page arXiv 2024
[28]

OpenClickNP: a clean-room reimple- mentation of ClickNP on Alveo U50

Bojie Li. OpenClickNP: a clean-room reimple- mentation of ClickNP on Alveo U50. https: //github.com/bojieli/OpenClickNP, 2025–2026

2025
[29]

SocksDirect: Datacenter sockets can be fast and compatible

Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lin- tao Zhang. SocksDirect: Datacenter sockets can be fast and compatible. InProc. ACM SIGCOMM, 2019

2019
[30]

KV-Direct: High- performance in-memory key-value store with pro- grammable NIC

Bojie Li, Zhenyuan Ruan, Wencong Xiao, Yuan- wei Lu, Yongqiang Xiong, Andrew Putnam, En- hong Chen, and Lintao Zhang. KV-Direct: High- performance in-memory key-value store with pro- grammable NIC. InProc. ACM SOSP, 2017

2017
[31]

ClickNP: Highly flexible and high-performance network processing with reconfigurable hardware

Bojie Li, Kun Tan, Layong Larry Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang Xiong, Peng Cheng, and Enhong Chen. ClickNP: Highly flexible and high-performance network processing with reconfigurable hardware. InProc. ACM SIG- COMM, 2016

2016
[32]

FastWake: Revis- iting host network stack for interrupt-mode RDMA

Bojie Li, Zhilong Xiang, Xiang Wang, Hon- gru Jonathan Zhou, and Kun Tan. FastWake: Revis- iting host network stack for interrupt-mode RDMA. InProc. APNet, 2023

2023
[33]

1Pipe: Scalable total order communication in data center networks

Bojie Li, Gefei Zuo, Wei Bai, and Lintao Zhang. 1Pipe: Scalable total order communication in data center networks. InProc. ACM SIGCOMM, 2021

2021
[34]

Flor: An open high performance RDMA framework over heterogeneous RNICs

Qiang Li, Yixiao Gao, Xiaoliang Wang, Haonan Qiu, Yanfang Le, Derui Liu, Qiao Xiang, Fei Feng, Peng Zhang, Bo Li, Jianbo Dong, Lingbo Tang, Hongqiang Harry Liu, Shaozong Liu, Weijie Li, Rui Miao, Yaohui Wu, Zhiwu Wu, Chao Han, Lei Yan, Zheng Cao, Zhongjie Wu, Chen Tian, Guihai Chen, Dennis Cai, Jinbo Wu, Jiaji Zhu, Jiesheng Wu, and Jiwu Shu. Flor: An open...

2023
[35]

Revisiting RDMA reliability for lossy fabrics

Wenxue Li, Xiangzhou Liu, Yunxuan Zhang, Zi- hao Wang, Wei Gu, Tao Qian, Gaoxiong Zeng, Shoushou Ren, Xinyang Huang, Zhenghang Ren, Bowen Liu, Junxue Zhang, Kai Chen, and Bingyang Liu. Revisiting RDMA reliability for lossy fabrics. InProc. ACM SIGCOMM, 2025. Best Student Paper, Honorable Mention

2025
[36]

HPCC: High precision congestion control

Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, and Min- lan Yu. HPCC: High precision congestion control. InProc. ACM SIGCOMM, 2019

2019
[37]

Fast- socket: An almost drop-in replacement for the Linux socket interface for High-Performance Networking

Xiaofeng Lin, Yu Chen, Xiaodong Li, et al. Fast- socket: An almost drop-in replacement for the Linux socket interface for High-Performance Networking. InProc. USENIX ATC, 2017

2017
[38]

Harmonic: Hardware-assisted RDMA performance isolation for public clouds

Jiaqi Lou, Xinhao Kong, Jinghan Huang, Wei Bai, Nam Sung Kim, and Danyang Zhuo. Harmonic: Hardware-assisted RDMA performance isolation for public clouds. InProc. USENIX NSDI, 2024

2024
[39]

The gem5 Simulator: Version 20.0+,

Jason Lowe-Power et al. The gem5 simulator: Ver- sion 20.0+. arXiv:2007.03152, 2020. Open-source cycle-level micro-architecture simulator with Sys- temC TLM 2.0 interoperability bridge; v24.0.0.1 is used as the future-work substrate for full-system integration oflibopenurma_sc.a

work page arXiv 2007
[40]

Memory efficient loss recovery for hardware-based transport in datacenter

Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang Xiong, Peng Cheng, Jiansong Zhang, Enhong Chen, and Thomas Moscibroda. Memory efficient loss recovery for hardware-based transport in datacenter. InProc. APNet, 2017

2017
[41]

Multi- Path transport for RDMA in datacenters

Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang Xiong, Peng Cheng, Jiansong Zhang, Enhong Chen, and Thomas Moscibroda. Multi- Path transport for RDMA in datacenters. InProc. USENIX NSDI, 2018

2018
[42]

Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C. Evans, Steve Gribble, Nicholas Kidd, Roman Kononov, Gautam Kumar, Carl Mauer, Emily Mu- sick, Lena Olson, Erik Rubow, Michael Ryan, Kevin Springborn, Paul Turner, Valas Valancius, Xi Wang, and Amin Vahdat. Snap: A ...

2019
[43]

TIMELY: RTT-based congestion control for the datacenter

Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, and David Zats. TIMELY: RTT-based congestion control for the datacenter. InProc. ACM SIGCOMM, 2015

2015
[44]

Revisiting network sup- port for RDMA

Radhika Mittal, Alexander Shpiner, Aurojit Panda, Eitan Zahavi, Arvind Krishnamurthy, Sylvia Rat- nasamy, and Scott Shenker. Revisiting network sup- port for RDMA. InProc. ACM SIGCOMM, 2018. 32

2018
[45]

NVLink: A high-bandwidth inter-GPU interconnect

NVIDIA Corporation. NVLink: A high-bandwidth inter-GPU interconnect. Vendor whitepaper, 2014–

2014
[46]

Successive generations of the NVLink fabric are described in the NVIDIA whitepaper series
[47]

NVIDIA BlueField-3 DPU datasheet

NVIDIA Corporation. NVIDIA BlueField-3 DPU datasheet. NVIDIA Networking product brief, 2023. Available from NVIDIA’s data-processing-unit prod- uct page

2023
[48]

NVIDIA Spectrum-X: Adap- tive routing and telemetry-based congestion control for AI networks

NVIDIA Networking. NVIDIA Spectrum-X: Adap- tive routing and telemetry-based congestion control for AI networks. NVIDIA technical brief, 2024. Vendor description of multi-path adaptive-routing delivery over Spectrum-4 / BlueField-3 NICs; the closest commercially-deployed point of comparison to UB’s TPG multi-path scheme

2024
[49]

Hermit: Low-latency, high- throughput, and transparent remote memory via feedback-directed asynchrony

Yifan Qiao, Chenxi Wang, Zhenyuan Ruan, Adam Belay, Qingda Lu, Yiying Zhang, Miryung Kim, and Guoqing Harry Xu. Hermit: Low-latency, high- throughput, and transparent remote memory via feedback-directed asynchrony. InProc. USENIX NSDI, 2023. Asynchronous remote-memory swap with feedback-directed I/O; cited in §8.2 for the same workload regime as Infiniswa...

2023
[50]

Designing high-performance, low-latency multi-cluster com- munication on modern InfiniBand networks

Sebastian Ramos and Torsten Hoefler. Designing high-performance, low-latency multi-cluster com- munication on modern InfiniBand networks. In Proc. ACM HPDC, 2023. Reports ConnectX-7 PCIe round-trip latencies in the ∼300–500 ns range; we use this as the parameterised PCIe RTT in §5

2023
[51]

StRoM: Smart re- mote memory

David Sidler, Zeke Wang, Monica Chiosa, Amit Kulkarni, and Gustavo Alonso. StRoM: Smart re- mote memory. InProc. EuroSys, 2020

2020
[52]

Wenisch, Monica Wong-Chan, Sean Clark, Milo M

Arjun Singhvi, Aditya Akella, Dan Gibson, Thomas F. Wenisch, Monica Wong-Chan, Sean Clark, Milo M. K. Martin, Moray McLaren, Prashant Chandra, Rob Cauble, et al. 1RMA: Re- envisioning remote memory access for multi-tenant datacenters. InProc. ACM SIGCOMM, 2020

2020
[53]

Arjun Singhvi, Nandita Dukkipati, Prashant Chan- dra, Hassan M. G. Wassel, Naveen Kr. Sharma, Anthony Rebello, Henry Schuh, Praveen Kumar, Behnam Montazeri, Neelesh Bansod, Sarin Thomas, Inho Cho, Hyojeong Lee Seibert, Baijun Wu, Rui Yang, Yuliang Li, Kai Huang, Qianwen Yin, Ab- hishek Agarwal, Srinivas Vaduvatha, Weihuang Wang, Masoud Moshref, Tao Ji, Da...

2025
[54]

Network load balancing with in-network reordering support for RDMA

Cha Hwan Song, Xin Zhe Khooi, Raj Joshi, Inho Choi, Jialin Li, and Mun Choon Chan. Network load balancing with in-network reordering support for RDMA. InProc. ACM SIGCOMM, 2023

2023
[55]

Ultra Ethernet spec- ification 1.0

Ultra Ethernet Consortium. Ultra Ethernet spec- ification 1.0. Industry specification, 2025. Re- leased June 2025 under Linux Foundation JDF; https://ultraethernet.org/

2025
[56]

StaR: Break- ing the scalability limit for RDMA

Xizheng Wang, Guo Chen, Xijin Yin, Huichen Dai, Bojie Li, Binzhang Fu, and Kun Tan. StaR: Break- ing the scalability limit for RDMA. InProc. IEEE ICNP, 2021

2021
[57]

SRNIC: A scalable architecture for RDMA NICs

Zilong Wang, Layong Luo, Qingsong Ning, Chao- liang Zeng, Wenxue Li, Xinchen Wan, Peng Xie, Tao Feng, Ke Cheng, Xiongfei Geng, Tianhao Wang, Weicheng Ling, Kejia Huo, Pingbo An, Kui Ji, Shi- deng Zhang, Bin Xu, Ruiqing Feng, Tao Ding, Kai Chen, and Chuanxiong Guo. SRNIC: A scalable architecture for RDMA NICs. InProc. USENIX NSDI, 2023

2023
[58]

Justitia: Software multi- tenancy in hardware kernel-bypass networks

Yiwen Zhang, Yue Tan, Brent Stephens, and Mosharaf Chowdhury. Justitia: Software multi- tenancy in hardware kernel-bypass networks. In Proc. USENIX NSDI, 2022

2022
[59]

White-boxing RDMA with packet-granular software control

Chenxingyu Zhao, Jaehong Min, Ming Liu, and Arvind Krishnamurthy. White-boxing RDMA with packet-granular software control. InProc. USENIX NSDI, 2025

2025
[60]

Congestion control for Large-Scale RDMA deployments

Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanx- iong Guo, Marina Lipshteyn, Yehonatan Liron, Ji- tendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. Congestion control for Large-Scale RDMA deployments. InProc. ACM SIGCOMM, 2015. 33

2015

[1] [1]

Effec- tively prefetching remote memory with Leap

Hasan Al Maruf and Mosharaf Chowdhury. Effec- tively prefetching remote memory with Leap. In Proc. USENIX ATC, 2020. Far-memory prefetch heuristic; cited in §8.2 as an example of software- side swap optimisation

2020

[2] [2]

Aguilera, Aurojit Panda, Sylvia Ratnasamy, and Scott Shenker

Emmanuel Amaro, Christopher Branner-Augmon, Zhihong Luo, Amy Ousterhout, Marcos K. Aguilera, Aurojit Panda, Sylvia Ratnasamy, and Scott Shenker. Can far memory improve job throughput? InProc. EuroSys, 2020. Introduces Fastswap; reports ∼1 µs kernel-side overhead and batched-prefetch swap-in, the basis of the second swap profile in §8.2

2020

[3] [3]

Enabling programmable transport protocols in high-speed NICs

Mina Tahmasbi Arashloo, Alexey Lavrov, Manya Ghobadi, Jennifer Rexford, David Walker, and David Wentzlaff. Enabling programmable transport protocols in high-speed NICs. InProc. USENIX NSDI, 2020. 30

2020

[4] [4]

Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears

Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Bench- marking cloud serving systems with YCSB. InProc. ACM SoCC, 2010. The Yahoo! Cloud Serving Benchmark; we use the YCSB-A 50/50 Get-Put Zipfian workload in §8.3

2010

[5] [5]

Compute Express Link (CXL) Specification 3.1

CXL Consortium. Compute Express Link (CXL) Specification 3.1. https://www. computeexpresslink.org/, 2024

2024

[6] [6]

FaRM: Fast remote memory

Aleksandar Dragojevi ´c, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. FaRM: Fast remote memory. InProc. USENIX NSDI, 2014

2014

[7] [7]

NICA: An infrastructure for inline acceleration of network applications

Haggai Eran, Lior Zeno, Maroun Tork, Gabi Malka, and Mark Silberstein. NICA: An infrastructure for inline acceleration of network applications. InProc. USENIX ATC, 2019

2019

[8] [8]

Azure Accelerated Networking: SmartNICs in the public cloud

Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike An- drewartha, Hari Angepat, et al. Azure Accelerated Networking: SmartNICs in the public cloud. In Proc. USENIX NSDI, 2018

2018

[9] [9]

RDMA over Ethernet for dis- tributed training at meta scale

Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jee- varaj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashidhar Gandham, and Hongyi Zeng. RDMA over Ethernet for dis- tributed training at meta scale. InProc. ACM SIG- COMM, 2024

2024

[10] [10]

Dan Gibson, Hema Hariharan, Eric Lance, Moray McLaren, Behnam Montazeri, Arjun Singh, Stephen Wang, Hassan M. G. Wassel, Zhehua Wu, Sungh- wan Yoo, Raghuraman Balasubramanian, Prashant Chandra, Michael Cutforth, Peter Cuy, David De- cotigny, Rakesh Gautam, Alex Iriza, Milo M. K. Martin, Rick Roy, Zuowei Shen, Ming Tan, Ye Tang, Monica Wong-Chan, Joe Zbici...

2022

[11] [11]

Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G. Shin. Efficient memory disaggregation with Infiniswap. InProc. USENIX NSDI, 2017. Kernel-side overhead of 3– 5 µs on the swap-in path is the parameter referenced in §8.2

2017

[12] [12]

Clio: A hardware- software co-designed disaggregated memory system

Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, and Yiying Zhang. Clio: A hardware- software co-designed disaggregated memory system. InProc. ACM ASPLOS, 2022

2022

[13] [13]

TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

Tingbo He. A time scaling theory for multi- layer electronic systems.ChinaXiv, May 2026. chinarxiv-202605.00224. Perspective from Huawei Semiconductor: τ scaling as successor to geomet- ric Moore’s-Law scaling; positions Unified Bus as the system-layer τ reduction mechanism with end-to-end remote-access latency from ∼10s of µs (TCP/IP-class) to∼100 ns

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

SwCC: Software- programmable and per-packet congestion control in RDMA engine

Hongjing Huang, Jie Zhang, Xuzheng Chen, Ziyu Song, Jiajun Qin, and Zeke Wang. SwCC: Software- programmable and per-packet congestion control in RDMA engine. InProc. USENIX ATC, 2025

2025

[15] [15]

Fast and scal- able selective retransmission for RDMA

Peihao Huang, Guo Chen, Xin Zhang, Can Liu, Hongyu Wang, Huijun Shen, Ying Bian, Yuanwei Lu, Zhenyuan Ruan, Bojie Li, Jiansong Zhang, Yongfeng Liu, and Zhigang Chen. Fast and scal- able selective retransmission for RDMA. InProc. IEEE INFOCOM, 2025

2025

[16] [16]

LEFT: Lightweight and fast packet reordering for RDMA

Peihao Huang, Xin Zhang, Zhigang Chen, Can Liu, and Guo Chen. LEFT: Lightweight and fast packet reordering for RDMA. InProc. APNet, 2024

2024

[17] [17]

UB-base-specification 2.0.1

Huawei Technologies. UB-base-specification 2.0.1. https://www.unifiedbus.org/,

[18] [18]

Unified Bus consortium specification, avail- able from the consortium’s documentation portal

[19] [19]

Ascend 950 NPU archi- tecture white paper

Huawei Technologies. Ascend 950 NPU archi- tecture white paper. Huawei vendor white paper, May 2026. Architectural disclosure for the Ascend 950PR and 950DT NPUs; first publicly documented silicon implementing the Unified Bus spec, with URMA (asynchronous Write/Read/Send/Atomic via Jetty) and UB Memory (synchronous Load/Store + AtomicStore/Load/Swap/CAS) ...

2026

[20] [20]

NanoTransport: A low-latency, programmable transport layer for NICs

Stephen Ibanez, Alex Mallery, Serhat Arslan, Theo Jepsen, Muhammad Shahbaz, Nick McKeown, and Changhoon Kim. NanoTransport: A low-latency, programmable transport layer for NICs. InProc. ACM SOSR, 2021

2021

[21] [21]

An- dersen

Anuj Kalia, Michael Kaminsky, and David G. An- dersen. FaSST: Fast, scalable and simple dis- tributed transactions with two-sided (RDMA) data- gram RPCs. InProc. USENIX OSDI, 2016

2016

[22] [22]

An- dersen

Anuj Kalia, Michael Kaminsky, and David G. An- dersen. Datacenter RPCs can be general and fast. In Proc. USENIX NSDI, 2019

2019

[23] [23]

Sharma, Arvind Krishnamurthy, and Thomas Anderson

Antoine Kaufmann, Tim Stamler, Simon Peter, Naveen Kr. Sharma, Arvind Krishnamurthy, and Thomas Anderson. TAS: TCP acceleration as an OS service. InProc. EuroSys, 2019. Reports detailed PCIe-class transaction latency decompositions used as parameter references in §5

2019

[24] [24]

Lebeck, and Danyang Zhuo

Xinhao Kong, Jingrong Chen, Wei Bai, Yechen Xu, Mahmoud Elhaddad, Shachar Raindel, Jiten- dra Padhye, Alvin R. Lebeck, and Danyang Zhuo. 31 Understanding RDMA microarchitecture resources for performance isolation. InProc. USENIX NSDI, 2023

2023

[25] [25]

Collie: Finding performance anomalies in RDMA subsystems

Xinhao Kong, Yibo Zhu, Huaping Zhou, Zhuo Jiang, Jianxi Ye, Chuanxiong Guo, and Danyang Zhuo. Collie: Finding performance anomalies in RDMA subsystems. InProc. USENIX NSDI, 2022

2022

[26] [26]

Gautam Kumar, Nandita Dukkipati, Keon Jang, Hassan M. G. Wassel, Xian Wu, Behnam Montaz- eri, Yaogong Wang, Kevin Springborn, Christopher Alfeld, Michael Ryan, David Wetherall, and Amin Vahdat. Swift: Delay is simple and effective for congestion control in the datacenter. InProc. ACM SIGCOMM, 2020

2020

[27] [27]

STrack: A re- liable multipath transport for AI/ML clusters

Yanfang Le, Rong Pan, Peter Newman, Jeremias Blendin, Abdul Kabbani, Vipin Jain, Raghava Sivaramu, and Francis Matus. STrack: A re- liable multipath transport for AI/ML clusters. arXiv:2407.15266, 2024

work page arXiv 2024

[28] [28]

OpenClickNP: a clean-room reimple- mentation of ClickNP on Alveo U50

Bojie Li. OpenClickNP: a clean-room reimple- mentation of ClickNP on Alveo U50. https: //github.com/bojieli/OpenClickNP, 2025–2026

2025

[29] [29]

SocksDirect: Datacenter sockets can be fast and compatible

Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lin- tao Zhang. SocksDirect: Datacenter sockets can be fast and compatible. InProc. ACM SIGCOMM, 2019

2019

[30] [30]

KV-Direct: High- performance in-memory key-value store with pro- grammable NIC

Bojie Li, Zhenyuan Ruan, Wencong Xiao, Yuan- wei Lu, Yongqiang Xiong, Andrew Putnam, En- hong Chen, and Lintao Zhang. KV-Direct: High- performance in-memory key-value store with pro- grammable NIC. InProc. ACM SOSP, 2017

2017

[31] [31]

ClickNP: Highly flexible and high-performance network processing with reconfigurable hardware

Bojie Li, Kun Tan, Layong Larry Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang Xiong, Peng Cheng, and Enhong Chen. ClickNP: Highly flexible and high-performance network processing with reconfigurable hardware. InProc. ACM SIG- COMM, 2016

2016

[32] [32]

FastWake: Revis- iting host network stack for interrupt-mode RDMA

Bojie Li, Zhilong Xiang, Xiang Wang, Hon- gru Jonathan Zhou, and Kun Tan. FastWake: Revis- iting host network stack for interrupt-mode RDMA. InProc. APNet, 2023

2023

[33] [33]

1Pipe: Scalable total order communication in data center networks

Bojie Li, Gefei Zuo, Wei Bai, and Lintao Zhang. 1Pipe: Scalable total order communication in data center networks. InProc. ACM SIGCOMM, 2021

2021

[34] [34]

Flor: An open high performance RDMA framework over heterogeneous RNICs

Qiang Li, Yixiao Gao, Xiaoliang Wang, Haonan Qiu, Yanfang Le, Derui Liu, Qiao Xiang, Fei Feng, Peng Zhang, Bo Li, Jianbo Dong, Lingbo Tang, Hongqiang Harry Liu, Shaozong Liu, Weijie Li, Rui Miao, Yaohui Wu, Zhiwu Wu, Chao Han, Lei Yan, Zheng Cao, Zhongjie Wu, Chen Tian, Guihai Chen, Dennis Cai, Jinbo Wu, Jiaji Zhu, Jiesheng Wu, and Jiwu Shu. Flor: An open...

2023

[35] [35]

Revisiting RDMA reliability for lossy fabrics

Wenxue Li, Xiangzhou Liu, Yunxuan Zhang, Zi- hao Wang, Wei Gu, Tao Qian, Gaoxiong Zeng, Shoushou Ren, Xinyang Huang, Zhenghang Ren, Bowen Liu, Junxue Zhang, Kai Chen, and Bingyang Liu. Revisiting RDMA reliability for lossy fabrics. InProc. ACM SIGCOMM, 2025. Best Student Paper, Honorable Mention

2025

[36] [36]

HPCC: High precision congestion control

Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, and Min- lan Yu. HPCC: High precision congestion control. InProc. ACM SIGCOMM, 2019

2019

[37] [37]

Fast- socket: An almost drop-in replacement for the Linux socket interface for High-Performance Networking

Xiaofeng Lin, Yu Chen, Xiaodong Li, et al. Fast- socket: An almost drop-in replacement for the Linux socket interface for High-Performance Networking. InProc. USENIX ATC, 2017

2017

[38] [38]

Harmonic: Hardware-assisted RDMA performance isolation for public clouds

Jiaqi Lou, Xinhao Kong, Jinghan Huang, Wei Bai, Nam Sung Kim, and Danyang Zhuo. Harmonic: Hardware-assisted RDMA performance isolation for public clouds. InProc. USENIX NSDI, 2024

2024

[39] [39]

The gem5 Simulator: Version 20.0+,

Jason Lowe-Power et al. The gem5 simulator: Ver- sion 20.0+. arXiv:2007.03152, 2020. Open-source cycle-level micro-architecture simulator with Sys- temC TLM 2.0 interoperability bridge; v24.0.0.1 is used as the future-work substrate for full-system integration oflibopenurma_sc.a

work page arXiv 2007

[40] [40]

Memory efficient loss recovery for hardware-based transport in datacenter

Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang Xiong, Peng Cheng, Jiansong Zhang, Enhong Chen, and Thomas Moscibroda. Memory efficient loss recovery for hardware-based transport in datacenter. InProc. APNet, 2017

2017

[41] [41]

Multi- Path transport for RDMA in datacenters

Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang Xiong, Peng Cheng, Jiansong Zhang, Enhong Chen, and Thomas Moscibroda. Multi- Path transport for RDMA in datacenters. InProc. USENIX NSDI, 2018

2018

[42] [42]

Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C. Evans, Steve Gribble, Nicholas Kidd, Roman Kononov, Gautam Kumar, Carl Mauer, Emily Mu- sick, Lena Olson, Erik Rubow, Michael Ryan, Kevin Springborn, Paul Turner, Valas Valancius, Xi Wang, and Amin Vahdat. Snap: A ...

2019

[43] [43]

TIMELY: RTT-based congestion control for the datacenter

Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, and David Zats. TIMELY: RTT-based congestion control for the datacenter. InProc. ACM SIGCOMM, 2015

2015

[44] [44]

Revisiting network sup- port for RDMA

Radhika Mittal, Alexander Shpiner, Aurojit Panda, Eitan Zahavi, Arvind Krishnamurthy, Sylvia Rat- nasamy, and Scott Shenker. Revisiting network sup- port for RDMA. InProc. ACM SIGCOMM, 2018. 32

2018

[45] [45]

NVLink: A high-bandwidth inter-GPU interconnect

NVIDIA Corporation. NVLink: A high-bandwidth inter-GPU interconnect. Vendor whitepaper, 2014–

2014

[46] [46]

Successive generations of the NVLink fabric are described in the NVIDIA whitepaper series

[47] [47]

NVIDIA BlueField-3 DPU datasheet

NVIDIA Corporation. NVIDIA BlueField-3 DPU datasheet. NVIDIA Networking product brief, 2023. Available from NVIDIA’s data-processing-unit prod- uct page

2023

[48] [48]

NVIDIA Spectrum-X: Adap- tive routing and telemetry-based congestion control for AI networks

NVIDIA Networking. NVIDIA Spectrum-X: Adap- tive routing and telemetry-based congestion control for AI networks. NVIDIA technical brief, 2024. Vendor description of multi-path adaptive-routing delivery over Spectrum-4 / BlueField-3 NICs; the closest commercially-deployed point of comparison to UB’s TPG multi-path scheme

2024

[49] [49]

Hermit: Low-latency, high- throughput, and transparent remote memory via feedback-directed asynchrony

Yifan Qiao, Chenxi Wang, Zhenyuan Ruan, Adam Belay, Qingda Lu, Yiying Zhang, Miryung Kim, and Guoqing Harry Xu. Hermit: Low-latency, high- throughput, and transparent remote memory via feedback-directed asynchrony. InProc. USENIX NSDI, 2023. Asynchronous remote-memory swap with feedback-directed I/O; cited in §8.2 for the same workload regime as Infiniswa...

2023

[50] [50]

Designing high-performance, low-latency multi-cluster com- munication on modern InfiniBand networks

Sebastian Ramos and Torsten Hoefler. Designing high-performance, low-latency multi-cluster com- munication on modern InfiniBand networks. In Proc. ACM HPDC, 2023. Reports ConnectX-7 PCIe round-trip latencies in the ∼300–500 ns range; we use this as the parameterised PCIe RTT in §5

2023

[51] [51]

StRoM: Smart re- mote memory

David Sidler, Zeke Wang, Monica Chiosa, Amit Kulkarni, and Gustavo Alonso. StRoM: Smart re- mote memory. InProc. EuroSys, 2020

2020

[52] [52]

Wenisch, Monica Wong-Chan, Sean Clark, Milo M

Arjun Singhvi, Aditya Akella, Dan Gibson, Thomas F. Wenisch, Monica Wong-Chan, Sean Clark, Milo M. K. Martin, Moray McLaren, Prashant Chandra, Rob Cauble, et al. 1RMA: Re- envisioning remote memory access for multi-tenant datacenters. InProc. ACM SIGCOMM, 2020

2020

[53] [53]

Arjun Singhvi, Nandita Dukkipati, Prashant Chan- dra, Hassan M. G. Wassel, Naveen Kr. Sharma, Anthony Rebello, Henry Schuh, Praveen Kumar, Behnam Montazeri, Neelesh Bansod, Sarin Thomas, Inho Cho, Hyojeong Lee Seibert, Baijun Wu, Rui Yang, Yuliang Li, Kai Huang, Qianwen Yin, Ab- hishek Agarwal, Srinivas Vaduvatha, Weihuang Wang, Masoud Moshref, Tao Ji, Da...

2025

[54] [54]

Network load balancing with in-network reordering support for RDMA

Cha Hwan Song, Xin Zhe Khooi, Raj Joshi, Inho Choi, Jialin Li, and Mun Choon Chan. Network load balancing with in-network reordering support for RDMA. InProc. ACM SIGCOMM, 2023

2023

[55] [55]

Ultra Ethernet spec- ification 1.0

Ultra Ethernet Consortium. Ultra Ethernet spec- ification 1.0. Industry specification, 2025. Re- leased June 2025 under Linux Foundation JDF; https://ultraethernet.org/

2025

[56] [56]

StaR: Break- ing the scalability limit for RDMA

Xizheng Wang, Guo Chen, Xijin Yin, Huichen Dai, Bojie Li, Binzhang Fu, and Kun Tan. StaR: Break- ing the scalability limit for RDMA. InProc. IEEE ICNP, 2021

2021

[57] [57]

SRNIC: A scalable architecture for RDMA NICs

Zilong Wang, Layong Luo, Qingsong Ning, Chao- liang Zeng, Wenxue Li, Xinchen Wan, Peng Xie, Tao Feng, Ke Cheng, Xiongfei Geng, Tianhao Wang, Weicheng Ling, Kejia Huo, Pingbo An, Kui Ji, Shi- deng Zhang, Bin Xu, Ruiqing Feng, Tao Ding, Kai Chen, and Chuanxiong Guo. SRNIC: A scalable architecture for RDMA NICs. InProc. USENIX NSDI, 2023

2023

[58] [58]

Justitia: Software multi- tenancy in hardware kernel-bypass networks

Yiwen Zhang, Yue Tan, Brent Stephens, and Mosharaf Chowdhury. Justitia: Software multi- tenancy in hardware kernel-bypass networks. In Proc. USENIX NSDI, 2022

2022

[59] [59]

White-boxing RDMA with packet-granular software control

Chenxingyu Zhao, Jaehong Min, Ming Liu, and Arvind Krishnamurthy. White-boxing RDMA with packet-granular software control. InProc. USENIX NSDI, 2025

2025

[60] [60]

Congestion control for Large-Scale RDMA deployments

Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanx- iong Guo, Marina Lipshteyn, Yehonatan Liron, Ji- tendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. Congestion control for Large-Scale RDMA deployments. InProc. ACM SIGCOMM, 2015. 33

2015