pith. sign in

arxiv: 2605.28717 · v1 · pith:JDK75LSLnew · submitted 2026-05-27 · 💻 cs.AI · cs.AR· cs.NI

OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol

Pith reviewed 2026-06-29 12:03 UTC · model grok-4.3

classification 💻 cs.AI cs.ARcs.NI
keywords OpenURMAUnified BusRDMARoCEremote memory accessdatacenter networkingFPGA implementationlatency
0
0 comments X

The pith

OpenURMA's clean-room implementation of the Unified Bus protocol achieves ~500 ns end-to-end latency on 64-byte remote fetches, 4.37 times lower than a matched RoCE baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenURMA as the first open implementation of Huawei's Unified Bus (UB) specification for datacenter RDMA. UB decouples per-application endpoint state from per-host transport state and routes remote accesses through native CPU load/store to an on-chip controller instead of Queue Pair abstractions. This design change is shown to eliminate per-connection state bloat and multiple PCIe traversals. The work realizes UB at three tiers—synthesisable RTL on Alveo U50, cycle-level SystemC, and gem5—each paired with an OpenRoCE baseline for direct comparison. On the canonical 64-byte LOAD/READ operation the UB path records ~500 ns latency, 2.80 times higher throughput, and ~14 percent LUT occupancy.

Core claim

The central claim is that a faithful three-tier open realization of the public UB specification delivers a load/store remote-fetch path with ~500 ns end-to-end latency on the canonical 64-byte operation, 4.37 times below the matched OpenRoCE baseline of 2186 ns, while sustaining 2.80 times higher throughput and occupying only ~14 percent of a U50's LUTs.

What carries the argument

The three-tier OpenURMA stack (synthesisable RTL on Alveo U50, cycle-level two-node SystemC simulator, gem5 full-system scaffold) that implements UB transport and transaction layers and is compared against a matched OpenRoCEv2 RC baseline.

If this is right

  • Connection context grows additively with applications rather than scaling with hundreds of megabytes per host at 1024-application fanout.
  • Ordering guarantees become opt-in instead of mandatory for every operation.
  • Remote memory is reached via a single on-chip-bus controller load/store rather than a four-traversal PCIe round trip.
  • The measured resource footprint of 14 percent LUTs leaves headroom for additional on-NIC functions in the same silicon budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • An open reference implementation allows other groups to test UB variants or port the design to different FPGA or ASIC targets without access to closed silicon.
  • The latency reduction suggests that similar abstraction changes could be explored for non-Huawei RDMA stacks if the spec remains public.
  • The gem5 scaffold provides a full-system model that could be extended to study interactions between UB and host OS or application runtimes.
  • Low LUT usage implies UB could be integrated into smaller or lower-cost network devices than current RoCE NICs.

Load-bearing premise

The three-tier OpenURMA implementation correctly and faithfully realizes the public UB specification without hidden optimizations or deviations that would not be present in a production closed-silicon realization.

What would settle it

Independent synthesis and cycle-accurate measurement of the released RTL on the same U50 platform yielding latency or throughput numbers materially different from the reported 500 ns / 2.80 times figures would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.28717 by Bojie Li.

Figure 1
Figure 1. Figure 1: The three architectural moves and their dependencies. RoCEv2 RC (top) puts the NIC behind PCIe, holds one Queue Pair per [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architectural comparison. RoCE puts the NIC behind PCIe; it holds one Queue Pair per (application, remote-endpoint) pair, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: State models compared. RoCE binds one Queue Pair to every (application, remote-host) pair, so per-NIC state grows as [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-operation data path for a small synchronous read. The traditional work-queue-driven path (top) traverses four PCIe [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: OpenURMA’s NIC as a ClickNP element graph. The TX path (top) flows from CPU doorbell to wire; the RX path (bottom) [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The pipeline carries two reorder buffers serving disjoint correctness contracts. Packet-sequence reordering at the transport [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Target-side SEND dispatch. RoCE demultiplexes through a shared completion queue plus an application event loop; UB’s [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Submission path per stack. RoCE traversals between CPU and NIC (dashed) sit on PCIe; UB carries the same hand-offs [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Post-route LUT budget by architectural role for both [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-element post-route LUT, sorted descending. All [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Raw NIC-pipeline microbenchmarks. (a) Per-stage cycle contribution, cumulative 24 cy at the wire. (b) Sustained WR rate [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Modeled RoCE-DMA RDMA WRITE latency (curves) vs published ConnectX-7 ranges (bands). to 3,855× — the residual spec fields the MVP elides do not explain the gap; the (N+M) vs (N·M) split does. At (1024, 1024) that gap straddles the boundary between fits￾in-on-chip-SRAM and spill-to-host-DRAM for a typical NIC [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Per-NIC connection state vs endpoint count [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 16
Figure 16. Figure 16: Per-op latency under K-Jetty contention on one TP Channel. Linear PSN-allocator scaling; UB beats per-QP RoCE until K≈255. 2 1 2 2 2 3 2 4 2 5 2 6 Cluster size N (all-to-all) 1000 2000 3000 4000 Mean per-op latency (ns) RoCE QP cache spill (N 2 >512) UB §8.3 LD/ST UB §8.4 URMA WR RoCE BF RoCE DMA [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Cluster-scale per-op latency vs node count. RoCE [PITH_FULL_IMAGE:figures/full_fig_p013_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Total connection-setup time at symmetric [PITH_FULL_IMAGE:figures/full_fig_p014_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Per-operation latency (CDF). All four NIC stacks on the three workloads; link-delay [PITH_FULL_IMAGE:figures/full_fig_p015_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Op-rate vs in-flight depth on pointer-chase, link [PITH_FULL_IMAGE:figures/full_fig_p015_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: End-to-end latency vs one-way link delay on pointer [PITH_FULL_IMAGE:figures/full_fig_p015_21.png] view at source ↗
Figure 24
Figure 24. Figure 24: UB LD/ST latency under three cache policies (write [PITH_FULL_IMAGE:figures/full_fig_p016_24.png] view at source ↗
Figure 23
Figure 23. Figure 23: Per-verb mean latency comparison. UB (UB [PITH_FULL_IMAGE:figures/full_fig_p016_23.png] view at source ↗
Figure 25
Figure 25. Figure 25: Page-swap baseline comparison. (a) Per-op latency CDF on a 64-K-key Zipfian read workload at [PITH_FULL_IMAGE:figures/full_fig_p018_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: End-to-end latency vs payload size on bulk-read. [PITH_FULL_IMAGE:figures/full_fig_p018_26.png] view at source ↗
Figure 28
Figure 28. Figure 28: Per-op latency CDFs under jitter. UB’s tail ( [PITH_FULL_IMAGE:figures/full_fig_p019_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: READ vs WRITE latency per stack. RoCE READ [PITH_FULL_IMAGE:figures/full_fig_p019_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Operating envelope: median (left) and p99 (right) latency vs sustained throughput per stack, open-loop Poisson arrivals. [PITH_FULL_IMAGE:figures/full_fig_p020_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Standalone TLM two-node throughput envelope. [PITH_FULL_IMAGE:figures/full_fig_p020_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: gem5 FS-mode sustained polled goodput vs back-to [PITH_FULL_IMAGE:figures/full_fig_p020_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: YCSB-A throughput (left) and p50 latency (right) vs concurrency across the four stacks. [PITH_FULL_IMAGE:figures/full_fig_p021_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Ordering cost in isolation. (a) Cycles from comple [PITH_FULL_IMAGE:figures/full_fig_p021_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Latency (left) and throughput (right) vs strict-order fraction. UB scales linearly with mix; RoCE is flat (always-on strict [PITH_FULL_IMAGE:figures/full_fig_p022_35.png] view at source ↗
Figure 37
Figure 37. Figure 37: Dual-NIC gem5-FS run after the OpenRoCE codec [PITH_FULL_IMAGE:figures/full_fig_p022_37.png] view at source ↗
Figure 40
Figure 40. Figure 40: Per-WR mean latency vs N across the three CQE paths: all are per-access-overhead-bound, not amortisable setup. The ioctl floor is ∼23× the MMIO floor; ppoll is another ∼2.3× above ioctl. 0 1000 2000 3000 4000 5000 cumulative cycles consumed (1 cycle = 1 ns @ 1 GHz) ethdec jsched 5403 1803 expTier2_atomic_gem5 (total cum_cycles=7206) [PITH_FULL_IMAGE:figures/full_fig_p023_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Per-SC-module cycle decomposition during a full [PITH_FULL_IMAGE:figures/full_fig_p023_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Per-WR mean latency vs WRITE payload at N=16 through paths (a) and (b). Flat within ±1 ns from 8 B to 4 KB on both paths — the per-access overhead dominates the per￾payload cost. In-context ConnectX-7 comparison. The ioctl path’s 484 ns is 3.1–3.7× below Mellanox’s published 1500– 1800 ns 8 B RDMA WRITE on ConnectX-7 [48, 22], and the UB-spec §8.3 proxy (3–6 ns) two orders lower — from the same gem5 stack… view at source ↗
Figure 43
Figure 43. Figure 43: OpenURMA gem5-FS per-WR latency (path-(a) polled MMIO) vs WireLoopback link de￾lay, post-Tier-2. With the SC pipeline cycle count and the wire delay folded back into the CPU’s view (NICTopologySC::pending_wire_delay_), per-WR latency tracks base + 5 × link delay (1644 ns at 0 ns delay → 26.6 µs at 5 µs delay): the 5× slope reflects the wire round-trip (request out, TAACK back) plus the intermediate decode… view at source ↗
Figure 44
Figure 44. Figure 44: Goodput (left) and p99 tail (right) vs loss rate. Go-Back-N amplifies single-packet losses into 32-packet flights; the [PITH_FULL_IMAGE:figures/full_fig_p025_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: C-AQM vs DCQCN controller dynamics: congestion-window trajectory (left) and steady-state utilisation (right). Parameters [PITH_FULL_IMAGE:figures/full_fig_p025_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: Per-host fabric state vs coherence-domain / peer [PITH_FULL_IMAGE:figures/full_fig_p026_46.png] view at source ↗
Figure 48
Figure 48. Figure 48: Per-coherent-write latency vs cluster size [PITH_FULL_IMAGE:figures/full_fig_p026_48.png] view at source ↗
Figure 49
Figure 49. Figure 49: Multi-rack distance sweep: per-coherent-write la [PITH_FULL_IMAGE:figures/full_fig_p027_49.png] view at source ↗
read the original abstract

Modern datacenter RDMA is bottlenecked at the network interface, not the wire. A NIC running RoCE or InfiniBand holds per-connection state for every (application, remote-endpoint) pair - hundreds of megabytes at 1024-application fanout - and pays a four-traversal PCIe round trip on a 64-byte operation, inflating latency an order of magnitude beyond the wire. Both follow from the Queue Pair over PCIe abstraction RDMA inherits from InfiniBand. Huawei's Unified Bus (UB), a public 2025 specification, changes the abstraction: it decouples per-application endpoint state from per-host transport state so connection context grows additively, exposes ordering as opt-in, and reaches remote memory through native CPU load/store to an on-chip-bus controller. UB ships in Huawei's closed Ascend 950 silicon. OpenURMA is the first clean-room open implementation of UB's transport and transaction layers, realised at three tiers - synthesisable RTL on Alveo U50, a cycle-level two-node SystemC simulator, and a gem5 full-system scaffold - each with a matched OpenRoCE (RoCEv2 RC) baseline. The contribution is the implementation, harness, and controlled comparison closed silicon does not admit. On the canonical 64-byte remote fetch - LOAD on UB-spec Sec.8.3, READ on RoCEv2 RC - UB's load/store path delivers ~500 ns end-to-end, 4.37x below the matched baseline (2186 ns), sustains 2.80x higher throughput, and fits in ~14% of a U50's LUTs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents OpenURMA as the first clean-room open implementation of Huawei's public Unified Bus (UB) specification, realized in three tiers (synthesizable RTL on Alveo U50 FPGA, cycle-level SystemC simulator, and gem5 full-system scaffold) with matched OpenRoCEv2 RC baselines. It reports that on the canonical 64-byte remote fetch (LOAD per UB-spec Sec.8.3 vs. READ on RoCE), the UB load/store path achieves ~500 ns end-to-end latency (4.37× below the 2186 ns baseline), 2.80× higher throughput, and occupies ~14% of U50 LUTs. The contribution centers on the implementation, harness, and controlled comparison that closed silicon does not permit.

Significance. If the three-tier implementations faithfully realize the public UB specification without hidden deviations, the work supplies the first reproducible open platform for studying UB's decoupled state, opt-in ordering, and native load/store path against conventional RDMA. The multi-tier design (RTL + SystemC + gem5) is a concrete strength that enables different fidelity levels and controlled experiments. This is valuable because UB currently exists only in closed Ascend 950 silicon.

major comments (1)
  1. [Abstract and evaluation section] Abstract and § on evaluation (performance numbers): the central claims of 500 ns latency, 4.37× improvement, and 2.80× throughput rest on the three-tier OpenURMA exactly reproducing UB-spec Sec.8.3 behavior (decoupled state, opt-in ordering, native load/store without reduced PCIe traversals or idealized shortcuts). The manuscript supplies no machine-checked correspondence, external test vectors, third-party audit, or workload descriptions to confirm fidelity; self-reported matching is the sole evidence. This is load-bearing for the comparison to the matched OpenRoCE baseline.
minor comments (1)
  1. [Abstract] The abstract states performance numbers but omits workload descriptions, error bars, or measurement methodology; these details should be added for reproducibility even if moved to an appendix.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for recognizing the multi-tier design as a strength. We address the major comment on implementation fidelity point by point below.

read point-by-point responses
  1. Referee: [Abstract and evaluation section] Abstract and § on evaluation (performance numbers): the central claims of 500 ns latency, 4.37× improvement, and 2.80× throughput rest on the three-tier OpenURMA exactly reproducing UB-spec Sec.8.3 behavior (decoupled state, opt-in ordering, native load/store without reduced PCIe traversals or idealized shortcuts). The manuscript supplies no machine-checked correspondence, external test vectors, third-party audit, or workload descriptions to confirm fidelity; self-reported matching is the sole evidence. This is load-bearing for the comparison to the matched OpenRoCE baseline.

    Authors: We agree that fidelity to UB-spec Sec.8.3 is load-bearing for the reported latency, throughput, and comparison results. The three tiers were developed as a clean-room implementation strictly following the public specification, with explicit attention to decoupled per-application state, opt-in ordering, and the native load/store path without idealized shortcuts or reduced PCIe traversals. The SystemC model is cycle-level, the gem5 scaffold is full-system, and the RTL is synthesizable on the Alveo U50; the OpenRoCEv2 RC baseline was realized in identical environments for controlled comparison. That said, the manuscript provides no machine-checked correspondence, external test vectors, or third-party audit. We will revise the evaluation section to add explicit workload descriptions, sample test vectors with their mapping to specification sections, and additional validation details to make the fidelity evidence more transparent. revision: partial

Circularity Check

0 steps flagged

Implementation and measurement paper with no derivation chain or predictions

full rationale

The manuscript describes a clean-room open implementation of the public UB specification realized in three tiers (RTL on U50, SystemC simulator, gem5 scaffold) and reports measured latency/throughput numbers against a matched OpenRoCE baseline. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text; the central claims are direct empirical outcomes of running the implemented hardware and simulators. Because there is no load-bearing derivation step that could reduce to its own inputs by construction, the paper is self-contained against external benchmarks and exhibits no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work rests on standard hardware description languages, cycle-accurate simulation, and the public UB specification.

pith-pipeline@v0.9.1-grok · 5830 in / 1213 out tokens · 25649 ms · 2026-06-29T12:03:02.275078+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Effec- tively prefetching remote memory with Leap

    Hasan Al Maruf and Mosharaf Chowdhury. Effec- tively prefetching remote memory with Leap. In Proc. USENIX ATC, 2020. Far-memory prefetch heuristic; cited in §8.2 as an example of software- side swap optimisation

  2. [2]

    Aguilera, Aurojit Panda, Sylvia Ratnasamy, and Scott Shenker

    Emmanuel Amaro, Christopher Branner-Augmon, Zhihong Luo, Amy Ousterhout, Marcos K. Aguilera, Aurojit Panda, Sylvia Ratnasamy, and Scott Shenker. Can far memory improve job throughput? InProc. EuroSys, 2020. Introduces Fastswap; reports ∼1 µs kernel-side overhead and batched-prefetch swap-in, the basis of the second swap profile in §8.2

  3. [3]

    Enabling programmable transport protocols in high-speed NICs

    Mina Tahmasbi Arashloo, Alexey Lavrov, Manya Ghobadi, Jennifer Rexford, David Walker, and David Wentzlaff. Enabling programmable transport protocols in high-speed NICs. InProc. USENIX NSDI, 2020. 30

  4. [4]

    Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears

    Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Bench- marking cloud serving systems with YCSB. InProc. ACM SoCC, 2010. The Yahoo! Cloud Serving Benchmark; we use the YCSB-A 50/50 Get-Put Zipfian workload in §8.3

  5. [5]

    Compute Express Link (CXL) Specification 3.1

    CXL Consortium. Compute Express Link (CXL) Specification 3.1. https://www. computeexpresslink.org/, 2024

  6. [6]

    FaRM: Fast remote memory

    Aleksandar Dragojevi ´c, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. FaRM: Fast remote memory. InProc. USENIX NSDI, 2014

  7. [7]

    NICA: An infrastructure for inline acceleration of network applications

    Haggai Eran, Lior Zeno, Maroun Tork, Gabi Malka, and Mark Silberstein. NICA: An infrastructure for inline acceleration of network applications. InProc. USENIX ATC, 2019

  8. [8]

    Azure Accelerated Networking: SmartNICs in the public cloud

    Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike An- drewartha, Hari Angepat, et al. Azure Accelerated Networking: SmartNICs in the public cloud. In Proc. USENIX NSDI, 2018

  9. [9]

    RDMA over Ethernet for dis- tributed training at meta scale

    Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jee- varaj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashidhar Gandham, and Hongyi Zeng. RDMA over Ethernet for dis- tributed training at meta scale. InProc. ACM SIG- COMM, 2024

  10. [10]

    Dan Gibson, Hema Hariharan, Eric Lance, Moray McLaren, Behnam Montazeri, Arjun Singh, Stephen Wang, Hassan M. G. Wassel, Zhehua Wu, Sungh- wan Yoo, Raghuraman Balasubramanian, Prashant Chandra, Michael Cutforth, Peter Cuy, David De- cotigny, Rakesh Gautam, Alex Iriza, Milo M. K. Martin, Rick Roy, Zuowei Shen, Ming Tan, Ye Tang, Monica Wong-Chan, Joe Zbici...

  11. [11]

    Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G. Shin. Efficient memory disaggregation with Infiniswap. InProc. USENIX NSDI, 2017. Kernel-side overhead of 3– 5 µs on the swap-in path is the parameter referenced in §8.2

  12. [12]

    Clio: A hardware- software co-designed disaggregated memory system

    Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, and Yiying Zhang. Clio: A hardware- software co-designed disaggregated memory system. InProc. ACM ASPLOS, 2022

  13. [13]

    TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

    Tingbo He. A time scaling theory for multi- layer electronic systems.ChinaXiv, May 2026. chinarxiv-202605.00224. Perspective from Huawei Semiconductor: τ scaling as successor to geomet- ric Moore’s-Law scaling; positions Unified Bus as the system-layer τ reduction mechanism with end-to-end remote-access latency from ∼10s of µs (TCP/IP-class) to∼100 ns

  14. [14]

    SwCC: Software- programmable and per-packet congestion control in RDMA engine

    Hongjing Huang, Jie Zhang, Xuzheng Chen, Ziyu Song, Jiajun Qin, and Zeke Wang. SwCC: Software- programmable and per-packet congestion control in RDMA engine. InProc. USENIX ATC, 2025

  15. [15]

    Fast and scal- able selective retransmission for RDMA

    Peihao Huang, Guo Chen, Xin Zhang, Can Liu, Hongyu Wang, Huijun Shen, Ying Bian, Yuanwei Lu, Zhenyuan Ruan, Bojie Li, Jiansong Zhang, Yongfeng Liu, and Zhigang Chen. Fast and scal- able selective retransmission for RDMA. InProc. IEEE INFOCOM, 2025

  16. [16]

    LEFT: Lightweight and fast packet reordering for RDMA

    Peihao Huang, Xin Zhang, Zhigang Chen, Can Liu, and Guo Chen. LEFT: Lightweight and fast packet reordering for RDMA. InProc. APNet, 2024

  17. [17]

    UB-base-specification 2.0.1

    Huawei Technologies. UB-base-specification 2.0.1. https://www.unifiedbus.org/,

  18. [18]

    Unified Bus consortium specification, avail- able from the consortium’s documentation portal

  19. [19]

    Ascend 950 NPU archi- tecture white paper

    Huawei Technologies. Ascend 950 NPU archi- tecture white paper. Huawei vendor white paper, May 2026. Architectural disclosure for the Ascend 950PR and 950DT NPUs; first publicly documented silicon implementing the Unified Bus spec, with URMA (asynchronous Write/Read/Send/Atomic via Jetty) and UB Memory (synchronous Load/Store + AtomicStore/Load/Swap/CAS) ...

  20. [20]

    NanoTransport: A low-latency, programmable transport layer for NICs

    Stephen Ibanez, Alex Mallery, Serhat Arslan, Theo Jepsen, Muhammad Shahbaz, Nick McKeown, and Changhoon Kim. NanoTransport: A low-latency, programmable transport layer for NICs. InProc. ACM SOSR, 2021

  21. [21]

    An- dersen

    Anuj Kalia, Michael Kaminsky, and David G. An- dersen. FaSST: Fast, scalable and simple dis- tributed transactions with two-sided (RDMA) data- gram RPCs. InProc. USENIX OSDI, 2016

  22. [22]

    An- dersen

    Anuj Kalia, Michael Kaminsky, and David G. An- dersen. Datacenter RPCs can be general and fast. In Proc. USENIX NSDI, 2019

  23. [23]

    Sharma, Arvind Krishnamurthy, and Thomas Anderson

    Antoine Kaufmann, Tim Stamler, Simon Peter, Naveen Kr. Sharma, Arvind Krishnamurthy, and Thomas Anderson. TAS: TCP acceleration as an OS service. InProc. EuroSys, 2019. Reports detailed PCIe-class transaction latency decompositions used as parameter references in §5

  24. [24]

    Lebeck, and Danyang Zhuo

    Xinhao Kong, Jingrong Chen, Wei Bai, Yechen Xu, Mahmoud Elhaddad, Shachar Raindel, Jiten- dra Padhye, Alvin R. Lebeck, and Danyang Zhuo. 31 Understanding RDMA microarchitecture resources for performance isolation. InProc. USENIX NSDI, 2023

  25. [25]

    Collie: Finding performance anomalies in RDMA subsystems

    Xinhao Kong, Yibo Zhu, Huaping Zhou, Zhuo Jiang, Jianxi Ye, Chuanxiong Guo, and Danyang Zhuo. Collie: Finding performance anomalies in RDMA subsystems. InProc. USENIX NSDI, 2022

  26. [26]

    Gautam Kumar, Nandita Dukkipati, Keon Jang, Hassan M. G. Wassel, Xian Wu, Behnam Montaz- eri, Yaogong Wang, Kevin Springborn, Christopher Alfeld, Michael Ryan, David Wetherall, and Amin Vahdat. Swift: Delay is simple and effective for congestion control in the datacenter. InProc. ACM SIGCOMM, 2020

  27. [27]

    STrack: A re- liable multipath transport for AI/ML clusters

    Yanfang Le, Rong Pan, Peter Newman, Jeremias Blendin, Abdul Kabbani, Vipin Jain, Raghava Sivaramu, and Francis Matus. STrack: A re- liable multipath transport for AI/ML clusters. arXiv:2407.15266, 2024

  28. [28]

    OpenClickNP: a clean-room reimple- mentation of ClickNP on Alveo U50

    Bojie Li. OpenClickNP: a clean-room reimple- mentation of ClickNP on Alveo U50. https: //github.com/bojieli/OpenClickNP, 2025–2026

  29. [29]

    SocksDirect: Datacenter sockets can be fast and compatible

    Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lin- tao Zhang. SocksDirect: Datacenter sockets can be fast and compatible. InProc. ACM SIGCOMM, 2019

  30. [30]

    KV-Direct: High- performance in-memory key-value store with pro- grammable NIC

    Bojie Li, Zhenyuan Ruan, Wencong Xiao, Yuan- wei Lu, Yongqiang Xiong, Andrew Putnam, En- hong Chen, and Lintao Zhang. KV-Direct: High- performance in-memory key-value store with pro- grammable NIC. InProc. ACM SOSP, 2017

  31. [31]

    ClickNP: Highly flexible and high-performance network processing with reconfigurable hardware

    Bojie Li, Kun Tan, Layong Larry Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang Xiong, Peng Cheng, and Enhong Chen. ClickNP: Highly flexible and high-performance network processing with reconfigurable hardware. InProc. ACM SIG- COMM, 2016

  32. [32]

    FastWake: Revis- iting host network stack for interrupt-mode RDMA

    Bojie Li, Zhilong Xiang, Xiang Wang, Hon- gru Jonathan Zhou, and Kun Tan. FastWake: Revis- iting host network stack for interrupt-mode RDMA. InProc. APNet, 2023

  33. [33]

    1Pipe: Scalable total order communication in data center networks

    Bojie Li, Gefei Zuo, Wei Bai, and Lintao Zhang. 1Pipe: Scalable total order communication in data center networks. InProc. ACM SIGCOMM, 2021

  34. [34]

    Flor: An open high performance RDMA framework over heterogeneous RNICs

    Qiang Li, Yixiao Gao, Xiaoliang Wang, Haonan Qiu, Yanfang Le, Derui Liu, Qiao Xiang, Fei Feng, Peng Zhang, Bo Li, Jianbo Dong, Lingbo Tang, Hongqiang Harry Liu, Shaozong Liu, Weijie Li, Rui Miao, Yaohui Wu, Zhiwu Wu, Chao Han, Lei Yan, Zheng Cao, Zhongjie Wu, Chen Tian, Guihai Chen, Dennis Cai, Jinbo Wu, Jiaji Zhu, Jiesheng Wu, and Jiwu Shu. Flor: An open...

  35. [35]

    Revisiting RDMA reliability for lossy fabrics

    Wenxue Li, Xiangzhou Liu, Yunxuan Zhang, Zi- hao Wang, Wei Gu, Tao Qian, Gaoxiong Zeng, Shoushou Ren, Xinyang Huang, Zhenghang Ren, Bowen Liu, Junxue Zhang, Kai Chen, and Bingyang Liu. Revisiting RDMA reliability for lossy fabrics. InProc. ACM SIGCOMM, 2025. Best Student Paper, Honorable Mention

  36. [36]

    HPCC: High precision congestion control

    Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, and Min- lan Yu. HPCC: High precision congestion control. InProc. ACM SIGCOMM, 2019

  37. [37]

    Fast- socket: An almost drop-in replacement for the Linux socket interface for High-Performance Networking

    Xiaofeng Lin, Yu Chen, Xiaodong Li, et al. Fast- socket: An almost drop-in replacement for the Linux socket interface for High-Performance Networking. InProc. USENIX ATC, 2017

  38. [38]

    Harmonic: Hardware-assisted RDMA performance isolation for public clouds

    Jiaqi Lou, Xinhao Kong, Jinghan Huang, Wei Bai, Nam Sung Kim, and Danyang Zhuo. Harmonic: Hardware-assisted RDMA performance isolation for public clouds. InProc. USENIX NSDI, 2024

  39. [39]

    The gem5 Simulator: Version 20.0+,

    Jason Lowe-Power et al. The gem5 simulator: Ver- sion 20.0+. arXiv:2007.03152, 2020. Open-source cycle-level micro-architecture simulator with Sys- temC TLM 2.0 interoperability bridge; v24.0.0.1 is used as the future-work substrate for full-system integration oflibopenurma_sc.a

  40. [40]

    Memory efficient loss recovery for hardware-based transport in datacenter

    Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang Xiong, Peng Cheng, Jiansong Zhang, Enhong Chen, and Thomas Moscibroda. Memory efficient loss recovery for hardware-based transport in datacenter. InProc. APNet, 2017

  41. [41]

    Multi- Path transport for RDMA in datacenters

    Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang Xiong, Peng Cheng, Jiansong Zhang, Enhong Chen, and Thomas Moscibroda. Multi- Path transport for RDMA in datacenters. InProc. USENIX NSDI, 2018

  42. [42]

    Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C. Evans, Steve Gribble, Nicholas Kidd, Roman Kononov, Gautam Kumar, Carl Mauer, Emily Mu- sick, Lena Olson, Erik Rubow, Michael Ryan, Kevin Springborn, Paul Turner, Valas Valancius, Xi Wang, and Amin Vahdat. Snap: A ...

  43. [43]

    TIMELY: RTT-based congestion control for the datacenter

    Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, and David Zats. TIMELY: RTT-based congestion control for the datacenter. InProc. ACM SIGCOMM, 2015

  44. [44]

    Revisiting network sup- port for RDMA

    Radhika Mittal, Alexander Shpiner, Aurojit Panda, Eitan Zahavi, Arvind Krishnamurthy, Sylvia Rat- nasamy, and Scott Shenker. Revisiting network sup- port for RDMA. InProc. ACM SIGCOMM, 2018. 32

  45. [45]

    NVLink: A high-bandwidth inter-GPU interconnect

    NVIDIA Corporation. NVLink: A high-bandwidth inter-GPU interconnect. Vendor whitepaper, 2014–

  46. [46]

    Successive generations of the NVLink fabric are described in the NVIDIA whitepaper series

  47. [47]

    NVIDIA BlueField-3 DPU datasheet

    NVIDIA Corporation. NVIDIA BlueField-3 DPU datasheet. NVIDIA Networking product brief, 2023. Available from NVIDIA’s data-processing-unit prod- uct page

  48. [48]

    NVIDIA Spectrum-X: Adap- tive routing and telemetry-based congestion control for AI networks

    NVIDIA Networking. NVIDIA Spectrum-X: Adap- tive routing and telemetry-based congestion control for AI networks. NVIDIA technical brief, 2024. Vendor description of multi-path adaptive-routing delivery over Spectrum-4 / BlueField-3 NICs; the closest commercially-deployed point of comparison to UB’s TPG multi-path scheme

  49. [49]

    Hermit: Low-latency, high- throughput, and transparent remote memory via feedback-directed asynchrony

    Yifan Qiao, Chenxi Wang, Zhenyuan Ruan, Adam Belay, Qingda Lu, Yiying Zhang, Miryung Kim, and Guoqing Harry Xu. Hermit: Low-latency, high- throughput, and transparent remote memory via feedback-directed asynchrony. InProc. USENIX NSDI, 2023. Asynchronous remote-memory swap with feedback-directed I/O; cited in §8.2 for the same workload regime as Infiniswa...

  50. [50]

    Designing high-performance, low-latency multi-cluster com- munication on modern InfiniBand networks

    Sebastian Ramos and Torsten Hoefler. Designing high-performance, low-latency multi-cluster com- munication on modern InfiniBand networks. In Proc. ACM HPDC, 2023. Reports ConnectX-7 PCIe round-trip latencies in the ∼300–500 ns range; we use this as the parameterised PCIe RTT in §5

  51. [51]

    StRoM: Smart re- mote memory

    David Sidler, Zeke Wang, Monica Chiosa, Amit Kulkarni, and Gustavo Alonso. StRoM: Smart re- mote memory. InProc. EuroSys, 2020

  52. [52]

    Wenisch, Monica Wong-Chan, Sean Clark, Milo M

    Arjun Singhvi, Aditya Akella, Dan Gibson, Thomas F. Wenisch, Monica Wong-Chan, Sean Clark, Milo M. K. Martin, Moray McLaren, Prashant Chandra, Rob Cauble, et al. 1RMA: Re- envisioning remote memory access for multi-tenant datacenters. InProc. ACM SIGCOMM, 2020

  53. [53]

    Arjun Singhvi, Nandita Dukkipati, Prashant Chan- dra, Hassan M. G. Wassel, Naveen Kr. Sharma, Anthony Rebello, Henry Schuh, Praveen Kumar, Behnam Montazeri, Neelesh Bansod, Sarin Thomas, Inho Cho, Hyojeong Lee Seibert, Baijun Wu, Rui Yang, Yuliang Li, Kai Huang, Qianwen Yin, Ab- hishek Agarwal, Srinivas Vaduvatha, Weihuang Wang, Masoud Moshref, Tao Ji, Da...

  54. [54]

    Network load balancing with in-network reordering support for RDMA

    Cha Hwan Song, Xin Zhe Khooi, Raj Joshi, Inho Choi, Jialin Li, and Mun Choon Chan. Network load balancing with in-network reordering support for RDMA. InProc. ACM SIGCOMM, 2023

  55. [55]

    Ultra Ethernet spec- ification 1.0

    Ultra Ethernet Consortium. Ultra Ethernet spec- ification 1.0. Industry specification, 2025. Re- leased June 2025 under Linux Foundation JDF; https://ultraethernet.org/

  56. [56]

    StaR: Break- ing the scalability limit for RDMA

    Xizheng Wang, Guo Chen, Xijin Yin, Huichen Dai, Bojie Li, Binzhang Fu, and Kun Tan. StaR: Break- ing the scalability limit for RDMA. InProc. IEEE ICNP, 2021

  57. [57]

    SRNIC: A scalable architecture for RDMA NICs

    Zilong Wang, Layong Luo, Qingsong Ning, Chao- liang Zeng, Wenxue Li, Xinchen Wan, Peng Xie, Tao Feng, Ke Cheng, Xiongfei Geng, Tianhao Wang, Weicheng Ling, Kejia Huo, Pingbo An, Kui Ji, Shi- deng Zhang, Bin Xu, Ruiqing Feng, Tao Ding, Kai Chen, and Chuanxiong Guo. SRNIC: A scalable architecture for RDMA NICs. InProc. USENIX NSDI, 2023

  58. [58]

    Justitia: Software multi- tenancy in hardware kernel-bypass networks

    Yiwen Zhang, Yue Tan, Brent Stephens, and Mosharaf Chowdhury. Justitia: Software multi- tenancy in hardware kernel-bypass networks. In Proc. USENIX NSDI, 2022

  59. [59]

    White-boxing RDMA with packet-granular software control

    Chenxingyu Zhao, Jaehong Min, Ming Liu, and Arvind Krishnamurthy. White-boxing RDMA with packet-granular software control. InProc. USENIX NSDI, 2025

  60. [60]

    Congestion control for Large-Scale RDMA deployments

    Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanx- iong Guo, Marina Lipshteyn, Yehonatan Liron, Ji- tendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. Congestion control for Large-Scale RDMA deployments. InProc. ACM SIGCOMM, 2015. 33