pith. sign in

arxiv: 2605.04842 · v2 · pith:LINIHV6Bnew · submitted 2026-05-06 · 💻 cs.DC

Communication Offloading on SmartNIC DPUs: A Quantitative Approach

Pith reviewed 2026-05-25 06:39 UTC · model grok-4.3

classification 💻 cs.DC
keywords SmartNIC DPUcommunication offloadingasynchronous messagingfire-and-forgetmemory-to-communication ratioDRAM trafficperformance evaluation
0
0 comments X

The pith

Offloading communication to SmartNIC DPUs speeds host-dominated workloads up to 1.55x when memory-to-communication ratio is low.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper designs Buddy, an engine that moves asynchronous fire-and-forget message routing off the host CPU and onto SmartNIC DPUs or x86 cores. Evaluation across five applications shows that the memory-to-communication ratio predicts whether offloading improves performance. Host-heavy codes such as Quicksilver and Sparse Matrix Transpose reach 1.55x speedup on the DPU. The same runs expose a 625x rise in DRAM traffic because the DPU lacks Direct Cache Access, pointing to a hardware requirement for future designs.

Core claim

The memory-to-communication ratio determines offloading benefit on SmartNIC DPUs; workloads dominated by host computation achieve up to 1.55x speedup when communication is moved to the DPU, yet the absence of Direct Cache Access produces a 625x increase in DRAM traffic.

What carries the argument

Buddy, the communication offloading engine that decouples message routing from the application process and runs on the DPU.

If this is right

  • Workloads whose compute time greatly exceeds communication time gain measurable speedup from DPU offloading.
  • Future SmartNIC hardware must add Direct Cache Access to keep DRAM traffic from exploding.
  • The fire-and-forget model can be supported on programmable DPUs without changing the host application interface.
  • The memory-to-communication ratio supplies a simple static rule for deciding whether to offload a given communication service.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the ratio rule generalizes, schedulers could decide at launch time whether to place communication on the DPU.
  • The DRAM-traffic penalty may restrict DPU use to latency-insensitive or bandwidth-rich networks until hardware changes.
  • Repeating the experiments on a DPU that does support Direct Cache Access would isolate the hardware contribution from the software design.

Load-bearing premise

The five tested applications represent the range of workloads for which memory-to-communication ratio reliably forecasts offloading gains and that running Buddy on the DPU adds no unmeasured costs beyond the reported DRAM traffic.

What would settle it

A workload with low memory-to-communication ratio that shows no speedup or even slowdown when communication is offloaded to the DPU, or a DPU with Direct Cache Access that does not produce the 625x DRAM traffic increase.

Figures

Figures reproduced from arXiv: 2605.04842 by Andong Hu, Ivy Peng, Jacob Wahlgren, Maya Gokhale, Roger Pearce.

Figure 1
Figure 1. Figure 1: An overview of Nvidia BlueField-3 DPU architecture. NIC by integrating programmable logic (e.g. ASIC/FPGA) on the data path, enabling host-side workloads to be offloaded to the NIC. Data Processing Units (DPUs) go one step further, featuring general-purpose CPUs and a standalone Linux software stack. Several vendors provide commercial DPU solutions with ARM cores, including AMD Pensando, Intel IPU, and Nvi… view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the Buddy design, where the routing agent interfaces between local processes and remote nodes to aggregate individual messages. Overview. Frequent irregular communication is a challenge in HPC systems since it leads to many small inefficient messages across the network. Message aggregation combined with multi-hop routing, a core component for enabling the “fire-and-forget” model, provides hi… view at source ↗
Figure 3
Figure 3. Figure 3: Three offloading scenarios with arrows representing RDMA data transfer. We use a set of five applications to evaluate communication offloading. They come from a variety of domains and feature different communication patterns as shown in view at source ↗
Figure 4
Figure 4. Figure 4: Speedup compared to no offloading. Histogram Quicksilver SSSP Transpose Tri. Count. 0 25 50 75 Network util. (%) DPU x86 None view at source ↗
Figure 7
Figure 7. Figure 7: Impact of tuning and optimizations on application performance. scenarios, although it deviates slightly for None in Histogram. Quicksilver has by far the highest ratio around 72, explaining the low network utilization we observed, followed by Sparse Transpose with 2.3 and Histogram with 1.5. SSSP and Triangle Counting have more communication than memory traffic with ratios of 0.56 and 0.53, respectively. W… view at source ↗
Figure 8
Figure 8. Figure 8: Performance of scaling Histogram and Quicksilver with Dpu offloading up to 8 nodes and the average transfer sizes. around 8 buffers with 8.0x and 12.4x, respectively, while other applications were not significantly affected.Utilizing multiple threads in the routing agent improves performance for most applications, with up to 6.2x speedup at 7 or 8 threads in Histogram. Triangle Counting peaks at 5 threads … view at source ↗
Figure 10
Figure 10. Figure 10: Amount of data loaded from DRAM in the routing agent (log scale). Intel Gold BlueField-3 Grace 0 20 40 Memory loads (% of comm.) view at source ↗
read the original abstract

SmartNIC Data Processing Units (DPUs) offer a promising solution for saving high-end CPU resources by offloading tasks to programmable cores near the network interface. In this work, we explore the feasibility of SmartNIC DPUs in supporting an asynchronous communication model called "fire-and-forget", particularly its core message routing service. We design a communication offloading engine called Buddy that decouples communication tasks from the application process. Buddy runs flexibly on SmartNIC DPUs such as the Nvidia BlueField-3 DPU and generic x86 CPUs. Our evaluation results in five applications identify the memory-to-communication ratio as a key predictor of the offloading performance. Host-dominated workloads, such as Quicksilver and Sparse Matrix Transpose, achieved up to 1.55x speedup with communication offloaded to the DPU. We further identify a 625x increase in DRAM traffic due to the absence of Direct Cache Access support on the DPU, highlighting a critical need in future SmartNIC designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Buddy, a communication offloading engine for asynchronous 'fire-and-forget' message routing that can run on SmartNIC DPUs (e.g., Nvidia BlueField-3) or x86 CPUs. Evaluation on five applications identifies the memory-to-communication ratio as a key predictor of offloading benefit; host-dominated workloads such as Quicksilver and Sparse Matrix Transpose achieve up to 1.55x speedup when communication is offloaded to the DPU. The work also reports a 625x increase in DRAM traffic due to the absence of Direct Cache Access support on the DPU.

Significance. If the memory-to-communication ratio is shown to be a reliable, generalizable predictor, the results could offer actionable guidance for when offloading communication to DPUs yields net benefit and could inform hardware requirements for future SmartNIC designs. The explicit quantification of the DRAM traffic penalty provides a concrete data point on current DPU limitations.

major comments (2)
  1. [Abstract] Abstract: The central performance claims (1.55x speedup, 625x DRAM traffic increase) and the identification of the memory-to-communication ratio as predictor are stated without any description of the experimental setup, workload characteristics, how the ratio is computed, baseline configurations, or measurement methodology. This absence makes the numerical results unverifiable from the provided text.
  2. [Evaluation] Evaluation (implied by the five-application results): The claim that the memory-to-communication ratio is a 'key predictor' rests on only five applications. No evidence is given that the observed relationship holds beyond this small, potentially non-representative set, nor are controls shown for DPU-specific overheads other than the reported DRAM traffic increase.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address the concerns about the abstract and the scope of the evaluation below, providing clarifications from the manuscript while noting where revisions can strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claims (1.55x speedup, 625x DRAM traffic increase) and the identification of the memory-to-communication ratio as predictor are stated without any description of the experimental setup, workload characteristics, how the ratio is computed, baseline configurations, or measurement methodology. This absence makes the numerical results unverifiable from the provided text.

    Authors: The abstract follows the conventional format of providing a concise summary of the problem, approach, and key findings without experimental details, which are instead fully described in the manuscript body. Section 3 details the Buddy design and DPU deployment on Nvidia BlueField-3; Section 4 describes the five applications (including Quicksilver and Sparse Matrix Transpose), how the memory-to-communication ratio is computed from application traces, the baseline configurations (host-only vs. DPU-offloaded), and the measurement methodology using hardware performance counters for speedup and DRAM traffic. We agree that a brief reference to the evaluation methodology could improve verifiability and will revise the abstract accordingly within length constraints. revision: partial

  2. Referee: [Evaluation] Evaluation (implied by the five-application results): The claim that the memory-to-communication ratio is a 'key predictor' rests on only five applications. No evidence is given that the observed relationship holds beyond this small, potentially non-representative set, nor are controls shown for DPU-specific overheads other than the reported DRAM traffic increase.

    Authors: The manuscript selects the five applications specifically to span a range of memory-to-communication ratios and to include both host-dominated and communication-dominated workloads, allowing the ratio to be identified as a predictor from the measured speedups (up to 1.55x) and the 625x DRAM traffic increase due to missing Direct Cache Access. The paper does not assert that the relationship is proven for all possible workloads; it presents the ratio as an actionable predictor derived from these cases. We can expand the evaluation section with an explicit limitations paragraph discussing the sample size and the need for future validation across additional applications. Other DPU overheads were quantified via the same performance counters but were not dominant compared to the DRAM traffic penalty in the reported experiments. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical speedups rest on direct measurement

full rationale

The paper reports measured speedups (up to 1.55x) and a DRAM traffic increase (625x) from running five applications with and without the Buddy offloader on DPU vs host. The memory-to-communication ratio is presented as an observed predictor derived from those measurements, not from any equation, fitted parameter, or self-citation chain that reduces the result to its inputs by construction. No load-bearing self-citations, ansatzes, or uniqueness theorems appear in the provided text. The central claims are therefore self-contained against external benchmarks (the actual runs).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that communication can be fully decoupled without correctness loss and that the chosen applications expose the relevant performance trade-offs; no free parameters or invented physical entities are mentioned.

axioms (1)
  • domain assumption Fire-and-forget messaging can be decoupled from the application process and executed on a separate DPU core without altering application semantics.
    This premise underpins the design of Buddy and the claim that offloading is feasible.
invented entities (1)
  • Buddy communication offloading engine no independent evidence
    purpose: Decouples communication tasks from the application process and routes messages on the DPU.
    New system introduced to realize the offloading approach.

pith-pipeline@v0.9.0 · 5707 in / 1291 out tokens · 24052 ms · 2026-05-25T06:39:44.786341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Post-Moore Technologies for Plasma Simulation: A Community Roadmap

    cs.ET 2026-05 unverdicted novelty 4.0

    No single post-Moore technology replaces current HPC for plasma simulations, but FPGA-class accelerators offer near-term kernel offload, non-von Neumann architectures medium-term operator acceleration, and quantum com...

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 1 Pith paper

  1. [1]

    In: 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2020)

    Alian, M., Yuan, Y., Zhang, J., Wang, R., Jung, M., Kim, N.S.: Data direct I/O characterization for future I/O system exploration. In: 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2020)

  2. [2]

    In: International Conference on High Performance Computing (2021) 14 J

    Bayatpour, M., Sarkauskas, N., Subramoni, H., Maqbool Hashmi, J., Panda, D.K.: BluesMPI: Efficient MPI non-blocking Alltoall offloading designs on modern Blue- Field smart NICs. In: International Conference on High Performance Computing (2021) 14 J. Wahlgren et al

  3. [3]

    In: Proceedings of the 48th International Conference on Parallel Processing (2019)

    Brock, B., Buluç, A., Yelick, K.: BCL: A cross-platform distributed data structures library. In: Proceedings of the 48th International Conference on Parallel Processing (2019)

  4. [4]

    In: 2020 USENIX Annual Technical Conference (USENIX ATC 20) (2020)

    Farshin, A., Roozbeh, A., Maguire Jr, G.Q., Kostić, D.: Reexamining direct cache access to optimize I/O intensive applications for multi-hundred-gigabit networks. In: 2020 USENIX Annual Technical Conference (USENIX ATC 20) (2020)

  5. [5]

    In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing (2006)

    Garg, R., Sabharwal, Y.: Software routing and aggregation of messages to optimize the performance of HPCC Randomaccess benchmark. In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing (2006)

  6. [6]

    In: Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing (2024)

    Gu, T., Fei, J., Canini, M.: OmNICCL: Zero-cost sparse AllReduce with direct cache access and SmartNICs. In: Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing (2024)

  7. [7]

    In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2022)

    Karamati, S., Hughes, C., Hemmert, K.S., Grant, R.E., Schonbein, W.W., Levy, S., Conte, T.M., Young, J., Vuduc, R.W.: “Smarter” NICs for faster molecular dynam- ics: a case study. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2022)

  8. [8]

    In: 15th Annual IEEE Symposium on High- Performance Interconnects (HOTI 2007) (2007)

    León, E.A., Ferreira, K.B., Maccabe, A.B.: Reducing the impact of the memory wall for I/O using cache injection. In: 15th Annual IEEE Symposium on High- Performance Interconnects (HOTI 2007) (2007)

  9. [9]

    In: 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

    Li, Y., Kashyap, A., Chen, W., Guo, Y., Lu, X.: Accelerating lossy and lossless compression on emerging bluefield dpu architectures. In: 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS). pp. 373–385 (2024)

  10. [10]

    In: 2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3) (2019)

    Maley, F.M., DeVinney, J.G.: Conveyors for streaming many-to-many communica- tion. In: 2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3) (2019)

  11. [11]

    IEEE Computer Architecture Letters (2025)

    Mamandipoor, A., Tran, H.D., Alian, M.: SDT: Cutting datacenter tax through simultaneous data-delivery threads. IEEE Computer Architecture Letters (2025)

  12. [12]

    In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2023)

    Steil, T., Reza, T., Priest, B., Pearce, R.: Embracing irregular parallelism in HPC with YGM. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2023)

  13. [13]

    In: 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2023)

    Suresh, K.K., Michalowicz, B., Ramesh, B., Contini, N., Yao, J., Xu, S., Shafi, A., Subramoni, H., Panda, D.: A novel framework for efficient offloading of commu- nication operations to BlueField SmartNICs. In: 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2023)

  14. [14]

    Future Generation Computer Sys- tems (2025)

    Tibbetts, N., Ibtisum, S., Puri, S.: A survey on heterogeneous computing using SmartNICs and emerging data processing units. Future Generation Computer Sys- tems (2025)

  15. [15]

    In: Proceedings of the In- ternational Conference for High Performance Computing, Networking, Storage and Analysis (2025)

    Usman, M., Benito, M., Iserte, S., Peña, A.J.: ODOS-MPI: HPC-friendly Smart- NIC offloading of computation/communication kernels. In: Proceedings of the In- ternational Conference for High Performance Computing, Networking, Storage and Analysis (2025)

  16. [16]

    In: Proceedings of the SC’23 Workshops of The International Con- ference on High Performance Computing, Network, Storage, and Analysis (2023)

    Usman,M.,Iserte,S.,Ferrer,R.,Peña,A.J.:DPUoffloadingprogrammingwiththe OpenMP API. In: Proceedings of the SC’23 Workshops of The International Con- ference on High Performance Computing, Network, Storage, and Analysis (2023)

  17. [17]

    In: 2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

    Wahlgren, J., Schieffer, G., Gokhale, M., Pearce, R., Peng, I.: Disaggregated mem- ory with smartnic offloading: a case study on graph processing. In: 2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). pp. 159–169. IEEE (2024)

  18. [18]

    Proceedings of the ACM on Measurement and Analysis of Computing Systems6(1) (2022)

    Wang, M., Xu, M., Wu, J.: Understanding I/O direct cache access performance for end host networking. Proceedings of the ACM on Measurement and Analysis of Computing Systems6(1) (2022)