arxiv: 2605.05628 · v1 · submitted 2026-05-07 · 💻 cs.AR · cs.DC

Recognition: unknown

Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems

Chen Zhang, Guangyu Sun, Haibo Wang, Jingwen Leng, Minyi Guo, Qijun Zhang, Yijia Diao, Zhe Zhou, Zhigang Ji, Zhipeng Tu, Zhiyao Li, Zhuoran Song, Zhuoshan Zhou

Pith reviewed 2026-05-08 04:37 UTC · model grok-4.3

classification 💻 cs.AR cs.DC

keywords in-switch computingtensor parallelismLLM trainingmulti-GPU systemscompute-communication overlapcollective operationsmicroarchitecture extension

0 comments

The pith

CAIS aligns in-switch communication with LLM computation memory needs to accelerate tensor parallelism on multi-GPU systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing in-switch methods like NVLS focus only on reducing communication volume but ignore how LLM computation kernels access memory, leaving compute and communication phases poorly overlapped. The paper presents CAIS as a framework that adds compute awareness to the switch through an extended ISA and microarchitecture. It also coordinates thread blocks to merge requests in time and optimizes the dataflow at the graph level. These changes allow tighter integration of operations and produce 1.38 times faster end-to-end training than NVLS-based solutions and 1.61 times faster than other overlap methods.

Core claim

CAIS is the first compute-aware in-switch computing framework for tensor-parallel LLM workloads. It consists of a compute-aware ISA and microarchitecture to match communication modes to computation memory semantics, merging-aware thread block coordination to improve temporal alignment, and a graph-level dataflow optimizer for cross-kernel overlap. This design overcomes the phase isolation in prior approaches and yields the reported performance improvements on multi-GPU systems.

What carries the argument

Compute-aware ISA and microarchitecture extension that allows the switch to perform operations aligned with the memory semantics of LLM computation kernels rather than pure communication reduction.

If this is right

Tensor parallel training on multi-GPU systems can overlap compute and communication phases more effectively, reducing overall execution time.
Collective operations in LLM workloads benefit from reduced redundant transfers while maintaining compatibility with computation requirements.
End-to-end training throughput increases without requiring changes to the underlying LLM model structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware vendors could incorporate similar compute extensions into future network switches to support emerging AI workloads better.
The merging-aware coordination technique might apply to other distributed computing scenarios beyond LLMs.
Further optimizations could explore dynamic adaptation of the dataflow based on workload characteristics.

Load-bearing premise

The compute-aware ISA and microarchitecture extensions can be implemented in production switches while keeping area, power, and compatibility costs acceptable and without introducing new performance bottlenecks.

What would settle it

Implementation measurements on actual or simulated switch hardware showing that the area or power overhead of the extensions exceeds practical limits for commercial deployment, or that the end-to-end speedups are not realized due to introduced bottlenecks.

Figures

Figures reproduced from arXiv: 2605.05628 by Chen Zhang, Guangyu Sun, Haibo Wang, Jingwen Leng, Minyi Guo, Qijun Zhang, Yijia Diao, Zhe Zhou, Zhigang Ji, Zhipeng Tu, Zhiyao Li, Zhuoran Song, Zhuoshan Zhou.

**Figure 1.** Figure 1: Motivation for Compute-Aware In-Switch Computing in Tensor Parallelism. (a–b) Tensor Parallelism (TP) in LLM. (c–f) view at source ↗

**Figure 2.** Figure 2: Computation-Communication Time When Scaling Up. C. Limits of Current In-Switch Computing Despite these advances, current in-switch computing remains fundamentally communication-centric: it is solely designed to accelerate collective operations but remains agnostic to the computation kernels such as GEMM that produce or consume these data streams. For example, AllGather operator collects data chunks from … view at source ↗

**Figure 3.** Figure 3: The System Architecture of CAIS. the current NVLS design and the requirements of computation kernels. Specifically, AllGather + GEMM (AGGEMM) requires memory reads, but NVLS only provides multimem.st in push mode. GEMM + Reduce-Scatter (GEMM-RS) requires memory writes, but NVLS provides only multimem.ld_reduce in pull mode. Similarly, Basic TP with AllReduce + GEMM (AR-GEMM) and GEMM + AllReduce (GEMM-AR… view at source ↗

**Figure 4.** Figure 4: Extension of the PTX Instructions. III. CAIS DESIGN Following the above design philosophy, we introduce CAIS, a compute-aware in-switch computing framework to overcome the limitation of existing communication-centric inswitch computing. The framework consists of three primary components: 1) Compute-Aware ISA and Microarchitecture Extensions. This is the core design of CAIS, which fundamentally eliminate… view at source ↗

**Figure 6.** Figure 6: In-switch Micro-Functions Workflow. updated to Load-Ready, and the data is cached in the Content Array. The switch also generates responses for requests stored in Content Array before caching the arriving data. After that, the switch can serve subsequent requests to the same address directly from this cached data without reissuing memory transactions to the target GPU. 4 If a later request arrives and hit… view at source ↗

**Figure 5.** Figure 5: Switch Micro-architecture for CAIS. for reductions. Each entry tracks session state (Load-Wait, Load-Ready, or Reduction) and a counter of merged requests. These tables operate in tandem to perform on-the-fly aggregation of identical accesses across GPUs. When the last contributing request arrives, the merged data is either forwarded to requesters (loads) or written to memory (reductions). 3) In-Switch M… view at source ↗

**Figure 8.** Figure 8: Compiler and Architecture Support for TB Coordi view at source ↗

**Figure 9.** Figure 9: (a) illustrates this concept with a portion of a trans- view at source ↗

**Figure 10.** Figure 10: Illustration of Asymmetric Traffic. grained TB-level dependency, it introduces Asymmetric Kernel Overlapping to balance complementary traffic between two directions of the inter-chip link, which can significantly improve the overall performance. 1) Fine-Grained TB Dependency and Deep Kernel Fusion: In contrast to coarse-grained kernel-level dependency that require the full completion of a producer kernel… view at source ↗

**Figure 11.** Figure 11: End-to-End Model Speedup Across Training and Inference view at source ↗

**Figure 12.** Figure 12: Sub-layer Performance Speedup. dropout/Add operations interleaved to reduce memory footprint. NVLS accelerates these collectives. • Overlap Solutions includes 3) CoCoNet [19] and 4) FuseLib [44], both of which enable GEMM-AllReduce overlapping through software scheduling techniques, and 5) T3 [43] that introduces hardware-assisted fine-grained overlapping between GEMM and ReduceScatter. We extend T3 to al… view at source ↗

**Figure 13.** Figure 13: (a) Required Merge Table Size with and without view at source ↗

**Figure 14.** Figure 14: Performance Sensitivity to Merge Table Size. CAIS view at source ↗

**Figure 15.** Figure 15: Average Bandwidth Utilization per Sub-layer. view at source ↗

**Figure 16.** Figure 16: Bandwidth Utilization over Time for (a) CAIS-Base, view at source ↗

**Figure 18.** Figure 18: Validation of Our Simulated NVLS. Setup Hidden Size FFN Hidden Size Attention Heads # SM CAIS Speedup Over TP-NVLS Full 8192 22528 64 132 1.43 Half 4096 11264 32 66 1.40 TABLE II: Experimental Validation of Scaling-down Setup. LLM size and GPU resources, and 2) accuracy of our NVLSenabled simulator with support for multimem instructions. For scaled-down setup, we compare two systems: a fullscale GPU exe… view at source ↗

read the original abstract

Tensor parallelism (TP) in large-scale LLM inference and training introduces frequent collective operations that dominate inter-GPU communication. While in-switch computing, exemplified by NVLink SHARP (NVLS), accelerates collective operations by reducing redundant data transfer, its communication-centric design philosophy introduces the mismatch between its communication mode and the memory semantic requirement of LLM's computation kernel. Such a mismatch isolates the compute and communication phases, resulting in underutilized resources and limited overlap in multi-GPU systems. To address the limitation, we propose CAIS, the first Compute-Aware In-Switch computing framework that aligns communication modes with computation's memory semantics requirement. CAIS consists of three integral techniques: (1) compute-aware ISA and microarchitecture extension to enable compute-aware in-switch computing. (2) merging-aware TB (Thread Block) coordination to improve the temporal alignment for efficient request merging. (3) graph-level dataflow optimizer to achieve a tight cross-kernel overlap. Evaluations on LLM workloads show that CAIS achieves 1.38$\times$ average end-to-end training speedup over the SOTA NVLS-enabled solution, and 1.61$\times$ over T3, the SOTA compute-communicate overlap solutions but do not leverage NVLS, demonstrating its effectiveness in accelerating TP on multi-GPU systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAIS adds compute-aware ISA extensions and coordination to in-switch collectives for LLM tensor parallelism, but the hardware costs and evaluation details stay unquantified.

read the letter

The central claim is that CAIS gets 1.38x end-to-end training speedup over NVLS and 1.61x over T3 by making in-switch operations match the memory semantics of tensor-parallel kernels instead of treating them as pure communication. The three pieces are a compute-aware ISA plus microarchitecture, merging-aware thread-block coordination, and a graph-level dataflow optimizer. That combination is new relative to the cited NVLS and T3 work, and it directly targets the underlap problem that appears once collectives dominate multi-GPU LLM training. The idea of aligning switch behavior with kernel memory patterns is a reasonable next step beyond communication-only designs. The paper also keeps the evaluation framed as external comparisons rather than self-referential fitting, which is cleaner than many systems papers. The soft spot is the hardware side. The ISA and microarchitecture extensions are load-bearing for the overlap gains, yet the abstract and stress-test note give no area, power, or added-latency numbers, and no compatibility check against existing NVLink fabrics. If those extensions introduce even modest per-packet overhead, the reported speedups disappear on real hardware. The evaluation methodology is also missing from the provided material—no baselines, no error bars, no ablations—so it is impossible to judge whether the 1.38x and 1.61x figures are robust. This paper is for people working on hardware-software co-design for large-model training and for switch architects who want to move beyond pure reduction collectives. It is worth a serious referee because the problem is real and the proposed direction is concrete, even though the current draft needs the missing implementation and measurement details filled in before it can be evaluated properly.

Referee Report

2 major / 2 minor

Summary. The paper proposes CAIS, the first Compute-Aware In-Switch computing framework for tensor parallelism (TP) in LLM training/inference on multi-GPU systems. It identifies a mismatch between communication-centric in-switch designs like NVLink SHARP (NVLS) and LLM kernel memory semantics, which limits compute-communicate overlap. CAIS introduces three techniques: (1) compute-aware ISA and microarchitecture extensions, (2) merging-aware thread-block coordination for request merging, and (3) a graph-level dataflow optimizer for cross-kernel overlap. Evaluations claim 1.38× average end-to-end training speedup over SOTA NVLS-enabled solutions and 1.61× over T3 (SOTA overlap without NVLS).

Significance. If the hardware extensions prove feasible without eroding gains, CAIS could meaningfully improve TP efficiency in large-scale LLM systems by enabling tighter compute-communicate alignment beyond current NVLS capabilities. The work provides a concrete hardware-software co-design path that builds directly on deployed in-switch fabrics, with potential for broader impact on distributed training throughput if the reported speedups hold under realistic constraints.

major comments (2)

[Proposed Techniques / Microarchitecture Extensions] The central speedup claims (1.38× over NVLS, 1.61× over T3) rest on the assumption that the compute-aware ISA and microarchitecture extensions can be realized in production switches without unacceptable area, power, or per-packet latency overhead. No synthesis results, power estimates, or compatibility analysis with NVLink/NVLS fabrics are provided to quantify this; any added latency would directly undermine the overlap benefits asserted in the abstract and evaluation.
[Evaluation] The evaluation section reports end-to-end speedups but supplies no details on experimental methodology, including LLM model sizes, number of GPUs, exact baseline implementations (e.g., how NVLS and T3 were configured), error bars, or ablation studies isolating each of the three techniques. This makes it impossible to assess whether the measurements support the central claim.

minor comments (2)

[§3.1] Notation for the compute-aware ISA instructions is introduced without a clear table or diagram showing opcode semantics and how they map to existing NVLS collectives.
[Abstract / Introduction] The abstract and introduction use 'SOTA' without citing the specific prior works for T3 and NVLS-enabled baselines in the first paragraph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Proposed Techniques / Microarchitecture Extensions] The central speedup claims (1.38× over NVLS, 1.61× over T3) rest on the assumption that the compute-aware ISA and microarchitecture extensions can be realized in production switches without unacceptable area, power, or per-packet latency overhead. No synthesis results, power estimates, or compatibility analysis with NVLink/NVLS fabrics are provided to quantify this; any added latency would directly undermine the overlap benefits asserted in the abstract and evaluation.

Authors: We agree that the manuscript currently lacks quantitative hardware overhead analysis for the proposed compute-aware ISA and microarchitecture extensions. The design philosophy extends the programmable in-switch capabilities already deployed in NVLS rather than introducing entirely new hardware blocks, with the goal of preserving low per-packet latency through the merging-aware thread-block coordination. However, without explicit synthesis or power data, the overhead claims remain unquantified. In the revised manuscript we will add a new subsection presenting preliminary RTL synthesis results, area and power estimates, and a compatibility discussion with NVLink/NVLS fabrics. This addition will directly address whether any incremental latency could offset the reported overlap gains. revision: yes
Referee: [Evaluation] The evaluation section reports end-to-end speedups but supplies no details on experimental methodology, including LLM model sizes, number of GPUs, exact baseline implementations (e.g., how NVLS and T3 were configured), error bars, or ablation studies isolating each of the three techniques. This makes it impossible to assess whether the measurements support the central claim.

Authors: The referee correctly identifies that the evaluation section omits key methodological details. The reported speedups were obtained on standard LLM training workloads using multiple GPU counts, with NVLS enabled via the latest vendor libraries and T3 re-implemented from its original description. Nevertheless, these specifics, along with error bars and per-technique ablations, are not adequately documented. We will revise the evaluation section to include model sizes and configurations, exact GPU counts, baseline implementation details, standard-deviation error bars across repeated runs, and ablation studies that isolate the contribution of the ISA extensions, thread-block coordination, and graph-level optimizer. These additions will allow readers to evaluate the strength of the 1.38× and 1.61× claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external baselines

full rationale

The paper proposes CAIS with three techniques (compute-aware ISA/microarchitecture, merging-aware TB coordination, graph-level dataflow optimizer) and reports speedups from direct end-to-end evaluations on LLM workloads against external SOTA systems (NVLS-enabled solution and T3). No equations, parameter fittings, or derivations are shown that reduce to the paper's own inputs by construction. No load-bearing self-citations or uniqueness theorems imported from prior author work appear in the derivation chain. The performance numbers are presented as measured outcomes rather than tautological predictions, making the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that collective communication dominates tensor-parallel LLM execution and that the three proposed techniques can be implemented without prohibitive hardware or software overheads; no free parameters or invented physical entities are introduced in the abstract.

axioms (1)

domain assumption Collective operations dominate inter-GPU communication time in tensor-parallel LLM training and inference.
Explicitly stated as the source of the performance bottleneck.

invented entities (1)

CAIS framework no independent evidence
purpose: Align in-switch computing modes with LLM computation memory semantics
Newly proposed system; no independent evidence outside the paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5583 in / 1346 out tokens · 55851 ms · 2026-05-08T04:37:28.231420+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 5 canonical work pages · 4 internal anchors

[1]

Blockmaestro: Enabling programmer- transparent task-based execution in gpu systems,

A. Abdolrashidi, H. A. Esfeden, A. Jahanshahi, K. Singh, N. Abu- Ghazaleh, and D. Wong, “Blockmaestro: Enabling programmer- transparent task-based execution in gpu systems,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 333–346

2021
[2]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review arXiv 2023
[3]

Mcm-gpu: Multi-chip-module gpus for continued performance scalability,

A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, O. Villa, A. Jaleel, C.-J. Wu, and D. Nellans, “Mcm-gpu: Multi-chip-module gpus for continued performance scalability,”ACM SIGARCH Computer Architecture News, vol. 45, no. 2, pp. 320–332, 2017

2017
[4]

Cen- tauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning,

C. Chen, X. Li, Q. Zhu, J. Duan, P. Sun, X. Zhang, and C. Yang, “Cen- tauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2024, pp. 178–191

2024
[5]

Flare: Flexible in-network allreduce,

D. De Sensi, S. Di Girolamo, S. Ashkboos, S. Li, and T. Hoefler, “Flare: Flexible in-network allreduce,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–16

2021
[6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review arXiv 2010
[7]

The llama 3 herd of models,

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv e-prints, pp. arXiv–2407, 2024

2024
[8]

Efficient sparse collective communication and its application to accelerate distributed deep learning,

J. Fei, C.-Y . Ho, A. N. Sahu, M. Canini, and A. Sapio, “Efficient sparse collective communication and its application to accelerate distributed deep learning,” inProceedings of the 2021 ACM SIGCOMM 2021 Conference, 2021, pp. 676–691

2021
[9]

Ultra-performance pascal gpu and nvlink interconnect,

D. Foley and J. Danskin, “Ultra-performance pascal gpu and nvlink interconnect,”IEEE Micro, vol. 37, no. 2, pp. 7–17, 2017

2017
[10]

In-network aggregation for shared machine learning clus- ters,

N. Gebara, “In-network aggregation for shared machine learning clus- ters,”Proceedings of Machine Learning and Systems (MLSys), 2021

2021
[11]

Scal- able hierarchical aggregation protocol (sharp): A hardware architecture for efficient data reduction,

R. L. Graham, D. Bureddy, P. Lui, H. Rosenstock, G. Shainer, G. Bloch, D. Goldenerg, M. Dubman, S. Kotchubievsky, V . Koushniret al., “Scal- able hierarchical aggregation protocol (sharp): A hardware architecture for efficient data reduction,” in2016 First International Workshop on Communication Optimizations in HPC (COMHPC). IEEE, 2016, pp. 1–10

2016
[12]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review arXiv 2024
[13]

Flash: Fpga-accelerated smart switches with gcn case study,

P. Haghi, W. Krska, C. Tan, T. Geng, P. H. Chen, C. Greenwood, A. Guo, T. Hines, C. Wu, A. Liet al., “Flash: Fpga-accelerated smart switches with gcn case study,” inProceedings of the 37th International Conference on Supercomputing, 2023, pp. 450–462

2023
[14]

Smartfuse: Reconfigurable smart switches to accelerate fused collectives in hpc applications,

P. Haghi, C. Tan, A. Guo, C. Wu, D. Liu, A. Li, A. Skjellum, T. Geng, and M. Herbordt, “Smartfuse: Reconfigurable smart switches to accelerate fused collectives in hpc applications,” inProceedings of the 38th ACM International Conference on Supercomputing, 2024, pp. 413–425

2024
[15]

A generic service to provide in-network aggregation for key-value streams,

Y . He, W. Wu, Y . Le, M. Liu, and C. Lao, “A generic service to provide in-network aggregation for key-value streams,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023, pp. 33–47

2023
[16]

Gpipe: Efficient training of giant neu- ral networks using pipeline parallelism,

Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wuet al., “Gpipe: Efficient training of giant neu- ral networks using pipeline parallelism,”Advances in neural information processing systems, vol. 32, 2019

2019
[17]

Nvswitch and dgx-2,

A. Ishii, D. Foley, E. Anderson, B. Dally, G. Dearth, L. Dennison, M. Hummel, and J. Schafer, “Nvswitch and dgx-2,” inHot Chips, 2018

2018
[18]

Pal: A variability-aware policy for scheduling ml workloads in gpu clusters,

R. Jain, B. Tran, K. Chen, M. D. Sinclair, and S. Venkataraman, “Pal: A variability-aware policy for scheduling ml workloads in gpu clusters,” inSC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024, pp. 1–18

2024
[19]

Breaking the com- putation and communication abstraction barrier in distributed machine learning workloads,

A. Jangda, J. Huang, G. Liu, A. H. N. Sabet, S. Maleki, Y . Miao, M. Musuvathi, T. Mytkowicz, and O. Saarikivi, “Breaking the com- putation and communication abstraction barrier in distributed machine learning workloads,” inProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022, ...

2022
[20]

A detailed and flexible cycle-accurate network-on-chip simulator,

N. Jiang, D. U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D. E. Shaw, J. Kim, and W. J. Dally, “A detailed and flexible cycle-accurate network-on-chip simulator,” in2013 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 2013, pp. 86–96

2013
[21]

Netcache: Balancing key-value stores with fast in-network caching,

X. Jin, X. Li, H. Zhang, R. Soul ´e, J. Lee, N. Foster, C. Kim, and I. Stoica, “Netcache: Balancing key-value stores with fast in-network caching,” in Proceedings of the 26th Symposium on Operating Systems Principles, 2017, pp. 121–136

2017
[22]

Locality- centric data and threadblock management for massive gpus,

M. Khairy, V . Nikiforov, D. Nellans, and T. G. Rogers, “Locality- centric data and threadblock management for massive gpus,” in2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 1022–1036

2020
[23]

Accel-sim: An extensible simulation framework for validated gpu modeling,

M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An extensible simulation framework for validated gpu modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 473–486

2020
[24]

An in-network architecture for accelerating shared-memory multiprocessor collectives,

B. Klenk, N. Jiang, G. Thorson, and L. Dennison, “An in-network architecture for accelerating shared-memory multiprocessor collectives,” in2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 996–1009

2020
[25]

Reducing activation recomputation in large transformer models,

V . A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,”Proceedings of Machine Learning and Systems, vol. 5, pp. 341–353, 2023

2023
[26]

{ATP}: In-network aggregation for multi-tenant learning,

C. Lao, Y . Le, K. Mahajan, Y . Chen, W. Wu, A. Akella, and M. Swift, “{ATP}: In-network aggregation for multi-tenant learning,” in18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), 2021, pp. 741–761

2021
[27]

Sequence paral- lelism: Long sequence training from system perspective

S. Li, F. Xue, C. Baranwal, Y . Li, and Y . You, “Sequence paral- lelism: Long sequence training from system perspective,”arXiv preprint arXiv:2105.13120, 2021

work page arXiv 2021
[28]

Chimera: efficiently training large-scale neural net- works with bidirectional pipelines,

S. Li and T. Hoefler, “Chimera: efficiently training large-scale neural net- works with bidirectional pipelines,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–14

2021
[29]

Accel- erating distributed reinforcement learning with in-switch computing,

Y . Li, I.-J. Liu, Y . Yuan, D. Chen, A. Schwing, and J. Huang, “Accel- erating distributed reinforcement learning with in-switch computing,” inProceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 279–291

2019
[30]

Ub-mesh: An new interconnection technology for large ai supernode,

H. Liao, “Ub-mesh: An new interconnection technology for large ai supernode,” in2025 IEEE Hot Chips 37 Symposium (HCS). IEEE Computer Society, 2025, pp. 1–13

2025
[31]

Incbricks: Toward in-network computation with an in-network cache,

M. Liu, L. Luo, J. Nelson, L. Ceze, A. Krishnamurthy, and K. Atreya, “Incbricks: Toward in-network computation with an in-network cache,” inProceedings of the Twenty-Second International Conference on Archi- tectural Support for Programming Languages and Operating Systems, 2017, pp. 795–809

2017
[32]

In-network aggregation with transport transparency for distributed training,

S. Liu, Q. Wang, J. Zhang, W. Wu, Q. Lin, Y . Liu, M. Xu, M. Canini, R. C. Cheung, and J. He, “In-network aggregation with transport transparency for distributed training,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2023, pp. 376–391

2023
[33]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022

2021
[34]

Beyond the socket: Numa-aware gpus,

U. Milic, O. Villa, E. Bolotin, A. Arunkumar, E. Ebrahimi, A. Jaleel, A. Ramirez, and D. Nellans, “Beyond the socket: Numa-aware gpus,” inProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017, pp. 123–135

2017
[35]

Efficient large-scale language model training on gpu clusters using megatron-lm,

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro et al., “Efficient large-scale language model training on gpu clusters using megatron-lm,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15

2021
[36]

Nvidia h100 tensor core gpu

NVIDIA, “Nvidia h100 tensor core gpu.”https://www.nvidia.com/en- us/data-center/h100, 2022

2022
[37]

Upgrading multi-gpu interconnectiv- ity with the third-generation nvidia nvswitch

NVIDIA, “Upgrading multi-gpu interconnectiv- ity with the third-generation nvidia nvswitch.” https://developer.nvidia.com/blog/upgrading-multi-gpu- interconnectivity-with-the-third-generation-nvidia-nvswitch/?ncid=so- nvsh-708451, 2022

2022
[38]

Cuda templates for linear algebra subroutines

NVIDIA, “Cuda templates for linear algebra subroutines.” https://github.com/NVIDIA/cutlass, 2024

2024
[39]

Introduction to nvidia dgx h100/h200 systems

NVIDIA, “Introduction to nvidia dgx h100/h200 systems.” https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to- dgxh100.html, 2024

2024
[40]

Nvidia collective communication library (nccl) documentation

NVIDIA, “Nvidia collective communication library (nccl) documentation.”https://docs.nvidia.com/deeplearning/nccl/user- guide/docs/index.html, 2024

2024
[41]

Nvidia gb200 nvl72

NVIDIA, “Nvidia gb200 nvl72.”https://www.nvidia.com/en-us/data- center/gb200-nvl72/, 2024

2024
[42]

Cuda: New features and beyond

NVIDIA, “Cuda: New features and beyond.”https://www.nvidia.com/en- us/on-demand/session/gtc25-s72383/, 2025

2025
[43]

T3: Transparent tracking & triggering for fine-grained overlap of compute & collectives,

S. Pati, S. Aga, M. Islam, N. Jayasena, and M. D. Sinclair, “T3: Transparent tracking & triggering for fine-grained overlap of compute & collectives,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 1146–1164

2024
[44]

Optimiz- ing distributed ml communication with fused computation-collective operations,

K. Punniyamurthy, K. Hamidouche, and B. M. Beckmann, “Optimiz- ing distributed ml communication with fused computation-collective operations,” inSC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024, pp. 1–17

2024
[45]

Zero: Memory optimizations toward training trillion parameter models,

S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “Zero: Memory optimizations toward training trillion parameter models,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–16

2020
[46]

Enabling compute-communication overlap in distributed deep learning training platforms,

S. Rashidi, M. Denton, S. Sridharan, S. Srinivasan, A. Suresh, J. Nie, and T. Krishna, “Enabling compute-communication overlap in distributed deep learning training platforms,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 540–553

2021
[47]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,

J. Rasley, S. Rajbhandari, O. Ruwase, and Y . He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” inProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 2020, pp. 3505– 3506

2020
[48]

Scaling distributed machine learning with{In-Network}aggregation,

A. Sapio, M. Canini, C.-Y . Ho, J. Nelson, P. Kalnis, C. Kim, A. Krish- namurthy, M. Moshref, D. Ports, and P. Richt ´arik, “Scaling distributed machine learning with{In-Network}aggregation,” in18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), 2021, pp. 785–808

2021
[49]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- zaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,”arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review arXiv 1909
[50]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[51]

Roar: A router microarchitecture for in-network allreduce,

R. Wang, D. Dong, F. Lei, J. Ma, K. Wu, and K. Lu, “Roar: A router microarchitecture for in-network allreduce,” inProceedings of the 37th International Conference on Supercomputing, 2023, pp. 423–436

2023
[52]

Overlap communication with dependent computation via decomposition in large deep learning models,

S. Wang, J. Wei, A. Sabne, A. Davis, B. Ilbeyi, B. Hechtman, D. Chen, K. S. Murthy, M. Maggioni, Q. Zhanget al., “Overlap communication with dependent computation via decomposition in large deep learning models,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2022,...

2022
[53]

Using trio: juniper networks’ programmable chipset-for emerging in-network applications,

M. Yang, A. Baban, V . Kugel, J. Libby, S. Mackie, S. S. R. Kananda, C.- H. Wu, and M. Ghobadi, “Using trio: juniper networks’ programmable chipset-for emerging in-network applications,” inProceedings of the ACM SIGCOMM 2022 Conference, 2022, pp. 633–648

2022
[54]

Unlocking the power of inline{Floating-Point} operations on programmable switches,

Y . Yuan, O. Alama, J. Fei, J. Nelson, D. R. Ports, A. Sapio, M. Canini, and N. S. Kim, “Unlocking the power of inline{Floating-Point} operations on programmable switches,” in19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), 2022, pp. 683–700

2022