pith. machine review for the scientific record. sign in

arxiv: 2603.26438 · v2 · submitted 2026-03-27 · 💻 cs.AR · cs.DC

Recognition: no theorem link

A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:29 UTC · model grok-4.3

classification 💻 cs.AR cs.DC
keywords Network on ChipCollective CommunicationML AcceleratorsMulticastReductionDirect Compute AccessGEMMOn-Chip Interconnect
0
0 comments X

The pith

A NoC with direct compute access to cores accelerates multicast by 5.3x and reductions by 2.8x for ML accelerators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a lightweight Network-on-Chip that adds support for barrier synchronization, multicast, and reduction collectives while targeting the communication patterns inside large machine-learning accelerators. It introduces Direct Compute Access so that routers can perform reductions by reaching directly into compute cores rather than shuttling data through memory. This keeps collective traffic off the critical path during matrix-multiplication workloads. If the approach works, chips containing thousands of processing elements can grow in size without communication latency and energy costs growing in proportion. Readers would care because standard unicast NoCs already struggle to keep up as model sizes and core counts increase.

Core claim

The authors show that a collective-capable NoC built around Direct Compute Access delivers 5.3x geomean speedup on multicast and 2.8x on reduction for payloads between 1 and 32 KiB, translating into estimated 3.8x and 2.4x overall performance gains plus 1.17x energy savings in GEMM workloads on large meshes, all at a 16.9 percent router-area cost compared with a baseline unicast design.

What carries the argument

Direct Compute Access (DCA), a mechanism that grants the interconnect fabric direct access to the cores' computational resources so reductions can be executed inside the network with high throughput.

Load-bearing premise

That the 16.9 percent router area overhead stays acceptable and that in-network reductions introduce neither new pipeline stalls nor coherence problems when used inside full GEMM workloads on large meshes.

What would settle it

End-to-end latency and energy measurements of complete GEMM workloads executed on a large-mesh simulator or prototype that includes the new NoC versus an otherwise identical baseline that uses only unicast traffic.

Figures

Figures reproduced from arXiv: 2603.26438 by Chen Wu, Lorenzo Leone, Luca Benini, Luca Colagrande, Raphael Roth, Tim Fischer.

Figure 1
Figure 1. Figure 1: (a) Overview of the 5 × 4 collective-capable NoC system. (b) Cluster tile and its main components: (c) compute cluster, (d) network interface and (e) router with collective extensions. (f) Centralized reduction controller enabling arithmetic in-network computation. Highlighted in orange are all modules affected (partially highlighted) or introduced (fully highlighted) by our extensions. corresponding addre… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Area breakdown of the router for different hardware configurations. Percentages indicate the area overhead with respect to the baseline. (b) Runtime of the software and hardware barriers. To better quantify the impact of our approach at the system level, we reproduce the place-and-route flow of the cluster tile from (Fischer et al., 2025) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Placed-and-routed implementation of the cluster tile, with the FPUs, the router and the L1 SPM interconnect highlighted. The remaining area is occupied by the Snitch cores, L1 SPM, I$ subsystem and cluster DMA, which are not highlighted for clarity. L1 SPM. For an efficient and scalable implementation, we use the RISC-V amoadd atomic instruction: every atomic operation arriving at the destination memory co… view at source ↗
Figure 5
Figure 5. Figure 5: Runtime (in cycles) of: (a) a 1D multicast transfer; (b) the seq implementation for various settings of αi + δ, ∀ i > 0, labeled next to each curve; (c) a 2D multicast transfer. to the additional round-trip latency (α) and synchronization (δ) overheads offsetting the benefit from the decreased batch size n k . Our second baseline (tree) implements a binary￾tree multicast, as depicted in Figure 4c, modeled … view at source ↗
Figure 6
Figure 6. Figure 6: Three software reduction implementations: (a) naive tree, (b) double-buffered tree, (c) pipelined sequential. Each block represents a DMA transfer: the containing row represents the initiator and the label indicates source and destination (source → destination). Red lines represent barriers. Colored blocks represent the reduction computations. with tm = αm+ n k βm and tc = αc+ n k βc, given αm and βm the r… view at source ↗
Figure 8
Figure 8. Figure 8: GEMM dataflows mapped onto a 4×4 tile-based architecture: (a) SUMMA GEMM (van de Geijn & Watts, 1995); (b) FusedConcatLinear GEMM (Potocnik et al., 2024). Colored background indicates L2 storage location (blue: m0, teal: m1, yellow: m2, red: m3). Colored arrows illustrate data movement: the tail marks the initiator, the color the source L2 tile, and the traversed clusters are the destinations. In the timin… view at source ↗
Figure 9
Figure 9. Figure 9: (a) Runtime of the communication and computation phases of the SUMMA GEMM kernel. (b) Hardware vs. soft￾ware reduction speedup for the FusedConcatLinear GEMM kernel. The X-axis uses a logarithmic (base 2) scale. demonstrated the benefit of fast reductions on ML work￾loads (Zhang et al., 2025). Consider a Multi-Head Attention (MHA) layer (Vaswani et al., 2017), where each cluster is assigned the computation… view at source ↗
Figure 10
Figure 10. Figure 10: a reports the energy savings on the SUMMA GEMM kernel across different mesh sizes. As quantified in [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: 2D naive sequential multicast. Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12 Cluster 13 Cluster 14 Cluster 15 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: 2D pipelined sequential multicast. Tseq = tm + 2(c − 2) max(tm, tc) + (k − 1)tc+ + max(tm, tc) + 2(r − 2) max(tm, tc)+ + ktc + (2(c − 2) + 2(r − 2) + 2k)δ (15) [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: 2D tree multicast. Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 c4 → c0 c5 → c1 c6 → c2 c7 → c3 Cluster 8 Cluster 9 Cluster 10 Cluster 11 c10 → c2 c9 → c1 c8 → c0 c11 → c3 c1 → c0 c3 → c2 Cluster 12 Cluster 13 Cluster 14 Cluster 15 c14 → c10 c13 → c9 c12 → c8 c15 → c11 c2 → c0 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: 2D naive tree reduction [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: 2D double-buffered tree reduction [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: 2D pipelined sequential reduction [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
read the original abstract

The exponential increase in Machine Learning (ML) model size and complexity has driven unprecedented demand for high-performance acceleration systems. As technology scaling enables the integration of thousands of computing elements onto a single die, the boundary between distributed and on-chip systems has blurred, making efficient on-chip collective communication increasingly critical. In this work, we present a lightweight, collective-capable Network on Chip (NoC) that supports efficient barrier synchronization alongside scalable, high-bandwidth multicast and reduction operations, co-designed for the next generation of ML accelerators. We introduce Direct Compute Access (DCA), a novel paradigm that grants the interconnect fabric direct access to the cores' computational resources, enabling high-throughput in-network reductions with a small 16.9% router area overhead. Through in-network hardware acceleration, we achieve 5.3x and 2.8x geomean speedups on multicast and reduction operations involving between 1 and 32 KiB of data, respectively. Furthermore, by keeping communication off the critical path in GEMM workloads, these features allow our architecture to scale efficiently to large meshes, resulting in up to 3.8x and 2.4x estimated performance gains through multicast and reduction support, respectively, compared to a baseline unicast NoC architecture, and up to 1.17x estimated energy savings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a lightweight collective-capable NoC for large-scale ML accelerators, introducing Direct Compute Access (DCA) to enable high-throughput in-network multicast, reduction, and barrier operations with a reported 16.9% router area overhead. It claims geomean speedups of 5.3x on multicast and 2.8x on reduction for 1-32 KiB data sizes, which translate to estimated GEMM performance gains of up to 3.8x and 2.4x respectively versus a unicast baseline, plus 1.17x energy savings, by keeping collectives off the critical path in large meshes.

Significance. If the off-critical-path assumption and translation from isolated collective benchmarks to full GEMM workloads hold, the work would offer a practical, low-overhead approach to scaling on-chip communication in thousand-core ML accelerators. The emphasis on hardware-accelerated collectives with quantified area cost addresses a timely bottleneck in distributed tensor computations.

major comments (2)
  1. [Abstract] Abstract: the headline 3.8x/2.4x GEMM gains and 1.17x energy savings rest on the unverified premise that DCA-enabled reductions can be overlapped with compute without pipeline stalls, coherence round-trips, or mesh-scale congestion; no cycle-accurate traces, barrier insertion analysis, or sensitivity results for >32x32 arrays are supplied to substantiate this.
  2. [Abstract] Abstract: the 5.3x multicast and 2.8x reduction geomean speedups are measured only on standalone 1-32 KiB collectives; without explicit methodology, error bars, or workload traces, it is impossible to assess how reduction completion latency interacts with GEMM tile scheduling.
minor comments (1)
  1. [Abstract] Abstract: simulation assumptions, baseline router configuration, mesh dimensions, and energy model details are omitted, making the reported numbers difficult to reproduce or compare.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concerns regarding the abstract's performance claims and evaluation methodology below, providing clarifications while noting where revisions will strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline 3.8x/2.4x GEMM gains and 1.17x energy savings rest on the unverified premise that DCA-enabled reductions can be overlapped with compute without pipeline stalls, coherence round-trips, or mesh-scale congestion; no cycle-accurate traces, barrier insertion analysis, or sensitivity results for >32x32 arrays are supplied to substantiate this.

    Authors: The GEMM gains are estimates obtained by combining measured collective latencies with an analytical model of tile scheduling that assumes overlap is feasible. DCA is explicitly designed to enable this overlap by granting the NoC direct access to compute units for in-network reductions, avoiding core stalls and keeping collectives off the critical path. We agree that additional validation would be beneficial and will incorporate cycle-accurate simulation traces, barrier insertion analysis, and sensitivity results for meshes up to 64x64 in the revised manuscript. revision: yes

  2. Referee: [Abstract] Abstract: the 5.3x multicast and 2.8x reduction geomean speedups are measured only on standalone 1-32 KiB collectives; without explicit methodology, error bars, or workload traces, it is impossible to assess how reduction completion latency interacts with GEMM tile scheduling.

    Authors: The reported geomean speedups come from cycle-accurate NoC simulations of standalone multicast and reduction operations across 1-32 KiB data sizes, with the full simulation methodology and configuration details provided in the Evaluation section. We will revise the manuscript to add error bars to the speedup figures, expand the explicit methodology description, and include a clearer discussion of how the measured latencies map to GEMM tile scheduling under the non-blocking collective assumption. Full end-to-end workload traces are outside the paper's primary scope on NoC microarchitecture but can be referenced via the analytical model. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical hardware measurements and estimates stand independently

full rationale

This is a hardware architecture paper presenting a NoC design with Direct Compute Access for collectives. All headline claims (5.3x/2.8x geomean speedups on 1-32 KiB collectives, 3.8x/2.4x estimated GEMM gains, 1.17x energy savings, 16.9% router area overhead) are presented as direct outcomes of RTL simulation, synthesis, and workload modeling. No equations, fitted parameters, or self-citations are used to derive the performance numbers; the results are obtained by running the proposed hardware against a baseline unicast NoC on the same traffic patterns. The derivation chain is therefore self-contained and externally falsifiable via reproduction of the reported cycle counts and area numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claims rest on the feasibility of integrating DCA hardware with low area cost and on workload assumptions for GEMM that keep collectives off the critical path; no explicit free parameters are fitted in the abstract.

invented entities (1)
  • Direct Compute Access (DCA) no independent evidence
    purpose: Grants the interconnect fabric direct access to cores' computational resources for high-throughput in-network reductions
    New paradigm introduced to enable the reported speedups; independent evidence not provided in abstract

pith-pipeline@v0.9.0 · 5551 in / 1169 out tokens · 41762 ms · 2026-05-14T22:29:10.405898+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    Amdahl, G. M. Validity of the single processor approach to achieving large scale computing capabilities. InPro- ceedings of the April 18-20, 1967, Spring Joint Computer Conference, AFIPS ’67 (Spring), pp. 483–485, New York, NY , USA,

  2. [2]

    org/abs/2512.17589

    URL https://arxiv. org/abs/2512.17589. Dice, D. and Kogan, A. Optimizing inference performance of transformers on cpus. InProceedings of the 1st Work- shop on Machine Learning and Systems,

  3. [3]

    S., and Sanchez, D

    Feldmann, A., Golden, C., Yang, Y ., Emer, J. S., and Sanchez, D. Azul: An accelerator for sparse itera- tive solvers leveraging distributed on-chip memory. In 2024 57th IEEE/ACM International Symposium on Mi- croarchitecture (MICRO), pp. 643–656,

  4. [4]

    Firoozshahian, A., Coburn, J., Levenstein, R., Nattoji, R., Kamath, A., Wu, O., Grewal, G., Aepala, H., Jakka, B., Dreyer, B., Hutchin, A., Diril, U., Nair, K., Aredestani, E

    doi: 10.1109/MICRO61859.2024.00054. Firoozshahian, A., Coburn, J., Levenstein, R., Nattoji, R., Kamath, A., Wu, O., Grewal, G., Aepala, H., Jakka, B., Dreyer, B., Hutchin, A., Diril, U., Nair, K., Aredestani, E. K., Schatz, M., Hao, Y ., Komuravelli, R., Ho, K., Abu Asal, S., Shajrawi, J., Quinn, K., Sreedhara, N., Kansal, P., Wei, W., Jayaraman, D., Chen...

  5. [5]

    Power-efficient tree- based multicast support for Networks-on-Chip

    Hu, W., Lu, Z., Jantsch, A., and Liu, H. Power-efficient tree- based multicast support for Networks-on-Chip. In16th Asia and South Pacific Design Automation Conference (ASP-DAC 2011), pp. 363–368,

  6. [6]

    Nongemm bench: Understanding the performance horizon of the latest ml workloads with nongemm workloads.arXiv preprint arXiv:2404.11788,

    Karami, R., Moar, C., Kao, S.-C., and Kwon, H. Nongemm bench: Understanding the performance horizon of the latest ml workloads with nongemm workloads.arXiv preprint arXiv:2404.11788,

  7. [7]

    2025.00049

    doi: 10.1109/HPCA61900. 2025.00049. Lie, S. Wafer-scale AI: GPU impossible performance. In Proceedings of the 36th IEEE Hot Chips Symposium (HCS), pp. 1–71. IEEE,

  8. [8]

    NVIDIA Collective Communication Library (NCCL)

    NVIDIA. NVIDIA Collective Communication Library (NCCL). https://developer.nvidia.com/nccl, 2024a. Ac- cessed: 2025-05-03. NVIDIA.NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP), Revision 3.9.0, 2024b. Ac- cessed: 2025-05-03. Ouyang, Y ., Tang, F., Hu, C., Zhou, W., and Wang, Q. MMNNN: A tree-based multicast mechanism for NoC- based...

  9. [9]

    Rodrigo, S., Flich, J., Duato, J., and Hummel, M

    doi: 10.1109/MM.2024.3423692. Rodrigo, S., Flich, J., Duato, J., and Hummel, M. Efficient unicast and multicast support for CMPs. In2008 41st IEEE/ACM International Symposium on Microarchitec- ture, pp. 364–375,

  10. [10]

    Tirumala, A

    https://www.synopsys.com/ implementation-and-signoff/rtl-synthesis-test/ fusion-compiler.html. Tirumala, A. and Wong, R. NVIDIA blackwell platform: Advancing generative AI and accelerated computing. In 2024 IEEE Hot Chips 36 Symposium (HCS), pp. 1–33,

  11. [11]

    Wang, L., Jin, Y ., Kim, H., and Kim, E. J. Recursive par- titioning multicast: A bandwidth-efficient routing for Networks-on-Chip. InProceedings of the 2009 3rd ACM/IEEE International Symposium on Networks-on- Chip, NOCS ’09, pp. 64–73, USA,

  12. [12]

    DeepOHeat: Operator Learning-based Ultra-fast Thermal Simulation in 3D-IC Design,

    doi: 10.1109/DAC56929.2023.10247897. Yavits, L., Morad, A., and Ginosar, R. The effect of com- munication and synchronization on Amdahl’s law in mul- ticore systems.Parallel Comput., 40(1):1–16, January

  13. [13]

    is an on-chip communication protocol defined by Arm as part of the AMBA specification. It or- ganizes transactions into five independent channels: read address (AR), read data (R), write address (AW), write data (W), and write response (B), each using valid/ready hand- shaking. AXI supportsbursttransactions: a single address transaction (AR or AW) can be ...