arxiv: 2603.26438 · v2 · submitted 2026-03-27 · 💻 cs.AR · cs.DC

Recognition: no theorem link

A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators

Luca Colagrande , Lorenzo Leone , Chen Wu , Tim Fischer , Raphael Roth , Luca Benini

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:29 UTC · model grok-4.3

classification 💻 cs.AR cs.DC

keywords Network on ChipCollective CommunicationML AcceleratorsMulticastReductionDirect Compute AccessGEMMOn-Chip Interconnect

0 comments

The pith

A NoC with direct compute access to cores accelerates multicast by 5.3x and reductions by 2.8x for ML accelerators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a lightweight Network-on-Chip that adds support for barrier synchronization, multicast, and reduction collectives while targeting the communication patterns inside large machine-learning accelerators. It introduces Direct Compute Access so that routers can perform reductions by reaching directly into compute cores rather than shuttling data through memory. This keeps collective traffic off the critical path during matrix-multiplication workloads. If the approach works, chips containing thousands of processing elements can grow in size without communication latency and energy costs growing in proportion. Readers would care because standard unicast NoCs already struggle to keep up as model sizes and core counts increase.

Core claim

The authors show that a collective-capable NoC built around Direct Compute Access delivers 5.3x geomean speedup on multicast and 2.8x on reduction for payloads between 1 and 32 KiB, translating into estimated 3.8x and 2.4x overall performance gains plus 1.17x energy savings in GEMM workloads on large meshes, all at a 16.9 percent router-area cost compared with a baseline unicast design.

What carries the argument

Direct Compute Access (DCA), a mechanism that grants the interconnect fabric direct access to the cores' computational resources so reductions can be executed inside the network with high throughput.

Load-bearing premise

That the 16.9 percent router area overhead stays acceptable and that in-network reductions introduce neither new pipeline stalls nor coherence problems when used inside full GEMM workloads on large meshes.

What would settle it

End-to-end latency and energy measurements of complete GEMM workloads executed on a large-mesh simulator or prototype that includes the new NoC versus an otherwise identical baseline that uses only unicast traffic.

Figures

Figures reproduced from arXiv: 2603.26438 by Chen Wu, Lorenzo Leone, Luca Benini, Luca Colagrande, Raphael Roth, Tim Fischer.

**Figure 1.** Figure 1: (a) Overview of the 5 × 4 collective-capable NoC system. (b) Cluster tile and its main components: (c) compute cluster, (d) network interface and (e) router with collective extensions. (f) Centralized reduction controller enabling arithmetic in-network computation. Highlighted in orange are all modules affected (partially highlighted) or introduced (fully highlighted) by our extensions. corresponding addre… view at source ↗

**Figure 2.** Figure 2: (a) Area breakdown of the router for different hardware configurations. Percentages indicate the area overhead with respect to the baseline. (b) Runtime of the software and hardware barriers. To better quantify the impact of our approach at the system level, we reproduce the place-and-route flow of the cluster tile from (Fischer et al., 2025) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Placed-and-routed implementation of the cluster tile, with the FPUs, the router and the L1 SPM interconnect highlighted. The remaining area is occupied by the Snitch cores, L1 SPM, I$ subsystem and cluster DMA, which are not highlighted for clarity. L1 SPM. For an efficient and scalable implementation, we use the RISC-V amoadd atomic instruction: every atomic operation arriving at the destination memory co… view at source ↗

**Figure 5.** Figure 5: Runtime (in cycles) of: (a) a 1D multicast transfer; (b) the seq implementation for various settings of αi + δ, ∀ i > 0, labeled next to each curve; (c) a 2D multicast transfer. to the additional round-trip latency (α) and synchronization (δ) overheads offsetting the benefit from the decreased batch size n k . Our second baseline (tree) implements a binarytree multicast, as depicted in Figure 4c, modeled … view at source ↗

**Figure 6.** Figure 6: Three software reduction implementations: (a) naive tree, (b) double-buffered tree, (c) pipelined sequential. Each block represents a DMA transfer: the containing row represents the initiator and the label indicates source and destination (source → destination). Red lines represent barriers. Colored blocks represent the reduction computations. with tm = αm+ n k βm and tc = αc+ n k βc, given αm and βm the r… view at source ↗

**Figure 8.** Figure 8: GEMM dataflows mapped onto a 4×4 tile-based architecture: (a) SUMMA GEMM (van de Geijn & Watts, 1995); (b) FusedConcatLinear GEMM (Potocnik et al., 2024). Colored background indicates L2 storage location (blue: m0, teal: m1, yellow: m2, red: m3). Colored arrows illustrate data movement: the tail marks the initiator, the color the source L2 tile, and the traversed clusters are the destinations. In the timin… view at source ↗

**Figure 9.** Figure 9: (a) Runtime of the communication and computation phases of the SUMMA GEMM kernel. (b) Hardware vs. software reduction speedup for the FusedConcatLinear GEMM kernel. The X-axis uses a logarithmic (base 2) scale. demonstrated the benefit of fast reductions on ML workloads (Zhang et al., 2025). Consider a Multi-Head Attention (MHA) layer (Vaswani et al., 2017), where each cluster is assigned the computation… view at source ↗

**Figure 10.** Figure 10: a reports the energy savings on the SUMMA GEMM kernel across different mesh sizes. As quantified in [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: 2D naive sequential multicast. Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12 Cluster 13 Cluster 14 Cluster 15 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: 2D pipelined sequential multicast. Tseq = tm + 2(c − 2) max(tm, tc) + (k − 1)tc+ + max(tm, tc) + 2(r − 2) max(tm, tc)+ + ktc + (2(c − 2) + 2(r − 2) + 2k)δ (15) [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: 2D tree multicast. Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 c4 → c0 c5 → c1 c6 → c2 c7 → c3 Cluster 8 Cluster 9 Cluster 10 Cluster 11 c10 → c2 c9 → c1 c8 → c0 c11 → c3 c1 → c0 c3 → c2 Cluster 12 Cluster 13 Cluster 14 Cluster 15 c14 → c10 c13 → c9 c12 → c8 c15 → c11 c2 → c0 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: 2D naive tree reduction [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: 2D double-buffered tree reduction [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: 2D pipelined sequential reduction [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

read the original abstract

The exponential increase in Machine Learning (ML) model size and complexity has driven unprecedented demand for high-performance acceleration systems. As technology scaling enables the integration of thousands of computing elements onto a single die, the boundary between distributed and on-chip systems has blurred, making efficient on-chip collective communication increasingly critical. In this work, we present a lightweight, collective-capable Network on Chip (NoC) that supports efficient barrier synchronization alongside scalable, high-bandwidth multicast and reduction operations, co-designed for the next generation of ML accelerators. We introduce Direct Compute Access (DCA), a novel paradigm that grants the interconnect fabric direct access to the cores' computational resources, enabling high-throughput in-network reductions with a small 16.9% router area overhead. Through in-network hardware acceleration, we achieve 5.3x and 2.8x geomean speedups on multicast and reduction operations involving between 1 and 32 KiB of data, respectively. Furthermore, by keeping communication off the critical path in GEMM workloads, these features allow our architecture to scale efficiently to large meshes, resulting in up to 3.8x and 2.4x estimated performance gains through multicast and reduction support, respectively, compared to a baseline unicast NoC architecture, and up to 1.17x estimated energy savings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's DCA mechanism offers a practical way to speed up collectives in ML NoCs with low overhead, though the GEMM gains need better validation on critical path overlap.

read the letter

The one thing to take away is that this paper introduces Direct Compute Access to let the interconnect handle reductions directly, which gives measurable speedups on collectives at low area cost, but the bigger claims about GEMM speedups in large systems rest on unverified assumptions about overlap. The work is new in combining a lightweight NoC with this DCA paradigm specifically for ML accelerators. Prior NoCs handle multicast and reduction but usually through software or extra messages; here the hardware gives direct access to compute resources. That seems like a clean idea for high-throughput collectives. They also support barriers, which fits the synchronization needs in distributed ML compute. They do well on the quantitative side for the isolated operations. Reporting geomean speedups across data sizes from 1 to 32 KiB is useful, and the 16.9% area overhead is a specific, believable number that shows the cost is contained. The energy savings estimate of 1.17x is another positive point if it holds. The soft spots are in the full-system evaluation. The 3.8x and 2.4x performance gains for multicast and reduction support are estimated by keeping communication off the critical path in GEMM workloads. But without details on how they modeled the interaction between reduction completion and GEMM tile scheduling, or any traces for mesh-scale congestion on big arrays, it's hard to know if new pipeline stalls appear. The stress-test note is on point: if reductions require any flush or extra round-trip, the gains shrink. The abstract lacks methodology details like simulation assumptions or workload traces, which is common but leaves the soundness at a medium level. Overall the math and design look solid for a hardware paper; no obvious circularity in the claims. This paper is aimed at hardware designers and architects focused on scaling ML accelerators to thousands of cores. Someone working on NoC for AI chips would get concrete ideas and numbers to build on. It deserves a serious referee because the core idea addresses a real bottleneck and they provide implementation-level results that can be evaluated. I would recommend sending it to peer review with requests for more validation on the end-to-end assumptions.

Referee Report

2 major / 1 minor

Summary. The paper proposes a lightweight collective-capable NoC for large-scale ML accelerators, introducing Direct Compute Access (DCA) to enable high-throughput in-network multicast, reduction, and barrier operations with a reported 16.9% router area overhead. It claims geomean speedups of 5.3x on multicast and 2.8x on reduction for 1-32 KiB data sizes, which translate to estimated GEMM performance gains of up to 3.8x and 2.4x respectively versus a unicast baseline, plus 1.17x energy savings, by keeping collectives off the critical path in large meshes.

Significance. If the off-critical-path assumption and translation from isolated collective benchmarks to full GEMM workloads hold, the work would offer a practical, low-overhead approach to scaling on-chip communication in thousand-core ML accelerators. The emphasis on hardware-accelerated collectives with quantified area cost addresses a timely bottleneck in distributed tensor computations.

major comments (2)

[Abstract] Abstract: the headline 3.8x/2.4x GEMM gains and 1.17x energy savings rest on the unverified premise that DCA-enabled reductions can be overlapped with compute without pipeline stalls, coherence round-trips, or mesh-scale congestion; no cycle-accurate traces, barrier insertion analysis, or sensitivity results for >32x32 arrays are supplied to substantiate this.
[Abstract] Abstract: the 5.3x multicast and 2.8x reduction geomean speedups are measured only on standalone 1-32 KiB collectives; without explicit methodology, error bars, or workload traces, it is impossible to assess how reduction completion latency interacts with GEMM tile scheduling.

minor comments (1)

[Abstract] Abstract: simulation assumptions, baseline router configuration, mesh dimensions, and energy model details are omitted, making the reported numbers difficult to reproduce or compare.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concerns regarding the abstract's performance claims and evaluation methodology below, providing clarifications while noting where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the headline 3.8x/2.4x GEMM gains and 1.17x energy savings rest on the unverified premise that DCA-enabled reductions can be overlapped with compute without pipeline stalls, coherence round-trips, or mesh-scale congestion; no cycle-accurate traces, barrier insertion analysis, or sensitivity results for >32x32 arrays are supplied to substantiate this.

Authors: The GEMM gains are estimates obtained by combining measured collective latencies with an analytical model of tile scheduling that assumes overlap is feasible. DCA is explicitly designed to enable this overlap by granting the NoC direct access to compute units for in-network reductions, avoiding core stalls and keeping collectives off the critical path. We agree that additional validation would be beneficial and will incorporate cycle-accurate simulation traces, barrier insertion analysis, and sensitivity results for meshes up to 64x64 in the revised manuscript. revision: yes
Referee: [Abstract] Abstract: the 5.3x multicast and 2.8x reduction geomean speedups are measured only on standalone 1-32 KiB collectives; without explicit methodology, error bars, or workload traces, it is impossible to assess how reduction completion latency interacts with GEMM tile scheduling.

Authors: The reported geomean speedups come from cycle-accurate NoC simulations of standalone multicast and reduction operations across 1-32 KiB data sizes, with the full simulation methodology and configuration details provided in the Evaluation section. We will revise the manuscript to add error bars to the speedup figures, expand the explicit methodology description, and include a clearer discussion of how the measured latencies map to GEMM tile scheduling under the non-blocking collective assumption. Full end-to-end workload traces are outside the paper's primary scope on NoC microarchitecture but can be referenced via the analytical model. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical hardware measurements and estimates stand independently

full rationale

This is a hardware architecture paper presenting a NoC design with Direct Compute Access for collectives. All headline claims (5.3x/2.8x geomean speedups on 1-32 KiB collectives, 3.8x/2.4x estimated GEMM gains, 1.17x energy savings, 16.9% router area overhead) are presented as direct outcomes of RTL simulation, synthesis, and workload modeling. No equations, fitted parameters, or self-citations are used to derive the performance numbers; the results are obtained by running the proposed hardware against a baseline unicast NoC on the same traffic patterns. The derivation chain is therefore self-contained and externally falsifiable via reproduction of the reported cycle counts and area numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claims rest on the feasibility of integrating DCA hardware with low area cost and on workload assumptions for GEMM that keep collectives off the critical path; no explicit free parameters are fitted in the abstract.

invented entities (1)

Direct Compute Access (DCA) no independent evidence
purpose: Grants the interconnect fabric direct access to cores' computational resources for high-throughput in-network reductions
New paradigm introduced to enable the reported speedups; independent evidence not provided in abstract

pith-pipeline@v0.9.0 · 5551 in / 1169 out tokens · 41762 ms · 2026-05-14T22:29:10.405898+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Amdahl, G. M. Validity of the single processor approach to achieving large scale computing capabilities. InPro- ceedings of the April 18-20, 1967, Spring Joint Computer Conference, AFIPS ’67 (Spring), pp. 483–485, New York, NY , USA,

work page 1967
[2]

org/abs/2512.17589

URL https://arxiv. org/abs/2512.17589. Dice, D. and Kogan, A. Optimizing inference performance of transformers on cpus. InProceedings of the 1st Work- shop on Machine Learning and Systems,

work page arXiv
[3]

S., and Sanchez, D

Feldmann, A., Golden, C., Yang, Y ., Emer, J. S., and Sanchez, D. Azul: An accelerator for sparse itera- tive solvers leveraging distributed on-chip memory. In 2024 57th IEEE/ACM International Symposium on Mi- croarchitecture (MICRO), pp. 643–656,

work page 2024
[4]

Firoozshahian, A., Coburn, J., Levenstein, R., Nattoji, R., Kamath, A., Wu, O., Grewal, G., Aepala, H., Jakka, B., Dreyer, B., Hutchin, A., Diril, U., Nair, K., Aredestani, E

doi: 10.1109/MICRO61859.2024.00054. Firoozshahian, A., Coburn, J., Levenstein, R., Nattoji, R., Kamath, A., Wu, O., Grewal, G., Aepala, H., Jakka, B., Dreyer, B., Hutchin, A., Diril, U., Nair, K., Aredestani, E. K., Schatz, M., Hao, Y ., Komuravelli, R., Ho, K., Abu Asal, S., Shajrawi, J., Quinn, K., Sreedhara, N., Kansal, P., Wei, W., Jayaraman, D., Chen...

work page doi:10.1109/micro61859.2024.00054 2024
[5]

Power-efficient tree- based multicast support for Networks-on-Chip

Hu, W., Lu, Z., Jantsch, A., and Liu, H. Power-efficient tree- based multicast support for Networks-on-Chip. In16th Asia and South Pacific Design Automation Conference (ASP-DAC 2011), pp. 363–368,

work page 2011
[6]

Nongemm bench: Understanding the performance horizon of the latest ml workloads with nongemm workloads.arXiv preprint arXiv:2404.11788,

Karami, R., Moar, C., Kao, S.-C., and Kwon, H. Nongemm bench: Understanding the performance horizon of the latest ml workloads with nongemm workloads.arXiv preprint arXiv:2404.11788,

work page arXiv
[7]

2025.00049

doi: 10.1109/HPCA61900. 2025.00049. Lie, S. Wafer-scale AI: GPU impossible performance. In Proceedings of the 36th IEEE Hot Chips Symposium (HCS), pp. 1–71. IEEE,

work page doi:10.1109/hpca61900 2025
[8]

NVIDIA Collective Communication Library (NCCL)

NVIDIA. NVIDIA Collective Communication Library (NCCL). https://developer.nvidia.com/nccl, 2024a. Ac- cessed: 2025-05-03. NVIDIA.NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP), Revision 3.9.0, 2024b. Ac- cessed: 2025-05-03. Ouyang, Y ., Tang, F., Hu, C., Zhou, W., and Wang, Q. MMNNN: A tree-based multicast mechanism for NoC- based...

work page 2025
[9]

Rodrigo, S., Flich, J., Duato, J., and Hummel, M

doi: 10.1109/MM.2024.3423692. Rodrigo, S., Flich, J., Duato, J., and Hummel, M. Efficient unicast and multicast support for CMPs. In2008 41st IEEE/ACM International Symposium on Microarchitec- ture, pp. 364–375,

work page doi:10.1109/mm.2024.3423692 2024
[10]

Tirumala, A

https://www.synopsys.com/ implementation-and-signoff/rtl-synthesis-test/ fusion-compiler.html. Tirumala, A. and Wong, R. NVIDIA blackwell platform: Advancing generative AI and accelerated computing. In 2024 IEEE Hot Chips 36 Symposium (HCS), pp. 1–33,

work page 2024
[11]

Wang, L., Jin, Y ., Kim, H., and Kim, E. J. Recursive par- titioning multicast: A bandwidth-efficient routing for Networks-on-Chip. InProceedings of the 2009 3rd ACM/IEEE International Symposium on Networks-on- Chip, NOCS ’09, pp. 64–73, USA,

work page 2009
[12]

DeepOHeat: Operator Learning-based Ultra-fast Thermal Simulation in 3D-IC Design,

doi: 10.1109/DAC56929.2023.10247897. Yavits, L., Morad, A., and Ginosar, R. The effect of com- munication and synchronization on Amdahl’s law in mul- ticore systems.Parallel Comput., 40(1):1–16, January

work page doi:10.1109/dac56929.2023.10247897 2023
[13]

is an on-chip communication protocol defined by Arm as part of the AMBA specification. It or- ganizes transactions into five independent channels: read address (AR), read data (R), write address (AW), write data (W), and write response (B), each using valid/ready hand- shaking. AXI supportsbursttransactions: a single address transaction (AR or AW) can be ...

work page 2023