Operator Fusion for LLM Inference on the Tensix Architecture

Jie Yu; Ke Li; Lili Liu; Qingbo Wu; Ruian Zhang; Wenzhu Wang

arxiv: 2606.09879 · v1 · pith:KN7ZSGHKnew · submitted 2026-06-03 · 💻 cs.LG

Operator Fusion for LLM Inference on the Tensix Architecture

Qingbo Wu , Ke Li , Wenzhu Wang , Jie Yu , Ruian Zhang , Lili Liu This is my paper

Pith reviewed 2026-06-28 06:59 UTC · model grok-4.3

classification 💻 cs.LG

keywords operator fusionLLM inferenceTensix architectureRMSNormmatrix multiplicationNoC multicastdata localityon-device inference

0 comments

The pith

Fusing RMSNorm with matrix multiplication enables back-to-back execution in on-chip SRAM to cut LLM inference latency on Tensix hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fusing RMSNorm directly with matrix multiplications in self-attention and the feed-forward network lets memory-bound normalization and compute-bound multiplication run consecutively inside on-chip SRAM. This avoids writing intermediate activations back to DRAM and reduces scheduling overhead on the Tensix mesh. A NoC multicast distributes rows and columns from master cores to support parallel execution across the core array without extra DRAM pressure. Tests on Qwen2.5-0.5B, Qwen3-0.6B, and Qwen3-4B models report up to 37.44 percent lower attention latency and 15.89 percent lower MLP latency per layer while keeping Pearson correlation above 98.75 percent. The work targets the memory-bandwidth bottleneck that dominates on-device Transformer inference.

Core claim

By fusing RMSNorm with matrix multiplication in self-attention and in the FFN, the method enables back-to-back execution of memory-bound and compute-bound operators in on-chip SRAM to significantly reduce DRAM reads/writes of intermediate results and scheduling overhead. To support multi-core parallelism, a NoC-based multicast mechanism is leveraged in which row/column master nodes efficiently distribute inputs and weights across the core mesh, alleviating DRAM bandwidth contention. Experiments on the Wormhole platform show up to 37.44% latency reduction for attention and 15.89% for MLP, with up to 7.91% reduction per decoder layer, while Pearson Correlation Coefficient remains above 98.75%.

What carries the argument

RMSNorm fusion with matrix multiplication plus NoC multicast for row/column data distribution across the core mesh.

If this is right

Intermediate results stay in SRAM instead of returning to DRAM after each operator.
Scheduling overhead drops because fused kernels run without host intervention between them.
Multi-core bandwidth contention falls because multicast replaces repeated DRAM reads.
End-to-end decoder-layer latency improves by up to 7.91 percent on the tested models.
Numerical outputs remain consistent with the unfused baseline at PCC above 98.75 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion pattern could be applied to other element-wise operations that sit between large matrix multiplies on similar mesh architectures.
Lower DRAM traffic may allow larger batch sizes or context lengths before hitting memory limits on the same hardware.
Extending the multicast scheme to non-uniform weight distributions might further reduce contention in deeper layers.
The approach could be tested on models larger than 4B parameters to check whether the relative gains remain constant.

Load-bearing premise

The Tensix NoC and SRAM can execute the fused operators and multicast data movement without creating new bottlenecks or numerical errors beyond those already measured.

What would settle it

Disabling the RMSNorm fusion on the same Wormhole hardware and Qwen models and measuring latency reduction below 5 percent or PCC drop below 98 percent would falsify the efficiency claim.

Figures

Figures reproduced from arXiv: 2606.09879 by Jie Yu, Ke Li, Lili Liu, Qingbo Wu, Ruian Zhang, Wenzhu Wang.

**Figure 1.** Figure 1: Decoder-only architecture overview The core objective is to maximize on-chip data locality and minimize accesses to off-chip DRAM, thereby enhancing edge inference performance [4]. From an implementation perspective, operator fusion often models a neural network as a DAG and partitions it into fuseable subgraphs. In these subgraphs, outputs of upstream operators are directly consumed by downstream operator… view at source ↗

**Figure 2.** Figure 2: Tensix architecture overview hardware units for matrix operations (FPU), specialized units for vector operations (SFPU), and 1.5 MB local SRAM. Typical dataflow: data is delivered to the core via the on-chip NoC, unpacked, processed by the specialized compute units, repacked, and then sent via the NoC to DRAM or other Tensix cores. As shown in Fig. 2b, the baby cores handle instruction scheduling and cont… view at source ↗

**Figure 3.** Figure 3: Single-Tensix operator fusion illustration [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Multi-Tensix operator fusion illustration [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Multicast acceleration illustration 4 Experiment Results We use the Tenstorrent Wormhole N300 accelerator card as our evaluation platform. The device connects via PCIe to a ThinkPad X1 laptop running openKylin SP2, forming a representative edge inference environment. The N300 integrates two Tensix chips for a total of 128 Tensix cores, 24 GB GDDR6, and 192 MB SRAM. Its peak performance reaches 466 TFLOPS … view at source ↗

read the original abstract

This study addresses on-device inference bottlenecks of Transformer models on Tenstorrent's Tensix architecture and proposes an operator fusion strategy that enhances data locality. RMSNorm is fused with matrix multiplication in self-attention and in the FFN, enabling back-to-back execution of memory-bound and compute-bound operators in on-chip SRAM to significantly reduce DRAM reads/writes of intermediate results and scheduling overhead. To support multi-core parallelism, a NoC-based multicast mechanism is leveraged in which row/column master nodes efficiently distribute inputs and weights across the core mesh, alleviating DRAM bandwidth contention. Experiments on the Wormhole platform with Qwen2.5-0.5B, Qwen3-0.6B, and Qwen3-4B show up to 37.44% latency reduction for attention and 15.89% for MLP, with up to 7.91% reduction per decoder layer, while Pearson Correlation Coefficient (PCC) remains above 98.75%, confirming significant end-to-end efficiency gains under numerical consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper applies known operator fusion to Tensix with RMSNorm-matmul pairing and NoC multicast, reporting latency cuts on small Qwen models, but omits sequence lengths and batch sizes so the SRAM locality claim stays unverified.

read the letter

The main takeaway is that this is a targeted implementation paper for one hardware platform rather than a general advance in fusion techniques. It fuses RMSNorm with matrix multiplies in both attention and the FFN blocks, then uses row/column master nodes on the NoC to multicast data across cores. That setup lets the fused operators run back-to-back in on-chip SRAM and cuts some DRAM traffic plus scheduling cost. On Qwen2.5-0.5B, Qwen3-0.6B, and Qwen3-4B it records up to 37% lower attention latency and 16% lower MLP latency, with end-to-end layer gains around 8% and PCC above 98.75%.

What stands out is the concrete numbers on real Wormhole silicon and the explicit use of the architecture's multicast feature to handle multi-core distribution. Those details make the work useful for anyone already targeting Tensix.

The soft spot is the missing input dimensions. The central claim rests on activations staying in SRAM so the fused operators avoid DRAM spills, yet the abstract and reported experiments give no sequence lengths or batch sizes. Without those, it is impossible to check whether the measured speedups actually come from the fusion or simply from the NoC multicast. The paper would be stronger with a table of tested shapes and a short memory-footprint calculation.

This is for readers who optimize inference on Tensix or similar mesh architectures and want a worked example of RMSNorm fusion. It is not broad enough for most general LLM or compiler papers.

I would send it to peer review. The measurements are on actual hardware and the numerical check is there; a referee can ask for the missing dimensions and any code artifacts. It is not a foundational result, but the implementation is specific enough to be worth checking.

Referee Report

2 major / 0 minor

Summary. The paper proposes fusing RMSNorm with matrix multiplication in self-attention and FFN layers on Tenstorrent's Tensix architecture. This enables back-to-back execution of memory-bound and compute-bound operators in on-chip SRAM, reducing DRAM traffic and scheduling overhead. A NoC-based multicast mechanism supports multi-core parallelism by distributing inputs and weights. Experiments on Qwen2.5-0.5B, Qwen3-0.6B, and Qwen3-4B models report up to 37.44% latency reduction for attention, 15.89% for MLP, and 7.91% per decoder layer, with PCC above 98.75%.

Significance. If the reported latency gains are attributable to the fusion keeping intermediates in SRAM rather than solely to NoC multicast, the work could offer a practical optimization for LLM inference on this hardware. The numerical consistency metric provides some reassurance on correctness, but the absence of input dimensions and implementation details limits assessment of broader applicability.

major comments (2)

[Experiments] Experiments (abstract and §4): no sequence lengths or batch sizes are reported for the Qwen model latency measurements. This is load-bearing for the central claim that fused RMSNorm+matmul executes back-to-back in on-chip SRAM without DRAM spills, as the fit depends on activation sizes.
[Methods] Methods (abstract and §3): the paper provides no description of the fusion implementation, data layout in SRAM, or how the NoC multicast interacts with the fused operators. Without these details or error analysis, the support for the performance numbers and PCC threshold cannot be verified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below and have revised the manuscript accordingly to improve experimental reporting and methodological transparency.

read point-by-point responses

Referee: [Experiments] Experiments (abstract and §4): no sequence lengths or batch sizes are reported for the Qwen model latency measurements. This is load-bearing for the central claim that fused RMSNorm+matmul executes back-to-back in on-chip SRAM without DRAM spills, as the fit depends on activation sizes.

Authors: We agree this information is essential for assessing the SRAM residency claim. In the revised manuscript we have added the experimental configuration details to Section 4: all reported latency results use batch size 1 with sequence lengths of 128, 256, 512 and 1024 tokens. We have also inserted a short paragraph confirming that, for these dimensions on the tested Qwen models, the fused operator intermediates remain within on-chip SRAM capacity and incur no additional DRAM traffic. revision: yes
Referee: [Methods] Methods (abstract and §3): the paper provides no description of the fusion implementation, data layout in SRAM, or how the NoC multicast interacts with the fused operators. Without these details or error analysis, the support for the performance numbers and PCC threshold cannot be verified.

Authors: We accept that the original submission lacked sufficient implementation detail. Section 3 has been expanded to describe the fusion kernel, the SRAM data layout chosen to keep RMSNorm outputs resident for the subsequent matmul, and the precise interaction between the fused operator and the NoC multicast mechanism. We have also added a dedicated error-analysis subsection that explains the rationale for the 98.75 % PCC threshold based on the observed numerical results. revision: yes

Circularity Check

0 steps flagged

No circularity; paper reports empirical measurements only

full rationale

The manuscript describes an operator-fusion implementation for the Tensix architecture and presents measured latency reductions and PCC values on specific Qwen models. No equations, derivations, fitted parameters, uniqueness theorems, or self-citation chains appear in the provided text. All performance claims rest on direct experimental reporting rather than any reduction of a 'prediction' to its own inputs. The reader's circularity score of 1.0 is consistent with this assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review uses only the abstract; ledger therefore lists only the hardware assumptions required by the described fusion and multicast claims. No free parameters or new entities are stated.

axioms (2)

domain assumption Tensix cores support back-to-back execution of fused RMSNorm and matmul inside on-chip SRAM.
Directly required for the claimed reduction in DRAM accesses.
domain assumption NoC-based multicast distributes data across the core mesh without creating new bandwidth or correctness problems.
Required for the multi-core parallelism benefit described.

pith-pipeline@v0.9.1-grok · 5716 in / 1108 out tokens · 34891 ms · 2026-06-28T06:59:26.177016+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 1 canonical work pages

[1]

Llm inference unveiled: Survey and roofline model insights.arXiv preprint arXiv:2402.16363, 2024

Yuan, Z., Shang, Y., Zhou, Y., Dong, Z., Zhou, Z., Xue, C., Wu, B., Li, Z., Gu, Q., Lee, Y.J., et al.: Llm inference unveiled: Survey and rooﬂine model insights. arXiv preprint arXiv:2402.16363 (2024)

work page arXiv 2024
[2]

Wormhole, https://tenstorrent.com/hardware/wormhole, [Online; accessed 2026-01-14]

2026
[3]

tenstorrent/tt-metal: :metal: Tt-nn operator library, and tt-metalium low level kernel programming model., https://github.com/tenstorrent/tt-metal/ blob/main/METALIUM_GUIDE.md#tenstorrent-architecture-overview , [Online; accessed 2026-01-13]

2026
[4]

IEEE Internet of Things Journal 12(24), 51927–51950 (2025)

Wang, W., Li, K., Ji, B., et al.: A survey of ai inference technologies for on-device systems. IEEE Internet of Things Journal 12(24), 51927–51950 (2025)

2025
[5]

In: Pro- ceedings of the 49th Annual IEEE/ACM International Symposium on Microarchi- tecture (MICRO)

Alwani, M., Chen, H., Ferdman, M., et al.: Fused-layer cnn accelerators. In: Pro- ceedings of the 49th Annual IEEE/ACM International Symposium on Microarchi- tecture (MICRO). pp. 1–12 (2016)

2016
[6]

ACM Transactions on Embedded Computing Systems (TECS) 22(1), 1–26 (2022) Operator Fusion for LLM Inference on the Tensix Architecture 11

Cai, X., Wang, Y., Zhang, L.: Optimus: An operator fusion framework for deep neural networks. ACM Transactions on Embedded Computing Systems (TECS) 22(1), 1–26 (2022) Operator Fusion for LLM Inference on the Tensix Architecture 11

2022
[7]

In: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Zheng, S., Chen, S., Gao, S., et al.: Tileﬂow: A framework for modeling fusion dataﬂow via tree-based analysis. In: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). pp. 1271–1288 (2023)

2023
[8]

Tenstorrent: tenstorrent/tt-metal: :metal: Tt-nn operator library, and tt-metalium low level kernel programming model., https://github.com/tenstorrent/ tt-metal
[9]

com/tenstorrent/tt-metal/blob/main/METALIUM_GUIDE.md, [Online; accessed 2026-03-12]

tt-metal/metalium_guide.md at main ctenstorrent/tt-metal, https://github. com/tenstorrent/tt-metal/blob/main/METALIUM_GUIDE.md, [Online; accessed 2026-03-12]

2026
[10]

EECS Department, University of California, Berkeley, Tech

Waterman, A., Lee, Y., Patterson, D.A., Asanovic, K.: The risc-v instruction set manual, volume i: User-level isa, version 2.0. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2014-54 p. 4 (2014)

2014
[11]

to Wikimedia projects, C.: Single program, multiple data - wikipedia (10 2004), https://en.wikipedia.org/wiki/Single_program,_multiple_data, [Online; ac- cessed 2026-01-14]

2004
[12]

Brown, N., Barton, R.: Accelerating stencils on the Tenstorrent Grayskull RISC-V accelerator (Sep 2024)

2024
[13]

In: Neuwirth, S., Paul, A.K., Weinzierl, T., Carson, E.C

Brown, N., Davies, J., Clair, F.L.: Exploring Fast Fourier Transforms onătheăTenstorrent Wormhole. In: Neuwirth, S., Paul, A.K., Weinzierl, T., Carson, E.C. (eds.) High Performance Computing. pp. 598–612. Springer Nature Switzer- land, Cham (2026)

2026
[14]

Cavagna, H.P., Cesarini, D., Bartolini, A.: Assessing Tenstorrent’s RISC-V MatMul Acceleration Capabilities (May 2025)

2025
[15]

Thüning, M.: Attention in SRAM on Tenstorrent Grayskull (Jul 2024)

2024
[16]

In: Proceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

Almerol, J.L., Boella, E., Spera, M., et al.: Accelerating Gravitational N-Body Sim- ulations Using the RISC-V-Based Tenstorrent Wormhole. In: Proceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1729–1735. SC Workshops ’25, Association for Computing Machinery, New York, ...

2025
[17]

tenstorrent/ttnn-visualizer: A comprehensive tool for visualizing and analyzing model execution, oﬀering interactive graphs, memory plots, tensor details, buﬀer overviews, operation ﬂow graphs, and multi-instance support with ﬁle or ssh- based report loading., https://github.com/tenstorrent/ttnn-visualizer, [On- line; accessed 2026-01-13]

2026

[1] [1]

Llm inference unveiled: Survey and roofline model insights.arXiv preprint arXiv:2402.16363, 2024

Yuan, Z., Shang, Y., Zhou, Y., Dong, Z., Zhou, Z., Xue, C., Wu, B., Li, Z., Gu, Q., Lee, Y.J., et al.: Llm inference unveiled: Survey and rooﬂine model insights. arXiv preprint arXiv:2402.16363 (2024)

work page arXiv 2024

[2] [2]

Wormhole, https://tenstorrent.com/hardware/wormhole, [Online; accessed 2026-01-14]

2026

[3] [3]

tenstorrent/tt-metal: :metal: Tt-nn operator library, and tt-metalium low level kernel programming model., https://github.com/tenstorrent/tt-metal/ blob/main/METALIUM_GUIDE.md#tenstorrent-architecture-overview , [Online; accessed 2026-01-13]

2026

[4] [4]

IEEE Internet of Things Journal 12(24), 51927–51950 (2025)

Wang, W., Li, K., Ji, B., et al.: A survey of ai inference technologies for on-device systems. IEEE Internet of Things Journal 12(24), 51927–51950 (2025)

2025

[5] [5]

In: Pro- ceedings of the 49th Annual IEEE/ACM International Symposium on Microarchi- tecture (MICRO)

Alwani, M., Chen, H., Ferdman, M., et al.: Fused-layer cnn accelerators. In: Pro- ceedings of the 49th Annual IEEE/ACM International Symposium on Microarchi- tecture (MICRO). pp. 1–12 (2016)

2016

[6] [6]

ACM Transactions on Embedded Computing Systems (TECS) 22(1), 1–26 (2022) Operator Fusion for LLM Inference on the Tensix Architecture 11

Cai, X., Wang, Y., Zhang, L.: Optimus: An operator fusion framework for deep neural networks. ACM Transactions on Embedded Computing Systems (TECS) 22(1), 1–26 (2022) Operator Fusion for LLM Inference on the Tensix Architecture 11

2022

[7] [7]

In: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Zheng, S., Chen, S., Gao, S., et al.: Tileﬂow: A framework for modeling fusion dataﬂow via tree-based analysis. In: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). pp. 1271–1288 (2023)

2023

[8] [8]

Tenstorrent: tenstorrent/tt-metal: :metal: Tt-nn operator library, and tt-metalium low level kernel programming model., https://github.com/tenstorrent/ tt-metal

[9] [9]

com/tenstorrent/tt-metal/blob/main/METALIUM_GUIDE.md, [Online; accessed 2026-03-12]

tt-metal/metalium_guide.md at main ctenstorrent/tt-metal, https://github. com/tenstorrent/tt-metal/blob/main/METALIUM_GUIDE.md, [Online; accessed 2026-03-12]

2026

[10] [10]

EECS Department, University of California, Berkeley, Tech

Waterman, A., Lee, Y., Patterson, D.A., Asanovic, K.: The risc-v instruction set manual, volume i: User-level isa, version 2.0. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2014-54 p. 4 (2014)

2014

[11] [11]

to Wikimedia projects, C.: Single program, multiple data - wikipedia (10 2004), https://en.wikipedia.org/wiki/Single_program,_multiple_data, [Online; ac- cessed 2026-01-14]

2004

[12] [12]

Brown, N., Barton, R.: Accelerating stencils on the Tenstorrent Grayskull RISC-V accelerator (Sep 2024)

2024

[13] [13]

In: Neuwirth, S., Paul, A.K., Weinzierl, T., Carson, E.C

Brown, N., Davies, J., Clair, F.L.: Exploring Fast Fourier Transforms onătheăTenstorrent Wormhole. In: Neuwirth, S., Paul, A.K., Weinzierl, T., Carson, E.C. (eds.) High Performance Computing. pp. 598–612. Springer Nature Switzer- land, Cham (2026)

2026

[14] [14]

Cavagna, H.P., Cesarini, D., Bartolini, A.: Assessing Tenstorrent’s RISC-V MatMul Acceleration Capabilities (May 2025)

2025

[15] [15]

Thüning, M.: Attention in SRAM on Tenstorrent Grayskull (Jul 2024)

2024

[16] [16]

In: Proceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

Almerol, J.L., Boella, E., Spera, M., et al.: Accelerating Gravitational N-Body Sim- ulations Using the RISC-V-Based Tenstorrent Wormhole. In: Proceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1729–1735. SC Workshops ’25, Association for Computing Machinery, New York, ...

2025

[17] [17]

tenstorrent/ttnn-visualizer: A comprehensive tool for visualizing and analyzing model execution, oﬀering interactive graphs, memory plots, tensor details, buﬀer overviews, operation ﬂow graphs, and multi-instance support with ﬁle or ssh- based report loading., https://github.com/tenstorrent/ttnn-visualizer, [On- line; accessed 2026-01-13]

2026