arxiv: 2605.07954 · v1 · submitted 2026-05-08 · 💻 cs.DC · cs.CE· cs.ET

Recognition: no theorem link

Stencil Computations on Cerebras Wafer-Scale Engine

Elia Belli , Daniele De Sensi

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:57 UTC · model grok-4.3

classification 💻 cs.DC cs.CEcs.ET

keywords stencil computationswafer-scale enginehigh-performance computingdataflow architecturememory-bound kernelsroofline analysisscientific simulationsperformance speedup

0 comments

The pith

CStencil maps two-dimensional stencil computations onto wafer-scale engines to deliver up to 342 times the performance of GPU implementations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for running two-dimensional stencil computations on a wafer-scale engine originally built for AI tasks. It shows that the engine's distributed on-chip memory and interconnect remove the memory access limits that constrain these kernels on GPUs. Performance measurements report large speedups relative to an adapted GPU solver, and resource analysis indicates the hardware reaches full utilization. If correct, this opens wafer-scale engines to a broader set of memory-bound scientific workloads that currently face the memory wall on conventional systems.

Core claim

The paper establishes that CStencil achieves speedups of up to 342x over an adapted single-precision GPU stencil solver, with a roofline model confirming that the wafer-scale engine's distributed SRAM and mesh interconnect eliminate off-chip memory bottlenecks and saturate both compute and memory resources for two-dimensional stencil computations.

What carries the argument

The CStencil framework that maps stencil operations directly onto the distributed SRAM and mesh interconnect of the wafer-scale engine to avoid external memory traffic.

If this is right

Memory-bound scientific kernels can reach full hardware utilization on wafer-scale engines without off-chip memory stalls.
Two-dimensional stencil operations become compute-limited rather than memory-limited under the engine's data movement model.
Scientific algorithms outside the original AI target domain can be successfully ported when on-chip memory bandwidth is exploited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mapping strategy may extend to three-dimensional stencils or other grid-based algorithms such as finite-volume methods.
Hardware designs for future scientific computing could prioritize distributed on-chip SRAM to reduce reliance on external memory hierarchies.
Larger problem sizes typical of production simulations would likely preserve the observed utilization levels if the interconnect scales accordingly.

Load-bearing premise

The single-precision adaptation of the GPU stencil solver serves as a fair baseline and the chosen stencil sizes and problem sizes reflect realistic scientific workloads.

What would settle it

Measuring achieved operations per second against the theoretical peak on a larger production fluid-dynamics simulation would show whether the reported resource saturation holds outside the tested configurations.

Figures

Figures reproduced from arXiv: 2605.07954 by Daniele De Sensi, Elia Belli.

**Figure 2.** Figure 2: Components of a processing element (Source [24]). [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: CStencil communication layout on a 4×4 PE subarray. The configuration employs a checkerboard pattern to manage the 8 distinct communication colors (4 transmit, 4 receive) across the grid [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the communication in a Star Pattern [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 8.** Figure 8: Visualization of the submatrices providing the North, [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 7.** Figure 7: Comparison between naive scalar (left) and vectorized [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 10.** Figure 10: Packing strategy. The left operand packs four tiles: [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 12.** Figure 12: Validation of the simulator accuracy using the Star2d [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

**Figure 11.** Figure 11: Weak scaling comparison between the original Con [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 13.** Figure 13: Weak scaling of various stencil patterns on WSE-3. [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗

**Figure 14.** Figure 14: Absolute performance of CStencil (WSE-3) and [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

**Figure 15.** Figure 15: Speedup of CStencil over ConvStencil as a function of [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗

read the original abstract

Stencil computations are a fundamental kernel in scientific computing, critical for simulations in domains such as fluid dynamics and climate modeling. However, these computations are often memory-bound on traditional High-Performance Computing architectures like GPUs, struggling against the "Memory Wall". Simultaneously, the rise of AI-oriented hardware, such as the Cerebras Wafer-Scale Engine, offers massive core parallelism and high-bandwidth on-chip memory, though typically optimized for lower-precision workloads. This work investigates the viability of bridging this divergence by mapping stencil algorithms onto the Cerebras WSE-3. The study introduces CStencil, a novel framework designed to implement two-dimensional stencil computations on the WSE-3. To ensure a rigorous and fair performance evaluation, the research also adapts ConvStencil, a state-of-the-art GPU stencil solver, porting it from its original double-precision design to single-precision for execution on an NVIDIA A100 GPU. Experimental results show that the WSE-3's distributed SRAM and mesh interconnect effectively eliminate the off-chip memory bottlenecks common in GPU implementations. CStencil achieves speedups of up to 342x over the adapted ConvStencil version. A roofline model analysis further confirms that CStencil saturates the available compute and memory resources, demonstrating that the WSE dataflow architecture can be successfully repurposed for traditional scientific algorithms. These findings highlight the potential of the WSE-3 to deliver hardware utilization levels unattainable on conventional systems, offering a promising path toward overcoming the memory limitations of current HPC architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CStencil, a framework for mapping 2D stencil computations onto the Cerebras WSE-3 wafer-scale engine. It adapts the ConvStencil GPU solver from double to single precision for comparison on an NVIDIA A100, reports speedups of up to 342x, and uses roofline analysis to claim that CStencil saturates the WSE-3's compute and on-chip memory resources, concluding that the dataflow architecture can be successfully repurposed for traditional scientific stencil kernels.

Significance. If the baseline comparison holds after verification, the result would demonstrate that AI-oriented wafer-scale hardware can deliver high utilization on memory-bound scientific workloads that are typically limited by the memory wall on GPUs, with potential implications for HPC applications in fluid dynamics and climate modeling.

major comments (2)

[Abstract] Abstract: the claim that porting ConvStencil to single precision 'ensure[s] a rigorous and fair performance evaluation' is load-bearing for the 342x speedup and 'unattainable on conventional systems' conclusions, yet the manuscript provides no evidence that the adapted baseline reaches near-peak A100 HBM bandwidth or compute utilization (e.g., via shared-memory tiling or register blocking).
[Experimental results] Experimental results (implied by abstract timing claims): the reported speedups and roofline saturation rest on direct timing against the adapted ConvStencil, but without reported error bars, exact stencil orders/sizes, problem dimensions, or a comparison against an independently optimized single-precision GPU stencil implementation, it is impossible to determine whether the speedup reflects WSE architectural advantage or baseline under-optimization.

minor comments (2)

[Abstract] Abstract: the description of 'two-dimensional stencil computations' does not specify the stencil radius or order used in the reported experiments.
The manuscript lacks any mention of verification steps (e.g., numerical correctness checks against a reference CPU implementation) for the CStencil results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that porting ConvStencil to single precision 'ensure[s] a rigorous and fair performance evaluation' is load-bearing for the 342x speedup and 'unattainable on conventional systems' conclusions, yet the manuscript provides no evidence that the adapted baseline reaches near-peak A100 HBM bandwidth or compute utilization (e.g., via shared-memory tiling or register blocking).

Authors: We agree that additional evidence regarding the performance of the adapted ConvStencil baseline would enhance the rigor of our comparison. Although the original ConvStencil paper demonstrates high utilization on GPUs, our adaptation to single precision on the A100 does not include explicit roofline analysis in the current manuscript. We will revise the paper to include a roofline model for the GPU baseline, reporting achieved HBM bandwidth and compute utilization to substantiate the fairness of the evaluation. revision: yes
Referee: [Experimental results] Experimental results (implied by abstract timing claims): the reported speedups and roofline saturation rest on direct timing against the adapted ConvStencil, but without reported error bars, exact stencil orders/sizes, problem dimensions, or a comparison against an independently optimized single-precision GPU stencil implementation, it is impossible to determine whether the speedup reflects WSE architectural advantage or baseline under-optimization.

Authors: The full manuscript provides details on the stencil orders, sizes, and problem dimensions in the experimental setup section. However, we acknowledge that these could be more prominently featured, and we will add a dedicated table summarizing all experimental parameters. Regarding error bars, the reported timings are from repeated kernel executions with low variance due to the deterministic nature of the computations on both platforms; we will include standard deviation where applicable. For the baseline, we adapted ConvStencil, which is a published state-of-the-art implementation, to single precision using its original optimization strategies including tiling. While we did not develop a new independent GPU implementation, we believe this provides a fair comparison. We will clarify the specific adaptations made in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity; experimental timings and standard roofline analysis are independent of inputs

full rationale

The paper's central claims rest on direct wall-clock measurements of CStencil versus an adapted external ConvStencil baseline on A100, plus a conventional roofline model that bounds achieved bandwidth and compute utilization. No equations, fitted parameters, or self-citations are invoked to derive the reported speedups or saturation conclusions; the numbers are produced by running the implementations on hardware. The adaptation of ConvStencil is presented as an external reference point rather than a quantity defined from the WSE results themselves. This is the typical non-circular case for a systems-performance paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the central claim rests on the experimental implementation of CStencil and the fairness of the GPU baseline adaptation.

pith-pipeline@v0.9.0 · 5570 in / 1045 out tokens · 32689 ms · 2026-05-11T02:57:45.308434+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

[1]

Compute substrate for software 2.0,

J. Vasiljevic, L. Bajic, D. Capalija, S. Sokorac, D. Ignjatovic, L. Bajic, M. Trajkovic, I. Hamer, I. Matosevic, A. Cejkov, U. Aydonat, T. Zhou, S. Z. Gilani, A. Paiva, J. Chu, D. Maksimovic, S. A. Chin, Z. Moudallal, A. Rakhmati, S. Nijjar, A. Bhullar, B. Drazic, C. Lee, J. Sun, K.-M. Kwong, J. Connolly, M. Dooley, H. Farooq, J. Y . T. Chen, M. Walker, K...

work page 2021
[2]

Cerebras Wafer-Scale Cluster Data Sheet,

Cerebras Systems, “Cerebras Wafer-Scale Cluster Data Sheet,” Cerebras Systems, Tech. Rep., 2024, accessed on 2025-11-02. [Online]. Available: https: //8968533.fs1.hubspotusercontent-na2.net/hubfs/8968533/Cerebras% 20Wafer%20Scale%20Cluster%20datasheet%20-%20final.pdf

work page arXiv 2024
[3]

Firoozshahian, A., Coburn, J., Levenstein, R., Nattoji, R., Kamath, A., Wu, O., Grewal, G., Aepala, H., Jakka, B., Dreyer, B., Hutchin, A., Diril, U., Nair, K., Aredestani, E

R. Prabhakar, R. Sivaramakrishnan, D. Gandhi, Y . Du, M. Wang, X. Song, K. Zhang, T. Gao, A. Wang, X. Li, Y . Sheng, J. Brot, D. Sokolov, A. Vivek, C. Leung, A. Sabnis, J. Bai, T. Zhao, M. Gottscho, D. Jackson, M. Luttrell, M. K. Shah, Z. Chen, K. Liang, S. Jain, U. Thakker, D. Huang, S. Jairath, K. J. Brown, and K. Olukotun, “SambaNova SN40L: Scaling the...

work page doi:10.1109/micro61859.2024.00100 2024
[4]

In-Datacenter Performance Analysis of a Tensor Processing Unit,

N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. luc Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V . Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan,...

work page arXiv 2017
[5]

Graphcore,

S. Knowles, “Graphcore,” in2021 IEEE Hot Chips 33 Symposium (HCS), 2021, pp. 1–25

work page 2021
[6]

ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores,

Y . Chen, K. Li, Y . Wang, D. Bai, L. Wang, L. Ma, L. Yuan, Y . Zhang, T. Cao, and M. Yang, “ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores,” inProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’24. New York, NY , USA: Association for Computing Machinery...

work page doi:10.1145/3627535.3638476 2024
[7]

The Landscape of Parallel Computing Research: A View from Berkeley,

K. Asanovi ´c, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick, “The Landscape of Parallel Computing Research: A View from Berkeley,” Tech. Rep. UCB/EECS-2006-183, Dec 2006. [Online]. Available: http://www2.eecs.berkeley.edu/Pubs/ TechRpts/2006/EECS-2006-183.html

work page 2006
[8]

J. R. Cannon,The one-dimensional heat equation. Foreword by Felix E. Browder, ser. Encycl. Math. Appl. Cambridge University Press, Cambridge, 1984, vol. 23

work page 1984
[9]

Asynchronous computations for solving the acoustic wave propagation equation,

K. Akbudak, H. Ltaief, V . Etienne, R. Abdelkhalak, T. Tonellot, and D. Keyes, “Asynchronous computations for solving the acoustic wave propagation equation,”The International Journal of High Performance Computing Applications, vol. 34, no. 4, pp. 377–393, 2020. [Online]. Available: https://doi.org/10.1177/1094342020923027

work page doi:10.1177/1094342020923027 2020
[10]

J. Tu, G. H. Yeoh, and C. Liu,Computational Fluid Dynamics: A Practical Approach, 3rd ed. Butterworth-Heinemann, 2018

work page 2018
[11]

The millennium prize problems,

Clay Mathematics Institute, “The millennium prize problems,” https: //www.claymath.org/millennium-prize-problems, 2000, accessed: 2025- 10-26

work page 2000
[12]

DRStencil: Exploiting Data Reuse within Low-order Stencil on GPU,

X. You, H. Yang, Z. Jiang, Z. Luan, and D. Qian, “DRStencil: Exploiting Data Reuse within Low-order Stencil on GPU,” in2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCi...

work page 2021
[13]

A portable framework for accelerating stencil computations on modern node architectures,

R. Sai, J. Mellor-Crummey, J. Xu, and M. Araya-Polo, “A portable framework for accelerating stencil computations on modern node architectures,” 2024. [Online]. Available: https://arxiv.org/abs/2309. 04671

work page 2024
[14]

Gt4py: High performance stencils for weather and climate applications using python,

E. G. Paredes, L. Groner, S. Ubbiali, H. V ogt, A. Madonna, K. Mariotti, F. Cruz, L. Benedicic, M. Bianco, J. VandeV ondele, and T. C. Schulthess, “Gt4py: High performance stencils for weather and climate applications using python,” 2023. [Online]. Available: https://arxiv.org/abs/2311.08322

work page arXiv 2023
[15]

Architecture and performance of devito, a system for automated stencil computation,

F. Luporini, M. Lange, M. Louboutin, N. Kukreja, J. H ¨uckelheim, C. Yount, P. Witte, P. H. J. Kelly, F. J. Herrmann, and G. J. Gorman, “Architecture and performance of devito, a system for automated stencil computation,” 2020. [Online]. Available: https: //arxiv.org/abs/1807.03032

work page arXiv 2020
[16]

High Performance Convolutional Neural Networks for Document Processing,

K. Chellapilla, S. Puri, and P. Simard, “High Performance Convolutional Neural Networks for Document Processing,” inTenth International Workshop on Frontiers in Handwriting Recognition. La Baule, France: Universit´e de Rennes 1, Oct. 2006, hAL Id: inria-00112631. [Online]. Available: https://hal.inria.fr/inria-00112631

work page 2006
[17]

Preparing for performance analysis at exascale,

X. Liu, Y . Liu, H. Yang, J. Liao, M. Li, Z. Luan, and D. Qian, “Toward accelerated stencil computation by adapting tensor core unit on GPU,” inProceedings of the 36th ACM International Conference on Supercomputing, ser. ICS ’22. New York, NY , USA: Association for Computing Machinery, 2022. [Online]. Available: https://doi.org/10.1145/3524059.3532392

work page doi:10.1145/3524059.3532392 2022
[18]

Hilfer fractional advection-diffusion equations with power-law initial condition; a Numerical study using variational iteration method

Y . Zhang, K. Li, L. Yuan, J. Cheng, Y . Zhang, T. Cao, and M. Yang, “LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor Cores,” inProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ser. SC ’24. IEEE Press, 2024. [Online]. Available: https://doi.org/10. 1109/SC41406.2024.00059

work page Pith review arXiv 2024
[19]

SPTCStencil: Using Sparse Tensor Cores for Stencil Computation,

Q. GU, C. Wu, H. Shi, and J. Yao, “SPTCStencil: Using Sparse Tensor Cores for Stencil Computation,” 2025. [Online]. Available: https://arxiv.org/abs/2506.22035

work page arXiv 2025
[20]

SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations via Structured Sparsity Transformation,

Q. Li, K. Li, H. Han, L. Yuan, J. Chen, Y . Zhang, Y . Chen, H. An, T. Cao, and M. Yang, “SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations via Structured Sparsity Transformation,”

work page
[21]

Available: https://arxiv.org/abs/2506.22969

[Online]. Available: https://arxiv.org/abs/2506.22969

work page arXiv
[22]

Fast Stencil-Code Computation on a Wafer-Scale Processor,

K. Rocki, D. V . Essendelft, I. Sharapov, R. Schreiber, M. Morrison, V . Kibardin, A. Portnoy, J. F. Dietiker, M. Syamlal, and M. James, “Fast Stencil-Code Computation on a Wafer-Scale Processor,” 2020. [Online]. Available: https://arxiv.org/abs/2010.03660

work page arXiv 2020
[23]

Massively scalable stencil algorithm,

M. Jacquelin, M. Araya-Polo, and J. Meng, “Massively scalable stencil algorithm,” 2022. [Online]. Available: https://arxiv.org/abs/2204.03775

work page arXiv 2022
[24]

SpaDA: A Spatial Dataflow Architecture Programming Language

L. Gianinazzi, T. Ben-Nun, and T. Hoefler, “Spada: A spatial dataflow architecture programming language,” 2025. [Online]. Available: https://arxiv.org/abs/2511.09447

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

[Online]

Cerebras Systems,Cerebras SDK Documentation (Version 1.4.0), Cerebras Systems, 2025, available at https://sdk.cerebras.net/. [Online]. Available: https://sdk.cerebras.net/ STENCIL COMPUTATIONS ON CEREBRAS W AFER-SCALE ENGINE 12

work page 2025
[26]

Near-Optimal Wafer-Scale Reduce,

P. Luczynski, L. Gianinazzi, P. Iff, L. Wilson, D. De Sensi, and T. Hoefler, “Near-Optimal Wafer-Scale Reduce,” inProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, ser. HPDC ’24. ACM, Jun. 2024, p. 334–347. [Online]. Available: http://dx.doi.org/10.1145/3625549.3658693

work page doi:10.1145/3625549.3658693 2024
[27]

[Online]

NVIDIA Corporation,CUDA C++ Programming Guide, Online; accessed: 18-Oct-2025, 2025, section 10.24: Warp Matrix Functions. [Online]. Available: https://docs.nvidia.com/cuda/ cuda-c-programming-guide/#warp-matrix-functions APPENDIXA MMA OPERATIONS ONTENSORCORES The design of high-performance kernels utilizing hardware accelerators requires a deep integratio...

work page 2025