Recognition: no theorem link
Stencil Computations on Cerebras Wafer-Scale Engine
Pith reviewed 2026-05-11 02:57 UTC · model grok-4.3
The pith
CStencil maps two-dimensional stencil computations onto wafer-scale engines to deliver up to 342 times the performance of GPU implementations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that CStencil achieves speedups of up to 342x over an adapted single-precision GPU stencil solver, with a roofline model confirming that the wafer-scale engine's distributed SRAM and mesh interconnect eliminate off-chip memory bottlenecks and saturate both compute and memory resources for two-dimensional stencil computations.
What carries the argument
The CStencil framework that maps stencil operations directly onto the distributed SRAM and mesh interconnect of the wafer-scale engine to avoid external memory traffic.
If this is right
- Memory-bound scientific kernels can reach full hardware utilization on wafer-scale engines without off-chip memory stalls.
- Two-dimensional stencil operations become compute-limited rather than memory-limited under the engine's data movement model.
- Scientific algorithms outside the original AI target domain can be successfully ported when on-chip memory bandwidth is exploited.
Where Pith is reading between the lines
- The same mapping strategy may extend to three-dimensional stencils or other grid-based algorithms such as finite-volume methods.
- Hardware designs for future scientific computing could prioritize distributed on-chip SRAM to reduce reliance on external memory hierarchies.
- Larger problem sizes typical of production simulations would likely preserve the observed utilization levels if the interconnect scales accordingly.
Load-bearing premise
The single-precision adaptation of the GPU stencil solver serves as a fair baseline and the chosen stencil sizes and problem sizes reflect realistic scientific workloads.
What would settle it
Measuring achieved operations per second against the theoretical peak on a larger production fluid-dynamics simulation would show whether the reported resource saturation holds outside the tested configurations.
Figures
read the original abstract
Stencil computations are a fundamental kernel in scientific computing, critical for simulations in domains such as fluid dynamics and climate modeling. However, these computations are often memory-bound on traditional High-Performance Computing architectures like GPUs, struggling against the "Memory Wall". Simultaneously, the rise of AI-oriented hardware, such as the Cerebras Wafer-Scale Engine, offers massive core parallelism and high-bandwidth on-chip memory, though typically optimized for lower-precision workloads. This work investigates the viability of bridging this divergence by mapping stencil algorithms onto the Cerebras WSE-3. The study introduces CStencil, a novel framework designed to implement two-dimensional stencil computations on the WSE-3. To ensure a rigorous and fair performance evaluation, the research also adapts ConvStencil, a state-of-the-art GPU stencil solver, porting it from its original double-precision design to single-precision for execution on an NVIDIA A100 GPU. Experimental results show that the WSE-3's distributed SRAM and mesh interconnect effectively eliminate the off-chip memory bottlenecks common in GPU implementations. CStencil achieves speedups of up to 342x over the adapted ConvStencil version. A roofline model analysis further confirms that CStencil saturates the available compute and memory resources, demonstrating that the WSE dataflow architecture can be successfully repurposed for traditional scientific algorithms. These findings highlight the potential of the WSE-3 to deliver hardware utilization levels unattainable on conventional systems, offering a promising path toward overcoming the memory limitations of current HPC architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CStencil, a framework for mapping 2D stencil computations onto the Cerebras WSE-3 wafer-scale engine. It adapts the ConvStencil GPU solver from double to single precision for comparison on an NVIDIA A100, reports speedups of up to 342x, and uses roofline analysis to claim that CStencil saturates the WSE-3's compute and on-chip memory resources, concluding that the dataflow architecture can be successfully repurposed for traditional scientific stencil kernels.
Significance. If the baseline comparison holds after verification, the result would demonstrate that AI-oriented wafer-scale hardware can deliver high utilization on memory-bound scientific workloads that are typically limited by the memory wall on GPUs, with potential implications for HPC applications in fluid dynamics and climate modeling.
major comments (2)
- [Abstract] Abstract: the claim that porting ConvStencil to single precision 'ensure[s] a rigorous and fair performance evaluation' is load-bearing for the 342x speedup and 'unattainable on conventional systems' conclusions, yet the manuscript provides no evidence that the adapted baseline reaches near-peak A100 HBM bandwidth or compute utilization (e.g., via shared-memory tiling or register blocking).
- [Experimental results] Experimental results (implied by abstract timing claims): the reported speedups and roofline saturation rest on direct timing against the adapted ConvStencil, but without reported error bars, exact stencil orders/sizes, problem dimensions, or a comparison against an independently optimized single-precision GPU stencil implementation, it is impossible to determine whether the speedup reflects WSE architectural advantage or baseline under-optimization.
minor comments (2)
- [Abstract] Abstract: the description of 'two-dimensional stencil computations' does not specify the stencil radius or order used in the reported experiments.
- The manuscript lacks any mention of verification steps (e.g., numerical correctness checks against a reference CPU implementation) for the CStencil results.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that porting ConvStencil to single precision 'ensure[s] a rigorous and fair performance evaluation' is load-bearing for the 342x speedup and 'unattainable on conventional systems' conclusions, yet the manuscript provides no evidence that the adapted baseline reaches near-peak A100 HBM bandwidth or compute utilization (e.g., via shared-memory tiling or register blocking).
Authors: We agree that additional evidence regarding the performance of the adapted ConvStencil baseline would enhance the rigor of our comparison. Although the original ConvStencil paper demonstrates high utilization on GPUs, our adaptation to single precision on the A100 does not include explicit roofline analysis in the current manuscript. We will revise the paper to include a roofline model for the GPU baseline, reporting achieved HBM bandwidth and compute utilization to substantiate the fairness of the evaluation. revision: yes
-
Referee: [Experimental results] Experimental results (implied by abstract timing claims): the reported speedups and roofline saturation rest on direct timing against the adapted ConvStencil, but without reported error bars, exact stencil orders/sizes, problem dimensions, or a comparison against an independently optimized single-precision GPU stencil implementation, it is impossible to determine whether the speedup reflects WSE architectural advantage or baseline under-optimization.
Authors: The full manuscript provides details on the stencil orders, sizes, and problem dimensions in the experimental setup section. However, we acknowledge that these could be more prominently featured, and we will add a dedicated table summarizing all experimental parameters. Regarding error bars, the reported timings are from repeated kernel executions with low variance due to the deterministic nature of the computations on both platforms; we will include standard deviation where applicable. For the baseline, we adapted ConvStencil, which is a published state-of-the-art implementation, to single precision using its original optimization strategies including tiling. While we did not develop a new independent GPU implementation, we believe this provides a fair comparison. We will clarify the specific adaptations made in the revised manuscript. revision: partial
Circularity Check
No circularity; experimental timings and standard roofline analysis are independent of inputs
full rationale
The paper's central claims rest on direct wall-clock measurements of CStencil versus an adapted external ConvStencil baseline on A100, plus a conventional roofline model that bounds achieved bandwidth and compute utilization. No equations, fitted parameters, or self-citations are invoked to derive the reported speedups or saturation conclusions; the numbers are produced by running the implementations on hardware. The adaptation of ConvStencil is presented as an external reference point rather than a quantity defined from the WSE results themselves. This is the typical non-circular case for a systems-performance paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Compute substrate for software 2.0,
J. Vasiljevic, L. Bajic, D. Capalija, S. Sokorac, D. Ignjatovic, L. Bajic, M. Trajkovic, I. Hamer, I. Matosevic, A. Cejkov, U. Aydonat, T. Zhou, S. Z. Gilani, A. Paiva, J. Chu, D. Maksimovic, S. A. Chin, Z. Moudallal, A. Rakhmati, S. Nijjar, A. Bhullar, B. Drazic, C. Lee, J. Sun, K.-M. Kwong, J. Connolly, M. Dooley, H. Farooq, J. Y . T. Chen, M. Walker, K...
work page 2021
-
[2]
Cerebras Wafer-Scale Cluster Data Sheet,
Cerebras Systems, “Cerebras Wafer-Scale Cluster Data Sheet,” Cerebras Systems, Tech. Rep., 2024, accessed on 2025-11-02. [Online]. Available: https: //8968533.fs1.hubspotusercontent-na2.net/hubfs/8968533/Cerebras% 20Wafer%20Scale%20Cluster%20datasheet%20-%20final.pdf
-
[3]
R. Prabhakar, R. Sivaramakrishnan, D. Gandhi, Y . Du, M. Wang, X. Song, K. Zhang, T. Gao, A. Wang, X. Li, Y . Sheng, J. Brot, D. Sokolov, A. Vivek, C. Leung, A. Sabnis, J. Bai, T. Zhao, M. Gottscho, D. Jackson, M. Luttrell, M. K. Shah, Z. Chen, K. Liang, S. Jain, U. Thakker, D. Huang, S. Jairath, K. J. Brown, and K. Olukotun, “SambaNova SN40L: Scaling the...
-
[4]
In-Datacenter Performance Analysis of a Tensor Processing Unit,
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. luc Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V . Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan,...
-
[5]
S. Knowles, “Graphcore,” in2021 IEEE Hot Chips 33 Symposium (HCS), 2021, pp. 1–25
work page 2021
-
[6]
ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores,
Y . Chen, K. Li, Y . Wang, D. Bai, L. Wang, L. Ma, L. Yuan, Y . Zhang, T. Cao, and M. Yang, “ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores,” inProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’24. New York, NY , USA: Association for Computing Machinery...
-
[7]
The Landscape of Parallel Computing Research: A View from Berkeley,
K. Asanovi ´c, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick, “The Landscape of Parallel Computing Research: A View from Berkeley,” Tech. Rep. UCB/EECS-2006-183, Dec 2006. [Online]. Available: http://www2.eecs.berkeley.edu/Pubs/ TechRpts/2006/EECS-2006-183.html
work page 2006
-
[8]
J. R. Cannon,The one-dimensional heat equation. Foreword by Felix E. Browder, ser. Encycl. Math. Appl. Cambridge University Press, Cambridge, 1984, vol. 23
work page 1984
-
[9]
Asynchronous computations for solving the acoustic wave propagation equation,
K. Akbudak, H. Ltaief, V . Etienne, R. Abdelkhalak, T. Tonellot, and D. Keyes, “Asynchronous computations for solving the acoustic wave propagation equation,”The International Journal of High Performance Computing Applications, vol. 34, no. 4, pp. 377–393, 2020. [Online]. Available: https://doi.org/10.1177/1094342020923027
-
[10]
J. Tu, G. H. Yeoh, and C. Liu,Computational Fluid Dynamics: A Practical Approach, 3rd ed. Butterworth-Heinemann, 2018
work page 2018
-
[11]
The millennium prize problems,
Clay Mathematics Institute, “The millennium prize problems,” https: //www.claymath.org/millennium-prize-problems, 2000, accessed: 2025- 10-26
work page 2000
-
[12]
DRStencil: Exploiting Data Reuse within Low-order Stencil on GPU,
X. You, H. Yang, Z. Jiang, Z. Luan, and D. Qian, “DRStencil: Exploiting Data Reuse within Low-order Stencil on GPU,” in2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCi...
work page 2021
-
[13]
A portable framework for accelerating stencil computations on modern node architectures,
R. Sai, J. Mellor-Crummey, J. Xu, and M. Araya-Polo, “A portable framework for accelerating stencil computations on modern node architectures,” 2024. [Online]. Available: https://arxiv.org/abs/2309. 04671
work page 2024
-
[14]
Gt4py: High performance stencils for weather and climate applications using python,
E. G. Paredes, L. Groner, S. Ubbiali, H. V ogt, A. Madonna, K. Mariotti, F. Cruz, L. Benedicic, M. Bianco, J. VandeV ondele, and T. C. Schulthess, “Gt4py: High performance stencils for weather and climate applications using python,” 2023. [Online]. Available: https://arxiv.org/abs/2311.08322
-
[15]
Architecture and performance of devito, a system for automated stencil computation,
F. Luporini, M. Lange, M. Louboutin, N. Kukreja, J. H ¨uckelheim, C. Yount, P. Witte, P. H. J. Kelly, F. J. Herrmann, and G. J. Gorman, “Architecture and performance of devito, a system for automated stencil computation,” 2020. [Online]. Available: https: //arxiv.org/abs/1807.03032
-
[16]
High Performance Convolutional Neural Networks for Document Processing,
K. Chellapilla, S. Puri, and P. Simard, “High Performance Convolutional Neural Networks for Document Processing,” inTenth International Workshop on Frontiers in Handwriting Recognition. La Baule, France: Universit´e de Rennes 1, Oct. 2006, hAL Id: inria-00112631. [Online]. Available: https://hal.inria.fr/inria-00112631
work page 2006
-
[17]
Preparing for performance analysis at exascale,
X. Liu, Y . Liu, H. Yang, J. Liao, M. Li, Z. Luan, and D. Qian, “Toward accelerated stencil computation by adapting tensor core unit on GPU,” inProceedings of the 36th ACM International Conference on Supercomputing, ser. ICS ’22. New York, NY , USA: Association for Computing Machinery, 2022. [Online]. Available: https://doi.org/10.1145/3524059.3532392
-
[18]
Y . Zhang, K. Li, L. Yuan, J. Cheng, Y . Zhang, T. Cao, and M. Yang, “LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor Cores,” inProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ser. SC ’24. IEEE Press, 2024. [Online]. Available: https://doi.org/10. 1109/SC41406.2024.00059
work page Pith review arXiv 2024
-
[19]
SPTCStencil: Using Sparse Tensor Cores for Stencil Computation,
Q. GU, C. Wu, H. Shi, and J. Yao, “SPTCStencil: Using Sparse Tensor Cores for Stencil Computation,” 2025. [Online]. Available: https://arxiv.org/abs/2506.22035
-
[20]
Q. Li, K. Li, H. Han, L. Yuan, J. Chen, Y . Zhang, Y . Chen, H. An, T. Cao, and M. Yang, “SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations via Structured Sparsity Transformation,”
-
[21]
Available: https://arxiv.org/abs/2506.22969
[Online]. Available: https://arxiv.org/abs/2506.22969
-
[22]
Fast Stencil-Code Computation on a Wafer-Scale Processor,
K. Rocki, D. V . Essendelft, I. Sharapov, R. Schreiber, M. Morrison, V . Kibardin, A. Portnoy, J. F. Dietiker, M. Syamlal, and M. James, “Fast Stencil-Code Computation on a Wafer-Scale Processor,” 2020. [Online]. Available: https://arxiv.org/abs/2010.03660
-
[23]
Massively scalable stencil algorithm,
M. Jacquelin, M. Araya-Polo, and J. Meng, “Massively scalable stencil algorithm,” 2022. [Online]. Available: https://arxiv.org/abs/2204.03775
-
[24]
SpaDA: A Spatial Dataflow Architecture Programming Language
L. Gianinazzi, T. Ben-Nun, and T. Hoefler, “Spada: A spatial dataflow architecture programming language,” 2025. [Online]. Available: https://arxiv.org/abs/2511.09447
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [25]
-
[26]
Near-Optimal Wafer-Scale Reduce,
P. Luczynski, L. Gianinazzi, P. Iff, L. Wilson, D. De Sensi, and T. Hoefler, “Near-Optimal Wafer-Scale Reduce,” inProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, ser. HPDC ’24. ACM, Jun. 2024, p. 334–347. [Online]. Available: http://dx.doi.org/10.1145/3625549.3658693
-
[27]
NVIDIA Corporation,CUDA C++ Programming Guide, Online; accessed: 18-Oct-2025, 2025, section 10.24: Warp Matrix Functions. [Online]. Available: https://docs.nvidia.com/cuda/ cuda-c-programming-guide/#warp-matrix-functions APPENDIXA MMA OPERATIONS ONTENSORCORES The design of high-performance kernels utilizing hardware accelerators requires a deep integratio...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.