PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction

Cheng Huang, Chutong Ding, Guangtao Xue, Guodong Yang, Jian Cao, Kaixuan Zhang, Liping Zhang, Luping Wang, Shiyou Qian, Shuhao Zhang, Yunfan Cui

Authors on Pith no claims yet

classification 💻 cs.PF cs.AR

keywords pipeweaveperformanceanalyticalkernelmodelsaccurateacrosscomplex

0 comments

read the original abstract

The rapid expansion of Transformer-based large language models has dramatically increased the need for high-performance GPUs. As a result, there is growing demand for fast, accurate, and widely generalizable GPU performance models to support next-generation hardware selection and system-level exploration. However, current data-driven methods are limited, exhibiting poor generalization across hardware and inadequate modeling of complex production-level kernels common in modern inference stacks. To address these issues, we present PipeWeave, a unified GPU modeling framework. This approach first employs an analytical model to quantify a given kernel's demands on the GPU's heterogeneous instruction pipelines. These analytical features are then fed into a machine learning (ML) model to capture complex cross-pipeline interactions and resource dependencies, enabling high-fidelity performance prediction. Our evaluation across 11 GPU types from four generations of major architectures on two widely-used serving systems demonstrates that PipeWeave delivers high fidelity and strong generalizability. It achieves accurate predictions, with only 6.1% average error at the kernel level and 8.5% for end-to-end inference -- reducing the error of state-of-the-art methods by 6.7x and 4.4x, respectively. We also demonstrate PipeWeave's value "beyond simulation" by utilizing its performance ceiling to diagnose implementation shortcomings and guide the optimization of a production fused MoE Triton kernel, achieving up to 1.7x speedup. Code is available https://github.com/zksainx/pipeweave.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning
cs.PF 2026-04 unverdicted novelty 6.0

WaveTune introduces a wave-aware bilinear latency predictor and wave-structured sparse sampling to enable fast runtime auto-tuning of GPU kernels, achieving up to 1.83x kernel speedup and 1.33x TTFT reduction with dra...