PassNet: Scaling Large Language Models for Graph Compiler Pass Generation
Pith reviewed 2026-06-29 07:50 UTC · model grok-4.3
The pith
Fine-tuning a small LLM on 4K PassNet trajectories yields 2.67x gains that approach frontier-model results on compiler pass generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Pass generation by LLMs is the appropriate abstraction for automated compiler optimization; PassNet supplies the first large-scale dataset and benchmark for it, and fine-tuning on a few thousand trajectories produces 2.67x improvement that approaches the best frontier models, showing the main remaining limit is consistency across subgraphs rather than raw capability.
What carries the argument
PassNet ecosystem, built from PassNet-Dataset of 18K+ computational graphs and PassBench of 200 tasks scored under the Error-aware Speedup Score (ES_t) with layered integrity checks.
If this is right
- LLM-generated passes can be inserted directly into existing compiler pipelines to recover performance on the long-tail workloads where default compilation slows models down.
- Targeted fine-tuning on a few thousand trajectories is sufficient to move a small model close to frontier performance, indicating that scale of data matters less than targeted trajectories.
- Consistency across subgraphs, not peak capability on any single subgraph, is the binding constraint on current LLM pass generators.
- PassNet supplies reusable training infrastructure that can be used to iterate on LLM-driven compiler optimization without starting from scratch.
Where Pith is reading between the lines
- If the approach generalizes, compilers could ship with lightweight fine-tuned models that adapt passes to each incoming model rather than relying on a fixed set of hand-written passes.
- The same trajectory-generation loop could be applied to other structured transformation problems such as query-plan rewriting or hardware-mapping decisions.
- Extending the benchmark to include multi-pass sequences or interactions between passes would test whether the current single-pass framing understates or overstates the achievable gains.
Load-bearing premise
The 200 curated long-tail tasks and the Error-aware Speedup Score produce a representative, hard-to-exploit measure of pass-generation quality that holds for graphs outside the benchmark.
What would settle it
A fine-tuned model that scores near the top on all PassBench tasks yet produces no measurable end-to-end speedups on a fresh collection of 100 real-world models drawn from the same distribution would falsify the claim that the benchmark tracks useful generalization.
Figures
read the original abstract
Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation -- where LLMs author structured graph transformations that integrate directly into compiler pipelines -- is the more appropriate abstraction. We propose PassNet, the first large-scale ecosystem for LLM-based compiler pass generation, comprising: (1) PassNet-Dataset, over 18K unique computational graphs from 100K real-world models; and (2) PassBench, 200 curated long-tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error-aware Speedup Score (ES_t) -- a metric unifying correctness, stability, and performance -- with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler -- indicating that the bottleneck is consistency, not capability. Fine-tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM-driven compiler optimization. All data, benchmarks, and tooling are publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PassNet as the first large-scale ecosystem for LLM-driven compiler pass generation. It comprises PassNet-Dataset (18K unique graphs from 100K real-world models) and PassBench (200 curated long-tail fusible tasks spanning 2,060 subgraphs), evaluated under the Error-aware Speedup Score (ES_t) with layered integrity defenses. Key results: frontier LLMs trail TorchInductor by 37% in aggregate on PassBench yet achieve up to 3x speedup on individual subgraphs; fine-tuning a small model on ~4K trajectories yields a 2.67x improvement that approaches frontier performance. The work positions PassNet as public training infrastructure for LLM-based compiler optimization.
Significance. If the benchmark and metric hold, the work supplies a concrete, publicly released dataset and evaluation harness that directly targets the pass-generation abstraction rather than isolated kernel synthesis. The reported headroom (37% aggregate gap, 3x per-subgraph wins, 2.67x from modest fine-tuning) would constitute a falsifiable, actionable signal for the community and would validate the claim that consistency rather than raw capability is the current bottleneck.
major comments (3)
- [Abstract] Abstract: the central performance numbers (2.67x, 37% gap, 3x on individuals) are stated without any derivation, exclusion criteria, variance estimates, or statistical tests. Because these quantities are measured exclusively on PassBench and underpin both the headroom claim and the 'live training infrastructure' conclusion, the absence of supporting methods details is load-bearing.
- [Abstract] Abstract: the 200-task curation from the 2,060 subgraphs is described only as 'curated long-tail fusible tasks' with no sampling protocol, stratification criteria, or bias audit. This directly affects the weakest assumption that PassBench supplies a representative, generalizable signal.
- [Abstract] Abstract: the claim that ES_t together with its 'layered integrity defenses' yields an 'unexploitable' measure is asserted without any concrete demonstration (e.g., red-team results, adversarial pass examples, or formal argument) that the defenses block passes satisfying static checks yet degrading on unseen subgraphs. This property is load-bearing for all reported speedups.
minor comments (1)
- The public release of data, benchmarks, and tooling is a clear strength and should be accompanied by explicit repository links or DOIs in the main text rather than left as a closing sentence.
Simulated Author's Rebuttal
We appreciate the referee's careful reading of the abstract and the identification of areas where additional methodological details are needed to support our claims. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance numbers (2.67x, 37% gap, 3x on individuals) are stated without any derivation, exclusion criteria, variance estimates, or statistical tests. Because these quantities are measured exclusively on PassBench and underpin both the headroom claim and the 'live training infrastructure' conclusion, the absence of supporting methods details is load-bearing.
Authors: The reported numbers are derived from the full PassBench evaluation in Sections 5 and 6. Specifically, the 2.67x is the improvement in mean ES_t from fine-tuning, the 37% gap is the aggregate difference versus TorchInductor, and the 3x is the max individual subgraph improvement. Variance estimates from repeated evaluations and exclusion criteria (e.g., invalid graphs) are provided in the experimental setup. We will revise the abstract to include brief references to these sections and note the basis for the statistics. revision: yes
-
Referee: [Abstract] Abstract: the 200-task curation from the 2,060 subgraphs is described only as 'curated long-tail fusible tasks' with no sampling protocol, stratification criteria, or bias audit. This directly affects the weakest assumption that PassBench supplies a representative, generalizable signal.
Authors: Section 3.2 details the curation: subgraphs were sampled from the long-tail of PassNet-Dataset based on frequency, filtered for fusibility, and stratified across operator categories and sizes, with a bias audit against the full distribution. We will incorporate a short description of the sampling protocol and stratification into the abstract to clarify the representativeness. revision: yes
-
Referee: [Abstract] Abstract: the claim that ES_t together with its 'layered integrity defenses' yields an 'unexploitable' measure is asserted without any concrete demonstration (e.g., red-team results, adversarial pass examples, or formal argument) that the defenses block passes satisfying static checks yet degrading on unseen subgraphs. This property is load-bearing for all reported speedups.
Authors: We agree that the abstract's phrasing implies a stronger guarantee than demonstrated. The defenses are specified in Section 4.3 as a combination of static analysis, runtime verification, and consistency checks across subgraphs. However, we do not provide red-team results or a formal argument in the manuscript. We will revise the abstract to describe the defenses without asserting that the measure is 'unexploitable', instead noting that they are intended to mitigate exploitation risks. revision: yes
Circularity Check
No significant circularity; empirical results on external benchmark
full rationale
The paper's core claims rest on constructing PassNet-Dataset from 100K real-world models, curating PassBench from 2,060 subgraphs, and reporting measured speedups (including the 2.67x fine-tuning gain) under the ES_t metric. No equations, fitted parameters, or predictions are defined in terms of the reported outcomes; no self-citations are invoked as load-bearing uniqueness theorems; and the derivation chain consists of standard data collection, fine-tuning, and evaluation steps that do not reduce to self-reference by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
KernelBench: Can LLMs Write Efficient GPU Kernels?
Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can llms write efficient gpu kernels?arXiv preprint arXiv:2502.10517,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, et al. Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation.arXiv preprint arXiv:2602.24286,
-
[3]
Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, et al. Kernelevolve: Scaling agentic kernel coding for heterogeneous ai accelerators at meta.arXiv preprint arXiv:2512.23236,
-
[4]
Bladedisc: Optimizing dynamic shape machine learning workloads via compiler approach.Proc
Zhen Zheng, Zaifeng Pan, Dalin Wang, Kai Zhu, Wenyi Zhao, Tianyou Guo, Xiafei Qiu, Minmin Sun, Junjie Bai, Feng Zhang, Xiaoyong Du, Jidong Zhai, and Wei Lin. Bladedisc: Optimizing dynamic shape machine learning workloads via compiler approach.Proc. ACM Manag. Data, pages 206:1–206:29, 2023a. URLhttps://doi.org/10.1145/3617327. Baidu PaddlePaddle Team. Cin...
-
[5]
Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. A survey of large language models for code: Evolution, benchmarking, and future trends.arXiv preprint arXiv:2311.10372, 2023b. Ke Liu, Qinglin Wang, Xiang Chen, Guang Yang, Yigui Feng, Gencheng Liu, and Jie Liu. Evaluating and improving framework-based parallel c...
-
[6]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Chen, Yufan Yuan, Yuzhe Zhang, Bowen Li, Haotian Qian, Pengfei He, Ruiqi Lyu, Yikun Ma, Ziru Yu, et al. OpenHands: An open platform for AI software developers as generalist agents.arXiv preprint arXiv:2407.16741,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Compiler-r1: Towards agentic compiler auto-tuning with reinforcement learning
Haolin Pan, Hongyu Lin, Haoran Luo, Yang Liu, Kaichun Yao, Libo Zhang, Mingjie Xing, and Yanjun Wu. Compiler-r1: Towards agentic compiler auto-tuning with reinforcement learning. arXiv preprint arXiv:2506.15701,
-
[9]
Stark: Strategic team of agents for refining kernels.arXiv preprint arXiv:2510.16996,
Juncheng Dong, Yang Yang, Tao Liu, Yang Wang, Feng Qi, Vahid Tarokh, Kaushik Rangadurai, and Shuang Yang. Stark: Strategic team of agents for refining kernels.arXiv preprint arXiv:2510.16996,
-
[10]
Kevin: Multi-turn rl for generating cuda kernels.arXiv preprint arXiv:2507.11948,
Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, and Silas Alberti. Kevin: Multi-turn rl for generating cuda kernels.arXiv preprint arXiv:2507.11948,
-
[11]
URLhttps://arxiv.org/abs/2507.23194. Qirui Zhou, Yuanbo Wen, Ruizhi Chen, Ke Gao, Weiqiang Xiong, Ling Li, Qi Guo, Yanjun Wu, and Yunji Chen. Qimeng-gemm: Automatically generating high-performance matrix multiplication code by exploiting large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 22982–22990,
-
[12]
Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, and Chris Shum. Cuda-l1: Improving cuda optimiza- tion via contrastive reinforcement learning.arXiv preprint arXiv:2507.14111,
-
[13]
12 Sharan Narang and Baidu Research
URLhttps://arxiv.org/abs/2505.18574. 12 Sharan Narang and Baidu Research. Deepbench: Benchmarking deep learning operations on different hardware,
-
[14]
doi: 10.1109/CGO53902.2022.9741258
ISBN 9781665405843. doi: 10.1109/CGO53902.2022.9741258. URL https://doi.org/10.1109/CGO53902.2022.9741258. Aiden Grossman, Ludger Paehler, Konstantinos Parasyris, Tal Ben-Nun, Jacob Hegna, William S. Moses, Jose M Monsalve Diaz, Mircea Trofin, and Johannes Doerfert. Compile: A large IR dataset from production sources.Journal of Data-centric Machine Learni...
-
[15]
doi: 10.1109/tpds.2020.3030548
ISSN 2161-9883. doi: 10.1109/tpds.2020.3030548. Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents.arXiv preprint arXiv:2504.07164,
-
[16]
GLM-5: from Vibe Coding to Agentic Engineering
URL https://arxiv.org/abs/2602.15763. MiniMaxAI,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
URLhttps://arxiv.org/abs/2505.09388. 14 A Dataset Quality Constraints A user’s model, wrapped by thepass_net.extract, is symbolically traced to generate a standard- ized set of files. This set forms a complete PassNet graph, including the high-level IR of the com- putation graph (model.py), metadata for inputs and weights (input_meta.py, weight_meta.py), ...
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.