arxiv: 2604.26666 · v2 · submitted 2026-04-29 · 💻 cs.DC · cs.PF

Recognition: unknown

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

Sina Heidari , Dimitrios S. Nikolopoulos

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:55 UTC · model grok-4.3

classification 💻 cs.DC cs.PF

keywords agentic kernel synthesisCUTLASS transpilationPyTorch optimizationtransformer accelerationcompositional kernel generationGPU pattern discoverydeep learning performanceauto-tuned kernels

0 comments

The pith

A three-stage agentic workflow discovers patterns in PyTorch graphs and realizes them as verified CUTLASS kernels to accelerate transformer modules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FACT as a framework that applies AI agents across three stages to improve the speed of deep learning code written in PyTorch. First the system inspects a traced computation graph to match subgraphs against known optimization rules and retrieve examples. Next it turns each pattern into a CUTLASS kernel, checks correctness, and tunes it for the target GPU. Finally the kernels are assembled into a complete module that replaces the original PyTorch execution. A sympathetic reader would care because vendor libraries and compilers rely on fixed catalogs of optimizations that often miss opportunities for specific models, forcing experts to write custom low-level code. The method keeps the agent work anchored in mature CUTLASS templates so that basic correctness and efficiency do not have to be rediscovered from scratch.

Core claim

FACT is a three-stage agent-driven workflow that optimizes PyTorch modules through multi-pattern composition while grounding synthesis in CUTLASS C++. Pattern discovery inspects the traced graph, matches subgraphs to optimization rules, retrieves vetted examples, and outputs prioritized patterns. Pattern realization implements each pattern as a CUTLASS kernel, verifies, and auto-tunes. Pattern composition assembles extensions into an optimized module for benchmarking. On Level 1 GEMM problems auto-tuned CUTLASS kernels achieve 1.06x-1.18x speedups on A100 and 0.84x-1.80x performance variations on H100 over cuBLAS. On Level 3 transformer blocks against PyTorch eager baseline, FACT achieves 2.

What carries the argument

The three-stage agentic workflow of pattern discovery on the traced graph, CUTLASS-based pattern realization with verification and auto-tuning, and pattern composition into an optimized module.

If this is right

Auto-tuned CUTLASS kernels achieve 1.06x-1.18x speedups on A100 and 0.84x-1.80x variations on H100 over cuBLAS for square, batched, and large-K matrix multiplies.
The full workflow reaches 2.03x speedup on MiniGPT transformer blocks over PyTorch eager execution, exceeding Inductor at 1.89x and TensorRT at 1.85x.
On Llama 3 8B the same workflow produces 1.41x speedup over PyTorch eager, better than the 1.17x from Inductor and 1.18x from TensorRT.
The dynamic pattern registry combined with agentic discovery supplies a practical route from any traced PyTorch module to deployable kernels.
The approach couples graph-level pattern finding with architecture-specific auto-tuning so that optimizations remain current as new hardware appears.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged pattern approach could be applied to other high-level frameworks whose execution graphs can be traced and mapped to low-level kernels.
Extending the registry with additional vetted examples might allow the agents to handle operations that current libraries still treat inefficiently.
Running the workflow inside training loops could let kernels adapt automatically as model shapes or hardware change during development.
The separation of discovery from realization suggests that future agents could propose entirely new optimization patterns beyond those already encoded in CUTLASS.

Load-bearing premise

The agentic pattern discovery, CUTLASS realization, and composition stages can consistently produce correct, verified kernels that outperform mature libraries without introducing bugs or requiring extensive manual oversight.

What would settle it

Applying the workflow to a new set of transformer operations or a different GPU and measuring either numerical errors in the output or performance no better than TensorRT would show that the three stages do not reliably deliver correct and faster kernels.

Figures

Figures reproduced from arXiv: 2604.26666 by Dimitrios S. Nikolopoulos, Sina Heidari.

**Figure 1.** Figure 1: Pattern mapping hierarchy. The agent identifies view at source ↗

**Figure 2.** Figure 2: Three-stage agentic workflow for whole-model kernel optimization. Stage 1 (Pattern Discovery) extracts and analyzes view at source ↗

**Figure 3.** Figure 3: Three-level hierarchy of CUTLASS kernel synthe view at source ↗

**Figure 4.** Figure 4: Auto-tuning results for all three Level 1 problems on A100. Each point is one configuration; stars mark the best. view at source ↗

**Figure 5.** Figure 5: Backend speedup comparison across Problems 1, 3, 6, and 44. Inductor and Torch-TensorRT speedups are computed from view at source ↗

**Figure 6.** Figure 6: KernelBench 44_MiniGPTBlock on A100 with (𝐵,𝑇 ,𝐶) = (128, 512, 768). (a) End-to-end latency for PyTorch baseline, single-pattern optimizations, and both patterns composed. (b) Speedup relative to the PyTorch baseline measured in the same benchmark run as each optimized configuration (ablations leave the non-target subgraph in eager PyTorch). feedback, and shows that optimization schedules can be reused acr… view at source ↗

read the original abstract

Deep learning compilers and vendor libraries deliver strong baseline performance but their performance is bounded by finite, engineer-curated catalogs. When these omit needed optimizations, practitioners substitute hand-written CUDA or CUTLASS, demanding expertise in GPU microarchitecture and C++ template metaprogramming. Recent LLM-based agents target kernel generation in raw CUDA, forcing rediscovery of optimizations already encoded in mature libraries. We present FACT (Framework for Agentic CUTLASS Transpilation), a three-stage agent-driven workflow optimizing PyTorch modules through multi-pattern composition while grounding synthesis in CUTLASS C++. Pattern discovery inspects the traced graph, matches subgraphs to optimization rules, retrieves vetted examples, and outputs prioritized patterns. Pattern realization implements each pattern as a CUTLASS kernel, verifies, and auto-tunes. Pattern composition assembles extensions into an optimized module for benchmarking. We evaluate the workflow on KernelBench across NVIDIA A100 and H100 GPUs. On Level 1 GEMM problems (square, batched, large-K matrix multiply), auto-tuned CUTLASS kernels achieve 1.06x-1.18x speedups on A100 and 0.84x-1.80x performance variations on H100 over cuBLAS. On Level 3 transformer blocks against PyTorch eager baseline, FACT achieves 2.03x speedup on MiniGPT (vs. Inductor: 1.89x, TensorRT: 1.85x) and 1.41x on Llama 3 8B (vs. Inductor: 1.17x, TensorRT: 1.18x). Our framework couples agentic graph-level pattern discovery with architecture-specific auto-tuning and a dynamic pattern registry, offering a practical path from traced PyTorch modules to deployable kernels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a three-stage agentic workflow that discovers patterns in PyTorch graphs and realizes them as verified CUTLASS kernels, but provides no data on how reliably the LLM discovery stage works.

read the letter

The main point is that FACT uses an agent to scan traced computation graphs for optimization patterns, implements those as CUTLASS kernels with verification and auto-tuning, then composes them into faster modules. It shows 2.03x speedup on MiniGPT Level 3 blocks and 1.41x on Llama 3 8B over PyTorch eager, beating Inductor and TensorRT in the reported cases. On simpler GEMM problems the gains are smaller and sometimes mixed across A100 and H100.

Referee Report

2 major / 2 minor

Summary. The paper presents FACT, a three-stage agentic workflow (pattern discovery via LLM on traced PyTorch graphs, CUTLASS-based realization with verification and auto-tuning, and dynamic composition) for synthesizing optimized kernels from PyTorch modules. It claims concrete speedups on KernelBench: 1.06x-1.18x over cuBLAS on Level-1 GEMM problems (A100) with variations on H100, and on Level-3 transformer blocks 2.03x on MiniGPT and 1.41x on Llama 3 8B over PyTorch eager (outperforming Inductor and TensorRT).

Significance. If the workflow's reliability can be demonstrated, the grounding of agentic discovery in vetted CUTLASS kernels rather than raw CUDA generation would be a meaningful advance for automated optimization of DL modules, offering a practical bridge between high-level frameworks and architecture-specific performance. The use of a dynamic pattern registry and auto-tuning is a positive design choice that could support reproducibility.

major comments (2)

[Evaluation (Level 3 results paragraph)] The Level-3 transformer block results (2.03x MiniGPT, 1.41x Llama 3 8B) are load-bearing for the central claim yet supply no quantitative metrics on pattern discovery success rate, number of LLM calls per pattern, fraction of patterns that pass verification and compose without error, or retry/failure statistics. This directly undermines assessment of whether the three-stage process reliably yields correct kernels without extensive manual oversight, as required by the weakest assumption.
[Evaluation (abstract and Level 1/3 reporting)] No error bars, run counts, data exclusion rules, or verification procedures are reported for any speedup numbers (including the Level-1 GEMM range of 1.06x-1.18x on A100). This leaves the empirical claims without visible supporting derivation or analysis, consistent with the noted soundness gap.

minor comments (2)

[Abstract] The abstract states performance 'variations' on H100 without clarifying whether these are speedups or regressions; a table or explicit breakdown would improve clarity.
[Evaluation] Consider adding a dedicated subsection or table in the evaluation that reports agentic workflow statistics (success rates, LLM query counts, verification pass rates) to make the methodology reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for highlighting gaps in our empirical reporting. We address each major comment below and will incorporate revisions to improve transparency on workflow reliability and statistical rigor.

read point-by-point responses

Referee: [Evaluation (Level 3 results paragraph)] The Level-3 transformer block results (2.03x MiniGPT, 1.41x Llama 3 8B) are load-bearing for the central claim yet supply no quantitative metrics on pattern discovery success rate, number of LLM calls per pattern, fraction of patterns that pass verification and compose without error, or retry/failure statistics. This directly undermines assessment of whether the three-stage process reliably yields correct kernels without extensive manual oversight, as required by the weakest assumption.

Authors: We agree that these intermediate metrics are necessary to fully substantiate the reliability of the three-stage workflow. The current manuscript prioritizes end-to-end speedups but does not report success rates, LLM call counts, verification pass fractions, or retry statistics. In the revised version we will add a dedicated paragraph and table in the Evaluation section that quantifies: (i) pattern discovery success rate over the KernelBench Level-3 modules, (ii) average LLM calls per discovered pattern, (iii) fraction of patterns that passed CUTLASS verification and composed without error, and (iv) any retry or failure counts observed during our runs. These additions will directly address concerns about manual oversight. revision: yes
Referee: [Evaluation (abstract and Level 1/3 reporting)] No error bars, run counts, data exclusion rules, or verification procedures are reported for any speedup numbers (including the Level-1 GEMM range of 1.06x-1.18x on A100). This leaves the empirical claims without visible supporting derivation or analysis, consistent with the noted soundness gap.

Authors: The observation is correct; the manuscript omits these methodological details. While all reported speedups were obtained from repeated executions on A100 and H100 hardware, the submission did not include run counts, error bars, exclusion rules, or verification procedures. We will revise the Evaluation and abstract sections to report: the number of runs per benchmark (10), standard-deviation error bars on all speedup figures, confirmation that no data were excluded beyond standard outlier filtering for hardware noise, and a concise description of the numerical verification steps used to confirm kernel correctness against PyTorch references. This will supply the missing derivation for both Level-1 and Level-3 results. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmarks against external baselines

full rationale

The paper describes a three-stage agentic workflow (pattern discovery on traced graphs, CUTLASS realization with verification, and composition) and reports measured speedups on KernelBench, MiniGPT, and Llama 3 8B against cuBLAS, PyTorch Inductor, and TensorRT. These are direct empirical results, not derived quantities. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes appear in a load-bearing role that would make any claim equivalent to its inputs by construction. The derivation chain is self-contained and externally falsifiable via the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; none can be extracted or inferred from the given text.

pith-pipeline@v0.9.0 · 5635 in / 1282 out tokens · 89755 ms · 2026-05-07T10:55:13.464155+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 9 canonical work pages

[1]

Jason Ansel, Peng Wu, Horace He, Animesh Jain, Mario Lezcano, Doricha Lazowska, Peter Gan, Melissa Gormish, Zhiqiang Chen, Mu Li Li, Zain Li, Selu Ramezani, et al . 2024. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. InPro- ceedings of the 29th ACM International Conference on Architectural Support...

work page doi:10.1145/3620665.3640366 2024
[2]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX symposium on operating systems design and implementation (OSDI 18). 578–594

2018
[3]

Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, and Hao Zhou. 2026. CUDA Agent: Large- Scale Agentic RL for High-Performance CUDA Kernel Generation.arXiv preprint arXiv:2602.24286(2026). doi:10.48550/arXiv.2602.24286

work page doi:10.48550/arxiv.2602.24286 2026
[4]

Kris Shengjun Dong, Sahil Modi, Dima Nikiforov, Sana Damani, Edward Lin, Siva Kumar Sastry Hari, and Christos Kozyrakis. 2026. KernelBlaster: Continual Cross- Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning.arXiv preprint arXiv:2602.14293(2026). doi:10.48550/arXiv.2602.14293

work page doi:10.48550/arxiv.2602.14293 2026
[5]

Charles Hong, Sahil Bhatia, Alvin Cheung, and Sophia Shao. 2025. Autocomp: LLM-Driven Code Optimization for Tensor Accelerators. InMLArchSys 2025 (Oral). https://openreview.net/forum?id=bPdQZedlsr

2025
[6]

Yoon Noh Lee, Yongseung Yu, and Yongjun Park. 2025. CUrator: An Efficient LLM Execution Engine with Optimized Integration of CUDA Libraries. InProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization (CGO ’25). ACM, 209–224. doi:10.1145/3696443.3708944

work page doi:10.1145/3696443.3708944 2025
[7]

Mingzhen Li, Hailong Yang, Shanjun Zhang, Fengwei Yu, Ruihao Gong, and Yi Liu. 2023. Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor Programs. InProceedings of the 52nd International Conference on Parallel Processing (ICPP ’23). 786–796. doi:10.1145/3605573.3605596

work page doi:10.1145/3605573.3605596 2023
[8]

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, and Caiwen Ding. 2026. StitchCUDA: An Automated Multi-Agents End-to-End GPU Pro- graming Framework with Rubric-based Agentic Reinforcement Learning.arXiv preprint arXiv:2603.02637(2026). doi:10.48550/arXiv.2603.02637

work page doi:10.48550/arxiv.2603.02637 2026
[9]

NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Datasheet. Datasheet. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/ a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf

2020
[10]

NVIDIA Corporation. 2025. CUTLASS Documentation. https://docs.nvidia.com/ cutlass/latest/. Accessed: 2026-03-16

2025
[11]

Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, and John D. Owens. 2023. Stream-K: Work-Centric Parallel Decomposition for Dense Matrix- Matrix Multiplication on the GPU. InProceedings of the 28th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’23). 367–

2023
[12]

doi:10.1145/3572848.3577479

work page doi:10.1145/3572848.3577479
[13]

Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. 2025. KernelBench: Can LLMs Write Efficient GPU Kernels?arXiv preprint arXiv:2502.10517(2025)

work page arXiv 2025
[14]

Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. 2025. Astra: A Multi-Agent System for GPU Kernel Performance Optimization.arXiv preprint arXiv:2509.07506(2025). doi:10.48550/arXiv.2509.07506

work page doi:10.48550/arxiv.2509.07506 2025
[15]

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. 2020. Ansor: Generating {High-Performance} tensor programs for deep learning. In14th USENIX symposium on operating systems design and implementation (OSDI 20). 863–879

2020