Recognition: 2 theorem links
Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs
Pith reviewed 2026-05-08 17:41 UTC · model grok-4.3
The pith
Kerncap automates extraction of isolated, recompilable GPU kernels from large AMD applications via HSA dispatch interception and virtual-address-faithful memory closure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Kerncap intercepts dispatches at the HSA runtime for both HIP and Triton, bridging Triton's JIT-only metadata into HSA-level capture via a lightweight Python compile-hook shim. It performs an address-space closure of all device memory to create a virtual-address-faithful snapshot that preserves embedded device pointers without DWARF metadata or pointer chasing, locates kernel sources, and emits self-contained reproducer projects. HIP reproducers use a Clang VFS overlay for source-level recompilation without modifying the original build system; Triton reproducers are tuning-pinned to bind the captured autotuner configuration into the artifact and preserve the JIT kernel's numerical contract.
What carries the argument
Virtual-address-faithful address-space closure of device memory that captures all reachable pointers and data from HSA dispatch points to form complete, dependency-free snapshots.
If this is right
- Kernels from complex applications such as vLLM Mixture-of-Experts can be isolated while preserving numerical contracts through pointer indirection.
- The edit-recompile-validate loop for kernel tuning reduces from multi-hour manual processes to a single automated command.
- Generated reproducers serve directly as evaluation substrates for autotuning agents and LLM-driven kernel generators.
- Extraction succeeds across traditional HPC and ML domains on CDNA2, CDNA3, and RDNA3 architectures with snapshot sizes from 152 MB to 30 GB.
Where Pith is reading between the lines
- The captured reproducers could be reused as portable test cases for cross-vendor kernel validation once similar capture hooks exist on other runtimes.
- Address-space closure might enable automated differential testing by comparing isolated kernel outputs against full-application runs under varying inputs.
- Integration into build pipelines could allow continuous monitoring of kernel performance regressions without rebuilding entire applications each time.
- The method opens a path to snapshot-based kernel archaeology, letting developers study historical kernel versions extracted from production workloads.
Load-bearing premise
Intercepting dispatches at the HSA runtime and performing a virtual-address-faithful address-space closure will always produce a complete, dependency-free snapshot that permits successful source-level recompilation and numerical validation without manual reconstruction of build flags, runtime inputs, or missing device pointers.
What would settle it
A kernel dispatch whose device pointers or memory regions lie outside the reachable closure from the intercepted HSA launch, causing the generated reproducer to fail recompilation or produce numerically different results from the original application.
Figures
read the original abstract
Iterative GPU kernel tuning is bottlenecked by the scale of the applications that host the kernels. Rapid iteration requires isolating the kernel so it can be edited, recompiled, and validated without rebuilding the full application -- but manual isolation requires reconstructing build flags, dispatch configuration, and runtime inputs by hand, so developers usually settle for slow in-place edits. We present Kerncap, an automated kernel extraction tool that intercepts dispatches at the HSA runtime for both HIP and Triton, bridging Triton's JIT-only metadata into HSA-level capture via a lightweight Python compile-hook shim. Kerncap performs an address-space closure of all device memory -- a virtual-address-faithful snapshot that preserves embedded device pointers without DWARF metadata or pointer chasing -- locates kernel sources, and emits self-contained reproducer projects. HIP reproducers use a Clang VFS overlay for source-level recompilation without modifying the original build system; Triton reproducers are tuning-pinned, binding the captured autotuner configuration into the artifact to preserve the JIT kernel's numerical contract. Across six real-world HIP and Triton workloads spanning traditional HPC and ML domains on three AMD GPU architectures (CDNA2, CDNA3, RDNA3), Kerncap extracts and validates kernels from snapshots ranging from 152~MB to 30~GB -- including a VA-faithful capture of vLLM's Mixture-of-Experts weight pool reached through pointer indirection. On our llama-cpp case study, Kerncap's edit-recompile-validate loop achieves a 13.6x speedup over the traditional workflow, reducing kernel isolation from a multi-hour process to a single command. The resulting reproducers also serve as a substrate for autotuning agents and LLM-driven kernel generators that need rapid, isolated evaluation of candidates.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Kerncap, a tool for automated extraction and isolation of GPU kernels from HIP and Triton applications on AMD GPUs. It intercepts kernel dispatches at the HSA runtime, performs a virtual-address-faithful address-space closure to snapshot device memory (including pointer indirection without DWARF), locates sources, and emits self-contained reproducer projects. HIP reproducers use Clang VFS overlays for recompilation; Triton ones pin autotuner configs. Evaluation on six real-world workloads across CDNA2/CDNA3/RDNA3 architectures (snapshots 152 MB to 30 GB, including vLLM MoE) shows successful extraction/validation, with a 13.6x speedup on the llama-cpp case study reducing isolation to a single command.
Significance. If the capture mechanism proves robust across workloads, Kerncap would meaningfully accelerate iterative kernel tuning and debugging for large-scale GPU applications in HPC and ML. The reported 13.6x reduction in isolation time, support for massive snapshots with indirection, and potential as a substrate for autotuning agents or LLM-driven generators represent practical engineering contributions. The cross-architecture evaluation on diverse domains strengthens the case for adoption if limitations are addressed.
major comments (3)
- [§3] §3 (HSA Interception and Address-Space Closure): The central claim that intercepting dispatches plus VA-faithful closure (without DWARF or explicit pointer chasing) always yields a complete, dependency-free snapshot permitting source-level recompilation and numerical validation is load-bearing. The description does not provide a formal argument or exhaustive enumeration of captured vs. uncaptured elements (e.g., post-dispatch allocations, host-side scalars not passed as arguments, or build-system preprocessor state not replicated by VFS).
- [§5] §5 (Evaluation and Case Studies): The six-workload results and vLLM MoE example demonstrate success under tested conditions, but the paper does not report attempts to trigger or measure the failure modes raised by the assumption (post-dispatch allocations, complex indirection not materialized at dispatch, missing runtime inputs). Without such negative results or a limitations subsection quantifying coverage, the generalizability of the 13.6x speedup and 'single command' claim remains under-supported.
- [§4.2] §4.2 (Reproducer Generation for HIP/Triton): The Clang VFS overlay and Triton autotuner pinning are presented as sufficient to preserve the numerical contract, but no concrete evidence (e.g., build-flag reconstruction accuracy or comparison of original vs. reproducer compiler invocations) is given for cases where include paths or JIT metadata differ from the captured state.
minor comments (2)
- A table summarizing the six workloads (domain, size, architecture, kernel type) would improve readability and allow readers to assess coverage.
- The abstract and introduction use 'VA-faithful' without an early definition or reference to the precise closure algorithm; a short paragraph in §2 would help.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important areas for strengthening the presentation of Kerncap's capture guarantees and evaluation. We address each major comment below, clarifying the technical approach where the manuscript was concise and committing to revisions that add explicit enumeration, a limitations discussion, and supporting evidence for reproducer fidelity.
read point-by-point responses
-
Referee: [§3] The central claim that intercepting dispatches plus VA-faithful closure (without DWARF or explicit pointer chasing) always yields a complete, dependency-free snapshot permitting source-level recompilation and numerical validation is load-bearing. The description does not provide a formal argument or exhaustive enumeration of captured vs. uncaptured elements (e.g., post-dispatch allocations, host-side scalars not passed as arguments, or build-system preprocessor state not replicated by VFS).
Authors: We agree that an explicit enumeration strengthens the paper. Kerncap's address-space closure operates by walking the HSA virtual address space at dispatch time and dumping every allocated device buffer (including those reached only via embedded pointers), which is why the vLLM MoE weights were captured without DWARF. Captured elements are: all device memory regions live at the intercepted dispatch, kernel launch parameters, and (for HIP) the source files referenced by the kernel symbol. Uncaptured by design are: allocations performed after the dispatch, host-side scalars never passed as kernel arguments, and any build-system state (e.g., preprocessor macros) that is not part of the captured source tree. We will add a concise enumeration table in §3 and a dedicated limitations paragraph that explicitly lists these boundaries. revision: partial
-
Referee: [§5] The six-workload results and vLLM MoE example demonstrate success under tested conditions, but the paper does not report attempts to trigger or measure the failure modes raised by the assumption (post-dispatch allocations, complex indirection not materialized at dispatch, missing runtime inputs). Without such negative results or a limitations subsection quantifying coverage, the generalizability of the 13.6x speedup and 'single command' claim remains under-supported.
Authors: The evaluation intentionally exercised large, pointer-indirect workloads (vLLM MoE at 30 GB) and cross-architecture cases to stress the closure mechanism. We did not, however, include deliberate negative experiments that would trigger post-dispatch allocations or missing runtime inputs. We will add a new limitations subsection in §5 that quantifies coverage across the six workloads, discusses the classes of kernels that would fail (e.g., those allocating device memory after launch), and reports the observed success rate on the tested suite. This will better bound the 13.6x claim to workloads whose memory footprint is materialized by dispatch time. revision: yes
-
Referee: [§4.2] The Clang VFS overlay and Triton autotuner pinning are presented as sufficient to preserve the numerical contract, but no concrete evidence (e.g., build-flag reconstruction accuracy or comparison of original vs. reproducer compiler invocations) is given for cases where include paths or JIT metadata differ from the captured state.
Authors: The VFS overlay is constructed directly from the paths and contents recorded at capture; the reproducer therefore uses exactly the same include search order and file contents as the original process. For Triton, the pinned autotuner configuration is the exact set of tuning knobs that produced the captured kernel. While we did not include a side-by-side compiler-invocation diff in the original manuscript, we have since verified that the generated build commands match the originals for the evaluated HIP cases. We will add a short table in §4.2 showing the captured versus reproduced compiler flags and include paths for two representative workloads, together with a statement that numerical equivalence was confirmed by running the reproducers against the original application outputs. revision: partial
Circularity Check
No circularity: applied systems paper with external empirical validation
full rationale
This is a systems-engineering paper describing an automated kernel extraction tool that intercepts HSA dispatches and performs virtual-address-faithful address-space closure. It contains no mathematical derivations, equations, fitted parameters, or predictions that could reduce to prior inputs by construction. All claims are supported by implementation details and evaluation on six independent real-world workloads (HIP and Triton) spanning multiple AMD GPU architectures, with no self-citation load-bearing steps or self-referential definitions. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption HSA runtime dispatches can be intercepted to capture all kernel launch parameters and device memory states in a virtual-address-faithful manner without DWARF metadata or explicit pointer chasing.
- domain assumption A Clang VFS overlay can supply the necessary source files for recompilation of HIP kernels without requiring changes to the original application build system.
Reference graph
Works this paper leans on
-
[1]
AMD. 2026. ROCm Compute Profiler Documentation. https://rocm.docs.amd. com/projects/rocprofiler-compute/en/latest/ ROCm Compute Profiler is a kernel- level profiling tool for machine learning and high performance computing (HPC) workloads running on AMD Instinct™accelerators
2026
-
[2]
AMD. 2026. ROCProfiler-SDK Documentation. https://rocm.docs.amd.com/ projects/rocprofiler-sdk/en/latest/ ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software
2026
-
[3]
AMD. 2026. ROCTracer Documentation. https://rocm.docs.amd.com/projects/ roctracer/en/latest/ ROCTracer consists of the ROCTracer and ROC-TX libraries, which provide APIs to help you trace an application in the runtime
2026
-
[4]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG] https://arxiv.org/abs/2205.14135
work page internal anchor Pith review arXiv 2022
-
[5]
Georgi Gerganov and contributors. 2023. llama.cpp: LLM inference in C/C++. https://github.com/ggml-org/llama.cpp. Accessed: 2026-04-17
2023
-
[6]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAtten- tion. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles
2023
-
[7]
NVIDIA. 2026. CUDA-GDB: The NVIDIA CUDA Debugger. https://developer. nvidia.com/cuda-gdb NVIDIA Developer Documentation
2026
-
[8]
NVIDIA. 2026. CUPTI: Checkpoint API. https://docs.nvidia.com/cupti/api/ group__CUPTI__CHECKPOINT__API.html NVIDIA CUPTI Checkpoint API Documentation
2026
-
[9]
NVIDIA. 2026. NVIDIA Nsight Compute: Kernel Profiling Guide. https://docs. nvidia.com/nsight-compute/index.html NVIDIA Developer Documentation
2026
- [10]
-
[11]
Kernelbench: Can llms write efficient gpu kernels?arXiv preprint arXiv:2502.10517,
Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christo- pher Ré, and Azalia Mirhoseini. 2025. KernelBench: Can LLMs Write Efficient GPU Kernels? arXiv:2502.10517 [cs.LG] https://arxiv.org/abs/2502.10517
-
[12]
2019.PyTorch: an imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019.PyTorch: an imperative style, high-per...
2019
-
[13]
Trott, Jeremiah Wilke, and Ichitaro Yamazaki
Sivasankaran Rajamanickam, Seher Acer, Luc Berger-Vergiat, Vinh Dang, Nathan Ellingwood, Evan Harvey, Brian Kelley, Christian R. Trott, Jeremiah Wilke, and Ichitaro Yamazaki. 2021. Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels. arXiv:2103.11991 [cs.MS] https://arxiv.org/ abs/2103.11991
- [14]
-
[15]
The Clang Team. 2026. JSON Compilation Database Format Specification. https: //clang.llvm.org/docs/JSONCompilationDatabase.html Clang Documentation
2026
-
[16]
A. P. Thompson, H. M. Aktulga, R. Berger, D. S. Bolintineanu, W. M. Brown, P. S. Crozier, P. J. in ’t Veld, A. Kohlmeyer, S. G. Moore, T. D. Nguyen, R. Shan, M. J. Stevens, J. Tranchida, C. Trott, and S. J. Plimpton. 2022. LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales.Comp. Phys. Comm.2...
-
[17]
Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an intermediate lan- guage and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Program- ming Languages(Phoenix, AZ, USA)(MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. doi:10.1145/3315508.3329973
-
[18]
Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, and Emad Barsoum
-
[19]
arXiv:2507.23194 [cs.CL] https://arxiv.org/abs/2507.23194
Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks. arXiv:2507.23194 [cs.CL] https://arxiv.org/abs/2507.23194
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.