arxiv: 2605.03208 · v2 · submitted 2026-05-04 · 💻 cs.SE

Recognition: 2 theorem links

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

Cole Ramos, Keith Lowery

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:41 UTC · model grok-4.3

classification 💻 cs.SE

keywords kernel extractionGPU isolationHSA runtimeAMD GPUsHIPTritonkernel reproducersaddress space closure

0 comments

The pith

Kerncap automates extraction of isolated, recompilable GPU kernels from large AMD applications via HSA dispatch interception and virtual-address-faithful memory closure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Kerncap as a tool to remove the bottleneck of iterative kernel tuning inside massive host applications. It intercepts kernel launches at the HSA runtime for HIP and Triton codes, using a lightweight shim to bridge Triton JIT metadata. Kerncap then builds a complete device memory snapshot through address-space closure that faithfully preserves all pointers and dependencies. This snapshot plus located source code produces standalone reproducer projects that support direct editing, recompilation with original build flags via VFS overlay, and numerical validation. Tests on six real workloads across CDNA2, CDNA3, and RDNA3 architectures, including 30 GB vLLM snapshots, confirm the reproducers match original behavior while cutting isolation time by 13.6x in the llama-cpp case.

Core claim

Kerncap intercepts dispatches at the HSA runtime for both HIP and Triton, bridging Triton's JIT-only metadata into HSA-level capture via a lightweight Python compile-hook shim. It performs an address-space closure of all device memory to create a virtual-address-faithful snapshot that preserves embedded device pointers without DWARF metadata or pointer chasing, locates kernel sources, and emits self-contained reproducer projects. HIP reproducers use a Clang VFS overlay for source-level recompilation without modifying the original build system; Triton reproducers are tuning-pinned to bind the captured autotuner configuration into the artifact and preserve the JIT kernel's numerical contract.

What carries the argument

Virtual-address-faithful address-space closure of device memory that captures all reachable pointers and data from HSA dispatch points to form complete, dependency-free snapshots.

If this is right

Kernels from complex applications such as vLLM Mixture-of-Experts can be isolated while preserving numerical contracts through pointer indirection.
The edit-recompile-validate loop for kernel tuning reduces from multi-hour manual processes to a single automated command.
Generated reproducers serve directly as evaluation substrates for autotuning agents and LLM-driven kernel generators.
Extraction succeeds across traditional HPC and ML domains on CDNA2, CDNA3, and RDNA3 architectures with snapshot sizes from 152 MB to 30 GB.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The captured reproducers could be reused as portable test cases for cross-vendor kernel validation once similar capture hooks exist on other runtimes.
Address-space closure might enable automated differential testing by comparing isolated kernel outputs against full-application runs under varying inputs.
Integration into build pipelines could allow continuous monitoring of kernel performance regressions without rebuilding entire applications each time.
The method opens a path to snapshot-based kernel archaeology, letting developers study historical kernel versions extracted from production workloads.

Load-bearing premise

Intercepting dispatches at the HSA runtime and performing a virtual-address-faithful address-space closure will always produce a complete, dependency-free snapshot that permits successful source-level recompilation and numerical validation without manual reconstruction of build flags, runtime inputs, or missing device pointers.

What would settle it

A kernel dispatch whose device pointers or memory regions lie outside the reachable closure from the intercepted HSA launch, causing the generated reproducer to fail recompilation or produce numerically different results from the original application.

Figures

Figures reproduced from arXiv: 2605.03208 by Cole Ramos, Keith Lowery.

**Figure 1.** Figure 1: The three components of a GPU kernel reproducer. view at source ↗

**Figure 2.** Figure 2: The Kerncap workflow. The tool abstracts low-level instrumentation into three high-level commands, supporting both HIP and Triton backends. Profile. Ranks kernels by GPU time via rocprofv3 --kernel-trace --stats; included for workflow completeness, often skipped when the target kernel is already known. Capture. A shared library (libkerncap.so) loaded via LD_PRELOAD intercepts HSA dispatches for both HIP a… view at source ↗

**Figure 3.** Figure 3: illustrates the hook architecture. HIP Application Kerncap ptr → size memory obj → name symbols obj → blob HSACO GPU dispatch.json + *.bin HSA calls forward signal interposition wait → snapshot → release capture view at source ↗

**Figure 4.** Figure 4: HIP source-discovery decision tree. DWARF-based view at source ↗

**Figure 5.** Figure 5: Structure of a generated HIP reproducer project. view at source ↗

read the original abstract

Iterative GPU kernel tuning is bottlenecked by the scale of the applications that host the kernels. Rapid iteration requires isolating the kernel so it can be edited, recompiled, and validated without rebuilding the full application -- but manual isolation requires reconstructing build flags, dispatch configuration, and runtime inputs by hand, so developers usually settle for slow in-place edits. We present Kerncap, an automated kernel extraction tool that intercepts dispatches at the HSA runtime for both HIP and Triton, bridging Triton's JIT-only metadata into HSA-level capture via a lightweight Python compile-hook shim. Kerncap performs an address-space closure of all device memory -- a virtual-address-faithful snapshot that preserves embedded device pointers without DWARF metadata or pointer chasing -- locates kernel sources, and emits self-contained reproducer projects. HIP reproducers use a Clang VFS overlay for source-level recompilation without modifying the original build system; Triton reproducers are tuning-pinned, binding the captured autotuner configuration into the artifact to preserve the JIT kernel's numerical contract. Across six real-world HIP and Triton workloads spanning traditional HPC and ML domains on three AMD GPU architectures (CDNA2, CDNA3, RDNA3), Kerncap extracts and validates kernels from snapshots ranging from 152~MB to 30~GB -- including a VA-faithful capture of vLLM's Mixture-of-Experts weight pool reached through pointer indirection. On our llama-cpp case study, Kerncap's edit-recompile-validate loop achieves a 13.6x speedup over the traditional workflow, reducing kernel isolation from a multi-hour process to a single command. The resulting reproducers also serve as a substrate for autotuning agents and LLM-driven kernel generators that need rapid, isolated evaluation of candidates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kerncap automates AMD GPU kernel extraction via HSA interception and VA-faithful snapshots, delivering practical speedups on tested workloads, but the completeness assumption needs more scrutiny.

read the letter

Kerncap intercepts dispatches at the HSA runtime to pull kernels out of HIP and Triton apps on AMD GPUs, then builds virtual-address-faithful memory snapshots that keep embedded pointers intact without DWARF or chasing. It spits out self-contained reproducers: Clang VFS overlays for HIP and pinned autotuner configs for Triton. That combination is new enough in the AMD space to matter for people stuck doing manual isolation by hand.

Referee Report

3 major / 2 minor

Summary. The paper introduces Kerncap, a tool for automated extraction and isolation of GPU kernels from HIP and Triton applications on AMD GPUs. It intercepts kernel dispatches at the HSA runtime, performs a virtual-address-faithful address-space closure to snapshot device memory (including pointer indirection without DWARF), locates sources, and emits self-contained reproducer projects. HIP reproducers use Clang VFS overlays for recompilation; Triton ones pin autotuner configs. Evaluation on six real-world workloads across CDNA2/CDNA3/RDNA3 architectures (snapshots 152 MB to 30 GB, including vLLM MoE) shows successful extraction/validation, with a 13.6x speedup on the llama-cpp case study reducing isolation to a single command.

Significance. If the capture mechanism proves robust across workloads, Kerncap would meaningfully accelerate iterative kernel tuning and debugging for large-scale GPU applications in HPC and ML. The reported 13.6x reduction in isolation time, support for massive snapshots with indirection, and potential as a substrate for autotuning agents or LLM-driven generators represent practical engineering contributions. The cross-architecture evaluation on diverse domains strengthens the case for adoption if limitations are addressed.

major comments (3)

[§3] §3 (HSA Interception and Address-Space Closure): The central claim that intercepting dispatches plus VA-faithful closure (without DWARF or explicit pointer chasing) always yields a complete, dependency-free snapshot permitting source-level recompilation and numerical validation is load-bearing. The description does not provide a formal argument or exhaustive enumeration of captured vs. uncaptured elements (e.g., post-dispatch allocations, host-side scalars not passed as arguments, or build-system preprocessor state not replicated by VFS).
[§5] §5 (Evaluation and Case Studies): The six-workload results and vLLM MoE example demonstrate success under tested conditions, but the paper does not report attempts to trigger or measure the failure modes raised by the assumption (post-dispatch allocations, complex indirection not materialized at dispatch, missing runtime inputs). Without such negative results or a limitations subsection quantifying coverage, the generalizability of the 13.6x speedup and 'single command' claim remains under-supported.
[§4.2] §4.2 (Reproducer Generation for HIP/Triton): The Clang VFS overlay and Triton autotuner pinning are presented as sufficient to preserve the numerical contract, but no concrete evidence (e.g., build-flag reconstruction accuracy or comparison of original vs. reproducer compiler invocations) is given for cases where include paths or JIT metadata differ from the captured state.

minor comments (2)

A table summarizing the six workloads (domain, size, architecture, kernel type) would improve readability and allow readers to assess coverage.
The abstract and introduction use 'VA-faithful' without an early definition or reference to the precise closure algorithm; a short paragraph in §2 would help.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important areas for strengthening the presentation of Kerncap's capture guarantees and evaluation. We address each major comment below, clarifying the technical approach where the manuscript was concise and committing to revisions that add explicit enumeration, a limitations discussion, and supporting evidence for reproducer fidelity.

read point-by-point responses

Referee: [§3] The central claim that intercepting dispatches plus VA-faithful closure (without DWARF or explicit pointer chasing) always yields a complete, dependency-free snapshot permitting source-level recompilation and numerical validation is load-bearing. The description does not provide a formal argument or exhaustive enumeration of captured vs. uncaptured elements (e.g., post-dispatch allocations, host-side scalars not passed as arguments, or build-system preprocessor state not replicated by VFS).

Authors: We agree that an explicit enumeration strengthens the paper. Kerncap's address-space closure operates by walking the HSA virtual address space at dispatch time and dumping every allocated device buffer (including those reached only via embedded pointers), which is why the vLLM MoE weights were captured without DWARF. Captured elements are: all device memory regions live at the intercepted dispatch, kernel launch parameters, and (for HIP) the source files referenced by the kernel symbol. Uncaptured by design are: allocations performed after the dispatch, host-side scalars never passed as kernel arguments, and any build-system state (e.g., preprocessor macros) that is not part of the captured source tree. We will add a concise enumeration table in §3 and a dedicated limitations paragraph that explicitly lists these boundaries. revision: partial
Referee: [§5] The six-workload results and vLLM MoE example demonstrate success under tested conditions, but the paper does not report attempts to trigger or measure the failure modes raised by the assumption (post-dispatch allocations, complex indirection not materialized at dispatch, missing runtime inputs). Without such negative results or a limitations subsection quantifying coverage, the generalizability of the 13.6x speedup and 'single command' claim remains under-supported.

Authors: The evaluation intentionally exercised large, pointer-indirect workloads (vLLM MoE at 30 GB) and cross-architecture cases to stress the closure mechanism. We did not, however, include deliberate negative experiments that would trigger post-dispatch allocations or missing runtime inputs. We will add a new limitations subsection in §5 that quantifies coverage across the six workloads, discusses the classes of kernels that would fail (e.g., those allocating device memory after launch), and reports the observed success rate on the tested suite. This will better bound the 13.6x claim to workloads whose memory footprint is materialized by dispatch time. revision: yes
Referee: [§4.2] The Clang VFS overlay and Triton autotuner pinning are presented as sufficient to preserve the numerical contract, but no concrete evidence (e.g., build-flag reconstruction accuracy or comparison of original vs. reproducer compiler invocations) is given for cases where include paths or JIT metadata differ from the captured state.

Authors: The VFS overlay is constructed directly from the paths and contents recorded at capture; the reproducer therefore uses exactly the same include search order and file contents as the original process. For Triton, the pinned autotuner configuration is the exact set of tuning knobs that produced the captured kernel. While we did not include a side-by-side compiler-invocation diff in the original manuscript, we have since verified that the generated build commands match the originals for the evaluated HIP cases. We will add a short table in §4.2 showing the captured versus reproduced compiler flags and include paths for two representative workloads, together with a statement that numerical equivalence was confirmed by running the reproducers against the original application outputs. revision: partial

Circularity Check

0 steps flagged

No circularity: applied systems paper with external empirical validation

full rationale

This is a systems-engineering paper describing an automated kernel extraction tool that intercepts HSA dispatches and performs virtual-address-faithful address-space closure. It contains no mathematical derivations, equations, fitted parameters, or predictions that could reduce to prior inputs by construction. All claims are supported by implementation details and evaluation on six independent real-world workloads (HIP and Triton) spanning multiple AMD GPU architectures, with no self-citation load-bearing steps or self-referential definitions. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about runtime interception and build-system compatibility rather than free parameters or new postulated entities.

axioms (2)

domain assumption HSA runtime dispatches can be intercepted to capture all kernel launch parameters and device memory states in a virtual-address-faithful manner without DWARF metadata or explicit pointer chasing.
This is required to produce complete snapshots from 152 MB to 30 GB workloads including pointer-indirected structures such as vLLM MoE weights.
domain assumption A Clang VFS overlay can supply the necessary source files for recompilation of HIP kernels without requiring changes to the original application build system.
This enables the self-contained HIP reproducer projects described.

pith-pipeline@v0.9.0 · 5615 in / 1672 out tokens · 105730 ms · 2026-05-08T17:41:54.010474+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 8 canonical work pages · 1 internal anchor

[1]

AMD. 2026. ROCm Compute Profiler Documentation. https://rocm.docs.amd. com/projects/rocprofiler-compute/en/latest/ ROCm Compute Profiler is a kernel- level profiling tool for machine learning and high performance computing (HPC) workloads running on AMD Instinct™accelerators

2026
[2]

AMD. 2026. ROCProfiler-SDK Documentation. https://rocm.docs.amd.com/ projects/rocprofiler-sdk/en/latest/ ROCprofiler-SDK is a tooling infrastructure for profiling general-purpose GPU compute applications running on the ROCm software

2026
[3]

AMD. 2026. ROCTracer Documentation. https://rocm.docs.amd.com/projects/ roctracer/en/latest/ ROCTracer consists of the ROCTracer and ROC-TX libraries, which provide APIs to help you trace an application in the runtime

2026
[4]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG] https://arxiv.org/abs/2205.14135

work page internal anchor Pith review arXiv 2022
[5]

Georgi Gerganov and contributors. 2023. llama.cpp: LLM inference in C/C++. https://github.com/ggml-org/llama.cpp. Accessed: 2026-04-17

2023
[6]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAtten- tion. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

2023
[7]

NVIDIA. 2026. CUDA-GDB: The NVIDIA CUDA Debugger. https://developer. nvidia.com/cuda-gdb NVIDIA Developer Documentation

2026
[8]

NVIDIA. 2026. CUPTI: Checkpoint API. https://docs.nvidia.com/cupti/api/ group__CUPTI__CHECKPOINT__API.html NVIDIA CUPTI Checkpoint API Documentation

2026
[9]

NVIDIA. 2026. NVIDIA Nsight Compute: Kernel Profiling Guide. https://docs. nvidia.com/nsight-compute/index.html NVIDIA Developer Documentation

2026
[10]

Robert O’Callahan, Chris Jones, Nathan Froyd, Kyle Huey, Albert Noll, and Nim- rod Partush. 2017. Engineering Record And Replay For Deployability: Extended Technical Report. arXiv:1705.05937 [cs.PL] https://arxiv.org/abs/1705.05937

work page arXiv 2017
[11]

Kernelbench: Can llms write efficient gpu kernels?arXiv preprint arXiv:2502.10517,

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christo- pher Ré, and Azalia Mirhoseini. 2025. KernelBench: Can LLMs Write Efficient GPU Kernels? arXiv:2502.10517 [cs.LG] https://arxiv.org/abs/2502.10517

work page arXiv 2025
[12]

2019.PyTorch: an imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019.PyTorch: an imperative style, high-per...

2019
[13]

Trott, Jeremiah Wilke, and Ichitaro Yamazaki

Sivasankaran Rajamanickam, Seher Acer, Luc Berger-Vergiat, Vinh Dang, Nathan Ellingwood, Evan Harvey, Brian Kelley, Christian R. Trott, Jeremiah Wilke, and Ichitaro Yamazaki. 2021. Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels. arXiv:2103.11991 [cs.MS] https://arxiv.org/ abs/2103.11991

work page arXiv 2021
[14]

Radostin Stoyanov, Viktória Spišaková, Jesus Ramos, Steven Gurfinkel, Andrei Vagin, Adrian Reber, Wesley Armour, and Rodrigo Bruno. 2025. CRIUgpu: Trans- parent Checkpointing of GPU-Accelerated Workloads. arXiv:2502.16631 [cs.DC] https://arxiv.org/abs/2502.16631

work page arXiv 2025
[15]

The Clang Team. 2026. JSON Compilation Database Format Specification. https: //clang.llvm.org/docs/JSONCompilationDatabase.html Clang Documentation

2026
[16]

A. P. Thompson, H. M. Aktulga, R. Berger, D. S. Bolintineanu, W. M. Brown, P. S. Crozier, P. J. in ’t Veld, A. Kohlmeyer, S. G. Moore, T. D. Nguyen, R. Shan, M. J. Stevens, J. Tranchida, C. Trott, and S. J. Plimpton. 2022. LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales.Comp. Phys. Comm.2...

work page doi:10.1016/j.cpc.2021 2022
[17]

Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an intermediate lan- guage and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Program- ming Languages(Phoenix, AZ, USA)(MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. doi:10.1145/3315508.3329973

work page doi:10.1145/3315508.3329973 2019
[18]

Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, and Emad Barsoum
[19]

arXiv:2507.23194 [cs.CL] https://arxiv.org/abs/2507.23194

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks. arXiv:2507.23194 [cs.CL] https://arxiv.org/abs/2507.23194

work page arXiv