pith. sign in

arxiv: 2606.02963 · v1 · pith:DZY2U7LEnew · submitted 2026-06-01 · 💻 cs.LG

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

Pith reviewed 2026-06-28 15:00 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM kernel generationcross-platform kernelsAI acceleratorsagentic refinementoperator fusioninference optimizationTriton kernelsmixed-precision execution
0
0 comments X

The pith

KForge uses two LLM agents in an iterative loop to generate kernels that run faster on both NVIDIA and Intel accelerators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces KForge to solve the problem of writing high-performance kernels by hand for many different AI accelerators as inference pipelines become more heterogeneous. It relies on a generation agent that produces and fixes code using compilation and correctness signals, paired with a performance-analysis agent that reads profiling output and recommends changes. The loop runs functional passes until the kernel works and then optimization passes to close the gap with tuned baselines. On one backend the generated kernels yield a small throughput increase on an inference benchmark; on the other they deliver large speedups on GEMM-heavy workloads mainly through fusion and mixed precision.

Core claim

KForge demonstrates that an iterative refinement process alternating between a generation agent and a performance-analysis agent can produce kernels that are both correct and competitive in performance across NVIDIA and Intel accelerators, with the agents using compilation feedback, correctness checks, and profiling data to drive successive improvements.

What carries the argument

The dual-agent iterative refinement loop that alternates functional passes for correctness with optimization passes guided by performance analysis.

If this is right

  • Kernels for new hardware backends can be produced without writing low-level code by hand.
  • Operator fusion and mixed-precision choices can be discovered automatically through the agents' recommendations.
  • End-to-end inference pipelines can achieve higher throughput when each stage runs on its best-suited accelerator.
  • The same refinement loop applies to both well-supported and less-supported programming models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could reduce the amount of specialized expertise needed when adding support for emerging accelerators.
  • Similar agent loops might transfer to generating optimized code in other low-level domains beyond kernel synthesis.
  • Success on Intel hardware suggests the method could help close performance gaps on backends that lack mature hand-tuned libraries.
  • Extending the agents to additional programming models would test how far the cross-platform claim generalizes.

Load-bearing premise

The two LLM agents can reliably interpret errors and profiling data to reach both correctness and competitive performance on new backends without human oversight or detailed failure analysis.

What would settle it

On a fresh accelerator or workload set, the generated kernels remain slower than the faster of the standard baselines or fail to reach correctness after repeated iterations.

Figures

Figures reproduced from arXiv: 2606.02963 by Ankita Nayak, Burak Bartan, Natalie Serrino, Taras Sereda, Tom St.John, Zain Asgar.

Figure 1
Figure 1. Figure 1: Iterative program synthesis and optimization loop using LLMs. The workflow consists of two main phases: (1) a functional pass that iteratively refines [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Given a PyTorch reference, KForge selects and lowers to an appropriate target programming model for each AI accelerator, supporting CUDA, CuTe, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Production inference increasingly targets a heterogeneous mix of accelerators. Agentic pipelines interleave reasoning, tool calls, and multi-agent coordination, each with distinct compute and memory profiles. For optimal efficiency, each stage should run on the accelerator best suited to it. This creates a systems challenge: each pipeline now requires high-performance kernels across a growing set of hardware backends and programming models. Writing these kernels by hand is time-consuming, demands deep low-level expertise, and does not scale as kernel complexity grows. Recently, Large Language Models (LLMs) have been leveraged for automatic kernel generation, but challenges in low-level code generation and cross-backend generalization persist. We present KForge, a cross-platform framework built around an iterative refinement loop driven by two collaborating LLM-based agents: a generation agent that produces and progressively refines kernels using compilation and correctness feedback, and a performance-analysis agent that interprets profiling data, from programmatic APIs to GUI-based tools, and emits recommendations that steer the next round of synthesis. The loop alternates between functional passes, which drive a candidate to correctness, and optimization passes, which close the performance gap to hand-tuned baselines. We evaluate KForge on two backends with very different baseline reference availability. On NVIDIA B200, KForge achieves a 2.12$\%$ improvement in end-to-end throughput compared to TensorRT-LLM on the gpt-oss-20b inference speed benchmark. On Intel Arc B580, KForge generates Triton kernels achieving a 5.13$\times$ geometric mean speedup over the faster of PyTorch eager and torch.compile on 37 GEMM + tail-ops workloads from KernelBench Level 2, primarily via operator fusion and mixed-precision execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces KForge, a cross-platform framework that uses a dual-agent LLM pipeline (generation agent + performance-analysis agent) to iteratively synthesize and optimize kernels via alternating functional-correctness and performance-optimization passes. It reports a 2.12% end-to-end throughput gain versus TensorRT-LLM on NVIDIA B200 for gpt-oss-20b inference and a 5.13× geometric-mean speedup versus PyTorch baselines on 37 KernelBench Level-2 GEMM+tail workloads on Intel Arc B580, attributing gains to operator fusion and mixed precision.

Significance. If the empirical claims can be substantiated with full experimental controls and reliability metrics for the agent loop, the work would demonstrate a practical route to automated, cross-backend kernel generation that scales beyond single-vendor hand-tuned libraries. The dual-agent separation of concerns and use of both programmatic and GUI profiling data are distinctive.

major comments (2)
  1. [Abstract / §4] Abstract and §4 (Evaluation): the central performance claims (2.12% on B200, 5.13× geomean on Arc B580) rest on the dual-agent loop reliably driving kernels from incorrect to correct to competitive, yet the manuscript supplies no iteration counts, success/failure rates, per-workload attempt statistics, or characterization of stalls or external intervention. Without these data it is impossible to attribute the reported deltas to the method rather than selective reporting of successful runs.
  2. [Abstract / §4] Abstract and §4: no error bars, run counts, workload-selection criteria, or statistical significance tests are reported for either benchmark, making it impossible to determine whether the observed speedups are robust or sensitive to particular seeds, prompts, or workload subsets.
minor comments (2)
  1. [§3] The description of how the performance-analysis agent ingests GUI-based profiling output is underspecified; a concrete example of a recommendation trace would clarify the interface.
  2. [§4] KernelBench Level-2 workload selection criteria and the exact definition of “tail-ops” should be stated explicitly so that the 37-workload set can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in our experimental methodology. We address each major comment below and commit to revisions that provide the requested statistics and controls.

read point-by-point responses
  1. Referee: [Abstract / §4] Abstract and §4 (Evaluation): the central performance claims (2.12% on B200, 5.13× geomean on Arc B580) rest on the dual-agent loop reliably driving kernels from incorrect to correct to competitive, yet the manuscript supplies no iteration counts, success/failure rates, per-workload attempt statistics, or characterization of stalls or external intervention. Without these data it is impossible to attribute the reported deltas to the method rather than selective reporting of successful runs.

    Authors: We agree that the current manuscript lacks sufficient detail on the agent loop dynamics. In the revised version we will add a new subsection to §4 that reports, for both benchmarks: (i) mean and per-workload iteration counts for functional-correctness and performance-optimization passes, (ii) overall success/failure rates across all workloads attempted, (iii) characterization of any stalls or external interventions, and (iv) the fraction of kernels that reached correctness versus those that required additional manual guidance. These data will be drawn from our existing experimental logs and will allow readers to assess the reliability of the reported speedups. revision: yes

  2. Referee: [Abstract / §4] Abstract and §4: no error bars, run counts, workload-selection criteria, or statistical significance tests are reported for either benchmark, making it impossible to determine whether the observed speedups are robust or sensitive to particular seeds, prompts, or workload subsets.

    Authors: We acknowledge the omission of statistical rigor. The revised manuscript will include: (1) explicit workload-selection criteria and the full list of the 37 KernelBench Level-2 workloads, (2) results aggregated over at least five independent runs per workload using different random seeds for the LLM agents, (3) error bars (standard deviation) on all reported speedups and the 2.12% throughput figure, and (4) paired statistical significance tests (e.g., Wilcoxon signed-rank) against the respective baselines. These additions will be placed in §4 and the abstract will be updated to reference the new robustness analysis. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results on external benchmarks

full rationale

The paper describes an agentic kernel-generation system and reports measured speedups (2.12% on NVIDIA B200 vs TensorRT-LLM; 5.13× geomean on Intel Arc B580 vs PyTorch baselines) as direct experimental outcomes on KernelBench workloads. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the provided text. The central claims rest on external hardware benchmarks rather than any derivation that reduces to the method's own inputs by construction. This is the normal case for a systems paper whose value is in the measured deltas, not in a closed-form derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5861 in / 1089 out tokens · 33732 ms · 2026-06-28T15:00:44.321087+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 2 canonical work pages

  1. [1]

    PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation

    J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y . Liang, J. Liang, Y . Lu, C. K. Luk, B. Maher, Y . Pan, C. Puhrsch, M....

  2. [2]

    [Online]

    Apple Inc., “Metal,” 2014, graphics and compute API. [Online]. Available: https://developer.apple.com/metal/

  3. [3]

    Efficient and scalable agentic ai with heterogeneous systems,

    Z. Asgar, M. Nguyen, and S. Katti, “Efficient and scalable agentic ai with heterogeneous systems,” 2025. [Online]. Available: https://arxiv.org/abs/2507.19635

  4. [4]

    Cuda-llm: Llms can write efficient cuda kernels,

    W. Chen, J. Zhu, Q. Fan, Y . Ma, and A. Zou, “Cuda-llm: Llms can write efficient cuda kernels,” 2025. [Online]. Available: https://arxiv.org/abs/2506.09092

  5. [5]

    Flashattention-2: Faster attention with better parallelism and work partitioning,

    T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,”arXiv preprint arXiv:2307.08691, 2023

  6. [6]

    Flashattention: Fast and memory-efficient exact attention with io-awareness,

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,”arXiv preprint arXiv:2205.14135, 2022

  7. [7]

    Kernelblaster: Continuous cross-task cuda optimization via memory-augmented in-context reinforcement learning,

    K. S. Dong, S. Modi, D. Nikiforov, S. Damani, E. Lin, S. K. S. Hari, and C. Kozyrakis, “Kernelblaster: Continuous cross-task cuda optimization via memory-augmented in-context reinforcement learning,”

  8. [8]

    Available: https://arxiv.org/abs/2602.14293

    [Online]. Available: https://arxiv.org/abs/2602.14293

  9. [9]

    Autocomp: Llm-driven code optimization for tensor accelerators,

    C. Hong, S. Bhatia, A. Cheung, and Y . S. Shao, “Autocomp: Llm-driven code optimization for tensor accelerators,” inMachine Learning for Computer Architecture and Systems, 2025

  10. [10]

    Quick Guide to SYCL Implementations,

    Intel Corporation, “Quick Guide to SYCL Implementations,” https://www.intel.com/content/www/us/en/developer/articles/technical/ quick-guide-to-sycl-implementations.html, accessed Apr. 29, 2026

  11. [11]

    The ai cuda engineer: Agentic cuda kernel discovery, optimization and composition,

    R. T. Lange, A. Prasad, Q. Sun, M. Faldor, Y . Tang, and D. Ha, “The ai cuda engineer: Agentic cuda kernel discovery, optimization and composition,”arXiv preprint, 2025

  12. [12]

    Sol-execbench: Speed-of-light benchmarking for real-world gpu kernels against hardware limits,

    E. Lin, S. Modi, S. K. S. Hari, Q. Huang, Z. Ye, N. Qin, F. Zhou, Y . Zhang, J. Wang, S. Damani, D. Peri, O. Xie, A. Kane, M. Maor, M. Behar, T. Cao, R. Mehta, V . Singh, V . S. Mailthody, T. Chen, Z. Ye, H. Chen, T. Chen, V . Grover, W. Chen, W. Liu, E. Chung, L. Ceze, R. Bringmann, C. Zeller, M. Lightstone, C. Kozyrakis, and H. Shi, “Sol-execbench: Spee...

  13. [13]

    Online normalizer calculation for softmax,

    M. Milakov and N. Gimelshein, “Online normalizer calculation for softmax,” 2018. [Online]. Available: https://arxiv.org/abs/1805.02867

  14. [14]

    Nolima: Long-context evaluation beyond literal matching,

    A. Modarressi, H. Deilamsalehy, F. Dernoncourt, T. Bui, R. A. Rossi, S. Yoon, and H. Sch¨utze, “Nolima: Long-context evaluation beyond literal matching,” 2025. [Online]. Available: https://arxiv.org/abs/2502.05167

  15. [15]

    Cuda toolkit documentation,

    NVIDIA, “Cuda toolkit documentation,” https://docs.nvidia.com/cuda/

  16. [16]

    Tensorrt-llm,

    NVIDIA, “Tensorrt-llm,” 2023, large Language Model inference optimization library for NVIDIA GPUs. [Online]. Available: https: //github.com/NVIDIA/TensorRT-LLM

  17. [17]

    CuTe DSL,

    NVIDIA Corporation, “CuTe DSL,” https://docs.nvidia.com/cutlass/ latest/media/docs/pythonDSL/cute dsl.html, 2026, nVIDIA CUTLASS Documentation. Last updated Apr. 8, 2026. Accessed Apr. 29, 2026

  18. [18]

    gpt-oss-120b & gpt-oss-20b model card,

    OpenAI, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, S. Bubeck, C. Chang, K. Chen, M. Chen, E. Cheung, A. Clark, D. Cook, M. Dukhan, C. Dvorak, K. Fives, V . Fomenko, T. Garipov, K. Georgiev, M. Glaese, T. Gogineni, A. Goucher, L. ...

  19. [19]

    Kernelbench: Can llms write efficient gpu kernels?

    A. Ouyang, S. Guo, S. Arora, A. L. Zhang, W. Hu, C. R ´e, and A. Mirhoseini, “Kernelbench: Can llms write efficient gpu kernels?”

  20. [20]

    Available: https://arxiv.org/abs/2502.10517

    [Online]. Available: https://arxiv.org/abs/2502.10517

  21. [21]

    Triton: an intermediate language and compiler for tiled neural network computations,

    P. Tillet, H. T. Kung, and D. Cox, “Triton: an intermediate language and compiler for tiled neural network computations,” ser. MAPL 2019. New York, NY , USA: Association for Computing Machinery, 2019, p. 10–19. [Online]. Available: https://doi.org/10.1145/3315508.3329973 6