pith. sign in

arxiv: 2606.16497 · v2 · pith:J7TWO3BPnew · submitted 2026-06-15 · 💻 cs.LG · cs.AI· cs.CL

daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization

Pith reviewed 2026-06-27 03:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords GPU kernel optimizationreinforcement learningmulti-agent RLskill discoveryCUDATritonKernelBenchexecution verification
0
0 comments X

The pith

A single LLM backbone jointly trains three agents to select, generate, and summarize GPU kernel skills, building a verified library that beats prior RL models on KernelBench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents daVinci-kernel as an RL framework that couples skill discovery and exploitation by dynamically evolving a skill library. Three agents share one LLM backbone: one selects techniques via BM25 and reranking, one generates CUDA or Triton kernels conditioned on those skills, and one distills successful verified rollouts into reusable entries. Skills enter the library only after execution confirms reproducible speedups. The system starts from structured SFT on diversity-filtered data then optimizes all agents end-to-end with multi-turn REINFORCE and per-agent advantages. A sympathetic reader would care because manual kernel tuning is a major bottleneck in high-performance computing, and an automated co-evolution process could reduce that cost if it scales.

Core claim

daVinci-kernel jointly trains a Skill Selection Agent, a Policy Agent, and a Skill Summary Agent that share a single LLM backbone; the selection agent retrieves skills, the policy agent produces multi-turn kernels, and the summary agent adds only those skills whose speedups survive execution-based verification. After an SFT cold start the three agents are optimized together via multi-turn REINFORCE with per-agent advantage estimation. On KernelBench the resulting 14B model records 37.2 percent, 70.6 percent, and 32.2 percent success under the Fast_1 threshold on Levels 1, 2, and 3, exceeding the strongest prior RL baseline Dr. Kernel-14B.

What carries the argument

The three-agent system with shared LLM backbone and execution-verified dynamic skill library, trained end-to-end by multi-turn REINFORCE.

If this is right

  • Only kernels whose speedups survive repeated execution enter the skill library, limiting noise in the evolving set of techniques.
  • A shared LLM backbone across selection, policy, and summary agents enables direct information flow during co-evolution.
  • The SFT cold-start on diversity-filtered data supplies a stable initialization that supports subsequent joint REINFORCE optimization.
  • Performance gains appear across three difficulty levels of KernelBench under the Fast_1 threshold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three-agent pattern could be tested on CPU or accelerator code generation tasks outside GPU kernels.
  • The skill library might be inspected to measure whether the discovered techniques generalize to new GPU architectures without retraining.
  • Replacing BM25-plus-reranking retrieval with learned retrieval could be compared directly inside the same framework.

Load-bearing premise

Execution-based verification reliably identifies reproducible speedups and the joint REINFORCE training keeps the three agents from collapsing or overfitting to the verification signal.

What would settle it

Retraining the 14B model from the same SFT checkpoint on the same KernelBench split and measuring whether Level-1 Fast_1 success falls below 30 percent.

Figures

Figures reproduced from arXiv: 2606.16497 by Dayuan Fu, Dian Yang, Jiarui Hu, Jinlong Hou, Liming Liu, Mohan Jiang, Pengfei Liu, Tongyu Wang.

Figure 1
Figure 1. Figure 1: Skill-policy co-evolution in daVinci-kernel. The selection, policy, and summary agents form a closed RL loop, where task-relevant skills guide optimization and successful rollouts are distilled back into reusable skills. As the policy improves, the useful skill frontier shifts from simple skills to more complex and task-specific skills, enabling increasingly difficult kernel optimization. 1 † Corresponding… view at source ↗
Figure 2
Figure 2. Figure 2: The structure of daVinci-kernel. 3.2 Skill Library The skill library L is a collection of GPU optimization techniques. Each skill records with five fields: • name: a short snake case identifier, used as a unique key within a snapshot and as the retrieval target returned by the Selection Agent’s tool call. • description: a one-sentence summary of what the technique does and when it applies, shown to the Sel… view at source ↗
Figure 3
Figure 3. Figure 3: Validation Fast1.2 training curves for daVinci-kernel and selected ablations on the 14B model. without skill conditioning, the model must find improvements by itself and can still do so frequently. However, skills are necessary for producing reliable, deep speedups: the skill library acts as an accumulated recipe book that enables the policy to consistently exploit dominant bottlenecks rather than only occ… view at source ↗
Figure 4
Figure 4. Figure 4: Task A: three-way comparison. daVinci-kernel with skill (left) recognises that min value=0.0 collapses the entire tail to zero and skips both Conv3D and GroupNorm, achieving 2.0× at Turn 1. Dr. Kernel (middle) discovers the constant-fill trick only at Turn 2 but still runs the heavy vendor kernels, capping speedup at 1.07×. daVinci-kernel without skill (right) attempts a custom Triton GroupNorm, causing fr… view at source ↗
Figure 5
Figure 5. Figure 5: Skill selection evolution on Task A a specific implementation idiom (flat-1D kernels for contiguous tails) substantially increase the fraction of samples producing valid, high-speedup kernels. The shift in skill selection from step 260 to step 460 demonstrates that the joint RL objective drives the Selection Agent to surface increasingly specific, high-value techniques as the policy’s capability frontier a… view at source ↗
Figure 6
Figure 6. Figure 6: Task B: two-way comparison. daVinci-kernel with skill (left) uses the flat-1D masked kernel pattern prescribed by fuse only contiguous pointwise tails, achieving 1.24× in all 8 samples. Dr. Kernel (right) defaults to explicit 5-D stride indexing, causing shape mismatches in 3 of 8 samples. C System prompt Summary Agent system prompt (GPT series) You are an expert CUDA/Triton kernel optimization engineer. Y… view at source ↗
read the original abstract

GPU kernel optimization represents a paradigm where functional correctness is assumed and execution efficiency is the objective. We present daVinci-kernel, a reinforcement learning framework that couples skill discovery with skill exploitation through a dynamically evolving skill library. daVinci-kernel jointly trains three agents sharing one LLM backbone: a Skill Selection Agent that retrieves relevant techniques via BM25 and LLM reranking, a Policy Agent that generates multi-turn CUDA/Triton kernels conditioned on selected skills, and a Skill Summary Agent that distills successful rollouts into reusable skills. Candidate skills are added only after execution-based verification confirms reproducible speedups. All three agents share a single LLM backbone, are initialized via a structured SFT cold start on diversity-filtered data, and are then jointly optimized end-to-end with multi-turn REINFORCE and per-agent advantage estimation. On KernelBench, daVinci-kernel-14B achieves 37.2%, 70.6%, and 32.2% on Level 1, Level 2, and Level 3 under the Fast$_1$ threshold, outperforming the strongest prior RL-trained model, Dr\. Kernel-14B.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents daVinci-kernel, an RL framework for GPU kernel optimization that jointly trains three agents (Skill Selection via BM25/LLM reranking, Policy for multi-turn CUDA/Triton generation, and Skill Summary for distilling successful rollouts) sharing one LLM backbone. Agents are initialized with structured SFT and optimized end-to-end via multi-turn REINFORCE with per-agent advantage estimation; skills enter the library only after execution verification of reproducible speedups. On KernelBench the 14B model reports 37.2/70.6/32.2 % success on Levels 1/2/3 under the Fast_1 threshold, outperforming the prior RL baseline Dr. Kernel-14B.

Significance. If the reported speedups are reproducible under controlled conditions and the joint training is shown to be stable, the co-evolution of skill discovery and exploitation via a shared backbone could meaningfully advance automated kernel optimization. The execution-verification gate and per-agent advantage design are potentially valuable if supported by ablations.

major comments (3)
  1. [Abstract] Abstract: the headline performance numbers (37.2 %, 70.6 %, 32.2 % on KernelBench Levels 1–3) are stated without any experimental protocol, baseline implementation details, number of seeds, error bars, or statistical tests, leaving the central claim of outperformance over Dr. Kernel-14B without visible supporting evidence.
  2. [Abstract] Abstract: the claim that the three agents co-evolve stably under shared-backbone multi-turn REINFORCE with per-agent advantage estimation is load-bearing for the method, yet no derivation, ablation, or analysis of credit assignment, collapse risk, or overfitting to verification signals is supplied.
  3. [Abstract] Abstract: the assertion that execution-based verification produces a reliable, non-overfit skill library lacks any description of statistical controls (multiple hardware runs, variability thresholds, or verification repetition), which directly affects the reproducibility of the reported speedups.
minor comments (1)
  1. [Abstract] Abstract: the terms 'Fast_1 threshold' and 'Dr. Kernel-14B' are used without definition or citation, reducing clarity for readers unfamiliar with KernelBench or the referenced baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, with proposed revisions to improve transparency in the abstract and supporting details in the main text.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline performance numbers (37.2 %, 70.6 %, 32.2 % on KernelBench Levels 1–3) are stated without any experimental protocol, baseline implementation details, number of seeds, error bars, or statistical tests, leaving the central claim of outperformance over Dr. Kernel-14B without visible supporting evidence.

    Authors: We agree that the abstract would benefit from additional context on the experimental protocol to support the reported numbers. The full manuscript details the KernelBench evaluation, Fast_1 threshold, and Dr. Kernel-14B comparison in Section 4. We will revise the abstract to concisely reference the evaluation protocol, note that results are averaged over multiple seeds, and indicate the outperformance margin. revision: yes

  2. Referee: [Abstract] Abstract: the claim that the three agents co-evolve stably under shared-backbone multi-turn REINFORCE with per-agent advantage estimation is load-bearing for the method, yet no derivation, ablation, or analysis of credit assignment, collapse risk, or overfitting to verification signals is supplied.

    Authors: Section 3 derives the multi-turn REINFORCE objective with per-agent advantage estimation and describes the shared-backbone joint optimization. We acknowledge that explicit ablations on credit assignment, collapse risk, and overfitting are not present. We will add a qualitative discussion of training stability and potential risks in a new paragraph in the Experiments section. revision: partial

  3. Referee: [Abstract] Abstract: the assertion that execution-based verification produces a reliable, non-overfit skill library lacks any description of statistical controls (multiple hardware runs, variability thresholds, or verification repetition), which directly affects the reproducibility of the reported speedups.

    Authors: Section 3.3 describes that skills enter the library only after execution verification of reproducible speedups. We agree that statistical controls merit more detail. We will revise the abstract to reference the verification gate and expand the methods description with repetition protocol and variability thresholds used. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation chain

full rationale

The paper describes an empirical RL framework with three jointly trained agents using multi-turn REINFORCE and execution-based verification on KernelBench. No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or method outline. Performance claims rest on external benchmarks and comparisons to prior models rather than any self-contained mathematical reduction. The central claims are statistically falsifiable via execution and do not reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5763 in / 1124 out tokens · 50647 ms · 2026-06-27T03:10:09.475927+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 10 linked inside Pith

  1. [1]

    Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, et al. 2026. davinci-env: Open swe environment synthesis at scale.arXiv preprint arXiv:2603.13023

  2. [2]

    Siqi Guo, Ming Lin, and Tianbao Yang. 2026. Drtriton: Large-scale synthetic data reinforcement learning for triton kernel generation.arXiv preprint arXiv:2603.21465

  3. [3]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

  4. [4]

    Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770

  5. [5]

    Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, WangHaojie Wang- Haojie, Jianrong Wang, Xu Han, et al. 2025a. Tritonbench: Benchmarking large language model capabilities for generating triton operators. InFindings of the Association for Computational Linguistics: ACL 2025, pages 23053–23066

  6. [6]

    Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, et al. 2025b. Autotriton: Automatic triton programming with reinforcement learning in llms.arXiv preprint arXiv:2507.05687

  7. [7]

    Yu Li, Rui Miao, Zhengling Qi, and Tian Lan. 2026. Arise: Agent reasoning with intrinsic skill evolution in hierarchical reinforcement learning.arXiv preprint arXiv:2603.16060

  8. [8]

    Xuechen Liang, Meiling Tao, Yinghui Xia, Jianhui Wang, Kun Li, Yijin Wang, Yangfan He, Jingsong Yang, Tianyu Shi, Yuantao Wang, et al. 2025. Sage: Self-evolving agents with reflective and memory-augmented abilities.Neurocomputing, 647:130470

  9. [9]

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. 2025. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556

  10. [10]

    Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, and Junxian He. 2026. Dr. kernel: Reinforcement learning done right for triton kernel generations.arXiv preprint arXiv:2602.05885

  11. [11]

    Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. 2026. Skill0: In-context agentic reinforcement learning for skill internalization. arXiv preprint arXiv:2604.02268

  12. [12]

    Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher R´e, and Azalia Mirhoseini

  13. [13]

    Kernelbench: Can llms write efficient gpu kernels?arXiv preprint arXiv:2502.10517

  14. [14]

    2009.The probabilistic relevance framework: BM25 and beyond, volume 4

    Stephen Robertson and Hugo Zaragoza. 2009.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc

  15. [15]

    Songjun Tu, Chengdong Xu, Qichao Zhang, Yaocheng Zhang, Xiangyuan Lan, Linjing Li, and Dongbin Zhao

  16. [16]

    Dynamic dual-granularity skill bank for agentic rl.arXiv preprint arXiv:2603.28716

  17. [17]

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291

  18. [18]

    Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. 2025. Astra: A multi-agent system for gpu kernel performance optimization.arXiv preprint arXiv:2509.07506

  19. [19]

    Jiin Woo, Shaowei Zhu, Allen Nie, Zhen Jia, Yida Wang, and Youngsuk Park. 2025. Tritonrl: Training llms to think and code triton without cheating.arXiv preprint arXiv:2510.17891

  20. [20]

    Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. 2026. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning. arXiv preprint arXiv:2602.08234

  21. [21]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388

  22. [22]

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. 2025. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471

  23. [23]

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. 2024. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642. 11 A. Training costs SII-GAIR Task A — Reference Architecture class Model(nn.Module): def forward(self, x): x = self.conv(x) # Co...

  24. [24]

    Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, et al. 2026. Qimeng-kernel: Macro-thinking micro-coding paradigm for llm-based high- performance gpu kernel generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29168–29176. A Training costs Interacting...

  25. [25]

    Run conv/matmul using PyTorch/library kernels

  26. [26]

    Ensure the output is contiguous

  27. [27]

    Flatten to numel() and launch masked 1D kernel

  28. [28]

    Pitfalls: .contiguous() erases gains if tail tiny

    Fuse as many elementwise ops as possible. Pitfalls: .contiguous() erases gains if tail tiny. [2]name: fuse_only_contiguous_pointwise_tails description: Fuse memory-bound pointwise epilogues around vendor kernels; don’t replace heavy ops. tags: [fusion, pointwise, memory_bound, triton] --- \#\# Motivation Custom Triton underperforms when it replaces librar...

  29. [29]

    Ensure tail input is contiguous

  30. [30]

    Flatten to numel() and launch masked 1D grid

  31. [31]

    Fuse as many pointwise ops as possible

  32. [32]

    call .contiguous(), flat- ten to numel(), and launch a masked 1-D grid,

    Avoid extra temporaries unless profiling justifies. Pitfalls: use masks for non-power-of-two sizes. [3]name: hotspot_aware_triton_selection description: Kernelize only bandwidth-bound tails that are actually hot. tags: [hotspot_analysis, fusion_strategy, triton] --- \#\# Motivation Custom Triton often underperforms when it targets only a tiny fraction of ...

  33. [33]

    Run conv/matmul in PyTorch

  34. [34]

    Make the output contiguous if needed

  35. [35]

    Flatten to numel(), launch 1D grid with mask (offs < n)

  36. [36]

    Pitfalls: .contiguous() can cost more than the tail

    Fuse all pointwise ops before the final store. Pitfalls: .contiguous() can cost more than the tail

  37. [37]

    name: hoist invariant scalars out of kernel path←NEW description: Reduce overhead by hoisting invariant scalar/tensor params out of the hot kernel path. tags: [scalar hoisting, host overhead, fast path] --- ## Motivation A surprising amount of overhead comes from re-fetching tiny invariant scalars every call: clamp bounds, strides, shape constants. On sma...

  38. [38]

    Inspect constant params once in init

  39. [39]

    Convert scalar tensors to Python floats

  40. [40]

    Combine with a specialised fast path: if min value == 0.0: # hoist the scalar skip entire pipeline() # 2.0x speedup Pitfalls: hoisting is wrong if value can change. Figure 5:Skill selection evolution on Task A a specific implementation idiom (flat-1D kernels for contiguous tails) substantially increase the fraction of samples producing valid, high-speedup...

  41. [41]

    Optionally call read_skill_files to inspect existing skills and avoid duplicates

  42. [42]

    Each skill body must contain: ## Motivation, ## Key Idea, ## Example (with code)

    Call update_skill_library with at most {max_skills} new skill(s). Each skill body must contain: ## Motivation, ## Key Idea, ## Example (with code)

  43. [43]

    Rules: - Skills must be GENERAL (applicable beyond this specific task)

    Your turn ends automatically after update_skill_library is called. Rules: - Skills must be GENERAL (applicable beyond this specific task). - Do NOT add task-specific hacks or solutions. - Respond in English only. Summary Agent user prompt (GPT series) ## Task {task_formatted} ## Existing Skills in Library ‘‘‘ {skill_library.get_file_tree()} ‘‘‘ ## Turn 1 ...

  44. [44]

    A kernel optimization task (original PyTorch code + performance target). 15 C. System prompt SII-GAIR

  45. [45]

    use torch.compile

    A numbered list of candidate optimization skills from the skill library. Each entry shows only the skill name, description, tags, and scope | NOT the full content. Your job: identify the top 3 skills most likely to help solve THIS specific task. ## Selection criteria - Relevance: the technique directly applies to the operator/pattern in the task. - Impact...

  46. [46]

    A kernel optimization task (original PyTorch code + performance target)

  47. [47]

    use torch.compile

    A numbered list of candidate optimization skills from the skill library. Each entry shows only the skill name, description, tags, and scope | NOT the full content. Your job: identify the top __top_k_select__ skills most likely to help solve THIS specific task. ## Selection criteria - Relevance: the technique directly applies to the operator/pattern in the...

  48. [48]

    Ensures the inputs are contiguous on GPU

  49. [49]

    Calculates the grid (blocks) needed

  50. [50]

    "" assert x.is_cuda and y.is_cuda,

    Launches the Triton kernel. """ assert x.is_cuda and y.is_cuda, "Tensors must be on CUDA." x = x.contiguous() y = y.contiguous() # Prepare output tensor out = torch.empty_like(x) # Number of elements in the tensor n_elements = x.numel() BLOCK_SIZE = 128 # Tunable parameter for block size # Determine the number of blocks needed grid = lambda meta: ((n_elem...