arxiv: 2603.28342 · v2 · submitted 2026-03-30 · 💻 cs.CL · cs.LG

Recognition: 1 theorem link

· Lean Theorem

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

He Du , Qiming Ge , Jiakai Hu , Aijun Yang , Zheng Cai , Zixian Huang , Sheng Yuan , Qinxiu Cheng

show 13 more authors

Xinchen Xie Yicheng Chen Yining Li Jiaxing Xie Huanan Dong Yaguang Wu Xiangjun Huang Jian Yang Hui Wang Bowen Zhou Bowen Li Qipeng Guo Kai Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:14 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords evolutionary optimizationGPU kernel generationreinforcement learning for codeperformance tuningTriton kernelsLLM agentkernel benchmark

0 comments

The pith

Kernel-Smith evolves GPU kernels by maintaining an archive of top programs and using execution feedback to guide revisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Kernel-Smith as a combined evolutionary search agent and post-training method for creating high-performance GPU kernels. The agent keeps a population of runnable candidates and improves them step by step with an archive of the best and most varied programs plus concrete feedback on whether each change compiles, runs correctly, and runs faster. The training step turns the full search histories into targeted supervision and reinforcement signals so the model learns to make small, reliable local improvements rather than trying to invent complete kernels in one shot. If the approach works as described, open-weight models can reach or exceed the kernel performance of much larger closed systems on standard benchmarks and transfer the same workflow to different hardware backends.

Core claim

Kernel-Smith maintains a population of executable kernel candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. Long-horizon trajectories are converted into step-centric supervision and reinforcement learning signals by retaining only correctness-preserving high-gain revisions. Under this unified evolutionary protocol the 235B RL variant attains the highest average speedup on KernelBench with the NVIDIA Triton backend and surpasses frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus; the same recipe on the MetaX MACA backend yields a 30B model (

What carries the argument

The evolutionary agent that preserves an archive of top and diverse executable programs and feeds back structured execution signals on compilation, correctness, and speedup, paired with a post-training recipe that extracts local-improvement signals from full trajectories.

If this is right

The trained model functions as a reliable local improver inside any evolutionary loop rather than only as a one-shot generator.
The identical workflow transfers across hardware backends with only backend-specific evaluation services required.
Kernels produced by the method can be integrated directly into production inference systems.
The same archive-plus-feedback mechanism scales to additional operator types beyond the evaluated set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The step-centric supervision approach could be applied to other long-horizon code search tasks where small correct edits matter more than single-shot generation.
Extending the feedback to include additional hardware counters such as memory bandwidth or power draw might further improve cross-device portability.
If the archive curation proves robust, the method reduces dependence on proprietary model scale for domain-specific hardware optimization.

Load-bearing premise

An archive of high-performing diverse programs plus structured run-time feedback on compilation, correctness, and speedup is sufficient to produce reliable long-horizon gains without excessive compute or convergence to poor local solutions.

What would settle it

Running the evolutionary loop with the same archive and feedback but without the RL post-training step on the identical KernelBench suite and measuring whether average speedups fall below the reported SOTA numbers.

read the original abstract

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kernel-Smith combines an archive-driven evolutionary search with post-training on high-gain trajectory steps to produce a reliable local improver for GPU kernels, and it reports SOTA speedups plus real deployment in SGLang and LMDeploy.

read the letter

The paper's main move is to keep a population of kernels, feed them structured feedback on compile, correctness, and speedup, then convert the successful long-horizon traces into step-level supervision so the model learns to make incremental fixes inside the loop rather than generating kernels from scratch. That unification of stable evolution with targeted post-training is the concrete new piece, and it appears to transfer across NVIDIA Triton and MetaX MACA backends without major rewrites.

Referee Report

2 major / 2 minor

Summary. The paper presents Kernel-Smith, a framework combining a stable evaluation-driven evolutionary agent (maintaining populations of executable kernel candidates with structured feedback on compilation, correctness, and speedup) and an evolution-oriented post-training recipe that converts long-horizon trajectories into step-centric supervision and RL signals. Under a unified protocol, the 235B RL variant achieves SOTA average speedup on KernelBench with the Nvidia Triton backend, outperforming proprietary models such as Gemini-3.0-pro and Claude-4.6-opus; a 30B variant is shown to surpass large open models on the MetaX MACA backend, with the workflow also yielding upstream contributions to production systems including SGLang and LMDeploy.

Significance. If the empirical claims hold under rigorous verification, the work is significant for demonstrating a practical, transferable recipe for LLM-driven GPU kernel optimization that bridges controlled evolutionary search with real-world deployment impact. The combination of archive-based diversity maintenance, backend-specific evaluators, and trajectory-to-RL conversion offers a concrete path for long-horizon improvement in high-performance computing, with potential to generalize across heterogeneous platforms.

major comments (2)

[Abstract / Experiments] Abstract and Experiments (assumed §4–5): The manuscript reports SOTA average speedup ratios and outperformance of frontier models on KernelBench but supplies no experimental details on baseline definitions, exact evaluation protocol, statistical significance tests, variance across runs, or ablation studies isolating the contribution of the evolutionary archive versus the RL post-training. This absence is load-bearing for the central empirical claim and prevents assessment of whether the speedups are robust or sensitive to implementation choices.
[§3] §3 (Evolutionary Agent): The description of the archive of top-performing and diverse programs plus structured execution feedback is presented as sufficient to drive reliable long-horizon improvement, yet no analysis is given of convergence behavior, diversity metrics, or failure modes when the population collapses; this assumption underpins the reliability of the entire framework and requires quantitative support.

minor comments (2)

[Abstract / §2] Notation for model variants (e.g., Kernel-Smith-235B-RL vs. Kernel-Smith-MACA-30B) should be defined consistently in a single table or section to avoid reader confusion across backends.
[§4] The claim of 'seamless adaptation across heterogeneous platforms' would benefit from a brief discussion of any backend-specific engineering effort required beyond the evaluation services.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The points raised highlight important areas where additional rigor will strengthen the manuscript. We address each major comment below and commit to incorporating the requested details and analyses in the revised version.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments (assumed §4–5): The manuscript reports SOTA average speedup ratios and outperformance of frontier models on KernelBench but supplies no experimental details on baseline definitions, exact evaluation protocol, statistical significance tests, variance across runs, or ablation studies isolating the contribution of the evolutionary archive versus the RL post-training. This absence is load-bearing for the central empirical claim and prevents assessment of whether the speedups are robust or sensitive to implementation choices.

Authors: We agree that the current presentation lacks sufficient experimental detail to fully substantiate the claims. In the revised manuscript we will expand the Experiments section (and update the abstract for consistency) with: precise baseline definitions including prompting templates and inference settings used for Gemini-3.0-pro and Claude-4.6-opus; the complete KernelBench evaluation protocol (compilation, correctness verification, timing methodology, and hardware configuration); statistical significance tests (paired t-tests and bootstrap confidence intervals across runs); variance statistics (mean and standard deviation over five independent seeds); and ablation studies that separately disable the archive-based diversity maintenance and the RL post-training stage. We will also release the evaluation harness and raw logs to support reproducibility. revision: yes
Referee: [§3] §3 (Evolutionary Agent): The description of the archive of top-performing and diverse programs plus structured execution feedback is presented as sufficient to drive reliable long-horizon improvement, yet no analysis is given of convergence behavior, diversity metrics, or failure modes when the population collapses; this assumption underpins the reliability of the entire framework and requires quantitative support.

Authors: We acknowledge that quantitative characterization of the evolutionary dynamics is necessary. In the revised §3 we will add: convergence plots showing average and best-case speedup across generations; diversity metrics (average pairwise edit distance and functional similarity within the maintained population); and an explicit discussion of failure modes, including observed cases of population collapse together with the mitigation provided by the top-performing/diverse archive. These analyses will be supported by data collected from the KernelBench runs reported in the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical framework for evolutionary kernel optimization using population-based search, structured execution feedback, and RL post-training on trajectories. All central claims (SOTA speedup on KernelBench, outperformance of proprietary models, cross-backend validation) are grounded in external benchmark results rather than any derivation, equation, or first-principles result. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear; the protocol is self-contained against independent benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into exact parameters; the framework implicitly assumes that execution feedback is reliable and that retaining high-gain revisions produces useful training signals.

axioms (1)

domain assumption Execution feedback on compilation, correctness, and speedup is sufficient to guide evolutionary improvement
Invoked to justify the stable evaluation-driven agent.

pith-pipeline@v0.9.0 · 5637 in / 1103 out tokens · 45903 ms · 2026-05-14T22:14:02.574714+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 10 internal anchors

[1]

Kevin: Multi-turn rl for generating cuda kernels, 2025.URL https://arxiv

Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, and Silas Alberti. Kevin: Multi-turn rl for generating cuda kernels, 2025.URL https://arxiv. org/abs/2507.11948. 1, 2.2

work page arXiv 2025
[2]

K-search: Llm kernel generation via co-evolving intrinsic world model.arXiv preprint arXiv:2602.19128,

Shiyi Cao, Ziming Mao, Joseph E Gonzalez, and Ion Stoica. K-search: Llm kernel generation via co-evolving intrinsic world model.arXiv preprint arXiv:2602.19128, 2026. 2.3

work page arXiv 2026
[3]

Evaluating Large Language Models Trained on Code

Mark Chen et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372, 2026

Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, et al. Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372, 2026. 6, 6.3

work page arXiv 2026
[5]

Lmdeploy: A toolkit for compressing, deploying, and serving llm.https: //github.com/InternLM/lmdeploy, 2023

LMDeploy Contributors. Lmdeploy: A toolkit for compressing, deploying, and serving llm.https: //github.com/InternLM/lmdeploy, 2023. 1, 6.2

work page 2023
[6]

Xtuner: A toolkit for efficiently fine-tuning large models.https://github

XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning large models.https://github. com/InternLM/xtuner, 2023. 1

work page 2023
[7]

Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation.arXiv preprint arXiv:2602.24286, 2026

Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, et al. Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation.arXiv preprint arXiv:2602.24286, 2026. 1, 2.2

work page arXiv 2026
[8]

Tritongym: A benchmark for agentic llm workflows in triton gpu code generation

Yue Guan, Yichen Lin, Xu Zhao, Jianzhu Yao, Xinwei Qiang, Zhongkai Yu, Pramod Viswanath, Yufei Ding, and Adnan Aziz. Tritongym: A benchmark for agentic llm workflows in triton gpu code generation. 1, 2.1

work page
[9]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention.Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 1

work page 2023
[10]

Pragma: A profiling-reasoned multi-agent framework for automatic kernel optimization.arXiv preprint arXiv:2511.06345, 2025

Kelun Lei, Hailong Yang, Huaitao Zhang, Xin You, Kaige Zhang, Zhongzhi Luan, Yi Liu, and Depei Qian. Pragma: A profiling-reasoned multi-agent framework for automatic kernel optimization.arXiv preprint arXiv:2511.06345, 2025. 1, 2.2, 3.1

work page arXiv 2025
[11]

Autotriton: Automatic triton programming with reinforcement learning in llms.arXiv preprint arXiv:2507.05687, 2025

Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, et al. Autotriton: Automatic triton programming with reinforcement learning in llms.arXiv preprint arXiv:2507.05687, 2025. 2.2

work page arXiv 2025
[12]

Cuda-l1: Improving cuda optimization via contrastive reinforcement learning.arXiv preprint arXiv:2507.14111, 2025

Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, and Chris Shum. Cuda-l1: Improving cuda optimization via contrastive reinforcement learning.arXiv preprint arXiv:2507.14111, 2025. 2.3

work page arXiv 2025
[13]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. 4.2, 5.1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, and Junxian He. Dr. kernel: Reinforcement learning done right for triton kernel generations.arXiv preprint arXiv:2602.05885,

work page arXiv
[15]

hdbscan: Hierarchical density based clustering.J

Leland McInnes, John Healy, Steve Astels, et al. hdbscan: Hierarchical density based clustering.J. Open Source Softw., 2(11):205, 2017. 4.2

work page 2017
[16]

Minimax m2.5: Built for real-world productivity.https://www.minimax.io/news/ minimax-m25, 2026

MiniMax. Minimax m2.5: Built for real-world productivity.https://www.minimax.io/news/ minimax-m25, 2026. 5.1

work page 2026
[17]

Illuminating search spaces by mapping elites

Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909, 2015. 3.2

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

AlexanderNovikov,NgânV ˜u,MarvinEisenberger,EmilienDupont,Po-SenHuang,AdamZsoltWagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. AlphaEvolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025. 1, 3.2 13 Kernel-Smith : A Unified Recipe for Evolutionary Kernel Optimization

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can llms write efficient gpu kernels?arXiv preprint arXiv:2502.10517,

work page internal anchor Pith review arXiv
[20]

Kernelband: Boosting llm-based kernel optimization with a hierarchical and hardware-aware multi-armed bandit.arXiv preprint arXiv:2511.18868, 2025

Dezhi Ran, Shuxiao Xie, Mingfang Ji, Ziyue Hua, Mengzhou Wu, Yuan Cao, Yuzhe Guo, Yu Hao, Linyi Li, Yitao Hu, et al. Kernelband: Boosting llm-based kernel optimization with a hierarchical and hardware-aware multi-armed bandit.arXiv preprint arXiv:2511.18868, 2025. 2.3

work page arXiv 2025
[21]

Code Llama: Open Foundation Models for Code

BaptisteRoziereetal. Codellama: Openfoundationmodelsforcode.arXivpreprintarXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Rohan Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019. 1

work page internal anchor Pith review Pith/arXiv arXiv 1909
[23]

Cuda-l2: Surpass- ing cublas performance for matrix multiplication through reinforcement learning.arXiv preprint arXiv:2512.02551, 2025

Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, and Chris Shum. Cuda-l2: Surpass- ing cublas performance for matrix multiplication through reinforcement learning.arXiv preprint arXiv:2512.02551, 2025. 2.3

work page arXiv 2025
[24]

Kernelskill: A multi-agent framework for gpu kernel optimization, 2026

Qitong Sun, Jun Han, Tianlin Li, Zhe Tang, Sheng Chen, Fei Yang, Aishan Liu, Xianglong Liu, and Yang Liu. Kernelskill: A multi-agent framework for gpu kernel optimization, 2026. 2.3

work page 2026
[25]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026. 5.1

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Fine-tuning gpt-5 for gpu kernel generation.arXiv preprint arXiv:2602.11000, 2026

Ali Tehrani, Yahya Emara, Essam Wissam, Wojciech Paluch, Waleed Atallah, Mohamed S Abdelfattah, et al. Fine-tuning gpt-5 for gpu kernel generation.arXiv preprint arXiv:2602.11000, 2026. 2.2

work page arXiv 2026
[27]

Astra: A multi-agent system for gpu kernel performance optimization.arXiv preprint arXiv:2509.07506, 2025

AnjiangWei, TianranSun, YogeshSeenichamy, HangSong, AnneOuyang,AzaliaMirhoseini, KeWang, and Alex Aiken. Astra: A multi-agent system for gpu kernel performance optimization.arXiv preprint arXiv:2509.07506, 2025. 1, 2.2, 3.1

work page arXiv 2025
[28]

Multikernel- bench: A multi-platform benchmark for kernel generation.arXiv e-prints, pp

Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, and Tian Zhang. Multikernel- bench: A multi-platform benchmark for kernel generation.arXiv e-prints, pp. arXiv–2507, 2025. 1, 2.1

work page 2025
[29]

Heckendorn

Darrell Whitley, Soraya Rana, and Robert B. Heckendorn. The island model genetic algorithm: On separability, population size and convergence.Journal of Computing and Information Technology, 7(1):33–47, 1999. 3.2

work page 1999
[30]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 5.1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026. 2.3

work page arXiv 2026
[32]

Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60,

Jialing Zhang et al. Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60,

work page
[33]

Cudaforge: An agent framework with hardware feedback for cuda kernel optimization.arXiv preprint arXiv:2511.01884, 2025

Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, and Caiwen Ding. Cudaforge: An agent framework with hardware feedback for cuda kernel optimization.arXiv preprint arXiv:2511.01884, 2025. 1, 2.2, 3.1

work page arXiv 2025
[34]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, et al. Sglang: Efficient execution of structured language model programs.arXiv preprint arXiv:2312.07104, 2023. 1, 6.1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Cudabench: Benchmarking llms for text-to-cuda generation.arXiv preprint arXiv:2603.02236, 2026

Jiace Zhu, Wentao Chen, Qi Fan, Zhixing Ren, Junying Wu, Xing Zhe Chai, Chotiwit Rungrueangwut- thinon, Yehan Ma, and An Zou. Cudabench: Benchmarking llms for text-to-cuda generation.arXiv preprint arXiv:2603.02236, 2026. 1, 2.1 14 Kernel-Smith : A Unified Recipe for Evolutionary Kernel Optimization

work page arXiv 2026
[36]

Intern-s1-pro: Scientific multimodal foundation model at trillion scale, 2026

Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang, Chen Zhang, Yuhang Zang, Fei...

work page 2026
[37]

Compilation Pass: Make Triton code compile (performance irrelevant)

work page
[38]

Numerical Correctness: Output must match PyTorch reference within float32 tolerance

work page
[39]

The Evolution system will pass the EVOLVE-BLOCK from the previous version

Performance Gain: Target more speedup than the reference code and previously generated code. The Evolution system will pass the EVOLVE-BLOCK from the previous version. Modify ONLY the Triton kernel source within this block - all other code is LOCKED. ==================== Constraints (MUST) ==================== NEVER change @triton.jit function signatures ...

work page
[40]

Minimize access to slow global memory

work page
[41]

Kernel fusion involves combining multiple separate computational steps

work page
[42]

Tune BLOCK_SIZE and num_warps

Maximize Occupancy. Tune BLOCK_SIZE and num_warps

work page
[43]

Grouped gemm operations: Implement Persistent Kernels

work page
[44]

Increase Arithmetic Intensity

work page
[45]

Use TMA for perfect latency hiding on NVIDIA Hopper+

work page
[46]

Setting fast_math=True allows the compiler to reorder floating-point operations

work page
[47]

16 Kernel-Smith : A Unified Recipe for Evolutionary Kernel Optimization

Multi-Level Compilation, Tiling Hints, and Warp-Level Primitives. 16 Kernel-Smith : A Unified Recipe for Evolutionary Kernel Optimization

work page
[48]

Advanced Tiling and Work Decomposition (SplitK and SplitM). User Prompt # Target Module Class ModelNew # Program Evolution History ## Previous Attempts {Previous Programs} ## Top Performing Programs {Top Performing Programs} # Current Program {current_program} # Current Program Information - Current performance metrics: - compiled: 1.0000 - correctness: 0...

work page