AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

Chuyun Shen; Jiehao Wu; Junjie Sheng; Wenhao Li; Xiangfeng Wang; Zixiao Huang

arxiv: 2603.23566 · v2 · pith:EPOFIQQ6new · submitted 2026-03-24 · 💻 cs.LG · cs.AI

AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

Jiehao Wu , Zixiao Huang , Wenhao Li , Chuyun Shen , Junjie Sheng , Xiangfeng Wang This is my paper

Pith reviewed 2026-05-21 10:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Ascend NPUAscendC operatorsepisodic agentoperator optimizationtiling configurationsevolutionary searchprofiling in the loopkernel rewriting

0 comments

The pith

AscendOptimizer is an episodic agent that learns Ascend NPU operator optimizations by rewinding proven kernels to extract reusable experience and by running profiling-driven evolutionary search for tiling and data movement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the scarcity of public optimization examples for AscendC operators on Ascend NPUs, where performance hinges on both a host-side tiling program and the kernel itself. It builds missing knowledge directly from hardware execution rather than external datasets. For kernels, the agent removes optimizations from strong implementations in a controlled manner and retains only the removals that measurably degrade speed as reusable rewriting experience. For the host side, it couples evolutionary search with on-device profiling to discover valid, high-performance tiling configurations. On 101 real operators this yields a 1.21 times geometric-mean speedup over the open-source baseline while outperforming Best-of-N sampling and OpenEvolve under equal evaluation budgets.

Core claim

AscendOptimizer is an episodic agent that improves AscendC operators by rewinding optimizations from strong implementations to create reusable experience and by performing profiling-in-the-loop evolutionary search for host-side tiling and data-movement configurations, leading to a 1.21x geometric mean speedup over the open-source baseline on 101 operators with 53.47 percent outperforming their references.

What carries the argument

The episodic agent combining controlled optimization removal for kernel rewriting experience with profiling-in-the-loop evolutionary search for tiling configurations.

If this is right

Kernel structure and host-side scheduling can be improved jointly without large external kernel databases.
Performance gains appear consistently across different evaluation budgets and outperform simple sampling baselines.
More than half the optimized operators exceed the speed of their original reference implementations.
The same workflow can be applied to additional AscendC operators beyond the 101-operator benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rewind-and-reuse pattern could transfer to operator optimization on other hardware platforms that also lack public high-performance examples.
Accumulated experience across many operators might eventually lower the number of evaluations needed for each new operator.
The approach could be extended to discover operator fusions or alternative pipelining strategies not present in the original references.

Load-bearing premise

Removing optimizations from strong kernels in a controlled way yields reusable, generalizable experience that helps rewrite new operators, and profiling-guided evolutionary search finds valid high-performance tiling configurations within modest evaluation budgets.

What would settle it

On a new held-out set of AscendC operators, AscendOptimizer fails to produce higher geometric-mean speedup than Best-of-N sampling or OpenEvolve when given identical per-operator evaluation budgets.

Figures

Figures reproduced from arXiv: 2603.23566 by Chuyun Shen, Jiehao Wu, Junjie Sheng, Wenhao Li, Xiangfeng Wang, Zixiao Huang.

**Figure 1.** Figure 1: Overview of AscendOptimizer. Stage I performs evolutionary-guided program search with hardwarein-the-loop profiling feedback to discover valid high-performance configurations; Stage II bootstraps optimization experience via optimization rewind and applies retrieval-augmented kernel optimization to address structural bottlenecks. The two stages are executed in an alternating loop, where improvements from o… view at source ↗

**Figure 2.** Figure 2: CDF of per-operator speedups achieved by [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Semantic landscape of optimization strategies via embedding clustering. Each optimization record (Title [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Optimization trajectory of the ”foreach pow scalar and tensor” operator [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of a key scheduling rewrite in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Base tiling function Tbase with evolution markers, synthesized from the operator code and attributes S. Method details: optimization thought from rewind 1 "optimization_point": { 2 "title": "Elimination of Pipeline Serialization and Enhancement of DMA Transfer Efficiency", 3 "description": "The Fast Version achieves significantly higher performance by addressing three critical architectural inefficiencies … view at source ↗

**Figure 7.** Figure 7: optimization thought example: key changes from a slow to a fast implementation. The figure shows a [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Optimizing AscendC (Ascend C) operators for Ascend NPUs is difficult for two reasons. First, unlike CUDA, the ecosystem offers few public kernels to learn from. Second, performance depends on a coupled two-part implementation: a host-side tiling program that controls data movement and a kernel program that schedules and pipelines computation. We present AscendOptimizer, an episodic agent that builds missing optimization knowledge from execution itself. For kernel optimization, AscendOptimizer rewinds strong implementations by removing optimizations in a controlled way, then keeps the changes whose removal measurably hurts performance as reusable experience for later rewriting. For host-side optimization, it runs profiling-in-the-loop evolutionary search to find valid, fast tiling and data-movement configurations directly from hardware feedback. This combination lets the agent improve kernel structure and host-side scheduling together. On a benchmark of 101 real AscendC operators, AscendOptimizer achieves a 1.21x geometric-mean speedup over the open-source baseline, and 53.47% of operators run faster than their references. Given a same budget of evaluations per operator, AscendOptimizer consistently outperforms Best-of-N sampling and OpenEvolve in terms of geometric mean speedup, fast_p tail speedup ratios, and overall optimization progress across varying budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AscendOptimizer gets a 1.21x geo-mean speedup on 101 AscendC operators by rewinding strong kernels for experience and running hardware-feedback evolutionary search, but the results do not isolate how much the rewinding step adds over search alone.

read the letter

The main takeaway is that this paper shows a practical way to optimize AscendC operators on Ascend NPUs, where public kernels are scarce. It uses an episodic agent that rewinds optimizations from already-strong implementations, keeps the changes that hurt performance as reusable experience, and combines that with profiling-in-the-loop evolutionary search for the host-side tiling and data movement part. On the 101-operator benchmark it reports a 1.21x geometric mean over the open-source baseline and beats Best-of-N and OpenEvolve at equal evaluation budgets, with 53% of operators improving.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AscendOptimizer, an episodic agent for optimizing AscendC operators on Ascend NPUs. It rewinds optimizations from strong kernels in a controlled manner to extract reusable experience for later rewriting, while employing profiling-in-the-loop evolutionary search to discover valid tiling and data-movement configurations on the host side. On a benchmark of 101 real AscendC operators, it reports a 1.21x geometric-mean speedup over the open-source baseline (with 53.47% of operators faster than references) and consistent outperformance versus Best-of-N sampling and OpenEvolve under matched evaluation budgets.

Significance. If the central claims hold after addressing the noted gaps, the work offers a practical method for building optimization knowledge in hardware ecosystems with limited public kernels, by combining experience extraction via controlled rewinding with hardware-feedback evolutionary search. This could be relevant for emerging NPUs where traditional tuning resources are scarce.

major comments (2)

[Abstract / Method] Abstract and method description: The headline 1.21x geometric-mean speedup and outperformance claims rest on the episodic rewinding+reuse mechanism producing transferable experience that improves rewriting beyond what profiling-in-the-loop evolutionary search achieves alone. No ablation or transfer analysis is provided to isolate this contribution (e.g., comparing full AscendOptimizer against search-only variants on the same 101 operators), leaving open the possibility that gains are driven primarily by the search component rather than the rewinding loop.
[Experiments] Experimental section: The soundness of the performance claims is difficult to evaluate because the manuscript provides no details on experimental controls, statistical testing, operator selection criteria for the 101-operator benchmark, or potential measurement biases (e.g., warm-up, variance across runs). This directly affects the reliability of the reported geometric-mean speedup and fast_p tail ratios.

minor comments (2)

[Method] Clarify the precise definition of 'controlled removal' of optimizations and how 'measurable hurt' is quantified to ensure reproducibility of the experience extraction step.
[Method] Add explicit discussion of invalid or suboptimal trials encountered during evolutionary search and how the budget is allocated across operators.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify how to better substantiate the contributions of AscendOptimizer. We will revise the manuscript to include a direct ablation isolating the rewinding mechanism and to expand the experimental details for improved reproducibility and reliability assessment.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: The headline 1.21x geometric-mean speedup and outperformance claims rest on the episodic rewinding+reuse mechanism producing transferable experience that improves rewriting beyond what profiling-in-the-loop evolutionary search achieves alone. No ablation or transfer analysis is provided to isolate this contribution (e.g., comparing full AscendOptimizer against search-only variants on the same 101 operators), leaving open the possibility that gains are driven primarily by the search component rather than the rewinding loop.

Authors: We acknowledge the value of a direct ablation to isolate the episodic rewinding and experience-reuse component. Our existing comparisons to Best-of-N sampling and OpenEvolve under matched evaluation budgets already show that the full AscendOptimizer outperforms pure search-based approaches on the 101 operators. Nevertheless, to address the referee's point explicitly, we will add an ablation study in the revised manuscript that evaluates a search-only variant (profiling-in-the-loop evolutionary search without the rewinding loop) against the complete system on the identical benchmark, reporting geometric-mean speedup and fast_p ratios. This will clarify the incremental benefit of the rewinding mechanism. revision: yes
Referee: [Experiments] Experimental section: The soundness of the performance claims is difficult to evaluate because the manuscript provides no details on experimental controls, statistical testing, operator selection criteria for the 101-operator benchmark, or potential measurement biases (e.g., warm-up, variance across runs). This directly affects the reliability of the reported geometric-mean speedup and fast_p tail ratios.

Authors: We agree that additional methodological details are required. In the revised version we will expand the Experiments section with a new subsection that specifies: (i) the criteria used to select the 101 real AscendC operators, (ii) the full measurement protocol including warm-up iterations, number of repeated runs per configuration, and how variance is handled, (iii) the statistical procedures applied (e.g., reporting means with standard deviations and any significance tests), and (iv) controls for hardware and environmental variability. These additions will allow readers to assess the robustness of the 1.21x geometric-mean speedup and the 53.47% fast_p figure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from direct hardware benchmarks

full rationale

The paper describes an episodic agent that rewinds optimizations from strong kernels to extract reusable experience and applies profiling-in-the-loop evolutionary search for tiling configurations. All reported performance claims (1.21x geo-mean speedup on 101 operators, outperformance vs. Best-of-N and OpenEvolve under fixed evaluation budgets) are obtained via direct execution measurements on Ascend NPU hardware. No mathematical derivations, equations, or self-referential definitions are present that reduce a claimed result to its own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked to justify core mechanisms. The approach is self-contained against external benchmarks and falsifiable through the described experimental protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on empirical hardware feedback and search heuristics with no explicit mathematical axioms or invented entities stated; free parameters such as search population size or rewinding thresholds are not detailed in the abstract.

pith-pipeline@v0.9.0 · 5773 in / 1225 out tokens · 62732 ms · 2026-05-21T10:11:55.049098+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

NeutronAscend: Op- timizing GNN training with Ascend AI processors.ACM Transactions on Architecture and Code Optimization, 22(4):1–26, 2025

Xin Ai, bing zhang, Qiange Wang, Yanfeng Zhang, Hao Yuan, Shufeng Gong, and Ge Yu. NeutronAscend: Op- timizing GNN training with Ascend AI processors.ACM Transactions on Architecture and Code Optimization, 22(4):1–26, 2025

work page 2025
[2]

GPU Kernel Scientist: An LLM-driven framework for iterative kernel optimization.arXiv preprint arXiv:2506.20807, 2025

Martin Andrews and Sam Witteveen. GPU Kernel Scientist: An LLM-driven framework for iterative kernel optimization.arXiv preprint arXiv:2506.20807, 2025

work page arXiv 2025
[3]

Tiramisu: A polyhedral compiler for expressing fast and portable code

Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. Tiramisu: A polyhedral compiler for expressing fast and portable code. In2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2019

work page 2019
[4]

Kevin: Multi-turn rl for generating cuda kernels, 2025

Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, and Silas Alberti. Kevin: Multi-turn RL for generating CUDA kernels.arXiv preprint arXiv:2507.11948, 2025

work page arXiv 2025
[5]

A practical automatic polyhedral parallelizer and locality optimizer

Uday Bondhugula, Albert Hartono, J Ramanujam, and P Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. InThe ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2008

work page 2008
[6]

AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units

Xinzi Cao, Jianyang Zhai, Pengfei Li, Zhiheng Hu, Cen Yan, Bingxu Mu, Guanghuan Fang, Bin She, Jiayu Li, Yihan Su, et al. AscendKernelGen: A systematic study of LLM-based kernel generation for neural processing units.arXiv preprint arXiv:2601.07160, 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

TVM: An automated end-to-end optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. TVM: An automated end-to-end optimizing compiler for deep learning. InThe 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 578–594, 2018

work page 2018
[8]

Flashattention: Fast and memory-efficient exact attention with IO-awareness.the 36th International Conference on Neural Information Processing Systems (NeurIPS), 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e. Flashattention: Fast and memory-efficient exact attention with IO-awareness.the 36th International Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[9]

STARK: Strategic team of agents for refining kernels.arXiv preprint arXiv:2510.16996, 2025

Juncheng Dong, Yang Yang, Tao Liu, Yang Wang, Feng Qi, Vahid Tarokh, Kaushik Rangadurai, and Shuang Yang. STARK: Strategic team of agents for refining kernels.arXiv preprint arXiv:2510.16996, 2025

work page arXiv 2025
[10]

EvoEngineer: Mastering automated CUDA kernel code evolution with large language models.arXiv preprint arXiv:2510.03760, 2025

Ping Guo, Chenyu Zhu, Siyuan Chen, Fei Liu, Xi Lin, Zhichao Lu, and Qingfu Zhang. EvoEngineer: Mastering automated CUDA kernel code evolution with large language models.arXiv preprint arXiv:2510.03760, 2025

work page arXiv 2025
[11]

PRAGMA: A profiling-reasoned multi-agent framework for automatic kernel optimization.arXiv preprint arXiv:2511.06345, 2025

Kelun Lei, Hailong Yang, Huaitao Zhang, Xin You, Kaige Zhang, Zhongzhi Luan, Yi Liu, and Depei Qian. PRAGMA: A profiling-reasoned multi-agent framework for automatic kernel optimization.arXiv preprint arXiv:2511.06345, 2025

work page arXiv 2025
[12]

Tritonforge: Profiling-guided framework for automated triton kernel optimization.arXiv preprint arXiv:2512.09196, 2025

Haonan Li, Keyu Man, Partha Kanuparthy, Hanning Chen, Wei Sun, Sreen Tallam, Chenguang Zhu, Kevin Zhu, and Zhiyun Qian. TritonForge: Profiling-guided framework for automated Triton kernel optimization. arXiv preprint arXiv:2512.09196, 2025

work page arXiv 2025
[13]

TritonBench: Benchmarking large language model capabilities for generating Triton operators

Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, WangHaojie WangHaojie, Jianrong Wang, Xu Han, et al. TritonBench: Benchmarking large language model capabilities for generating Triton operators. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

work page 2025
[14]

The deep learning compiler: A comprehensive survey.IEEE Transactions on Parallel and Distributed Systems, 32(3):708–727, 2020

Mingzhen Li, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang, Zhongzhi Luan, Lin Gan, Guangwen Yang, and Depei Qian. The deep learning compiler: A comprehensive survey.IEEE Transactions on Parallel and Distributed Systems, 32(3):708–727, 2020

work page 2020
[15]

Autotriton: Automatic triton programming with reinforcement learning in llms, 2025

Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, et al. AutoTriton: Automatic triton programming with reinforcement learning in LLMs.arXiv preprint arXiv:2507.05687, 2025

work page arXiv 2025
[16]

Stitchcuda: An automated multi-agents end-to-end gpu programing framework with rubric-based agentic reinforcement learning.arXiv preprint arXiv:2603.02637, 2026

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, and Caiwen Ding. Stitchcuda: An automated multi-agents end-to-end gpu programing framework with rubric-based agentic reinforcement learning.arXiv preprint arXiv:2603.02637, 2026

work page arXiv 2026
[17]

Cuda-l1: Improving cuda optimization via contrastive reinforcement learning, 2026

Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, and Chris Shum. CUDA-L1: Improving CUDA optimization via contrastive reinforcement learning.arXiv preprint arXiv:2507.14111, 2025

work page arXiv 2025
[18]

KernelEvolve: Scaling agentic kernel coding for heterogeneous AI accelerators at meta.arXiv preprint arXiv:2512.23236, 2025

Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, et al. KernelEvolve: Scaling agentic kernel coding for heterogeneous AI accelerators at meta.arXiv preprint arXiv:2512.23236, 2025

work page arXiv 2025
[19]

Accelerating sparse matrix-matrix multiplication with the Ascend AI core

Salli Moustafa. Accelerating sparse matrix-matrix multiplication with the Ascend AI core. InThe 5th Workshop on Accelerated Machine Learning (AccML), 2023

work page 2023
[20]

KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher R´ e, and Azalia Mirhoseini. KernelBench: Can LLMs write efficient GPU kernels?arXiv preprint arXiv:2502.10517, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Fr´ edo Durand, and Saman Ama- rasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. InThe 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2013

work page 2013
[22]

Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524, 2025

ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, et al. Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524, 2025. 13

work page arXiv 2025
[23]

FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.The 38th Conference on Neural Information Processing Systems (NeurIPS), 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.The 38th Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[24]

OpenEvolve: An open-source evolutionary coding agent, 2025

Asankhaya Sharma. OpenEvolve: An open-source evolutionary coding agent, 2025

work page 2025
[25]

CUDA-L2: Surpassing cuBLAS performance for matrix multiplication through reinforcement learning.arXiv preprint arXiv:2512.02551, 2025

Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, and Chris Shum. CUDA-L2: Surpassing cuBLAS performance for matrix multiplication through reinforcement learning.arXiv preprint arXiv:2512.02551, 2025

work page arXiv 2025
[26]

Triton: An intermediate language and compiler for tiled neural network computations

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InThe 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), 2019

work page 2019
[27]

Geak: Introducing triton kernel ai agent & evaluation benchmarks, 2025

Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, and Emad Barsoum. Geak: Introducing triton kernel AI agent & evaluation benchmarks.arXiv preprint arXiv:2507.23194, 2025

work page arXiv 2025
[28]

TileLang: A composable tiled programming model for AI systems.arXiv preprint arXiv:2504.17577, 2025

Lei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, Wenhao Xie, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, et al. TileLang: A composable tiled programming model for AI systems.arXiv preprint arXiv:2504.17577, 2025

work page arXiv 2025
[29]

Astra: A multi-agent system for GPU kernel performance optimization.arXiv preprint arXiv:2509.07506, 2025

Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. Astra: A multi-agent system for GPU kernel performance optimization.arXiv preprint arXiv:2509.07506, 2025

work page arXiv 2025
[30]

Multikernelbench: A multi-platform benchmark for kernel generation, 2025

Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, and Tian Zhang. MultiKernelBench: A multi-platform benchmark for kernel generation.arXiv preprint arXiv:2507.17773, 2025

work page arXiv 2025
[31]

TritonRL: Training LLMs to think and code triton without cheating.arXiv preprint arXiv:2510.17891, 2025

Jiin Woo, Shaowei Zhu, Allen Nie, Zhen Jia, Yida Wang, and Youngsuk Park. TritonRL: Training LLMs to think and code triton without cheating.arXiv preprint arXiv:2510.17891, 2025

work page arXiv 2025
[32]

Mirage: A multi-level superoptimizer for tensor programs

Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: A multi-level superoptimizer for tensor programs. InThe 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2025

work page 2025
[33]

TLP: A deep learning- based cost model for tensor program tuning

Yi Zhai, Yu Zhang, Shuo Liu, Xiaomeng Chu, Jie Peng, Jianmin Ji, and Yanyong Zhang. TLP: A deep learning- based cost model for tensor program tuning. InThe 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023

work page 2023
[34]

ReWiND: Language-guided rewards teach robot policies without new demonstrations,

Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang. ReWiND: Language-guided rewards teach robot policies without new demonstrations. arXiv preprint arXiv:2505.10911, 2025

work page arXiv 2025
[35]

Cudaforge: An agent framework with hardware feedback for cuda kernel optimization, 2025

Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, and Caiwen Ding. CudaForge: An agent framework with hardware feedback for cuda kernel optimization.arXiv preprint arXiv:2511.01884, 2025

work page arXiv 2025
[36]

AKG: Automatic kernel generation for neural processing units using polyhedral transformations

Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, et al. AKG: Automatic kernel generation for neural processing units using polyhedral transformations. In The 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI), 2021

work page 2021
[37]

Ansor: Generating high-performance tensor programs for deep learning

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. Ansor: Generating high-performance tensor programs for deep learning. InThe 14th USENIX symposium on operating systems design and implementation (OSDI), 2020

work page 2020
[38]

TenSet: A large-scale program performance dataset for learned tensor compilers

Lianmin Zheng, Ruochen Liu, Junru Shao, Tianqi Chen, Joseph E Gonzalez, Ion Stoica, and Ameer Haj Ali. TenSet: A large-scale program performance dataset for learned tensor compilers. InThe 35th Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021

work page 2021
[39]

Squeezing operator performance potential for the ascend architecture

Yuhang Zhou, Zhibin Wang, Guyue Liu, Shipeng Li, Xi Lin, Zibo Wang, Yongzhong Wang, Fuchun Wei, Jingyi Zhang, Zhiheng Hu, et al. Squeezing operator performance potential for the ascend architecture. In The 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025. 14

work page 2025
[40]

Accelerating model training on Ascend chips: An industrial system for profiling, analysis and optimization

Yuhang Zhou, Zibo Wang, Zhibin Wang, Ruyi Zhang, Chen Tian, Xiaoliang Wang, Wanchun Dou, Guihai Chen, Bingqiang Wang, Yonghong Tian, et al. Accelerating model training on Ascend chips: An industrial system for profiling, analysis and optimization. In2025 USENIX Annual Technical Conference (USENIX ATC), 2025

work page 2025
[41]

increase tiling

Hongyu Zhu, Amar Phanishayee, and Gennady Pekhimenko. Daydream: Accurately estimating the efficacy of optimizations for DNN training. In2020 USENIX Annual Technical Conference (USENIX ATC), 2020. 15 A Additional Experimental Details Final benchmark size.After these checks and adjustments, we retain 127 operators for all reported experiments. Category Oper...

work page 2020
[42]

7- On the Ascend AI Core, this forces a full stall across MTE1/MTE2/MTE3 and the Vector units, preventing overlap between data movement and computation

Pipeline pipelining: 6- The Slow Version invokes ‘PipeBarrier<PIPE_ALL>()‘ inside the inner loop for every complex element. 7- On the Ascend AI Core, this forces a full stall across MTE1/MTE2/MTE3 and the Vector units, preventing overlap between data movement and computation. 8- Removing these barriers restores the intended decoupled pipeline and enables ...

work page
[43]

12- Increasing it to ‘maxDataCount = 8‘ (32 bytes) improves the payload-to-overhead ratio for DMA commands

DMA burst efficiency: 11- The Slow Version sets ‘maxDataCount = 2‘ (1 complex float = 8 bytes), far below the typical 32B/64B-efficient burst sizes on MTE. 12- Increasing it to ‘maxDataCount = 8‘ (32 bytes) improves the payload-to-overhead ratio for DMA commands. 13

work page
[44]

, 17"bottleneck

Optimized data paths: 15- Switching from ‘CopyInPad‘ (via ‘DataCopyPad‘) to ‘CopyIn‘ (via ‘DataCopy‘), and enabling aligned mode in ‘CopyOut‘, allows the operator to take the high-performance aligned DMA path. 16- ‘DataCopyPad‘ is generally slower due to extra handling for non-contiguous or unaligned accesses, while ‘DataCopy‘ maps more directly to effici...

work page

[1] [1]

NeutronAscend: Op- timizing GNN training with Ascend AI processors.ACM Transactions on Architecture and Code Optimization, 22(4):1–26, 2025

Xin Ai, bing zhang, Qiange Wang, Yanfeng Zhang, Hao Yuan, Shufeng Gong, and Ge Yu. NeutronAscend: Op- timizing GNN training with Ascend AI processors.ACM Transactions on Architecture and Code Optimization, 22(4):1–26, 2025

work page 2025

[2] [2]

GPU Kernel Scientist: An LLM-driven framework for iterative kernel optimization.arXiv preprint arXiv:2506.20807, 2025

Martin Andrews and Sam Witteveen. GPU Kernel Scientist: An LLM-driven framework for iterative kernel optimization.arXiv preprint arXiv:2506.20807, 2025

work page arXiv 2025

[3] [3]

Tiramisu: A polyhedral compiler for expressing fast and portable code

Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. Tiramisu: A polyhedral compiler for expressing fast and portable code. In2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2019

work page 2019

[4] [4]

Kevin: Multi-turn rl for generating cuda kernels, 2025

Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, and Silas Alberti. Kevin: Multi-turn RL for generating CUDA kernels.arXiv preprint arXiv:2507.11948, 2025

work page arXiv 2025

[5] [5]

A practical automatic polyhedral parallelizer and locality optimizer

Uday Bondhugula, Albert Hartono, J Ramanujam, and P Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. InThe ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2008

work page 2008

[6] [6]

AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units

Xinzi Cao, Jianyang Zhai, Pengfei Li, Zhiheng Hu, Cen Yan, Bingxu Mu, Guanghuan Fang, Bin She, Jiayu Li, Yihan Su, et al. AscendKernelGen: A systematic study of LLM-based kernel generation for neural processing units.arXiv preprint arXiv:2601.07160, 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

TVM: An automated end-to-end optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. TVM: An automated end-to-end optimizing compiler for deep learning. InThe 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 578–594, 2018

work page 2018

[8] [8]

Flashattention: Fast and memory-efficient exact attention with IO-awareness.the 36th International Conference on Neural Information Processing Systems (NeurIPS), 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e. Flashattention: Fast and memory-efficient exact attention with IO-awareness.the 36th International Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[9] [9]

STARK: Strategic team of agents for refining kernels.arXiv preprint arXiv:2510.16996, 2025

Juncheng Dong, Yang Yang, Tao Liu, Yang Wang, Feng Qi, Vahid Tarokh, Kaushik Rangadurai, and Shuang Yang. STARK: Strategic team of agents for refining kernels.arXiv preprint arXiv:2510.16996, 2025

work page arXiv 2025

[10] [10]

EvoEngineer: Mastering automated CUDA kernel code evolution with large language models.arXiv preprint arXiv:2510.03760, 2025

Ping Guo, Chenyu Zhu, Siyuan Chen, Fei Liu, Xi Lin, Zhichao Lu, and Qingfu Zhang. EvoEngineer: Mastering automated CUDA kernel code evolution with large language models.arXiv preprint arXiv:2510.03760, 2025

work page arXiv 2025

[11] [11]

PRAGMA: A profiling-reasoned multi-agent framework for automatic kernel optimization.arXiv preprint arXiv:2511.06345, 2025

Kelun Lei, Hailong Yang, Huaitao Zhang, Xin You, Kaige Zhang, Zhongzhi Luan, Yi Liu, and Depei Qian. PRAGMA: A profiling-reasoned multi-agent framework for automatic kernel optimization.arXiv preprint arXiv:2511.06345, 2025

work page arXiv 2025

[12] [12]

Tritonforge: Profiling-guided framework for automated triton kernel optimization.arXiv preprint arXiv:2512.09196, 2025

Haonan Li, Keyu Man, Partha Kanuparthy, Hanning Chen, Wei Sun, Sreen Tallam, Chenguang Zhu, Kevin Zhu, and Zhiyun Qian. TritonForge: Profiling-guided framework for automated Triton kernel optimization. arXiv preprint arXiv:2512.09196, 2025

work page arXiv 2025

[13] [13]

TritonBench: Benchmarking large language model capabilities for generating Triton operators

Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, WangHaojie WangHaojie, Jianrong Wang, Xu Han, et al. TritonBench: Benchmarking large language model capabilities for generating Triton operators. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

work page 2025

[14] [14]

The deep learning compiler: A comprehensive survey.IEEE Transactions on Parallel and Distributed Systems, 32(3):708–727, 2020

Mingzhen Li, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang, Zhongzhi Luan, Lin Gan, Guangwen Yang, and Depei Qian. The deep learning compiler: A comprehensive survey.IEEE Transactions on Parallel and Distributed Systems, 32(3):708–727, 2020

work page 2020

[15] [15]

Autotriton: Automatic triton programming with reinforcement learning in llms, 2025

Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, et al. AutoTriton: Automatic triton programming with reinforcement learning in LLMs.arXiv preprint arXiv:2507.05687, 2025

work page arXiv 2025

[16] [16]

Stitchcuda: An automated multi-agents end-to-end gpu programing framework with rubric-based agentic reinforcement learning.arXiv preprint arXiv:2603.02637, 2026

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, and Caiwen Ding. Stitchcuda: An automated multi-agents end-to-end gpu programing framework with rubric-based agentic reinforcement learning.arXiv preprint arXiv:2603.02637, 2026

work page arXiv 2026

[17] [17]

Cuda-l1: Improving cuda optimization via contrastive reinforcement learning, 2026

Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, and Chris Shum. CUDA-L1: Improving CUDA optimization via contrastive reinforcement learning.arXiv preprint arXiv:2507.14111, 2025

work page arXiv 2025

[18] [18]

KernelEvolve: Scaling agentic kernel coding for heterogeneous AI accelerators at meta.arXiv preprint arXiv:2512.23236, 2025

Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, et al. KernelEvolve: Scaling agentic kernel coding for heterogeneous AI accelerators at meta.arXiv preprint arXiv:2512.23236, 2025

work page arXiv 2025

[19] [19]

Accelerating sparse matrix-matrix multiplication with the Ascend AI core

Salli Moustafa. Accelerating sparse matrix-matrix multiplication with the Ascend AI core. InThe 5th Workshop on Accelerated Machine Learning (AccML), 2023

work page 2023

[20] [20]

KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher R´ e, and Azalia Mirhoseini. KernelBench: Can LLMs write efficient GPU kernels?arXiv preprint arXiv:2502.10517, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Fr´ edo Durand, and Saman Ama- rasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. InThe 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2013

work page 2013

[22] [22]

Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524, 2025

ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, et al. Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524, 2025. 13

work page arXiv 2025

[23] [23]

FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.The 38th Conference on Neural Information Processing Systems (NeurIPS), 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.The 38th Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[24] [24]

OpenEvolve: An open-source evolutionary coding agent, 2025

Asankhaya Sharma. OpenEvolve: An open-source evolutionary coding agent, 2025

work page 2025

[25] [25]

CUDA-L2: Surpassing cuBLAS performance for matrix multiplication through reinforcement learning.arXiv preprint arXiv:2512.02551, 2025

Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, and Chris Shum. CUDA-L2: Surpassing cuBLAS performance for matrix multiplication through reinforcement learning.arXiv preprint arXiv:2512.02551, 2025

work page arXiv 2025

[26] [26]

Triton: An intermediate language and compiler for tiled neural network computations

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InThe 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), 2019

work page 2019

[27] [27]

Geak: Introducing triton kernel ai agent & evaluation benchmarks, 2025

Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, and Emad Barsoum. Geak: Introducing triton kernel AI agent & evaluation benchmarks.arXiv preprint arXiv:2507.23194, 2025

work page arXiv 2025

[28] [28]

TileLang: A composable tiled programming model for AI systems.arXiv preprint arXiv:2504.17577, 2025

Lei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, Wenhao Xie, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, et al. TileLang: A composable tiled programming model for AI systems.arXiv preprint arXiv:2504.17577, 2025

work page arXiv 2025

[29] [29]

Astra: A multi-agent system for GPU kernel performance optimization.arXiv preprint arXiv:2509.07506, 2025

Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. Astra: A multi-agent system for GPU kernel performance optimization.arXiv preprint arXiv:2509.07506, 2025

work page arXiv 2025

[30] [30]

Multikernelbench: A multi-platform benchmark for kernel generation, 2025

Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, and Tian Zhang. MultiKernelBench: A multi-platform benchmark for kernel generation.arXiv preprint arXiv:2507.17773, 2025

work page arXiv 2025

[31] [31]

TritonRL: Training LLMs to think and code triton without cheating.arXiv preprint arXiv:2510.17891, 2025

Jiin Woo, Shaowei Zhu, Allen Nie, Zhen Jia, Yida Wang, and Youngsuk Park. TritonRL: Training LLMs to think and code triton without cheating.arXiv preprint arXiv:2510.17891, 2025

work page arXiv 2025

[32] [32]

Mirage: A multi-level superoptimizer for tensor programs

Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: A multi-level superoptimizer for tensor programs. InThe 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2025

work page 2025

[33] [33]

TLP: A deep learning- based cost model for tensor program tuning

Yi Zhai, Yu Zhang, Shuo Liu, Xiaomeng Chu, Jie Peng, Jianmin Ji, and Yanyong Zhang. TLP: A deep learning- based cost model for tensor program tuning. InThe 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023

work page 2023

[34] [34]

ReWiND: Language-guided rewards teach robot policies without new demonstrations,

Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang. ReWiND: Language-guided rewards teach robot policies without new demonstrations. arXiv preprint arXiv:2505.10911, 2025

work page arXiv 2025

[35] [35]

Cudaforge: An agent framework with hardware feedback for cuda kernel optimization, 2025

Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, and Caiwen Ding. CudaForge: An agent framework with hardware feedback for cuda kernel optimization.arXiv preprint arXiv:2511.01884, 2025

work page arXiv 2025

[36] [36]

AKG: Automatic kernel generation for neural processing units using polyhedral transformations

Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, et al. AKG: Automatic kernel generation for neural processing units using polyhedral transformations. In The 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI), 2021

work page 2021

[37] [37]

Ansor: Generating high-performance tensor programs for deep learning

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. Ansor: Generating high-performance tensor programs for deep learning. InThe 14th USENIX symposium on operating systems design and implementation (OSDI), 2020

work page 2020

[38] [38]

TenSet: A large-scale program performance dataset for learned tensor compilers

Lianmin Zheng, Ruochen Liu, Junru Shao, Tianqi Chen, Joseph E Gonzalez, Ion Stoica, and Ameer Haj Ali. TenSet: A large-scale program performance dataset for learned tensor compilers. InThe 35th Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021

work page 2021

[39] [39]

Squeezing operator performance potential for the ascend architecture

Yuhang Zhou, Zhibin Wang, Guyue Liu, Shipeng Li, Xi Lin, Zibo Wang, Yongzhong Wang, Fuchun Wei, Jingyi Zhang, Zhiheng Hu, et al. Squeezing operator performance potential for the ascend architecture. In The 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025. 14

work page 2025

[40] [40]

Accelerating model training on Ascend chips: An industrial system for profiling, analysis and optimization

Yuhang Zhou, Zibo Wang, Zhibin Wang, Ruyi Zhang, Chen Tian, Xiaoliang Wang, Wanchun Dou, Guihai Chen, Bingqiang Wang, Yonghong Tian, et al. Accelerating model training on Ascend chips: An industrial system for profiling, analysis and optimization. In2025 USENIX Annual Technical Conference (USENIX ATC), 2025

work page 2025

[41] [41]

increase tiling

Hongyu Zhu, Amar Phanishayee, and Gennady Pekhimenko. Daydream: Accurately estimating the efficacy of optimizations for DNN training. In2020 USENIX Annual Technical Conference (USENIX ATC), 2020. 15 A Additional Experimental Details Final benchmark size.After these checks and adjustments, we retain 127 operators for all reported experiments. Category Oper...

work page 2020

[42] [42]

7- On the Ascend AI Core, this forces a full stall across MTE1/MTE2/MTE3 and the Vector units, preventing overlap between data movement and computation

Pipeline pipelining: 6- The Slow Version invokes ‘PipeBarrier<PIPE_ALL>()‘ inside the inner loop for every complex element. 7- On the Ascend AI Core, this forces a full stall across MTE1/MTE2/MTE3 and the Vector units, preventing overlap between data movement and computation. 8- Removing these barriers restores the intended decoupled pipeline and enables ...

work page

[43] [43]

12- Increasing it to ‘maxDataCount = 8‘ (32 bytes) improves the payload-to-overhead ratio for DMA commands

DMA burst efficiency: 11- The Slow Version sets ‘maxDataCount = 2‘ (1 complex float = 8 bytes), far below the typical 32B/64B-efficient burst sizes on MTE. 12- Increasing it to ‘maxDataCount = 8‘ (32 bytes) improves the payload-to-overhead ratio for DMA commands. 13

work page

[44] [44]

, 17"bottleneck

Optimized data paths: 15- Switching from ‘CopyInPad‘ (via ‘DataCopyPad‘) to ‘CopyIn‘ (via ‘DataCopy‘), and enabling aligned mode in ‘CopyOut‘, allows the operator to take the high-performance aligned DMA path. 16- ‘DataCopyPad‘ is generally slower due to extra handling for non-contiguous or unaligned accesses, while ‘DataCopy‘ maps more directly to effici...

work page