pith. sign in

arxiv: 2603.23566 · v2 · pith:EPOFIQQ6new · submitted 2026-03-24 · 💻 cs.LG · cs.AI

AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

Pith reviewed 2026-05-21 10:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Ascend NPUAscendC operatorsepisodic agentoperator optimizationtiling configurationsevolutionary searchprofiling in the loopkernel rewriting
0
0 comments X

The pith

AscendOptimizer is an episodic agent that learns Ascend NPU operator optimizations by rewinding proven kernels to extract reusable experience and by running profiling-driven evolutionary search for tiling and data movement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the scarcity of public optimization examples for AscendC operators on Ascend NPUs, where performance hinges on both a host-side tiling program and the kernel itself. It builds missing knowledge directly from hardware execution rather than external datasets. For kernels, the agent removes optimizations from strong implementations in a controlled manner and retains only the removals that measurably degrade speed as reusable rewriting experience. For the host side, it couples evolutionary search with on-device profiling to discover valid, high-performance tiling configurations. On 101 real operators this yields a 1.21 times geometric-mean speedup over the open-source baseline while outperforming Best-of-N sampling and OpenEvolve under equal evaluation budgets.

Core claim

AscendOptimizer is an episodic agent that improves AscendC operators by rewinding optimizations from strong implementations to create reusable experience and by performing profiling-in-the-loop evolutionary search for host-side tiling and data-movement configurations, leading to a 1.21x geometric mean speedup over the open-source baseline on 101 operators with 53.47 percent outperforming their references.

What carries the argument

The episodic agent combining controlled optimization removal for kernel rewriting experience with profiling-in-the-loop evolutionary search for tiling configurations.

If this is right

  • Kernel structure and host-side scheduling can be improved jointly without large external kernel databases.
  • Performance gains appear consistently across different evaluation budgets and outperform simple sampling baselines.
  • More than half the optimized operators exceed the speed of their original reference implementations.
  • The same workflow can be applied to additional AscendC operators beyond the 101-operator benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rewind-and-reuse pattern could transfer to operator optimization on other hardware platforms that also lack public high-performance examples.
  • Accumulated experience across many operators might eventually lower the number of evaluations needed for each new operator.
  • The approach could be extended to discover operator fusions or alternative pipelining strategies not present in the original references.

Load-bearing premise

Removing optimizations from strong kernels in a controlled way yields reusable, generalizable experience that helps rewrite new operators, and profiling-guided evolutionary search finds valid high-performance tiling configurations within modest evaluation budgets.

What would settle it

On a new held-out set of AscendC operators, AscendOptimizer fails to produce higher geometric-mean speedup than Best-of-N sampling or OpenEvolve when given identical per-operator evaluation budgets.

Figures

Figures reproduced from arXiv: 2603.23566 by Chuyun Shen, Jiehao Wu, Junjie Sheng, Wenhao Li, Xiangfeng Wang, Zixiao Huang.

Figure 1
Figure 1. Figure 1: Overview of AscendOptimizer. Stage I performs evolutionary-guided program search with hardware￾in-the-loop profiling feedback to discover valid high-performance configurations; Stage II bootstraps optimization experience via optimization rewind and applies retrieval-augmented kernel optimization to address structural bottlenecks. The two stages are executed in an alternating loop, where improvements from o… view at source ↗
Figure 2
Figure 2. Figure 2: CDF of per-operator speedups achieved by [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Semantic landscape of optimization strategies via embedding clustering. Each optimization record (Title [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Optimization trajectory of the ”foreach pow scalar and tensor” operator [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of a key scheduling rewrite in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Base tiling function Tbase with evolution markers, synthesized from the operator code and attributes S. Method details: optimization thought from rewind 1 "optimization_point": { 2 "title": "Elimination of Pipeline Serialization and Enhancement of DMA Transfer Efficiency", 3 "description": "The Fast Version achieves significantly higher performance by addressing three critical architectural inefficiencies … view at source ↗
Figure 7
Figure 7. Figure 7: optimization thought example: key changes from a slow to a fast implementation. The figure shows a [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Optimizing AscendC (Ascend C) operators for Ascend NPUs is difficult for two reasons. First, unlike CUDA, the ecosystem offers few public kernels to learn from. Second, performance depends on a coupled two-part implementation: a host-side tiling program that controls data movement and a kernel program that schedules and pipelines computation. We present AscendOptimizer, an episodic agent that builds missing optimization knowledge from execution itself. For kernel optimization, AscendOptimizer rewinds strong implementations by removing optimizations in a controlled way, then keeps the changes whose removal measurably hurts performance as reusable experience for later rewriting. For host-side optimization, it runs profiling-in-the-loop evolutionary search to find valid, fast tiling and data-movement configurations directly from hardware feedback. This combination lets the agent improve kernel structure and host-side scheduling together. On a benchmark of 101 real AscendC operators, AscendOptimizer achieves a 1.21x geometric-mean speedup over the open-source baseline, and 53.47% of operators run faster than their references. Given a same budget of evaluations per operator, AscendOptimizer consistently outperforms Best-of-N sampling and OpenEvolve in terms of geometric mean speedup, fast_p tail speedup ratios, and overall optimization progress across varying budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AscendOptimizer, an episodic agent for optimizing AscendC operators on Ascend NPUs. It rewinds optimizations from strong kernels in a controlled manner to extract reusable experience for later rewriting, while employing profiling-in-the-loop evolutionary search to discover valid tiling and data-movement configurations on the host side. On a benchmark of 101 real AscendC operators, it reports a 1.21x geometric-mean speedup over the open-source baseline (with 53.47% of operators faster than references) and consistent outperformance versus Best-of-N sampling and OpenEvolve under matched evaluation budgets.

Significance. If the central claims hold after addressing the noted gaps, the work offers a practical method for building optimization knowledge in hardware ecosystems with limited public kernels, by combining experience extraction via controlled rewinding with hardware-feedback evolutionary search. This could be relevant for emerging NPUs where traditional tuning resources are scarce.

major comments (2)
  1. [Abstract / Method] Abstract and method description: The headline 1.21x geometric-mean speedup and outperformance claims rest on the episodic rewinding+reuse mechanism producing transferable experience that improves rewriting beyond what profiling-in-the-loop evolutionary search achieves alone. No ablation or transfer analysis is provided to isolate this contribution (e.g., comparing full AscendOptimizer against search-only variants on the same 101 operators), leaving open the possibility that gains are driven primarily by the search component rather than the rewinding loop.
  2. [Experiments] Experimental section: The soundness of the performance claims is difficult to evaluate because the manuscript provides no details on experimental controls, statistical testing, operator selection criteria for the 101-operator benchmark, or potential measurement biases (e.g., warm-up, variance across runs). This directly affects the reliability of the reported geometric-mean speedup and fast_p tail ratios.
minor comments (2)
  1. [Method] Clarify the precise definition of 'controlled removal' of optimizations and how 'measurable hurt' is quantified to ensure reproducibility of the experience extraction step.
  2. [Method] Add explicit discussion of invalid or suboptimal trials encountered during evolutionary search and how the budget is allocated across operators.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify how to better substantiate the contributions of AscendOptimizer. We will revise the manuscript to include a direct ablation isolating the rewinding mechanism and to expand the experimental details for improved reproducibility and reliability assessment.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: The headline 1.21x geometric-mean speedup and outperformance claims rest on the episodic rewinding+reuse mechanism producing transferable experience that improves rewriting beyond what profiling-in-the-loop evolutionary search achieves alone. No ablation or transfer analysis is provided to isolate this contribution (e.g., comparing full AscendOptimizer against search-only variants on the same 101 operators), leaving open the possibility that gains are driven primarily by the search component rather than the rewinding loop.

    Authors: We acknowledge the value of a direct ablation to isolate the episodic rewinding and experience-reuse component. Our existing comparisons to Best-of-N sampling and OpenEvolve under matched evaluation budgets already show that the full AscendOptimizer outperforms pure search-based approaches on the 101 operators. Nevertheless, to address the referee's point explicitly, we will add an ablation study in the revised manuscript that evaluates a search-only variant (profiling-in-the-loop evolutionary search without the rewinding loop) against the complete system on the identical benchmark, reporting geometric-mean speedup and fast_p ratios. This will clarify the incremental benefit of the rewinding mechanism. revision: yes

  2. Referee: [Experiments] Experimental section: The soundness of the performance claims is difficult to evaluate because the manuscript provides no details on experimental controls, statistical testing, operator selection criteria for the 101-operator benchmark, or potential measurement biases (e.g., warm-up, variance across runs). This directly affects the reliability of the reported geometric-mean speedup and fast_p tail ratios.

    Authors: We agree that additional methodological details are required. In the revised version we will expand the Experiments section with a new subsection that specifies: (i) the criteria used to select the 101 real AscendC operators, (ii) the full measurement protocol including warm-up iterations, number of repeated runs per configuration, and how variance is handled, (iii) the statistical procedures applied (e.g., reporting means with standard deviations and any significance tests), and (iv) controls for hardware and environmental variability. These additions will allow readers to assess the robustness of the 1.21x geometric-mean speedup and the 53.47% fast_p figure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from direct hardware benchmarks

full rationale

The paper describes an episodic agent that rewinds optimizations from strong kernels to extract reusable experience and applies profiling-in-the-loop evolutionary search for tiling configurations. All reported performance claims (1.21x geo-mean speedup on 101 operators, outperformance vs. Best-of-N and OpenEvolve under fixed evaluation budgets) are obtained via direct execution measurements on Ascend NPU hardware. No mathematical derivations, equations, or self-referential definitions are present that reduce a claimed result to its own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked to justify core mechanisms. The approach is self-contained against external benchmarks and falsifiable through the described experimental protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on empirical hardware feedback and search heuristics with no explicit mathematical axioms or invented entities stated; free parameters such as search population size or rewinding thresholds are not detailed in the abstract.

pith-pipeline@v0.9.0 · 5773 in / 1225 out tokens · 62732 ms · 2026-05-21T10:11:55.049098+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

  1. [1]

    NeutronAscend: Op- timizing GNN training with Ascend AI processors.ACM Transactions on Architecture and Code Optimization, 22(4):1–26, 2025

    Xin Ai, bing zhang, Qiange Wang, Yanfeng Zhang, Hao Yuan, Shufeng Gong, and Ge Yu. NeutronAscend: Op- timizing GNN training with Ascend AI processors.ACM Transactions on Architecture and Code Optimization, 22(4):1–26, 2025

  2. [2]

    GPU Kernel Scientist: An LLM-driven framework for iterative kernel optimization.arXiv preprint arXiv:2506.20807, 2025

    Martin Andrews and Sam Witteveen. GPU Kernel Scientist: An LLM-driven framework for iterative kernel optimization.arXiv preprint arXiv:2506.20807, 2025

  3. [3]

    Tiramisu: A polyhedral compiler for expressing fast and portable code

    Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. Tiramisu: A polyhedral compiler for expressing fast and portable code. In2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2019

  4. [4]

    Kevin: Multi-turn rl for generating cuda kernels, 2025

    Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, and Silas Alberti. Kevin: Multi-turn RL for generating CUDA kernels.arXiv preprint arXiv:2507.11948, 2025

  5. [5]

    A practical automatic polyhedral parallelizer and locality optimizer

    Uday Bondhugula, Albert Hartono, J Ramanujam, and P Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. InThe ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2008

  6. [6]

    AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units

    Xinzi Cao, Jianyang Zhai, Pengfei Li, Zhiheng Hu, Cen Yan, Bingxu Mu, Guanghuan Fang, Bin She, Jiayu Li, Yihan Su, et al. AscendKernelGen: A systematic study of LLM-based kernel generation for neural processing units.arXiv preprint arXiv:2601.07160, 2026. 12

  7. [7]

    TVM: An automated end-to-end optimizing compiler for deep learning

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. TVM: An automated end-to-end optimizing compiler for deep learning. InThe 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 578–594, 2018

  8. [8]

    Flashattention: Fast and memory-efficient exact attention with IO-awareness.the 36th International Conference on Neural Information Processing Systems (NeurIPS), 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e. Flashattention: Fast and memory-efficient exact attention with IO-awareness.the 36th International Conference on Neural Information Processing Systems (NeurIPS), 2022

  9. [9]

    STARK: Strategic team of agents for refining kernels.arXiv preprint arXiv:2510.16996, 2025

    Juncheng Dong, Yang Yang, Tao Liu, Yang Wang, Feng Qi, Vahid Tarokh, Kaushik Rangadurai, and Shuang Yang. STARK: Strategic team of agents for refining kernels.arXiv preprint arXiv:2510.16996, 2025

  10. [10]

    EvoEngineer: Mastering automated CUDA kernel code evolution with large language models.arXiv preprint arXiv:2510.03760, 2025

    Ping Guo, Chenyu Zhu, Siyuan Chen, Fei Liu, Xi Lin, Zhichao Lu, and Qingfu Zhang. EvoEngineer: Mastering automated CUDA kernel code evolution with large language models.arXiv preprint arXiv:2510.03760, 2025

  11. [11]

    PRAGMA: A profiling-reasoned multi-agent framework for automatic kernel optimization.arXiv preprint arXiv:2511.06345, 2025

    Kelun Lei, Hailong Yang, Huaitao Zhang, Xin You, Kaige Zhang, Zhongzhi Luan, Yi Liu, and Depei Qian. PRAGMA: A profiling-reasoned multi-agent framework for automatic kernel optimization.arXiv preprint arXiv:2511.06345, 2025

  12. [12]

    Tritonforge: Profiling-guided framework for automated triton kernel optimization.arXiv preprint arXiv:2512.09196, 2025

    Haonan Li, Keyu Man, Partha Kanuparthy, Hanning Chen, Wei Sun, Sreen Tallam, Chenguang Zhu, Kevin Zhu, and Zhiyun Qian. TritonForge: Profiling-guided framework for automated Triton kernel optimization. arXiv preprint arXiv:2512.09196, 2025

  13. [13]

    TritonBench: Benchmarking large language model capabilities for generating Triton operators

    Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, WangHaojie WangHaojie, Jianrong Wang, Xu Han, et al. TritonBench: Benchmarking large language model capabilities for generating Triton operators. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

  14. [14]

    The deep learning compiler: A comprehensive survey.IEEE Transactions on Parallel and Distributed Systems, 32(3):708–727, 2020

    Mingzhen Li, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang, Zhongzhi Luan, Lin Gan, Guangwen Yang, and Depei Qian. The deep learning compiler: A comprehensive survey.IEEE Transactions on Parallel and Distributed Systems, 32(3):708–727, 2020

  15. [15]

    Autotriton: Automatic triton programming with reinforcement learning in llms, 2025

    Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, et al. AutoTriton: Automatic triton programming with reinforcement learning in LLMs.arXiv preprint arXiv:2507.05687, 2025

  16. [16]

    Stitchcuda: An automated multi-agents end-to-end gpu programing framework with rubric-based agentic reinforcement learning.arXiv preprint arXiv:2603.02637, 2026

    Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, and Caiwen Ding. Stitchcuda: An automated multi-agents end-to-end gpu programing framework with rubric-based agentic reinforcement learning.arXiv preprint arXiv:2603.02637, 2026

  17. [17]

    Cuda-l1: Improving cuda optimization via contrastive reinforcement learning, 2026

    Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, and Chris Shum. CUDA-L1: Improving CUDA optimization via contrastive reinforcement learning.arXiv preprint arXiv:2507.14111, 2025

  18. [18]

    KernelEvolve: Scaling agentic kernel coding for heterogeneous AI accelerators at meta.arXiv preprint arXiv:2512.23236, 2025

    Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, et al. KernelEvolve: Scaling agentic kernel coding for heterogeneous AI accelerators at meta.arXiv preprint arXiv:2512.23236, 2025

  19. [19]

    Accelerating sparse matrix-matrix multiplication with the Ascend AI core

    Salli Moustafa. Accelerating sparse matrix-matrix multiplication with the Ascend AI core. InThe 5th Workshop on Accelerated Machine Learning (AccML), 2023

  20. [20]

    KernelBench: Can LLMs Write Efficient GPU Kernels?

    Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher R´ e, and Azalia Mirhoseini. KernelBench: Can LLMs write efficient GPU kernels?arXiv preprint arXiv:2502.10517, 2025

  21. [21]

    Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

    Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Fr´ edo Durand, and Saman Ama- rasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. InThe 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2013

  22. [22]

    Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524, 2025

    ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, et al. Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524, 2025. 13

  23. [23]

    FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.The 38th Conference on Neural Information Processing Systems (NeurIPS), 2024

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.The 38th Conference on Neural Information Processing Systems (NeurIPS), 2024

  24. [24]

    OpenEvolve: An open-source evolutionary coding agent, 2025

    Asankhaya Sharma. OpenEvolve: An open-source evolutionary coding agent, 2025

  25. [25]

    CUDA-L2: Surpassing cuBLAS performance for matrix multiplication through reinforcement learning.arXiv preprint arXiv:2512.02551, 2025

    Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, and Chris Shum. CUDA-L2: Surpassing cuBLAS performance for matrix multiplication through reinforcement learning.arXiv preprint arXiv:2512.02551, 2025

  26. [26]

    Triton: An intermediate language and compiler for tiled neural network computations

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InThe 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), 2019

  27. [27]

    Geak: Introducing triton kernel ai agent & evaluation benchmarks, 2025

    Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, and Emad Barsoum. Geak: Introducing triton kernel AI agent & evaluation benchmarks.arXiv preprint arXiv:2507.23194, 2025

  28. [28]

    TileLang: A composable tiled programming model for AI systems.arXiv preprint arXiv:2504.17577, 2025

    Lei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, Wenhao Xie, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, et al. TileLang: A composable tiled programming model for AI systems.arXiv preprint arXiv:2504.17577, 2025

  29. [29]

    Astra: A multi-agent system for GPU kernel performance optimization.arXiv preprint arXiv:2509.07506, 2025

    Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. Astra: A multi-agent system for GPU kernel performance optimization.arXiv preprint arXiv:2509.07506, 2025

  30. [30]

    Multikernelbench: A multi-platform benchmark for kernel generation, 2025

    Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, and Tian Zhang. MultiKernelBench: A multi-platform benchmark for kernel generation.arXiv preprint arXiv:2507.17773, 2025

  31. [31]

    TritonRL: Training LLMs to think and code triton without cheating.arXiv preprint arXiv:2510.17891, 2025

    Jiin Woo, Shaowei Zhu, Allen Nie, Zhen Jia, Yida Wang, and Youngsuk Park. TritonRL: Training LLMs to think and code triton without cheating.arXiv preprint arXiv:2510.17891, 2025

  32. [32]

    Mirage: A multi-level superoptimizer for tensor programs

    Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: A multi-level superoptimizer for tensor programs. InThe 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2025

  33. [33]

    TLP: A deep learning- based cost model for tensor program tuning

    Yi Zhai, Yu Zhang, Shuo Liu, Xiaomeng Chu, Jie Peng, Jianmin Ji, and Yanyong Zhang. TLP: A deep learning- based cost model for tensor program tuning. InThe 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023

  34. [34]

    ReWiND: Language-guided rewards teach robot policies without new demonstrations,

    Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang. ReWiND: Language-guided rewards teach robot policies without new demonstrations. arXiv preprint arXiv:2505.10911, 2025

  35. [35]

    Cudaforge: An agent framework with hardware feedback for cuda kernel optimization, 2025

    Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, and Caiwen Ding. CudaForge: An agent framework with hardware feedback for cuda kernel optimization.arXiv preprint arXiv:2511.01884, 2025

  36. [36]

    AKG: Automatic kernel generation for neural processing units using polyhedral transformations

    Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, et al. AKG: Automatic kernel generation for neural processing units using polyhedral transformations. In The 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI), 2021

  37. [37]

    Ansor: Generating high-performance tensor programs for deep learning

    Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. Ansor: Generating high-performance tensor programs for deep learning. InThe 14th USENIX symposium on operating systems design and implementation (OSDI), 2020

  38. [38]

    TenSet: A large-scale program performance dataset for learned tensor compilers

    Lianmin Zheng, Ruochen Liu, Junru Shao, Tianqi Chen, Joseph E Gonzalez, Ion Stoica, and Ameer Haj Ali. TenSet: A large-scale program performance dataset for learned tensor compilers. InThe 35th Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021

  39. [39]

    Squeezing operator performance potential for the ascend architecture

    Yuhang Zhou, Zhibin Wang, Guyue Liu, Shipeng Li, Xi Lin, Zibo Wang, Yongzhong Wang, Fuchun Wei, Jingyi Zhang, Zhiheng Hu, et al. Squeezing operator performance potential for the ascend architecture. In The 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025. 14

  40. [40]

    Accelerating model training on Ascend chips: An industrial system for profiling, analysis and optimization

    Yuhang Zhou, Zibo Wang, Zhibin Wang, Ruyi Zhang, Chen Tian, Xiaoliang Wang, Wanchun Dou, Guihai Chen, Bingqiang Wang, Yonghong Tian, et al. Accelerating model training on Ascend chips: An industrial system for profiling, analysis and optimization. In2025 USENIX Annual Technical Conference (USENIX ATC), 2025

  41. [41]

    increase tiling

    Hongyu Zhu, Amar Phanishayee, and Gennady Pekhimenko. Daydream: Accurately estimating the efficacy of optimizations for DNN training. In2020 USENIX Annual Technical Conference (USENIX ATC), 2020. 15 A Additional Experimental Details Final benchmark size.After these checks and adjustments, we retain 127 operators for all reported experiments. Category Oper...

  42. [42]

    7- On the Ascend AI Core, this forces a full stall across MTE1/MTE2/MTE3 and the Vector units, preventing overlap between data movement and computation

    Pipeline pipelining: 6- The Slow Version invokes ‘PipeBarrier<PIPE_ALL>()‘ inside the inner loop for every complex element. 7- On the Ascend AI Core, this forces a full stall across MTE1/MTE2/MTE3 and the Vector units, preventing overlap between data movement and computation. 8- Removing these barriers restores the intended decoupled pipeline and enables ...

  43. [43]

    12- Increasing it to ‘maxDataCount = 8‘ (32 bytes) improves the payload-to-overhead ratio for DMA commands

    DMA burst efficiency: 11- The Slow Version sets ‘maxDataCount = 2‘ (1 complex float = 8 bytes), far below the typical 32B/64B-efficient burst sizes on MTE. 12- Increasing it to ‘maxDataCount = 8‘ (32 bytes) improves the payload-to-overhead ratio for DMA commands. 13

  44. [44]

    , 17"bottleneck

    Optimized data paths: 15- Switching from ‘CopyInPad‘ (via ‘DataCopyPad‘) to ‘CopyIn‘ (via ‘DataCopy‘), and enabling aligned mode in ‘CopyOut‘, allows the operator to take the high-performance aligned DMA path. 16- ‘DataCopyPad‘ is generally slower due to extra handling for non-contiguous or unaligned accesses, while ‘DataCopy‘ maps more directly to effici...