pith. sign in

arxiv: 2605.29734 · v1 · pith:IHYKQSPUnew · submitted 2026-05-28 · 💻 cs.CL

HTAM: Hierarchical Transition-Attended Memory for Operator Optimization

Pith reviewed 2026-06-29 07:24 UTC · model grok-4.3

classification 💻 cs.CL
keywords HTAMHierarchical Transition GraphLLM-based operator optimizationGPU kernel generationCUDA code optimizationstructured memorycoarse-to-fine framework
0
0 comments X

The pith

HTAM organizes LLM optimization experience in a two-level graph to select global directions and retrieve local strategies for CUDA kernel code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HTAM to improve automatic optimization of GPU kernels using large language models. Existing approaches struggle because broad hints are hard to apply while specific details overwhelm the search process. HTAM creates a two-level graph to store broad optimization directions separately from detailed tactics and records how one leads to the other. At each step the model chooses a direction from the current situation and past steps then pulls the matching details to write the code. Tests across the full KernelBench set show higher rates of correct and fast solutions.

Core claim

HTAM builds a two-level Hierarchical Transition Graph to organize coarse global directions, detailed local strategies, and transition experience between optimization steps. During each evolution step it selects a global direction from the current state and recent optimization history, retrieves the corresponding local strategy memory, and uses it to guide concrete CUDA code generation. This coarse-to-fine framework addresses the granularity mismatch in LLM-based operator optimization.

What carries the argument

The two-level Hierarchical Transition Graph that stores coarse global directions at one level, detailed local strategies at the other, and transition experience between steps.

Load-bearing premise

That a two-level graph can organize experience at the right granularity so global selection plus local retrieval produces effective CUDA code without enlarging the search space or obscuring bottlenecks.

What would settle it

Experiments on the KernelBench suite where HTAM shows no gains in correctness, fast-solution rate, or speedup compared to LLM baselines would disprove the claim.

Figures

Figures reproduced from arXiv: 2605.29734 by Chengqing Zong, Chen Wang, Mingyang Yi, Tianhe Jia, Xuwen Xiang, Yining Zhang, Yue Wang, Zedong Dan.

Figure 1
Figure 1. Figure 1: HTAM completes operator optimization in a [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Human expert operator optimization typically [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall framework of HTAM. The current implementation and evaluation feedback are summarized into a [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Hierarchical Transition Graph (HTG), where [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Swish case study. The table shows step-wise [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Step-wise speedup curve for the Swish case [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
read the original abstract

High-performance GPU kernels are essential for efficient LLM deployment, yet optimizing them remains expertise-intensive. Recent LLM-based code generation makes automatic GPU operator generation promising, but operator optimization remains a hardware-aware search problem. Existing LLM-based methods face a granularity mismatch: coarse hints are reusable but hard to execute, whereas detailed memories are actionable but enlarge the search space and obscure optimization bottlenecks. The key challenge is therefore to organize optimization experience at an appropriate granularity. To address this issue, this paper proposes HTAM (Hierarchical Transition-Attended Memory), a coarse-to-fine framework for LLM-based operator optimization. HTAM builds a two-level Hierarchical Transition Graph (HTG) to organize coarse global directions, detailed local strategies, and transition experience between optimization steps. During each evolution step, HTAM selects a global direction from the current state and recent optimization history, retrieves the corresponding local strategy memory, and uses it to guide concrete CUDA code generation. Experiments on the full KernelBench suite demonstrate that HTAM consistently improves correctness, fast-solution rate, and speedup over LLM-based baselines, while backend and Robust-KBench studies indicate transferable benefits from structured memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes HTAM (Hierarchical Transition-Attended Memory), a coarse-to-fine framework for LLM-based GPU operator optimization. It constructs a two-level Hierarchical Transition Graph (HTG) to organize coarse global directions, detailed local strategies, and transition experience. At each evolution step, the method selects a global direction from the current state and recent history, retrieves the corresponding local strategy memory, and uses it to guide concrete CUDA code generation. Experiments on the full KernelBench suite are reported to show consistent gains in correctness, fast-solution rate, and speedup over LLM-based baselines, with additional backend and Robust-KBench studies indicating transferable benefits from the structured memory.

Significance. If the reported gains hold under rigorous evaluation, the hierarchical organization of optimization experience at appropriate granularity could meaningfully advance automatic, hardware-aware kernel generation for LLMs. The approach directly targets the granularity mismatch between reusable coarse hints and actionable but search-space-enlarging detailed memories, offering a structured alternative to flat memory or prompt-based methods.

major comments (1)
  1. [Abstract / Experimental Evaluation] The central experimental claim (consistent improvements on the full KernelBench suite) is presented without any description of baselines, number of trials, statistical tests, ablation results on the two-level HTG components, or definitions of the reported metrics. This absence prevents assessment of whether the data actually support the claim that structured memory yields transferable benefits.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract's experimental description. We agree that greater self-containment in the abstract will strengthen the presentation of our claims and will revise accordingly while preserving the abstract's length constraints.

read point-by-point responses
  1. Referee: [Abstract / Experimental Evaluation] The central experimental claim (consistent improvements on the full KernelBench suite) is presented without any description of baselines, number of trials, statistical tests, ablation results on the two-level HTG components, or definitions of the reported metrics. This absence prevents assessment of whether the data actually support the claim that structured memory yields transferable benefits.

    Authors: We acknowledge the referee's point that the abstract, as currently written, does not enumerate these experimental details. The full manuscript addresses them in the body: Section 4.1 specifies the LLM-based baselines (direct generation, flat memory, and prompt-only variants), Section 3.3 defines the three metrics (correctness rate, fast-solution rate within 10 attempts, and geometric-mean speedup), Section 4.2 reports results aggregated over multiple independent runs with standard deviation, Section 4.4 contains the two-level HTG ablations, and Sections 4.5–4.6 present the backend portability and Robust-KBench transfer experiments. To make the central claim verifiable from the abstract itself, we will add a concise clause listing the primary baselines, the number of runs, and the metric definitions. We believe this addresses the concern without requiring new experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes HTAM, a coarse-to-fine framework using a two-level Hierarchical Transition Graph (HTG) to organize global directions and local strategies for LLM-based CUDA optimization. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or description. The central claims rest on experimental improvements on KernelBench rather than any internal reduction to inputs by construction. This is a standard methodological proposal with independent empirical validation and no detectable circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities; all arrays left empty.

pith-pipeline@v0.9.1-grok · 5745 in / 1213 out tokens · 33363 ms · 2026-06-29T07:24:50.963409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 35 canonical work pages · 22 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  2. [2]

    Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva, Evgeniy Glukhov, and Egor Bogomolov. 2025. https://arxiv.org/abs/2510.23393 The best of n worlds: Aligning reinforcement learning with best-of-n sampling via max@k optimisation

  3. [3]

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. 2024. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787

  4. [4]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

  5. [5]

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. https://arxiv.org/abs/1802.04799 Tvm: An automated end-to-end optimizing compiler for deep learning . Preprint, arXiv:1802.04799

  6. [6]

    Wentao Chen, Jiace Zhu, Qi Fan, Yehan Ma, and An Zou. 2025. https://arxiv.org/abs/2506.09092 Cuda-llm: Llms can write efficient cuda kernels

  7. [7]

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. 2025. https://arxiv.org/abs/2504.19413 Mem0: Building production-ready ai agents with scalable long-term memory

  8. [8]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261

  9. [9]

    Weinan Dai, Hanlin Wu, Qiying Yu, Huan ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, and Hao Zhou. 2026. https://arxiv.org/abs/2602.24286 Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation

  10. [10]

    Tri Dao. 2023. https://arxiv.org/abs/2307.08691 Flashattention-2: Faster attention with better parallelism and work partitioning . Preprint, arXiv:2307.08691

  11. [11]

    DeepSeek-AI . 2026. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf Deepseek-v4: Towards highly efficient million-token context intelligence . Technical report. Accessed: 2026-05-18

  12. [12]

    Kris Shengjun Dong, Sahil Modi, Dima Nikiforov, Sana Damani, Edward Lin, Siva Kumar Sastry Hari, and Christos Kozyrakis. 2026. https://arxiv.org/abs/2602.14293 Kernelblaster: Continual cross-task cuda optimization via memory-augmented in-context reinforcement learning

  13. [13]

    Thomas Faingnaert, Tim Besard, and Bjorn De Sutter. 2021. Flexible performant gemm kernels on gpus. IEEE Transactions on Parallel and Distributed Systems, 33(9):2230--2248

  14. [14]

    Junfeng Gong, Zhiyi Wei, Junying Chen, Cheng Liu, and Huawei Li. 2025. https://arxiv.org/abs/2510.19873 From large to small: Transferring cuda optimization expertise via reasoning graph

  15. [15]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

  16. [16]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. https://arxiv.org/abs/2401.14196 Deepseek-coder: When the large language model meets programming -- the rise of code intelligence

  17. [17]

    Yuxuan Han, Meng-Hao Guo, Zhengning Liu, Wenguang Chen, and Shi-Min Hu. 2026. https://arxiv.org/abs/2603.07169 Making llms optimize multi-scenario cuda kernels like experts

  18. [18]

    Yiyang Jin, Kunzhao Xu, Hang Li, Xueting Han, Yanmin Zhou, Cheng Li, and Jing Bai. 2025. https://arxiv.org/abs/2506.11442 Reveal: Self-evolving code agents via reliable self-verification

  19. [19]

    Robert Tjarko Lange, Qi Sun, Aaditya Prasad, Maxence Faldor, Yujin Tang, and David Ha. 2025. https://arxiv.org/abs/2509.14279 Towards robust agentic cuda kernel benchmarking, verification, and optimization

  20. [20]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. https://arxiv.org/abs/2005.11401 Retrieval-augmented generation for knowledge-intensive nlp tasks

  21. [21]

    Xiuyu Li, Jinkai Zhang, Mingyang Yi, Yu Li, Longqiang Wang, Yue Wang, and Ju Fan. 2026 a . Ets: Energy-guided test-time scaling for training-free rl alignment. arXiv preprint arXiv:2601.21484

  22. [22]

    Yu Li, Mingyang Yi, Xiuyu Li, Ju Fan, Fuxin Jiang, Binbin Chen, Peng Li, Jie Song, and Tieying Zhang. 2026 b . https://arxiv.org/abs/2602.00994 Reasoning and tool-use compete in agentic rl:from quantifying interference to disentangled tuning

  23. [23]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, and 1 others. 2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437

  24. [24]

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, and 1 others. 2025. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556

  25. [25]

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, and Zhihao Jia. 2025. Towards efficient generative large language model serving: A survey from algorithms to systems. ACM Computing Surveys, 58(1):1--37

  26. [26]

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori B Hashimoto. 2025. s1: Simple test-time scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286--20332

  27. [27]

    Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2025. A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 16(5):1--72

  28. [28]

    Alexander Novikov, Ng \^a n V \ u , Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, and 1 others. 2025. Alphaevolve: A coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131

  29. [29]

    KernelBench: Can LLMs Write Efficient GPU Kernels?

    Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. 2025. https://arxiv.org/abs/2502.10517 Kernelbench: Can llms write efficient gpu kernels?

  30. [30]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2024. https://arxiv.org/abs/2310.08560 Memgpt: Towards llms as operating systems

  31. [31]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, and 7 others. 2024. https://arxiv.org/abs/2308.12950 Code ...

  32. [32]

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. https://arxiv.org/abs/2303.11366 Reflexion: Language agents with verbal reinforcement learning

  33. [33]

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10--19

  34. [34]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. https://arxiv.org/abs/1706.03762 Attention is all you need

  35. [35]

    Jiaxing Wang, Deping Xiang, Jin Xu, Mingyang Yi, Guoqiang Gong, Zicheng Zhang, Haoran Li, Pengzhang Liu, Zhen Chen, Ke Zhang, and 1 others. 2026. Tandem: Bi-level data mixture optimization with twin networks. Advances in Neural Information Processing Systems, 38:144720--144752

  36. [36]

    Lei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, Wenhao Xie, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, and 1 others. 2025. Tilelang: A composable tiled programming model for ai systems. arXiv preprint arXiv:2504.17577

  37. [37]

    Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. 2025. https://arxiv.org/abs/2509.07506 Astra: A multi-agent system for gpu kernel performance optimization . Preprint, arXiv:2509.07506

  38. [38]

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. 2025. https://arxiv.org/abs/2502.12110 A-mem: Agentic memory for llm agents

  39. [39]

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and 1 others. 2025. Flashinfer: Efficient and customizable attention engine for llm inference serving. Proceedings of Machine Learning and Systems, 7

  40. [40]

    Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, and Caiwen Ding. 2025. https://arxiv.org/abs/2511.01884 Cudaforge: An agent framework with hardware feedback for cuda kernel optimization

  41. [41]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2023. https://arxiv.org/abs/2006.06762 Ansor: Generating high-performance tensor programs for deep learning . Preprint, arXiv:2006.06762

  42. [42]

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2023. https://arxiv.org/abs/2305.10250 Memorybank: Enhancing large language models with long-term memory

  43. [43]

    Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. 2022. https://www.usenix.org/conference/osdi22/presentation/zhu ROLLER : Fast and efficient tensor compilation for deep learning . In 16th USENIX Symposium on Operating Syst...

  44. [44]

    Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, Yanjun Wu, Chen Zhao, and Ling Li. 2025. https://arxiv.org/abs/2511.20100 Qimeng-kernel: Macro-thinking micro-coding paradigm for llm-based high-performance gpu kernel generation . Preprint, arXiv:2511.20100

  45. [45]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  46. [46]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...