HTAM: Hierarchical Transition-Attended Memory for Operator Optimization
Pith reviewed 2026-06-29 07:24 UTC · model grok-4.3
The pith
HTAM organizes LLM optimization experience in a two-level graph to select global directions and retrieve local strategies for CUDA kernel code.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HTAM builds a two-level Hierarchical Transition Graph to organize coarse global directions, detailed local strategies, and transition experience between optimization steps. During each evolution step it selects a global direction from the current state and recent optimization history, retrieves the corresponding local strategy memory, and uses it to guide concrete CUDA code generation. This coarse-to-fine framework addresses the granularity mismatch in LLM-based operator optimization.
What carries the argument
The two-level Hierarchical Transition Graph that stores coarse global directions at one level, detailed local strategies at the other, and transition experience between steps.
Load-bearing premise
That a two-level graph can organize experience at the right granularity so global selection plus local retrieval produces effective CUDA code without enlarging the search space or obscuring bottlenecks.
What would settle it
Experiments on the KernelBench suite where HTAM shows no gains in correctness, fast-solution rate, or speedup compared to LLM baselines would disprove the claim.
Figures
read the original abstract
High-performance GPU kernels are essential for efficient LLM deployment, yet optimizing them remains expertise-intensive. Recent LLM-based code generation makes automatic GPU operator generation promising, but operator optimization remains a hardware-aware search problem. Existing LLM-based methods face a granularity mismatch: coarse hints are reusable but hard to execute, whereas detailed memories are actionable but enlarge the search space and obscure optimization bottlenecks. The key challenge is therefore to organize optimization experience at an appropriate granularity. To address this issue, this paper proposes HTAM (Hierarchical Transition-Attended Memory), a coarse-to-fine framework for LLM-based operator optimization. HTAM builds a two-level Hierarchical Transition Graph (HTG) to organize coarse global directions, detailed local strategies, and transition experience between optimization steps. During each evolution step, HTAM selects a global direction from the current state and recent optimization history, retrieves the corresponding local strategy memory, and uses it to guide concrete CUDA code generation. Experiments on the full KernelBench suite demonstrate that HTAM consistently improves correctness, fast-solution rate, and speedup over LLM-based baselines, while backend and Robust-KBench studies indicate transferable benefits from structured memory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HTAM (Hierarchical Transition-Attended Memory), a coarse-to-fine framework for LLM-based GPU operator optimization. It constructs a two-level Hierarchical Transition Graph (HTG) to organize coarse global directions, detailed local strategies, and transition experience. At each evolution step, the method selects a global direction from the current state and recent history, retrieves the corresponding local strategy memory, and uses it to guide concrete CUDA code generation. Experiments on the full KernelBench suite are reported to show consistent gains in correctness, fast-solution rate, and speedup over LLM-based baselines, with additional backend and Robust-KBench studies indicating transferable benefits from the structured memory.
Significance. If the reported gains hold under rigorous evaluation, the hierarchical organization of optimization experience at appropriate granularity could meaningfully advance automatic, hardware-aware kernel generation for LLMs. The approach directly targets the granularity mismatch between reusable coarse hints and actionable but search-space-enlarging detailed memories, offering a structured alternative to flat memory or prompt-based methods.
major comments (1)
- [Abstract / Experimental Evaluation] The central experimental claim (consistent improvements on the full KernelBench suite) is presented without any description of baselines, number of trials, statistical tests, ablation results on the two-level HTG components, or definitions of the reported metrics. This absence prevents assessment of whether the data actually support the claim that structured memory yields transferable benefits.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the abstract's experimental description. We agree that greater self-containment in the abstract will strengthen the presentation of our claims and will revise accordingly while preserving the abstract's length constraints.
read point-by-point responses
-
Referee: [Abstract / Experimental Evaluation] The central experimental claim (consistent improvements on the full KernelBench suite) is presented without any description of baselines, number of trials, statistical tests, ablation results on the two-level HTG components, or definitions of the reported metrics. This absence prevents assessment of whether the data actually support the claim that structured memory yields transferable benefits.
Authors: We acknowledge the referee's point that the abstract, as currently written, does not enumerate these experimental details. The full manuscript addresses them in the body: Section 4.1 specifies the LLM-based baselines (direct generation, flat memory, and prompt-only variants), Section 3.3 defines the three metrics (correctness rate, fast-solution rate within 10 attempts, and geometric-mean speedup), Section 4.2 reports results aggregated over multiple independent runs with standard deviation, Section 4.4 contains the two-level HTG ablations, and Sections 4.5–4.6 present the backend portability and Robust-KBench transfer experiments. To make the central claim verifiable from the abstract itself, we will add a concise clause listing the primary baselines, the number of runs, and the metric definitions. We believe this addresses the concern without requiring new experiments. revision: yes
Circularity Check
No significant circularity
full rationale
The paper proposes HTAM, a coarse-to-fine framework using a two-level Hierarchical Transition Graph (HTG) to organize global directions and local strategies for LLM-based CUDA optimization. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or description. The central claims rest on experimental improvements on KernelBench rather than any internal reduction to inputs by construction. This is a standard methodological proposal with independent empirical validation and no detectable circular patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [2]
-
[3]
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. 2024. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. https://arxiv.org/abs/1802.04799 Tvm: An automated end-to-end optimizing compiler for deep learning . Preprint, arXiv:1802.04799
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [6]
-
[7]
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. 2025. https://arxiv.org/abs/2504.19413 Mem0: Building production-ready ai agents with scalable long-term memory
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Weinan Dai, Hanlin Wu, Qiying Yu, Huan ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, and Hao Zhou. 2026. https://arxiv.org/abs/2602.24286 Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation
-
[10]
Tri Dao. 2023. https://arxiv.org/abs/2307.08691 Flashattention-2: Faster attention with better parallelism and work partitioning . Preprint, arXiv:2307.08691
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
DeepSeek-AI . 2026. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf Deepseek-v4: Towards highly efficient million-token context intelligence . Technical report. Accessed: 2026-05-18
2026
- [12]
-
[13]
Thomas Faingnaert, Tim Besard, and Bjorn De Sutter. 2021. Flexible performant gemm kernels on gpus. IEEE Transactions on Parallel and Distributed Systems, 33(9):2230--2248
2021
- [14]
-
[15]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. https://arxiv.org/abs/2401.14196 Deepseek-coder: When the large language model meets programming -- the rise of code intelligence
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [17]
- [18]
- [19]
-
[20]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. https://arxiv.org/abs/2005.11401 Retrieval-augmented generation for knowledge-intensive nlp tasks
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Xiuyu Li, Jinkai Zhang, Mingyang Yi, Yu Li, Longqiang Wang, Yue Wang, and Ju Fan. 2026 a . Ets: Energy-guided test-time scaling for training-free rl alignment. arXiv preprint arXiv:2601.21484
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Yu Li, Mingyang Yi, Xiuyu Li, Ju Fan, Fuxin Jiang, Binbin Chen, Peng Li, Jie Song, and Tieying Zhang. 2026 b . https://arxiv.org/abs/2602.00994 Reasoning and tool-use compete in agentic rl:from quantifying interference to disentangled tuning
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, and 1 others. 2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, and 1 others. 2025. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, and Zhihao Jia. 2025. Towards efficient generative large language model serving: A survey from algorithms to systems. ACM Computing Surveys, 58(1):1--37
2025
-
[26]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori B Hashimoto. 2025. s1: Simple test-time scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286--20332
2025
-
[27]
Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2025. A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 16(5):1--72
2025
-
[28]
Alexander Novikov, Ng \^a n V \ u , Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, and 1 others. 2025. Alphaevolve: A coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
KernelBench: Can LLMs Write Efficient GPU Kernels?
Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. 2025. https://arxiv.org/abs/2502.10517 Kernelbench: Can llms write efficient gpu kernels?
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2024. https://arxiv.org/abs/2310.08560 Memgpt: Towards llms as operating systems
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, and 7 others. 2024. https://arxiv.org/abs/2308.12950 Code ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. https://arxiv.org/abs/2303.11366 Reflexion: Language agents with verbal reinforcement learning
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10--19
2019
-
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. https://arxiv.org/abs/1706.03762 Attention is all you need
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[35]
Jiaxing Wang, Deping Xiang, Jin Xu, Mingyang Yi, Guoqiang Gong, Zicheng Zhang, Haoran Li, Pengzhang Liu, Zhen Chen, Ke Zhang, and 1 others. 2026. Tandem: Bi-level data mixture optimization with twin networks. Advances in Neural Information Processing Systems, 38:144720--144752
2026
- [36]
- [37]
-
[38]
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. 2025. https://arxiv.org/abs/2502.12110 A-mem: Agentic memory for llm agents
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and 1 others. 2025. Flashinfer: Efficient and customizable attention engine for llm inference serving. Proceedings of Machine Learning and Systems, 7
2025
- [40]
-
[41]
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2023. https://arxiv.org/abs/2006.06762 Ansor: Generating high-performance tensor programs for deep learning . Preprint, arXiv:2006.06762
-
[42]
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2023. https://arxiv.org/abs/2305.10250 Memorybank: Enhancing large language models with long-term memory
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. 2022. https://www.usenix.org/conference/osdi22/presentation/zhu ROLLER : Fast and efficient tensor compilation for deep learning . In 16th USENIX Symposium on Operating Syst...
2022
-
[44]
Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, Yanjun Wu, Chen Zhao, and Ling Li. 2025. https://arxiv.org/abs/2511.20100 Qimeng-kernel: Macro-thinking micro-coding paradigm for llm-based high-performance gpu kernel generation . Preprint, arXiv:2511.20100
-
[45]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[46]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.