arxiv: 2604.23892 · v1 · submitted 2026-04-26 · 💻 cs.PF · cs.SE

Recognition: unknown

Optimas: An Intelligent Analytics-Informed Generative AI Framework for Performance Optimization

Mohammad Zaeed, Tanzima Z. Islam, Vladimir Indic

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:38 UTC · model grok-4.3

classification 💻 cs.PF cs.SE

keywords performance optimizationgenerative AIlarge language modelsHPC applicationscode generationmulti-agent systemsGPU optimizationanalytics-informed optimization

0 comments

The pith

Optimas automates performance optimization by guiding LLMs with analytics to generate correct faster code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Optimas as a framework that bridges performance analysis tools and large language models through a multi-agent workflow. It extracts insights from diagnostics reports and directs LLMs to apply established code transformations. This automation matters because manual optimization by experts is time-consuming and limits accessibility. The system includes built-in execution and validation to maintain correctness. Results from thousands of tests confirm high success rates and meaningful speed improvements on GPUs.

Core claim

Optimas generates 100% correct code and improves performance in over 98.82% of 3,410 experiments on 10 benchmarks and two HPC mini-applications, with average gains of 8.02% to 79.09% on NVIDIA GPUs, by using LLMs to map performance diagnostics to literature-backed transformations in a unified pipeline.

What carries the argument

Optimas multi-agent workflow that unifies performance insight extraction, LLM-based code generation from diagnostics, execution, and validation.

Load-bearing premise

That large language models can reliably map performance diagnostics to correct code transformations without missing edge cases or introducing subtle bugs.

What would settle it

Testing the framework on additional untested applications and observing cases of incorrect code or zero performance gain would disprove the reliability claims.

Figures

Figures reproduced from arXiv: 2604.23892 by Mohammad Zaeed, Tanzima Z. Islam, Vladimir Indic.

**Figure 1.** Figure 1: Overview of Optimas. that align with microarchitectural ground truth. We conduct all experiments on NVIDIA GPUs, which dominate both HPC and consumer computing ecosystems, ensuring that our optimization insights and conclusions generalize to real-world users rather than platform-specific edge cases. In addition, we release an optimization corpus that aligns performance diagnostics, code transformations, an… view at source ↗

**Figure 2.** Figure 2: Performance improvements across input config view at source ↗

**Figure 4.** Figure 4: User interface of Optimas. B Optimas User Interface view at source ↗

read the original abstract

Large language models (LLMs) show promise for automated code optimization. However, without performance context, they struggle to produce correct and effective code transformations. Existing performance tools can identify bottlenecks but stop short of generating actionable code changes. Consequently, performance optimization continues to be a time-intensive and manual endeavor, typically undertaken only by experts with detailed architectural understanding. To bridge this gap, we introduce Optimas, a modular, fully automated, end-to-end generative AI framework built on a multi-agent workflow. Optimas uses LLMs to map performance diagnostics from multiple reports to established, literature-backed code transformations, while unifying insight extraction, code generation, execution, and validation within a single pipeline. Across 3,410 real-world experiments on 10 benchmarks and two HPC mini-applications, Optimas generates 100% correct code and improves performance in over 98.82% of those experiments, achieving average gains of 8.02%-79.09% on NVIDIA GPUs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Optimas, a modular multi-agent generative AI framework that uses LLMs to map performance diagnostics from multiple reports to established, literature-backed code transformations. It unifies insight extraction, code generation, execution, and validation in a single pipeline. The central empirical claim is that across 3,410 experiments on 10 benchmarks and two HPC mini-applications, the framework generates 100% correct code and improves performance in over 98.82% of cases, with average gains of 8.02%-79.09% on NVIDIA GPUs.

Significance. If the reported correctness and performance results hold under rigorous validation, Optimas could meaningfully advance automated performance optimization in HPC and GPU computing by bridging diagnostic tools with generative models. The modular multi-agent design and grounding in literature-backed transformations are strengths that support reproducibility and credibility. The scale of the experimental evaluation (3,410 runs) is notable and, if properly documented, would strengthen the case for practical impact.

major comments (2)

[Abstract] Abstract: The claim of 100% correct code generation across 3,410 experiments is load-bearing for the central contribution. The abstract states that Optimas unifies generation with validation but provides no description of the validation procedure (e.g., full test-suite execution, differential output checking on held-out inputs, compilation/runtime success only, or semantic equivalence checking). Without this detail, subtle behavioral alterations cannot be ruled out, directly undermining both the correctness rate and the reported performance gains.
[Abstract] Abstract and Experimental Evaluation: No information is given on how the LLM agents select specific transformations from the literature or on the baselines used for comparison (e.g., standard compiler flags, existing auto-tuners, or manual expert optimizations). This absence makes it impossible to assess whether the reported gains of 8.02%-79.09% are attributable to the framework or to other factors, weakening the evaluation of the approach's effectiveness.

minor comments (1)

The manuscript should include full details on the specific LLM models, prompt templates, temperature settings, and exact benchmark versions used to enable reproducibility of the 3,410 experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's positive evaluation of Optimas's potential impact and the comprehensive nature of our experiments. We have reviewed the major comments and will make revisions to enhance the clarity of the abstract and experimental details as outlined below.

read point-by-point responses

Referee: The claim of 100% correct code generation across 3,410 experiments is load-bearing for the central contribution. The abstract states that Optimas unifies generation with validation but provides no description of the validation procedure (e.g., full test-suite execution, differential output checking on held-out inputs, compilation/runtime success only, or semantic equivalence checking). Without this detail, subtle behavioral alterations cannot be ruled out, directly undermining both the correctness rate and the reported performance gains.

Authors: We agree that the abstract would benefit from a brief description of the validation procedure to support the 100% correctness claim. The manuscript describes validation as an integrated step in the pipeline involving code execution and verification for correctness. We will revise the abstract to include a concise summary of this procedure, specifying that it encompasses test execution and equivalence checking. This change will clarify how behavioral alterations are ruled out and strengthen the central claim. revision: yes
Referee: No information is given on how the LLM agents select specific transformations from the literature or on the baselines used for comparison (e.g., standard compiler flags, existing auto-tuners, or manual expert optimizations). This absence makes it impossible to assess whether the reported gains of 8.02%-79.09% are attributable to the framework or to other factors, weakening the evaluation of the approach's effectiveness.

Authors: We acknowledge that explicit details on the transformation selection process and baselines would improve the evaluation. The multi-agent design maps diagnostics to literature-backed transformations via agent reasoning, and experiments use standard compiler optimizations as a baseline. We will revise the abstract and expand the experimental evaluation section to describe the selection mechanism and explicitly list the baselines, enabling better assessment of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements

full rationale

The paper describes a multi-agent LLM framework for code optimization and reports outcomes from 3,410 direct experiments on benchmarks and mini-applications. Claims of 100% correctness and performance gains are presented as observed results from running the pipeline, not as predictions or derivations that reduce to fitted parameters, self-definitions, or self-citation chains. No equations, ansatzes, uniqueness theorems, or load-bearing self-citations appear in the abstract or described methodology. The evaluation is self-contained against external benchmarks and test suites, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that LLMs can accurately translate diagnostic reports into literature-backed transformations; no free parameters or invented entities are described.

axioms (1)

domain assumption Large language models can map performance diagnostics to established code transformations without errors
This is required for the insight extraction and code generation steps to produce correct output.

pith-pipeline@v0.9.0 · 5470 in / 1238 out tokens · 35244 ms · 2026-05-08T04:38:16.129346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 14 canonical work pages · 3 internal anchors

[1]

[n. d.]. cub::DeviceAdjacentDifference — CUDA Core Compute Libraries. https: //nvidia.github.io/cccl/cub/api/structcub_1_1DeviceAdjacentDifference.html. Ac- cessed: April 28, 2026

2026
[2]

Technical Report

2023.Nsight Compute Profiling Guide. Technical Report. NVIDIA

2023
[3]

A Mess of Memory System Benchmarking, Simulation and Application Profiling.arXiv(2024)

2024. A Mess of Memory System Benchmarking, Simulation and Application Profiling.arXiv(2024). arXiv:2405.10170

work page arXiv 2024
[4]

Advanced Micro Devices, Inc. 2024. MI200 Performance Counters and Met- rics. https://rocm.docs.amd.com/en/docs-6.0.0/conceptual/gpu-arch/mi200- performance-counters.html. Accessed: 2025-04-14

2024
[5]

Martin Andrews and Sam Witteveen. 2025. GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization. arXiv preprint arXiv:2506.20807v2. https://arxiv.org/abs/2506.20807v2

work page arXiv 2025
[6]

Jason Ansel, Shoaib Kamil, Jonathan Ragan-Kelley Wong Chan, Kalyan Veera- machaneni, Lukasz Ziarek, Saman Amarasinghe, and Martin C O’Reilly. 2014. OpenTuner: An extensible framework for program autotuning. InPACT. 303–316. doi:10.1145/2628071.2628092

work page doi:10.1145/2628071.2628092 2014
[7]

Banooqa Banday, Kowshik Thopalli, Tanzima Z Islam, and Jayaraman J Thiagara- jan. 2025. On the role of prompt construction in enhancing efficacy and efficiency of llm-based tabular data generation. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

2025
[8]

Tania Banerjee, Jason Hackl, Mrugesh Shringarpure, Tanzima Islam, S Bal- achandar, Thomas Jackson, and Sanjay Ranka. 2016. CMT-Bone — A Proxy Application for Compressible Multiphase Turbulent Flows.IEEE 23rd Interna- tional Conference on High Performance Computing (HiPC)(Dec 2016), 173–182. doi:10.1109/HiPC.2016.029 Acceptance rate: 23%

work page doi:10.1109/hipc.2016.029 2016
[9]

Milind Chabbi, Karthik Murthy, Michael Fagan, and John Mellor-Crummey. 2013. Effective sampling-driven performance tools for GPU-accelerated supercomput- ers. InProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1–12

2013
[10]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review arXiv 2021
[11]

Neal C Crago, Mark Stephenson, and Stephen W Keckler. 2018. Exposing memory access patterns to improve instruction and memory efficiency in GPUs.ACM Transactions on Architecture and Code Optimization (TACO)15, 4 (2018), 1–23

2018
[12]

Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziére, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather. 2025. LLM Compiler: Foundation Language Models for Compiler Optimization. InCC 2025. 141–153

2025
[13]

Tom Deakin, James Price, Matt Martineau, and Simon McIntosh-Smith. 2018. Evaluating attainable memory bandwidth of parallel programming models via BabelStream.International Journal of Computational Science and Engineering17, 3 (2018), 247–262. doi:10.1504/IJCSE.2018.095847 Special issue

work page doi:10.1504/ijcse.2018.095847 2018
[14]

Weihua Du, Yiming Yang, and Sean Welleck. 2025. Optimizing Temperature for Language Models with Multi-Sample Inference. InProceedings of the International Conference on Machine Learning (ICML)

2025
[15]

M. F. I. Ibrahim and A. A. Al-Jumaily. 2016. PCA indexing based feature learning and feature selection. In2016 8th Cairo International Biomedical Engineering Conference (CIBEC). 68–71

2016
[16]

Aleksandar Ilic, Frederico Pratas, and Leonel Sousa. 2014. Cache-aware Roofline Model: Upgrading the Loft.IEEE Comput. Archit. Lett.13, 1 (Jan. 2014), 21–24

2014
[17]

Tanzima Islam, Alexis Ayala, Quentin Jensen, and Khaled Ibrahim. 2019. Toward a Programmable Analysis and Visualization Framework for Interactive Perfor- mance Analytics. In2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools). IEEE, 70–77

2019
[18]

Tanzima Islam, Alexis Ayala, Quentin Jensen, and Khaled Ibrahim. 2019. Toward a Programmable Analysis and Visualization Framework for Interactive Perfor- mance Analytics. In2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools). 70–77. doi:10.1109/ProTools49597. 2019.00015

work page doi:10.1109/protools49597 2019
[19]

Islam, A

Tanzima Z. Islam, A. Ayala, and Q. Jensen. [n. d.]. Dashing–A Machine Learning Toolkit for HPC Performance Analysis. https://github.com/tzislam/Dashing
[20]

Tanzima Z Islam, Aniruddha Marathe, Holland Schutte, and Mohammad Zaeed
[21]

Accepted

Data-Driven Analysis to Understand GPU Hardware Resource Usage of Optimizations.International Journal High Performance Computing Applications (IJHPCA). Accepted
[22]

Tanzima Z Islam, Jayaraman J Thiagarajan, Abhinav Bhatele, Martin Schulz, and Todd Gamblin. 2016. A machine learning framework for performance coverage analysis of proxy applications. InSC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 538– 549

2016
[23]

Zheming Jin and Jeffrey S. Vetter. 2023. A Benchmark Suite for Improving Per- formance Portability of the SYCL Programming Model. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 325–327. doi:10.1109/ISPASS57527.2023.00041

work page doi:10.1109/ispass57527.2023.00041 2023
[24]

Ibrahim, Samuel Williams, Leonid Oliker

Khaled Z. Ibrahim, Samuel Williams, Leonid Oliker. 2018. Roofline Scaling Trajec- tories: A Method for Parallel Application and Architectural Performance Analysis. InInternational Conference on High Performance Computing & Simulation (HPCS)

2018
[25]

Gary Lawson, Vaibhav Sundriyal, Masha Sosonkina, and Yuzhong Shen. 2015. Experimentation Procedure for Offloaded Mini-Apps Executed on Cluster Archi- tectures with Xeon Phi Accelerators. arXiv:1509.02135 [cs.DC] https://arxiv.org/ abs/1509.02135

work page arXiv 2015
[26]

Jongmin Lee, Kwangho Lee, Mucheol Kim, Geunchul Park, and Chan Yeol Park
[27]

Roofline-based Data Migration Methodology for Hybrid Memories.Journal of Internet Technology21, 3 (2020), 849–859

2020
[28]

Matthew Leinhauser, René Widera, Sergei Bastrakov, Alexander Debus, Michael Bussmann, and Sunita Chandrasekaran. 2022. Metrics and design of an instruction roofline model for AMD GPUs.ACM Transactions on Parallel Computing9, 1 (2022)

2022
[29]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. StarCoder: May the source be with you!arXiv preprint arXiv:2305.06161(2023). https://arxiv.org/abs/2305.06161

work page internal anchor Pith review arXiv 2023
[30]

Yukuo Luo, Yanqi Zheng, Jiacheng Deng, Tianjun Zhang, Hang Su, Stefano Ermon, Yuke Ye, Tianshi Zhao, and Yuxin Wu. 2023. OptFormer: Schedule Optimization as Program Generation.arXiv preprint arXiv:2305.10765(2023). https://arxiv.org/abs/2305.10765

work page arXiv 2023
[31]

Aman Madaan, Alexander Shypula, Uri Alon, Milad Hashemi, Parthasarathy Ranganathan, Yiming Yang, Graham Neubig, and Amir Yazdanbakhsh. 2023. Learning performance-improving code edits.arXiv preprint arXiv:2302.078678 (2023)

work page arXiv 2023
[32]

Mitesh R Meswani, Laura Carrington, Allan Snavely, and Stephen Poole. 2012. Tools for benchmarking, tracing, and simulating SHMEM applications. InPro- ceedings Cray user group conference (CUG)

2012
[33]

Naderan-Tahan et al

M. Naderan-Tahan et al. 2021. Top-Down GPU-Compute Benchmarking using Real-Life Applications. InIEEE International Symposium on Workload Characteri- zation (IISWC)

2021
[34]

Daniel Nichols, Aniruddha Marathe, Harshitha Menon, Todd Gamblin, and Abhi- nav Bhatele. 2024. HPC-Coder: Modeling Parallel Programs using Large Language Models. arXiv/ISC preprint. (2024). https://pssg.cs.umd.edu/assets/papers/2024- 05-hpc-llm-isc.pdf

2024
[35]

Daniel Nichols, Pranav Polasam, Harshitha Menon, Aniruddha Marathe, Todd Gamblin, and Abhinav Bhatele. 2024. Performance-Aligned LLMs for Generating Fast Code. arXiv preprint. (2024). https://pssg.cs.umd.edu/assets/papers/2024- 04-rlpf-arxiv.pdf

2024
[36]

NVIDIA Corporation. 2024. CUDA C++ Best Practices Guide. https://docs.nvidia. com/cuda/cuda-c-best-practices-guide/

2024
[37]

Kernelbench: Can llms write efficient gpu kernels?arXiv preprint arXiv:2502.10517,

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Reé, and Azalia Mirhoseini. 2025. KernelBench: Can LLMs Write Efficient GPU Kernels?arXiv preprint arXiv:2502.10517(2025). https://arxiv.org/abs/2502.10517

work page arXiv 2025
[38]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code Llama: Open Foundation Models for Code.arXiv preprint arXiv:2308.12950 (2023). https://arxiv.org/abs/2308.12950

work page internal anchor Pith review arXiv 2023
[39]

Adam Thompson. 2021. Accelerated Signal Processing with cuSignal. NVIDIA Technical Blog. https://developer.nvidia.com/blog/accelerated-signal-processing- with-cusignal/ (accessed: April 28, 2026)

2021
[40]

John Tramm, Andrew Siegel, Tanzima Islam, and Martin Schulz. 2014. XSBench: the development and verification of a performance abstraction for Monte Carlo reactor analysis. InPHYSOR

2014
[41]

Leif Uhsadel, Andy Georges, and Ingrid Verbauwhede. 2008. Exploiting hardware performance counters. In2008 5th Workshop on Fault Diagnosis and Tolerance in Cryptography. IEEE, 59–67

2008
[42]

S Williams, A Waterman, and D Patterson. 2009. Roofline: Aninsightful vi- sual performance model for floating-point programsand multicore architectures. Communications of theAssociation for Computing Machinery(2009)

2009
[43]

Xingfu Wu, Valerie Taylor, and Zhiling Lan. 2021. Performance and energy im- provement of ECP proxy app SW4lite under various workloads. In2021 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC). IEEE, 17– 24. arXiv Preprint, 2026, Anonymous et al

2021
[44]

Mohammad Zaeed, Tanzima Z Islam, and Vladimir Indict. 2024. Characterize and Compare the Performance of Deep Learning Optimizers in Recurrent Neu- ral Network Architectures. In2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 39–44

2024
[45]

Hang Zhou, Yehui Tang, Haochen Qin, Yujie Yang, Renren Jin, Deyi Xiong, Kai Han, and Yunhe Wang. 2024. Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning.Advances in Neural Information Processing Systems37 (2024), 4575–4597

2024
[46]

Keren Zhou, Laksono Adhianto, Jonathon Anderson, Aaron Cherian, Dejan Gru- bisic, Mark Krentel, Yumeng Liu, Xiaozhu Meng, and John Mellor-Crummey. 2021. Measurement and analysis of GPU-accelerated applications with HPCToolkit. Parallel Comput.108 (2021), 102837

2021
[47]

Keren Zhou, Mark W Krentel, and John Mellor-Crummey. 2020. Tools for top- down performance analysis of GPU-accelerated applications. InProceedings of the 34th ACM International Conference on Supercomputing. 1–12

2020
[48]

st.global.cg.f32

Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey. 2021. Gpa: A gpu performance advisor based on instruction sampling. In2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 115– 125. A Prompt for theaccuracyapplication Below is a fully instantiated example of the prompt for theAccu- racykernel, precisely ref...

2021