pith. sign in

arxiv: 2607.02370 · v1 · pith:ADUVNSRTnew · submitted 2026-07-02 · 💻 cs.SE · cs.AI

Understanding Agent-Based Patching of Compiler Missed Optimizations

Pith reviewed 2026-07-03 08:32 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords agent-based patchingcompiler missed optimizationsLLVMpatch generalizationoptimization scopehistorical knowledge augmentationcoding agentspull request retrieval
0
0 comments X

The pith

Coding agents often optimize specific LLVM missed optimization cases but produce patches whose scope only partially matches or overlaps with developer-intended changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how well coding agents can patch missed optimizations in compilers, where the core difficulty is generalizing from a reported case to similar ones rather than fixing only the immediate example. It constructs a benchmark from real-world LLVM issues and directly compares the optimization scope of agent patches against those written by developers. Results indicate that agents frequently improve the given example, yet many patches cover only part of the intended scope, overlap partially, or sometimes extend beyond the reference. The work also tests augmentation methods that retrieve and distill knowledge from prior LLVM optimization pull requests, finding these improve alignment with developer generalization patterns and deliver benefits on actual intermediate representation.

Core claim

Patching a compiler missed optimization requires generalizing beyond the reported case to cover similar situations. On a benchmark of real-world LLVM missed optimization issues, coding agents commonly optimize the supplied examples, but many generated patches cover only part of the developer-intended scope, partially overlap with it, or in some cases generalize beyond the reference patch. Augmentation techniques that leverage historical LLVM optimization pull requests via retrieval and distillation measurably increase the degree of developer-aligned generalization and produce practical improvements when applied to real-world IR.

What carries the argument

Comparison of optimization scope between agent-generated patches and developer reference patches on a benchmark of real-world LLVM missed optimization issues.

If this is right

  • Agents can generate initial patches for missed optimizations but require additional mechanisms to ensure full scope alignment with developer intent.
  • Retrieval and distillation of prior pull requests measurably improve how well agent patches match the generalization level chosen by developers.
  • The same augmentation approach yields measurable benefits when the resulting patches are applied to real-world LLVM intermediate representation.
  • Patching tasks that involve generalization beyond a single example remain a distinct challenge even when the agent succeeds on the reported case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent systems for code editing may benefit from explicit scope-inference steps that go beyond example-level fixes.
  • The partial-overlap pattern observed here could appear in other maintenance domains where changes must apply to families of similar code rather than isolated instances.
  • Re-running the evaluation on LLVM issues reported after the benchmark construction date would test whether the observed generalization gap persists over time.

Load-bearing premise

The constructed benchmark of real-world LLVM missed optimization issues sufficiently represents the generalization requirements that human developers apply when patching.

What would settle it

Collect a fresh set of LLVM missed optimization reports not used in the original benchmark, have the same agents generate patches, and measure whether the distribution of scope coverage (partial, overlapping, beyond-reference) matches the statistics reported in the paper.

Figures

Figures reproduced from arXiv: 2607.02370 by Batu Guan, Shaohua Li, Zirui Wang.

Figure 1
Figure 1. Figure 1: Illustration of how LLVM repository corresponds to benchmark components. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Fuzz-based generalization assessment. IR [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Workflow of RAG- and distillation-based augmentation. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Outcome transitions under baseline and different augmentation strategies. Issues whose generated patches fail to compile [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Project-level cumulative optimization hits of baseline, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accumulated wins and losses of augmented patches [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 9
Figure 9. Figure 9: An LLM-generated test that golden patch does not handle. 1. Retrieved context define i1 @src(i32 %x) { %and = and i32 %x, -8 %cmp = icmp ult i32 %and, 1 ret i1 %cmp } define i1 @tgt(i32 %x) { %cmp = icmp ult i32 %x, 8 ret i1 %cmp } The key idea is to reason about the value range represented by the masked expression. 2. Agent reasoning icmp ult x, 5 -> x in [0, 5) icmp eq (x & -2), 2 -> x in [2, 4) Since [2… view at source ↗
Figure 10
Figure 10. Figure 10: How RAG guides the agent from retrieved masked-comparison knowledge to range-based patch generation. The [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
read the original abstract

Compiler missed optimizations refer to cases in which compilers failed to optimize certain code. It takes many compiler developers' efforts to implement or patch such missed optimizations. In this paper, we present a systematic study of how well agents patch compiler missed optimizations. We identify a significant challenge that patching a missed optimization requires more than just fixing the reported case, and instead requires generalizing to similar cases. We construct a benchmark of real-world LLVM missed optimization issues and compare agent-generated patches with patches from developers in terms of optimization scope. Our results show that coding agents often optimize the given examples, but many generated patches either cover only part of the developer-intended scope or partially overlap with it; in some cases, they further generalize beyond the reference patch. We further introduce historical-knowledge augmentation techniques that leverage prior LLVM optimization pull requests through retrieval and distillation, showing that they improve developer-aligned generalization and yield practical benefits when applied to real-world IR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a systematic study of coding agents patching real-world LLVM missed optimizations. It constructs a benchmark from developer-reported issues, compares agent-generated patches to reference developer patches on optimization scope (finding frequent partial coverage, partial overlap, or over-generalization by agents), and proposes historical-knowledge augmentation via retrieval and distillation from prior LLVM PRs that improves alignment with developer scope and yields practical benefits on IR.

Significance. If the results hold, the work usefully documents generalization challenges for agents on compiler tasks and shows that retrieval/distillation from historical patches can measurably improve developer-aligned scope; the augmentation techniques constitute a concrete, reusable contribution that could inform agent tooling for optimization-related SE tasks.

major comments (2)
  1. [Benchmark Construction] The central evaluation treats the chosen developer reference patches as defining the correct optimization scope (partial coverage, overlap, or over-generalization), yet the manuscript provides no independent validation or inter-rater study confirming that these references match the generalization decisions human compiler developers would typically make on similar cases. This assumption is load-bearing for all reported mismatch rates.
  2. [§4] §4 (Evaluation) and the abstract state quantitative results on scope but supply no details on benchmark construction criteria, exact scope-classification procedure, statistical methods, controls for patch size, or inter-annotator agreement; without these the data-to-claim link cannot be assessed.
minor comments (2)
  1. Clarify the precise definition and operationalization of 'optimization scope' and 'generalize beyond the reference patch' with examples or pseudocode.
  2. The paper would benefit from an explicit limitations subsection discussing selection bias in the LLVM issues chosen for the benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in our benchmark and evaluation methodology. We will revise the manuscript to address these points by expanding the relevant sections with additional details and discussion.

read point-by-point responses
  1. Referee: [Benchmark Construction] The central evaluation treats the chosen developer reference patches as defining the correct optimization scope (partial coverage, overlap, or over-generalization), yet the manuscript provides no independent validation or inter-rater study confirming that these references match the generalization decisions human compiler developers would typically make on similar cases. This assumption is load-bearing for all reported mismatch rates.

    Authors: Developer patches from real LLVM pull requests serve as our reference because they embody the generalization decisions made by experienced compiler engineers in practice. We acknowledge the absence of a separate inter-rater study with multiple independent developers. In the revision we will add an explicit subsection in §4 discussing this design choice, its rationale, and the associated limitations, while retaining the developer patches as the primary reference. revision: partial

  2. Referee: [§4] §4 (Evaluation) and the abstract state quantitative results on scope but supply no details on benchmark construction criteria, exact scope-classification procedure, statistical methods, controls for patch size, or inter-annotator agreement; without these the data-to-claim link cannot be assessed.

    Authors: We will substantially expand §4 (and update the abstract if needed) to document: (i) the precise criteria used to select the benchmark issues from the LLVM issue tracker, (ii) the step-by-step scope-classification procedure with definitions and examples for partial coverage, partial overlap, and over-generalization, (iii) the statistical methods and tests applied, (iv) any controls or matching performed for patch size, and (v) clarification on the annotation process (including whether multiple annotators were used and any agreement measures). These additions will make the evaluation fully reproducible and the link from data to claims transparent. revision: yes

Circularity Check

0 steps flagged

No circularity detected in empirical evaluation

full rationale

The paper is an empirical study comparing agent-generated patches to developer patches on a constructed benchmark of real-world LLVM missed optimizations. No equations, derivations, fitted parameters, or self-definitional constructs are present in the provided text. The evaluation uses developer patches as the reference scope by design for the comparison task, which does not constitute a reduction to inputs by construction under the specified circularity patterns. No load-bearing self-citations, uniqueness theorems, or ansatzes are identified. The work is self-contained as an observational analysis against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5684 in / 965 out tokens · 19700 ms · 2026-07-03T08:32:32.434014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 9 canonical work pages · 6 internal anchors

  1. [1]

    Alfred, S

    V . Alfred, S. Monica, S. Ravi, U. Jeffrey Det al.,Compilers Principles, Techniques. Pearson, 2007

  2. [2]

    Llvm: A compilation framework for lifelong program analysis & transformation,

    C. Lattner and V . Adve, “Llvm: A compilation framework for lifelong program analysis & transformation,” inInternational symposium on code generation and optimization, 2004. CGO 2004.IEEE, 2004, pp. 75–86

  3. [3]

    Lpo: Discovering missed peephole optimizations with large language models,

    Z. Xu, H. Xu, Y . Tian, X. Zhou, and C. Sun, “Lpo: Discovering missed peephole optimizations with large language models,” in Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’26. New York, NY , USA: Association for Computing Machinery, 2026, p. 1136–1150. [...

  4. [4]

    Souper: A Synthesizing Superoptimizer

    R. Sasnauskas, Y . Chen, P. Collingbourne, J. Ketema, G. Lup, J. Taneja, and J. Regehr, “Souper: A synthesizing superoptimizer,”arXiv preprint arXiv:1711.04422, 2017

  5. [5]

    Hydra: Generalizing peephole optimiza- tions with program synthesis,

    M. Mukherjee and J. Regehr, “Hydra: Generalizing peephole optimiza- tions with program synthesis,”Proceedings of the ACM on Programming Languages, vol. 8, no. OOPSLA1, pp. 725–753, 2024

  6. [6]

    Finding missed code size optimizations in compilers using large language models,

    D. Italiano and C. Cummins, “Finding missed code size optimizations in compilers using large language models,” inProceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction, 2025, pp. 81–91

  7. [7]

    Agentic harness for real- world compilers,

    Y . Zheng, C. Li, S. Li, Y . Zhang, and Z. Su, “Agentic harness for real- world compilers,”arXiv preprint arXiv:2603.20075, 2026

  8. [8]

    Automatically finding patches using genetic programming,

    W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest, “Automatically finding patches using genetic programming,” in2009 IEEE 31st Interna- tional Conference on Software Engineering. IEEE, 2009, pp. 364–374

  9. [9]

    Is the cure worse than the disease? overfitting in automated program repair,

    E. K. Smith, E. T. Barr, C. Le Goues, and Y . Brun, “Is the cure worse than the disease? overfitting in automated program repair,” in Proceedings of the 2015 10th joint meeting on foundations of software engineering, 2015, pp. 532–543

  10. [10]

    History driven program repair,

    X. B. D. Le, D. Lo, and C. Le Goues, “History driven program repair,” in2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), vol. 1. IEEE, 2016, pp. 213– 224

  11. [11]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

  12. [12]

    An empirical study of optimization bugs in gcc and llvm,

    Z. Zhou, Z. Ren, G. Gao, and H. Jiang, “An empirical study of optimization bugs in gcc and llvm,”Journal of Systems and Software, vol. 174, p. 110884, 2021

  13. [13]

    Llvm language reference manual,

    LLVM Project, “Llvm language reference manual,” https://llvm.org/ docs/LangRef.html, 2026, lLVM 23.0.0git documentation

  14. [14]

    ReAct: Synergizing Reasoning and Acting in Language Models

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,”arXiv preprint arXiv:2210.03629, 2022

  15. [15]

    Alive2: bounded translation validation for llvm,

    N. P. Lopes, J. Lee, C.-K. Hur, Z. Liu, and J. Regehr, “Alive2: bounded translation validation for llvm,” inProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, pp. 65–79

  16. [16]

    llvm-mca - LLVM Machine Code Analyzer,

    LLVM Project, “llvm-mca - LLVM Machine Code Analyzer,” 2026, lLVM 23.0.0git documentation. [Online]. Available: https: //llvm.org/docs/CommandGuide/llvm-mca.html

  17. [17]

    lit - LLVM Integrated Tester,

    ——, “lit - LLVM Integrated Tester,” lLVM 23.0.0git documentation. Last updated: 2026-06-12. Accessed: 2026-06-12. [Online]. Available: https://llvm.org/docs/CommandGuide/lit.html

  18. [18]

    Whitefox: White-box compiler fuzzing empowered by large language models,

    C. Yang, Y . Deng, R. Lu, J. Yao, J. Liu, R. Jabbarvand, and L. Zhang, “Whitefox: White-box compiler fuzzing empowered by large language models,”Proceedings of the ACM on Programming Languages, vol. 8, no. OOPSLA2, pp. 709–735, 2024

  19. [19]

    GPT-5.5 System Card,

    OpenAI, “GPT-5.5 System Card,” https://openai.com/index/ gpt-5-5-system-card/, Apr. 2026, updated April 24, 2026. Accessed June 15, 2026

  20. [20]

    Deepseek-v4: Towards highly efficient million-token context intelligence,

    A. DeepSeek, “Deepseek-v4: Towards highly efficient million-token context intelligence,” 2026

  21. [21]

    Qwen3.5: Accelerating productivity with native multimodal agents,

    Q. Team, “Qwen3.5: Accelerating productivity with native multimodal agents,” February 2026. [Online]. Available: https://qwen.ai/blog?id= qwen3.5

  22. [22]

    Kimi K2.5: Visual Agentic Intelligence

    K. Team, T. Bai, Y . Bai, Y . Bao, S. Cai, Y . Cao, Y . Charles, H. Che, C. Chen, G. Chenet al., “Kimi k2. 5: Visual agentic intelligence,”arXiv preprint arXiv:2602.02276, 2026

  23. [23]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. R. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software engineering,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://arxiv.org/abs/2405.15793

  24. [24]

    High-throughput, formal-methods-assisted fuzzing for llvm,

    Y . Fan and J. Regehr, “High-throughput, formal-methods-assisted fuzzing for llvm,” in2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2024, pp. 349–358

  25. [25]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,”arXiv preprint arXiv:2305.16291, 2023

  26. [26]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Y . Zhang, M. Li, D. Long, X. Zhang, H. Lin, B. Yang, P. Xie, A. Yang, D. Liu, J. Linet al., “Qwen3 embedding: Advancing text embedding and reranking through foundation models,”arXiv preprint arXiv:2506.05176, 2025

  27. [27]

    Llvm opt benchmark,

    Y . Zheng, “Llvm opt benchmark,” 2023. [Online]. Available: https://github.com/dtcxzyw/llvm-opt-benchmark

  28. [28]

    Patchpilot: A cost-efficient software engineering agent with early attempts on formal verification,

    H. Li, Y . Tang, S. Wang, and W. Guo, “Patchpilot: A cost-efficient software engineering agent with early attempts on formal verification,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 35 922–35 941

  29. [29]

    Claude Code by Anthropic — AI Coding Agent, Terminal, IDE,

    Anthropic, “Claude Code by Anthropic — AI Coding Agent, Terminal, IDE,” https://claude.com/product/claude-code, 2026, accessed: 2026-05- 27

  30. [30]

    Codex — AI Coding Partner from OpenAI,

    OpenAI, “Codex — AI Coding Partner from OpenAI,” https://openai. com/codex/, 2026, accessed: 2026-05-27

  31. [31]

    Optgen: A generator for local optimizations,

    S. Buchwald, “Optgen: A generator for local optimizations,” inInter- national Conference on Compiler Construction. Springer, 2015, pp. 171–189

  32. [32]

    Generating compiler optimizations from proofs,

    R. Tate, M. Stepp, and S. Lerner, “Generating compiler optimizations from proofs,”ACM Sigplan Notices, vol. 45, no. 1, pp. 389–402, 2010

  33. [33]

    Leveraging large lan- guage models for generalizing peephole optimizations,

    C. Liao, H. Xu, X. Zhou, Z. Xu, and C. Sun, “Leveraging large lan- guage models for generalizing peephole optimizations,”arXiv preprint arXiv:2603.18477, 2026

  34. [34]

    An analysis of patch plausibility and correctness for generate-and-validate patch generation systems,

    Z. Qi, F. Long, S. Achour, and M. Rinard, “An analysis of patch plausibility and correctness for generate-and-validate patch generation systems,” inProceedings of the 2015 international symposium on software testing and analysis, 2015, pp. 24–36

  35. [35]

    Identifying patch correctness in test-based program repair,

    Y . Xiong, X. Liu, M. Zeng, L. Zhang, and G. Huang, “Identifying patch correctness in test-based program repair,” inProceedings of the 40th international conference on software engineering, 2018, pp. 789–799

  36. [36]

    Automatic patch generation learned from human-written patches,

    D. Kim, J. Nam, J. Song, and S. Kim, “Automatic patch generation learned from human-written patches,” in2013 35th international con- ference on software engineering (ICSE). IEEE, 2013, pp. 802–811

  37. [37]

    Automatic patch generation by learning correct code,

    F. Long and M. Rinard, “Automatic patch generation by learning correct code,” inProceedings of the 43rd annual ACM SIGPLAN-SIGACT symposium on principles of programming languages, 2016, pp. 298– 312

  38. [38]

    Getafix: Learning to fix bugs automatically,

    J. Bader, A. Scott, M. Pradel, and S. Chandra, “Getafix: Learning to fix bugs automatically,”Proceedings of the ACM on Programming Languages, vol. 3, no. OOPSLA, pp. 1–27, 2019