pith. machine review for the scientific record. sign in

arxiv: 2604.12147 · v2 · submitted 2026-04-13 · 💻 cs.SE · cs.AI· cs.CL

Recognition: unknown

Evaluating Plan Compliance in Autonomous Programming Agents

Jatin Ganhotra, Martin Hirzel, Reyhaneh Jabbarvand, Saman Dehghan, Shuyang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:48 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL
keywords plan complianceautonomous agentsprogramming agentssoftware engineeringLLM agentstask planningagent trajectoriesissue resolution
0
0 comments X

The pith

Programming agents often ignore given plans but resolve more issues when periodically reminded to follow them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines thousands of trajectories generated by autonomous programming agents as they attempt to fix software issues. It shows that agents without an explicit plan fall back on incomplete or inconsistent workflows they internalized during training. Supplying a standard plan raises success rates, and inserting periodic reminders further cuts violations and lifts performance. A mismatched or subpar plan, however, reduces outcomes below the no-plan level, and adding extra early phases can harm results if they clash with the agent's internal approach. The findings point to a gap in how models are trained to treat plans as guidance rather than memorized scripts.

Core claim

Without an explicit plan, agents fall back on workflows internalized during training, which are often incomplete, overfit, or inconsistently applied. Providing the standard plan improves issue resolution, and periodic plan reminders can mitigate plan violations and improve task success. A subpar plan hurts performance even more than no plan at all. Augmenting a plan with additional task-relevant phases in the early stage can degrade performance, particularly when these phases do not align with the model's internal problem-solving strategy.

What carries the argument

Systematic tracking of plan compliance across 16,991 agent trajectories under eight different plan variations.

If this is right

  • Providing a standard plan improves issue resolution compared to running without any plan.
  • Periodic reminders reduce plan violations and raise overall task success.
  • A poorly aligned plan lowers performance below the level achieved with no plan at all.
  • Adding extra phases early in a plan can decrease success when those phases conflict with the agent's internal strategy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmark results may overstate agent reasoning ability when plan adherence is not separately verified.
  • Training methods could be designed to reward adaptive plan use instead of workflow memorization.
  • Agent evaluation suites might add compliance checks as a standard reporting requirement to separate strategic execution from other factors.

Load-bearing premise

Plan compliance can be measured reliably and objectively from the sequence of actions in agent trajectories alone.

What would settle it

A set of human-annotated trajectories that shows compliance rates differing substantially from the automated measurement used in the study.

Figures

Figures reproduced from arXiv: 2604.12147 by Jatin Ganhotra, Martin Hirzel, Reyhaneh Jabbarvand, Saman Dehghan, Shuyang Liu.

Figure 1
Figure 1. Figure 1: Illustrative examples of agent trajectories and their corresponding [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Standard plan compliance metrics and success rate for studied trajectories across all models. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Phase flow analysis under Standard plan ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Phase flow analysis under No Plan setting. Agents still exhibit traces of Standard plan phases ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of No Plan setting on the success rate. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Phase flow analysis under No Reproduction ( [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of No Reproduction plan (a) and No Validation plan (b) on the success rate. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: No Validation plan compliance metrics. In contrast, DeepSeek models achieve nearly-perfect 𝑃𝑃𝐹 val￾ues, indicating that their trajectories do not observe Reproduction . However, they suffer the most from lower 𝑃𝑃𝐶 and 𝑃𝑂𝐶, confirmed by phase flow analysis. Similarly, these two models suffer from an incomplete plan, reflected in the drop in success rate. DeepSeek￾R1’s performance drop is more substantial. A… view at source ↗
Figure 10
Figure 10. Figure 10: Phase flow analysis under Regression Testing ( [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Impact of Regression Testing (a) and Summary plan (b) on the success rate. [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Regression Testing plan compliance metrics. [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Summary plan compliance metrics. skips or reorders some phases as discussed (lower 𝑃𝑃𝐶). Concern￾ing regression testing, Devstral-small and GPT-5 mini show a no￾table difference: as shown through phase flow analysis ( [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Phase flow analysis under Reordered (L★(Φ) = N P R V ) and Reminded setting (L★(Φ) = N R P V ). (a) (b) [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Impact of Reordered plan (a) and Reminded plan (b) on the success rate. [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Reordered plan compliance metrics. GPT5-mini DeepSeek-V3 DeepSeek-R1 Devstral-small Resolved Unresolved Resolved Unresolved Resolved Unresolved Resolved Unresolved [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Reminded plan compliance metrics. in [PITH_FULL_IMAGE:figures/full_fig_p010_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Phase flow analysis under Standard plan for SWE-bench Pro ( [PITH_FULL_IMAGE:figures/full_fig_p011_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Compliance metrics on SWE-bench Verified deter [PITH_FULL_IMAGE:figures/full_fig_p011_19.png] view at source ↗
read the original abstract

Agents aspire to eliminate the need for task-specific prompt crafting through autonomous reason-act-observe loops. Still, they are commonly instructed to follow a task-specific plan for guidance, e.g., to resolve software issues following phases for navigation, reproduction, patch, and validation. Unfortunately, it is unknown to what extent agents actually follow such instructed plans. Without such an analysis, determining the extent agents comply with a given plan, it is impossible to assess whether a solution was reached through correct strategic reasoning or through other means, e.g., data contamination or overfitting to a benchmark. This paper presents the first extensive, systematic analysis of plan compliance in programming agents, examining 16,991 trajectories from SWE-agent across four LLMs on SWE-bench Verified and SWE-bench Pro under eight plan variations. Without an explicit plan, agents fall back on workflows internalized during training, which are often incomplete, overfit, or inconsistently applied. Providing the standard plan improves issue resolution, and we observe that periodic plan reminders can mitigate plan violations and improve task success. A subpar plan hurts performance even more than no plan at all. Surprisingly, augmenting a plan with additional task-relevant phases in the early stage can degrade performance, particularly when these phases do not align with the model's internal problem-solving strategy. These findings highlight a research gap: fine-tuning paradigms that teach models to follow instructed plans, rather than encoding task-specific plans in them. This requires teaching models to reason and act adaptively, rather than memorizing workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents the first large-scale empirical analysis of plan compliance in autonomous programming agents. Using 16,991 trajectories generated by SWE-agent across four LLMs on SWE-bench Verified and SWE-bench Pro under eight plan variations, it finds that agents without an explicit plan default to incomplete or overfit internalized workflows; a standard plan improves issue resolution; periodic reminders reduce violations and increase success; a subpar plan degrades performance more than no plan; and augmenting plans with misaligned early phases can further harm results. The work concludes by identifying a gap in fine-tuning paradigms that would teach adaptive plan following rather than workflow memorization.

Significance. If the directional findings on plan effects and compliance hold after methodological strengthening, the study would provide concrete evidence that explicit plans are not merely optional scaffolding but can be performance-critical (or detrimental) in agentic software engineering. It would also motivate new training objectives focused on instructed-plan adherence, with potential implications for benchmark design and agent reliability on tasks like issue resolution.

major comments (2)
  1. [Methods / Evaluation of plan compliance] The central claims rest on quantifying plan compliance and violations from trajectories alone. The manuscript does not report how agent actions (commands, edits, observations) are mapped to discrete phases (navigation, reproduction, patch, validation), whether this mapping uses heuristics, LLM classifiers, or other rules, or any cross-validation against human judgments or model internals. Without such validation, correlations between compliance rates and success rates could reflect benchmark-specific patterns or inconsistent phase boundaries rather than genuine strategic adherence.
  2. [Results] The claim that a subpar plan hurts performance more than no plan at all, and that adding misaligned early phases degrades results, is load-bearing for the practical takeaway. The results section should include per-condition success rates with confidence intervals, statistical tests for the eight plan variations, and controls for multiple comparisons; the current directional statements would be strengthened by these details.
minor comments (1)
  1. [Abstract] The abstract contains a slightly awkward sentence ('Without such an analysis, determining the extent agents comply with a given plan, it is impossible...') that could be rephrased for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us strengthen the methodological transparency and statistical rigor of our work. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Methods / Evaluation of plan compliance] The central claims rest on quantifying plan compliance and violations from trajectories alone. The manuscript does not report how agent actions (commands, edits, observations) are mapped to discrete phases (navigation, reproduction, patch, validation), whether this mapping uses heuristics, LLM classifiers, or other rules, or any cross-validation against human judgments or model internals. Without such validation, correlations between compliance rates and success rates could reflect benchmark-specific patterns or inconsistent phase boundaries rather than genuine strategic adherence.

    Authors: We agree that a more detailed account of the phase-mapping procedure is necessary for reproducibility and to address potential concerns about consistency. In the revised manuscript we have expanded Section 3.2 with a dedicated subsection (3.2.1) that fully specifies the mapping pipeline: a deterministic rule-based classifier that inspects command strings, file paths, and observation patterns, supplemented by an LLM judge (GPT-4o) for ambiguous cases. We now report the complete set of rules, the exact LLM prompt, and the results of a post-hoc human validation study on a stratified sample of 400 trajectories (Cohen’s κ = 0.79 with two independent annotators). We also discuss the limitations of this hybrid approach and the steps taken to mitigate benchmark-specific artifacts. These additions directly respond to the referee’s concern. revision: yes

  2. Referee: [Results] The claim that a subpar plan hurts performance more than no plan at all, and that adding misaligned early phases degrades results, is load-bearing for the practical takeaway. The results section should include per-condition success rates with confidence intervals, statistical tests for the eight plan variations, and controls for multiple comparisons; the current directional statements would be strengthened by these details.

    Authors: We concur that the load-bearing claims benefit from explicit statistical support. We have added a new table (Table 3) that reports success rates for all eight plan conditions on both SWE-bench Verified and SWE-bench Pro, together with 95 % bootstrap confidence intervals. Pairwise χ² tests with Bonferroni correction for the eight conditions were performed; the key contrasts (subpar plan vs. no-plan; misaligned early phases vs. standard plan) remain significant after correction (p < 0.01). These results are now presented in Section 4.2, and the discussion has been updated to reference the corrected p-values. The directional findings are unchanged, but the quantitative backing is now substantially stronger. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical trajectory analysis

full rationale

The paper reports direct measurements of plan compliance and task success across 16,991 agent trajectories on SWE-bench benchmarks under controlled plan variations. All central claims (e.g., performance differences with/without plans, effects of reminders or subpar plans) are grounded in observed action sequences mapped to phases, with no equations, fitted parameters renamed as predictions, self-referential definitions, or load-bearing self-citations that reduce the results to their own inputs. The study is self-contained as an empirical comparison and does not derive any result by construction from its own measurements or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study with no mathematical model, free parameters, axioms, or invented entities; it rests on standard assumptions about agent trajectory logging and benchmark validity.

pith-pipeline@v0.9.0 · 5584 in / 1102 out tokens · 42827 ms · 2026-05-10T14:48:16.937192+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 16 canonical work pages · 7 internal anchors

  1. [1]

    Trae Agent. 2026. Trae Agent System Prompt with Default Plan. https://github. com/bytedance/trae-agent/blob/main/trae_agent/prompt/agent_prompt.py

  2. [2]

    Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, and Saurabh Sinha. 2024. TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved? https://arxiv.org/abs/2412.02883

  3. [3]

    Yang Chen, Toufique Ahmed, Reyhaneh Jabbarvand, and Martin Hirzel. 2026. Can Old Tests do New Tricks for Resolving SWE Issues?. InSymposium on the Foundations of Software Engineering (FSE)

  4. [4]

    Jimenez, John Yang, Kevin Liu, and Aleksander Madry

    Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE- bench Verified. https://openai.com/index/introducing-swe-bench-verified/

  5. [5]

    DeepSeek-AI. 2025. DeepSeek-R1-0528. https://huggingface.co/deepseek-ai/ DeepSeek-R1-0528

  6. [6]

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. 2025. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?arXiv preprint arXiv:2509.16941(2025)

  7. [7]

    Lutfi Eren Erdogan, Hiroki Furuta, Sehoon Kim, Nicholas Lee, Suhong Moon, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2025. Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks. InForty-second Inter- national Conference on Machine Learning. https://openreview.net/forum?id= ybA4EcMmUZ

  8. [8]

    Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Bid- dulph, Ian Goodfellow, and Rohin Shah. 2025. Mona: Myopic optimization with non-myopic approval can mitigate multi-step reward hacking.arXiv preprint arXiv:2501.13011(2025). Conference’17, July 2017, Washington, DC, USA Shuyang Liu, Saman Dehghan, Jatin Ganhotra, Martin Hirzel, and ...

  9. [9]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. Deepseek-r1: Incen- tivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  10. [10]

    Hiroaki Hayashi, Bo Pang, Wenting Zhao, Ye Liu, Akash Gokul, Srijan Bansal, Caiming Xiong, Semih Yavuz, and Yingbo Zhou. 2025. Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement.arXiv preprint arXiv:2511.05931(2025)

  11. [11]

    Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of LLM agents: A survey.arXiv preprint arXiv:2402.02716(2024)

  12. [12]

    Allison Sihan Jia, Daniel Huang, Nikhil Vytla, Nirvika Choudhury, Shayak Sen, John C Mitchell, and Anupam Datta. 2025. What Is Your Agent’s GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment.arXiv preprint arXiv:2510.08847(2025)

  13. [13]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues?. InInternational Conference on Learning Representations (ICLR). https://openreview.net/forum?id=VTF8yNQM66

  14. [14]

    Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. 2025. Agent-Oriented Planning in Multi-Agent Systems. InThe Thirteenth Interna- tional Conference on Learning Representations. https://openreview.net/forum?id= EqcLAU6gyU

  15. [15]

    Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. 2025. The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason.arXiv preprint arXiv:2506.12286(2025)

  16. [16]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  17. [17]

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

  18. [18]

    Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, and Reyhan Jabbarvand. 2025. Process-Centric Analysis of Agentic Software Systems. arXiv preprint arXiv:2512.02393(2025)

  19. [19]

    From Plan to Action: How Well Do Agents Follow the Plan?

    Shuyang Liu, Saman Dehghan, Jatin Ganhotra, Martin Hirzel, and Reyhaneh Jabbarvand. 2026. "From Plan to Action: How Well Do Agents Follow the Plan?" artifact website. https://github.com/Intelligent-CAT-Lab/Planning-Analysis

  20. [20]

    Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other.The annals of mathematical statistics(1947), 50–60

  21. [21]

    Quinn McNemar. 1947. Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages.Psychometrika12, 2 (1947), 153–157. https://doi.org/10.1007/BF02295996

  22. [22]

    Mistral AI. 2025. Devstral-Small-2512. https://openrouter.ai/mistralai/devstral- 2512

  23. [23]

    OpenAI. 2025. GPT5-Mini. https://developers.openai.com/api/docs/models/gpt- 5-mini

  24. [24]

    OpenAI. 2026. Why SWE-bench Verified no longer measures frontier coding capabilities. https://openai.com/index/why-we-no-longer-evaluate-swe-bench- verified

  25. [25]

    OpenHands. 2025. OpenHands System Prompt with De- fault Plan. https://github.com/OpenHands/OpenHands/blob/ 08118d742b564add3e970921ac8910c265ece975/evaluation/benchmarks/swe_ bench/prompts/swe_default.j2

  26. [26]

    Karl Pearson. 1895. VII. Note on regression and inheritance in the case of two parents.Proceedings of the Royal Society of London58, 347-352 (12 1895), 240–242. https://doi.org/10.1098/rspl.1895.0041 arXiv:https://royalsocietypublishing.org/rspl/article-pdf/58/347- 352/240/263745/rspl.1895.0041.pdf

  27. [27]

    Thanosan Prathifkumar, Noble Saji Mathews, and Meiyappan Nagappan. 2025. Does SWE-Bench-Verified Test Agent Ability or Model Memory?arXiv preprint arXiv:2512.10218(2025)

  28. [28]

    Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani- Tür, Gokhan Tur, and Heng Ji. 2025. ToolRL: Reward is all tool learning needs. arXiv preprint arXiv:2504.13958(2025)

  29. [29]

    IBM Research. 2026. From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets. https://jatinganhotra.dev/blog/swe-agents/2025/06/ 05/swe-bench-verified-discriminative-subsets.html

  30. [30]

    Shuzheng Si, Haozhe Zhao, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, and Maosong Sun. 2025. A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks. https://arxiv.org/abs/2510.05608

  31. [31]

    SWE-Agent. 2026. SWE-agent System Prompt with Default Plan. https://github. com/SWE-agent/SWE-agent/blob/main/config/default.yaml

  32. [32]

    Scott Wu. 2024. Introducing Devin, the first AI software engineer.Cognition Labs Blog(2024)

  33. [33]

    Binfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, and Dongkuan Xu

  34. [34]

    arXiv preprint , year =

    ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models. https://arxiv.org/abs/2305.18323

  35. [35]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: agent-computer interfaces enable automated software engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada) (NIPS ’24). Curran Associates Inc., Red Hook, NY,...

  36. [36]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR). https: //openreview.net/forum?id=WE_vluYUL-X

  37. [37]

    Cong Zhang, Xin Deik Goh, Dexun Li, Hao Zhang, and Yong Liu. 2025. Planning with multi-constraints via collaborative language agents. InProceedings of the 31st International Conference on Computational Linguistics. 10054–10082

  38. [38]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604