Helpful or Harmful? Evaluating LLM-Assisted Vulnerability Patching via a Human Study

Fabio Massacci; Giulian Biolo; Michael Tezza; Yuanjun Gong

arxiv: 2606.25973 · v1 · pith:ZOFOIAUMnew · submitted 2026-06-24 · 💻 cs.SE · cs.AI

Helpful or Harmful? Evaluating LLM-Assisted Vulnerability Patching via a Human Study

Giulian Biolo , Michael Tezza , Yuanjun Gong , Fabio Massacci This is my paper

Pith reviewed 2026-06-25 19:00 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLMvulnerability patchinghuman studysoftware securityempirical experimentcrossover designghost tests

0 comments

The pith

LLM assistance in vulnerability patching risks superficial repairs that pass functionality but fail security tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors hypothesize that LLM-assisted vulnerability patching, while potentially faster, may lead to more superficial repairs that satisfy standard functionality checks but fail security validation due to hallucinations or insecure code. To investigate this, they outline a controlled human study using a Balanced Crossover design on a custom WebApp equipped with hidden Ghost Tests that go beyond visible requirements. The experiment compares LLM assistance to manual debugging across training and evaluation scenarios, measuring speed, efficacy on both functional and security tests, and participant perceptions. A pilot study has been completed to inform the main experiment.

Core claim

LLM-assisted vulnerability patching has a higher likelihood of generating superficial repairs that bypass the standard functionality checks but fail the security validation.

What carries the argument

Balanced Crossover design with a WebApp integrating hidden Ghost Tests to verify patch integrity beyond functional requirements.

If this is right

LLM tools may speed up remediation but compromise security in real-world scenarios.
The study will reveal differences in remediation efficacy between assisted and manual approaches on security tests.
Participant perceptions may not align with actual patch security.
Training with the WebApp could highlight risks of relying on LLM suggestions without validation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The findings could inform guidelines for safe integration of LLMs in security-sensitive development tasks.
Similar hidden test methods might apply to evaluating AI assistance in other software engineering domains.
Developers might need new workflows that include automated security validation when using LLM tools.

Load-bearing premise

The hidden Ghost Tests will reliably detect insecure patches beyond what standard functionality tests reveal, and the Balanced Crossover design will adequately control for participant skill differences.

What would settle it

Observing in the experiment that LLM-assisted patches achieve similar or better pass rates on the security tests compared to manual patches would falsify the hypothesis of higher likelihood of superficial repairs.

Figures

Figures reproduced from arXiv: 2606.25973 by Fabio Massacci, Giulian Biolo, Michael Tezza, Yuanjun Gong.

**Figure 1.** Figure 1: Data Collection and Analysis: A Supabase database stores the experimental variables summarized in Table II; the data are analyzed with the statistical methods of Section III-C. B. Hypotheses and Variables The descriptions of the experimental variables are summarized in Table II. Independent variables (intervention type, scenario identifier) define the experimental design; dependent variables operationaliz… view at source ↗

**Figure 1.** Figure 1: Experimental workflow from participant onboarding to patch evalua [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Screenshot of the experimental WebApp interface [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Software vulnerability remediation is a cognitively demanding task that requires specialized security expertise often lacking in general developers. In the meantime, Large Language Models (LLMs) assisted tools show potential in vulnerability detection, location, and repair tasks. [Hypothesis:] While LLM-assistance is hypothesized to accelerate patching, it also risks introducing hallucinations or insecure code, leading to a higher likelihood of generating superficial repairs that bypass the standard functionality checks but fail the security validation. [Objective:] We aim to present an empirical experiment, unveiling the capability of LLM-assisted vulnerability patching compared to manual debugging on human participants in real-world scenarios. [Method:] We plan to conduct a controlled experiment using a Balanced Crossover design. For that, we have developed a WebApp for code execution and integrated hidden Ghost Tests to verify patch integrity beyond visible functional requirements. The experiment involves training and evaluation scenarios. The remediation speed, remediation efficacy for both standard functionality tests and security tests, and participant perception will be evaluated. [Pilot Study:] A pilot experiment with a small sample of participants has been conducted, providing insights for the following study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Protocol for a human study on LLM vs manual patching with a pilot, but ghost tests lack construction details.

read the letter

This paper sets out a plan for a controlled human experiment on whether LLM assistance helps or hurts vulnerability patching, plus results from a small pilot.

The design uses a balanced crossover so each participant works both with and without the LLM tool. They built a web app with hidden ghost tests that check security properties the visible functionality tests miss. Outcomes include speed, success on both test types, and participant feedback. The pilot gave them practical tweaks for the main run.

The setup is sensible for isolating the effect of LLM help while controlling for individual skill. Focusing on whether patches pass visible tests but fail hidden security ones directly targets the superficial-repair concern.

The main gap is that the ghost tests get almost no description: no account of how they were chosen, which vulnerabilities they target, or any validation that they catch insecure code the visible tests would miss. Without that, the measurement for the central hypothesis stays unanchored.

This is for people who run or review human studies in software engineering and security. A methods reader could take the crossover-plus-hidden-test idea for their own work.

It deserves peer review because the experimental plan is careful and the question is practical, even though the pilot is small and the test details need expansion.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a controlled human study to evaluate LLM-assisted versus manual vulnerability patching. It hypothesizes that LLM assistance may accelerate remediation but risks hallucinations or insecure code, producing superficial patches that pass standard functionality tests yet fail security validation. The planned method uses a balanced crossover design in a custom WebApp with hidden Ghost Tests for security validation beyond visible functionality, plus training/evaluation scenarios; metrics include remediation speed, efficacy on both test types, and participant perceptions. A small pilot has been run to inform the protocol.

Significance. If the study is executed with the proposed controls and the Ghost Tests prove reliable, the work would supply much-needed empirical evidence on the security trade-offs of LLM tools in vulnerability remediation—an area of growing practical importance as such assistants become widespread. The crossover design and dual-test approach are conceptually appropriate for isolating assistance effects.

major comments (2)

[Method (Ghost Tests)] The description of the Ghost Tests is insufficient to support the central hypothesis. The text states only that hidden tests were 'integrated into the WebApp to verify patch integrity beyond visible functional requirements,' with no account of their construction, the specific vulnerabilities they target, coverage of hallucination-induced insecure patterns, or any pilot validation against known superficial-but-insecure patches. Without these details the primary measurement instrument for the hypothesized risk remains unanchored.
[Pilot Study] No quantitative or qualitative results from the pilot study are reported, nor is it explained how pilot observations shaped the final protocol (e.g., task difficulty, time limits, or LLM prompt templates). This omission makes it impossible to evaluate whether the planned design is feasible or adequately powered.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our study protocol. The comments highlight important areas where additional detail will strengthen the manuscript. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Method (Ghost Tests)] The description of the Ghost Tests is insufficient to support the central hypothesis. The text states only that hidden tests were 'integrated into the WebApp to verify patch integrity beyond visible functional requirements,' with no account of their construction, the specific vulnerabilities they target, coverage of hallucination-induced insecure patterns, or any pilot validation against known superficial-but-insecure patches. Without these details the primary measurement instrument for the hypothesized risk remains unanchored.

Authors: We agree that the current description of the Ghost Tests is too brief to adequately support the central hypothesis. In the revised manuscript we will expand this section with a detailed account of Ghost Test construction, the specific vulnerabilities each test targets, how the tests are designed to detect common hallucination-induced insecure patterns (e.g., missing input sanitization or improper access controls), and any validation performed against known superficial-but-insecure patches during the pilot. These additions will anchor the primary measurement instrument more firmly. revision: yes
Referee: [Pilot Study] No quantitative or qualitative results from the pilot study are reported, nor is it explained how pilot observations shaped the final protocol (e.g., task difficulty, time limits, or LLM prompt templates). This omission makes it impossible to evaluate whether the planned design is feasible or adequately powered.

Authors: We acknowledge that the manuscript currently provides only a brief mention of the pilot without reporting observations or explaining their influence on the protocol. Although the paper is a study protocol, we will add a dedicated subsection that describes the pilot (including qualitative observations on task difficulty, time limits, and prompt templates) and explicitly states how these observations informed the final design choices. This will allow readers to assess feasibility and power while preserving the protocol focus of the work. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study proposal with no derivations or fitted parameters

full rationale

The manuscript is a forward-looking experimental proposal describing a Balanced Crossover human study on LLM-assisted patching. It contains no equations, no parameter fitting, no uniqueness theorems, and no derivations that could reduce to inputs by construction. The Ghost Tests are described only at the level of integration into a WebApp; their construction is not presented as a derived result. No self-citation chains or ansatzes appear in the provided text. The work is therefore self-contained against the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The proposal rests on the domain assumption that the chosen experimental controls and hidden tests will isolate the effect of LLM assistance on security outcomes; no free parameters or invented entities are introduced.

axioms (2)

domain assumption The Balanced Crossover design will control for individual participant differences in the remediation speed and efficacy measures.
Invoked in the Method section describing the experiment structure.
domain assumption Hidden Ghost Tests can verify patch integrity on security properties beyond visible functional requirements.
Stated in the description of the WebApp and evaluation criteria.

pith-pipeline@v0.9.1-grok · 5726 in / 1253 out tokens · 24779 ms · 2026-06-25T19:00:34.586337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 13 canonical work pages · 4 internal anchors

[1]

How effective are neural networks for fixing security vul- nerabilities,

Y . Wu, N. Jiang, H. V . Pham, T. Lutellier, J. Davis, L. Tan, P. Babkin, and S. Shah, “How effective are neural networks for fixing security vul- nerabilities,” inProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). ACM, 2023, pp. 1282–1294

2023
[2]

Examining zero-shot vulnerability repair with large language models,

H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Examining zero-shot vulnerability repair with large language models,” in2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2023, pp. 2339–2356

2023
[3]

VulnRepairEval: An exploit-based evaluation framework for assessing large language model vulnerability repair capabilities,

X. Wanget al., “VulnRepairEval: An exploit-based evaluation framework for assessing large language model vulnerability repair capabilities,” 2025. [Online]. Available: https://arxiv.org/abs/2509.03331

work page arXiv 2025
[4]

Purple Llama CyberSecEval: A secure coding benchmark for language models,

M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontana, S. Frolov, R. P. Giri, D. Kapil, Y . Kozyrakis, D. LeBlanc, J. Milazzo, A. Straumann, G. Synnaeve, V . V ontimitta, S. Whitman, and J. Saxe, “Purple Llama CyberSecEval: A secure coding benchmark for language models,” 2023. [Online]. Availa...

work page arXiv 2023
[5]

Are AI-generated fixes secure? analyzing LLM and agent patches on SWE-bench,

A. Sajadi, K. Damevski, and P. Chatterjee, “Are AI-generated fixes secure? analyzing LLM and agent patches on SWE-bench,” 2025. [Online]. Available: https://arxiv.org/abs/2507.02976

work page arXiv 2025
[6]

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

B. Baker, J. Huizinga, L. Gao, Z. Dou, M. Y . Guan, A. Madry, W. Zaremba, J. Pachocki, and D. Farhi, “Monitoring reasoning models for misbehavior and the risks of promoting obfuscation,” 2025. [Online]. Available: https://arxiv.org/abs/2503.11926

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Security weaknesses of Copilot-generated code in GitHub projects: An empirical study,

Y . Fu, P. Liang, A. Tahir, Z. Li, M. Shahin, J. Yu, and J. Chen, “Security weaknesses of Copilot-generated code in GitHub projects: An empirical study,”ACM Transactions on Software Engineering and Methodology,
[8]

ACM Transactions on Software Engineering and Methodology (2025),https://dl.acm.org/doi/10.1145/3716848

[Online]. Available: https://doi.org/10.1145/3716848

work page doi:10.1145/3716848
[9]

Do users write more insecure code with AI assistants?

N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with AI assistants?” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS),

2023
[10]

Do users write more insecure code with ai assistants?

[Online]. Available: https://doi.org/10.1145/3576915.3623157

work page doi:10.1145/3576915.3623157
[11]

Exploring the impact of intervention methods on developers’ security behavior in a manipulated ChatGPT study,

A. Serafini, M. Yardim, and A. Naiakshina, “Exploring the impact of intervention methods on developers’ security behavior in a manipulated ChatGPT study,” inProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025

2025
[12]

Using AI assistants in software development: A qualitative study on security practices and concerns,

J. H. Klemmeret al., “Using AI assistants in software development: A qualitative study on security practices and concerns,” inProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2024

2024
[13]

Lost at C: A user study on the security implications of large language model code assistants,

G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan- Gavitt, “Lost at C: A user study on the security implications of large language model code assistants,” inProceedings of the 32nd USENIX Security Symposium, 2023. [Online]. Available: https: //www.usenix.org/conference/usenixsecurity23/presentation/sandoval

2023
[14]

A user-centered security evaluation of Copilot,

O. Asare, M. Nagappan, and N. Asokan, “A user-centered security evaluation of Copilot,” inProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE), 2024. [Online]. Available: https://doi.org/10.1145/3597503.3639154

work page doi:10.1145/3597503.3639154 2024
[15]

Closing the gap: A user study on the real-world usefulness of AI-powered vulnerability detection & repair in the IDE,

B. Steenhoeket al., “Closing the gap: A user study on the real-world usefulness of AI-powered vulnerability detection & repair in the IDE,” inProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE), 2025. [Online]. Available: https://arxiv.org/abs/2412.14306

work page arXiv 2025
[16]

Codemirage: Hallucinations in code generated by large language models,

V . Agarwal, Y . Pei, S. Alamir, and X. Liu, “Codemirage: Hallucinations in code generated by large language models,”arXiv, 2025

2025
[17]

SWE-Bench+: Enhanced coding benchmark for LLMs,

R. Aleithan, M. Wessel, S. Wang, M. B. Kennedy, and A. E. Hassan, “SWE-Bench+: Enhanced coding benchmark for LLMs,” 2024. [Online]. Available: https://arxiv.org/abs/2410.06992

work page arXiv 2024
[18]

The patch overfitting problem in automated program repair: Practical magnitude and a baseline for realistic benchmarking,

J. Petkeet al., “The patch overfitting problem in automated program repair: Practical magnitude and a baseline for realistic benchmarking,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE Companion), 2024

2024
[19]

The taguchi method,

A. Mitra, “The taguchi method,”Wiley Interdisciplinary Reviews: Com- putational Statistics, vol. 3, no. 5, pp. 472–480, 2011

2011
[20]

Addressing combinatorial ex- periments and scarcity of subjects by provably orthogonal and crossover experimental designs,

F. Massacci, P. Papotti, and R. Paramitha, “Addressing combinatorial ex- periments and scarcity of subjects by provably orthogonal and crossover experimental designs,”Journal of Systems and Software (JSS), vol. 211, p. 111990, 2024

2024
[21]

Evaluating large language models trained on code,

M. Chenet al., “Evaluating large language models trained on code,” arXiv preprint, 2021. [Online]. Available: https://arxiv.org/abs/2107. 03374

2021
[22]

Program Synthesis with Large Language Models

J. Austin, A. Odenaet al., “Program synthesis with large language models,”arXiv preprint, 2021. [Online]. Available: https://arxiv.org/abs/ 2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. E. Jimenezet al., “SWE-bench: Can language models resolve real-world GitHub issues?” inThe Twelfth International Conference on Learning Representations (ICLR), 2024. [Online]. Available: https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

[Online]

OpenAI. [Online]. Available: https://openai.com
[25]

[Online]

Anthropic. [Online]. Available: https://claude.ai
[26]

[Online]

UpWork. [Online]. Available: https://www.upwork.com/
[27]

Asleep at the keyboard? assessing the security of GitHub Copilot’s code contributions,

H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of GitHub Copilot’s code contributions,” in2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 754–768

2022
[28]

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

C. Denison, M. MacDiarmid, F. Barez, D. Duvenaud, S. Kravec, S. Marks, N. Schiefer, R. Soklaski, A. Tamkin, J. Kaplan, B. Shlegeris, S. R. Bowman, E. Perez, and E. Hubinger, “Sycophancy to subterfuge: Investigating reward tampering in language models,” 2024. [Online]. Available: https://arxiv.org/abs/2406.10162

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Instruction tuning for secure code generation,

J. He, M. Vero, G. Krasnopolska, and M. Vechev, “Instruction tuning for secure code generation,” inProceedings of the 41st International Conference on Machine Learning (ICML), 2024. [Online]. Available: https://arxiv.org/abs/2402.09497

work page arXiv 2024

[1] [1]

How effective are neural networks for fixing security vul- nerabilities,

Y . Wu, N. Jiang, H. V . Pham, T. Lutellier, J. Davis, L. Tan, P. Babkin, and S. Shah, “How effective are neural networks for fixing security vul- nerabilities,” inProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). ACM, 2023, pp. 1282–1294

2023

[2] [2]

Examining zero-shot vulnerability repair with large language models,

H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Examining zero-shot vulnerability repair with large language models,” in2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2023, pp. 2339–2356

2023

[3] [3]

VulnRepairEval: An exploit-based evaluation framework for assessing large language model vulnerability repair capabilities,

X. Wanget al., “VulnRepairEval: An exploit-based evaluation framework for assessing large language model vulnerability repair capabilities,” 2025. [Online]. Available: https://arxiv.org/abs/2509.03331

work page arXiv 2025

[4] [4]

Purple Llama CyberSecEval: A secure coding benchmark for language models,

M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontana, S. Frolov, R. P. Giri, D. Kapil, Y . Kozyrakis, D. LeBlanc, J. Milazzo, A. Straumann, G. Synnaeve, V . V ontimitta, S. Whitman, and J. Saxe, “Purple Llama CyberSecEval: A secure coding benchmark for language models,” 2023. [Online]. Availa...

work page arXiv 2023

[5] [5]

Are AI-generated fixes secure? analyzing LLM and agent patches on SWE-bench,

A. Sajadi, K. Damevski, and P. Chatterjee, “Are AI-generated fixes secure? analyzing LLM and agent patches on SWE-bench,” 2025. [Online]. Available: https://arxiv.org/abs/2507.02976

work page arXiv 2025

[6] [6]

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

B. Baker, J. Huizinga, L. Gao, Z. Dou, M. Y . Guan, A. Madry, W. Zaremba, J. Pachocki, and D. Farhi, “Monitoring reasoning models for misbehavior and the risks of promoting obfuscation,” 2025. [Online]. Available: https://arxiv.org/abs/2503.11926

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Security weaknesses of Copilot-generated code in GitHub projects: An empirical study,

Y . Fu, P. Liang, A. Tahir, Z. Li, M. Shahin, J. Yu, and J. Chen, “Security weaknesses of Copilot-generated code in GitHub projects: An empirical study,”ACM Transactions on Software Engineering and Methodology,

[8] [8]

ACM Transactions on Software Engineering and Methodology (2025),https://dl.acm.org/doi/10.1145/3716848

[Online]. Available: https://doi.org/10.1145/3716848

work page doi:10.1145/3716848

[9] [9]

Do users write more insecure code with AI assistants?

N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with AI assistants?” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS),

2023

[10] [10]

Do users write more insecure code with ai assistants?

[Online]. Available: https://doi.org/10.1145/3576915.3623157

work page doi:10.1145/3576915.3623157

[11] [11]

Exploring the impact of intervention methods on developers’ security behavior in a manipulated ChatGPT study,

A. Serafini, M. Yardim, and A. Naiakshina, “Exploring the impact of intervention methods on developers’ security behavior in a manipulated ChatGPT study,” inProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025

2025

[12] [12]

Using AI assistants in software development: A qualitative study on security practices and concerns,

J. H. Klemmeret al., “Using AI assistants in software development: A qualitative study on security practices and concerns,” inProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2024

2024

[13] [13]

Lost at C: A user study on the security implications of large language model code assistants,

G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan- Gavitt, “Lost at C: A user study on the security implications of large language model code assistants,” inProceedings of the 32nd USENIX Security Symposium, 2023. [Online]. Available: https: //www.usenix.org/conference/usenixsecurity23/presentation/sandoval

2023

[14] [14]

A user-centered security evaluation of Copilot,

O. Asare, M. Nagappan, and N. Asokan, “A user-centered security evaluation of Copilot,” inProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE), 2024. [Online]. Available: https://doi.org/10.1145/3597503.3639154

work page doi:10.1145/3597503.3639154 2024

[15] [15]

Closing the gap: A user study on the real-world usefulness of AI-powered vulnerability detection & repair in the IDE,

B. Steenhoeket al., “Closing the gap: A user study on the real-world usefulness of AI-powered vulnerability detection & repair in the IDE,” inProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE), 2025. [Online]. Available: https://arxiv.org/abs/2412.14306

work page arXiv 2025

[16] [16]

Codemirage: Hallucinations in code generated by large language models,

V . Agarwal, Y . Pei, S. Alamir, and X. Liu, “Codemirage: Hallucinations in code generated by large language models,”arXiv, 2025

2025

[17] [17]

SWE-Bench+: Enhanced coding benchmark for LLMs,

R. Aleithan, M. Wessel, S. Wang, M. B. Kennedy, and A. E. Hassan, “SWE-Bench+: Enhanced coding benchmark for LLMs,” 2024. [Online]. Available: https://arxiv.org/abs/2410.06992

work page arXiv 2024

[18] [18]

The patch overfitting problem in automated program repair: Practical magnitude and a baseline for realistic benchmarking,

J. Petkeet al., “The patch overfitting problem in automated program repair: Practical magnitude and a baseline for realistic benchmarking,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE Companion), 2024

2024

[19] [19]

The taguchi method,

A. Mitra, “The taguchi method,”Wiley Interdisciplinary Reviews: Com- putational Statistics, vol. 3, no. 5, pp. 472–480, 2011

2011

[20] [20]

Addressing combinatorial ex- periments and scarcity of subjects by provably orthogonal and crossover experimental designs,

F. Massacci, P. Papotti, and R. Paramitha, “Addressing combinatorial ex- periments and scarcity of subjects by provably orthogonal and crossover experimental designs,”Journal of Systems and Software (JSS), vol. 211, p. 111990, 2024

2024

[21] [21]

Evaluating large language models trained on code,

M. Chenet al., “Evaluating large language models trained on code,” arXiv preprint, 2021. [Online]. Available: https://arxiv.org/abs/2107. 03374

2021

[22] [22]

Program Synthesis with Large Language Models

J. Austin, A. Odenaet al., “Program synthesis with large language models,”arXiv preprint, 2021. [Online]. Available: https://arxiv.org/abs/ 2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. E. Jimenezet al., “SWE-bench: Can language models resolve real-world GitHub issues?” inThe Twelfth International Conference on Learning Representations (ICLR), 2024. [Online]. Available: https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

[Online]

OpenAI. [Online]. Available: https://openai.com

[25] [25]

[Online]

Anthropic. [Online]. Available: https://claude.ai

[26] [26]

[Online]

UpWork. [Online]. Available: https://www.upwork.com/

[27] [27]

Asleep at the keyboard? assessing the security of GitHub Copilot’s code contributions,

H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of GitHub Copilot’s code contributions,” in2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 754–768

2022

[28] [28]

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

C. Denison, M. MacDiarmid, F. Barez, D. Duvenaud, S. Kravec, S. Marks, N. Schiefer, R. Soklaski, A. Tamkin, J. Kaplan, B. Shlegeris, S. R. Bowman, E. Perez, and E. Hubinger, “Sycophancy to subterfuge: Investigating reward tampering in language models,” 2024. [Online]. Available: https://arxiv.org/abs/2406.10162

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Instruction tuning for secure code generation,

J. He, M. Vero, G. Krasnopolska, and M. Vechev, “Instruction tuning for secure code generation,” inProceedings of the 41st International Conference on Machine Learning (ICML), 2024. [Online]. Available: https://arxiv.org/abs/2402.09497

work page arXiv 2024