pith. sign in

arxiv: 2607.00511 · v1 · pith:UG3RUVLYnew · submitted 2026-07-01 · 💻 cs.SE

Large Language Models for Multi-Lingual Equivalent Mutant Detection: An Extended Empirical Study

Pith reviewed 2026-07-02 09:04 UTC · model grok-4.3

classification 💻 cs.SE
keywords equivalent mutant detectionlarge language modelsmutation testingmulti-lingualempirical studysoftware testing
0
0 comments X

The pith

Fine-tuned large language models detect equivalent mutants more accurately than traditional methods across Java and C

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can solve the problem of equivalent mutants in mutation testing, where some mutants produce identical behavior to the original program and therefore add no value while increasing testing cost. It runs an empirical comparison of several LLM strategies against existing methods on 3,302 Java mutant pairs and 1,088 C mutant pairs. The results show that LLM approaches reach higher F1-scores, with the best performance coming from fine-tuned code embeddings, while also maintaining competitive speed and exhibiting some ability to generalize from one language to the other. A reader would care because equivalent mutants have long been a practical barrier that reduces the usefulness of mutation testing for improving software quality.

Core claim

LLM-based approaches achieve higher F1-scores than the evaluated traditional methods, with fine-tuned code embedding yielding the highest detection accuracy among the tested strategies. Moreover, fine-tuned LLMs demonstrate measurable generalization across programming languages.

What carries the argument

Fine-tuned large language models that embed code to classify pairs of mutants as equivalent or non-equivalent

If this is right

  • LLM approaches reach higher F1-scores than the compared traditional detection methods.
  • Fine-tuned code embedding gives the highest accuracy among the LLM strategies tested.
  • LLM methods maintain inference times comparable to existing machine-learning models.
  • Fine-tuned LLMs show measurable cross-language generalization between Java and C.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the accuracy gains hold on broader data, mutation testing could become less expensive to apply in practice.
  • The same embedding-based classification approach might transfer to other code-semantics tasks that require distinguishing behavioral equivalence.

Load-bearing premise

The 3,302 Java and 1,088 C mutant pairs used for benchmarking are representative of real-world equivalent mutants and the state-of-the-art baseline methods were implemented without bias or implementation differences that favor the LLM approaches.

What would settle it

A new benchmark set of mutant pairs drawn from additional languages or larger real-world projects on which the LLM strategies no longer produce higher F1-scores than the traditional baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2607.00511 by Dong Wang, Honglin Shu, Jiazhe Zhang, Junjie Chen, Junji Yu, Xuejie Cao, Yasutaka Kamei, Zhao Tian.

Figure 1
Figure 1. Figure 1: Overview of experimental design. ○1 /○2 /○3 /○4 represents the workflow of EMD baselines/code em￾bedding strategies/prompting strategies/crosslingual generalization, respectively [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Zero-shot prompt template and few-shot prompt template [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Unique correct detections (↑) and unique incorrect detections (↓) across studied techniques on Java 5 3 1 4 6 4 7 6 2 1 29 2 8 148 3 191 14 2 103 0 0 0 0 0 0 0 0 0 0 0 Fine-tuning with instruction Pre-trained code embedding Fine-tuned code embedding Zero-shot prompting Few-shot prompting 2 14 191 3 8 148 2 29 1 2 6 4 7 4 6 1 3 5 5 0 0 0 0 0 0 0 0 0 0 Fine-tuning with instruction Pre-trained code embedding … view at source ↗
Figure 4
Figure 4. Figure 4: Unique correct detections (↑) and unique incorrect detections (↓) across studied techniques on C groups, to determine statistical significance in detection performance differences among the EMD techniques for each mutation operator. Results [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Detection performance on Top-10 mutation operators across various EMD techniques (x-axis shows [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: F1-score by Top-10 mutant operators outperforms the single-lingual approach on AORB and ROR (0.92 vs. 0.88 and 0.74 vs. 0.70). Based on the quantitative results and operator-level analysis, fine-tuning on the cross-lingual dataset tends to improve the effectiveness when fine-tuned with the instruction strategy for structurally similar languages. However, we note that both C and Java are strongly typed and … view at source ↗
Figure 7
Figure 7. Figure 7: t-SNE plots showing the embedding of mutant pairs. EQ/NEQ represents equivalent/non-equivalent, [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case studies of incorrect prediction J. ACM, Vol. 1, No. 1, Article 111. Publication date: July 2026 [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
read the original abstract

Mutation testing is a powerful technique for ensuring software quality. However, the presence of equivalent mutants introduces unnecessary costs and biases, limiting its practical effectiveness. Although numerous equivalent mutant detection (EMD) methods have been proposed, they often face distinct challenges: pure-code analysis methods can be limited by their reliance on specific compiler infrastructures, while existing machine-learning approaches remain constrained by scarce training data and limited generalization to unseen mutants. Large language models (LLMs) have recently demonstrated remarkable performance across diverse code-related tasks by better capturing program semantics. Yet their potential for EMD remains largely unexplored, particularly in the multi-lingual context. This paper presents the first comprehensive empirical study on LLMs for EMD, using 3,302 Java and 1,088 C mutant pairs to benchmark against state-of-the-art methods, explore strategy variations, assess efficiency, and evaluate cross-lingual generalization. Experimental results show that LLM-based approaches achieve higher F1-scores than the evaluated traditional methods, with fine-tuned code embedding yielding the highest detection accuracy among the tested strategies. Moreover, LLM-based approaches strike a practical balance between effectiveness and efficiency with inference times comparable to existing machine-learning models. Importantly, fine-tuned LLMs demonstrate measurable generalization across programming languages. These findings establish LLMs as a viable and efficient approach for tackling the longstanding challenge of equivalent mutant detection, offering new directions for advancing mutation testing in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper conducts the first comprehensive empirical study on large language models (LLMs) for equivalent mutant detection (EMD) in mutation testing. It benchmarks LLM strategies (including fine-tuned code embeddings) against traditional methods on a dataset of 3,302 Java and 1,088 C mutant pairs, reporting higher F1 scores for LLMs, best performance from fine-tuned embeddings, practical efficiency (inference times comparable to prior ML models), and measurable cross-lingual generalization.

Significance. If the results hold under rigorous validation, the work would advance mutation testing practice by offering a more effective and efficient solution to the equivalent-mutant problem, a longstanding barrier to adoption. The multi-lingual scope and explicit efficiency measurements are strengths; the study also supplies a sizable new benchmark that future EMD research can build upon.

major comments (3)
  1. [§4] §4 (Dataset and labeling): The construction and validation of the 4,390 ground-truth labels receive insufficient detail; no inter-rater agreement statistics, coverage thresholds, or external validation against test suites are reported. Because every F1 comparison rests on these labels, the absence is load-bearing for the central superiority claim.
  2. [§5] §5 (Baseline reproduction): The paper provides no source code, exact reproduction instructions, or compiler-configuration details for the state-of-the-art traditional and prior ML baselines. Systematic implementation differences could therefore artifactually inflate the reported LLM advantage.
  3. [§6.3] §6.3 (Cross-lingual evaluation): The cross-lingual generalization experiments do not state whether any mutant or code overlap exists between the Java and C collections, nor do they describe the precise train/test partitioning used to enforce language separation. This information is required to substantiate the generalization result.
minor comments (3)
  1. [Abstract] Abstract and §6: Include error bars, number of runs, or statistical significance tests alongside the F1 scores to allow readers to assess the stability of the reported improvements.
  2. [Table 1] Table 1 and Figure 3: Ensure axis labels, legend entries, and caption text are fully self-contained so that efficiency and accuracy trade-offs can be interpreted without reference to the main text.
  3. [§2] Related-work section: Add explicit citations to the most recent LLM-based code-analysis papers that post-date the baselines used, to clarify the novelty boundary.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing clarifications where possible and committing to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [§4] §4 (Dataset and labeling): The construction and validation of the 4,390 ground-truth labels receive insufficient detail; no inter-rater agreement statistics, coverage thresholds, or external validation against test suites are reported. Because every F1 comparison rests on these labels, the absence is load-bearing for the central superiority claim.

    Authors: We agree that §4 would benefit from expanded description of the labeling process. The ground-truth labels were produced via a hybrid approach combining static analysis heuristics with author-led manual review on the collected mutant pairs; however, formal inter-rater agreement metrics were not computed because labeling was performed by a single primary researcher with internal consistency checks rather than multiple independent raters. In the revised manuscript we will add a dedicated subsection detailing the exact heuristics, any coverage or confidence thresholds applied, the manual review protocol, and any post-hoc validation steps performed against available test suites. This addition will directly address the concern that the F1 results rest on insufficiently documented labels. revision: yes

  2. Referee: [§5] §5 (Baseline reproduction): The paper provides no source code, exact reproduction instructions, or compiler-configuration details for the state-of-the-art traditional and prior ML baselines. Systematic implementation differences could therefore artifactually inflate the reported LLM advantage.

    Authors: We concur that reproducibility details for the baselines are essential. The traditional and prior ML baselines were re-implemented following the descriptions in their respective source papers, using the same datasets and evaluation protocol; however, the submission did not include the re-implementation code, compiler flags, or step-by-step reproduction scripts. In the revision we will add a reproducibility appendix (or supplementary repository link) that supplies the exact source code used, compiler configurations, and any deviations from the original baseline implementations, thereby allowing independent verification that the reported LLM advantage is not an artifact of implementation differences. revision: yes

  3. Referee: [§6.3] §6.3 (Cross-lingual evaluation): The cross-lingual generalization experiments do not state whether any mutant or code overlap exists between the Java and C collections, nor do they describe the precise train/test partitioning used to enforce language separation. This information is required to substantiate the generalization result.

    Authors: We confirm that the Java and C mutant-pair collections were assembled from entirely disjoint codebases with no shared mutants, functions, or files, ensuring zero overlap. For the cross-lingual experiments, models were trained exclusively on one language and evaluated on the other using a strict language-separated split (i.e., no intra-language mixing in the test sets). We will revise §6.3 to state these facts explicitly, including the precise train/test partitioning ratios and confirmation of zero overlap, thereby substantiating the reported cross-lingual generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmarking with external baselines

full rationale

This is an empirical study that benchmarks LLM variants against independently described state-of-the-art methods on fixed mutant-pair datasets. No equations, fitted parameters, uniqueness theorems, or ansatzes are introduced whose outputs reduce by construction to the paper's own inputs or self-citations. All reported F1 scores and cross-lingual results are direct measurements against external baselines; the central claims therefore remain falsifiable by re-running the same experiments on the released data rather than being definitionally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical study whose central claims rest on the representativeness of the mutant dataset and fair comparison to baselines; no free parameters or invented entities visible in abstract.

axioms (1)
  • domain assumption The 3,302 Java and 1,088 C mutant pairs constitute a valid benchmark for evaluating equivalent mutant detection methods.
    All reported F1 scores and generalization claims depend on this dataset being representative and correctly labeled.

pith-pipeline@v0.9.1-grok · 5803 in / 1164 out tokens · 31238 ms · 2026-07-02T09:04:54.791056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

115 extracted references · 32 canonical work pages · 19 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al . 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774(2023)

  2. [2]

    Konstantinos Adamopoulos, Mark Harman, and Robert M Hierons. 2004. How to overcome the equivalent mutant problem and achieve tailored selective mutation using co-evolution. InGenetic and Evolutionary Computation–GECCO 2004: Genetic and Evolutionary Computation Conference, Seattle, W A, USA, June 26-30, 2004. Proceedings, Part II. Springer, 1338–1349

  3. [3]

    Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation.arXiv preprint arXiv:2103.06333(2021)

  4. [4]

    Toufique Ahmed, Christian Bird, Premkumar Devanbu, and Saikat Chakraborty. 2024. Studying LLM Performance on Closed-and Open-source Data.arXiv preprint arXiv:2402.15100(2024)

  5. [5]

    James H Andrews, Lionel C Briand, and Yvan Labiche. 2005. Is mutation an appropriate tool for testing experiments?. InProceedings of the 27th international conference on Software engineering. 402–411

  6. [6]

    James H Andrews, Lionel C Briand, Yvan Labiche, and Akbar Siami Namin. 2006. Using mutation analysis for assessing and comparing testing coverage criteria.IEEE Transactions on Software Engineering32, 8 (2006), 608–624

  7. [7]

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report.arXiv preprint arXiv:2305.10403 (2023)

  8. [8]

    Paolo Arcaini, Angelo Gargantini, Elvinia Riccobene, and Paolo Vavassori. 2017. A novel use of equivalent mutants for static anomaly detection in software artifacts.Information and Software Technology81 (2017), 52–64

  9. [9]

    Michael Baer, Norbert Oster, and Michael Philippsen. 2020. Mutantdistiller: Using symbolic execution for automatic detection of equivalent mutants and generation of mutant killing tests. In2020 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 294–303

  10. [10]

    Ezio Bartocci, Leonardo Mariani, Dejan Ničković, and Drishti Yadav. 2023. Property-based mutation testing. In2023 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 222–233

  11. [11]

    Moritz Beller, Chu-Pan Wong, Johannes Bader, Andrew Scott, Mateusz Machalica, Satish Chandra, and Erik Meijer

  12. [12]

    In2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)

    What it would take to use mutation testing in industry—a study at facebook. In2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 268–277

  13. [13]

    Claudinei Brito, Vinicius HS Durelli, Rafael S Durelli, Simone RS de Souza, Auri MR Vincenzi, and Márcio Eduardo Delamaro. 2020. A preliminary investigation into using machine learning algorithms to identify minimal and equivalent mutants. In2020 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 304–313

  14. [14]

    Nadia Burkart and Marco F Huber. 2021. A survey on the explainability of supervised machine learning.Journal of Artificial Intelligence Research70 (2021), 245–317

  15. [15]

    Cristiano Cervellera and Danilo Macciò. 2017. Distribution-preserving stratified sampling for learning problems. IEEE Transactions on Neural Networks and Learning Systems29, 7 (2017), 2886–2895

  16. [16]

    Thierry Titcheu Chekam, Mike Papadakis, Maxime Cordy, and Yves Le Traon. 2021. Killing stubborn mutants with symbolic execution.ACM Transactions on Software Engineering and Methodology (TOSEM)30, 2 (2021), 1–23

  17. [17]

    Seungjoon Chung and Shin Yoo. 2022. Augmenting Equivalent Mutant Dataset Using Symbolic Execution. In2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 150–159

  18. [18]

    Xavier Devroey, Gilles Perrouin, Mike Papadakis, Axel Legay, Pierre-Yves Schobbens, and Patrick Heymans. 2018. Model-based mutant equivalence detection using automata language equivalence and simulations.Journal of Systems and Software141 (2018), 1–15. J. ACM, Vol. 1, No. 1, Article 111. Publication date: July 2026. Large Language Models for Multi-Lingual...

  19. [19]

    Yangruibo Ding, Luca Buratti, Saurabh Pujar, Alessandro Morari, Baishakhi Ray, and Saikat Chakraborty. 2021. Towards learning (dis)-similarity of source code from program contrasts.arXiv preprint arXiv:2110.03868(2021)

  20. [20]

    Yali Du and Zhongxing Yu. 2023. Pre-training code representation with semantic flow graph for effective bug localization. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 579–591

  21. [21]

    Sidong Feng and Chunyang Chen. 2024. Prompting Is All You Need: Automated Android Bug Replay with Large Language Models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13

  22. [22]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages.arXiv preprint arXiv:2002.08155(2020)

  23. [23]

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821(2021)

  24. [24]

    Rohit Gheyi, Márcio Ribeiro, Beatriz Souza, Marcio Guimarães, Leo Fernandes, Marcelo d’Amorim, Vander Alves, Leopoldo Teixeira, and Baldoino Fonseca. 2021. Identifying method-level mutation subsumption relations using Z3. Information and Software Technology132 (2021), 106496

  25. [25]

    Dan Gong, Tiantian Wang, Xiaohong Su, and Yanhang Zhang. 2022. Equivalent mutants detection based on weighted software behavior graph.International Journal of Software Engineering and Knowledge Engineering32, 06 (2022), 819–843

  26. [26]

    Rahul Gopinath, Carlos Jensen, and Alex Groce. 2014. Mutations: How close are they to real faults?. In2014 IEEE 25th International Symposium on Software Reliability Engineering. IEEE, 189–200

  27. [27]

    Rahul Gopinath, Björn Mathis, and Andreas Zeller. 2018. If You Can’t Kill a Supermutant, You Have a Problem. In 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 18–24

  28. [28]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  29. [29]

    Marcio Augusto Guimarães, Leo Fernandes, Márcio Ribeiro, Marcelo d’Amorim, and Rohit Gheyi. 2020. Optimizing mutation testing by discovering dynamic mutant subsumption relations. In2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST). IEEE, 198–208

  30. [30]

    Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation.arXiv preprint arXiv:2203.03850(2022)

  31. [31]

    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svy- atkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366(2020)

  32. [32]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  33. [33]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196(2024)

  34. [34]

    Mark Harman, Rob Hierons, and Sebastian Danicic. 2001. The relationship between program dependence and mutation analysis.Mutation testing for the new century(2001), 5–13

  35. [35]

    Mark Harman, Jillian Ritchey, Inna Harper, Shubho Sengupta, Ke Mao, Abhishek Gulati, Christopher Foster, and Hervé Robert. 2025. Mutation-Guided LLM-based Test Generation at Meta. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 180–191

  36. [36]

    Dominik Holling, Sebastian Banescu, Marco Probst, Ana Petrovska, and Alexander Pretschner. 2016. Nequivack: Assessing mutation score confidence. In2016 IEEE Ninth International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 152–161

  37. [37]

    Homepage. 2025. https://github.com/SpanShu96/Large-Language-Models-for-Multi-Lingual-Equivalent-Mutant- Detection-An-Extended-Empirical-Study

  38. [38]

    Mahdi Houshmand and Samad Paydar. 2017. TCE+: An extension of the tce method for detecting equivalent mutants in java programs. InFundamentals of Software Engineering: 7th International Conference, FSEN 2017, Tehran, Iran, April 26–28, 2017, Revised Selected Papers 7. Springer, 164–179

  39. [39]

    Kai Huang, Xiangxin Meng, Jian Zhang, Yang Liu, Wenjie Wang, Shuhao Li, and Yuqing Zhang. 2023. An empirical study on fine-tuning large language models of code for automated program repair. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1162–1174

  40. [40]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024). J. ACM, Vol. 1, No. 1, Article 111. Publication date: July 2026. 111:34 Wang et al

  41. [41]

    Yue Jia and Mark Harman. 2009. Higher order mutation testing.Information and Software Technology51, 10 (2009), 1379–1393

  42. [42]

    Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing.IEEE transactions on software engineering37, 5 (2010), 649–678

  43. [43]

    Xue Jiang, Yihong Dong, Lecheng Wang, Qiwei Shang, and Ge Li. 2023. Self-planning code generation with large language model.arXiv preprint arXiv:2303.06689(2023)

  44. [44]

    Ernst, Reid Holmes, and Gordon Fraser

    René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software testing?. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering(Hong Kong, China)(FSE 2014). Association for Computing Machinery, New York, NY, USA,...

  45. [45]

    Mohamad Khajezade, Jie Wu, Fatemeh Hendijani Fard, Gema Rodríguez-Pérez, and Mohamed Sami Shehata. 2024. Investigating the Efficacy of Large Language Models for Code Clone Detection.arXiv preprint arXiv:2401.13802 (2024)

  46. [46]

    Jinhan Kim, Juyoung Jeon, Shin Hong, and Shin Yoo. 2022. Predictive mutation analysis via the natural language channel in source code.ACM Transactions on Software Engineering and Methodology (TOSEM)31, 4 (2022), 1–27

  47. [47]

    Marinos Kintis, Mike Papadakis, Yue Jia, Nicos Malevris, Yves Le Traon, and Mark Harman. 2017. Detecting trivial mutant equivalences via compiler optimisations.IEEE Transactions on Software Engineering44, 4 (2017), 308–333

  48. [48]

    Marinos Kintis, Mike Papadakis, Andreas Papadopoulos, Evangelos Valvis, and Nicos Malevris. 2016. Analysing and comparing the effectiveness of mutation testing tools: A manual study. In2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 147–156

  49. [49]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213

  50. [50]

    Kruskal and W

    William H. Kruskal and W. Allen Wallis. 1952. Use of Ranks in One-Criterion Variance Analysis.J. Amer. Statist. Assoc.47, 260 (1952), 583–621

  51. [51]

    Kaufman, Ryan Featherman, Hannah Potter, Ardi Madadi, and René Just

    Benjamin Kushigian, Samuel J. Kaufman, Ryan Featherman, Hannah Potter, Ardi Madadi, and René Just. 2024. Equivalent Mutants in the Wild: Identifying and Efficiently Suppressing Equivalent Mutants for Java Programs. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis(Vienna, Austria)(ISSTA 2024). Association for ...

  52. [52]

    Benjamin Kushigian, Amit Rawat, and René Just. 2019. Medusa: Mutant equivalence detection using satisfiability analysis. In2019 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 77–82

  53. [53]

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al . 2023. StarCoder: may the source be with you!arXiv preprint arXiv:2305.06161(2023)

  54. [54]

    Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing-Chi Cheung, and Jeff Kramer. 2023. Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 14–26

  55. [55]

    Wen Li, Li Li, and Haipeng Cai. 2022. On the vulnerability proneness of multilingual code. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore)(ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 847–859. https: //doi.org/10.1145/354...

  56. [56]

    Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. 2022. Diffusion-lm improves controllable text generation.Advances in Neural Information Processing Systems35 (2022), 4328–4343

  57. [57]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems36 (2024)

  58. [58]

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.Comput. Surveys55, 9 (2023), 1–35

  59. [59]

    Yiling Lou, Dan Hao, and Lu Zhang. 2015. Mutation-based test-case prioritization in software evolution. In2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 46–57

  60. [60]

    Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664(2021)

  61. [61]

    Yucheng Lu, Youngsuk Park, Lifan Chen, Yuyang Wang, Christopher De Sa, and Dean Foster. 2021. Variance reduced training with stratified sampling for forecasting models. InInternational Conference on Machine Learning. PMLR, 7145–7155. J. ACM, Vol. 1, No. 1, Article 111. Publication date: July 2026. Large Language Models for Multi-Lingual Equivalent Mutant ...

  62. [62]

    Wei Ma, Shangqing Liu, Wenhan Wang, Qiang Hu, Ye Liu, Cen Zhang, Liming Nie, and Yang Liu. 2023. The scope of chatgpt in software engineering: A thorough investigation.arXiv preprint arXiv:2305.12138(2023)

  63. [63]

    Lech Madeyski, Wojciech Orzeszyna, Richard Torkar, and Mariusz Jozala. 2013. Overcoming the equivalent mutant problem: A systematic literature review and a comparative experiment of second order mutation.IEEE Transactions on Software Engineering40, 1 (2013), 23–42

  64. [64]

    Mohsen Moradi Moghadam, Mehdi Bagherzadeh, Raffi Khatchadourian, and Hamid Bagheri. 2023. muAkka: Mutation Testing for Actor Concurrency in Akka using Real-World Bugs. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 262–274

  65. [65]

    Muhammad Rashid Naeem, Tao Lin, Hamad Naeem, and Hailu Liu. 2020. A machine learning approach for classification of equivalent mutants.Journal of Software: Evolution and Process32, 5 (2020), e2238

  66. [66]

    Changan Niu, Chuanyi Li, Vincent Ng, Dongxiao Chen, Jidong Ge, and Bin Luo. 2023. An empirical comparison of pre-trained models of source code.arXiv preprint arXiv:2302.04026(2023)

  67. [67]

    A Jefferson Offutt and Jie Pan. 1997. Automatically detecting equivalent mutants and infeasible paths.Software testing, verification and reliability7, 3 (1997), 165–192

  68. [68]

    Saeyoon Oh, Seongmin Lee, and Shin Yoo. 2021. Effectively sampling higher order mutants using causal effect. In 2021 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 19–24

  69. [69]

    Milos Ojdanic, Ezekiel Soremekun, Renzo Degiovanni, Mike Papadakis, and Yves Le Traon. 2023. Mutation testing in evolving systems: Studying the relevance of mutants to code evolution.ACM Transactions on Software Engineering and Methodology32, 1 (2023), 1–39

  70. [70]

    OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt

  71. [71]

    OpenAI. 2024. https://openai.com/

  72. [72]

    OpenAI. 2024. New Generation of Embedding Model. https://openai.com/blog/new-embedding-models-and-api- updates

  73. [73]

    OpenAI. 2025. Introducing OpenAI o3 and o4-mini. https://openai.com/index/introducing-o3-and-o4-mini/

  74. [74]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems35 (2022), 27730–27744

  75. [75]

    Mike Papadakis, Marcio Delamaro, and Yves Le Traon. 2014. Mitigating the effects of equivalent mutants with mutant classification strategies.Science of Computer Programming95 (2014), 298–319

  76. [76]

    Mike Papadakis, Yue Jia, Mark Harman, and Yves Le Traon. 2015. Trivial compiler equivalence: A large scale empirical study of a simple, fast and effective equivalent mutant detection technique. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 936–946

  77. [77]

    Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation testing advances: an analysis and survey. InAdvances in computers. Vol. 112. Elsevier, 275–378

  78. [78]

    Mike Papadakis and Yves Le Traon. 2013. Mutation testing strategies using mutant classification. InProceedings of the 28th Annual ACM Symposium on Applied Computing. 1223–1229

  79. [79]

    Mike Papadakis and Yves Le Traon. 2015. Metallaxis-FL: mutation-based fault localization.Software Testing, Verification and Reliability25, 5-7 (2015), 605–628

  80. [80]

    Samuel Peacock, Lin Deng, Josh Dehlinger, and Suranjan Chakraborty. 2021. Automatic equivalent mutants classifica- tion using abstract syntax tree neural networks. In2021 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 13–18

Showing first 80 references.