pith. sign in

arxiv: 2509.09192 · v2 · submitted 2025-09-11 · 💻 cs.SE · cs.AI

ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?

Pith reviewed 2026-05-18 18:20 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords just-in-time defect predictioncode language modelscode changescounterfactual testingrevert commitsbenchmark datasetsoftware defect predictioninput encodings
0
0 comments X p. Extension

The pith

Code language models detect defects from superficial cues in diffs rather than genuine semantic understanding of changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs ReDef, a high-confidence dataset of function-level code modifications from 22 large C/C++ projects, using revert commits to anchor defective cases and post-hoc checks plus GPT-assisted multi-vote filtering to remove ambiguous ones. It then tests four code language models on five input encoding strategies for just-in-time defect prediction and introduces four counterfactual perturbations, such as swapping added and deleted blocks or inverting diff polarity, as probes for semantic grasp. Compact diff-style encodings outperform whole-function formats with statistical backing, yet model performance remains stable under the perturbations. This leads the authors to conclude that the models exploit surface patterns instead of truly comprehending code changes. The work matters for improving prioritization of risky changes during code review and continuous integration.

Core claim

ReDef supplies 3,164 defective and 10,268 clean modifications with reliable labels derived from revert commits and conservative GPT triage. Across CodeBERT, CodeT5+, UniXcoder, and Qwen2.5, compact diff-style encodings yield higher predictive performance than full-function formats. Four counterfactual strategies that distort change semantics leave performance essentially unchanged, which the paper interprets as evidence that models rely on superficial cues rather than semantic understanding of modifications.

What carries the argument

Counterfactual perturbation strategies (swapping added/deleted blocks, inverting diff polarity) applied as diagnostic probes on top of the ReDef dataset to test whether models capture change semantics.

If this is right

  • Compact diff-style encodings improve JIT defect prediction accuracy for current code language models.
  • Models that remain stable under semantic distortions will likely fail on novel or complex change patterns.
  • High-quality datasets anchored by revert commits enable more trustworthy comparisons of model robustness.
  • Current evaluation practices may overstate model capability for real-world code review assistance.
  • Training objectives that penalize performance on perturbed inputs could be needed to build genuine change understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same superficial-cue reliance may appear in other code-related tasks that involve diffs or edits.
  • One extension would be to fine-tune models explicitly on ReDef plus its perturbed variants and measure whether robustness improves.
  • The results connect to broader questions about whether language models learn causal semantics or statistical associations in structured inputs like code.
  • Similar counterfactual probes could be applied to non-CLM approaches such as graph-based or rule-based defect predictors.

Load-bearing premise

Revert commits accurately mark bug-inducing changes and the GPT-assisted multi-vote triage removes ambiguous cases without systematic bias in labels.

What would settle it

A clear drop in model accuracy on a held-out set of verified bug-inducing changes after applying the same block-swap or polarity-inversion perturbations that produced stable results in the reported tests.

Figures

Figures reproduced from arXiv: 2509.09192 by Doha Nam, Duksan Ryu, Jongmoon Baik, Taehyoun Kim.

Figure 1
Figure 1. Figure 1: Overall approach of ReDef construction and PLM evaluation. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Number of defective and clean modifications per project in the ReDef corpus. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative distribution of function lengths (tokens) across PLM tokenizers [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Just-in-Time software defect prediction (JIT-SDP) plays a critical role in prioritizing risky code changes during code review and continuous integration. However, existing datasets often suffer from noisy labels and low precision in identifying bug-inducing commits. To address this, we present ReDef (Revert-based Defect dataset), a high-confidence benchmark of function-level modifications curated from 22 large-scale C/C++ projects. Defective cases are anchored by revert commits, while clean cases are validated through post-hoc history checks. Ambiguous instances are conservatively filtered out via a GPT-assisted triage process involving multiple votes and audits. This pipeline yields 3,164 defective and 10,268 clean modifications, offering substantially more reliable labels than prior resources. Beyond dataset construction, we provide a systematic evaluation of how Code Language Models (CLMs)-specifically CodeBERT, CodeT5+, UniXcoder, and Qwen2.5-reason about code modifications. We first investigate which input encodings most effectively expose change information under five different strategies. We then design four counterfactual perturbation strategies (e.g., swapping added/deleted blocks, inverting diff polarity) to serve as diagnostic probes. We posit that if models genuinely capture change semantics, such distortions should lead to a clear decline in predictive performance. Our results show that compact diff-style encodings consistently outperform whole-function formats across all CLMs, supported by rigorous statistical confirmation. However, under counterfactual tests, performance remains effectively stable, revealing that what appears to be robustness in fact reflects a reliance on superficial cues rather than true semantic understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ReDef, a new high-confidence benchmark dataset for just-in-time software defect prediction (JIT-SDP) comprising 3,164 defective and 10,268 clean function-level modifications from 22 large-scale C/C++ projects. Defective cases are anchored by revert commits and clean cases by post-hoc history checks plus GPT-assisted multi-vote triage to filter ambiguous instances. The authors systematically evaluate Code Language Models (CodeBERT, CodeT5+, UniXcoder, Qwen2.5) across five input encoding strategies for code changes and apply four counterfactual perturbation probes (e.g., swapping added/deleted blocks, inverting diff polarity). They report that compact diff-style encodings outperform whole-function formats with statistical confirmation, yet model performance remains stable under perturbations, concluding that apparent robustness reflects reliance on superficial cues rather than true semantic understanding of code changes.

Significance. If the ReDef labels are shown to be high-precision, the work offers a valuable diagnostic contribution to the field by demonstrating that current CLMs may not capture genuine change semantics in defect prediction tasks. The use of counterfactual tests as probes and the focus on encoding formats provide a useful framework for future evaluations. The dataset itself could serve as a stronger baseline for JIT-SDP research compared to noisier prior resources, provided label validity is established.

major comments (2)
  1. [Dataset Construction] The central interpretive claim—that stable performance under counterfactual perturbations reveals reliance on superficial cues rather than semantic understanding—depends entirely on the reliability of the ReDef labels. In the dataset construction pipeline, the assumption that revert commits reliably mark bug-inducing changes (and that GPT multi-vote triage introduces no systematic bias) is asserted but not supported by quantitative validation, such as manual audit rates or analysis of non-bug revert reasons (e.g., refactoring or dependency updates). This is load-bearing for both the performance gap results and the counterfactual conclusions.
  2. [Results and Analysis] The abstract states that compact diff-style encodings 'consistently outperform whole-function formats across all CLMs, supported by rigorous statistical confirmation.' However, without reported details on the specific statistical tests, p-values, effect sizes, or confidence intervals (presumably in the results section or associated tables), the strength of this outperformance claim cannot be fully assessed and risks overinterpretation of the encoding comparison.
minor comments (2)
  1. [Counterfactual Perturbation Strategies] The description of the four counterfactual perturbation strategies is high-level; including one or two concrete before/after examples (perhaps in a table or figure) would clarify how the probes are implemented and aid reproducibility.
  2. [Dataset Construction] Project selection criteria for the 22 C/C++ repositories are not detailed in the provided summary; adding a brief justification or table of project characteristics (size, domain, history length) would strengthen the dataset description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where we agree and what revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Dataset Construction] The central interpretive claim—that stable performance under counterfactual perturbations reveals reliance on superficial cues rather than semantic understanding—depends entirely on the reliability of the ReDef labels. In the dataset construction pipeline, the assumption that revert commits reliably mark bug-inducing changes (and that GPT multi-vote triage introduces no systematic bias) is asserted but not supported by quantitative validation, such as manual audit rates or analysis of non-bug revert reasons (e.g., refactoring or dependency updates). This is load-bearing for both the performance gap results and the counterfactual conclusions.

    Authors: We agree that explicit quantitative validation of label quality is important for supporting the interpretive claims. Revert commits are a standard anchoring mechanism in JIT-SDP literature precisely because they provide direct evidence of defect introduction, and our pipeline applies additional post-hoc checks plus conservative multi-vote GPT triage with audits to exclude ambiguous cases. To directly address the concern, we will add a dedicated subsection to the dataset construction section that reports manual audit results on a random sample of 200 instances (stratified by project and label), including agreement rates, and a breakdown of revert commit reasons to quantify the proportion attributable to non-bug factors such as refactoring. revision: yes

  2. Referee: [Results and Analysis] The abstract states that compact diff-style encodings 'consistently outperform whole-function formats across all CLMs, supported by rigorous statistical confirmation.' However, without reported details on the specific statistical tests, p-values, effect sizes, or confidence intervals (presumably in the results section or associated tables), the strength of this outperformance claim cannot be fully assessed and risks overinterpretation of the encoding comparison.

    Authors: The manuscript reports the statistical analysis supporting the encoding comparison in Section 5.2 and the associated tables, using Wilcoxon signed-rank tests with p-values and effect sizes. To improve clarity and prevent any risk of overinterpretation, we will revise the abstract to explicitly name the statistical test employed and will add a brief summary paragraph in the results section that highlights the key p-values, effect sizes, and confidence intervals for the encoding comparisons. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset and evaluation study

full rationale

This is an empirical paper focused on constructing the ReDef dataset from external open-source project histories (revert commits for defective labels, post-hoc checks for clean labels, and GPT multi-vote filtering) and then running controlled experiments on CLMs with different input encodings and counterfactual perturbations. No mathematical derivations, equations, fitted parameters, or first-principles claims appear in the provided text. All performance comparisons and robustness conclusions are grounded in direct experimental measurements on the curated external data rather than reducing to self-definitions, self-citations, or renamings of inputs. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the domain assumption that revert commits provide high-confidence defective labels and that stable performance under perturbations indicates lack of semantic understanding rather than other factors.

axioms (2)
  • domain assumption Revert commits reliably indicate bug-inducing changes
    Used to anchor defective cases in the dataset construction pipeline.
  • domain assumption GPT-assisted multi-vote triage accurately filters ambiguous instances without bias
    Applied to produce the final high-confidence labels.

pith-pipeline@v0.9.0 · 5820 in / 1352 out tokens · 70677 ms · 2026-05-18T18:20:26.135889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 5 internal anchors

  1. [1]

    Parameter-efficient fine-tuning of pre-trained code models for just-in-time defect prediction.Neural Computing and Applications, 36(27):16911–16940, 2024

    Manar Abu Talib, Ali Bou Nassif, Mohammad Azzeh, Yaser Alesh, and Yaman Afadar. Parameter-efficient fine-tuning of pre-trained code models for just-in-time defect prediction.Neural Computing and Applications, 36(27):16911–16940, 2024

  2. [2]

    Empirical study: How issue classification influences software defect prediction.IEEE access, 11:11732–11748, 2023

    Petar Afric, Davor Vukadin, Marin Silic, and Goran Delac. Empirical study: How issue classification influences software defect prediction.IEEE access, 11:11732–11748, 2023

  3. [3]

    Vulnerability detection in popular programming languages with language models.arXiv preprint arXiv:2412.15905, 2024

    Syafiq Al Atiiq, Christian Gehrmann, and Kevin Dahlén. Vulnerability detection in popular programming languages with language models.arXiv preprint arXiv:2412.15905, 2024

  4. [4]

    Prediction of software fault-prone classes using ensemble random forest with adaptive synthetic sampling algorithm.Automated Software Engineering, 29(1):6, 2022

    A Balaram and S Vasundra. Prediction of software fault-prone classes using ensemble random forest with adaptive synthetic sampling algorithm.Automated Software Engineering, 29(1):6, 2022

  5. [5]

    The limited impact of individual developer data on software defect prediction.Empirical Software Engineering, 18(3):478–505, 2013

    Robert M Bell, Thomas J Ostrand, and Elaine J Weyuker. The limited impact of individual developer data on software defect prediction.Empirical Software Engineering, 18(3):478–505, 2013

  6. [6]

    Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection

    Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David Wagner. Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. InProceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, pages 654–668, 2023

  7. [7]

    routledge, 2013

    Jacob Cohen.Statistical power analysis for the behavioral sciences. routledge, 2013

  8. [8]

    Semantic source code segmentation using small and large language models.arXiv preprint arXiv:2507.08992, 2025

    Abdelhalim Dahou, Ansgar Scherp, Sebastian Kurten, Brigitte Mathiak, and Madhu Chauhan. Semantic source code segmentation using small and large language models.arXiv preprint arXiv:2507.08992, 2025

  9. [9]

    Vulnerability detection with code language models: How far are we?arXiv preprint arXiv:2403.18624, 2024

    Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we?arXiv preprint arXiv:2403.18624, 2024. 16 Nam et al

  10. [10]

    Predicting defect-prone software modules using support vector machines

    Karim O Elish and Mahmoud O Elish. Predicting defect-prone software modules using support vector machines. Journal of Systems and Software, 81(5):649–660, 2008

  11. [11]

    Ac/c++ code vulnerability dataset with code changes and cve summaries

    Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. Ac/c++ code vulnerability dataset with code changes and cve summaries. InProceedings of the 17th international conference on mining software repositories, pages 508–512, 2020

  12. [12]

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages.arXiv preprint arXiv:2002.08155, 2020

  13. [13]

    Learning in the wild: Towards leveraging unlabeled data for effectively tuning pre-trained code models

    Shuzheng Gao, Wenxin Mao, Cuiyun Gao, Li Li, Xing Hu, Xin Xia, and Michael R Lyu. Learning in the wild: Towards leveraging unlabeled data for effectively tuning pre-trained code models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024

  14. [14]

    The earlybird catches the bug: On exploiting early layers of encoder models for more efficient code classification

    Anastasiia Grishina, Max Hort, and Leon Moonen. The earlybird catches the bug: On exploiting early layers of encoder models for more efficient code classification. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 895–907, 2023

  15. [15]

    UniXcoder: Unified Cross-Modal Pre-training for Code Representation

    Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. Unixcoder: Unified cross-modal pre-training for code representation.arXiv preprint arXiv:2203.03850, 2022

  16. [16]

    A study on the impact of pre-trained model on just-in-time defect prediction

    Yuxiang Guo, Xiaopeng Gao, Zhenyu Zhang, Wing Kwong Chan, and Bo Jiang. A study on the impact of pre-trained model on just-in-time defect prediction. In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS), pages 105–116. IEEE, 2023

  17. [17]

    Problems with szz and features: An empirical study of the state of practice of defect prediction data collection.Empirical Software Engineering, 27(2):42, 2022

    Steffen Herbold, Alexander Trautsch, Fabian Trautsch, and Benjamin Ledel. Problems with szz and features: An empirical study of the state of practice of defect prediction data collection.Empirical Software Engineering, 27(2):42, 2022

  18. [18]

    Deepjit: an end-to-end deep learning framework for just-in-time defect prediction

    Thong Hoang, Hoa Khanh Dam, Yasutaka Kamei, David Lo, and Naoyasu Ubayashi. Deepjit: an end-to-end deep learning framework for just-in-time defect prediction. In2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pages 34–45. IEEE, 2019

  19. [19]

    Cc2vec: Distributed representations of code changes

    Thong Hoang, Hong Jin Kang, David Lo, and Julia Lawall. Cc2vec: Distributed representations of code changes. In Proceedings of the ACM/IEEE 42nd international conference on software engineering, pages 518–529, 2020

  20. [20]

    A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, pages 65–70, 1979

    Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, pages 65–70, 1979

  21. [21]

    A framework for software defect prediction and metric selection.IEEE access, 6:2844–2858, 2017

    Shamsul Huda, Sultan Alyahya, Md Mohsin Ali, Shafiq Ahmad, Jemal Abawajy, Hmood Al-Dossari, and John Yearwood. A framework for software defect prediction and metric selection.IEEE access, 6:2844–2858, 2017

  22. [22]

    Adversarial Examples for Evaluating Reading Comprehension Systems

    Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems.arXiv preprint arXiv:1707.07328, 2017

  23. [23]

    Just-in-time software defect prediction via bi-modal change representation learning.Journal of Systems and Software, 219:112253, 2025

    Yuze Jiang, Beijun Shen, and Xiaodong Gu. Just-in-time software defect prediction via bi-modal change representation learning.Journal of Systems and Software, 219:112253, 2025

  24. [24]

    Defects4j: A database of existing faults to enable controlled testing studies for java programs

    René Just, Darioush Jalali, and Michael D Ernst. Defects4j: A database of existing faults to enable controlled testing studies for java programs. InProceedings of the 2014 international symposium on software testing and analysis, pages 437–440, 2014

  25. [25]

    A large-scale empirical study of just-in-time quality assurance.IEEE Transactions on Software Engineering, 39(6): 757–773, 2012

    Yasutaka Kamei, Emad Shihab, Bram Adams, Ahmed E Hassan, Audris Mockus, Anand Sinha, and Naoyasu Ubayashi. A large-scale empirical study of just-in-time quality assurance.IEEE Transactions on Software Engineering, 39(6): 757–773, 2012

  26. [26]

    Studying just-in-time defect prediction using cross-project models.Empirical Software Engineering, 21(5):2072–2106, 2016

    Yasutaka Kamei, Takafumi Fukushima, Shane McIntosh, Kazuhiro Yamashita, Naoyasu Ubayashi, and Ahmed E Hassan. Studying just-in-time defect prediction using cross-project models.Empirical Software Engineering, 21(5):2072–2106, 2016

  27. [27]

    Automating modern code review processes with code similarity measurement.Information and Software Technology, 173:107490, 2024

    Yusuf Kartal, E Kaan Akdeniz, and Kemal Özkan. Automating modern code review processes with code similarity measurement.Information and Software Technology, 173:107490, 2024

  28. [28]

    Tree-based software quality estimation models for fault prediction

    Taghi M Khoshgoftaar and Naeem Seliya. Tree-based software quality estimation models for fault prediction. In Proceedings Eighth IEEE Symposium on Software Metrics, pages 203–214. IEEE, 2002

  29. [29]

    Classifying software changes: Clean or buggy?IEEE Transactions on software engineering, 34(2):181–196, 2008

    Sunghun Kim, E James Whitehead, and Yi Zhang. Classifying software changes: Clean or buggy?IEEE Transactions on software engineering, 34(2):181–196, 2008

  30. [30]

    Logistic regression in rare events data.Political analysis, 9(2):137–163, 2001

    Gary King and Langche Zeng. Logistic regression in rare events data.Political analysis, 9(2):137–163, 2001

  31. [31]

    Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas.Frontiers in psychology, 4:863, 2013

    Daniël Lakens. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas.Frontiers in psychology, 4:863, 2013

  32. [32]

    Automating code review activities by large-scale pre-training

    Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svy- atkovskiy, Shengyu Fu, et al. Automating code review activities by large-scale pre-training. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1035–1047, 202...

  33. [33]

    Cct5: A code-change-oriented pre-trained model

    Bo Lin, Shangwen Wang, Zhongxin Liu, Yepang Liu, Xin Xia, and Xiaoguang Mao. Cct5: A code-change-oriented pre-trained model. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1509–1521, 2023

  34. [34]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

  35. [35]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  36. [36]

    Evaluating szz implementations: An empirical study on the linux kernel.IEEE Transactions on Software Engineering, 50(9):2219–2239, 2024

    Yunbo Lyu, Hong Jin Kang, Ratnadira Widyasari, Julia Lawall, and David Lo. Evaluating szz implementations: An empirical study on the linux kernel.IEEE Transactions on Software Engineering, 50(9):2219–2239, 2024

  37. [37]

    A systematic review of machine learning techniques for software fault prediction.Applied Soft Computing, 27:504–518, 2015

    Ruchika Malhotra. A systematic review of machine learning techniques for software fault prediction.Applied Soft Computing, 27:504–518, 2015

  38. [38]

    Applying codebert for automated program repair of java simple bugs

    Ehsan Mashhadi and Hadi Hemmati. Applying codebert for automated program repair of java simple bugs. In2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pages 505–509. IEEE, 2021

  39. [39]

    Topic-based defect prediction (nier track)

    Tung Thanh Nguyen, Tien N Nguyen, and Tu Minh Phuong. Topic-based defect prediction (nier track). InProceedings of the 33rd international conference on software engineering, pages 932–935, 2011

  40. [40]

    The best of both worlds: integrating semantic features with expert features for defect prediction and localization

    Chao Ni, Wei Wang, Kaiwen Yang, Xin Xia, Kui Liu, and David Lo. The best of both worlds: integrating semantic features with expert features for defect prediction and localization. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 672–683, 2022

  41. [41]

    Function-level vulnerability detection through fusing multi-modal knowledge

    Chao Ni, Xinrong Guo, Yan Zhu, Xiaodan Xu, and Xiaohu Yang. Function-level vulnerability detection through fusing multi-modal knowledge. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1911–1918. IEEE, 2023

  42. [42]

    An empirical comparison of pre- trained models of source code

    Changan Niu, Chuanyi Li, Vincent Ng, Dongxiao Chen, Jidong Ge, and Bin Luo. An empirical comparison of pre- trained models of source code. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2136–2148. IEEE, 2023

  43. [43]

    Refactoring ≠ bug-inducing: Improving defect prediction with code change tactics analysis.arXiv preprint arXiv:2507.19714, 2025

    Feifei Niu, Junqian Shao, Christoph Mayr-Dorn, Liguo Huang, Wesley KG Assunção, Chuanyi Li, Jidong Ge, and Alexander Egyed. Refactoring ≠ bug-inducing: Improving defect prediction with code change tactics analysis.arXiv preprint arXiv:2507.19714, 2025

  44. [44]

    Deep learning for software defect prediction: A survey

    Safa Omri and Carsten Sinz. Deep learning for software defect prediction: A survey. InProceedings of the IEEE/ACM 42nd international conference on software engineering workshops, pages 209–214, 2020

  45. [45]

    How to measure success of fault prediction models

    Thomas J Ostrand and Elaine J Weyuker. How to measure success of fault prediction models. InFourth international workshop on Software quality assurance: in conjunction with the 6th ESEC/FSE joint meeting, pages 25–30, 2007

  46. [46]

    Semantically equivalent adversarial rules for debugging nlp models

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Semantically equivalent adversarial rules for debugging nlp models. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (volume 1: long papers), pages 856–865, 2018

  47. [47]

    An industrial study on the risk of software changes

    Emad Shihab, Ahmed E Hassan, Bram Adams, and Zhen Ming Jiang. An industrial study on the risk of software changes. InProceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, pages 1–11, 2012

  48. [48]

    When do changes induce fixes?ACM sigsoft software engineering notes, 30(4):1–5, 2005

    Jacek Śliwerski, Thomas Zimmermann, and Andreas Zeller. When do changes induce fixes?ACM sigsoft software engineering notes, 30(4):1–5, 2005

  49. [49]

    Automatic code summarization via chatgpt: How far are we?arXiv preprint arXiv:2305.12865, 2023

    Weisong Sun, Chunrong Fang, Yudu You, Yun Miao, Yi Liu, Yuekang Li, Gelei Deng, Shenghan Huang, Yuchen Chen, Quanjun Zhang, et al. Automatic code summarization via chatgpt: How far are we?arXiv preprint arXiv:2305.12865, 2023

  50. [50]

    An introduction to the bootstrap.Monographs on statistics and applied probability, 57(1):1–436, 1993

    Robert J Tibshirani and Bradley Efron. An introduction to the bootstrap.Monographs on statistics and applied probability, 57(1):1–436, 1993

  51. [51]

    Practical considerations in deploying ai for defect prediction: a case study within the turkish telecommunication industry

    Ayşe Tosun, Burak Turhan, and Ayşe Bener. Practical considerations in deploying ai for defect prediction: a case study within the turkish telecommunication industry. InProceedings of the 5th International Conference on Predictor Models in Software Engineering, pages 1–9, 2009

  52. [52]

    On the relative value of cross-company and within-company data for defect prediction.Empirical Software Engineering, 14(5):540–578, 2009

    Burak Turhan, Tim Menzies, Ayşe B Bener, and Justin Di Stefano. On the relative value of cross-company and within-company data for defect prediction.Empirical Software Engineering, 14(5):540–578, 2009

  53. [53]

    A systematic literature review of software defect prediction.Journal of software engineering, 1 (1):1–16, 2015

    Romi Satria Wahono. A systematic literature review of software defect prediction.Journal of software engineering, 1 (1):1–16, 2015

  54. [54]

    Compressed c4

    Jun Wang, Beijun Shen, and Yuting Chen. Compressed c4. 5 models for software defect prediction. In2012 12th International Conference on quality software, pages 13–16. IEEE, 2012

  55. [55]

    Deep semantic feature learning for software defect prediction

    Song Wang, Taiyue Liu, Jaechang Nam, and Lin Tan. Deep semantic feature learning for software defect prediction. IEEE Transactions on Software Engineering, 46(12):1267–1293, 2018

  56. [56]

    Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair

    Weishi Wang, Yue Wang, Shafiq Joty, and Steven CH Hoi. Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 146–158, 2023. 18 Nam et al

  57. [57]

    CodeT5+: Open Code Large Language Models for Code Understanding and Generation

    Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. Codet5+: Open code large language models for code understanding and generation.arXiv preprint arXiv:2305.07922, 2023

  58. [58]

    Line-level semantic structure learning for code vulnerability detection.arXiv preprint arXiv:2407.18877, 2024

    Ziliang Wang, Ge Li, Jia Li, Yihong Dong, Yingfei Xiong, and Zhi Jin. Line-level semantic structure learning for code vulnerability detection.arXiv preprint arXiv:2407.18877, 2024

  59. [59]

    Individual comparisons by ranking methods.Biometrics bulletin, 1(6):80–83, 1945

    Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics bulletin, 1(6):80–83, 1945

  60. [60]

    Automated program repair in the era of large pre-trained language models

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023

  61. [61]

    Empirical study on transformer- based techniques for software engineering.arXiv preprint arXiv:2310.00399, 2023

    Yan Xiao, Xinyue Zuo, Lei Xue, Kailong Wang, Jin Song Dong, and Ivan Beschastnikh. Empirical study on transformer- based techniques for software engineering.arXiv preprint arXiv:2310.00399, 2023

  62. [62]

    Characterizing and identifying reverted commits

    Meng Yan, Xin Xia, David Lo, Ahmed E Hassan, and Shanping Li. Characterizing and identifying reverted commits. Empirical Software Engineering, 24(4):2171–2208, 2019

  63. [63]

    Deep learning for just-in-time defect prediction

    Xinli Yang, David Lo, Xin Xia, Yun Zhang, and Jianling Sun. Deep learning for just-in-time defect prediction. In2015 IEEE International conference on software quality, reliability and security, pages 17–26. IEEE, 2015

  64. [64]

    Tram: A token-level retrieval-augmented mechanism for source code summarization.arXiv preprint arXiv:2305.11074, 2023

    Tong Ye, Lingfei Wu, Tengfei Ma, Xuhong Zhang, Yangkai Du, Peiyu Liu, Shouling Ji, and Wenhai Wang. Tram: A token-level retrieval-augmented mechanism for source code summarization.arXiv preprint arXiv:2305.11074, 2023

  65. [65]

    Revisiting sentiment analysis for software engineering in the era of large language models.ACM Transactions on Software Engineering and Methodology, 34(3):1–30, 2025

    Ting Zhang, Ivana Clairine Irsan, Ferdian Thung, and David Lo. Revisiting sentiment analysis for software engineering in the era of large language models.ACM Transactions on Software Engineering and Methodology, 34(3):1–30, 2025

  66. [66]

    Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection,

    Ting Zhang, Chengran Yang, Yindu Su, Martin Weyssow, Hung Nguyen, Tan Bui, Hong Jin Kang, Yikun Li, Eng Lieh Ouh, Lwin Khin Shar, et al. Benchmarking large language models for multi-language software vulnerability detection. arXiv preprint arXiv:2503.01449, 2025

  67. [67]

    Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems, 32, 2019

    Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems, 32, 2019