ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?

arxiv: 2509.09192 · v2 · submitted 2025-09-11 · 💻 cs.SE · cs.AI

ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?

Doha Nam , Taehyoun Kim , Duksan Ryu , Jongmoon Baik This is my paper

Pith reviewed 2026-05-18 18:20 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords just-in-time defect predictioncode language modelscode changescounterfactual testingrevert commitsbenchmark datasetsoftware defect predictioninput encodings

0 comments p. Extension

The pith

Code language models detect defects from superficial cues in diffs rather than genuine semantic understanding of changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs ReDef, a high-confidence dataset of function-level code modifications from 22 large C/C++ projects, using revert commits to anchor defective cases and post-hoc checks plus GPT-assisted multi-vote filtering to remove ambiguous ones. It then tests four code language models on five input encoding strategies for just-in-time defect prediction and introduces four counterfactual perturbations, such as swapping added and deleted blocks or inverting diff polarity, as probes for semantic grasp. Compact diff-style encodings outperform whole-function formats with statistical backing, yet model performance remains stable under the perturbations. This leads the authors to conclude that the models exploit surface patterns instead of truly comprehending code changes. The work matters for improving prioritization of risky changes during code review and continuous integration.

Core claim

ReDef supplies 3,164 defective and 10,268 clean modifications with reliable labels derived from revert commits and conservative GPT triage. Across CodeBERT, CodeT5+, UniXcoder, and Qwen2.5, compact diff-style encodings yield higher predictive performance than full-function formats. Four counterfactual strategies that distort change semantics leave performance essentially unchanged, which the paper interprets as evidence that models rely on superficial cues rather than semantic understanding of modifications.

What carries the argument

Counterfactual perturbation strategies (swapping added/deleted blocks, inverting diff polarity) applied as diagnostic probes on top of the ReDef dataset to test whether models capture change semantics.

If this is right

Compact diff-style encodings improve JIT defect prediction accuracy for current code language models.
Models that remain stable under semantic distortions will likely fail on novel or complex change patterns.
High-quality datasets anchored by revert commits enable more trustworthy comparisons of model robustness.
Current evaluation practices may overstate model capability for real-world code review assistance.
Training objectives that penalize performance on perturbed inputs could be needed to build genuine change understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same superficial-cue reliance may appear in other code-related tasks that involve diffs or edits.
One extension would be to fine-tune models explicitly on ReDef plus its perturbed variants and measure whether robustness improves.
The results connect to broader questions about whether language models learn causal semantics or statistical associations in structured inputs like code.
Similar counterfactual probes could be applied to non-CLM approaches such as graph-based or rule-based defect predictors.

Load-bearing premise

Revert commits accurately mark bug-inducing changes and the GPT-assisted multi-vote triage removes ambiguous cases without systematic bias in labels.

What would settle it

A clear drop in model accuracy on a held-out set of verified bug-inducing changes after applying the same block-swap or polarity-inversion perturbations that produced stable results in the reported tests.

Figures

Figures reproduced from arXiv: 2509.09192 by Doha Nam, Duksan Ryu, Jongmoon Baik, Taehyoun Kim.

**Figure 2.** Figure 2: Number of defective and clean modifications per project in the ReDef corpus. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Cumulative distribution of function lengths (tokens) across PLM tokenizers [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Just-in-Time software defect prediction (JIT-SDP) plays a critical role in prioritizing risky code changes during code review and continuous integration. However, existing datasets often suffer from noisy labels and low precision in identifying bug-inducing commits. To address this, we present ReDef (Revert-based Defect dataset), a high-confidence benchmark of function-level modifications curated from 22 large-scale C/C++ projects. Defective cases are anchored by revert commits, while clean cases are validated through post-hoc history checks. Ambiguous instances are conservatively filtered out via a GPT-assisted triage process involving multiple votes and audits. This pipeline yields 3,164 defective and 10,268 clean modifications, offering substantially more reliable labels than prior resources. Beyond dataset construction, we provide a systematic evaluation of how Code Language Models (CLMs)-specifically CodeBERT, CodeT5+, UniXcoder, and Qwen2.5-reason about code modifications. We first investigate which input encodings most effectively expose change information under five different strategies. We then design four counterfactual perturbation strategies (e.g., swapping added/deleted blocks, inverting diff polarity) to serve as diagnostic probes. We posit that if models genuinely capture change semantics, such distortions should lead to a clear decline in predictive performance. Our results show that compact diff-style encodings consistently outperform whole-function formats across all CLMs, supported by rigorous statistical confirmation. However, under counterfactual tests, performance remains effectively stable, revealing that what appears to be robustness in fact reflects a reliance on superficial cues rather than true semantic understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReDef builds a cleaner dataset from reverts for JIT defect prediction and tests CLM change understanding, but label noise remains a real risk.

read the letter

ReDef gives a new dataset for just-in-time defect prediction drawn from revert commits in 22 C/C++ projects, and it checks whether code models actually read change semantics or just grab surface signals. They mark defective changes via reverts, clean the rest with history checks and multi-vote GPT filtering, and land on 3164 defective plus 10268 clean function edits. They then compare five input encodings and run four counterfactual perturbations such as block swaps and polarity flips to see if performance falls when the meaning is altered. The dataset construction and the specific perturbation probes are the fresh pieces relative to earlier JIT-SDP work. The comparison of encodings is also useful: diff-style inputs beat whole-function ones across CodeBERT, CodeT5+, UniXcoder, and Qwen2.5, backed by statistical tests. That finding is straightforward and could guide practical choices. The soft spot sits in the labels. Revert commits often mark non-bug work like refactors or updates, so some defective cases may be mislabeled. The GPT triage reduces ambiguity but can import its own model biases. If those labels carry noticeable noise, the stable results under perturbations lose force as evidence of missing semantic grasp; the tests might simply not be strong enough or the data too mixed to reveal real understanding. The abstract claims rigorous confirmation but leaves details on exact perturbation impacts and controls for the full text. This paper is for software engineering groups that build or evaluate defect predictors and for researchers probing what code models actually capture in diffs. Readers who need better-labeled benchmarks or simple diagnostic tests will get something concrete from it. It deserves a serious referee because the dataset addresses a documented weakness in prior resources and the probe idea is worth checking even if the current evidence needs tightening. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The paper introduces ReDef, a new high-confidence benchmark dataset for just-in-time software defect prediction (JIT-SDP) comprising 3,164 defective and 10,268 clean function-level modifications from 22 large-scale C/C++ projects. Defective cases are anchored by revert commits and clean cases by post-hoc history checks plus GPT-assisted multi-vote triage to filter ambiguous instances. The authors systematically evaluate Code Language Models (CodeBERT, CodeT5+, UniXcoder, Qwen2.5) across five input encoding strategies for code changes and apply four counterfactual perturbation probes (e.g., swapping added/deleted blocks, inverting diff polarity). They report that compact diff-style encodings outperform whole-function formats with statistical confirmation, yet model performance remains stable under perturbations, concluding that apparent robustness reflects reliance on superficial cues rather than true semantic understanding of code changes.

Significance. If the ReDef labels are shown to be high-precision, the work offers a valuable diagnostic contribution to the field by demonstrating that current CLMs may not capture genuine change semantics in defect prediction tasks. The use of counterfactual tests as probes and the focus on encoding formats provide a useful framework for future evaluations. The dataset itself could serve as a stronger baseline for JIT-SDP research compared to noisier prior resources, provided label validity is established.

major comments (2)

[Dataset Construction] The central interpretive claim—that stable performance under counterfactual perturbations reveals reliance on superficial cues rather than semantic understanding—depends entirely on the reliability of the ReDef labels. In the dataset construction pipeline, the assumption that revert commits reliably mark bug-inducing changes (and that GPT multi-vote triage introduces no systematic bias) is asserted but not supported by quantitative validation, such as manual audit rates or analysis of non-bug revert reasons (e.g., refactoring or dependency updates). This is load-bearing for both the performance gap results and the counterfactual conclusions.
[Results and Analysis] The abstract states that compact diff-style encodings 'consistently outperform whole-function formats across all CLMs, supported by rigorous statistical confirmation.' However, without reported details on the specific statistical tests, p-values, effect sizes, or confidence intervals (presumably in the results section or associated tables), the strength of this outperformance claim cannot be fully assessed and risks overinterpretation of the encoding comparison.

minor comments (2)

[Counterfactual Perturbation Strategies] The description of the four counterfactual perturbation strategies is high-level; including one or two concrete before/after examples (perhaps in a table or figure) would clarify how the probes are implemented and aid reproducibility.
[Dataset Construction] Project selection criteria for the 22 C/C++ repositories are not detailed in the provided summary; adding a brief justification or table of project characteristics (size, domain, history length) would strengthen the dataset description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where we agree and what revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Dataset Construction] The central interpretive claim—that stable performance under counterfactual perturbations reveals reliance on superficial cues rather than semantic understanding—depends entirely on the reliability of the ReDef labels. In the dataset construction pipeline, the assumption that revert commits reliably mark bug-inducing changes (and that GPT multi-vote triage introduces no systematic bias) is asserted but not supported by quantitative validation, such as manual audit rates or analysis of non-bug revert reasons (e.g., refactoring or dependency updates). This is load-bearing for both the performance gap results and the counterfactual conclusions.

Authors: We agree that explicit quantitative validation of label quality is important for supporting the interpretive claims. Revert commits are a standard anchoring mechanism in JIT-SDP literature precisely because they provide direct evidence of defect introduction, and our pipeline applies additional post-hoc checks plus conservative multi-vote GPT triage with audits to exclude ambiguous cases. To directly address the concern, we will add a dedicated subsection to the dataset construction section that reports manual audit results on a random sample of 200 instances (stratified by project and label), including agreement rates, and a breakdown of revert commit reasons to quantify the proportion attributable to non-bug factors such as refactoring. revision: yes
Referee: [Results and Analysis] The abstract states that compact diff-style encodings 'consistently outperform whole-function formats across all CLMs, supported by rigorous statistical confirmation.' However, without reported details on the specific statistical tests, p-values, effect sizes, or confidence intervals (presumably in the results section or associated tables), the strength of this outperformance claim cannot be fully assessed and risks overinterpretation of the encoding comparison.

Authors: The manuscript reports the statistical analysis supporting the encoding comparison in Section 5.2 and the associated tables, using Wilcoxon signed-rank tests with p-values and effect sizes. To improve clarity and prevent any risk of overinterpretation, we will revise the abstract to explicitly name the statistical test employed and will add a brief summary paragraph in the results section that highlights the key p-values, effect sizes, and confidence intervals for the encoding comparisons. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset and evaluation study

full rationale

This is an empirical paper focused on constructing the ReDef dataset from external open-source project histories (revert commits for defective labels, post-hoc checks for clean labels, and GPT multi-vote filtering) and then running controlled experiments on CLMs with different input encodings and counterfactual perturbations. No mathematical derivations, equations, fitted parameters, or first-principles claims appear in the provided text. All performance comparisons and robustness conclusions are grounded in direct experimental measurements on the curated external data rather than reducing to self-definitions, self-citations, or renamings of inputs. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the domain assumption that revert commits provide high-confidence defective labels and that stable performance under perturbations indicates lack of semantic understanding rather than other factors.

axioms (2)

domain assumption Revert commits reliably indicate bug-inducing changes
Used to anchor defective cases in the dataset construction pipeline.
domain assumption GPT-assisted multi-vote triage accurately filters ambiguous instances without bias
Applied to produce the final high-confidence labels.

pith-pipeline@v0.9.0 · 5820 in / 1352 out tokens · 70677 ms · 2026-05-18T18:20:26.135889+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

compact diff-style encodings consistently outperform whole-function formats... performance remains effectively stable under counterfactual tests

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 5 internal anchors

[1]

Parameter-efficient fine-tuning of pre-trained code models for just-in-time defect prediction.Neural Computing and Applications, 36(27):16911–16940, 2024

Manar Abu Talib, Ali Bou Nassif, Mohammad Azzeh, Yaser Alesh, and Yaman Afadar. Parameter-efficient fine-tuning of pre-trained code models for just-in-time defect prediction.Neural Computing and Applications, 36(27):16911–16940, 2024

work page 2024
[2]

Empirical study: How issue classification influences software defect prediction.IEEE access, 11:11732–11748, 2023

Petar Afric, Davor Vukadin, Marin Silic, and Goran Delac. Empirical study: How issue classification influences software defect prediction.IEEE access, 11:11732–11748, 2023

work page 2023
[3]

Vulnerability detection in popular programming languages with language models.arXiv preprint arXiv:2412.15905, 2024

Syafiq Al Atiiq, Christian Gehrmann, and Kevin Dahlén. Vulnerability detection in popular programming languages with language models.arXiv preprint arXiv:2412.15905, 2024

work page arXiv 2024
[4]

Prediction of software fault-prone classes using ensemble random forest with adaptive synthetic sampling algorithm.Automated Software Engineering, 29(1):6, 2022

A Balaram and S Vasundra. Prediction of software fault-prone classes using ensemble random forest with adaptive synthetic sampling algorithm.Automated Software Engineering, 29(1):6, 2022

work page 2022
[5]

The limited impact of individual developer data on software defect prediction.Empirical Software Engineering, 18(3):478–505, 2013

Robert M Bell, Thomas J Ostrand, and Elaine J Weyuker. The limited impact of individual developer data on software defect prediction.Empirical Software Engineering, 18(3):478–505, 2013

work page 2013
[6]

Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection

Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David Wagner. Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. InProceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, pages 654–668, 2023

work page 2023
[7]

routledge, 2013

Jacob Cohen.Statistical power analysis for the behavioral sciences. routledge, 2013

work page 2013
[8]

Semantic source code segmentation using small and large language models.arXiv preprint arXiv:2507.08992, 2025

Abdelhalim Dahou, Ansgar Scherp, Sebastian Kurten, Brigitte Mathiak, and Madhu Chauhan. Semantic source code segmentation using small and large language models.arXiv preprint arXiv:2507.08992, 2025

work page arXiv 2025
[9]

Vulnerability detection with code language models: How far are we?arXiv preprint arXiv:2403.18624, 2024

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we?arXiv preprint arXiv:2403.18624, 2024. 16 Nam et al

work page arXiv 2024
[10]

Predicting defect-prone software modules using support vector machines

Karim O Elish and Mahmoud O Elish. Predicting defect-prone software modules using support vector machines. Journal of Systems and Software, 81(5):649–660, 2008

work page 2008
[11]

Ac/c++ code vulnerability dataset with code changes and cve summaries

Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. Ac/c++ code vulnerability dataset with code changes and cve summaries. InProceedings of the 17th international conference on mining software repositories, pages 508–512, 2020

work page 2020
[12]

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages.arXiv preprint arXiv:2002.08155, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[13]

Learning in the wild: Towards leveraging unlabeled data for effectively tuning pre-trained code models

Shuzheng Gao, Wenxin Mao, Cuiyun Gao, Li Li, Xing Hu, Xin Xia, and Michael R Lyu. Learning in the wild: Towards leveraging unlabeled data for effectively tuning pre-trained code models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024

work page 2024
[14]

The earlybird catches the bug: On exploiting early layers of encoder models for more efficient code classification

Anastasiia Grishina, Max Hort, and Leon Moonen. The earlybird catches the bug: On exploiting early layers of encoder models for more efficient code classification. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 895–907, 2023

work page 2023
[15]

UniXcoder: Unified Cross-Modal Pre-training for Code Representation

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. Unixcoder: Unified cross-modal pre-training for code representation.arXiv preprint arXiv:2203.03850, 2022

work page internal anchor Pith review arXiv 2022
[16]

A study on the impact of pre-trained model on just-in-time defect prediction

Yuxiang Guo, Xiaopeng Gao, Zhenyu Zhang, Wing Kwong Chan, and Bo Jiang. A study on the impact of pre-trained model on just-in-time defect prediction. In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS), pages 105–116. IEEE, 2023

work page 2023
[17]

Problems with szz and features: An empirical study of the state of practice of defect prediction data collection.Empirical Software Engineering, 27(2):42, 2022

Steffen Herbold, Alexander Trautsch, Fabian Trautsch, and Benjamin Ledel. Problems with szz and features: An empirical study of the state of practice of defect prediction data collection.Empirical Software Engineering, 27(2):42, 2022

work page 2022
[18]

Deepjit: an end-to-end deep learning framework for just-in-time defect prediction

Thong Hoang, Hoa Khanh Dam, Yasutaka Kamei, David Lo, and Naoyasu Ubayashi. Deepjit: an end-to-end deep learning framework for just-in-time defect prediction. In2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pages 34–45. IEEE, 2019

work page 2019
[19]

Cc2vec: Distributed representations of code changes

Thong Hoang, Hong Jin Kang, David Lo, and Julia Lawall. Cc2vec: Distributed representations of code changes. In Proceedings of the ACM/IEEE 42nd international conference on software engineering, pages 518–529, 2020

work page 2020
[20]

A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, pages 65–70, 1979

Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, pages 65–70, 1979

work page 1979
[21]

A framework for software defect prediction and metric selection.IEEE access, 6:2844–2858, 2017

Shamsul Huda, Sultan Alyahya, Md Mohsin Ali, Shafiq Ahmad, Jemal Abawajy, Hmood Al-Dossari, and John Yearwood. A framework for software defect prediction and metric selection.IEEE access, 6:2844–2858, 2017

work page 2017
[22]

Adversarial Examples for Evaluating Reading Comprehension Systems

Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems.arXiv preprint arXiv:1707.07328, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Just-in-time software defect prediction via bi-modal change representation learning.Journal of Systems and Software, 219:112253, 2025

Yuze Jiang, Beijun Shen, and Xiaodong Gu. Just-in-time software defect prediction via bi-modal change representation learning.Journal of Systems and Software, 219:112253, 2025

work page 2025
[24]

Defects4j: A database of existing faults to enable controlled testing studies for java programs

René Just, Darioush Jalali, and Michael D Ernst. Defects4j: A database of existing faults to enable controlled testing studies for java programs. InProceedings of the 2014 international symposium on software testing and analysis, pages 437–440, 2014

work page 2014
[25]

A large-scale empirical study of just-in-time quality assurance.IEEE Transactions on Software Engineering, 39(6): 757–773, 2012

Yasutaka Kamei, Emad Shihab, Bram Adams, Ahmed E Hassan, Audris Mockus, Anand Sinha, and Naoyasu Ubayashi. A large-scale empirical study of just-in-time quality assurance.IEEE Transactions on Software Engineering, 39(6): 757–773, 2012

work page 2012
[26]

Studying just-in-time defect prediction using cross-project models.Empirical Software Engineering, 21(5):2072–2106, 2016

Yasutaka Kamei, Takafumi Fukushima, Shane McIntosh, Kazuhiro Yamashita, Naoyasu Ubayashi, and Ahmed E Hassan. Studying just-in-time defect prediction using cross-project models.Empirical Software Engineering, 21(5):2072–2106, 2016

work page 2072
[27]

Automating modern code review processes with code similarity measurement.Information and Software Technology, 173:107490, 2024

Yusuf Kartal, E Kaan Akdeniz, and Kemal Özkan. Automating modern code review processes with code similarity measurement.Information and Software Technology, 173:107490, 2024

work page 2024
[28]

Tree-based software quality estimation models for fault prediction

Taghi M Khoshgoftaar and Naeem Seliya. Tree-based software quality estimation models for fault prediction. In Proceedings Eighth IEEE Symposium on Software Metrics, pages 203–214. IEEE, 2002

work page 2002
[29]

Classifying software changes: Clean or buggy?IEEE Transactions on software engineering, 34(2):181–196, 2008

Sunghun Kim, E James Whitehead, and Yi Zhang. Classifying software changes: Clean or buggy?IEEE Transactions on software engineering, 34(2):181–196, 2008

work page 2008
[30]

Logistic regression in rare events data.Political analysis, 9(2):137–163, 2001

Gary King and Langche Zeng. Logistic regression in rare events data.Political analysis, 9(2):137–163, 2001

work page 2001
[31]

Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas.Frontiers in psychology, 4:863, 2013

Daniël Lakens. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas.Frontiers in psychology, 4:863, 2013

work page 2013
[32]

Automating code review activities by large-scale pre-training

Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svy- atkovskiy, Shengyu Fu, et al. Automating code review activities by large-scale pre-training. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1035–1047, 202...

work page 2022
[33]

Cct5: A code-change-oriented pre-trained model

Bo Lin, Shangwen Wang, Zhongxin Liu, Yepang Liu, Xin Xia, and Xiaoguang Mao. Cct5: A code-change-oriented pre-trained model. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1509–1521, 2023

work page 2023
[34]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

work page 2017
[35]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Evaluating szz implementations: An empirical study on the linux kernel.IEEE Transactions on Software Engineering, 50(9):2219–2239, 2024

Yunbo Lyu, Hong Jin Kang, Ratnadira Widyasari, Julia Lawall, and David Lo. Evaluating szz implementations: An empirical study on the linux kernel.IEEE Transactions on Software Engineering, 50(9):2219–2239, 2024

work page 2024
[37]

A systematic review of machine learning techniques for software fault prediction.Applied Soft Computing, 27:504–518, 2015

Ruchika Malhotra. A systematic review of machine learning techniques for software fault prediction.Applied Soft Computing, 27:504–518, 2015

work page 2015
[38]

Applying codebert for automated program repair of java simple bugs

Ehsan Mashhadi and Hadi Hemmati. Applying codebert for automated program repair of java simple bugs. In2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pages 505–509. IEEE, 2021

work page 2021
[39]

Topic-based defect prediction (nier track)

Tung Thanh Nguyen, Tien N Nguyen, and Tu Minh Phuong. Topic-based defect prediction (nier track). InProceedings of the 33rd international conference on software engineering, pages 932–935, 2011

work page 2011
[40]

The best of both worlds: integrating semantic features with expert features for defect prediction and localization

Chao Ni, Wei Wang, Kaiwen Yang, Xin Xia, Kui Liu, and David Lo. The best of both worlds: integrating semantic features with expert features for defect prediction and localization. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 672–683, 2022

work page 2022
[41]

Function-level vulnerability detection through fusing multi-modal knowledge

Chao Ni, Xinrong Guo, Yan Zhu, Xiaodan Xu, and Xiaohu Yang. Function-level vulnerability detection through fusing multi-modal knowledge. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1911–1918. IEEE, 2023

work page 1911
[42]

An empirical comparison of pre- trained models of source code

Changan Niu, Chuanyi Li, Vincent Ng, Dongxiao Chen, Jidong Ge, and Bin Luo. An empirical comparison of pre- trained models of source code. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2136–2148. IEEE, 2023

work page 2023
[43]

Refactoring ≠ bug-inducing: Improving defect prediction with code change tactics analysis.arXiv preprint arXiv:2507.19714, 2025

Feifei Niu, Junqian Shao, Christoph Mayr-Dorn, Liguo Huang, Wesley KG Assunção, Chuanyi Li, Jidong Ge, and Alexander Egyed. Refactoring ≠ bug-inducing: Improving defect prediction with code change tactics analysis.arXiv preprint arXiv:2507.19714, 2025

work page arXiv 2025
[44]

Deep learning for software defect prediction: A survey

Safa Omri and Carsten Sinz. Deep learning for software defect prediction: A survey. InProceedings of the IEEE/ACM 42nd international conference on software engineering workshops, pages 209–214, 2020

work page 2020
[45]

How to measure success of fault prediction models

Thomas J Ostrand and Elaine J Weyuker. How to measure success of fault prediction models. InFourth international workshop on Software quality assurance: in conjunction with the 6th ESEC/FSE joint meeting, pages 25–30, 2007

work page 2007
[46]

Semantically equivalent adversarial rules for debugging nlp models

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Semantically equivalent adversarial rules for debugging nlp models. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (volume 1: long papers), pages 856–865, 2018

work page 2018
[47]

An industrial study on the risk of software changes

Emad Shihab, Ahmed E Hassan, Bram Adams, and Zhen Ming Jiang. An industrial study on the risk of software changes. InProceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, pages 1–11, 2012

work page 2012
[48]

When do changes induce fixes?ACM sigsoft software engineering notes, 30(4):1–5, 2005

Jacek Śliwerski, Thomas Zimmermann, and Andreas Zeller. When do changes induce fixes?ACM sigsoft software engineering notes, 30(4):1–5, 2005

work page 2005
[49]

Automatic code summarization via chatgpt: How far are we?arXiv preprint arXiv:2305.12865, 2023

Weisong Sun, Chunrong Fang, Yudu You, Yun Miao, Yi Liu, Yuekang Li, Gelei Deng, Shenghan Huang, Yuchen Chen, Quanjun Zhang, et al. Automatic code summarization via chatgpt: How far are we?arXiv preprint arXiv:2305.12865, 2023

work page arXiv 2023
[50]

An introduction to the bootstrap.Monographs on statistics and applied probability, 57(1):1–436, 1993

Robert J Tibshirani and Bradley Efron. An introduction to the bootstrap.Monographs on statistics and applied probability, 57(1):1–436, 1993

work page 1993
[51]

Practical considerations in deploying ai for defect prediction: a case study within the turkish telecommunication industry

Ayşe Tosun, Burak Turhan, and Ayşe Bener. Practical considerations in deploying ai for defect prediction: a case study within the turkish telecommunication industry. InProceedings of the 5th International Conference on Predictor Models in Software Engineering, pages 1–9, 2009

work page 2009
[52]

On the relative value of cross-company and within-company data for defect prediction.Empirical Software Engineering, 14(5):540–578, 2009

Burak Turhan, Tim Menzies, Ayşe B Bener, and Justin Di Stefano. On the relative value of cross-company and within-company data for defect prediction.Empirical Software Engineering, 14(5):540–578, 2009

work page 2009
[53]

A systematic literature review of software defect prediction.Journal of software engineering, 1 (1):1–16, 2015

Romi Satria Wahono. A systematic literature review of software defect prediction.Journal of software engineering, 1 (1):1–16, 2015

work page 2015
[54]

Compressed c4

Jun Wang, Beijun Shen, and Yuting Chen. Compressed c4. 5 models for software defect prediction. In2012 12th International Conference on quality software, pages 13–16. IEEE, 2012

work page 2012
[55]

Deep semantic feature learning for software defect prediction

Song Wang, Taiyue Liu, Jaechang Nam, and Lin Tan. Deep semantic feature learning for software defect prediction. IEEE Transactions on Software Engineering, 46(12):1267–1293, 2018

work page 2018
[56]

Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair

Weishi Wang, Yue Wang, Shafiq Joty, and Steven CH Hoi. Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 146–158, 2023. 18 Nam et al

work page 2023
[57]

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. Codet5+: Open code large language models for code understanding and generation.arXiv preprint arXiv:2305.07922, 2023

work page internal anchor Pith review arXiv 2023
[58]

Line-level semantic structure learning for code vulnerability detection.arXiv preprint arXiv:2407.18877, 2024

Ziliang Wang, Ge Li, Jia Li, Yihong Dong, Yingfei Xiong, and Zhi Jin. Line-level semantic structure learning for code vulnerability detection.arXiv preprint arXiv:2407.18877, 2024

work page arXiv 2024
[59]

Individual comparisons by ranking methods.Biometrics bulletin, 1(6):80–83, 1945

Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics bulletin, 1(6):80–83, 1945

work page 1945
[60]

Automated program repair in the era of large pre-trained language models

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023

work page 2023
[61]

Empirical study on transformer- based techniques for software engineering.arXiv preprint arXiv:2310.00399, 2023

Yan Xiao, Xinyue Zuo, Lei Xue, Kailong Wang, Jin Song Dong, and Ivan Beschastnikh. Empirical study on transformer- based techniques for software engineering.arXiv preprint arXiv:2310.00399, 2023

work page arXiv 2023
[62]

Characterizing and identifying reverted commits

Meng Yan, Xin Xia, David Lo, Ahmed E Hassan, and Shanping Li. Characterizing and identifying reverted commits. Empirical Software Engineering, 24(4):2171–2208, 2019

work page 2019
[63]

Deep learning for just-in-time defect prediction

Xinli Yang, David Lo, Xin Xia, Yun Zhang, and Jianling Sun. Deep learning for just-in-time defect prediction. In2015 IEEE International conference on software quality, reliability and security, pages 17–26. IEEE, 2015

work page 2015
[64]

Tram: A token-level retrieval-augmented mechanism for source code summarization.arXiv preprint arXiv:2305.11074, 2023

Tong Ye, Lingfei Wu, Tengfei Ma, Xuhong Zhang, Yangkai Du, Peiyu Liu, Shouling Ji, and Wenhai Wang. Tram: A token-level retrieval-augmented mechanism for source code summarization.arXiv preprint arXiv:2305.11074, 2023

work page arXiv 2023
[65]

Revisiting sentiment analysis for software engineering in the era of large language models.ACM Transactions on Software Engineering and Methodology, 34(3):1–30, 2025

Ting Zhang, Ivana Clairine Irsan, Ferdian Thung, and David Lo. Revisiting sentiment analysis for software engineering in the era of large language models.ACM Transactions on Software Engineering and Methodology, 34(3):1–30, 2025

work page 2025
[66]

Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection,

Ting Zhang, Chengran Yang, Yindu Su, Martin Weyssow, Hung Nguyen, Tan Bui, Hong Jin Kang, Yikun Li, Eng Lieh Ouh, Lwin Khin Shar, et al. Benchmarking large language models for multi-language software vulnerability detection. arXiv preprint arXiv:2503.01449, 2025

work page arXiv 2025
[67]

Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems, 32, 2019

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems, 32, 2019

work page 2019

[1] [1]

Parameter-efficient fine-tuning of pre-trained code models for just-in-time defect prediction.Neural Computing and Applications, 36(27):16911–16940, 2024

Manar Abu Talib, Ali Bou Nassif, Mohammad Azzeh, Yaser Alesh, and Yaman Afadar. Parameter-efficient fine-tuning of pre-trained code models for just-in-time defect prediction.Neural Computing and Applications, 36(27):16911–16940, 2024

work page 2024

[2] [2]

Empirical study: How issue classification influences software defect prediction.IEEE access, 11:11732–11748, 2023

Petar Afric, Davor Vukadin, Marin Silic, and Goran Delac. Empirical study: How issue classification influences software defect prediction.IEEE access, 11:11732–11748, 2023

work page 2023

[3] [3]

Vulnerability detection in popular programming languages with language models.arXiv preprint arXiv:2412.15905, 2024

Syafiq Al Atiiq, Christian Gehrmann, and Kevin Dahlén. Vulnerability detection in popular programming languages with language models.arXiv preprint arXiv:2412.15905, 2024

work page arXiv 2024

[4] [4]

Prediction of software fault-prone classes using ensemble random forest with adaptive synthetic sampling algorithm.Automated Software Engineering, 29(1):6, 2022

A Balaram and S Vasundra. Prediction of software fault-prone classes using ensemble random forest with adaptive synthetic sampling algorithm.Automated Software Engineering, 29(1):6, 2022

work page 2022

[5] [5]

The limited impact of individual developer data on software defect prediction.Empirical Software Engineering, 18(3):478–505, 2013

Robert M Bell, Thomas J Ostrand, and Elaine J Weyuker. The limited impact of individual developer data on software defect prediction.Empirical Software Engineering, 18(3):478–505, 2013

work page 2013

[6] [6]

Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection

Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David Wagner. Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. InProceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, pages 654–668, 2023

work page 2023

[7] [7]

routledge, 2013

Jacob Cohen.Statistical power analysis for the behavioral sciences. routledge, 2013

work page 2013

[8] [8]

Semantic source code segmentation using small and large language models.arXiv preprint arXiv:2507.08992, 2025

Abdelhalim Dahou, Ansgar Scherp, Sebastian Kurten, Brigitte Mathiak, and Madhu Chauhan. Semantic source code segmentation using small and large language models.arXiv preprint arXiv:2507.08992, 2025

work page arXiv 2025

[9] [9]

Vulnerability detection with code language models: How far are we?arXiv preprint arXiv:2403.18624, 2024

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we?arXiv preprint arXiv:2403.18624, 2024. 16 Nam et al

work page arXiv 2024

[10] [10]

Predicting defect-prone software modules using support vector machines

Karim O Elish and Mahmoud O Elish. Predicting defect-prone software modules using support vector machines. Journal of Systems and Software, 81(5):649–660, 2008

work page 2008

[11] [11]

Ac/c++ code vulnerability dataset with code changes and cve summaries

Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. Ac/c++ code vulnerability dataset with code changes and cve summaries. InProceedings of the 17th international conference on mining software repositories, pages 508–512, 2020

work page 2020

[12] [12]

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages.arXiv preprint arXiv:2002.08155, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[13] [13]

Learning in the wild: Towards leveraging unlabeled data for effectively tuning pre-trained code models

Shuzheng Gao, Wenxin Mao, Cuiyun Gao, Li Li, Xing Hu, Xin Xia, and Michael R Lyu. Learning in the wild: Towards leveraging unlabeled data for effectively tuning pre-trained code models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024

work page 2024

[14] [14]

The earlybird catches the bug: On exploiting early layers of encoder models for more efficient code classification

Anastasiia Grishina, Max Hort, and Leon Moonen. The earlybird catches the bug: On exploiting early layers of encoder models for more efficient code classification. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 895–907, 2023

work page 2023

[15] [15]

UniXcoder: Unified Cross-Modal Pre-training for Code Representation

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. Unixcoder: Unified cross-modal pre-training for code representation.arXiv preprint arXiv:2203.03850, 2022

work page internal anchor Pith review arXiv 2022

[16] [16]

A study on the impact of pre-trained model on just-in-time defect prediction

Yuxiang Guo, Xiaopeng Gao, Zhenyu Zhang, Wing Kwong Chan, and Bo Jiang. A study on the impact of pre-trained model on just-in-time defect prediction. In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS), pages 105–116. IEEE, 2023

work page 2023

[17] [17]

Problems with szz and features: An empirical study of the state of practice of defect prediction data collection.Empirical Software Engineering, 27(2):42, 2022

Steffen Herbold, Alexander Trautsch, Fabian Trautsch, and Benjamin Ledel. Problems with szz and features: An empirical study of the state of practice of defect prediction data collection.Empirical Software Engineering, 27(2):42, 2022

work page 2022

[18] [18]

Deepjit: an end-to-end deep learning framework for just-in-time defect prediction

Thong Hoang, Hoa Khanh Dam, Yasutaka Kamei, David Lo, and Naoyasu Ubayashi. Deepjit: an end-to-end deep learning framework for just-in-time defect prediction. In2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pages 34–45. IEEE, 2019

work page 2019

[19] [19]

Cc2vec: Distributed representations of code changes

Thong Hoang, Hong Jin Kang, David Lo, and Julia Lawall. Cc2vec: Distributed representations of code changes. In Proceedings of the ACM/IEEE 42nd international conference on software engineering, pages 518–529, 2020

work page 2020

[20] [20]

A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, pages 65–70, 1979

Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, pages 65–70, 1979

work page 1979

[21] [21]

A framework for software defect prediction and metric selection.IEEE access, 6:2844–2858, 2017

Shamsul Huda, Sultan Alyahya, Md Mohsin Ali, Shafiq Ahmad, Jemal Abawajy, Hmood Al-Dossari, and John Yearwood. A framework for software defect prediction and metric selection.IEEE access, 6:2844–2858, 2017

work page 2017

[22] [22]

Adversarial Examples for Evaluating Reading Comprehension Systems

Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems.arXiv preprint arXiv:1707.07328, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

Just-in-time software defect prediction via bi-modal change representation learning.Journal of Systems and Software, 219:112253, 2025

Yuze Jiang, Beijun Shen, and Xiaodong Gu. Just-in-time software defect prediction via bi-modal change representation learning.Journal of Systems and Software, 219:112253, 2025

work page 2025

[24] [24]

Defects4j: A database of existing faults to enable controlled testing studies for java programs

René Just, Darioush Jalali, and Michael D Ernst. Defects4j: A database of existing faults to enable controlled testing studies for java programs. InProceedings of the 2014 international symposium on software testing and analysis, pages 437–440, 2014

work page 2014

[25] [25]

A large-scale empirical study of just-in-time quality assurance.IEEE Transactions on Software Engineering, 39(6): 757–773, 2012

Yasutaka Kamei, Emad Shihab, Bram Adams, Ahmed E Hassan, Audris Mockus, Anand Sinha, and Naoyasu Ubayashi. A large-scale empirical study of just-in-time quality assurance.IEEE Transactions on Software Engineering, 39(6): 757–773, 2012

work page 2012

[26] [26]

Studying just-in-time defect prediction using cross-project models.Empirical Software Engineering, 21(5):2072–2106, 2016

Yasutaka Kamei, Takafumi Fukushima, Shane McIntosh, Kazuhiro Yamashita, Naoyasu Ubayashi, and Ahmed E Hassan. Studying just-in-time defect prediction using cross-project models.Empirical Software Engineering, 21(5):2072–2106, 2016

work page 2072

[27] [27]

Automating modern code review processes with code similarity measurement.Information and Software Technology, 173:107490, 2024

Yusuf Kartal, E Kaan Akdeniz, and Kemal Özkan. Automating modern code review processes with code similarity measurement.Information and Software Technology, 173:107490, 2024

work page 2024

[28] [28]

Tree-based software quality estimation models for fault prediction

Taghi M Khoshgoftaar and Naeem Seliya. Tree-based software quality estimation models for fault prediction. In Proceedings Eighth IEEE Symposium on Software Metrics, pages 203–214. IEEE, 2002

work page 2002

[29] [29]

Classifying software changes: Clean or buggy?IEEE Transactions on software engineering, 34(2):181–196, 2008

Sunghun Kim, E James Whitehead, and Yi Zhang. Classifying software changes: Clean or buggy?IEEE Transactions on software engineering, 34(2):181–196, 2008

work page 2008

[30] [30]

Logistic regression in rare events data.Political analysis, 9(2):137–163, 2001

Gary King and Langche Zeng. Logistic regression in rare events data.Political analysis, 9(2):137–163, 2001

work page 2001

[31] [31]

Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas.Frontiers in psychology, 4:863, 2013

Daniël Lakens. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas.Frontiers in psychology, 4:863, 2013

work page 2013

[32] [32]

Automating code review activities by large-scale pre-training

Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svy- atkovskiy, Shengyu Fu, et al. Automating code review activities by large-scale pre-training. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1035–1047, 202...

work page 2022

[33] [33]

Cct5: A code-change-oriented pre-trained model

Bo Lin, Shangwen Wang, Zhongxin Liu, Yepang Liu, Xin Xia, and Xiaoguang Mao. Cct5: A code-change-oriented pre-trained model. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1509–1521, 2023

work page 2023

[34] [34]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

work page 2017

[35] [35]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [36]

Evaluating szz implementations: An empirical study on the linux kernel.IEEE Transactions on Software Engineering, 50(9):2219–2239, 2024

Yunbo Lyu, Hong Jin Kang, Ratnadira Widyasari, Julia Lawall, and David Lo. Evaluating szz implementations: An empirical study on the linux kernel.IEEE Transactions on Software Engineering, 50(9):2219–2239, 2024

work page 2024

[37] [37]

A systematic review of machine learning techniques for software fault prediction.Applied Soft Computing, 27:504–518, 2015

Ruchika Malhotra. A systematic review of machine learning techniques for software fault prediction.Applied Soft Computing, 27:504–518, 2015

work page 2015

[38] [38]

Applying codebert for automated program repair of java simple bugs

Ehsan Mashhadi and Hadi Hemmati. Applying codebert for automated program repair of java simple bugs. In2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pages 505–509. IEEE, 2021

work page 2021

[39] [39]

Topic-based defect prediction (nier track)

Tung Thanh Nguyen, Tien N Nguyen, and Tu Minh Phuong. Topic-based defect prediction (nier track). InProceedings of the 33rd international conference on software engineering, pages 932–935, 2011

work page 2011

[40] [40]

The best of both worlds: integrating semantic features with expert features for defect prediction and localization

Chao Ni, Wei Wang, Kaiwen Yang, Xin Xia, Kui Liu, and David Lo. The best of both worlds: integrating semantic features with expert features for defect prediction and localization. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 672–683, 2022

work page 2022

[41] [41]

Function-level vulnerability detection through fusing multi-modal knowledge

Chao Ni, Xinrong Guo, Yan Zhu, Xiaodan Xu, and Xiaohu Yang. Function-level vulnerability detection through fusing multi-modal knowledge. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1911–1918. IEEE, 2023

work page 1911

[42] [42]

An empirical comparison of pre- trained models of source code

Changan Niu, Chuanyi Li, Vincent Ng, Dongxiao Chen, Jidong Ge, and Bin Luo. An empirical comparison of pre- trained models of source code. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2136–2148. IEEE, 2023

work page 2023

[43] [43]

Refactoring ≠ bug-inducing: Improving defect prediction with code change tactics analysis.arXiv preprint arXiv:2507.19714, 2025

Feifei Niu, Junqian Shao, Christoph Mayr-Dorn, Liguo Huang, Wesley KG Assunção, Chuanyi Li, Jidong Ge, and Alexander Egyed. Refactoring ≠ bug-inducing: Improving defect prediction with code change tactics analysis.arXiv preprint arXiv:2507.19714, 2025

work page arXiv 2025

[44] [44]

Deep learning for software defect prediction: A survey

Safa Omri and Carsten Sinz. Deep learning for software defect prediction: A survey. InProceedings of the IEEE/ACM 42nd international conference on software engineering workshops, pages 209–214, 2020

work page 2020

[45] [45]

How to measure success of fault prediction models

Thomas J Ostrand and Elaine J Weyuker. How to measure success of fault prediction models. InFourth international workshop on Software quality assurance: in conjunction with the 6th ESEC/FSE joint meeting, pages 25–30, 2007

work page 2007

[46] [46]

Semantically equivalent adversarial rules for debugging nlp models

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Semantically equivalent adversarial rules for debugging nlp models. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (volume 1: long papers), pages 856–865, 2018

work page 2018

[47] [47]

An industrial study on the risk of software changes

Emad Shihab, Ahmed E Hassan, Bram Adams, and Zhen Ming Jiang. An industrial study on the risk of software changes. InProceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, pages 1–11, 2012

work page 2012

[48] [48]

When do changes induce fixes?ACM sigsoft software engineering notes, 30(4):1–5, 2005

Jacek Śliwerski, Thomas Zimmermann, and Andreas Zeller. When do changes induce fixes?ACM sigsoft software engineering notes, 30(4):1–5, 2005

work page 2005

[49] [49]

Automatic code summarization via chatgpt: How far are we?arXiv preprint arXiv:2305.12865, 2023

Weisong Sun, Chunrong Fang, Yudu You, Yun Miao, Yi Liu, Yuekang Li, Gelei Deng, Shenghan Huang, Yuchen Chen, Quanjun Zhang, et al. Automatic code summarization via chatgpt: How far are we?arXiv preprint arXiv:2305.12865, 2023

work page arXiv 2023

[50] [50]

An introduction to the bootstrap.Monographs on statistics and applied probability, 57(1):1–436, 1993

Robert J Tibshirani and Bradley Efron. An introduction to the bootstrap.Monographs on statistics and applied probability, 57(1):1–436, 1993

work page 1993

[51] [51]

Practical considerations in deploying ai for defect prediction: a case study within the turkish telecommunication industry

Ayşe Tosun, Burak Turhan, and Ayşe Bener. Practical considerations in deploying ai for defect prediction: a case study within the turkish telecommunication industry. InProceedings of the 5th International Conference on Predictor Models in Software Engineering, pages 1–9, 2009

work page 2009

[52] [52]

On the relative value of cross-company and within-company data for defect prediction.Empirical Software Engineering, 14(5):540–578, 2009

Burak Turhan, Tim Menzies, Ayşe B Bener, and Justin Di Stefano. On the relative value of cross-company and within-company data for defect prediction.Empirical Software Engineering, 14(5):540–578, 2009

work page 2009

[53] [53]

A systematic literature review of software defect prediction.Journal of software engineering, 1 (1):1–16, 2015

Romi Satria Wahono. A systematic literature review of software defect prediction.Journal of software engineering, 1 (1):1–16, 2015

work page 2015

[54] [54]

Compressed c4

Jun Wang, Beijun Shen, and Yuting Chen. Compressed c4. 5 models for software defect prediction. In2012 12th International Conference on quality software, pages 13–16. IEEE, 2012

work page 2012

[55] [55]

Deep semantic feature learning for software defect prediction

Song Wang, Taiyue Liu, Jaechang Nam, and Lin Tan. Deep semantic feature learning for software defect prediction. IEEE Transactions on Software Engineering, 46(12):1267–1293, 2018

work page 2018

[56] [56]

Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair

Weishi Wang, Yue Wang, Shafiq Joty, and Steven CH Hoi. Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 146–158, 2023. 18 Nam et al

work page 2023

[57] [57]

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. Codet5+: Open code large language models for code understanding and generation.arXiv preprint arXiv:2305.07922, 2023

work page internal anchor Pith review arXiv 2023

[58] [58]

Line-level semantic structure learning for code vulnerability detection.arXiv preprint arXiv:2407.18877, 2024

Ziliang Wang, Ge Li, Jia Li, Yihong Dong, Yingfei Xiong, and Zhi Jin. Line-level semantic structure learning for code vulnerability detection.arXiv preprint arXiv:2407.18877, 2024

work page arXiv 2024

[59] [59]

Individual comparisons by ranking methods.Biometrics bulletin, 1(6):80–83, 1945

Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics bulletin, 1(6):80–83, 1945

work page 1945

[60] [60]

Automated program repair in the era of large pre-trained language models

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023

work page 2023

[61] [61]

Empirical study on transformer- based techniques for software engineering.arXiv preprint arXiv:2310.00399, 2023

Yan Xiao, Xinyue Zuo, Lei Xue, Kailong Wang, Jin Song Dong, and Ivan Beschastnikh. Empirical study on transformer- based techniques for software engineering.arXiv preprint arXiv:2310.00399, 2023

work page arXiv 2023

[62] [62]

Characterizing and identifying reverted commits

Meng Yan, Xin Xia, David Lo, Ahmed E Hassan, and Shanping Li. Characterizing and identifying reverted commits. Empirical Software Engineering, 24(4):2171–2208, 2019

work page 2019

[63] [63]

Deep learning for just-in-time defect prediction

Xinli Yang, David Lo, Xin Xia, Yun Zhang, and Jianling Sun. Deep learning for just-in-time defect prediction. In2015 IEEE International conference on software quality, reliability and security, pages 17–26. IEEE, 2015

work page 2015

[64] [64]

Tram: A token-level retrieval-augmented mechanism for source code summarization.arXiv preprint arXiv:2305.11074, 2023

Tong Ye, Lingfei Wu, Tengfei Ma, Xuhong Zhang, Yangkai Du, Peiyu Liu, Shouling Ji, and Wenhai Wang. Tram: A token-level retrieval-augmented mechanism for source code summarization.arXiv preprint arXiv:2305.11074, 2023

work page arXiv 2023

[65] [65]

Revisiting sentiment analysis for software engineering in the era of large language models.ACM Transactions on Software Engineering and Methodology, 34(3):1–30, 2025

Ting Zhang, Ivana Clairine Irsan, Ferdian Thung, and David Lo. Revisiting sentiment analysis for software engineering in the era of large language models.ACM Transactions on Software Engineering and Methodology, 34(3):1–30, 2025

work page 2025

[66] [66]

Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection,

Ting Zhang, Chengran Yang, Yindu Su, Martin Weyssow, Hung Nguyen, Tan Bui, Hong Jin Kang, Yikun Li, Eng Lieh Ouh, Lwin Khin Shar, et al. Benchmarking large language models for multi-language software vulnerability detection. arXiv preprint arXiv:2503.01449, 2025

work page arXiv 2025

[67] [67]

Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems, 32, 2019

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems, 32, 2019

work page 2019