pith. machine review for the scientific record. sign in

arxiv: 2605.07678 · v1 · submitted 2026-05-08 · 💻 cs.SE

Recognition: no theorem link

Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel

Chen Yang, Dong Wang, Haichi Wang, Jiashuo Tian, Junjie Chen, Zan Wang

Pith reviewed 2026-05-11 02:19 UTC · model grok-4.3

classification 💻 cs.SE
keywords false-positive bug reportsLinux kernelempirical studylarge language modelsretrieval-augmented generationBugzillaSyzkallersoftware maintenance
0
0 comments X

The pith

False-positive bug reports in the Linux kernel take developer effort comparable to real bugs and can be filtered by retrieval-augmented LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs the first large-scale empirical study of false-positive bug reports in the Linux kernel by assembling a dataset of 2,006 reports that includes 497 cases manually labeled as false positives from Bugzilla and Syzkaller. Analysis shows these reports trigger extended discussions and take closure times similar to genuine bugs, appearing most frequently in file systems and drivers because of external dependencies and misunderstandings of correct behavior. The authors then test large language models under different prompting strategies and report that retrieval-augmented generation reaches 91 percent recall and an 88 percent F1 score, indicating a workable method to reduce wasted triage effort.

Core claim

We manually construct a dataset of 2,006 bug reports comprising 1,509 genuine bugs and 497 false positives collected from Bugzilla and Syzkaller. Our analysis indicates that false positives demand effort comparable to real bugs, often requiring extended discussions and non-trivial closure time. They occur in several components, especially File Systems and Drivers, mainly due to external dependencies and semantic misunderstandings. To address this challenge, we evaluate large language models for automated false-positive bug report mitigation. Among various prompting strategies, retrieval-augmented generation performs best, achieving 91 percent recall and an F1 score of 88 percent.

What carries the argument

The manually labeled dataset of 2,006 Linux kernel bug reports paired with retrieval-augmented generation prompting to classify false positives.

If this is right

  • Bug triage systems for the kernel should treat false positives as a distinct category rather than routing every report directly to developers.
  • Retrieval-augmented generation classifiers could be integrated into Bugzilla workflows to flag likely false positives before human review.
  • Documentation and testing practices in file systems and drivers could be strengthened around external dependencies to reduce semantic misunderstandings.
  • The measured effort cost implies that early filtering would free measurable developer time for resolving actual defects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dataset construction and RAG evaluation approach could be applied to other large open-source projects to test whether the effort patterns and component distributions hold beyond the kernel.
  • Adding kernel-specific code context to the retrieval step might raise precision without losing the reported recall level.
  • Long-term use of such a filter could shift reporting norms so that submitters provide clearer evidence against external-dependency misreads.

Load-bearing premise

The manual labeling of 497 reports as false positives is accurate and the collected sample from Bugzilla and Syzkaller is representative of all false-positive reports in the kernel.

What would settle it

Re-labeling the same set of reports by an independent group of kernel developers and obtaining a substantially different count or set of causes would undermine the effort comparison and the reported LLM performance numbers.

Figures

Figures reproduced from arXiv: 2605.07678 by Chen Yang, Dong Wang, Haichi Wang, Jiashuo Tian, Junjie Chen, Zan Wang.

Figure 1
Figure 1. Figure 1: Comparison of False-Positive and Genuine Bug Reports in the Linux Kernel [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: False-Positive Bug Reports Distribution by Component [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: False-Positive Bug Reports Root Causes While a small number of reports could plausibly fit multiple categories, such cases account for only 4.8% of the dataset, indicating that category overlap is limited. Results and Analysis [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustrative Examples environment using Qt 5.15.9. Upon developer investigation, however, the visualization problem was traced to the Qt libraries. Specifically, the bug disappeared once Qt was rebuilt from source. This case highlights how mismatches or inconsistencies in userspace dependencies can lead reporters to misattribute failures to the kernel. Finding 6: Misunderstanding of Features or Limitations… view at source ↗
read the original abstract

False-positive bug reports represent a significant yet underexplored challenge in the development and maintenance of the Linux kernel. They occur when correct system behavior is mistakenly flagged as a defect, consuming developer effort without leading to actual code improvements. Such reports can mislead developers, waste debugging resources, and delay the resolution of real bugs. In this paper, we present the first comprehensive empirical study of false-positive bug reports in the Linux kernel. We manually construct a dataset of 2,006 bug reports comprising 1,509 genuine bugs and 497 false positives collected from Bugzilla and Syzkaller. Our analysis indicates that false positives demand effort comparable to real bugs, often requiring extended discussions and non-trivial closure time. They occur in several components, especially File Systems and Drivers, mainly due to external dependencies and semantic misunderstandings. To address this challenge, we evaluate large language models (LLMs) for automated false-positive bug report mitigation. Among various prompting strategies, retrieval-augmented generation (RAG) performs best, achieving 91% recall and an F1 score of 88%. These findings highlight the non-negligible cost of false positive bug reports and show the promise of LLMs for more efficient false positive mitigation in the Linux kernel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to present the first comprehensive empirical study of false-positive bug reports in the Linux kernel. It manually constructs a dataset of 2,006 bug reports (1,509 genuine bugs and 497 false positives) from Bugzilla and Syzkaller, analyzes their effort demands, component distribution (especially File Systems and Drivers), and root causes (external dependencies and semantic misunderstandings), and evaluates LLMs for automated mitigation, finding that retrieval-augmented generation (RAG) achieves the highest performance with 91% recall and 88% F1 score.

Significance. If the ground-truth labels and sample are reliable, the work offers the first systematic characterization of false positives in a major open-source kernel, quantifying their non-negligible cost and identifying actionable patterns. The LLM results indicate a viable path toward automated triage that could conserve developer resources, provided the evaluation is placed on firmer methodological footing.

major comments (2)
  1. [Dataset Construction] Dataset Construction section: the manual labeling of the 497 false-positive reports is described only at a high level, with no reported annotation criteria, number of labelers, inter-rater agreement (e.g., Cohen's kappa), or disagreement-resolution protocol. These labels constitute the ground truth for both the effort and component analyses and for the LLM performance numbers (91% recall, 88% F1), so the absence of validation metrics directly affects the interpretability of the central empirical claims.
  2. [LLM Evaluation] LLM Evaluation section: performance figures for the prompting strategies, including the headline RAG result, are presented without full prompting templates, non-LLM baselines, or statistical significance tests. This makes the comparative claim that RAG is best difficult to assess or reproduce and weakens the mitigation contribution.
minor comments (2)
  1. The abstract refers to 'various prompting strategies' without enumerating them; the main text should list each strategy and its exact configuration for clarity.
  2. A summary table of dataset statistics (reports per component, median closure time for false positives vs. genuine bugs) would improve readability of the characterization results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of methodological transparency that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Dataset Construction] Dataset Construction section: the manual labeling of the 497 false-positive reports is described only at a high level, with no reported annotation criteria, number of labelers, inter-rater agreement (e.g., Cohen's kappa), or disagreement-resolution protocol. These labels constitute the ground truth for both the effort and component analyses and for the LLM performance numbers (91% recall, 88% F1), so the absence of validation metrics directly affects the interpretability of the central empirical claims.

    Authors: We agree that the current description of the labeling process is insufficient for establishing ground-truth reliability. In the revised manuscript we will expand the Dataset Construction section to specify the annotation criteria used to identify false positives, the number of labelers (two authors labeled all reports independently), the inter-rater agreement measured by Cohen's kappa, and the disagreement-resolution protocol (discussion until consensus was reached). These additions will directly support the validity of the effort, component, and LLM analyses. revision: yes

  2. Referee: [LLM Evaluation] LLM Evaluation section: performance figures for the prompting strategies, including the headline RAG result, are presented without full prompting templates, non-LLM baselines, or statistical significance tests. This makes the comparative claim that RAG is best difficult to assess or reproduce and weakens the mitigation contribution.

    Authors: We acknowledge that the LLM Evaluation section lacks the requested details. In the revision we will add the complete prompting templates for all strategies (including RAG) to an appendix, introduce non-LLM baselines such as keyword matching and metadata-based heuristics, and report statistical significance tests (McNemar's test) comparing RAG against the other prompting strategies. These changes will make the performance claims reproducible and strengthen the mitigation results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential steps

full rationale

The paper conducts an empirical study by manually labeling 2,006 bug reports (1,509 genuine, 497 false positives) from Bugzilla and Syzkaller, performs component analysis, and evaluates LLM prompting strategies on that fixed dataset. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The RAG result (91% recall, 88% F1) is a direct performance measurement against the manually assigned labels rather than a reduction to prior inputs by construction. Concerns about label quality or sample representativeness are validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the accuracy of manual false-positive labeling and the assumption that the sampled reports reflect the broader population of kernel bug reports.

axioms (1)
  • domain assumption Manual labeling of bug reports as false positive versus genuine is reliable and consistent across annotators
    The study depends entirely on the 2,006 manually constructed labels without reported validation metrics.

pith-pipeline@v0.9.0 · 5530 in / 1070 out tokens · 42950 ms · 2026-05-11T02:19:03.652455+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 4 internal anchors

  1. [1]

    2025. Apache. https://www.apache.org

  2. [2]

    Bugzilla

    2025. Bugzilla. https://bugzilla.kernel.org

  3. [3]

    Deepseek

    2025. Deepseek. https://www.deepseek.com

  4. [4]

    2025. Eclipse. https://www.eclipse.org

  5. [5]

    Kernel Bugzilla Components

    2025. Kernel Bugzilla Components. https://bugzilla.kernel.org/describecomponents.cgi

  6. [6]

    2025. Mozilla. https://www.mozilla.org

  7. [7]

    2025. Qwen. https://chat.qwen.ai

  8. [8]

    Replication pakage

    2025. Replication pakage. https://github.com/tianjiashuo/False-Positive-from-Linux-Kernel

  9. [9]

    Syzkaller

    2025. Syzkaller. https://syzkaller.appspot.com/upstream

  10. [10]

    Trinity: Linux system call fuzzer

    2025. Trinity: Linux system call fuzzer. https://github.com/kernelslacker/trinity

  11. [11]

    Iago Abal, Claus Brabrand, and Andrzej Wasowski. 2014. 42 variability bugs in the linux kernel: a qualitative analysis. InProceedings of the 29th ACM/IEEE international conference on Automated software engineering. 421–432

  12. [12]

    John Anvik, Lyndon Hiew, and Gail C Murphy. 2005. Coping with an open bug repository. InProceedings of the 2005 OOPSLA workshop on Eclipse technology eXchange. 35–39

  13. [13]

    Gabriel Aracena, Kyle Luster, Fabio Santos, Igor Steinmacher, and Marco Aurelio Gerosa. 2024. Applying large language models to issue classification. InProceedings of the Third ACM/IEEE International Workshop on NL-based Software Engineering. 57–60

  14. [14]

    Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. 2008. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs.. InOSDI, Vol. 8. 209–224

  15. [15]

    Junjie Chen, Xingyu Fan, Chen Yang, Shuang Liu, and Jun Sun. 2025. De-duplicating Silent Compiler Bugs via Deep Semantic Representation.Proceedings of the ACM on Software Engineering2, FSE (2025), 2359–2381

  16. [16]

    Norman Cliff. 1993. Dominance statistics: Ordinal analyses to answer ordinal questions.Psychological bulletin114, 3 (1993), 494

  17. [17]

    Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and psychological measurement20, 1 (1960), 37–46

  18. [18]

    Juliet M Corbin and Anselm Strauss. 1990. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative sociology13, 1 (1990), 3–21

  19. [19]

    Xiaoting Du, Zheng Zheng, Lei Ma, and Jianjun Zhao. 2021. An empirical study on common bugs in deep learning compilers. In2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 184–195

  20. [20]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.109972, 1 (2023)

  21. [21]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. 2024. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594(2024)

  22. [22]

    Jianjun He, Ling Xu, Yuanrui Fan, Zhou Xu, Meng Yan, and Yan Lei. 2020. Deep learning based valid bug reports determination and explanation. In2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, 184–194

  23. [23]

    Kim Herzig, Sascha Just, and Andreas Zeller. 2013. It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In2013 35th international conference on software engineering (ICSE). IEEE, 392–401

  24. [24]

    Felicitas Hetzelt, Martin Radev, Robert Buhren, Mathias Morbitzer, and Jean-Pierre Seifert. 2021. Via: Analyzing device interfaces of protected virtual machines. InProceedings of the 37th Annual Computer Security Applications Conference. 273–284

  25. [25]

    Zheyue Jiang, Yuan Zhang, Jun Xu, Qi Wen, Zhenghe Wang, Xiaohan Zhang, Xinyu Xing, Min Yang, and Zhemin Yang

  26. [26]

    InProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security

    Pdiff: Semantic-based patch presence testing for downstream kernels. InProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 1149–1163

  27. [27]

    Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323

  28. [28]

    Kyungtae Kim, Dae R Jeong, Chung Hwan Kim, Yeongjin Jang, Insik Shin, and Byoungyoung Lee. 2020. HFL: Hybrid Fuzzing on the Linux Kernel.. InNDSS

  29. [29]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213

  30. [30]

    Hiroki Kuramoto, Dong Wang, Masanari Kondo, Yutaro Kashiwa, Yasutaka Kamei, and Naoyasu Ubayashi. 2024. Understanding the characteristics and the role of visual issue reports.Empirical Software Engineering29, 4 (2024), 89

  31. [31]

    Muhammad Laiq, Nauman bin Ali, Jürgen Börstler, and Emelie Engström. 2024. Industrial adoption of machine learning techniques for early identification of invalid bug reports.Empirical Software Engineering29, 5 (2024), 130. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE054. Publication date: July 2026. FSE054:22 Jiashuo Tian, Dong Wang, Chen Yang, Hai...

  32. [32]

    Muhammad Laiq, Nauman bin Ali, Jürgen Böstler, and Emelie Engström. 2022. Early identification of invalid bug reports in industrial settings–a case study. InInternational Conference on Product-Focused Software Process Improvement. Springer, 497–507

  33. [33]

    Muhammad Laiq, Nauman bin Ali, Jürgen Börstler, and Emelie Engström. 2023. A data-driven approach for under- standing invalid bug reports: An industrial case study.Information and Software Technology164 (2023), 107305

  34. [34]

    Frank Li and Vern Paxson. 2017. A large-scale empirical study of security patches. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2201–2215

  35. [35]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  36. [36]

    Mary L McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia medica22, 3 (2012), 276–282

  37. [37]

    Patrick E McKnight and Julius Najab. 2010. Mann-whitney U test.The Corsini encyclopedia of psychology(2010), 1–1

  38. [38]

    Dongliang Mu, Alejandro Cuevas, Limin Yang, Hang Hu, Xinyu Xing, Bing Mao, and Gang Wang. 2018. Understanding the reproducibility of crowd-reported security vulnerabilities. In27th USENIX Security Symposium (USENIX Security 18). 919–936

  39. [39]

    Dongliang Mu, Yuhang Wu, Yueqi Chen, Zhenpeng Lin, Chensheng Yu, Xinyu Xing, and Gang Wang. 2022. An in-depth analysis of duplicated linux kernel bug reports. InNetwork and Distributed Systems Security Symposium (NDSS)

  40. [40]

    Shankara Pailoor, Andrew Aday, and Suman Jana. 2018. {MoonShine}: Optimizing {OS} fuzzer seed selection with trace distillation. In27th USENIX Security Symposium (USENIX Security 18). 729–743

  41. [41]

    Nitish Pandey, Debarshi Kumar Sanyal, Abir Hudait, and Amitava Sen. 2017. Automated classification of software issue reports using machine learning techniques: an empirical study.Innovations in Systems and Software Engineering 13, 4 (2017), 279–297

  42. [42]

    Hui Peng and Mathias Payer. 2020. {USBFuzz}: A framework for fuzzing {USB} drivers by device emulation. In29th USENIX Security Symposium (USENIX Security 20). 2559–2575

  43. [43]

    Robin L Plackett. 1983. Karl Pearson and the chi-squared test.International statistical review/revue internationale de statistique(1983), 59–72

  44. [44]

    Jeanine Romano, Jeffrey D Kromrey, Jesse Coraggio, and Jeff Skowronek. 2006. Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys. In annual meeting of the Florida Association of Institutional Research, Vol. 177

  45. [45]

    Sergej Schumilo, Cornelius Aschermann, Robert Gawlik, Sebastian Schinzel, and Thorsten Holz. 2017. {kAFL}:{Hardware-Assisted} feedback fuzzing for {OS} kernels. In26th USENIX security symposium (USENIX Security 17). 167–182

  46. [46]

    Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A comprehensive study of deep learning compiler bugs. InProceedings of the 29th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering. 968–980

  47. [47]

    Dokyung Song, Felicitas Hetzelt, Dipanjan Das, Chad Spensky, Yeoul Na, Stijn Volckaert, Giovanni Vigna, Christopher Kruegel, Jean-Pierre Seifert, and Michael Franz. 2019. Periscope: An effective probing and fuzzing framework for the hardware-os boundary. In2019 Network and Distributed Systems Security Symposium (NDSS). Internet Society, 1–15

  48. [48]

    2009.Card sorting: Designing usable categories

    Donna Spencer. 2009.Card sorting: Designing usable categories. Rosenfeld Media

  49. [49]

    Hao Sun, Yuheng Shen, Cong Wang, Jianzhong Liu, Yu Jiang, Ting Chen, and Aiguo Cui. 2021. Healer: Relation learning guided kernel fuzzing. InProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 344–358

  50. [50]

    Jian Sun. 2011. Why are bug reports invalid?. In2011 Fourth IEEE International Conference on Software Testing, Verification and Validation. IEEE, 407–410

  51. [51]

    Xin Tan, Yuan Zhang, Jiadong Lu, Xin Xiong, Zhuang Liu, and Min Yang. 2023. Syzdirect: Directed greybox fuzzing for linux kernel. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 1630–1644

  52. [52]

    Pannavat Terdchanakul, Hideaki Hata, Passakorn Phannachitta, and Kenichi Matsumoto. 2017. Bug or not? bug report classification using n-gram idf. In2017 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 534–538

  53. [53]

    Dong Wang, Masanari Kondo, Yasutaka Kamei, Raula Gaikovina Kula, and Naoyasu Ubayashi. 2023. When conversations turn into work: a taxonomy of converted discussions and issues in GitHub.Empirical Software Engineering28, 6 (2023), 138

  54. [54]

    Daimeng Wang, Zheng Zhang, Hang Zhang, Zhiyun Qian, Srikanth V Krishnamurthy, and Nael Abu-Ghazaleh. 2021. {SyzVegas}: Beating kernel fuzzing odds with reinforcement learning. In30th USENIX Security Symposium (USENIX Security 21). 2741–2758

  55. [55]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al . 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE054. Publication date: July 2026. Characterizing and Mitigating False-Positive Bug R...

  56. [56]

    Robert F Woolson. 2007. Wilcoxon signed-rank test.Wiley encyclopedia of clinical trials(2007), 1–3

  57. [57]

    Zhengzi Xu, Yulong Zhang, Longri Zheng, Liangzhao Xia, Chenfu Bao, Zhi Wang, and Yang Liu. 2020. Automatic hot patch generation for android kernels. In29th USENIX Security Symposium (USENIX Security 20). 2397–2414

  58. [58]

    Chen Yang, Junjie Chen, Xingyu Fan, Jiajun Jiang, and Jun Sun. 2023. Silent compiler bug de-duplication via three- dimensional analysis. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 677–689

  59. [59]

    Chen Yang, Junjie Chen, Bin Lin, Ziqi Wang, and Jianyi Zhou. 2025. Advancing Code Coverage: Incorporating Program Analysis with Large Language Models.ACM Transactions on Software Engineering and Methodology(2025)

  60. [60]

    Chen Yang, Ziqi Wang, Yanjie Jiang, Lin Yang, Yuteng Zheng, Jianyi Zhou, and Junjie Chen. 2025. Reflective Unit Test Generation for Precise Type Error Detection with Large Language Models. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2834–2845. doi:10.1109/ASE63991.2025.00233

  61. [61]

    Chen Yang, Ziqi Wang, Lin Yang, Dong Wang, Shutao Gao, Yanjie Jiang, and Junjie Chen. 2026. WiseUT: An Intelligent Framework for Unit Test Generation. In2026 IEEE/ACM 48th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)

  62. [62]

    Chen Yang, Lin Yang, Ziqi Wang, Dong Wang, Jianyi Zhou, and Junjie Chen. 2025. Clarifying Semantics of In-Context Examples for Unit Test Generation. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 3046–3057. doi:10.1109/ASE63991.2025.00250

  63. [63]

    Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. 2024. On the evaluation of large language models in unit test generation. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1607–1619

  64. [64]

    Feng Zhang, Foutse Khomh, Ying Zou, and Ahmed E Hassan. 2012. An empirical study on factors impacting bug fixing time. In2012 19th Working conference on reverse engineering. IEEE, 225–234

  65. [65]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025)

  66. [66]

    Yu Zhou, Yanxiang Tong, Ruihang Gu, and Harald Gall. 2016. Combining text mining and data mining for bug report classification.Journal of Software: Evolution and Process28, 3 (2016), 150–176. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE054. Publication date: July 2026