arxiv: 2605.07678 · v1 · submitted 2026-05-08 · 💻 cs.SE

Recognition: no theorem link

Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel

Chen Yang, Dong Wang, Haichi Wang, Jiashuo Tian, Junjie Chen, Zan Wang

Pith reviewed 2026-05-11 02:19 UTC · model grok-4.3

classification 💻 cs.SE

keywords false-positive bug reportsLinux kernelempirical studylarge language modelsretrieval-augmented generationBugzillaSyzkallersoftware maintenance

0 comments

The pith

False-positive bug reports in the Linux kernel take developer effort comparable to real bugs and can be filtered by retrieval-augmented LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs the first large-scale empirical study of false-positive bug reports in the Linux kernel by assembling a dataset of 2,006 reports that includes 497 cases manually labeled as false positives from Bugzilla and Syzkaller. Analysis shows these reports trigger extended discussions and take closure times similar to genuine bugs, appearing most frequently in file systems and drivers because of external dependencies and misunderstandings of correct behavior. The authors then test large language models under different prompting strategies and report that retrieval-augmented generation reaches 91 percent recall and an 88 percent F1 score, indicating a workable method to reduce wasted triage effort.

Core claim

We manually construct a dataset of 2,006 bug reports comprising 1,509 genuine bugs and 497 false positives collected from Bugzilla and Syzkaller. Our analysis indicates that false positives demand effort comparable to real bugs, often requiring extended discussions and non-trivial closure time. They occur in several components, especially File Systems and Drivers, mainly due to external dependencies and semantic misunderstandings. To address this challenge, we evaluate large language models for automated false-positive bug report mitigation. Among various prompting strategies, retrieval-augmented generation performs best, achieving 91 percent recall and an F1 score of 88 percent.

What carries the argument

The manually labeled dataset of 2,006 Linux kernel bug reports paired with retrieval-augmented generation prompting to classify false positives.

If this is right

Bug triage systems for the kernel should treat false positives as a distinct category rather than routing every report directly to developers.
Retrieval-augmented generation classifiers could be integrated into Bugzilla workflows to flag likely false positives before human review.
Documentation and testing practices in file systems and drivers could be strengthened around external dependencies to reduce semantic misunderstandings.
The measured effort cost implies that early filtering would free measurable developer time for resolving actual defects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dataset construction and RAG evaluation approach could be applied to other large open-source projects to test whether the effort patterns and component distributions hold beyond the kernel.
Adding kernel-specific code context to the retrieval step might raise precision without losing the reported recall level.
Long-term use of such a filter could shift reporting norms so that submitters provide clearer evidence against external-dependency misreads.

Load-bearing premise

The manual labeling of 497 reports as false positives is accurate and the collected sample from Bugzilla and Syzkaller is representative of all false-positive reports in the kernel.

What would settle it

Re-labeling the same set of reports by an independent group of kernel developers and obtaining a substantially different count or set of causes would undermine the effort comparison and the reported LLM performance numbers.

Figures

Figures reproduced from arXiv: 2605.07678 by Chen Yang, Dong Wang, Haichi Wang, Jiashuo Tian, Junjie Chen, Zan Wang.

**Figure 2.** Figure 2: False-Positive Bug Reports Distribution by Component [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: False-Positive Bug Reports Root Causes While a small number of reports could plausibly fit multiple categories, such cases account for only 4.8% of the dataset, indicating that category overlap is limited. Results and Analysis [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Illustrative Examples environment using Qt 5.15.9. Upon developer investigation, however, the visualization problem was traced to the Qt libraries. Specifically, the bug disappeared once Qt was rebuilt from source. This case highlights how mismatches or inconsistencies in userspace dependencies can lead reporters to misattribute failures to the kernel. Finding 6: Misunderstanding of Features or Limitations… view at source ↗

read the original abstract

False-positive bug reports represent a significant yet underexplored challenge in the development and maintenance of the Linux kernel. They occur when correct system behavior is mistakenly flagged as a defect, consuming developer effort without leading to actual code improvements. Such reports can mislead developers, waste debugging resources, and delay the resolution of real bugs. In this paper, we present the first comprehensive empirical study of false-positive bug reports in the Linux kernel. We manually construct a dataset of 2,006 bug reports comprising 1,509 genuine bugs and 497 false positives collected from Bugzilla and Syzkaller. Our analysis indicates that false positives demand effort comparable to real bugs, often requiring extended discussions and non-trivial closure time. They occur in several components, especially File Systems and Drivers, mainly due to external dependencies and semantic misunderstandings. To address this challenge, we evaluate large language models (LLMs) for automated false-positive bug report mitigation. Among various prompting strategies, retrieval-augmented generation (RAG) performs best, achieving 91% recall and an F1 score of 88%. These findings highlight the non-negligible cost of false positive bug reports and show the promise of LLMs for more efficient false positive mitigation in the Linux kernel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The new dataset of Linux kernel false positives is the actual contribution, but the RAG results at 91% recall rest on labels with no validation details.

read the letter

The dataset of false-positive bug reports in the Linux kernel is the useful new element here, but the LLM mitigation results are undercut by missing details on how the labels were created. The authors collected 2,006 reports, manually identified 497 as false positives, and analyzed their characteristics. They show these reports consume real developer time, appear across components but especially in file systems and drivers, and often trace back to things like external dependencies or semantic issues. That characterization is the part that feels grounded and adds to what we know about kernel maintenance. They then try out LLMs with different prompting, including RAG, to automate spotting false positives. RAG does best in their tests. The problem is the evaluation setup. The paper gives no information on the labeling criteria, how many people did the labeling, how disagreements were handled, or any agreement score. The sample selection from the two sources also lacks description, so we can't tell if it's representative or skewed. If the 497 labels have noise, then saying RAG achieves 91% recall and 88% F1 doesn't tell us much. This work is for people in empirical software engineering who care about bug triage in big open-source projects. The dataset could be valuable if released with better documentation. It deserves peer review because the topic is practical and the data collection is a step forward, even though the methods section needs strengthening on the human annotation side. I would send this to peer review with a request to add the missing details on labeling and to include proper baselines for the LLM experiments.

Referee Report

2 major / 2 minor

Summary. The paper claims to present the first comprehensive empirical study of false-positive bug reports in the Linux kernel. It manually constructs a dataset of 2,006 bug reports (1,509 genuine bugs and 497 false positives) from Bugzilla and Syzkaller, analyzes their effort demands, component distribution (especially File Systems and Drivers), and root causes (external dependencies and semantic misunderstandings), and evaluates LLMs for automated mitigation, finding that retrieval-augmented generation (RAG) achieves the highest performance with 91% recall and 88% F1 score.

Significance. If the ground-truth labels and sample are reliable, the work offers the first systematic characterization of false positives in a major open-source kernel, quantifying their non-negligible cost and identifying actionable patterns. The LLM results indicate a viable path toward automated triage that could conserve developer resources, provided the evaluation is placed on firmer methodological footing.

major comments (2)

[Dataset Construction] Dataset Construction section: the manual labeling of the 497 false-positive reports is described only at a high level, with no reported annotation criteria, number of labelers, inter-rater agreement (e.g., Cohen's kappa), or disagreement-resolution protocol. These labels constitute the ground truth for both the effort and component analyses and for the LLM performance numbers (91% recall, 88% F1), so the absence of validation metrics directly affects the interpretability of the central empirical claims.
[LLM Evaluation] LLM Evaluation section: performance figures for the prompting strategies, including the headline RAG result, are presented without full prompting templates, non-LLM baselines, or statistical significance tests. This makes the comparative claim that RAG is best difficult to assess or reproduce and weakens the mitigation contribution.

minor comments (2)

The abstract refers to 'various prompting strategies' without enumerating them; the main text should list each strategy and its exact configuration for clarity.
A summary table of dataset statistics (reports per component, median closure time for false positives vs. genuine bugs) would improve readability of the characterization results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of methodological transparency that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Dataset Construction] Dataset Construction section: the manual labeling of the 497 false-positive reports is described only at a high level, with no reported annotation criteria, number of labelers, inter-rater agreement (e.g., Cohen's kappa), or disagreement-resolution protocol. These labels constitute the ground truth for both the effort and component analyses and for the LLM performance numbers (91% recall, 88% F1), so the absence of validation metrics directly affects the interpretability of the central empirical claims.

Authors: We agree that the current description of the labeling process is insufficient for establishing ground-truth reliability. In the revised manuscript we will expand the Dataset Construction section to specify the annotation criteria used to identify false positives, the number of labelers (two authors labeled all reports independently), the inter-rater agreement measured by Cohen's kappa, and the disagreement-resolution protocol (discussion until consensus was reached). These additions will directly support the validity of the effort, component, and LLM analyses. revision: yes
Referee: [LLM Evaluation] LLM Evaluation section: performance figures for the prompting strategies, including the headline RAG result, are presented without full prompting templates, non-LLM baselines, or statistical significance tests. This makes the comparative claim that RAG is best difficult to assess or reproduce and weakens the mitigation contribution.

Authors: We acknowledge that the LLM Evaluation section lacks the requested details. In the revision we will add the complete prompting templates for all strategies (including RAG) to an appendix, introduce non-LLM baselines such as keyword matching and metadata-based heuristics, and report statistical significance tests (McNemar's test) comparing RAG against the other prompting strategies. These changes will make the performance claims reproducible and strengthen the mitigation results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential steps

full rationale

The paper conducts an empirical study by manually labeling 2,006 bug reports (1,509 genuine, 497 false positives) from Bugzilla and Syzkaller, performs component analysis, and evaluates LLM prompting strategies on that fixed dataset. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The RAG result (91% recall, 88% F1) is a direct performance measurement against the manually assigned labels rather than a reduction to prior inputs by construction. Concerns about label quality or sample representativeness are validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the accuracy of manual false-positive labeling and the assumption that the sampled reports reflect the broader population of kernel bug reports.

axioms (1)

domain assumption Manual labeling of bug reports as false positive versus genuine is reliable and consistent across annotators
The study depends entirely on the 2,006 manually constructed labels without reported validation metrics.

pith-pipeline@v0.9.0 · 5530 in / 1070 out tokens · 42950 ms · 2026-05-11T02:19:03.652455+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 4 internal anchors

[1]

2025. Apache. https://www.apache.org

work page 2025
[2]

Bugzilla

2025. Bugzilla. https://bugzilla.kernel.org

work page 2025
[3]

Deepseek

2025. Deepseek. https://www.deepseek.com

work page 2025
[4]

2025. Eclipse. https://www.eclipse.org

work page 2025
[5]

Kernel Bugzilla Components

2025. Kernel Bugzilla Components. https://bugzilla.kernel.org/describecomponents.cgi

work page 2025
[6]

2025. Mozilla. https://www.mozilla.org

work page 2025
[7]

2025. Qwen. https://chat.qwen.ai

work page 2025
[8]

Replication pakage

2025. Replication pakage. https://github.com/tianjiashuo/False-Positive-from-Linux-Kernel

work page 2025
[9]

Syzkaller

2025. Syzkaller. https://syzkaller.appspot.com/upstream

work page 2025
[10]

Trinity: Linux system call fuzzer

2025. Trinity: Linux system call fuzzer. https://github.com/kernelslacker/trinity

work page 2025
[11]

Iago Abal, Claus Brabrand, and Andrzej Wasowski. 2014. 42 variability bugs in the linux kernel: a qualitative analysis. InProceedings of the 29th ACM/IEEE international conference on Automated software engineering. 421–432

work page 2014
[12]

John Anvik, Lyndon Hiew, and Gail C Murphy. 2005. Coping with an open bug repository. InProceedings of the 2005 OOPSLA workshop on Eclipse technology eXchange. 35–39

work page 2005
[13]

Gabriel Aracena, Kyle Luster, Fabio Santos, Igor Steinmacher, and Marco Aurelio Gerosa. 2024. Applying large language models to issue classification. InProceedings of the Third ACM/IEEE International Workshop on NL-based Software Engineering. 57–60

work page 2024
[14]

Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. 2008. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs.. InOSDI, Vol. 8. 209–224

work page 2008
[15]

Junjie Chen, Xingyu Fan, Chen Yang, Shuang Liu, and Jun Sun. 2025. De-duplicating Silent Compiler Bugs via Deep Semantic Representation.Proceedings of the ACM on Software Engineering2, FSE (2025), 2359–2381

work page 2025
[16]

Norman Cliff. 1993. Dominance statistics: Ordinal analyses to answer ordinal questions.Psychological bulletin114, 3 (1993), 494

work page 1993
[17]

Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and psychological measurement20, 1 (1960), 37–46

work page 1960
[18]

Juliet M Corbin and Anselm Strauss. 1990. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative sociology13, 1 (1990), 3–21

work page 1990
[19]

Xiaoting Du, Zheng Zheng, Lei Ma, and Jianjun Zhao. 2021. An empirical study on common bugs in deep learning compilers. In2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 184–195

work page 2021
[20]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.109972, 1 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. 2024. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Jianjun He, Ling Xu, Yuanrui Fan, Zhou Xu, Meng Yan, and Yan Lei. 2020. Deep learning based valid bug reports determination and explanation. In2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, 184–194

work page 2020
[23]

Kim Herzig, Sascha Just, and Andreas Zeller. 2013. It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In2013 35th international conference on software engineering (ICSE). IEEE, 392–401

work page 2013
[24]

Felicitas Hetzelt, Martin Radev, Robert Buhren, Mathias Morbitzer, and Jean-Pierre Seifert. 2021. Via: Analyzing device interfaces of protected virtual machines. InProceedings of the 37th Annual Computer Security Applications Conference. 273–284

work page 2021
[25]

Zheyue Jiang, Yuan Zhang, Jun Xu, Qi Wen, Zhenghe Wang, Xiaohan Zhang, Xinyu Xing, Min Yang, and Zhemin Yang

work page
[26]

InProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security

Pdiff: Semantic-based patch presence testing for downstream kernels. InProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 1149–1163

work page 2020
[27]

Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323

work page 2023
[28]

Kyungtae Kim, Dae R Jeong, Chung Hwan Kim, Yeongjin Jang, Insik Shin, and Byoungyoung Lee. 2020. HFL: Hybrid Fuzzing on the Linux Kernel.. InNDSS

work page 2020
[29]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213

work page 2022
[30]

Hiroki Kuramoto, Dong Wang, Masanari Kondo, Yutaro Kashiwa, Yasutaka Kamei, and Naoyasu Ubayashi. 2024. Understanding the characteristics and the role of visual issue reports.Empirical Software Engineering29, 4 (2024), 89

work page 2024
[31]

Muhammad Laiq, Nauman bin Ali, Jürgen Börstler, and Emelie Engström. 2024. Industrial adoption of machine learning techniques for early identification of invalid bug reports.Empirical Software Engineering29, 5 (2024), 130. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE054. Publication date: July 2026. FSE054:22 Jiashuo Tian, Dong Wang, Chen Yang, Hai...

work page 2024
[32]

Muhammad Laiq, Nauman bin Ali, Jürgen Böstler, and Emelie Engström. 2022. Early identification of invalid bug reports in industrial settings–a case study. InInternational Conference on Product-Focused Software Process Improvement. Springer, 497–507

work page 2022
[33]

Muhammad Laiq, Nauman bin Ali, Jürgen Börstler, and Emelie Engström. 2023. A data-driven approach for under- standing invalid bug reports: An industrial case study.Information and Software Technology164 (2023), 107305

work page 2023
[34]

Frank Li and Vern Paxson. 2017. A large-scale empirical study of security patches. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2201–2215

work page 2017
[35]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Mary L McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia medica22, 3 (2012), 276–282

work page 2012
[37]

Patrick E McKnight and Julius Najab. 2010. Mann-whitney U test.The Corsini encyclopedia of psychology(2010), 1–1

work page 2010
[38]

Dongliang Mu, Alejandro Cuevas, Limin Yang, Hang Hu, Xinyu Xing, Bing Mao, and Gang Wang. 2018. Understanding the reproducibility of crowd-reported security vulnerabilities. In27th USENIX Security Symposium (USENIX Security 18). 919–936

work page 2018
[39]

Dongliang Mu, Yuhang Wu, Yueqi Chen, Zhenpeng Lin, Chensheng Yu, Xinyu Xing, and Gang Wang. 2022. An in-depth analysis of duplicated linux kernel bug reports. InNetwork and Distributed Systems Security Symposium (NDSS)

work page 2022
[40]

Shankara Pailoor, Andrew Aday, and Suman Jana. 2018. {MoonShine}: Optimizing {OS} fuzzer seed selection with trace distillation. In27th USENIX Security Symposium (USENIX Security 18). 729–743

work page 2018
[41]

Nitish Pandey, Debarshi Kumar Sanyal, Abir Hudait, and Amitava Sen. 2017. Automated classification of software issue reports using machine learning techniques: an empirical study.Innovations in Systems and Software Engineering 13, 4 (2017), 279–297

work page 2017
[42]

Hui Peng and Mathias Payer. 2020. {USBFuzz}: A framework for fuzzing {USB} drivers by device emulation. In29th USENIX Security Symposium (USENIX Security 20). 2559–2575

work page 2020
[43]

Robin L Plackett. 1983. Karl Pearson and the chi-squared test.International statistical review/revue internationale de statistique(1983), 59–72

work page 1983
[44]

Jeanine Romano, Jeffrey D Kromrey, Jesse Coraggio, and Jeff Skowronek. 2006. Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys. In annual meeting of the Florida Association of Institutional Research, Vol. 177

work page 2006
[45]

Sergej Schumilo, Cornelius Aschermann, Robert Gawlik, Sebastian Schinzel, and Thorsten Holz. 2017. {kAFL}:{Hardware-Assisted} feedback fuzzing for {OS} kernels. In26th USENIX security symposium (USENIX Security 17). 167–182

work page 2017
[46]

Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A comprehensive study of deep learning compiler bugs. InProceedings of the 29th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering. 968–980

work page 2021
[47]

Dokyung Song, Felicitas Hetzelt, Dipanjan Das, Chad Spensky, Yeoul Na, Stijn Volckaert, Giovanni Vigna, Christopher Kruegel, Jean-Pierre Seifert, and Michael Franz. 2019. Periscope: An effective probing and fuzzing framework for the hardware-os boundary. In2019 Network and Distributed Systems Security Symposium (NDSS). Internet Society, 1–15

work page 2019
[48]

2009.Card sorting: Designing usable categories

Donna Spencer. 2009.Card sorting: Designing usable categories. Rosenfeld Media

work page 2009
[49]

Hao Sun, Yuheng Shen, Cong Wang, Jianzhong Liu, Yu Jiang, Ting Chen, and Aiguo Cui. 2021. Healer: Relation learning guided kernel fuzzing. InProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 344–358

work page 2021
[50]

Jian Sun. 2011. Why are bug reports invalid?. In2011 Fourth IEEE International Conference on Software Testing, Verification and Validation. IEEE, 407–410

work page 2011
[51]

Xin Tan, Yuan Zhang, Jiadong Lu, Xin Xiong, Zhuang Liu, and Min Yang. 2023. Syzdirect: Directed greybox fuzzing for linux kernel. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 1630–1644

work page 2023
[52]

Pannavat Terdchanakul, Hideaki Hata, Passakorn Phannachitta, and Kenichi Matsumoto. 2017. Bug or not? bug report classification using n-gram idf. In2017 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 534–538

work page 2017
[53]

Dong Wang, Masanari Kondo, Yasutaka Kamei, Raula Gaikovina Kula, and Naoyasu Ubayashi. 2023. When conversations turn into work: a taxonomy of converted discussions and issues in GitHub.Empirical Software Engineering28, 6 (2023), 138

work page 2023
[54]

Daimeng Wang, Zheng Zhang, Hang Zhang, Zhiyun Qian, Srikanth V Krishnamurthy, and Nael Abu-Ghazaleh. 2021. {SyzVegas}: Beating kernel fuzzing odds with reinforcement learning. In30th USENIX Security Symposium (USENIX Security 21). 2741–2758

work page 2021
[55]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al . 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE054. Publication date: July 2026. Characterizing and Mitigating False-Positive Bug R...

work page 2022
[56]

Robert F Woolson. 2007. Wilcoxon signed-rank test.Wiley encyclopedia of clinical trials(2007), 1–3

work page 2007
[57]

Zhengzi Xu, Yulong Zhang, Longri Zheng, Liangzhao Xia, Chenfu Bao, Zhi Wang, and Yang Liu. 2020. Automatic hot patch generation for android kernels. In29th USENIX Security Symposium (USENIX Security 20). 2397–2414

work page 2020
[58]

Chen Yang, Junjie Chen, Xingyu Fan, Jiajun Jiang, and Jun Sun. 2023. Silent compiler bug de-duplication via three- dimensional analysis. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 677–689

work page 2023
[59]

Chen Yang, Junjie Chen, Bin Lin, Ziqi Wang, and Jianyi Zhou. 2025. Advancing Code Coverage: Incorporating Program Analysis with Large Language Models.ACM Transactions on Software Engineering and Methodology(2025)

work page 2025
[60]

Chen Yang, Ziqi Wang, Yanjie Jiang, Lin Yang, Yuteng Zheng, Jianyi Zhou, and Junjie Chen. 2025. Reflective Unit Test Generation for Precise Type Error Detection with Large Language Models. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2834–2845. doi:10.1109/ASE63991.2025.00233

work page doi:10.1109/ase63991.2025.00233 2025
[61]

Chen Yang, Ziqi Wang, Lin Yang, Dong Wang, Shutao Gao, Yanjie Jiang, and Junjie Chen. 2026. WiseUT: An Intelligent Framework for Unit Test Generation. In2026 IEEE/ACM 48th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)

work page 2026
[62]

Chen Yang, Lin Yang, Ziqi Wang, Dong Wang, Jianyi Zhou, and Junjie Chen. 2025. Clarifying Semantics of In-Context Examples for Unit Test Generation. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 3046–3057. doi:10.1109/ASE63991.2025.00250

work page doi:10.1109/ase63991.2025.00250 2025
[63]

Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. 2024. On the evaluation of large language models in unit test generation. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1607–1619

work page 2024
[64]

Feng Zhang, Foutse Khomh, Ying Zou, and Ahmed E Hassan. 2012. An empirical study on factors impacting bug fixing time. In2012 19th Working conference on reverse engineering. IEEE, 225–234

work page 2012
[65]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Yu Zhou, Yanxiang Tong, Ruihang Gu, and Harald Gall. 2016. Combining text mining and data mining for bug report classification.Journal of Software: Evolution and Process28, 3 (2016), 150–176. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE054. Publication date: July 2026

work page 2016