Recognition: no theorem link
Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel
Pith reviewed 2026-05-11 02:19 UTC · model grok-4.3
The pith
False-positive bug reports in the Linux kernel take developer effort comparable to real bugs and can be filtered by retrieval-augmented LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We manually construct a dataset of 2,006 bug reports comprising 1,509 genuine bugs and 497 false positives collected from Bugzilla and Syzkaller. Our analysis indicates that false positives demand effort comparable to real bugs, often requiring extended discussions and non-trivial closure time. They occur in several components, especially File Systems and Drivers, mainly due to external dependencies and semantic misunderstandings. To address this challenge, we evaluate large language models for automated false-positive bug report mitigation. Among various prompting strategies, retrieval-augmented generation performs best, achieving 91 percent recall and an F1 score of 88 percent.
What carries the argument
The manually labeled dataset of 2,006 Linux kernel bug reports paired with retrieval-augmented generation prompting to classify false positives.
If this is right
- Bug triage systems for the kernel should treat false positives as a distinct category rather than routing every report directly to developers.
- Retrieval-augmented generation classifiers could be integrated into Bugzilla workflows to flag likely false positives before human review.
- Documentation and testing practices in file systems and drivers could be strengthened around external dependencies to reduce semantic misunderstandings.
- The measured effort cost implies that early filtering would free measurable developer time for resolving actual defects.
Where Pith is reading between the lines
- The same dataset construction and RAG evaluation approach could be applied to other large open-source projects to test whether the effort patterns and component distributions hold beyond the kernel.
- Adding kernel-specific code context to the retrieval step might raise precision without losing the reported recall level.
- Long-term use of such a filter could shift reporting norms so that submitters provide clearer evidence against external-dependency misreads.
Load-bearing premise
The manual labeling of 497 reports as false positives is accurate and the collected sample from Bugzilla and Syzkaller is representative of all false-positive reports in the kernel.
What would settle it
Re-labeling the same set of reports by an independent group of kernel developers and obtaining a substantially different count or set of causes would undermine the effort comparison and the reported LLM performance numbers.
Figures
read the original abstract
False-positive bug reports represent a significant yet underexplored challenge in the development and maintenance of the Linux kernel. They occur when correct system behavior is mistakenly flagged as a defect, consuming developer effort without leading to actual code improvements. Such reports can mislead developers, waste debugging resources, and delay the resolution of real bugs. In this paper, we present the first comprehensive empirical study of false-positive bug reports in the Linux kernel. We manually construct a dataset of 2,006 bug reports comprising 1,509 genuine bugs and 497 false positives collected from Bugzilla and Syzkaller. Our analysis indicates that false positives demand effort comparable to real bugs, often requiring extended discussions and non-trivial closure time. They occur in several components, especially File Systems and Drivers, mainly due to external dependencies and semantic misunderstandings. To address this challenge, we evaluate large language models (LLMs) for automated false-positive bug report mitigation. Among various prompting strategies, retrieval-augmented generation (RAG) performs best, achieving 91% recall and an F1 score of 88%. These findings highlight the non-negligible cost of false positive bug reports and show the promise of LLMs for more efficient false positive mitigation in the Linux kernel.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to present the first comprehensive empirical study of false-positive bug reports in the Linux kernel. It manually constructs a dataset of 2,006 bug reports (1,509 genuine bugs and 497 false positives) from Bugzilla and Syzkaller, analyzes their effort demands, component distribution (especially File Systems and Drivers), and root causes (external dependencies and semantic misunderstandings), and evaluates LLMs for automated mitigation, finding that retrieval-augmented generation (RAG) achieves the highest performance with 91% recall and 88% F1 score.
Significance. If the ground-truth labels and sample are reliable, the work offers the first systematic characterization of false positives in a major open-source kernel, quantifying their non-negligible cost and identifying actionable patterns. The LLM results indicate a viable path toward automated triage that could conserve developer resources, provided the evaluation is placed on firmer methodological footing.
major comments (2)
- [Dataset Construction] Dataset Construction section: the manual labeling of the 497 false-positive reports is described only at a high level, with no reported annotation criteria, number of labelers, inter-rater agreement (e.g., Cohen's kappa), or disagreement-resolution protocol. These labels constitute the ground truth for both the effort and component analyses and for the LLM performance numbers (91% recall, 88% F1), so the absence of validation metrics directly affects the interpretability of the central empirical claims.
- [LLM Evaluation] LLM Evaluation section: performance figures for the prompting strategies, including the headline RAG result, are presented without full prompting templates, non-LLM baselines, or statistical significance tests. This makes the comparative claim that RAG is best difficult to assess or reproduce and weakens the mitigation contribution.
minor comments (2)
- The abstract refers to 'various prompting strategies' without enumerating them; the main text should list each strategy and its exact configuration for clarity.
- A summary table of dataset statistics (reports per component, median closure time for false positives vs. genuine bugs) would improve readability of the characterization results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important aspects of methodological transparency that we will address in the revision. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [Dataset Construction] Dataset Construction section: the manual labeling of the 497 false-positive reports is described only at a high level, with no reported annotation criteria, number of labelers, inter-rater agreement (e.g., Cohen's kappa), or disagreement-resolution protocol. These labels constitute the ground truth for both the effort and component analyses and for the LLM performance numbers (91% recall, 88% F1), so the absence of validation metrics directly affects the interpretability of the central empirical claims.
Authors: We agree that the current description of the labeling process is insufficient for establishing ground-truth reliability. In the revised manuscript we will expand the Dataset Construction section to specify the annotation criteria used to identify false positives, the number of labelers (two authors labeled all reports independently), the inter-rater agreement measured by Cohen's kappa, and the disagreement-resolution protocol (discussion until consensus was reached). These additions will directly support the validity of the effort, component, and LLM analyses. revision: yes
-
Referee: [LLM Evaluation] LLM Evaluation section: performance figures for the prompting strategies, including the headline RAG result, are presented without full prompting templates, non-LLM baselines, or statistical significance tests. This makes the comparative claim that RAG is best difficult to assess or reproduce and weakens the mitigation contribution.
Authors: We acknowledge that the LLM Evaluation section lacks the requested details. In the revision we will add the complete prompting templates for all strategies (including RAG) to an appendix, introduce non-LLM baselines such as keyword matching and metadata-based heuristics, and report statistical significance tests (McNemar's test) comparing RAG against the other prompting strategies. These changes will make the performance claims reproducible and strengthen the mitigation results. revision: yes
Circularity Check
No circularity: purely empirical measurements with no derivations or self-referential steps
full rationale
The paper conducts an empirical study by manually labeling 2,006 bug reports (1,509 genuine, 497 false positives) from Bugzilla and Syzkaller, performs component analysis, and evaluates LLM prompting strategies on that fixed dataset. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The RAG result (91% recall, 88% F1) is a direct performance measurement against the manually assigned labels rather than a reduction to prior inputs by construction. Concerns about label quality or sample representativeness are validity issues, not circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Manual labeling of bug reports as false positive versus genuine is reliable and consistent across annotators
Reference graph
Works this paper leans on
-
[1]
2025. Apache. https://www.apache.org
work page 2025
- [2]
- [3]
-
[4]
2025. Eclipse. https://www.eclipse.org
work page 2025
-
[5]
2025. Kernel Bugzilla Components. https://bugzilla.kernel.org/describecomponents.cgi
work page 2025
-
[6]
2025. Mozilla. https://www.mozilla.org
work page 2025
-
[7]
2025. Qwen. https://chat.qwen.ai
work page 2025
-
[8]
2025. Replication pakage. https://github.com/tianjiashuo/False-Positive-from-Linux-Kernel
work page 2025
- [9]
-
[10]
Trinity: Linux system call fuzzer
2025. Trinity: Linux system call fuzzer. https://github.com/kernelslacker/trinity
work page 2025
-
[11]
Iago Abal, Claus Brabrand, and Andrzej Wasowski. 2014. 42 variability bugs in the linux kernel: a qualitative analysis. InProceedings of the 29th ACM/IEEE international conference on Automated software engineering. 421–432
work page 2014
-
[12]
John Anvik, Lyndon Hiew, and Gail C Murphy. 2005. Coping with an open bug repository. InProceedings of the 2005 OOPSLA workshop on Eclipse technology eXchange. 35–39
work page 2005
-
[13]
Gabriel Aracena, Kyle Luster, Fabio Santos, Igor Steinmacher, and Marco Aurelio Gerosa. 2024. Applying large language models to issue classification. InProceedings of the Third ACM/IEEE International Workshop on NL-based Software Engineering. 57–60
work page 2024
-
[14]
Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. 2008. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs.. InOSDI, Vol. 8. 209–224
work page 2008
-
[15]
Junjie Chen, Xingyu Fan, Chen Yang, Shuang Liu, and Jun Sun. 2025. De-duplicating Silent Compiler Bugs via Deep Semantic Representation.Proceedings of the ACM on Software Engineering2, FSE (2025), 2359–2381
work page 2025
-
[16]
Norman Cliff. 1993. Dominance statistics: Ordinal analyses to answer ordinal questions.Psychological bulletin114, 3 (1993), 494
work page 1993
-
[17]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and psychological measurement20, 1 (1960), 37–46
work page 1960
-
[18]
Juliet M Corbin and Anselm Strauss. 1990. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative sociology13, 1 (1990), 3–21
work page 1990
-
[19]
Xiaoting Du, Zheng Zheng, Lei Ma, and Jianjun Zhao. 2021. An empirical study on common bugs in deep learning compilers. In2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 184–195
work page 2021
-
[20]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.109972, 1 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. 2024. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Jianjun He, Ling Xu, Yuanrui Fan, Zhou Xu, Meng Yan, and Yan Lei. 2020. Deep learning based valid bug reports determination and explanation. In2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, 184–194
work page 2020
-
[23]
Kim Herzig, Sascha Just, and Andreas Zeller. 2013. It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In2013 35th international conference on software engineering (ICSE). IEEE, 392–401
work page 2013
-
[24]
Felicitas Hetzelt, Martin Radev, Robert Buhren, Mathias Morbitzer, and Jean-Pierre Seifert. 2021. Via: Analyzing device interfaces of protected virtual machines. InProceedings of the 37th Annual Computer Security Applications Conference. 273–284
work page 2021
-
[25]
Zheyue Jiang, Yuan Zhang, Jun Xu, Qi Wen, Zhenghe Wang, Xiaohan Zhang, Xinyu Xing, Min Yang, and Zhemin Yang
-
[26]
InProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security
Pdiff: Semantic-based patch presence testing for downstream kernels. InProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 1149–1163
work page 2020
-
[27]
Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323
work page 2023
-
[28]
Kyungtae Kim, Dae R Jeong, Chung Hwan Kim, Yeongjin Jang, Insik Shin, and Byoungyoung Lee. 2020. HFL: Hybrid Fuzzing on the Linux Kernel.. InNDSS
work page 2020
-
[29]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213
work page 2022
-
[30]
Hiroki Kuramoto, Dong Wang, Masanari Kondo, Yutaro Kashiwa, Yasutaka Kamei, and Naoyasu Ubayashi. 2024. Understanding the characteristics and the role of visual issue reports.Empirical Software Engineering29, 4 (2024), 89
work page 2024
-
[31]
Muhammad Laiq, Nauman bin Ali, Jürgen Börstler, and Emelie Engström. 2024. Industrial adoption of machine learning techniques for early identification of invalid bug reports.Empirical Software Engineering29, 5 (2024), 130. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE054. Publication date: July 2026. FSE054:22 Jiashuo Tian, Dong Wang, Chen Yang, Hai...
work page 2024
-
[32]
Muhammad Laiq, Nauman bin Ali, Jürgen Böstler, and Emelie Engström. 2022. Early identification of invalid bug reports in industrial settings–a case study. InInternational Conference on Product-Focused Software Process Improvement. Springer, 497–507
work page 2022
-
[33]
Muhammad Laiq, Nauman bin Ali, Jürgen Börstler, and Emelie Engström. 2023. A data-driven approach for under- standing invalid bug reports: An industrial case study.Information and Software Technology164 (2023), 107305
work page 2023
-
[34]
Frank Li and Vern Paxson. 2017. A large-scale empirical study of security patches. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2201–2215
work page 2017
-
[35]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Mary L McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia medica22, 3 (2012), 276–282
work page 2012
-
[37]
Patrick E McKnight and Julius Najab. 2010. Mann-whitney U test.The Corsini encyclopedia of psychology(2010), 1–1
work page 2010
-
[38]
Dongliang Mu, Alejandro Cuevas, Limin Yang, Hang Hu, Xinyu Xing, Bing Mao, and Gang Wang. 2018. Understanding the reproducibility of crowd-reported security vulnerabilities. In27th USENIX Security Symposium (USENIX Security 18). 919–936
work page 2018
-
[39]
Dongliang Mu, Yuhang Wu, Yueqi Chen, Zhenpeng Lin, Chensheng Yu, Xinyu Xing, and Gang Wang. 2022. An in-depth analysis of duplicated linux kernel bug reports. InNetwork and Distributed Systems Security Symposium (NDSS)
work page 2022
-
[40]
Shankara Pailoor, Andrew Aday, and Suman Jana. 2018. {MoonShine}: Optimizing {OS} fuzzer seed selection with trace distillation. In27th USENIX Security Symposium (USENIX Security 18). 729–743
work page 2018
-
[41]
Nitish Pandey, Debarshi Kumar Sanyal, Abir Hudait, and Amitava Sen. 2017. Automated classification of software issue reports using machine learning techniques: an empirical study.Innovations in Systems and Software Engineering 13, 4 (2017), 279–297
work page 2017
-
[42]
Hui Peng and Mathias Payer. 2020. {USBFuzz}: A framework for fuzzing {USB} drivers by device emulation. In29th USENIX Security Symposium (USENIX Security 20). 2559–2575
work page 2020
-
[43]
Robin L Plackett. 1983. Karl Pearson and the chi-squared test.International statistical review/revue internationale de statistique(1983), 59–72
work page 1983
-
[44]
Jeanine Romano, Jeffrey D Kromrey, Jesse Coraggio, and Jeff Skowronek. 2006. Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys. In annual meeting of the Florida Association of Institutional Research, Vol. 177
work page 2006
-
[45]
Sergej Schumilo, Cornelius Aschermann, Robert Gawlik, Sebastian Schinzel, and Thorsten Holz. 2017. {kAFL}:{Hardware-Assisted} feedback fuzzing for {OS} kernels. In26th USENIX security symposium (USENIX Security 17). 167–182
work page 2017
-
[46]
Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A comprehensive study of deep learning compiler bugs. InProceedings of the 29th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering. 968–980
work page 2021
-
[47]
Dokyung Song, Felicitas Hetzelt, Dipanjan Das, Chad Spensky, Yeoul Na, Stijn Volckaert, Giovanni Vigna, Christopher Kruegel, Jean-Pierre Seifert, and Michael Franz. 2019. Periscope: An effective probing and fuzzing framework for the hardware-os boundary. In2019 Network and Distributed Systems Security Symposium (NDSS). Internet Society, 1–15
work page 2019
-
[48]
2009.Card sorting: Designing usable categories
Donna Spencer. 2009.Card sorting: Designing usable categories. Rosenfeld Media
work page 2009
-
[49]
Hao Sun, Yuheng Shen, Cong Wang, Jianzhong Liu, Yu Jiang, Ting Chen, and Aiguo Cui. 2021. Healer: Relation learning guided kernel fuzzing. InProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 344–358
work page 2021
-
[50]
Jian Sun. 2011. Why are bug reports invalid?. In2011 Fourth IEEE International Conference on Software Testing, Verification and Validation. IEEE, 407–410
work page 2011
-
[51]
Xin Tan, Yuan Zhang, Jiadong Lu, Xin Xiong, Zhuang Liu, and Min Yang. 2023. Syzdirect: Directed greybox fuzzing for linux kernel. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 1630–1644
work page 2023
-
[52]
Pannavat Terdchanakul, Hideaki Hata, Passakorn Phannachitta, and Kenichi Matsumoto. 2017. Bug or not? bug report classification using n-gram idf. In2017 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 534–538
work page 2017
-
[53]
Dong Wang, Masanari Kondo, Yasutaka Kamei, Raula Gaikovina Kula, and Naoyasu Ubayashi. 2023. When conversations turn into work: a taxonomy of converted discussions and issues in GitHub.Empirical Software Engineering28, 6 (2023), 138
work page 2023
-
[54]
Daimeng Wang, Zheng Zhang, Hang Zhang, Zhiyun Qian, Srikanth V Krishnamurthy, and Nael Abu-Ghazaleh. 2021. {SyzVegas}: Beating kernel fuzzing odds with reinforcement learning. In30th USENIX Security Symposium (USENIX Security 21). 2741–2758
work page 2021
-
[55]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al . 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE054. Publication date: July 2026. Characterizing and Mitigating False-Positive Bug R...
work page 2022
-
[56]
Robert F Woolson. 2007. Wilcoxon signed-rank test.Wiley encyclopedia of clinical trials(2007), 1–3
work page 2007
-
[57]
Zhengzi Xu, Yulong Zhang, Longri Zheng, Liangzhao Xia, Chenfu Bao, Zhi Wang, and Yang Liu. 2020. Automatic hot patch generation for android kernels. In29th USENIX Security Symposium (USENIX Security 20). 2397–2414
work page 2020
-
[58]
Chen Yang, Junjie Chen, Xingyu Fan, Jiajun Jiang, and Jun Sun. 2023. Silent compiler bug de-duplication via three- dimensional analysis. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 677–689
work page 2023
-
[59]
Chen Yang, Junjie Chen, Bin Lin, Ziqi Wang, and Jianyi Zhou. 2025. Advancing Code Coverage: Incorporating Program Analysis with Large Language Models.ACM Transactions on Software Engineering and Methodology(2025)
work page 2025
-
[60]
Chen Yang, Ziqi Wang, Yanjie Jiang, Lin Yang, Yuteng Zheng, Jianyi Zhou, and Junjie Chen. 2025. Reflective Unit Test Generation for Precise Type Error Detection with Large Language Models. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2834–2845. doi:10.1109/ASE63991.2025.00233
-
[61]
Chen Yang, Ziqi Wang, Lin Yang, Dong Wang, Shutao Gao, Yanjie Jiang, and Junjie Chen. 2026. WiseUT: An Intelligent Framework for Unit Test Generation. In2026 IEEE/ACM 48th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)
work page 2026
-
[62]
Chen Yang, Lin Yang, Ziqi Wang, Dong Wang, Jianyi Zhou, and Junjie Chen. 2025. Clarifying Semantics of In-Context Examples for Unit Test Generation. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 3046–3057. doi:10.1109/ASE63991.2025.00250
-
[63]
Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. 2024. On the evaluation of large language models in unit test generation. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1607–1619
work page 2024
-
[64]
Feng Zhang, Foutse Khomh, Ying Zou, and Ahmed E Hassan. 2012. An empirical study on factors impacting bug fixing time. In2012 19th Working conference on reverse engineering. IEEE, 225–234
work page 2012
-
[65]
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Yu Zhou, Yanxiang Tong, Ruihang Gu, and Harald Gall. 2016. Combining text mining and data mining for bug report classification.Journal of Software: Evolution and Process28, 3 (2016), 150–176. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE054. Publication date: July 2026
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.