Recognition: unknown
Towards Better Static Code Analysis Reports: Sentence Transformer-based Filtering of Non-Actionable Alerts
Pith reviewed 2026-05-10 04:01 UTC · model grok-4.3
The pith
Sentence transformer embeddings classify static code analysis alerts as actionable or non-actionable with 89% F1 score.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
STAF applies a transformer-based model with sentence embeddings to label each static analysis finding as actionable or non-actionable. On a dataset of Java SCA reports the model attains an F1 score of 89 percent, outperforming existing SCA warning filters by at least 11 percent in within-project settings and by at least 6 percent in cross-project settings.
What carries the argument
STAF (Sentence Transformer-based Actionability Filtering), a classifier that converts alert descriptions into sentence embeddings and uses them to predict actionability.
If this is right
- Static analysis tools produce shorter reports that developers are more likely to review.
- Fewer non-actionable alerts reduce the chance that genuine issues are overlooked.
- The same filtering model works across projects without retraining from scratch.
- Overall effectiveness of static analysis in software development increases.
Where Pith is reading between the lines
- The same embedding approach could be tested on alerts from languages other than Java to check if the performance gain generalizes.
- Embedding the filter inside an IDE could let developers see only the predicted actionable alerts while they code.
- Combining the sentence embeddings with additional signals such as code change history might raise accuracy further.
Load-bearing premise
Human labels that mark alerts as actionable or non-actionable are consistent and match the decisions developers would actually make on new code.
What would settle it
A study in which practicing developers review a fresh set of SCA reports, decide which alerts they would act on, and compare those decisions against the model's predictions to measure mismatch rate.
Figures
read the original abstract
Static code analysis (SCA) tools are widely used as effective ways to detect bugs and vulnerabilities in software systems. However, the reports generated by these tools often contain a large number of non-actionable findings, which can overwhelm developers to the point of ignoring them altogether -- this phenomenon is known as "alert fatigue". In this paper, we combat alert fatigue by proposing STAF: Sentence Transformer-based Actionability Filtering. Our approach leverages a transformer based architecture with sentence embeddings to classify findings into actionable and non-actionable categories. Evaluating STAF on a large dataset of reports from Java projects, we demonstrate that our method can effectively reduce the number of non-actionable findings while maintaining a high level of accuracy in identifying actionable issues. The results show that our approach can improve the usability of static analysis tools reaching an F1 score of 89%, outperforming existing methods for SCA warning filtering by at least 11% in a within-project setting and by at least 6% in a cross-project setting. By providing a more focused and relevant set of findings, we aim to enhance the overall effectiveness of static analysis in software development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes STAF, a sentence-transformer embedding approach to classify static code analysis alerts as actionable or non-actionable. On a large Java dataset it reports an F1 of 89%, with gains of at least 11% over baselines in within-project evaluation and 6% in cross-project evaluation, with the goal of mitigating alert fatigue.
Significance. If the human actionability labels prove stable and predictive of actual developer triage behavior, the work could meaningfully improve the practical utility of SCA tools. The application of sentence transformers to this filtering task is a natural extension of recent embedding techniques, and the within- versus cross-project split provides a useful empirical distinction. The result would be more significant if accompanied by evidence that the reported margins are not artifacts of label noise.
major comments (2)
- [Abstract] Abstract: The headline F1 of 89% and the 11%/6% outperformance margins are obtained from supervised classification of human judgments of actionability. The manuscript supplies no information on how labels were collected (number of annotators per alert, majority-vote protocol, or inter-annotator agreement), nor any external validation linking the labels to actual fixes or ignored warnings. This directly affects the credibility of the central performance claims.
- [Evaluation] Evaluation (presumed section reporting results): No details are given on baseline implementations, choice of classification threshold, statistical significance tests, or cross-validation procedure. Without these, the exact magnitude of the reported gains cannot be independently assessed.
minor comments (1)
- [Abstract] The abstract and introduction would benefit from an explicit definition of 'actionable' versus 'non-actionable' used in the labeling guidelines.
Simulated Author's Rebuttal
We are grateful to the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of transparency that we will address in the revision to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline F1 of 89% and the 11%/6% outperformance margins are obtained from supervised classification of human judgments of actionability. The manuscript supplies no information on how labels were collected (number of annotators per alert, majority-vote protocol, or inter-annotator agreement), nor any external validation linking the labels to actual fixes or ignored warnings. This directly affects the credibility of the central performance claims.
Authors: We agree that the manuscript would be improved by providing explicit details on the label collection process. We will revise the paper to add a dedicated subsection describing the annotation procedure, including the number of annotators involved per alert, the majority-vote protocol used to resolve disagreements, and the computed inter-annotator agreement. We will also add a discussion of the limitation that our labels are based on human judgments of actionability rather than direct observation of developer triage actions such as fixes or suppressions; while this proxy is standard in the SCA filtering literature, we acknowledge that external validation against real developer behavior would further strengthen the claims and will note this as future work. revision: yes
-
Referee: [Evaluation] Evaluation (presumed section reporting results): No details are given on baseline implementations, choice of classification threshold, statistical significance tests, or cross-validation procedure. Without these, the exact magnitude of the reported gains cannot be independently assessed.
Authors: We apologize for the omission of these methodological details in the submitted version. We will expand the Evaluation and Experimental Setup sections in the revised manuscript to fully describe the baseline implementations (including reimplementation choices and libraries), the method used to select the classification threshold, the statistical significance tests applied to the performance differences, and the cross-validation procedures employed for the within-project and cross-project settings. These additions will allow independent verification and assessment of the reported gains. revision: yes
Circularity Check
No significant circularity; standard empirical ML evaluation on held-out data
full rationale
The paper reports training a sentence-transformer classifier on human-labeled actionability data and measuring F1 (89%) plus gains over baselines on within-project and cross-project held-out test sets. No equations, derivations, or self-citations are presented that reduce the reported metrics to the training inputs by construction. The evaluation uses external benchmarks (baselines and unseen data splits), satisfying the self-contained criterion. Label consistency is a validity concern, not a circularity issue per the guidelines.
Axiom & Free-Parameter Ledger
free parameters (2)
- classification threshold
- transformer model variant and fine-tuning hyperparameters
axioms (1)
- domain assumption Sentence embeddings from the chosen transformer capture semantic features relevant to whether a warning is actionable
Reference graph
Works this paper leans on
-
[1]
d.].Anthropic Claude - Accessed 2026-01-15
[n. d.].Anthropic Claude - Accessed 2026-01-15. https://claude.ai
2026
-
[2]
Morgenthaler, John Penix, and Yuqian Zhou
Nathaniel Ayewah, William Pugh, J. Morgenthaler, John Penix, and Yuqian Zhou
-
[3]
Evaluating static analysis defect warnings on production software. 1–8. doi:10.1145/1251535.1251536
-
[4]
Dejan Baca. 2010. Identifying Security Relevant Warnings from Static Code Anal- ysis Tools through Code Tainting. In2010 International Conference on A vailability, Reliability and Security. 386–390. doi:10.1109/ARES.2010.108
-
[5]
Vipin Balachandran. 2013. Reducing human effort and improving quality in peer code reviews using automatic static analysis and reviewer recommendation(ICSE ’13). IEEE Press, 931–940
2013
-
[6]
Al Bessey, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-gros, Asya Kamsky, Scott Mcpeak, and Dawson Engler. 2010. A few billion lines of code later: using static analysis to find bugs in the real world.Commun. ACM53, 2 (Feb. 2010), 66–75. doi:10.1145/1646353.1646374
-
[7]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
2020
-
[8]
Cristiano Calcagno and Dino Distefano. 2011. Infer: an automatic program verifier for memory safety of C programs. InProceedings of the Third International Conference on NASA Formal Methods(Pasadena, CA)(NFM’11). Springer-Verlag, Berlin, Heidelberg, 459–465
2011
-
[9]
Chen Chen, Kai Lu, Xiaoping Wang, Xu Zhou, and Li Fang. 2013. Pruning False Positives of Static Data-Race Detection via Thread Specialization. InAdvanced Sentence Transformer-based Filtering of Non-Actionable Alerts EASE 2026, 9–12 June, 2026, Glasgow, Scotland, United Kingdom Parallel Processing Technologies, Chenggang Wu and Albert Cohen (Eds.). Springe...
2013
-
[10]
Gautier Dagan, Gabriel Synnaeve, and Baptiste Rozière. 2024. Getting the most out of your tokenizer for pre-training and domain adaptation. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 387, 22 pages
2024
-
[11]
Jacob Devlin, Ming-wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, ...
- [12]
-
[13]
Katerina Goseva-popstojanova and Andrei Perhinschi. 2015. On the capability of static code analysis to detect security vulnerabilities.Information and Software Technology68 (2015), 18–33. doi:10.1016/j.infsof.2015.08.002
-
[14]
Zhaoqiang Guo, Tingting Tan, Shiran Liu, Xutong Liu, Wei Lai, Yibiao Yang, Yanhui Li, Lin Chen, Wei Dong, and Yuming Zhou. 2023. Mitigating False Positive Static Analysis Warnings: Progress, Challenges, and Opportunities.IEEE Trans. Softw. Eng.49, 12 (Dec. 2023), 5154–5188. doi:10.1109/TSE.2023.3329667
-
[15]
Sarah Heckman and Williams Laurie. 2011. A systematic literature review of actionable alert identification techniques for automated static code analysis. Information and Software Technology53, 4 (2011), 363–387. doi:10.1016/j.infsof. 2010.12.007 Special section: Software Engineering track of the 24th Annual Symposium on Applied Computing
-
[16]
Sarah Heckman and Laurie Williams. 2008. On establishing a benchmark for evaluating static analysis alert prioritization and classification techniques. InProceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement. ACM, Kaiserslautern Germany, 41–50. doi:10.1145/1414004.1414013
-
[17]
Sarah Heckman and Laurie Williams. 2009. A Model Building Process for Identi- fying Actionable Static Analysis Alerts. 161–170. doi:10.1109/ICST.2009.45
-
[18]
Péter Hegedűs and Rudolf Ferenc. 2022. Static Code Analysis Alarms Filtering Reloaded: A New Real-World Dataset and its ML-Based Utilization.IEEE Access 10 (2022), 55090–55101. doi:10.1109/ACCESS.2022.3176865
-
[19]
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 427–431
2017
-
[20]
Anant Kharkar, Roshanak Zilouchian Moghaddam, Matthew Jin, Xiaoyu Liu, Xin Shi, Colin Clement, and Neel Sundaresan. 2022. Learning to reduce false positives in analytic bug detectors. InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania)(ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1307–...
-
[21]
Sunghun Kim and Michael D. Ernst. 2007. Which warnings should I fix first?. InProceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering(Dubrovnik, Croatia)(ESEC-FSE ’07). Association for Computing Machinery, New York, NY, USA, 45–54. doi:10.1145/1287624.1287633
-
[22]
Dynamic Matrix Inverse: Improved Algorithms and Matching Conditional Lower Bounds , booktitle =
Ugur Koc, Shiyi Wei, Jeffrey S. Foster, Marine Carpuat, and Adam A. Porter. 2019. An Empirical Assessment of Machine Learning Approaches for Triaging Reports of a Java Static Analysis Tool. In2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). 288–299. doi:10.1109/ICST.2019.00036
-
[23]
Dávid Kószó, Tamás Aladics, Rudolf Ferenc, and Péter Hegedűs. 2025. A Large- Scale Collection Of (Non-)Actionable Static Code Analysis Reports.Scientific Data12, 1 (28 Nov 2025), 1884. doi:10.1038/s41597-025-06154-7
-
[24]
Seongmin Lee, Shin Hong, Jungbae Yi, Taeksu Kim, Chul-joo Kim, and Shin Yoo
-
[25]
Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T
Classifying False Positive Static Checker Alarms in Continuous Integration Using Convolutional Neural Networks. In2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). 391–401. doi:10.1109/ICST.2019.00048
-
[26]
Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2024. Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach.Proc. ACM Program. Lang.8, OOPSLA1, Article 111 (April 2024), 26 pages. doi:10.1145/ 3649828
2024
-
[27]
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2022. SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities. IEEE Transactions on Dependable and Secure Computing19, 4 (July 2022), 2244–
2022
-
[28]
doi:10.1109/tdsc.2021.3051525
-
[29]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. doi:10.48550/arXiv.1907.11692
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.11692 2019
-
[30]
Tomas Mikolov, Kai Chen, G.s Corrado, and Jeffrey Dean. 2013. Efficient Estima- tion of Word Representations in Vector Space.Proceedings of Workshop at ICLR 2013 (01 2013)
2013
-
[31]
Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. MTEB: Massive Text Embedding Benchmark.arXiv preprint arXiv:2210.07316(2022). doi:10.48550/ARXIV.2210.07316
-
[32]
Tukaram Muske and Uday P. Khedker. 2015. Efficient elimination of false positives using static analysis. In2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE). 270–280. doi:10.1109/ISSRE.2015.7381820
-
[33]
PMD. 2025. An extensible cross-language static code analyzer. [https://pmd. github.io/]. Accessed: 20 November 2025
2025
-
[34]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. doi:10.48550/arXiv.1908.10084 arXiv:1908.10084 [cs]
work page internal anchor Pith review doi:10.48550/arxiv.1908.10084 2019
-
[35]
Nils Reimers and Iryna Gurevych. 2020. Making Monolingual Sentence Em- beddings Multilingual using Knowledge Distillation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/2004.09813
-
[36]
SonarSource. 2025. SonarQube: Continuous Code Quality and Security. [https: //www.sonarsource.com/products/sonarqube/]. Accessed: 20 November 2025
2025
-
[37]
SpotBugs. 2025. Find bugs in Java Programs. [https://spotbugs.github.io/]. Ac- cessed: 20 November 2025
2025
-
[38]
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomput.568, C (Feb. 2024), 12 pages. doi:10.1016/j.neucom.2023.127063
- [39]
-
[40]
Gemini Team, Rohan Anil, and Sebastian Borgeaud et al. 2025. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL] https://arxiv. org/abs/2312.11805
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Manuel Valdiviezo, Cristina Cifuentes, and Padmanabhan Krishnan. 2014. A Method for Scalable and Precise Bug Finding Using Program Analysis and Model Checking. InProgramming Languages and Systems, Jacques Garrigue (Ed.). Springer International Publishing, Cham, 196–215
2014
-
[42]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)
2017
-
[43]
Lili Wei, Yepang Liu, and Shing-chi Cheung. 2017. OASIS: prioritizing static analysis warnings for Android apps based on app user reviews. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering(Paderborn, Germany)(ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 672–682. doi:10.1145/3106237.3106294
-
[44]
Cheng Wen, Yuandao Cai, Bin Zhang, Jie Su, Zhiwu Xu, Dugang Liu, Shengchao Qin, Zhong Ming, and Tian Cong. 2024. Automatically Inspecting Thousands of Static Bug Warnings with Large Language Model: How Far Are We?ACM Trans. Knowl. Discov. Data18, 7, Article 168 (June 2024), 34 pages. doi:10.1145/3653718
-
[45]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. HuggingFace’s Transforme...
work page internal anchor Pith review arXiv 2020
-
[46]
Yixin Yang, Ming Wen, Xiang Gao, Yuting Zhang, and Hailong Sun. 2024. Re- ducing False Positives of Static Bug Detectors Through Code Representation Learning. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, Rovaniemi, Finland, 681–692. doi:10.1109/ SANER60148.2024.00075
-
[47]
Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, and Others. 2024. mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 1393–1412
2024
-
[48]
Feng Zhangyin, Guo Daya, Tang Duyu, Duan Nan, Feng Xiaocheng, Gong Ming, Shou Linjun, Qin Bing, Liu Ting, n Jiang Daxi, and Zhou Ming. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. InFindings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trev...
2020
-
[49]
Deqing Zou, Sujuan Wang, Shouhuai Xu, Zhen Li, and Hai Jin. 2019. 𝜇VulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection.IEEE Transactions on Dependable and Secure Computing(2019), 1–1. doi:10.1109/tdsc.2019.2942930
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.